Disk is fast, memory is slow. Forget all you think you know

Talks

Daniel Magliola

@dmagliola

#performance-optimization

Disk is fast, memory is slow. Forget all you think you know

by Daniel Magliola

The video titled "Disk is fast, memory is slow. Forget all you think you know" by Daniel Magliola explores performance considerations in Ruby applications, particularly in the context of monitoring metrics using Prometheus in multi-process environments. Magliola outlines several critical challenges faced by developers when attempting to gather aggregate performance metrics across multiple processes due to Ruby's Global Interpreter Lock (GIL) and the use of prefork servers. He discusses the following key points:

Overview of the Problem: Magliola introduces Go-Catalyst's need for reliable metrics in their banking application where uptime is critical. He highlights the use of Prometheus for monitoring but points out the complications arising from multi-process architecture.
Multiple Processes and Memory Sharing: Since Ruby processes do not share memory, each maintains its own counters, complicating the metric aggregation required for Prometheus’s functionality. Magliola examines the limitations of existing Ruby libraries in addressing this concurrency issue.
Performance Targets: The need to maintain performance metrics with minimal overhead led to the goal of incrementing counters within a microsecond. He shares insights about the goal’s challenging nature, indicating skepticism from peers about achieving it in Ruby.
Experimentation with Solutions: Magliola investigates multiple approaches, including the use of hashes, various storage methods, and Redis for data storage and inter-process communication. Notably, he addresses the performance constraints faced when deploying memory maps and the difficulties of using C extensions in Ruby.
Surprising Outcomes: Ultimately, through a series of experiments, he discovers that writing to disk offers surprising performance benefits over traditional in-memory operations or more complex systems. The use of disk was revealed to be faster than expected due to the asynchronous nature of operating systems in handling write operations.
Takeaway Messages: The main conclusion emphasizes the importance of questioning preconceptions about performance, advocating that experimenting with different solutions can reveal unexpected efficiencies. Magliola suggests that assumptions can lead to overlooking viable paths, and encourages viewers to measure performance before optimizing assumptions.

In summary, this talk illustrates how context can redefine performance assumptions, ultimately guiding developers toward discovering unconventional yet effective solutions through empirical experimentation.

00:00:12.139 Like many developers, I love working on code performance, getting deep into a problem, measuring, profiling, tweaking, and squeezing out every last drop of truth from my code. I'm a Rails back-end developer, but I don't get to do that very often. Last year, I worked on a project where performance was key, and working on that project taught me some interesting lessons about assumptions.

00:00:37.079 First of all, hi! My name is Daniel, and I work for Go-Catalyst, a payments company based in London, UK. As you probably noticed, I’m not originally from London; I come from Argentina, so just in case you're wondering, that's the accent. Now, before I get into the interesting part of this talk, I need to give you a little background on the problem we were trying to solve.

00:01:06.600 Go-Catalyst is a bank-to-bank payments company, which basically means we serve as an API on top of the banking system. For a payments company, your customers expect you to always be available, just like they expect the electricity to always work. Therefore, for us, uptime is critical, and that means, among other things, that having good metrics for our apps is essential.

00:01:29.600 This is where our project came in. One of the main tools we use for this is a metric server called Prometheus. Now, to clarify, it's not just Prometheus; we had a significant issue trying to use it well. Prometheus basically maintains a history of all of your metrics over time, and to get those numbers, it periodically hits your app asking for them.

00:01:54.720 In order to provide metrics, your web server, which is running your app, also needs to keep a count of various things, such as how many HTTP requests it served, how long they took, or how many payments were processed in different currencies. Prometheus will reach out to your app every few seconds for these metrics, and you'll need to reply with an output that looks somewhat like this.

00:02:18.450 Prometheus has libraries that help you accomplish this. These libraries allow you to declare your counters and set their values, and when scraped, they automatically generate the needed output. This is available for all major server languages, including Ruby.

00:02:35.330 However, in the Ruby ecosystem—and also in Python—the situation is a little trickier due to the Global Interpreter Lock (GIL), which prevents the execution of code in multiple threads simultaneously. Because of this, our web servers are typically not multi-threaded, as that doesn't scale effectively. Instead, in Ruby, we often use separate processes. Processes do not share memory, which effectively solves the same problem that the GIL tries to address.

00:03:06.340 In the web world, we call this a prefork server, and it's how Unicorn works. When Unicorn starts, it loads your app and then forks itself into a master orchestrator process that listens for requests, alongside a bunch of workers that actually handle those requests. When a request comes in, the master process selects a worker to handle it and returns the response to the client. This system works great, scales well across cores, and is how most Ruby web apps run in production.

00:03:32.260 However, for Prometheus’s use case, this presents a problem. Now that we have multiple processes that don’t share memory, each process maintains its own internal counters. When Prometheus hits your application and requests all of your metrics, each worker responsible for handling that request will report its own metrics, which is not what you want. You need an aggregate view of the metrics across the entire system.

00:04:02.890 Yet, because the processes don’t share memory—an intentional design choice—there’s no easy way for one process to access the numbers from another and combine them. This lack of visibility is the primary reason we use multiple processes; we want that isolation. Unfortunately, it becomes an issue for Prometheus, as the majority of Ruby web applications run in prefork servers.

00:04:50.080 Since the Prometheus library also lacks a solution for cross-process aggregation, many people using Ruby in production with Prometheus metrics face this very same challenge. We, too, encountered this issue while using Prometheus for our primary monolith written in Ruby. It wasn't just us; this has been a common problem for many people for some time.

00:05:12.590 Moreover, this Ruby library faced a number of other issues. It was one of the first libraries created, developed before the community established best practices for handling metrics, leading to various complications. There were some workarounds; for instance, GitLab had an official library with a multi-process solution, but it wasn’t getting merged back into the main repository and carried over all of the same issues regarding best practices.

00:05:50.720 We certainly didn’t want to use that either, nor did we want to create another fork. Instead, we communicated with the library maintainer, reached an agreement on a plan, and decided we would rewrite it while addressing all of the existing problems. However, there was one more aspect to consider: the maintainer strongly cared about performance.

00:06:18.950 The reasoning was that instrumentation should not adversely affect your app’s performance. If you're using counters and need to increment one in a tight loop, you shouldn’t even notice the overhead being added. Deep within a lengthy discussion from a 2015 ticket, I found an interesting target. While I’ve changed names to protect the innocent, one of the lead maintainers stated that incrementing a counter should take less than a microsecond.

00:06:52.220 I don't know why one microsecond was chosen as the target, but if I wanted to get our solution merged, we would really need to meet that target. Another maintainer commented that achieving this in Ruby was unlikely. This exchange led to one of my favorite responses, which basically implied that 'surely Ruby can't be that slow',

00:07:18.810 So we had to solve this problem swiftly. We had a clear performance target, and there was one more thing that the maintainer asked us to agree on regarding the library: we had these metrics classes that were employed to declare counters. Currently, they stored values in instance variables, but we wanted to change this to allow those values to be stored elsewhere.

00:07:44.080 We envisioned that these 'stores' should be interchangeable, which meant that we could go from a design like this to something a bit more flexible and modular. The reasoning behind this is that different projects have various requirements; some might run in a single-threaded, single-process environment and care deeply about performance, while others could have a web app running in multiple processes and hitting a database over the network. At that point, microsecond performance may not be a concern.

00:08:01.480 Having interchangeable stores not only provided better options for our users but also simplified our development process. The challenge with our project was sharing metrics across processes. With these interchangeable stores, it became easy to conduct small experiments, gather performance data, and measure the results without the need to modify the rest of the codebase, which was a very effective strategy for experimentation.

00:08:40.310 So, to recap, we knew what we had to do: we were going to rewrite this gem to support prefork servers, providing obstructed storage for metrics, following the current best practices, all while keeping it fast. The next step was to figure out how to do this in a way that allowed us to share those metrics while remaining within that performance budget.

00:09:17.430 Now, to store our numbers, we decided to use hashes instead of integers for reasons that will become clearer later. What this means is that when I say 'increment the counter,' it effectively involves reading and writing a value within a hash. My initial assumption was that doing this could or couldn’t be achieved in a microsecond.

00:09:52.210 If you had asked me whether this was feasible before embarking on this project, I would have firmly believed it was impossible due to my lack of intuition for performance at this low level. One individual claimed Ruby was slow, while another insisted that writing to hashes had to be faster than one microsecond, creating a divergence in opinions regarding Ruby’s performance.

00:10:29.710 The scale of this issue felt out of my depth compared to my usual experience in Ruby, where the smallest measurable unit of time I typically consider is a millisecond. As a web programmer, this is how I perceive the world: anything smaller than a millisecond seems insignificant. For instance, running a web app, a request comes in, and a few milliseconds are consumed by processes like Unicorn, Rack, and Rails.

00:11:10.890 Before it even reaches my code, I've lost time—like loading a user ID from the session, which takes about a millisecond—so the overall experience adds up. In a normal day as a Rails developer, I might see response times in the range of 60 to 70 milliseconds, and this is my daily life when performance monitoring.

00:11:44.239 But this perspective changes drastically depending on your field; for example, game developers operate within a strict 16-millisecond timeframe to deliver smooth gameplay at 60 frames per second. If they miss that frame rate, the game experiences lag, and frustrated gamers can quickly take to social media to voice their displeasure.

00:12:18.650 This raises the question: what is fast or slow is very subjective and relative to your specific context. Some developers may be looking to optimize memory access to minimize nanoseconds spent per operation, while I'm often concerned with the millisecond scale.

00:12:44.309 It’s crucial to understand that performance discussions can sometimes lack context. Things that seem fast to a Ruby developer might be viewed very differently by someone writing kernel drivers. When discussing speed, we must recognize that perspectives differ widely.

00:13:18.450 Returning to our original question: can we achieve incrementing a hash in a microsecond? As I've mentioned, my world largely exists in milliseconds, so my instincts were limited in their ability to assess this problem. Yet, I was presented with two distinct opinions that could both, in theory, be valid, so I had to measure and find out.

00:13:57.130 The first hurdle was that measuring this increment isn't straightforward. The standard approach is to use benchmarks, running the same operation multiple times to assess speed. But when the operation is so tiny, even the measurement process becomes a significant portion of the total time spent.

00:14:26.340 The method still works for comparative measurements, but it poses challenges when trying to assess the time taken for a very small operation. For instance, on my laptop, my measurements showed about one-half microsecond for incrementing a hash, which sounds impressive—but it's misleading. Each increment requires running a block, which, although fast in Ruby, still consumes a substantial fraction of total time.

00:14:51.799 If I modify the approach to avoid using a block, I can achieve a doubling in speed, which brings me closer to the actual performance of incrementing a hash. To get even more precise, I could strip the timing process itself from the final count, allowing me to identify how much time is wasted in the benchmarking process.

00:15:25.270 This reveals I can perform about 12 increments in a single microsecond—all well within budget. Interestingly, this seems to confirm that the initial opinion about Ruby's speed was valid after all: performing this operation is feasible in under a microsecond.

00:15:58.990 However, as I dive deeper into the specifics, I realize that it’s not all straightforward. I'm talking about storing these metrics in hashes. Why hashes instead of simple integers? The value of metrics is that we need much more granularity than a mere count; for example, when concerned with payments, we want to capture how many transactions occurred within various currencies.

00:16:27.780 Hence, we incorporate the concept of labels, or key-value pairs. Our internal implementation stores all numbers in a hash where the key itself is also a hash. The implications for performance are significant. When accessing a hash with a constant key, the operation is swift, but when your key is another hash, it entails additional comparisons that can dramatically slow down operations.

00:16:56.430 For example, if we have three labels in a metric, this can end up slowing things down by roughly 60 times, which poses a serious issue for us. In an ideal scenario, if most metrics lack labels, trying to use an empty hash as a key brings our incrementing operation closer to one microsecond.

00:17:27.060 However, typically, we need to validate those labels and process them for safety, which adds overhead. By implementing a mutex for thread safety, our total time creeps up to about one and a half microseconds, and we miss the target altogether.

00:17:57.590 This leads to the conclusion that the second maintainer's assertion about the microsecond target was accurate. Nevertheless, the most intriguing insight I gleaned is that I initially expected a simple response to the question but found no clear winner in this scenario.

00:18:24.220 When I say fast or slow, it’s essential to understand these terms will always be relative to the context within which they are considered. Ultimately, while we could not achieve the target of maintaining microsecond increments in a multi-process environment, we opted instead to maintain the essence of our goals — that instrumentation should be unobtrusive.

00:18:51.280 This became a vital objective for us as it would impact numerous projects across various companies, so it was worthwhile to invest time in optimizing performance, even if meeting the microsecond target was unrealistic. At this point, we established a baseline for performance and a clean, refactored codebase that facilitated rapid experimentation.

00:19:32.050 We began exploring different strategies extensively. First up, we looked into Buckets, which utilizes a rare data structure called an AP store. Essentially, it's a hash backed by a file on disk. Every time you write a value, the whole data structure gets serialized and stored, while reading requires access to the file, but this approach allows multiple processes to interact safely.

00:20:08.420 Unfortunately, while P stores operate similarly to hashes by allowing batch transactions to avoid hitting the disk for each individual operation, we quickly dismissed them as they weren't fast enough for our application needs. On the other hand, when I worked at Go-Catalyst, we had a clever idea.

00:20:38.320 Why not use Redis as a shared memory chunk between different processes? I was excited at the prospect because I knew it could offer the kind of solution we were searching for, and it had worked for similar use cases in other languages. However, when I executed some benchmarks, I was left feeling disappointed.

00:21:09.410 My expectations were dashed upon discovering that Redis was actually slower than marshaling an entire hash to disk. I was disheartened because I initially believed that the speed advantages of using Redis would hold true against the alternatives.

00:21:35.310 But when I reflected more on the dynamics at play, it made sense. Communicating with Redis over localhost involves network packets, which introduces latency—this communication process effectively led to higher counts of operational time than expected, especially against simple, in-memory operations.

00:22:10.030 I then switched my attention to the elephant in the room—memory maps. The consensus among experts and many implementations in the Python library suggested that using memory maps would solve our cross-process communication challenges. However, I was hesitant. My fear was rooted in my lack of familiarity with memory maps.

00:22:38.590 Before this project, I had never even encountered them. The documentation around memory mappings is often confusing, and while it has detailed descriptions of various flags, it lacks context on their applications for shared memory between processes.

00:23:06.046 Additionally, I discovered that accessing memory maps typically involved using file paths, which felt strange to me. Memory maps, serving as the fundamental building blocks of system kernels, act like Swiss Army knives, capable of numerous tasks but lacking straightforward examples relevant to our goals.

00:23:34.569 The method we needed was a shared-memory mapped file. What happens is that when a process opens a file and maps it, reading that memory will return the contents, and if you modify memory, that change reflects in the file. If another process maps the same file, they both access the same memory page, which is the core technique of rapid inter-process communication.

00:24:00.496 Yet, directly using memory maps from Ruby posed challenges—primarily because it had to be interfaced through C code. For someone with my limited proficiency in C, this was intimidating. Fortunately, we had the existing MMap gem, created by a developer known for successfully handling intricacies in Ruby.

00:24:32.140 The way this works is simple: you create a file, initialize it to a certain size; say, one megabyte, and then set up the mapping function to create an M variable, which behaves much like a byte array, although not entirely like a Ruby array. It enables you to read and write bytes but requires a deeper understanding of how to pack and unpack data.

00:25:02.680 If you've never used these methods, it’s best not to look them up. Ignorance is bliss in this scenario. However, if you must, when we need to write a float into memory, we pack it into eight bytes and specify where to place them within the mapped array.

00:25:40.990 At that point, we further explored how fast these stores would be. As a brief recap, we had developed a hash without synchronization, which had clocked in under a microsecond. We created a similar version for multi-threaded applications, observing performance around one and a half microseconds, and finally identified the PStore—the only workable solution thus far for multi-process interactions at a disappointing 35 microseconds.

00:26:07.420 However, we found our MMap approach was remarkably efficient. Testing each segment reinforced our observations, so we developed a script to apply stress testing in diverse scenarios. We subjected components to excessive loads to ensure stability, running complex metric trees from a plethora of threads and processes.

00:26:41.640 But upon executing this script, I encountered a segmentation fault—an unanticipated situation in Ruby, especially in high-level programming, where this doesn’t happen often. It was related to the memory being freed prematurely, which put me straight back at the root of my initial hesitancy toward memory maps.

00:27:15.195 The issue seemed tied to my code trying to resize memory maps if my original one-megabyte size was insufficient. If done incorrectly, this would lead to segmentation faults appearing well after the fact, indicating potential garbage collector issues.

00:27:49.090 Primarily, I had to isolate the root causes of these errors—something made even more challenging by the gem’s obscurity within the Ruby community. My experiments led me to isolate a common problematic sequence of events associated with resizing.

00:28:20.210 I honed my experiments, performing what felt like an elaborate dance of touch commands to flush, close, reopen, resize, remap, and only then, perhaps, avoid the dreaded segfault. Yet, despite a temporary win in overcoming the segfaults, I faced critical complications surrounding C dependency.

00:28:48.200 C extensions can be quite troublesome to manage and often lead to issues across environments, such as JRuby, where they aren’t supported, presenting ongoing installation problems similar to what many developers have experienced with other C-language integration gems.

00:29:18.470 Despite being relatively fast and unproblematic to a certain extent, challenges persisted. At that point, I found myself challenged with yet another project. The concept was simple: what if we discarded the MMap concept entirely and reverted to writing files directly?

00:29:59.510 The transition entailed minimal change from our coding perspective, simply inserting file functions to navigate and write while bypassing the more complex MMap functionality altogether. The efficiency gain resided in forgoing cumbersome marshaling tasks over the whole hash, instead opting for streamlined byte write operations.

00:30:31.660 To gauge how slow this new solution would be against disk drives, I expected significant performance disadvantages given the common understanding of disk speeds. Yet astonishingly, I discovered that writing to the disk presented healthier numbers, approximately six microseconds versus the MMap alternative.

00:31:06.970 This revelation blew my mind. Practically, I found myself presented with a compelling solution devoid of the risks that extending functionality through a C extension might introduce. With no segfault fears or compatibility issues present, it stood to be an optimal choice.

00:31:42.170 Consequently, we opted for disk as our solution. Although not necessarily the fastest, it proved to be effectively fast enough and far more reassuring than the alternatives we’d examined. Ultimately, we successfully resolved our performance problem and had our code merged into the existing library.

00:32:15.300 Now, the maintenance of this gem feels rewarding, as it accommodates Prometheus usage in Ruby applications. Curiously though, this represents an ironic twist—traditional programming wisdom touts the expectations of memory being orders of magnitude faster than disk, yet in this case, the reality fits a different narrative.

00:32:50.470 As I noted earlier, it's often necessary to recognize how context heavily influences our interpretations. Writing to memory should ideally take nanoseconds instead of microseconds, although this complexity emerges from the involved mechanics of the MMap gem itself.

00:33:27.100 Early observations might provoke skepticism toward my findings regarding write performance against traditional structures, and rightly so. What I discovered was somewhat misleading: writing configurations don’t inherently translate directly to disk commits under typical operating systems, emphasizing asynchronous write handling methods.

00:34:06.290 When you request a write operation, the OS simply recognizes the request and notes it to be done later, reducing immediate performance considerations. Since I only store metrics during operational runs without needing prior data once the app terminates, I sidestep these heavy write costs, permitting me to optimize performance without too much overhead.

00:34:41.600 In this peculiar incident, we certified that the widely accepted beliefs regarding speed do come with their undeniable nuances. The point of this talk is to convey an essential lesson regarding performance optimizations: it's crucial to measure before optimizing, as what seems slow may not be what impacts performance at all.

00:35:15.230 During my investigation of the performance bottlenecks within my code, I discovered some surprising places where my assumptions previously guided my thinking but led me to fast conclusions.

00:35:43.140 Conversely, my insights gathered from this project indicate that it extends further than previously predicted. Such revelations underscore that assumptions can dictate our paths irrationally, isolating some methods purely due to their perceived slowness.

00:36:15.120 In evaluating our solutions to performance challenges, we overlooked disk usage out of a collective preconception that it would be too slow. This paradigm shift came to light only through experimentation, which ultimately reassured both my coding instincts and our application processes.

00:36:49.330 So as I conclude, be cautious of the biases that filter out potential solutions. We must remain open to ideas we may initially assume won’t work. When exploring any new problem, pause to assess these preconceptions—ask why they exist, whether they hold merit, and, most importantly, carry out experiments to measure your thoughts.

00:37:20.760 In the end, there could be clear and ascertainable solutions that lie quietly in the background, waiting for you to recognize their potential. Thank you for listening.

RubyConf 2019