How to Performance

by Eileen M. Uchitelle

In the RubyConf 2015 talk titled 'How to Performance,' Eileen M. Uchitelle discusses methods for understanding and improving the performance of Ruby code, particularly in the context of speeding up integration tests in Rails. The talk emphasizes the use of various tools and techniques to analyze and enhance code performance effectively.

Key points include:
- Introduction to Performance Issues: Eileen shares her experience with Basecamp's integration tests, highlighting the necessity to speed them up to avoid future complaints from users during the transition to Rails 5.
- Benchmarking vs. Profiling: The distinction between benchmarking (measuring current performance) and profiling (identifying slow parts of the code) is explained, underscoring their complementary roles in the performance improvement process.
- Establishing a Baseline: She emphasizes the importance of obtaining a solid baseline for performance comparison, recommending tools like Benchmark IPS.
- Profiling Tools: She discusses various profiling tools, including RubyProf and StackProf, illustrating how they help identify bottlenecks in code performance.
- Optimization Techniques: Uchitelle shares the step-by-step process she undertook to optimize the code, including moving certain calls to reduce unnecessary allocations and employing strategies to minimize the impact of garbage collection.
- Real-World Case Studies: The talk includes specific examples of optimizations made, such as fixing issues in Rack’s object allocations and managing URL helpers to improve integration test performance.
- Performance Improvements Achieved: Ultimately, integration tests improved significantly from being 2.53 times slower than controller tests to just 1.12 times slower, demonstrating the effectiveness of the techniques employed.

Eileen concludes by encouraging the continuous use of benchmarking throughout the optimization process and emphasizes the need for understanding the limits of tools while being open to experimenting with various profiling approaches to achieve better performance outcomes. The main takeaway is to always benchmark your code changes to verify improvements effectively.

00:00:14.259 Good morning, everyone! I hope you enjoyed what just happened before this today. We're going to talk about performance. This talk will focus on the tools and techniques to help speed up your Ruby code.

00:00:21.220 The things we're going to discuss today came out of work that I did with Iron to speed up integration tests in Rails. I'm Eileen Uchitelle, and I'm a programmer at Basecamp, formerly known as 37signals. I work on the security and performance team, ensuring everything runs fast and safely. You can find me online at EileenCodes. I'm also on the Rails committers team, which means I'm allowed to push directly to master.

00:00:34.420 Additionally, I'm on the Rails security team. However, I'm not on the Rails core, which means I'm not allowed to tweet from the Rails Twitter account. I'm not sure if that makes you feel more or less secure. This is my dog, Arya. She is, of course, named after Arya Stark from Game of Thrones, and she has her own Twitter account: AryaDog. She would love it if you followed her. Be warned though, she talks about poop a lot.

00:00:49.000 I'm from upstate New York, specifically from a city called Kingston. Kingston was the first capital of New York in 1777, but then the British burned it down twice, after which they moved the capital to Albany because we couldn't be trusted with it. On that note, I’m really happy to be speaking here at RubyConf this year. Not only is this my first time speaking at RubyConf, but it’s also my first RubyConf ever!

00:01:22.209 I always like to mention something interesting about the city I'm currently speaking in. Since this is my first time in San Antonio, I find it fascinating that you have this River Walk with a really narrow walkway and a river in the middle. You have to cross bridges to get to the restaurants you actually want to visit. Do people fall in a lot?

00:01:57.250 That's crazy! So perhaps you're wondering why we wanted to make integration tests faster. Besides the obvious reason—faster is better—I'm giving this talk because integration tests and controller tests both process routes and make assertions about the responses, but they are two completely separate code bases in Rails. We figured that if we could make integration tests as fast as controller tests, we could combine the two code bases. If we didn’t speed it up first, everybody would notice their test speeds suddenly slowed down in Rails 5, and then they would complain to me. This has the added benefit that when we consolidate the code bases, nobody has to change their code—there's no deprecation notice. You can just continue writing tests as you were, but for us, it's faster.

00:02:39.710 The purpose of this talk is to show you the techniques and tools that I used to speed up integration tests. We’re going to go over what worked and what didn’t, and ultimately how successful we were. Before we can discuss speeding up integration tests, we first need to go over the difference between benchmarking and profiling. Benchmarking measures how slow our code is; it’s all about collecting data on where your code currently stands. Benchmarking tools are also instrumental in improvement.

00:03:06.270 We made our code faster, and we can use benchmarking tools to measure our code before and after we make changes to be sure that we’ve improved the speed. Profiling tells us what parts of our code are slow. We can use profiling to pinpoint the exact methods that are slowing down our code. Improving performance is a lot harder, if not impossible, if you don't use these tools together. Let's get started.

00:03:43.950 Back in January, David said to me, "Hey, I like fast things, like race cars, and I really, really hate it when things are slow. I want to move Basecamp’s controller tests and integration tests because right now they are 20 to 40% slower than controller tests, and that’s just not going to do. Can you see what you can do about speeding up integration tests in Rails to be as fast as race cars?" Of course, I said yes because this was a good opportunity to improve Rails for everyone.

00:04:12.760 So, if we're going to improve performance, what's the first thing we need to do? We need to get a baseline. This gives us something to measure against to know we are actually improving the speed of our code instead of just thinking or hoping. To get a good baseline, I decided that it would be easier not to change Basecamp's code but to build an application specifically for testing the speed of integration tests. Building a new application has many benefits; it’s public on GitHub, so you can see all the different kinds of tests I wrote to measure the integration test performance.

00:04:49.330 It also prevents anything from getting in the way of helping me build a solid baseline, like login code and other application-specific elements—those aren’t part of my integration test code. Of course, hello world examples won’t show me where all of the slowdown is, but I needed to start with the bare minimum and build up from there to find the underlying issues. Controller tests and integration tests don’t look that different; they both define a test, process routes, and make assertions about the response.

00:05:18.490 The index tests for both are getting the index route and asserting that the response was a 200. That’s it! To build a baseline, I started with researching Rails 4.2, because that was what Basecamp was using at the time. I wanted to ensure that integration tests were slow in isolation, using the same technologies Basecamp uses. First, I double-checked that David’s assertion that the integration tests were 20 to 40 percent slower than controller tests was true.

00:05:41.770 He was basing this off the time they took to execute, so I first looked at that. Knowing that wasn't going to be my real baseline, I ran the single controller test that we looked at earlier, and found that it took 1.625 seconds of real time. I ran the corresponding integration test, and it took 2.112 seconds of real time. This calculated the integration tests as being 23% slower than controller tests. That’s a huge difference! It would make running your entire test suite on integration tests only much slower.

00:06:15.700 I then decided to run those tests five more times and got some interesting results. The time a single test took to run varied a lot, and after a few runs, it appeared that there was no difference in speed between controller tests and integration tests. But I knew that this was false, and it points to the fact that time is not a good baseline for the type of performance work that needed to be done here.

00:06:36.250 This is because using time only runs one iteration. There are a lot of factors at play, like garbage collection, memory, and caching. To use time, we have to calculate the average of all the runs and figure out how many runs got a good average time. So, time is not a good baseline because it varies a lot and would have required a significant amount of work to prove that I could make it faster.

00:07:14.329 If time isn't a good baseline, then how do we get a baseline? We can't start improving our code without one because we don’t know where we stand. A good gem for benchmarking your Ruby code is Benchmark IPs. It was written by Evan Phoenix, and the gem measures the number of times the specified code can run in one second. That means the more iterations, the faster your code is.

00:07:35.400 It completely takes the guesswork out of figuring out the number of runs you need to get a good average. It also provides the percent of standard deviation, so you know how far spread your results are from the average. Benchmark IPS allows you to focus on the results rather than the data points when benchmarking your code. Here’s an example of the Benchmark IP script I use to measure the difference in speed between integration tests and controller tests.

00:08:06.070 First, we define the controller test; it’s a simple GET, and we assert the response's success. Then we do the same for the integration test; again, we access the index route and assert that the response was a success. Then we invoke the Benchmark IPs gem inside the benchmark block; we report on each of the tests using the Minitest run method. We can call each of the tests in the code and label them integration tests and functional tests, making them easy to recognize.

00:08:30.760 We then use the compare bang method from the benchmark IPs gem to compare how many times slower the integration tests are than the controller tests. I really like this method because it means I don’t have to do any math. We can run this like any other test, and it will tell us we were able to get 393 integration tests per second, give or take 8%. In comparison, we were able to run 920 controller tests in one second, give or take 7%.

00:09:00.880 The data shows that integration tests were 2.43 times slower than controller tests. That's a huge difference and will definitely be noticeable if you’re running your test suite on integration tests. So now that I have proven that integration tests are indeed slower than controller tests, and it is unrelated specifically to Basecamp’s code, I wanted to double-check that this is still true on Rails master and the latest Ruby.

00:09:37.570 Improvements in speed are being made all the time, so it’s good to confirm that there is still a problem in the latest versions. You don’t want to be redoing work that someone has already completed on master. Repeating the benchmarking script with Ruby 2.2.0 and Rails master revealed they were about the same, a little bit slower. If you check the iterations, they are basically the same and have only a slightly higher percent of standard deviation.

00:10:06.720 Whereas Ruby 2.2.0 on Rails master has integration tests at 2.53 times lower than controller tests. Your integration tests are in pink, and controller tests are in black. When we compare the difference between Rails 4.2 and Rails master, both versions were really close in the number of iterations, but the newest Ruby on Rails was slightly slower for integration tests. We’re getting about 395 iterations in one second while we achieved 998 iterations on the controller tests.

00:10:38.920 This really demonstrates how much slower integration tests are than controller tests. We wouldn’t have seen that if we were using time as a baseline. Now that we have a solid baseline, we can start improving our code. Having the benchmark script means that we can be confident that we really improved the speed of our code and can prove that to others as well.

00:11:27.410 The next step is to find what is slowing down our code. We can’t just start changing things without knowing what the culprits are; otherwise, we'll end up really frustrated. The first gem I used to profile the integration test and determine what was slowing down Rails was RubyProf. RubyProf requires Ruby 1.9.3 or higher and is faster than other Ruby profilers because it's a C extension.

00:12:02.230 It’s been around for a long time; it was first released eight years ago. The script I wrote to profile the integration test is the first script I worked on. I followed along with the first RubyProf sample on GitHub, as one does when they use a brand-new gem. First, I set up the documentation integration test so I could specifically profile the integration test stack.

00:12:37.560 Here, I'm using the RubyProf API to run the index test in a block. This will capture the integration test stack trace, and finally, I set the printer to the default flat printer and output to standard out. I then run the RubyProf script in the terminal, and it returns a lot of information. To be honest, when I first saw this output, I was pretty confused; I wasn't sure what I was even looking at or how to use this to speed up the integration test, and it felt like some kind of secret code.

00:13:23.280 Let’s break down what we're looking at. First, we have a percentage of time spent in this method. This is calculated from the time spent in the method divided by the total time. Then we have the total time spent in this method and its children. Next, we see the wait time spent in this method, time spent in its children, the number of times this method was called, and finally, the method name called. Even after breaking this down, I still wasn’t entirely sure what this data was telling me to do with it.

00:14:02.020 I felt overwhelmed and frustrated, so I started reading the documentation more closely and playing with the various options and printers. There are many options available, so if you decide to try out RubyProf, I encourage you to play around until you find data that you like. The first type of printer I discovered was the Graph HTML printer. There are no screenshots of how it works, so you’ll need to experiment a bit.

00:14:34.730 Instead of printing to standard out, this code will print to a specified file. When I opened that file in my browser, it was a bit more readable but still super confusing to understand where the real problems lie. At first glance, if you don't know how to read this output, it seems like global.nomads is the problem because we're spending 100 percent of our time there. That’s definitely not the case.

00:15:15.430 A better way to demonstrate time spent in the stack is with the Call Stack printer. This is by far the easiest output to understand. When run, it creates a visual display of the stack where your code is spending the most time. If we look closer at the top, we can see that it starts with 100% in global in the method visit. This is just the beginning of our stack; you will always spend 100% of your time at the top of your own stack because that’s where you started.

00:15:56.020 What we want to focus on are parts of the call stack graph that jut out with higher percentages. One of them is Action Dispatch Run, where we're spending 12.56 percent of our time; this could be a good place to start looking for performance issues in our code. RubyProf is a great tool for showing you the overall picture, but the data can feel overwhelming and frustrating if you're unfamiliar with how to interpret it.

00:16:31.560 There are a lot of options that come with RubyProf, so it's worth trying them all out. Overall, RubyProf is most useful for finding the differences between the code stacks for integration tests and controller tests. However, we didn’t really use it to improve the speed of our code. Once I had these results, I reached out to members of the Rails core team to give me some insight into the data I had collected.

00:17:08.080 I got a lot of feedback at this point. Aaron Patterson chimes in and says, "Oh, speeding up integration tests sounds like a really great idea for my RailsConf talk. Let's work together on this so I can use your code in my slides." So, the first tool that Aaron introduced was StackProf. StackProf is written by Amin Gupta and is a Ruby profiler like RubyProf, but it takes a sampling of the call stack, allowing you to focus on the biggest problems and slowdowns in your code.

00:17:44.430 Let's take a look at how to set this up. Here’s an example of the StackProf profiler code we used: again, we add the integration test code like in previous examples, then we call StackProf and set a file for it to output to inside the block. We run our test 3000 times to get a real sample of our code, and when we run this, we need to open that specific file after the script runs.

00:18:09.300 To view StackProf output, you can type stack raw - text and the path of the dump file that you named in your script. This will show you the sampled profile of your Ruby code. It can seem confusing at first, but it’s similar to what we learned in the RubyProf section; however, the stack is sorted by the time spent in each frame. First, we have the total number of samples where this frame existed in the stack, and next is the percent of time all samples spent in this frame.

00:18:56.310 StackProf then shows the number of samples where this was the topmost frame. After that, we see the percent of time where this method was the topmost frame and finally the method that was called in this frame. Since StackProf output is sorted by the most time spent at the top, we can just focus on the first line, which says we're spending 53% of our time in MiniTest::Runnable# on signal. Let’s take a look at that method. Essentially, this method is just yielding.

00:19:44.100 There’s a lot of other stuff going on, but in the end, all it does is yield, so if the integration tests are really spending 53% of our time here, we can delete all of the code, but we need to yield something else. That would indicate where the source of slowness originates. When we run the StackProf script again, we can see that MiniTest::Runnable is no longer in our stack, and the new slowest code is SearchForFile at 14.7%.

00:20:33.960 So we're done! This is a MiniTest problem now, so let’s open an issue, grab a beer, and work on something else. But wait! We forgot to follow one very important rule: always be benchmarking. I think of it as the rule of ABB—Always Be Benchmarking. Every time you make a change to improve performance, you have to benchmark your code; otherwise, you have no way to prove that you actually made it faster.

00:21:07.310 So, let’s run that benchmark script again with the change we made to MiniTest::Runnable. We should see a massive improvement now that we’re not spending 53% of our time there. But when we run it again, we can see, huh, it's still 2.53 times slower and hasn't improved at all. After a bit of research and consulting the creator of StackProf, we found out there’s a bug on OS X that causes CPU time to be completely inaccurate and not correctly profile the code.

00:21:46.390 This is something to remember if you use StackProf or any other profiler that defaults to CPU time. OS X does not have this issue, so you're free to use CPU mode with StackProf on that OS. We can resolve this by using wall time instead. If we change the mode in our StackProf script from CPU time to wall time, we get accurate results that tell us where the real problems lie.

00:22:29.210 It turned out that the problem wasn’t even attributed to MiniTest—it’s not even represented in this stack sample. StackProf in wall time actually pointed to a regression in the delegate method where we were spending 28.3% of our time. We changed the delegate methods and routes to be optimized, and if we run the StackProf script again, we can see that module: Delegate is no longer at the top of the list, and our slowest sample now shows 4% of time spent in per_thread_registry.

00:23:10.670 Now, we need to follow the number one rule of Always Be Benchmarking. If we run the benchmark script again, we’ll see that integration tests went from being 2.53 times slower to 1.4 times slower. That’s quite an improvement! Here, we can see that we actually lost some iterations in the controller test, but integration tests gained about 200 more iterations per second from where we started. We can see that we are definitely improving the speed and slowly closing the gap between the two types of tests.

00:23:53.660 StackProf is fantastic for helping point out the slowdowns in your code quickly because it sorts by the slowest profiler rather than giving you all the information at once. However, we faced an issue with StackProf: it was hard to pinpoint problems in dynamically created methods or anonymous modules. In fact, you wouldn't even have known there was a problem in an anonymous module if it weren’t for Ruby's VM.stat method.

00:24:33.100 The VM.stat method provides information about the current method cache. This is useful for seeing if your method cache is getting busted between runs, which is crucial for performance in Ruby since creating new globals and constants can be expensive. Here’s an example of the script we used to track the global and constant state in our tests.

00:25:15.710 First, we add the same integration test and index method like we've done previously. Then we output the Ruby VM stats before we run the test, run the test again, and then output the Ruby VM stats after we’ve completed the tests. When we run the test, we get the following output. What we want to focus on is the global constant state, which tells us that before we ran the test, we had 7,823 constants. Afterwards, we had 7,824.

00:25:56.600 This means that we created a new constant while running a single integration test. If we ran five more tests, we’d have five more constants, and that’s not good! Creating constants while running tests is expensive, as each time we bust the method cache and Ruby needs to resolve that method again. We found that the method creating an extra constant was URL helpers, and finding this method took considerable digging.

00:26:44.350 We couldn't rely on StackProf to find it or standard debugging tools since it was an anonymous module. We had to examine the Rails internals to locate it. I researched the git history and found that this method used to be memoized, so I figured it would be best to bring the cache back; however, it's more complicated than it seems. The URL helpers should be memoized again, and we should see some improvements in our benchmarks.

00:27:29.800 So, we run the benchmark script again and observe a minor improvement. Integration tests went from being 1.36 times slower to 1.27 times slower. We did see an improvement in the controller tests as well. Yet, it's pointed out that there is a better way to address this than just caching URL helpers. When dealing with performance, it's easy to think that you should cache everything to magically resolve all your issues.

00:28:09.680 However, caching isn't a free fix for every performance problem. Caching comes with a cost. If you can speed up your code without caching, you should do that first. You don't want to simply patch over real problems in your code; optimize first and cache second. The problem with the URL helpers was that they were being called when they weren't necessary.

00:28:49.880 The integration tests didn’t always need the URL helpers when they were being initialized. A better solution is to move where they are called to only when they are needed. The diff might be considerably larger than just this, but if we look closely, we can see that we have moved URL helpers from initialization to being called only when the session is reset.

00:29:31.710 Now that we've done that, if we run the Ruby VM script again to check the global method cache, we can see that the global constant state remains the same before and after we run the integration test.

00:30:12.670 If we run our benchmark code again, we can observe that this improved integration test even more. When we cached the URL helpers, we were 1.36 times slower. After this change, integration tests are now only 1.12 times slower than controller tests. This shows how costly busting a global method cache can be. It's a significant improvement for what seems like a minor fix.

00:30:48.740 RubyProf, StackProf, and the Ruby VM were instrumental in helping us find the largest problems in Rails that caused slowdowns in integration tests; of course, there were many more improvements that I didn’t demonstrate today. We were starting to reach a point where we could make improvements in performance. One thing we noticed after removing the largest culprits was that we were spending a lot of time in garbage collection—11.13 percent.

00:31:31.470 This was because we were creating many objects that need to be cleaned up afterwards. To figure out the source of this, we turned to Allocation Tracer. Allocation Tracer was written by Kuu Ichi and is a tool for determining how many objects were being allocated by Ruby and where they are in your code. Allocation Tracer focuses on very specific problem areas in your code and helps narrow down unnecessary allocations.

00:32:04.370 Here’s an example of the Allocation Tracer script we used to track allocations in integration tests. To use Allocation Tracer, we need to include ObjectSpace. Then we define the index integration test as we’ve done in previous scripts. We then set up Allocation Tracer and call path_line and type. This allows us to specify what information we want from allocations.

00:32:38.870 Allocation Tracer’s output had to get the path line and the type that the object was called on. After that, we assigned the trace to the result and re-ran the test 3000 times to get an accurate sample of allocations. Finally, we sort the results by the number of allocations in reverse to put the highest counts at the top, returning the first five to find out what the top five offenders are.

00:33:16.680 I ran the Allocation Tracer script, like we did with previous scripts. The output can be a bit confusing at first, but once you get the hang of the data, it makes a lot of sense. It shows us the path to the file where the object was allocated, then the line number where the object was created. Next is the type of the allocated object, which could be a string, array, struct, hash, or various other types.

00:34:08.880 The next array gives us the total count of created objects, the old count of objects that haven't been garbage collected yet, and the total age of those objects. We found that many of the object allocation problems were in Rack and not Rails; in fact, the top five biggest offenders were identified there.

00:34:47.880 One way we reduced the time spent in garbage collection in Rack was to limit the number of string objects being created. We achieved this by freezing certain strings. After each one of these optimizations—checking our allocations—now we don’t need to run benchmarks because we're not specifically trying to improve the speed of the code, but rather reduce the time spent in garbage collection.

00:35:30.710 In this run, we can see that Rails replaced three of the Rack allocations in our top five. This improvement stemmed from changes where we froze strings. The reason for freezing strings is that when you assign a string key in a hash, the string is copied, theoretically creating more objects. Freezing prevents this but makes the string immutable.

00:36:11.780 Before you declare, 'Winter is coming!' and decide to freeze all strings, do make sure you prove that string allocations are actually a problem. You don’t want to prematurely optimize code by making strings immutable without need, as that could be dangerous if the string needs to change. This should only be done if it's absolutely necessary.

00:36:56.470 So, I hope I don’t see anyone sending a giant pull request to Rails with all the strings frozen, claiming I told you to do it! You absolutely must first prove that string allocations are a bottleneck before freezing them. Allocation Tracer was a great tool for finding where objects were being unnecessarily allocated in Rails and where garbage collection was wasting our time.

00:37:39.610 However, we didn’t really see much speed improvement from this. It's also because these changes are micro performance improvements that can seem tedious. Comparing integration testing from before we started to where we ended up, we are almost faster than the original controller tests! Originally, we could only run 393 iterations of integration tests; now, we can run 828 iterations.

00:38:21.670 That’s a huge difference! It equates to integration tests being almost 3 times faster than current Rails when they were in Rails 4.2. Most of these changes didn’t actual alter behavior, and they were back-ported, making them available in the newest Rails 4.2, so you don't have to wait for Rails 5 to take advantage of these optimizations.

00:39:05.360 I hope this helps you better understand your tools and how to improve performance. Be sure to know the advantages and disadvantages of the tools you're using. You need to understand their limitations, so you don't end up frustrated. No two tools can do everything, so use multiple tools.

00:39:35.550 We were only able to speed up integration tests this much because we combined the usage of Benchmark IPs, RubyProf, StackProf, Ruby VM, and Allocation Tracer. And of course, always remember to be benchmarking! It’ll help you to be confident that you're truly speeding up your code and will allow you to defend your changes. It may even help you discover a bug in your Ruby profiler.

00:40:10.520 I want to say a special thank you to a few people for their help with this work. First, thank you to Aaron Patterson for collaborating with me on speeding up integration tests. There was some concern that he was serious when he said he stole my code for his RailsConf talk, but it was genuinely a joint effort. Thank you also to all of the creators of the gems we used.

00:40:55.180 Thank you, RubyProf, for being a great tool, and RubyComp for having me here today. And lastly, thank you all for listening! I have Basecamp logos and happy camper stickers if you’d like to come find me.

00:41:36.590 Oh, and before I go, someone asked if there are any tools that automatically include benchmarks in your commit message or pull requests. Not that I know of. If you want to make one, that might be quite cool, but you can’t always pinpoint every part of the code that you want to benchmark. If you send a pull request with ten changes, and only one of them impacts performance, you would need to measure each of those, then measure them together.

00:41:57.330 That's why it’s generally better to submit performance pull requests specifically, and when you do that, you should include your benchmarks in your commit messages to provide context as to why a change was made. The question was if I used any other tools that weren’t introduced in this talk.

00:42:35.900 I didn’t for this particular case, but there are other tools I'd use for performance issues. For application performance, such as MySQL calls, I would typically use MiniProfiler. But that doesn't really apply here. So there are plenty of other tools I would use depending on what type of performance I’m measuring, but for this instance, we didn't utilize anything else.