How to Performance

@eileencodes

Understanding performance output can feel like reading tea leaves. It makes sense to a few people, but many of us are left in the dark; overwhelmed and frustrated by the data. On top of that there are a ton of performance tools to choose from; StackProf, RubyProf, AllocationTracer. Where do you even start? While working on speeding up integration tests in Rails source, I learned that the key to improving the performance of Ruby code is having a baseline, not relying on one profiler and understanding the advantages and limitations of your tools. By utilizing these methods, integration tests are now 3 times faster than they were in Rails 4.2.0, with more improvements being made every day. In this talk we will not only look at how to read performance output, but when and how to use the right profilers for the job. We'll discuss a variety of methods and techniques for benchmarking and profiling so you can get the most out of any performance tool.

Talk given at GORUCO 2015: http://goruco.com

GoRuCo 2015

00:00:13.410 Today, we're going to talk about performance. This talk is going to focus on tools and techniques to help speed up your Ruby code. The topics we're discussing today came out of the work I did with Aaron Patterson to make integration tests faster in Rails. I'm Eileen Uchitelle, and I'm a programmer at Basecamp, formerly known as 37signals. I'm also on the Rails Commuters team.

00:00:25.900 This means I can push directly to master on Rails. I'm not in core, which apparently means I'm not allowed to tweet from the Rails Twitter account. You can find me online at @EileenCodes, and this is my dog, Aria. She's, of course, named after Arya Stark from Game of Thrones, and she has her own Twitter account @AriaDog. She would love it if you followed her, but you should be warned – she loves to talk about poop.

00:00:44.649 I'm from upstate New York, specifically from a city called Kingston, which is about two hours north of New York City. Kingston was the first capital of New York State in 1777, but it was burned down by the British twice, which is why we moved to Albany – because we can't keep it safe! Kingston is a wonderfully walkable city with some of the best hiking spots in New York. You can experience some of that hiking paired with inspiring technical talks if you come to the upcoming Catskills Conference from October 23rd to 25th. It's going to be a fantastic weekend full of inspiration and learning in the Catskills.

00:01:14.200 I'm super happy to be speaking at Heroku this year, as this was the first Ruby conference I ever attended back in 2012. I really like the new logo a lot better; it's much more appropriate. I’m not going to go into details about those of you here since 2012 will understand why I can't wear the t-shirt from this year. Earlier, I mentioned that I work at Basecamp, and DHH knows that performance is one of my favorite topics, which is why I'm on the Performance and Infrastructure team at Basecamp.

00:01:49.340 Back in January, David said to me, 'Hey, I like fast things like race cars, and I really hate it when things are slow. I want to move Basecamp's controller tests to integration tests, but right now, they are 20 to 40% slower than the controller tests, and that’s just not going to work. Can you see what you can do about speeding up integration tests in Rails so they’re as fast as race cars?' Of course, I said yes, because this was a great opportunity to improve Rails for everyone. If we're going to improve performance, what's the first thing we need to do? We need to have a baseline. This gives you something to measure against so you know you're actually improving speed rather than just thinking or hoping you are.

00:02:38.829 To get a good baseline, I decided it would be much easier not to change Basecamp's core code and instead build an application specifically for testing this. Building a new application has many benefits. One benefit is that it's on GitHub, so you can all see the scripts I wrote to test this. Another is that it removes anything that might get in the way of helping me build that baseline, such as other gems and weird stuff. Of course, Hello World examples aren't going to show where all of the slowdowns are, but starting with the bare minimum helps ensure you aren't changing things that aren't the real problems.

00:03:10.680 Controller tests and integration tests don’t look that different. Both define a test, process a route, and assert something about that response. Here, the index tests for both are getting the index route and asserting the response was a success. To build a baseline, I started with Ruby 2.1.5 and Rails 4.2.0, as that's what Basecamp was using at the time. I wanted to ensure that integration tests were slow in isolation, using the same knowledge Basecamp has. First, I double-checked David's assertion: that integration tests were 20 to 40 percent slower. David based this off the time they took to execute, so I looked at that, knowing it wouldn’t be my real baseline.

00:03:48.660 I ran a single controller test and found that it took 1.65 seconds of real time. I then ran the corresponding integration test, which took 2.120 seconds of real time. That calculated to integration tests being about 23% slower than controller tests. That's actually a significant difference, and running your entire test suite with slower integration tests would take much longer. I then decided to run each of those tests five more times and got some interesting results – they varied a lot. After a few runs, it appeared that there was no difference between controller tests and integration tests, and I knew this was completely false.

00:04:26.810 This pointed to the fact that time isn't a good baseline for the type of performance work that needs to be done. Using time only runs one iteration, and there are many factors at play, such as garbage collection, caching, and memory usage. To use time, we'd have to calculate the average of all runs and figure out how many runs got a good average. Time was not a good baseline for me because it varied a lot and required a lot more work to prove that I was making it faster. So, if time isn't a reliable baseline, how do we get one? We can't start improving our code without a good baseline.

00:05:00.540 A good gem for benchmarking Ruby code is Benchmark-AP. It was written by Evan Phoenix and measures the number of times per second the specified code can run. This means the more iterations there are, the faster your code will run. Benchmark-AP completely removes the guesswork from figuring out how many runs you need to get a good average. It also provides the percent of standard deviation so you know how spread out your results are from the average. Benchmark-Ap is helpful as it allows you to focus on the results rather than the data points when benchmarking your code.

00:05:43.620 Here's an example of the benchmark script I used to measure the difference between controller tests and integration tests. First, we define the controller test as a simple GET request and assert the response was a success. Then, we do the same for the integration test, getting the index route and asserting the response was successful. Next, we invoke the Benchmark-AP gem inside the benchmark block. We use MiniTest runner to report on each of the tests. We can call each of the tests individually and label them as integration or functional tests so that they're easy to recognize after the run.

00:06:34.860 Then, we compare the number of times slower integration tests are compared to controller tests using the benchmark comparison method. This approach is great because it doesn't matter for you how you run it. We can run this like any other test, and it tells us that we were able to run 393 iterations of integration tests per second with an error rate of 8%. In comparison, we were able to run 920 controller tests in one second, also with an error rate of 7%. These data show that integration tests were 2.43 times slower than controller tests. This is a significant difference and it's definitely going to be noticeable when running an entire test suite.

00:07:15.660 Now that I had proved that integration tests were indeed slower than controller tests and that this wasn't specifically a problem with Basecamp's code, I needed to double-check that this was still true on Rails master and the latest version of Ruby, which at the time was Ruby 2.2.0. Improvements to speed are being made all the time, so it's wise to ensure this is still a problem in the latest versions, as we don't want to redo work that someone else has already accomplished. I ran the same benchmark script with Ruby 2.1.0 and Rails master, and they were about the same, a little bit slower, but not concerning.

00:07:54.040 If you check the iterations, they're basically the same, with a slightly higher percentage of standard deviation. Running the tests on Ruby 2.1 and Rails master showed that integration tests were 2.53 times slower than controller tests. Integration tests were getting fewer iterations than controller tests, but we weren't seeing this when we were using time as a baseline. Now we have a solid baseline, and we can start improving our code. Having this benchmark script means we can be confident that we’ve improved the speed of our code and demonstrate it to others.

00:08:31.300 The next step is to pinpoint what is slowing down our tests. We can't just change things without knowing what is actually slow. The first gem I used to profile our integration tests was RubyProf. RubyProf requires Ruby 1.9.3 or higher and is faster than other profiling tools because it uses a C extension. RubyProf has been around a long time, being released about eight years ago. The first script I wrote to profile the integration test and identify where the slowdowns in Rails were is shown here.

00:09:02.540 To write this, I followed the first RubyProf example on GitHub. First, I set up the documents for integration tests so I could specifically profile the integration tests. Here, I'm using the RubyProf API to run the index tests in a block. This captures the integration test stack trace, and finally, I set the printer to the default printer, which is flat and outputs to standard out.

00:09:50.300 When you run RubyProf in this terminal, you get the following output. RubyProf has provided us with quite a bit of information. I don’t know about you, but when I first saw this, I was pretty confused. I wasn’t even sure what I was looking at and had no idea how to use this to speed anything up or understand what was going on. It felt like it was a secret code telling me where Ruby's slows were. So, let’s break down what we’re looking at. First, we have the percent total time spent in this method. This is calculated from time spent in the method divided by total time.

00:10:39.080 Next, we have the total time spent in this method and its children. Then, we have the total time spent in this particular method. After that is the wait time spent in this method and then the time spent in its children. We also have the total number of times this method was called, and finally, the method name that was called. Unfortunately, even after breaking this down, I wasn't entirely sure what to do with this. It was overwhelming, and I started reading the documentation more closely and experimenting with available options and printers.

00:11:45.320 I found an alternate printer called graph HTML printer. There are no screenshots of how these work, so you have to do a lot of trial and error to figure out what's useful and what data you can understand. Instead of printing to standard out, I printed to a specified HTML file, which I could open in a browser for better readability. But I still didn't fully understand what this was relaying. At first glance, it almost seems like the global method is your bottleneck because you’re spending 100% of your time there. That’s definitely not the case.

00:12:40.720 A better way to demonstrate the time spent is to use the call stack printer. This is by far the easiest output to understand. It creates a visual display of your stack and where your code spends the most time. What you need to focus on are the parts of the stack that are not part of the main stack trace. If you look closer at the top part where you have that straight line, you’ll see that 100% in the global method is just the first method call. You are always going to spend 100% of your time in the top of your stack.

00:13:26.420 So, you want to focus on the parts that are not part of the main stack and stick out further. Looking closer, we can see that we're spending about 12.56% of our time in the method called "budget_runner_reset." This could potentially be a good place to start looking for problems in your code. Root profiling is great for displaying the overall picture, but the data can feel overwhelmingly confusing if you’re unfamiliar with how to use it. There are many options that come with every profiling tool, so it’s worth experimenting with them.

00:13:59.870 Overall, RubyProf is most useful for finding the differences between the call stacks for integration tests and controller tests. In the end, we didn’t use it to improve the speed of the code—at least, not yet. Once I had these results, I posted to the Rails channel at Basecamp about my findings. I was hoping members of the Rails core team could offer some insights into the data I’d collected.

00:14:38.180 I received a lot of feedback, and Aaron Patterson chimed in, saying, 'Hey, that sounds like a really great idea for my RailsConf talk! Let’s work on this together so I can use your code in my slides.' I wish I hadn’t ruined that one. The first tool that Aaron introduced me to was StackProf. StackProf is a Ruby profiler like RubyProf, but it takes a sampling of the call stack, making it easier to focus on the biggest problems slowing down your code.

00:15:10.790 Let’s take a look at how to set this up. Here’s an example of the StackProf code we used. Again, we add the integration test code like we did in the previous examples. Then we run StackProf and set a file for it to output to. Inside the block, we execute MiniTest 3000 times to ensure we get a real sample of our Ruby code. We can run it like we did before, and since it’s outputting to a file, we only need to open the file we specified in the script.

00:15:49.190 You can do this using the command "stackprof --text path/to/the/file.dump." This command will display the sample profile of your Ruby code. Initially, it seems a bit confusing. It's similar to what we learned in the RubyProf section, but here the stack is sorted by time spent in each frame. Each column represents a different metric – the total number of samples where this print was in the stack, the percentage of time all samples spent in this frame, and the number of samples where this was the topmost frame.

00:16:24.060 Next, we have the time that shows the percent of time where this method was the topmost frame, and lastly, it displays the method that was called in this frame. StackProf output is sorted by time spent, so we can focus on the top line, where we see that 53% of our time was spent in a method called 'mini_test_run_all.' For those of you who saw Aaron's earlier talk, you might have an idea of where I'm going with this.

00:16:55.280 Let’s take a look at the 'on_signal' method. Essentially, this method is just yielding. If the integration tests are spending 53% of our time in 'on_signal,' we could remove all the code and make the method yield. However, there's no way that only a yield would be the source of slowness in our code. After running StackProf again, 'on_signal' is no longer in the stack, and the new slowest method is 'search_for_file,' taking up 14.7%.

00:17:38.700 So, we've confirmed that it’s a MiniTest issue, and it's time to create an issue report, grab a beer, and move on to another task. Oh wait, we forgot to do something super important. When it comes to improving performance, you need to follow the number-one rule of verifying your improvements. I call it ABB: 'Always Be Benchmarking.' Every time you change something to improve performance, you must benchmark your code.

00:18:14.180 Otherwise, you have no way to know if you are truly making it faster or just imagining it. This is like having unicorns running around—they're not real. So, let's run the benchmark script again with the changes made to the 'on_signal' and MiniTest, expecting to see substantial improvements now that we're not spending 53% of our time in 'MiniTest.'

00:18:59.670 When we run it again, we're still at 2.53 times slower, which is exactly the same, and it hasn't improved at all. At this point, I was pairing remotely with Aaron, who was just as upset. He said it was impossible, that there was no way this could be happening, and he was right. After some research and consultation with the creator of StackProf, he found that there was a bug in OS X that causes CPU time to be completely inaccurate and not correctly profile the code.

00:19:34.900 This is something to remember when using StackProf or any profiler that defaults to CPU time. Linux does not have this issue, so you can rely on CPU time with StackProf on that OS. We can resolve this by using wall time instead. So if we change the mode in our StackProf script from CPU to wall time, we get accurate results revealing the real problems.

00:20:11.780 The problem wasn’t in MiniTest; it wasn't even represented on our stack at all! Running it again with wall time showed a regression in the delegate method, where we previously spent 28.3% of our time. Aaron changed the delegate methods in route sets to just return straight away, and if we run StackProf again, we can see that module delegate is no longer present in the list. The slowest example we have now is just 4%.

00:20:46.340 If we continue to follow our number one rule of 'Always Be Benchmarking,' we run the benchmark script again. The integration tests have gone from 2.53 times slower to being only 1.4 times slower—this is a huge improvement! Here, we can see that we actually lost some iterations in the controller tests after this change, but the integration tests gained back 200 more iterations per second, indicating we're starting to see improvement in integration test speed while closing the gap between the two.

00:21:18.640 StackProf is great for pinpointing slowdowns in your code quickly, as it sorts by the slowest profile, allowing us to address the most significant issues first. However, we found it challenging to pinpoint problems with dynamically created methods or anonymous modules. One such method was 'url_helpers,' which we discovered was creating a net; URL helpers were creating an extra anonymous module. I looked back in the Git history and found that this method used to be made from an array, so I added caching back. The diff is much larger than I can show you here, but you get the idea.

00:22:03.420 URL helpers are now cached, so we should see significant speed improvements and everyone will be happy. After caching the URL helpers, we run the benchmark script again, and it shows a minor improvement down from 1.4 times slower to 1.36 times slower, while also showing an improvement in the controller tests.

00:22:39.400 However, I reported that there’s a better way to fix this than simply caching URL helpers when they're being called unnecessarily. The integration tests don’t always need URL helpers when they are called in initialization. The better approach is to move them to where they are required rather than caching them. The change is substantial; now, URL helpers are only called when the session resets. After implementing this, we need to remember to always benchmark.

00:23:28.960 If you run the script again, we can see that we have improved tests even more than when we captured URL helpers. Previously, we were 1.36 times slower; now, we have dropped to 1.12 times slower than controller tests. The chart clearly shows we've gained many iterations on integration tests and recovered the ones we lost on controller tests. StackProf is invaluable for highlighting the significant problems in our Rails source that contribute to slowdowns in integration tests.

00:24:18.020 Of course, I've made many other improvements that I didn't demonstrate today, but we were at the point where making performance tweaks became tedious. I noticed after removing the largest culprits that we were spending a considerable amount of time on garbage collection—approximately 11.13%. This was due to the creation of many objects that needed to be garbage collected. To figure this out, we turned to AllocationTracer, which was written by Kikuchi. This tool helps identify how many objects Ruby is allocating and where the allocations occur in your code.

00:25:29.460 AllocationTracer focuses on specific problems in your Ruby code and helps narrow down unnecessary object allocations. Here’s an example of the AllocationTracer script we used to track allocations in integration tests. To use AllocationTracer, we included object space, similar to our previous scripts.

00:25:47.460 Then we defined the index test as we’ve done in past scripts, included AllocationTracer, and called the tracing in our tests. We ran it through MiniTest 3000 times to gather accurate samples of allocations. Finally, we sort the results by count, allowing us to see the highest offenders at the top, and we can return only the first five, as that’s all we need.

00:26:35.440 When we run the AllocationTracer script, the output is a bit confusing at first. However, once you understand the data structure and what you’re looking for, it makes a lot of sense. The output will show you the path of the file allocating the object, the line number where the object was allocated, and the types of allocated objects (e.g., string, hash, struct, array). Then, the total count of created objects is given, which is crucial.

00:27:10.840 We also get the total age of objects, the minimum age, the maximum age of objects, and the total memory size. Specifically, we found that many of our object allocation problems were in Rack, not Rails. This showed the top offenders during the time we ran this script. To reduce time spent in garbage collection, we aimed to reduce the number of strings created.

00:27:57.080 We found that a simple way to do this was to freeze certain strings. After each of these improvements, we could check our allocations. At this point, we didn’t need to benchmark because we weren't attempting to speed up the tests per se, but rather we were looking to reduce time spent on garbage collection, which should eventually speed up the code.

00:28:46.920 In this run, we found that Rails replaced three of the Rack allocations in our top five. This is attributable to the changes, mainly within the reduced strings. The reasoning behind freezing the strings is that when you assign a string key to a hash, it creates a copy, increasing the number of objects. Freezing prevents this, but it also makes the string immutable. However, don’t rush into freezing strings without verifying that it’s necessary.

00:29:36.800 Premature optimization can lead to issues, especially with strings that must change. This should only be done if absolutely necessary. I hope we don’t see any of you submitting large pull requests to Rails with strings frozen everywhere, saying I suggested it! You must first prove that reducing allocations is the problem.

00:30:18.900 AllocationTracer was an effective tool that pinpointed where unnecessary object allocations occurred in Rails and where garbage collection was burdening our performance. Unfortunately, we didn’t see significant improvements from this mainly because such optimizations are often micro-level tweaks that get tedious and bore us.

00:30:42.780 Next, we focused on enhancing integration tests at a faster pace. Starting from where we began, we’ve improved greatly. Initially, we could run only 393 iterations of integration tests; now, we can run 828 iterations—an impressive feat!

00:31:00.640 This is a momentous difference, almost three times faster than in Rails 4. In fact, most of these changes did not alter the behavior of the code so that they're available in the newest Rails 4, meaning you don’t have to wait for Rails 5 to reap these benefits.

00:31:29.470 I hope this discussion helps you better understand how to utilize performance tools effectively. Be sure to recognize the advantages and disadvantages of the tools you are using. Understanding limitations prevents frustration when working with these tools. No single tool can address everything, so ensure you're using multiple tools. We were able to speed up integration tests because we combined Benchmark-AP, RubyProf, StackProf, and AllocationTracer. And, of course, always remember to benchmark!

00:32:18.700 This ensures you are actually speeding up your code, and it helps you defend your work. It may even help you identify bugs within your profiler. If you do ever submit a pull request to Rails or any other project that improves performance, please include those benchmark IDs in your commit message.

00:32:37.360 Lastly, I want to express my gratitude to a few people: thanks to Aaron for collaborating with me on this work. I know there were some concerns raised about him using my code for his RailsConf talk, but it was a collaborative effort, I promise. Thank you to all the gem creators of the tools we used, and thanks to Heroku for having me speak here today. Thank you all for listening!