RailsConf 2018

Using Skylight to Solve Real-World Performance Problems

Using Skylight to Solve Real-World Performance Problems

by Yehuda Katz, Godfrey Chan, and Krystan HuffMenne

In the talk titled 'Using Skylight to Solve Real-World Performance Problems' presented at RailsConf 2018 by Yehuda Katz, Godfrey Chan, and Krystan Huffmenne, the speakers discuss how Skylight can be utilized to improve the performance of open-source applications. They emphasize the challenges encountered in performance optimization and demonstrate practical solutions for three specific open-source projects using Skylight's performance metrics.

Key points discussed include:

  • Skylight for Open Source Program: Introduced to help existing contributors address performance issues by making performance data accessible, even for those without a Skylight account.
  • Examples of Open Source Applications:
    • The Awesome project, an event management tool, struggled with performance due to its self-hosting nature. They utilized Skylight metrics to identify and resolve N+1 query issues.
    • The Odin Project, an online coding curriculum, faced challenges regarding its extensive user data and the efficiency of rendering student progress badges. By optimizing database queries and adjusting index strategies, they improved performance significantly.
    • The Code Triage application, which helps triage GitHub issues, dealt with slow homepage rendering. The team implemented caching strategies to enhance loading times and discovered a critical caching issue related to item expiration that was affecting performance.
  • Broader Implications: Besides solving immediate performance challenges, the speakers noted they also found opportunities to contribute back to larger frameworks like Rails and to other open source communities through their work.

This session underscores the importance of performance optimization in software development, particularly in open source projects, and demonstrates how actionable insights from Skylight can lead to substantial improvements in application performance. The main takeaway is that with the right tools and strategy, teams can significantly enhance their applications' performance while encouraging community contribution to open-source projects.

00:00:10.880 Alright, so 240 on the dot, let's do it. I'm Godfrey. No one is introducing me today, so I have to do the work myself. But before we start, a very important question for you: Did you file your taxes? In case you didn't know, the deadline was yesterday. But if you missed it, don't worry, there will be no one here next year. More importantly, you are in the red room because this is a sponsored session. Just making sure everyone is here in the right room for a real talk. No regrets, right? So we all know what we're getting into; let's do this.
00:00:38.399 This video is brought to you by Blue Apron. You're busy, we get it; we're busy too. But that doesn't mean you can't have a nutritious and delicious meal. I figured I should burn the evidence before lawyers come calling me. I actually work for a small company called Skylight. I don't know if anyone believes us when we say we're a small company, so we decided to put the entire company on stage this year, hoping people will believe us. Even though we're not sponsored by Blue Apron, if you want to sign up for them, I have a personal promo code that you can use.
00:01:08.340 Speaking of personal referral codes, this is my personal referral code on Skylight. If you sign up with this URL, we'll both be swimming in Skylight credits. I'll give you a few minutes to copy that down. Got it? If you don't want to remember the long code, you can also use 'disgaea.l/r/o/railsconf' in which case—well, you will anyway. If you like free money, you probably are into cryptocurrency, so we're announcing the Skylight ICO as a token of appreciation. You can find the details at 'skeleton/favicon.ico'.
00:01:48.450 We have been very busy working on a lot of things in the past year, but instead of discussing all of them today, we decided to focus on just one: Skylight for Open Source. Vale will be here to tell you all about it. As Godfrey mentioned, one of the things we've been working on is the Skylight for Open Source program. It was born from our own experiences as open-source contributors. Many of us on the Skylight team actively work on open-source projects, and from our experiences, we saw that it can be pretty hard to get existing contributors to actually work on application performance issues.
00:02:29.130 Only a few contributors consider working on performance problems, to begin with, and even those who are interested can find it hard to know where to start. That's why we created the Skylight for Open Source program. This is, of course, the same Skylight that you know and love, except that open-source apps will have their public dashboards accessible to everyone. This way, all your potential contributors can access performance data easily, even if they don't have a Skylight account.
00:02:49.390 We started this program because we hope that by making performance data and information more accessible, we could inspire potential contributors to tackle those really slow parts of their favorite open-source apps. By leveraging Skylight, they can finally have a good way to see if their own contributions actually helped. Because at the end of the day, every single app—large or small, open source or not—has room for improvement when it comes to performance. After spreading the word about the Skylight for Open Source program, we decided to contribute back ourselves. We had a company-wide hackathon, yes—awesome! It's like a hackathon but focused on open-source projects, as you might have guessed.
00:04:09.120 We paired up to help make performance contributions and improvements to some of the apps that were participating in the Skylight for Open Source program. Today, we're going to show you three different open-source apps that are running on Skylight, each with their own unique performance challenges varying in complexity. To start, I'm going to hand it off to Yehuda and Peter, who will introduce us to the very first app. Hey Peter, want to see something awesome? Well, I'm glad you asked. Awesome stands for Open Source Event Manager. I have no idea if it's actually pronounced that way, but I'm going with it.
00:05:20.060 Awesome is an event management tool tailored to free and open-source software conferences. It's a one-stop shop to set up a website for your conference, handling everything from program registration to sponsorship, as well as the admin interface for editing all this content. Since Awesome is designed to be a self-hosted solution, working on performance issues could be tricky since you'll need real data and traffic to understand how the code is performing in the real world.
00:05:42.620 Luckily, one of the maintainers of Awesome is running a conference later this month, which is LinuxFest Northwest 2018, so you can see Awesome working in action on the website and on Skylight too. This is the Skylight dashboard for the LinuxFest website, which is powered by Awesome. At the top, you'll see a graph showing some key performance metrics over time. The blue line represents the app's typical response time, the purple line indicates the problem response time, and the gray part at the bottom represents the number of requests per minute.
00:06:02.690 You're probably wondering what the typical and problem response times are. The technical terms for these metrics are the median and 95th percentile, which are a pretty good proxy for what a typical user of your site would experience on a typical day versus a rainy day. However, why do we have to show you two numbers? Can't we just show you the average response time? That is a very good question; hold on to that thought for now. For now, let's shift our attention to the second half of the page: the endpoints list. This shows you all the endpoints in your app, their typical and problem response times, as well as their popularity. By default, they're ranked by agony, which helps you prioritize which performance issues are worth spending your time fixing.
00:06:49.950 It takes into account both how slow the endpoint is and how popular it is, so you won't waste your time fixing a slow endpoint that is only accessed by a handful of people every day. Those of you with good eyes might also notice the red heads-up icon next to the endpoint name; this indicates potential database issues. But what kind of database issues? To find out, let’s click on the first row to see the detailed endpoint page. The first thing you're probably going to notice is the histogram on top, which shows you a distribution of the response times for this endpoint. This is a good time to revisit the earlier question about typical and problem response times versus averages.
00:08:41.130 If there's one thing you remember from your statistics class, it's probably the normal distribution, aka the bell curve. You might expect your response time distribution to look something like that. If it did, the average would be right in the middle, making it a representative number. The reality, however, is that response times in the real world do not tend to follow this distribution; they usually show a distribution more skewed to one side. While many requests are fast, there is also a long tail of slow requests. In this case, an average does not accurately reflect the user experience of the fast requests and misleads about the slow requests, which are typically around ten times slower than the fast ones.
00:09:58.610 So, if you ask me to tell you where the average of this particular distribution is, I wouldn't be able to say. That's why instead of showing you the average and misleading you about the bell curve, we show you the typical response, which covers about half of all of your requests, and the problem response time, which covers about 95% of your requests. It's important to remember that one out of twenty requests falls outside of this range, so basically all of your users will experience slowness on the right side of that bar regularly. We show you both numbers to remind you of what your users actually experience in the real world.
00:10:50.300 So that's the first half of the page. The second half shows you what your app is actually spending time doing during the request. For example, as you would expect, most of the time in the request is spent processing the controller action, which is indicated by the blue bar. However, a lot of time is spent doing other things within the action, such as database queries and rendering views, which is why parts of the blue bar are painted in light blue.
00:11:12.690 The child tasks inside the light blue are also color-coded: green is for database-related processes and purple is for view rendering. You can click on the event to view more details about it, such as the actual SQL query being executed. If you remember the endpoint page we saw earlier, there was a heads-up icon referring to a potential database issue. In this endpoint, it corresponds to a red icon, indicating a common database issue in Rails apps called N+1 queries. Peter is going to explain what N+1 queries are and how they work.
00:12:27.310 Thank you! I imagine that many of you have heard of N+1 queries before, but if you haven't, don't worry; I'm about to explain them. Let's say you've built a blog in Rails. You probably have some code like this in the controller, where you're selecting the posts to send off to the template. Since this is a blog, you probably also have authors for each post.
00:12:36.790 Each post belongs to an author, and you'll have something like this in your templates where you're looping over each post and rendering the author's name and probably some other data about the post. This behavior may seem benign, but it's hiding some important work. When you access a post's author, Rails doesn't yet have the author data; it has only loaded the post itself. Therefore, it has to execute an additional query to fetch the author from the database. Since we do this for each post that we render, we end up issuing multiple separate author queries to the database, in addition to the one needed to fetch the posts. This is why it's called the N+1 query problem. Since we know we will need the author data, we can optimize this by fetching it ahead of time using just a single additional query.
00:13:57.650 Rails allows for an API to do this automatically. You simply change the controller code to add an 'includes' method; many of you may have done this, but if you haven't, it's a really useful optimization to perform. Now Rails will select from the posts in a single query, followed by another where it retrieves all the authors in one query. This is only the simplest scenario; in the real world, N+1 queries are often hidden deep within the code. They may follow a similar pattern but exist in other areas that might not be as obvious. Fortunately, Skylight does a good job of detecting and highlighting these issues, guiding you toward the proper investigations.
00:14:45.430 Going back to Awesome, we used the information from Skylight as a starting point. We pinpointed some areas where N+1 queries became an issue and submitted a pull request to address them. Our time is pretty limited here, so I won't delve into any of the nitty-gritty details. However, if you're interested in learning more, feel free to stop by our booth in the exhibition hall later today; we won't be around tomorrow and would be happy to discuss how we identified and fixed some of these problems. Now, I'm going to hand it over to Kristin and Zach to talk about their work on another open-source app, The Odin Project.
00:19:11.370 However, one challenge in running an app like The Odin Project is the size of its community. With over a hundred thousand users, there’s a lot of information to track for each user, such as which courses they took and their progress. For instance, let's take a look at the work involved in rendering these beautiful badges. To render a badge, you need to know two things: the lessons belonging to the course, and which of those lessons the student has completed. The first part is a simple select query, while the second one involves a join table called lesson completions, which is pretty standard.
00:20:07.060 Since we are showing badges for every course on the page, to avoid the N+1 query problem, we perform a single query to fetch all the completion records for the current student. However, according to Skylight, the second query takes up a noticeable amount of time on each request—certainly not terribly slow, but noticeable. So the question is, can we make it faster? Databases use indexes to keep these queries efficient. Generally, you do a little extra work on the insertion of the data in exchange for a considerable performance boost while querying.
00:20:59.280 In many situations, the answer is simply to add the missing index. However, this case is more nuanced because the student shouldn't be able to complete the same lesson twice. A compound index already exists on the lesson ID and student ID, guaranteeing uniqueness. Since we're querying solely on the second field, we must determine whether this compound index is adequate for our query’s needs. For a long time, and across most databases, the answer would be no. However, starting in version 8.1, Postgres is sometimes able to exploit the compound index for queries that use only the second column in that index.
00:22:14.390 Unfortunately, even when that is possible, it’s still not as efficient as having a dedicated index for that second column. We can confirm this by running 'EXPLAIN ANALYZE' on the queries. If you haven’t used this before, it’s a great tool in Postgres to comprehend what the database is doing under the hood in helping optimize your queries. If no indexes exist at all, Postgres will be forced to perform a sequential scan, which could be quite slow on a large table.
00:22:58.190 Now, let's add the compound index and try again. As you can see, Postgres is using the compound index here, but it’s still not as fast as it could be. Finally, let’s add a dedicated index for the student ID column. Now, Postgres can take full advantage of the index and significantly improve query performance. Remember, none of these queries use the first column of a compound index. If we were to query solely on the lesson ID, Postgres would still leverage the compound index just as effectively.
00:24:22.540 Here's a summary of our findings: First, Postgres is generally smart about employing multiple indexes for a single query; therefore, you should always start with individual indexes. However, if you need a compound index for other connections, it becomes complicated. Ensure you prioritize the field you're querying independently and place it first in the compound index. Alternatively, maintain a dedicated index for any additional columns you wish to query.
00:25:09.840 As we engaged with these optimizations, we submitted a pull request to improve this query's performance. While working on this, we discovered a parallel case on the lessons table, where a compound index existed for slug and section ID, while we wanted a query based solely on section ID. We considered adding a dedicated index for section ID too; however, when we executed 'EXPLAIN ANALYZE,' we found that Postgres wasn’t employing the new index. In this instance, the lessons table was quite small, with fewer than 200 rows in total. With such a small table, Postgres could execute a sequential scan far more rapidly than using the index.
00:26:15.000 The moral of the story is: don't optimize prematurely, and always verify your assumptions. It’s often better to take the straightforward route and allow Skylight to highlight what’s sluggish, then spend time investigating and optimizing your code. Ultimately, we managed to identify why the second query was slow and sent a second pull request to fix it. However, that's a story for another time. Should you be curious, come ask us about it at our booth after the talk. For now, I’ll pass it over to Vitae and Godfrey to discuss the open-source app that they helped improve.
00:26:46.899 I can't click until you're ready. It's ready enough. Thank you, Kristin! We decided to work on one of our favorite open-source apps called Code Triage. What makes Code Triage special is that it’s an open-source app built to help other open-source projects. Popular open-source projects receive a host of bug reports, feature requests, and pull requests every day, and just reading through all of them can be a massive drain on these projects’ maintainers’ time.
00:27:20.230 We're involved in various popular open-source projects and thoroughly understand this problem. Code Triage enables you to assist your favorite project maintainers by signing up to help triage GitHub issues and pull requests. Once you sign up, the app sends a random ticket to your inbox every day, enabling you to help distribute the workload so that everyone only has to contribute a little bit each day. You can even volunteer to help triage and improve documentation on the project. Receiving just one ticket in your inbox daily is actually quite an appealing concept.
00:28:23.800 Currently, there are already tens of thousands of users helping out in thousands of repository projects in this way, which is pretty impressive. As we can see, at this scale, some fascinating and unique challenges arise. Let’s examine the homepage, for example. When you visit codetriage.com, you can browse through many popular projects and see the number of open issues needing triage for each one. That equates to substantial information to render, which can make the site a bit slow at times. This page is also one of the most frequently visited, meaning optimizing it presents a valuable opportunity for performance enhancement.
00:29:03.060 Notably, even though there’s a wealth of information on the homepage, most of these elements do not change frequently, making the page an ideal candidate for caching. However, it turns out that a lot of performance-minded individuals have already worked on Code Triage, refining everything that could be optimized. Because of that, many of the caching strategies we identified had already been implemented. When examining this page through Skylight, we noted that to populate the meta tag, we had to run two queries to fetch the counts of users and projects on the site. Since these numbers don’t need to be super current, we should be able to cache them.
00:30:34.930 We submitted a pull request to cache this information for up to an hour. As expected, when we deployed this change, we succeeded in shaving off a little bit of time from the homepage's load. Aside from the header, footer, and meta tag, most of the time is spent rendering cards on this page. While they may appear straightforward, a tremendous amount of work goes into rendering these cards. However, for even the most popular open-source project, issues are submitted sporadically, maybe just a handful per day or every few hours. Most of these cards, therefore, typically don't change much, leading one to assume that the solution likely lies in fragment caching since we’re using Rails.
00:31:40.160 Interestingly, while many of the performance optimizations on this app have already been executed, these cards are framed cached. This seems good, but in fact, they are cached more intelligently than that. If you’ve just learned about fragment caching, you might think to loop through each card and cache them inside the loop. While this works, it poses a problem: Rails will query your cache store to fetch the fragment for each card every time it renders. So, now when we are looping through, say, 50 cards, this leads to making 50 requests to your cache store, which generates considerable overhead.
00:32:56.320 This issue resembles the N+1 query problem in caching. To resolve this challenge, the Code Triage maintainer capitalized on a Rails 5 feature called collection caching. Instead of iterating through each card and caching them individually, you move the content into a partial and use collection caching, telling Rails to cache appropriately. By doing so, Rails can fetch all the fragments in a single request to the cache store, significantly minimizing network overhead. Skylight reflects this enhancement in performance on the dashboard where the cache read count is shown as a single request.
00:34:02.790 However, upon closer inspection, we realized that something was wrong. If you look at the logs, you’ll see something surprising; for each request to the homepage, we were missing more than half of the 50 cards from the cache. If this were a one-time occurrence, it might not be a substantial issue. Rails could generate the missing content and save it to the cache, allowing subsequent requests to utilize that cached data. Oddly enough, though, we were able to reproduce these cache misses even after multiple consecutive requests to the homepage.
00:35:03.440 At this point, the only explanation could be that something else was inhabiting the cache, preventing our cards from being cached. We theorized that our cache must be full, but upon review, we had provisioned a 100-megabyte memcache store for this app, which should comfortably accommodate many cards. Each card, being a few hundred bytes, would imply needing at least a couple of hundred thousand of them to fill the cache. As an illustration, if this square represents our 100-megabyte cache space, those 50 cards would occupy only a small fraction of a pixel.
00:35:48.860 Thus, this wasn’t our issue. It could be possible that something unusually large in the cache was consuming all this space. Our understanding of memcached was that it operates as an LRU (least recently used) store—that recently used data is evicted first. However, since we weren't caching anything significantly large on the homepage, our mental model suggested that memcache would evict older items in the cache as necessary to allocate space for our cards. Clearly, our understanding wasn’t accurate, so we needed to learn more about how memcached manages its data.
00:36:58.800 Before we delve further, I should note that we’re not running memcached ourselves; we're using a SaaS provider that implements a proprietary memcache-compatible store. Some details may differ from standard memcache, but much of this pertains to it. The first thing we learned is that our cache store isn’t just 100 megabytes of available space; instead, memcache categorizes items into tiers based on their sizes, managing the space for each tier independently. We observed this in action by running the 'stats items' command in memcache.
00:38:07.360 The output showed various statistics, including sizes for different tiers, the number of items stored, and how many items have been evicted from a specific tier. When caching similarly sized items, they could end up competing for a smaller amount of space than you'd expect. For our cards, fragments being generally under one kilobyte fit perfectly into bucket number four, responsible for storing anything up to a kilobyte. The prior tier manages things up to 512 bytes. We noticed around 5,000 items in bucket four, amounting to about 5 megabytes, considerably less than the 100 megabytes we had initially assumed.
00:39:09.900 This implied that items of the same size in a memcache store might compete for a much smaller amount of space than expected. However, five megabytes is still enough to accommodate 5,000 cards. On the other hand, with all that in mind, we were only trying to cache 50 cards, so the math still didn’t add up. Paradoxically, checking the eviction count revealed that we had evicted almost two million items from bucket four alone. This supports our observation that the cache was overflowing.
00:39:51.050 It seemed there must be something other than our cards occupying the space and that these non-card items were somehow not being evicted. While it would be great to examine the contents of the bucket, our SaaS provider wouldn't allow it. Based on our research, we theorized that memcached features an automatic expiry on cached entries known as TTL (time to live). Our hypothesis was that mixing TTL data and non-TTL data within the same bucket could lead the cache store to always evict non-TTL data before considering any TTL data for eviction.
00:40:43.460 Since our cards did not have TTL and our cache was already mostly full, they would always be the first to be evicted. To check, we flushed the entire cache manually, and sure enough, it freed up space for our cards, achieving a 100% cache hit rate on the homepage, leading to a significant performance improvement. This investigation revealed that we had inadvertently discovered some bugs in other open-source projects. If you look at the Skylight metrics from before we flushed the cache, you can see that we utilized a single cache multi-read, which kept us from incurring a 'n+1 queries' problem.
00:41:34.350 However, we immediately found ourselves spending considerable time on cache reads and writes. When reviewing the logs, we saw that a particular read multi operation took longer than expected to return results, fetching all 50 fragments in a single request. Even so, when rendering the missing roles, we made another read to the cache store, despite being aware that the cache did not contain anything. Basically, during each homepage request, the cache misses compounded times many items were rendered!
00:42:10.380 That’s problematic. We should be able to eliminate this redundant read entirely because it added noticeable time to the page's load, even without a consistent 100% hit rate. Keeping up with the logs also revealed a second problem: while we generated and rewrote the missing roles back to the cache for good measure, we didn’t need to do this ready for every occurrence in the request; we could significantly cut this time in half. Over our work, we also discovered a bug in Rails 5 that causes each cache item to consume twice the necessary space.
00:43:00.700 We believe this is an issue within Rails or the Dalli gem. As we continue to refine our findings here, we are working on patches to address these concerns, and we aim to have them completed when we return to Portland. Importantly, the specific bug regarding Rails 5 and the two compression issue have already been fixed and merged in the master branch, which you should expect to see in Rails version 5.2.1 very soon. We initially set out to find ways to assist our favorite open-source apps with Skylight, and I believe we accomplished that.
00:43:49.160 However, we also managed to achieve something greater than that. During this journey, we discovered methods to contribute back to upstream projects like Rails, and we hope that initiatives like Skylight for Open Source and this journey will inspire more contributors to do the same. Speaking of journeys, I’ll be giving a talk tomorrow about the internals of the Rails router, and it’s going to be a wonderful experience—if I’m getting it right, that is! But I hope to see you all there!
00:44:52.980 For now, since we promised to bring the entire company here, I’ll send it over to Olivia with a baby. I usually bring him on stage with me, but he was so quiet and wonderful that I decided to skip it. I chose this Bitmoji because, well, that's what CEOs do: show up at the end after everyone else has done all the work and take all the credit. Thank you all! That was fabulous. This is our team. We are a small company, and I want to say a little family. I feel like all companies call themselves family nowadays, but it has no weight anymore.
00:45:54.000 However, we do invite our actual families to work with us. For example, we have a 'babies at work' program, and Jonas, which is my son, along with Kristen's daughter, hangs out at the office all the time. Turns out it’s not impractical at all, and it makes everyone happy. You walk down the hall and see babies babbling and smiling. I wrote a blog post about our 'babies at work' program. So if you are aiming to attract talent, parents, especially women, definitely mention it to your HR person, and they’ll probably look at you as though you’ve lost your mind. But then point them to my blog post because I discussed all the reasons I thought it was a terrible idea, and I was proven wrong!
00:46:44.440 We build Skylight and work really hard on it because we love giving back to the community. We are focusing intently on making performance a universal responsibility for everyone on the team. Many, especially those at the start of their programming careers, might think of performance work as something only the ops team handles, but it shouldn't be that way. One of our missions is to shift that notion, allowing anyone on your team to pause for a moment, jump into Skylight, discover what they can do, and find some low-hanging fruit.
00:47:38.130 We have plenty of low-hanging fruit outlined on our documentation site. If any of these suggestions have been outside your skillset, many of them are on our docs site, providing guidance on fixing issues in your own app. We have a booth downstairs so stop by if you wish to discuss Skylight or share your thoughts. Thank you all once again!