Talks

Lightning Talk: The 6 Characters That Could Bring Down Your App

Lightning Talk: The 6 Characters That Could Bring Down Your App

by Moncef Belyamani

In this lightning talk, Moncef Belyamani presents a cautionary tale on how a seemingly innocuous coding method can lead to severe performance issues in an application. He recounts his experience working on a Rails app for the Department of Veterans Affairs, detailing how the use of '.count' in database queries caused significant CPU strain and application hang-ups. Belyamani emphasizes the importance of understanding the consequences of different methods in Rails and outlines alternative and more efficient ways to check for empty results.

Key points discussed include:
- The Problem: Moncef reveals that calling '.count' on Active Record relations translates to expensive count queries, particularly on large tables, leading to performance degradation.
- The Context: He describes his role at Truss and how they were modernizing legacy systems for enhancing veteran services, leading to the Caseflow project focused on processing claims more efficiently.
- The Investigation: That winter day, after noticing high CPU usage and a related exception in Sentry, Moncef was tasked with troubleshooting the issue which was linked to an inefficient query method.
- Identifying the Culprit: Spotting the infamous '.count' in the code, he shared how such a method could greatly slow down the application, especially given the growing size of the documents table.
- The Solution: Belyamani proposed replacing '.count > 0' with '.any', which yielded a drastic reduction in CPU utilization—an approximate five to ten times decrease. This clean-up extended beyond just this one instance, as he found and fixed several other occurrences of '.count' in the codebase.
- Lessons Learned: The talk concludes with valuable action items: developers should audit their code for occurrences of '.count', consider adding timestamps to tables for better tracking, and monitor SQL performance more closely to prevent future issues.

Ultimately, Moncef encourages developers to be proactive in optimizing database interactions and highlights the impact that small changes can have on application performance, especially in systems dealing with large datasets. He ends by inviting attendees to explore more details and resources related to the discussion on his blog.

00:00:00.160 Our next talk is a lightning talk about the six characters that could bring down your app. This talk is presented by Moncef Belyamani.
00:00:11.360 Moncef writes coding guides for beginners and beyond. His very first tutorial for setting up a Ruby web development environment on a Mac has helped tens of thousands of people since 2012. The process is now automated by his popular script. Moncef has extensive experience across both government and private sectors and has helped shape best practices on various engineering teams. His specialties include speeding up test suites and other repetitive tasks, improving code quality, and teaching people how to set up and use their Macs more efficiently.
00:01:14.560 This talk is a true story about how using the wrong method caused our app to hang for users and how it increased our database's CPU usage by up to 10 times. You probably are using the same method in your app right now, so find out what you should use instead. Let's go!
00:01:49.360 Hi everyone, my name is Moncef Belyamani. You might know me from my popular tutorials and scripts for setting up a Ruby development environment on a Mac, which you can find on my website, or maybe you know me as Monfresh, which was originally my DJ name but now I use it everywhere online. Either way, welcome! This lightning talk is based on a blog post of mine that was featured in Ruby Weekly earlier this year, and a lot of people told me it was very helpful. So hopefully, you'll learn something new today as well.
00:02:23.599 This story is about a common method and expression that ended up bringing down a Rails app I was working on, and it's likely that you have it in your app right now. Let me give you a quick background on this app. I work for Truss, a small consultancy, and one of the best companies I've ever worked for. We solve problems for both government and private sector clients and work on impactful projects that help improve people's lives. I was hired for my Rails expertise and joined a project called Caseflow for the Department of Veterans Affairs, or the VA for short.
00:03:14.240 This was a collaboration between the U.S. Digital Service, Navalist, and Truss. In the U.S., there are hundreds of thousands of veterans suffering from health conditions, depending on financial assistance from the VA. They can submit claims to receive benefits, but the problem is that it can take years to process these claims. To address this, this group of wonderful folks modernized legacy systems from the '80s and built open-source Rails apps using human-centered design to make it easier and faster to track and process claims.
00:03:37.519 One of those apps was for the Caseflow project, and the app was already three years old when I joined. Now, this story begins on a winter day a few years ago when I was working by my cozy fireplace, and I received a phone call from the engineering lead. He told me there was a recurring exception in Sentry that was affecting pretty much every user at the same time. Our Datadog dashboard showed that the Postgres CPU usage was a lot higher than normal.
00:04:03.760 Based on the Sentry logs, the team lead thought this exception was caused by a recent deploy, since it seemed to happen all of a sudden, but he couldn't find anything in the most recent commits. So, he asked me to look into it. As someone new to this project, I had the advantage of fresh eyes. As a side note, the photo with a person screaming into the phone was for comedic effect; in reality, this was a calm conversation over Slack. Going back to the problem, the first thing I did was look at the exception in Sentry.
00:04:25.120 I found the problematic line of code, which had been introduced about a year earlier. I immediately knew what the problem was when I saw the infamous six characters. Can you spot them? Yes, '.count'. Now in Rails, calling '.count' on an Active Record relation translates to a count query in Postgres, which is slow and gets slower as the table increases in size. I confirmed this by noticing statement timeouts for those count queries in New Relic, and this was with a statement timeout set to 30 seconds.
00:05:15.919 I also double-checked by looking at the ID of the last record in the table, and it was over 34 million. In this case, the query included a '.where' and a '.where.not', so it wasn't trying to count the entire table, but I presumed that even the filtered list was so big that Postgres couldn't count it all within 30 seconds. Looking at the code more closely, we can see that the only reason the previous documents were being counted was to see if there was at least one of them being returned by the query.
00:06:13.680 We were not at all interested in the exact number of documents that matched the first query, just whether or not the result was empty. Now, there are more efficient and idiomatic ways to check if a query returns any results. The problem is there are several of them, such as '.any', '.exists', and '.present', but they don't behave the same way. On top of that, the behavior depends on the Rails version. It's hard to remember which one to use in which situation, and there aren't any automated tools that I know of to guide us. Luckily, Nate Berkopec wrote a detailed article explaining it all.
00:07:04.640 Now, since we were using Rails 5.2, the best solution was to replace '.count > 0' with '.any'. As soon as this change was deployed, the CPU utilization was reduced by about five to ten times. Ta-da! Once this production issue was resolved, I found other instances of '.count > 0' that could be replaced with '.any', and I fixed those as well. Another nice moment was when Nate found my PR and highlighted it in his Slack.
00:07:40.800 One question that came up was why the timeouts happened all of a sudden on that particular day. My best guess was that the documents table was slowly growing over time and finally reached a size at which Postgres could no longer execute the count query within 30 seconds. Perhaps a larger than usual number of documents were created on that day. Unfortunately, the table didn't have any timestamps, so we couldn't easily tell how many documents were created at any given time.
00:08:33.840 So, what have we learned today? Here are actions you can take today to improve your project: Check your app for '.count > 0' or '.count.positive', or just '.count', and replace them with one of the faster methods. You can add timestamps to any tables that don't have them, even if you think you'll never need them. You can add monitoring to track and measure things that matter, and consider using a lower value for global database statement timeouts, setting different values on a case-by-case basis, such as for long-running migrations.
00:08:59.919 Well, there you have it. Thank you for your time. Again, my name is Moncef Belyamani, and you can find me as Monfresh everywhere. For more details and links that I referenced in this talk, as well as the PRs I submitted, check out my blog post at tinyurl.com/dotcount. Enjoy the rest of the conference!