Panel: Performance Problems in Rails Applications

00:00:06.120 All right, but now we'll start with the performance questions. The panelists are a bit unprepared, and so am I. Luckily, they are experts in this field.

00:00:12.200 Let's start with the first question: What are the typical performance problems in Rails applications, and how do you solve them? We can start with Stephen based on your experience.

00:00:24.640 I feel like most people would probably focus on database bottlenecks and how to make their queries faster. One interesting thing I found while preparing for a talk like this is that Action View is incredibly fast if you're just rendering a view. However, as soon as you call out to a partial, it gets significantly slower. Most of the applications I've worked with average about three to six layers of partial calls. The ERB engine itself is quite fast, but the Rails ERB, which actually finds and compiles those files, is slow. I remember running some benchmarks where I expected my write benchmarking APIs to be slower than my reads because of linear writes and all of that. Surprisingly, they were consistently faster.

00:01:10.479 I looked into what was happening, and SQLite was way faster than Action View. In one experiment, I introduced just one partial: I had an index view building a table and pulled out the table row for each post into a single partial. This change made it 40% slower. So sure, if you've got really slow queries, fix them but don’t lose sight of your depth of partial layers.

00:01:50.799 For instance, consider that you might have a view called 'show' that references a header partial. In the header partial, you might include a left and a right partial, and in the left partial, you're calling another button partial—this structure can kill your application's performance. Okay, Stephen is very focused on views. Now let's imagine that you don't have any views. Let's only discuss APIs while keeping queries aside, which of course can be improved by indexing and other means. It can be easy to forget what is actually happening under the hood, which is something you need to be very careful about when using Active Records and Frameworks in general.

00:02:41.959 That’s why talks like the one Stephen gave—going into the deep details of what’s going on—are very important. It is common to load a large collection into memory during code reviews, and I often find myself advising, 'Please replace this each with a find_each.' Although you’re only changing five letters, if you load a thousand items into memory and then iterate over them, you are doing costly operations when it can be much more efficient. Understanding exactly what you are using and ensuring you're using the most efficient way to run background jobs is also crucial because many times, we think, 'Let’s run everything as background jobs.' However, this can lead you to serialize a large object, which makes your background queue slower than if it were running in the foreground because the majority of your time is spent serializing objects.

00:03:41.080 So, this is just about developing a critical mindset—understand exactly what you are using. When running a background job, it's often more efficient to just pass numbers, IDs, and reload your objects from inside instead of serializing a large object into memory, which can be slower than running a database query to load it. These are two additional considerations I think it’s important to emphasize.

00:04:24.320 All right, let me repeat the question for you: What are the typical performance problems in Rails applications, and how do you solve them? I believe my colleagues have answered most of them, but from my observation, most performance issues are related to fetching too much data or executing too many queries to the database. One specific case to consider arises when you need to fetch data from two sources. In such situations, you may need to essentially reimplement some database algorithms at the application level.

00:05:03.400 For example, let's say you have transactions from one source and users from another source, and you want to match them. If this leads to a nested loop approach, it can be quite slow. Additionally, if you fetch large amounts of data from Elasticsearch and keep it in hashes, trying to add just a few fields to those deeply nested hashes can become rather costly as well. I once spent half a day trying to optimize such an update.

00:05:51.919 Fortunately, we managed to bypass this issue by communicating with the product team. Instead of trying to optimize the update, we decided to present less data. That's actually one of the common problems I have noticed in the industry.

00:06:46.520 Based on what you have said, I'm curious about the models we often use that are driven by the framework we rely on, which most of us love. How do you perceive the correlation of growing models—like a user class with 57 columns in the database—and its effect on performance? I am asking all of you, by the way.

00:07:11.280 Moreover, this situation necessitates performing numerous joins sometimes to present a more sophisticated page, like a report. How do you view this issue in general?

00:07:34.280 If I may start, if I need to show very specific data for a background job that sends data to an external service, I do not hesitate to bypass normal model usage, instead directly fetching the data I need, structuring it simply, and then sending it off. When dealing with regular business logic and needing methods defined on models, I'd consider their structure more seriously, but having numerous fields—say 60—might not be a significant problem.

00:08:20.000 This ties back to focusing on pragmatism and understanding the context of problems. Sometimes the difference between 100 microseconds and 1 millisecond is crucial for business, but those scenarios are not as frequent as we might think.

00:08:50.480 Adding unnecessary complexity to an issue that does not require such attention can often cause more problems than solutions. Therefore, in most cases, models don’t affect performance. However, it's important to deeply understand the specific problem you're dealing with, including your performance budget and the factors that influence it. You need to do enough benchmarking to identify the cost of having an additional column in your model.

00:09:44.000 An example of this would be whether to pull out a JSON payload column into a separate table, only retrieving it when necessary, or keeping it in its original table. I made that decision once, thinking it was smart to avoid loading unnecessary data, but it turned out to create a myriad of complications that were not worth the minuscule performance improvement.

00:10:32.680 I suggest pausing to consider your situation: negotiating with front-end and API clients is essential. In many instances, you don’t always have to load everything simultaneously, so loading on demand can significantly improve user experience. So while the structure of your data might not be problematic, how you access and use that data bears considerable significance.

00:11:35.360 Sometimes, introducing a new column can be beneficial as it may serve as a counter cache, but it's important not to treat caching as a panacea. Generally speaking, thoroughly measuring and comprehending the issue at hand can lead to finding the right solution, and my advice often circles back to the notion that it depends.

00:12:33.200 Understanding context, including budgets and user experience, is crucial as well. If at some point your database grows tremendously in terms of record numbers, this can lead to performance issues, which is a problem worth considering.

00:12:49.280 In my past experience working on a large application, I had to address partitioning. I want to note, however, that partitioning isn’t always necessary; if you have a good index, isolating your data can be efficient enough.

00:13:03.680 Thank you. Are there any questions from the audience? If not, let's move on to the second question: how to triage performance problems, and what are the best ones to solve first? Let's start with Maciej this time.

00:13:21.360 I’d echo what Kyo said. Triage requires communicating with the business to ascertain the areas of pain. If they feel something is running too slowly, that's where we should start. However, there's a caveat: sometimes we hear complaints about performance without fully understanding that the site might be slow due to the amount of data we need to show. This is where our expertise must come into play.

00:14:14.359 My heuristic is that if I come across a request taking longer than one second, I consider it worthy of investigation because it could be an easy win in terms of optimization.

00:15:02.000 Additionally, there’s an interesting point you brought up regarding complexity. What do you think is the biggest contributor to performance issues? Is it the volume of data being fetched or the structure of SQL queries?

00:15:26.440 In my experience, particularly when dealing with complex analytics applications, the prevailing issue tends to be loading too much unnecessary data, usually because implementing pagination is viewed as costly. Developers don’t just consider the number of pages they need to show but also the requirement for search functionality and proper sorting within the database.

00:16:12.240 When developers prioritize minimal implementation efforts, their applications can become slow, particularly when a ticket is delivered without pagination or limiting data.

00:16:57.120 Okay, thank you. Stephen, what is your perspective on this?

00:17:13.920 For triaging performance issues, I recommend a technique popularly known as time boxing exploration. This involves addressing a problem with the goal of identifying the most significant leverage opportunities. High leverage opportunities allow you to invest a short amount of time and gain a substantial performance boost.

00:17:59.920 Performance problems can sometimes be unintuitive, and our initial guesses may often be incorrect. Engaging in quick explorations to discover high leverage opportunities can be incredibly useful. You can become more adept at determining where to focus your optimization efforts for maximum impact.

00:18:55.600 In response to the second question, I don’t have a definitive answer. Personally, I've never encountered performance bottlenecks at the database layer. Database operations typically occur in microseconds; for me, the bulk of performance issues stems from the view layer. But since others frequently send queries over the network, maybe that's where the problem lies.

00:19:57.760 You mentioned earlier about not optimizing for 1 millisecond or 100 milliseconds. Let’s change perspectives: What about situations where a page is timing out for certain customers—especially high-value Enterprise clients? What would you recommend for triaging that situation?

00:20:21.440 That's an intersection of art and science, where you'll need to have appropriate conversations with your team, including those from leadership. Generally, having a monitoring strategy is crucial. Using tools to help identify where users are experiencing issues can significantly inform your approach.

00:21:17.120 Understanding your users is paramount. Depending on how you operate, 100 milliseconds might be acceptable for one segment of your audience, while for another group, it’s not.

00:22:30.480 Thus, comprehensively analyzing various aspects is critical; sometimes, optimizations can be more efficiently implemented at a layer above where the performance issue lies, such as the front-end or the API, by requesting less data. Effective negotiation with front-end clients plays a significant role in managing expectations around loading those queries.

00:23:26.000 Having appropriate performance optimization and monitoring as part of your routine throughout development can also help reduce setbacks over time. It's crucial to identify low-hanging fruit first–those easy wins that can lead to significant impacts with relatively little effort.

00:24:15.440 Addressing tools necessary for performance monitoring, consider solutions in cloud services such as Honeycomb, which seamlessly integrates with AWS and can track performance across various regions, thereby giving insights into user experience around the world.

00:25:12.480 I also wanted to mention the importance of logging performance statistics. Monitoring performance as you develop can help you identify issues before they become problematic.

00:25:50.840 To add on, we're currently using OpenTelemetry to log performance metrics. The adaptability of this tool allows you to shift your metrics platform without major code changes. In the past, I’ve used tools like Datadog and New Relic, but of course, it's important to be aware of pricing models before using them.

00:26:54.320 Moreover, when optimizing a crucial GraphQL-based API, we included custom telemetry for each request to facilitate pinpointing slow requests. By logging various performance stats, we could reproduce slow requests locally to apply necessary optimizations and conduct acceptance tests swiftly.

00:27:25.680 It's imperative that we discuss how beneficial tools from Prometheus and Grafana can be when employed collectively, with on-the-ground teams responsible for infrastructure being integral.

00:28:09.480 For less significant apps, or when just starting, utilizing the existing instrumentation built into Rails is incredibly useful. Many Active Support notifications can be leveraged effectively during larger applications when monitoring becomes crucial. Even if you decide to use PostgreSQL as your main database, starting with SQLite for analytics can serve a purpose and help monitor performance improvements.

00:29:13.000 When it comes to larger applications, you can rely on Sidekiq, which offers comprehensive insights regarding background processing while keeping things simple. Additionally, the notifications from Active Support can be extremely helpful in identifying where performance is bottlenecking.

00:30:08.960 Do any other tools come to mind when addressing performance issues surrounding CPU, memory, and so forth?

00:30:27.680 I may not have direct experience with all the tools available for front-end performance optimization, but ensuring that as you add dependencies via npm and keep an eye on bundle sizes can save a lot of headaches down the road.

00:31:28.000 To address performance on the back end, I suggest exploring Ruby profilers. For example, tools like Pegasus and Arpr yield insightful metrics and can vastly streamline optimization efforts.

00:32:04.000 Now, let us turn our attention to heuristics for determining when to cease performance optimizations. As we’ve discussed, cost plays an essential role in our decisions around optimization efforts.

00:32:51.760 For instance, you could have a clear understanding of performance budgets defined via conversations with your team and the business side. It’s important to negotiate early and define agreements or service level agreements on performance. Once those rules or expectations are set, it’s easier to establish whether you’re operating within those targets.

00:33:46.040 So, if it's become frustratingly difficult to meet those budgetary constraints, you'll have more cutoff points for adjusting expectations.

00:34:49.920 Even if performance budgets aren't a reality quite yet in your organization, it’s essential to call your shots. Declare how much time you’ll allocate to optimizing a particular page and stick to that time box for the sake of efficient progress.

00:35:23.720 It’s reasonable to establish goals for optimization that fit logically into your working weeks, therefore making incremental improvements across sprints.

00:36:18.840 Again, I completely agree to the idea of performance budgets. However, if an organization isn’t working with them, it gets tricky. With one poorly performing page, you address it accordingly based on the urgency from the business. Project transparency is key here and while you might be amidst the chaos of seeking solutions, those initial conversations ultimately cultivate trust in your decision-making.

00:37:14.640 Diving deeper into specific incidents, what’s your most complex performance problem you've ever encountered?

00:38:02.760 One notable instance of a performance pitfall occurred when I sanitized too aggressively, which ended up fetching too much data from our external services. The performance hit was immediate, and very serious.

00:38:53.720 We were optimizing a GraphQL API and encountered a security issue due to parameters not being properly sanitized. I hastily introduced sanitization in both branches and deployed it. We were rolling out the GraphQL service to 10% of our traffic, but it exploded when it went to production. The infrastructure team had to revert my update.

00:39:45.320 It wasn’t the intricacy of the operation that was problematic; sanitizing parameters improperly led to fetching too much data from external sources. The resulting memory usage quickly spiraled out of control.

00:40:33.680 There was another case during the 2022 elections in Brazil, where our API connected to a WhatsApp channel needed to handle thousands of messages simultaneously. While our Rails API managed requests well, we didn't account for the machine learning service being used, which significantly hindered performance under heavy load.

00:41:20.760 The challenge involved synchronously making requests to the machine learning service. When we hit scale, database contention issues emerged even without apparent activity in the database, leading to a full connection pool.

00:42:07.560 Essentially, we had an open transaction due to waiting on that external Python service to respond, leaving other requests unable to get connections.

00:42:54.480 That debugging session was quite intriguing because it highlighted the importance of connections being efficient and immediate, even across two independent service layers.

00:43:40.640 Now, as for handling larger datasets with various filters in place, it ultimately boils down to the volume of data you're working with.

00:43:58.500 In many situations, proper indexing is essential. Those entering the realm of big data should ensure they are employing effective, scalable solutions, especially if you're using both Postgres and Elasticsearch.

00:44:24.880 Having these conversations about data and how best to optimize what you’re using creates clarity. Too many times, people lean on advanced technology when simpler solutions would yield just as effective results.

00:45:01.040 To wrap up, while it's essential to consider advanced searching techniques for large datasets, naive approaches can yield significant performance bottlenecks. Ultimately, ongoing communication and learning are key as we continue to evolve in the industry.

00:45:39.440 This includes understanding expected performance outcomes and establishing a structured approach to the technologies we implement to ensure they are scalable and efficient.

00:46:28.560 Thank you, and let’s give a round of applause to our panelists for this engaging discussion.