Optimizing performance in Rails apps with GraphQL layer

00:00:09.599 You may wonder why I came this far, but actually, this time I'm very happy. A bunch of other appointments in Europe aligned with the conference here. I was scheduled to speak at this conference back in 2020, but it got canceled due to the pandemic. I’m delighted to finally be here, four years after the original schedule.

00:00:34.680 I'm from Brazil and have been working with Rails since 2018, so it’s been a long time. A lot of examples here will be based on an application that features posts and users. This is how we learned Rails at that time, following those "How to Build Your Blog in 20 Minutes" tutorials.

00:00:45.399 I've been working at Midan since 2018 as a software engineer and was responsible for migrating Midan's main products from PHP to Rails, which I’m glad to have accomplished. The main product of Midan is Check, so a lot of what I will present here is based on lessons we learned over time during this full rewrite of a monolithic PHP application.

00:01:11.400 Midan is a nonprofit based in San Francisco that develops tools for collaborative fact-checking. We moved to a microservices architecture, and the main core service is a Rails API, which is headless. When we did this rewrite back in 2016, GraphQL was a new concept; it had just been open-sourced by Facebook.

00:01:39.960 We decided to experiment with it, but over the years, we learned about the issues that can arise from this choice, like with any other decisions made in a product. I plan to share some screenshots from the monitoring tools we use, and maybe even glimpses of the codebase. To make it easier to understand, many of my examples will focus on generic database schemas.

00:02:17.280 The focus of this talk will be on Rails and GraphQL. However, many of the concepts and architectures I will discuss are applicable to other frameworks and technology stacks too. Just to gauge the audience, is there anyone here who currently has any production applications using GraphQL? Oh, wow! Great! Please bear with me, and I’ll also be happy to hear about your strategies on what has worked for you and what hasn’t.

00:03:05.280 This journey began in 2016 when GraphQL was new and started being adopted in larger systems in the Rails ecosystem, like GitLab, GitHub, and even Facebook. This is fantastic because the community began creating tools to aid with that. Many of the tools I’ll share are widely available and open to use.

00:03:46.360 A few disclaimers: there are no silver bullets for anything—there isn't any specific tool or solution that works for everyone in every context or application you have. Everything needs to be considered in context, taking into account your application's level of concurrency, the specific database schema, and the data and demand. Premature optimization is a well-known concept. We hear that especially when we start in computer science, but it’s crucial to approach this carefully.

00:04:24.560 When you know what to anticipate in your project, you can try to make certain decisions from day one. However, we know this isn’t always feasible, and there will be many surprises when things are put into production. In our case, this application has been running for eight years now. We've migrated from Rails 4 to Rails 5 to Rails 6, and now to Rails 7. It’s been a fun ride, and we have had to adapt over time.

00:05:23.720 The problems and solutions will vary based on different factors, such as the database used and the specific characteristics of the application. We learned that when you're in a production environment, especially when you're in a position to look up and analyze data, different problems may arise that require unique solutions. Each situation is different, which may necessitate different solutions. As always, every solution comes with trade-offs.

00:06:44.200 Sometimes, introducing a new dependency into your codebase can require additional learning for your team and make the code more difficult to understand. All of these factors must be considered. I'm speaking not only as someone from a company that has had a product running for a long time, but also from the perspective of open-source software, which brings significant implications in terms of security and understanding how people will actually use your code, especially when working with different developers.

00:07:20.000 Before diving deeper, let’s do a basic introduction to GraphQL for those who may not be familiar. It is a query language used to describe data using a type system. It has bindings for different programming languages, so it's not a programming language itself but rather a means of accessing data for APIs. Sometimes it’s easier to understand GraphQL when we compare it to REST. Imagine a simple application where we have media, which could be a video. This is a simplified extract from our own Check application.

00:08:48.520 Fact-checking organizations worldwide are able to track and verify media that contains misinformation circulating the internet. A media item can have different comments from fact-checkers, and these comments are made by users and can include various tags. In REST, you would typically create different endpoints for your frontend or for your API clients, following a specific convention for resources and actions.

00:09:52.920 You would retrieve an instance of your object, and then from there, access tags, comments, and other information. One of the challenges we encountered was needing more specific data access when you have multiple teams working on the same product. If frontend engineers say, 'Hey, I need an endpoint for this,' it can spiral out of control quickly if the API isn’t carefully designed, leading to a proliferation of specific endpoints.

00:11:03.360 In practice, due to pressures like deadlines and team requirements, projects can end up with too many endpoints, which results in frontends that make excessive requests, often fetching more data than necessary. GraphQL addresses this with a different approach. One key difference in GraphQL compared to REST is that there is only one endpoint for all data manipulations instead of multiple GET requests for different data.

00:12:04.440 That single POST GraphQL endpoint receives a query parameter, letting you send larger queries without hitting limits typical for GET requests. GraphQL makes it natural to express the data needed for a given interface by asking only for that specific data. This request structure is beneficial, as you can ask for just the fields you need, avoiding unnecessary retrieval of extensive data.

00:13:30.359 For instance, if I only want one media item with its title, I can ask explicitly for the title without needing to load a full object with potentially expensive fields. This leads to a clear relationship between the data requested and the data displayed on the interface. Also, the response you get back resembles the query you sent, using JSON format.

00:14:47.760 So as an API client, you only receive the fields you requested—no surprises there. Now, regarding Rails, this is how it works: types are defined using a descriptive language, where you can use basic types like integer and string, or define your custom types. The initial graph gem required more complex definitions, but recent upgrades have made it easier to define types using classes.

00:15:56.440 You can also define arguments as parameters to different fields, which allows for various return values depending on the arguments provided. Connections return collections of objects, often represented as Active Record relations, allowing pagination and sorting. Thus, it is quite straightforward to declare GraphQL types and mutations in Rails using a declarative language. Queries handle read operations, while mutations manage create, update, and delete operations.

00:17:47.600 This structure lets you define the operation, the input data for that change, and the corresponding GraphQL query with the desired output. As a result, the API client only receives the feedback they explicitly requested. However, this flexibility can sometimes become a problem. While it allows frontend developers to request exactly what they need, it can create unpredictability.

00:18:39.760 For example, nested queries can become problematic as they increase in depth. Although it may initially seem effective to ask for all that necessary data in a single query, this often leads to performance bottlenecks because a highly expressive language allows for complex, nested queries.

00:19:49.280 To elaborate further, it is not just about the number of fields requested, but also about the cost associated with fetching each field. In monitoring tools, you might observe a log entry stating that a request took a significant amount of time, and a frontend developer might argue, 'But I only asked for these ten lines.' This disconnect can be problematic if there isn't proper control and measurement regarding the queries executed.

00:20:58.440 So in response to those first introductions, it's key to recognize that you can’t fix what you can’t test. Detecting performance regressions only after deploying to production can be stressful. Thus, we implemented tests that detect performance regressions in our code. A significant approach to achieve this is utilizing test helpers that assert the number of queries a specific GraphQL query generates.

00:22:03.640 It's essential to track and limit the number of database queries triggered by a GraphQL query. In our tests, we wanted to ensure that regardless of refactoring or new features added, we maintain the expected limits on database activity. This helps detect potential regressions during CI processes rather than letting them manifest in production.

00:23:20.640 To do this, we enable query cache since it isn't useful to track cached queries, which don't impact performance. During a code block, we monitor how many database queries occur and check that number against our expectations. For operators, we might just need to ensure that the count is not exceeded.

00:24:36.160 While the tests may appear limited—focusing primarily on crucial code components—they provide a baseline. Still, it’s vital to measure and monitor the actual impact of these modifications post-deployment. In conversations around performance, it's crucial to gather metrics that help identify the bottlenecks present.

00:25:11.920 Standard monitoring typically involves tracking HTTP request times. Regular monitoring tools may suffice when dealing with endpoints, but due to GraphQL’s single endpoint nature, it only provides the overall request time. We integrated our logs with monitoring solutions like uptime monitoring to trigger incidents if requests exceed a specific time threshold; though helpful, this is not sufficient.

00:26:47.760 To get detailed data regarding GraphQL query performance, we needed to delve deeper. Understanding the nuances behind each query executed is vital, as each request goes through several underlying layers, such as authentication, load balancing, and validation. We have successfully implemented Honeycomb for our monitoring, which uses OpenTelemetry.

00:27:58.240 This integration has allowed us to gather detailed information on the timing of different steps in GraphQL query execution. Through this, we can understand how long each step, including the database queries, takes. This is particularly useful for APIs with connections to specific frontend applications, where the users and their queries aren’t fully predictable.

00:29:00.440 Tracking granular performance in GraphQL is essential since a single query can involve multiple fields. If one field particularly impacts the execution time, we need to identify it. This approach may involve trial and error as you assess the impact of removing certain fields from queries to determine their individual execution time.

00:30:03.360 We've utilized tools like Apollo GraphQL that not only offer excellent insights into query performance but also visual representations of your schema. This provides a way to maintain documentation and trace changes to the schema. The tool also includes monitoring features which can inform on which specific fields take longer, helping identify opportunities for improvements based on execution sequences.

00:31:24.600 These monitoring capabilities extend to our daily summaries, which track performance metrics and query durations. Several tools exist that can provide these insights, and many solutions are available for Rails applications specifically, making it unnecessary to rely solely on external solutions.

00:32:37.920 Returning to query performance, one of the classic problems with GraphQL is the N+1 query issue—where the number of database queries can soar based on the volume of loaded data. For example, if we want to retrieve posts along with the authors’ names, the typical query execution results in fetching the posts first and then triggering an individual query for each author.

00:33:58.440 This results in an excessive number of queries if you request multiple posts. In REST, you can anticipate these needs and preload data more effectively, but GraphQL’s flexibility means it’s not always feasible to preload everything, as it contradicts one of the key advantages of GraphQL—making precise data requests.

00:35:10.000 A potential resolution is to use different batching strategies. When you have multiple queries requiring related data—such as posts and user information—tools like the GraphQL Batch gem can help combine these requests efficiently. This effectively minimizes the query load on your database, as you would execute a single request for all user objects, reducing the number of database accesses.

00:36:23.920 However, handling 'has many' relationships can be trickier. For cases involving tags, for instance, you can use the GraphQL preload gem to preload collections. This method works well if queries are structured flatly, but as nesting increases, complications can arise.

00:37:30.480 Identifying predictable requests can help preload the necessary associations. While look-ahead functionality provided by GraphQL can optimize specific data needs, it introduces added complexity, especially when accommodating a diverse range of queries. Assessing the potential performance impact against the complexity this introduces into your code is crucial.

00:38:54.480 It would be preferable to mitigate complex queries entirely, especially for open-source APIs free for public use. The risk of deeply nested requests, including circular dependencies where the same data can be requested from varying levels, can introduce performance risks.

00:40:02.160 A simple solution involves setting a maximum depth limit for queries. By applying a maximum depth restriction, you can throw exceptions for excessive nesting, imposing safeguards without thorough manual intervention. Additionally, timeout controls for query execution can also help manage performance.

00:41:16.800 While these strategies help regulate query execution time, focusing on performance at the field level is vital. The fields’ computational costs may widely vary, so implementing caching strategies at the GraphQL API layer using tools like the GraphQL Ruby fragment cache gem can lead to notable performance gains.

00:42:25.920 By declaring specific cache fragments for expensive-to-compute fields, we can alleviate server load for frequently requested data. However, caching introduces a new set of challenges around invalidation, requiring precise control over events that should trigger cache invalidation.

00:43:37.760 This approach can simplify our overall management by consolidating cache control logic in one location. Improvements in managing cached values are especially crucial in large codebases with growing complexity. All strategies highlighted here emphasize the singular aspect of handling a single GraphQL query sent to the application.

00:44:48.640 When running in production under concurrent loads, however, things may differ and pose additional challenges concerning scalability. Many simultaneous queries could be sent from a single API client or across multiple clients, so it's imperative to consider ways to manage concurrent query handling.

00:46:06.000 Implementing query batching can reduce HTTP overhead and improve application responsiveness. Apollo Client, for example, allows you to send queries together efficiently. On the server end, the underlying Ruby GraphQL gem’s capabilities support processing batch queries concurrently using features like Multilex.

00:47:10.720 This approach offers distinct advantages regarding performance, but it introduces the challenge of potentially longer-running individual queries in a batch, leading to delayed responses until all queries are resolved. Serving optimized metrics and monitoring is beneficial, as it enables us to observe the execution behavior of these batch queries.

00:48:01.920 It's critical to maintain a balance among these execution times in your architecture. While methodologies I have shared predominantly focus on read operations and GraphQL queries, the same considerations can apply to mutations.

00:49:29.080 We typically follow strategies that apply to any REST or GraphQL applications, such as offloading some workloads to background jobs, particularly for bulk operations.

00:50:20.000 In conclusion, with great systems come great responsibility. Performance optimization should be a priority, ensuring scalability is achieved by consistently monitoring metrics at various levels. Tracking performance bottlenecks will not only direct you to the right problems but will also help you implement solutions sooner, as slow responses can hinder user adoption.

00:51:20.000 Moreover, keep in mind that optimization is an ongoing process, and it's essential to cultivate a performance-aware culture within your engineering team. Monitoring and tracking bottlenecks systematically will ensure timely remediation as your application grows.

00:51:56.360 Thank you for your attention, and please feel free to reach out with your questions or thoughts.

00:54:24.240 I have not evaluated all cases, but in our scenario, using GraphQL Batch resolved performance issues effectively. Denormalization can work, but it often depends on context; we found GraphQL Batch suitable. Our API structure must be flexible, accommodating numerous clients and their requirements.