Building high–performance GraphQL APIs

by Dmitry Tsepelev

Summary of 'Building High-Performance GraphQL APIs'

In this presentation, Dmitry Tsepelev discusses improving the performance of GraphQL APIs specifically aimed at an audience already using GraphQL in production environments. He outlines common issues that may lead to slow responses and shares methods for diagnosing and enhancing performance.

Key Points:

Understanding Query Execution:
- GraphQL queries are string-based and go through several stages of processing when handled by a server, including parsing, validating, and executing queries.
Measuring Performance:
- Performance measurement is essential before optimization. Tools like New Relic can track query performance, but GraphQL specifics require additional setup such as utilizing the 'tracer' feature in Ruby GraphQL to monitor slow queries effectively.
Identifying Bottlenecks:
- Common performance issues include parsing efficiency and response building speed. Smaller and simpler queries are preferable to facilitate quick parsing, while excessive response sizes can lead to slowdowns.
Dealing with N+1 Problem:
- This problem arises in scenarios where associations aren’t eagerly loaded, leading to multiple queries for related data. Solutions include eager loading associations using methods like ‘includes’ and employing lazy preloading techniques.
Caching Strategies:
- Advanced methods such as HTTP caching, ETags, and Apollo Persisted Queries can enhance performance. For instance, switching from POST to GET for certain queries allows for traditional caching methods to be applied effectively.
Practical Examples:
- Tsepelev provides examples, like discovering that slow price conversion queries were improved simply by reducing unnecessary data in Redis. He emphasizes the importance of client headers for maintaining performance insights in evolving applications.
Continuous Improvement:
- Regularly updating to the latest versions of GraphQL libraries is crucial as updates typically include performance enhancements.

Conclusions and Takeaways:

Monitoring Setup is Critical: Performance monitoring must be configured initially to effectively address potential issues.
Simplifying Queries: Reducing the complexity and size of queries can yield substantial improvements.
Addressing N+1 Issues: Implementing effective loading strategies for associations is key to avoiding performance pitfalls.
Utilizing Caching: Implementing caching solutions tailored for GraphQL can lead to significant performance gains.
Dmitry concludes by encouraging continuous evaluation and adaptation of strategies to keep GraphQL APIs performing optimally.

00:00:00.160 Our next speaker is Dmitry Tsepelev. He is going to talk about building high-performance GraphQL APIs. Dmitry has a background in engineering at Evil Martians, where he works on open-source projects. Today, he will share insights into improving the performance of GraphQL APIs.

00:00:12.000 One day, you decided to give GraphQL a try and implemented an API for your new application using GraphQL. You deployed the app to production, but after a while, you realized that the responses were not super fast. How can you find out what is causing your GraphQL endpoint to be slow?

00:00:35.120 In this talk, we will discuss how queries are executed and what makes processing slower. We will learn how to measure performance in the GraphQL ecosystem and identify the aspects we should improve. Finally, we’ll explore possible solutions and some advanced techniques to keep your GraphQL app fast.

00:01:19.119 You can send your questions to Dmitry via the stream chat, and I will be picking them up from there. Dmitry, the floor is yours.

00:01:37.280 Okay, I hope you can hear me now and see my screen. Hello everyone and welcome to my talk called 'Building High-Performance GraphQL APIs.' My name is Dmitry, or Dima if that's easier.

00:01:43.119 I live in a small Russian town called Vladimir and I am a backend engineer at Evil Martians. I frequently engage in open-source projects, conference talks, and sometimes I write articles. Today, I plan to share some of my articles with you.

00:02:07.840 We use GraphQL extensively in our production applications, so this talk is specifically for people who are already using GraphQL in production. It's not an introductory talk, so we won't discuss the pros and cons. However, if you need details about the technology itself, I suggest looking at my previous talk titled 'Thinking Graphs,' which I gave two years ago.

00:02:30.080 You can also check one of Evil Martians' articles about writing GraphQL applications in Rails. There are three articles that guide you in building a GraphQL application from scratch, including a client for it. This can be a good starting point.

00:02:53.840 Now, let's start talking about performance. I want to begin with the query path that is executed when a GraphQL application processes a request. It's essential to know the points where things can go wrong.

00:03:10.720 In GraphQL, a query is just a string, which we can refer to as a query string. This string typically goes to the '/graphql' endpoint via a POST request. The request is sent to the GraphQL controller, where it must be passed to the GraphQL schema along with the arguments and other necessary variables.

00:03:56.160 Next, we need to prepare the abstract syntax tree or AST. We send the query string to the query parser method. Once we have the abstract syntax tree, we must validate it to ensure that it has valid syntax and that all required variables are present. Afterward, we execute the query. As you can see, the query is structured like a tree, which we traverse recursively to execute each field.

00:04:42.560 This execution stage is called 'execute field,' which continues until we execute all fields. Sometimes, we also need to access data sources, like a database. Occasionally, we might not want to execute a field immediately; in such cases, we can mark it as 'lazy,' postponing its execution until all non-lazy fields are processed.

00:05:01.120 After finishing the execution of all fields, we obtain the query result object. The final step is to return this result to the client by converting it to JSON format before sending it back.

00:05:50.280 This is the general query execution path. Our plan today involves beginning with different methods on how to identify bottlenecks in your applications. Following that, we will discuss how to fix common issues typically encountered in GraphQL applications. Finally, we will cover some advanced techniques for addressing specific cases.

00:06:39.440 To optimize effectively, it doesn’t make sense to optimize before having measurements. If you do not have measurements, you should set them up first. In production applications, we often use services like New Relic; however, these may not be very helpful for GraphQL.

00:07:05.360 Given that all requests are directed to the same GraphQL endpoint, the data will be aggregated, preventing us from identifying which specific query is slow. To address this, Ruby GraphQL offers a feature called 'tracer.' This feature allows you to capture events from the execution engine and perform various actions with them.

00:07:50.720 For example, you can send that information to your monitoring tool, such as New Relic. To enable the tracing feature, you simply need to turn on the plugin in your schema, and afterward, you can see all the stages in New Relic. This setup allows you to identify which queries are slow.

00:08:59.919 If you are using your monitoring solution, you will have to set it up manually. I can recommend a set of gems known as 'Ebitda,' created by my colleague Andrew Nautical. The idea is that you can declare all the metrics you'd like to collect from various components of your applications, such as Rails or Sidekiq, and send them to any data source of your choice.

00:09:38.640 For instance, it can be sent to Prometheus. In the case of GraphQL, it sends various metrics such as GraphQL API request counts, field resolve run times, and counts for query and mutation fields. This setup provides visibility into the operations occurring in your application.

00:10:36.119 Next, I'd like to recommend a book titled 'Production Radical.' I read it one day and found it to be filled with useful tips, some of which will feature in today's talk. One notable tip discusses client headers, which are beneficial when clients, such as mobile applications, are not immediately updated.

00:11:04.639 The problem is that you might identify a slow query and resolve it on the client side, only for the issue to resurface later. Without client headers, you won’t be able to tell whether the problem is new or old. Therefore, the solution is to add headers indicating the client name and client version.

00:11:30.560 Now that we have our monitoring setup, we can find out if a query is slow. However, sometimes a query may be fast under certain conditions and slow under others. For instance, a specific variable might cause the query to slow down in some cases, and conventional monitoring won’t help us.

00:12:15.520 Let me illustrate this with an example. Imagine we have a list of items for sale, each with a price in USD, but we want to show the prices in the user's local currency. While the query that returns items with prices may normally be fast, it can occasionally log as being really slow.

00:12:56.880 To address this, we can take advantage of the tracer feature I previously mentioned. We can write our custom tracer, which is a class with a single method called 'trace.' This method accepts a key, which represents the name of the event, and the data, which is a hash that varies depending on the event key.

00:13:39.360 What we do here is remember the time when we received the event, allow the execution engine to process the stage, and compare the new time with the previous one, logging it accordingly. By doing this, we can see the various stages with the time spent within the code.

00:14:33.440 For example, we may find that the field execution takes a considerable amount of time, but we won’t know which specific field is the bottleneck. Thus, we can change our tracer to compare keys with the 'execute field' string. If it matches, we print the field name to understand what is slow.

00:15:56.320 If you're curious about one of my problems, I found that we stored conversion rates for currencies in Redis. When I executed the field for the first time, it resulted in loading a massive amount of data, taking a significant time. The fix was straightforward: I removed unnecessary data from Redis, and the query execution improved drastically.

00:16:43.440 Another effective technique is called directive profiling. You can write a custom directive, such as one called 'PP,' which can apply any profiler you want to a specific field. This allows you to test performance on staging or even production while logging the results on the server for further analysis.

00:17:25.760 Next, I'd like to discuss parsing issues. When I prepared for the Ruby Russia conference in 2019, I wanted to explore how slow parsing could be. I created a small benchmark with a list of queries to test various aspects, such as nested fields and query complexity.

00:17:59.360 The results showed that smaller queries are parsed quickly, while larger queries with more fields and nesting take significantly longer, sometimes up to a second. Therefore, it’s essential to keep your queries simple, as there isn’t a way to reduce parsing time significantly.

00:18:44.800 The second issue worth mentioning is response building. Sometimes building a response can be slow due to the large response size. In one scenario I encountered, profiling indicated that most time was spent on the library responsible for handling large records, not due to developer error.

00:19:11.760 In such cases, keeping your responses small is crucial. Implementing pagination can significantly reduce the amount of data transferred and processed. Additionally, applying maximum depth and complexity limits can help regulate the size and complexity of queries.

00:19:49.680 One key recommendation is to ensure that you are using the latest version of GraphQL Ruby, as each iteration usually brings performance improvements. I ran benchmarks from 2019 again after two years, and I noticed solid performance increases and reduced memory consumption.

00:20:20.160 Another common problem arises from database interactions. Database queries might be suboptimal, which can create a bottleneck. This talk won’t cover database optimization methods as that’s a topic in itself.

00:20:50.880 A significant issue for GraphQL users is the 'N+1 problem,' which occurs when you have a list of entities with associations. If you forget to preload these associations, you could execute one query to load the entities and many additional queries to retrieve associations for each.

00:21:33.520 To mitigate this, the simplest solution is to eager-load all associations using 'includes' or 'eagerload.' While this approach works well for smaller applications, it isn't always efficient since it may load unnecessary data. Thus, using techniques like 'look-ahead' can improve efficiency by checking if an association is needed before loading it.

00:22:38.960 Lazy preloading is another useful technique, where associations are loaded only when they are accessed for the first time. Various gems can implement this feature, and I have one called 'ar_lazy_preload' for those interested.

00:23:15.680 Moving on to advanced techniques for improving application performance, let’s discuss HTTP caching, which allows us to avoid loading data from the database when we know that the client already has valid data. We can set cache control and expiration headers to specify how long the data is valid.

00:24:08.480 Additionally, we can use 'ETags' or 'Last-Modified' headers for determining when data needs to be refreshed. However, typical HTTP caching can be challenging for GraphQL because it primarily uses POST requests that are not cacheable according to the specification.

00:25:05.440 A solution is to allow GET requests for your GraphQL endpoint, enabling you to build an HTTP caching mechanism from scratch. For instance, if you have static data, like banners that don’t change frequently, you can cache them efficiently.

00:25:51.440 Another useful technique is called 'Apollo Persisted Queries.' Instead of sending the entire query string, you send a hash representing the query. The server checks its storage for the query associated with that hash. If found, the server can use it directly; if not, it returns an error until the query is saved.

00:26:50.720 With GraphQL Ruby, there’s an existing implementation, although it's currently in the pro version. I have also created my implementation called 'persistent_queries,' which stores queries and processes them based on a hash.

00:27:36.080 To sum up, first ensure you have monitoring configured before making optimizations. If issues arise with specific queries, you can use a custom tracer to pinpoint performance issues directly.

00:27:51.840 If parsing issues persist, consider simplifying your queries or implementing compiled queries. It's crucial to address any N+1 problems as they can lead to significant performance issues. Keep your responses small, as smaller responses can be serialized quickly, and explore various caching strategies for better performance.

00:28:57.760 Now, I'm ready to take your questions. I will also post my slides on Twitter later, so please follow me for updates. Thank you all for your attention!

00:29:31.680 Thank you, Dima! Once again, I will drop your links in the stream chat. If you’re attending from the other side of the event, feel free to share your links with us too. We have a couple of questions. Are you ready to answer them?

00:29:44.800 Yes, I’m ready! Let’s do it.