Service modeling at GitHub

by Elena Tănăsoiu

In her presentation at the Friendly.rb 2023 conference, Elena Tănăsoiu, a Senior Engineer at GitHub, delves into the concept of service modeling, which she has been working on for the past few months. The talk provides insights into how GitHub approaches performance optimization of its vast Rails monolith.

Key Points Discussed:

- Introduction to Service Modeling:

- Service modeling is the process of identifying and documenting performance bottlenecks in GitHub services.

- It involves collaboration with service owners to address issues and improve performance.

- Performance Testing Environment:

- GitHub creates an environment to analyze the performance impacts of specific endpoints during real traffic scenarios.

- Focus is on educating engineers on their code's resource impact.

- Technical Background:

- GitHub relies on Unicorn rather than Puma, which is now standard in Rails.

- The focus is on uncovering performance issues like the notorious 'n+1 queries' which can lead to excessive database calls.

- Findings from Profiling:

- Profiling revealed that over 40% of rows fetched by their monolith came from one heavily queried permissions table.

- Six controllers were identified as responsible for 50% of the trafficked routes, indicating areas ripe for optimization.

- Process of Service Optimization:

- Steps include gathering requests data, prioritizing endpoints based on load and latency, and visualizing service flows using diagrams and flame graphs.

- Elena highlights the importance of understanding dependencies across databases and caches, along with the utilization and saturation levels impacting performance.

- Future Considerations:

- As GitHub transitions to React, new performance demands will shift how their services are structured, necessitating ongoing analysis.

- Conclusion and Recommendations:

- Success in this optimization work is measured through monitored performance metrics, and findings are shared via educational blog posts.

- Elena encourages others to adopt similar methodologies for performance improvement, such as the USP method, tracing tools, and incorporating real traffic data in their evaluations.

The talk concludes with a sense of community spirit and an invitation for collaboration in optimizing code for better performance, leaving the audience with actionable insights to enhance their own processes.

00:00:06.200 At the end of June, I went to Brighton for Brighton Ruby. We met, but unfortunately, we didn’t get to talk much because she was always surrounded by other people. Just last week, I was at UKO and a few people asked me to say hi to Elena for them, so let’s give a warm welcome to Elena Tănăsoiu.

00:00:36.280 Hi everyone! I see the sound is working, so that’s great! Thank you for joining me today on this second day of the conference. It’s been really cool, and I’m quite impressed with the quality of the talks so far. I’m pretty sure I’ll be coming back next year if Adrien is organizing it.

00:01:01.239 I don’t know if you’ve noticed, but everyone keeps using the word 'friendly.' I think we should do a drinking game or something at this point, because it’s everywhere! But that’s a good sign. Alright, let’s start my presentation.

00:01:15.400 My name is Elena, and I work at GitHub. The topic I’ll discuss today is service modeling. I should preface this by saying that this is a project I’ve been working on for the last couple of months. I hope you get a peek under the hood at GitHub and hear my story from the past few months.

00:01:31.920 A little about me: you can find all my social media at that address. I sometimes like to flex a bit by claiming I have my own link. I also use the Pikachu emoji because that’s my team’s emoji, and we’re quite attached to it — it’s a Detective Pikachu because we do investigations. I’m also Romanian, and it feels good to be at a Ruby conference in Romania.

00:02:02.079 It's very satisfying to see this community thrive. I am very proud. I studied computer science at the University of Bucharest in 2006, which doesn't feel that long ago, right? Now I live in London and have been working mostly with Rails for the past 14 years. I have been part of the performance engineering team at GitHub for the last couple of months, and I enjoy terrible puns. If you have any, please share them with me; I love them!

00:03:00.519 Before I start, I want to say a few words about Chris NOA. She was the reason my team was founded and was our tech lead. You might know her from her work with eBPF, or her book on cloud-native infrastructure, or from the Nova Show. She was the founder of Hacker and the NLE Foundation for open source. Sadly, she passed away last month in a climbing accident at the age of 36. We deeply feel her loss on our team.

00:03:41.080 Chris was a kind, funny, and passionate person. We partnered on this service modeling project, which was something close to her heart. She aimed to help developers understand the load they put on our resources and educate them on how to write performance code. We will miss you, Chris.

00:04:11.480 Now, what is service modeling? It’s a standard process to find and document bottlenecks and performance opportunities at GitHub. That’s the definition we have in the documentation. It's a collaborative process where we work with the respective service owners to address problems.

00:04:22.720 We’ve created a performance testing environment that I will discuss later. So, why do we do this? GitHub is one of the largest Rails monoliths in the world. We want strongly isolated vertical slices of our application and to understand the performance of a specific endpoint at runtime with real traffic. Moreover, we want to educate engineers about the performance of their code so they develop an awareness of what their code is doing.

00:05:06.640 Less YOLO, more precision. While we have service separation and good observability, we still cannot accurately determine an endpoint's load on our resources. When I refer to resources, I mean CPU consumption, memory, the number of database connections, MySQL queries, cache hits, and misses at that level.

00:05:34.640 This lack of visibility makes it difficult to do capacity planning. We have countless requests from teams asking us questions like, 'If I scale out this service, what’s the downstream effect?' We need to be able to answer that.

00:06:15.079 To clarify, GitHub runs on Unicorn. So when you hear me mention Unicorn, just visualize the number of requests we can process simultaneously. Rails now ships with Puma by default, but this wasn’t the case when GitHub was started. We do have Puma in some places, but Unicorn is the main one.

00:06:34.800 I want to briefly explain what n+1 queries are. It’s a performance problem where the application makes multiple database queries instead of a single query to fetch or modify data. For instance, in a Rails example, let's say you have books belonging to authors. You fetch all books in one query, but if you then want to fetch a resource related to each book, we might end up firing off additional n queries, when ideally, we want just one query that fetches everything.

00:06:58.720 When I refer to n+1 queries, I’m highlighting a performance issue that we aim to identify and resolve in our work.

00:07:48.399 So what did we find? We started by profiling one of the oldest services at GitHub, which is Issues. There is a lot of technical debt here, and I can share many stories about it. We began with this service because it has the most experienced people regarding performance optimizations, so we learned together.

00:08:12.360 Believe it or not, over 40% of all rows fetched by the monolith come from one table, the permissions table. This is due to the numerous small queries we make to check if each piece of data is allowed to be viewed. We are adamant about ensuring security is properly managed, but this is starting to show in our performance.

00:08:55.000 We didn’t expect that one table would be responsible for such a large portion of our traffic, but focusing on optimizing this can have a significant effect across the entire monolith, not just within the Issues service.

00:09:34.720 Additionally, we discovered that 50% of the traffic at GitHub is served by just six controllers. Again, really good news! We can strategically optimize these controllers to enhance the performance of a large portion of our requests. It’s worth mentioning that our architecture consists of both a front-end and a REST API, where we’re focusing on optimizing the front-end.

00:10:06.560 This is an example of the process used when creating an issue. It takes about 3.56 seconds to complete, where we spend 0.9 seconds solely creating the issue. The remaining time can be spent on tasks that can run in the background, updating related resources without holding up the user.

00:10:25.759 Here’s a diagram that illustrates what happens when a user creates an issue. I initially did this manually as we developed our automated tools. The diagram details all the MySQL queries that are executed, bringing to light how many operations are triggered.

00:10:59.760 We also identified a potential future bottleneck. GitHub plans to move to React, which means different performance considerations might arise. Previously, the whole page would load at once, but in the new experience, only the first page will load once, while additional interactions will yield GraphQL requests back to our database.

00:11:35.160 This raises a significant question: what is the total load on our backend as we continue to move more pages to React? Currently, there isn’t a lot of data on this, but we want to stay one step ahead. The workload characterization, a technical term, is changing. We plan to invest more in resource isolation.

00:12:14.840 So what does the typical process look like? There are several steps to follow. First, you gather request data. When we began, we needed to make decisions about what to profile because GitHub Issues is a massive service. We narrowed our focus to the REST API, as it has more traffic than the front end.

00:12:51.559 From there, we made a list of our request paths and assessed which ones were receiving the most traffic. We prioritize high-impact areas based on traffic load - usually, requests per second. However, the REST API includes over 180 endpoints, which would certainly be overwhelming to visualize all at once.

00:13:30.640 We settled on three criteria for prioritization: endpoints with the highest load, those with the highest latency, or those that are most critical for business operations. Our focus was on the CRUD operations — create, read, update, delete — specifically regarding the index page.

00:14:00.360 Next, we create diagrams to visualize the service. Finding the right code in a massive monolith can be challenging, but it’s crucial to understand how different parts affect the service. We obtain traces of requests to create useful flame graphs and build awareness of where performance bottlenecks may arise.

00:14:35.800 We want to determine the dependencies for each endpoint, looking at databases, caches, and calls to internal services. We utilized the effective method proposed by Brandon Greg in his book on systems performance, where you iterate through each of your dependencies to track utilization, saturation, and errors.

00:15:04.480 In our case, we also modified this approach to focus on procedural aspects, specifically identifying where the code waits for work to be done. This helped us pinpoint systemic bottlenecks.

00:15:45.760 With the gathered data, we assess the lifecycle of our requests. We focus on errors first because they are quick to identify, followed by analysis of CPU and memory usage. Afterward, we look at levels of saturation as any saturation can lead to issues.

00:16:29.440 Using dashboards and metrics that concern us, we pinpoint where issues arise and associate them with our flame graphs. This visualization allows us to detect problems like fragmentation quickly.

00:17:09.120 Once we have established these insights, we can identify opportunities for optimization based on the data we’ve gathered. We focus on aspects such as unbounded SQL calls, synchronously processing tasks that might work better in the background, and evaluating caching methods.

00:17:51.760 Ultimately, we compile our findings into a concise report for the team. The goal is to keep it short and engaging, identifying three key bottlenecks we can work on together.

00:18:32.720 To measure success, we monitor our performance metrics and celebrate improvements. We share this knowledge through blog posts and educational materials.

00:18:52.080 If you’re looking to implement something similar, I encourage you to utilize the tools we've developed — the USP method, tracing tools, flame graphs, and effective diagramming. Don’t forget to consider real traffic in your measurements.

00:19:24.960 Thank you all for your attention!