Caching
Handling 225k requests per second to RubyGems.org
Summarized using AI

Handling 225k requests per second to RubyGems.org

by Samuel Giddins

In this presentation at RubyConf Taiwan 2023, Samuel Giddins discusses the intricacies of managing RubyGems.org, a vital gem hosting service that experiences significant traffic peaks. The session opens with an acknowledgment of the audience's familiarity with RubyGems through common commands like 'gem install' or 'bundle install'. Giddins, a maintainer of RubyGems and Bundler, shares insights into how RubyGems.org handles an average of 20,000 requests per second, peaking at an astounding 225,000 requests per second.

Key points discussed include:
- Traffic Volume: RubyGems.org serves approximately 150 billion downloads out of around 192,000 available gems, indicating a massive user base and continuous growth.
- Infrastructure: The service utilizes AWS as its primary server host and Fastly for caching and DDoS protection. The total costs for infrastructure amount to about $20,000 per month.
- Volunteer Contribution: A small, mostly-volunteer team, alongside Ruby Central’s support, is pivotal in maintaining operations, which highlights the reliance on community efforts in sustaining the service.
- Caching Strategies: The team optimizes resource management by employing caching mechanisms through Fastly and simplifying the processing of requests, minimizing workload. This is leveraged particularly during peak traffic times to keep the service operational without excessive resource use.
- Incident Management: Giddins recounts a critical event in May where an unforeseen spike in traffic led to database overloads. This incident led the team to innovate caching responses in S3, bolstering their system's resilience against future traffic surges.
- Support Importance: The discussion emphasizes the essential role of sponsors like AWS and Fastly in facilitating the overall functionality and sustainability of RubyGems.org.

In conclusion, the presentation underlines the collaborative effort and thoughtful architecture management required to maintain agile and effective service for users. Giddins encourages audience engagement with RubyGems.org and expresses gratitude towards the audience for their involvement, reinforcing the significant community aspect of the ecosystem.

00:00:00.040 The topic of handling 225k requests per second to RubyGems.org is up next. Please join me in welcoming your speaker, Samuel.
00:00:28.960 Good morning everyone! I know it's still early, and maybe we haven't had enough coffee yet, but we're going to talk a bit about everyone's favorite website that you never think about: RubyGems.
00:00:36.079 A little about me: I am pretty much everywhere on the internet. My name is Samuel, and I'm a maintainer of RubyGems, Bundler, and RubyGems.org.
00:00:49.039 You might know me for contributing bugs to all the above for at least a decade. My sincerest apologies for that! Currently, I am the Security Engineer in Residence at Ruby Central, thanks to AWS for sponsoring that role. What does that mean? It means thinking about security across the entire packaging ecosystem.
00:01:01.879 It's a lot of fun! I actually just shipped something called Trusted Publishing this past Thursday at a coffee shop here in Taipei. You'll have to stay tuned for a talk about that some other time.
00:01:17.000 Okay, so I promised you a talk about RubyGems.org. Let's begin with some Q&A. Show of hands: if you have ever run 'gem install'. Great! This is an easy way to tell who's not paying attention to this talk. Now, show of hands if you've ever put this in a Gemfile. Huh! Some people here aren't using Bundler, so guess what? You make my life difficult.
00:01:58.039 You've helped contribute to the traffic that RubyGems serves. Would it surprise you to hear that most of the traffic we have comes from people doing things like installing gems and resolving Gemfiles? How selfish of all of you! I'm, of course, kidding. I gave a talk at RubyConf San Diego last month. If you saw that, I apologize for some quick repetition, but I have some numbers that I like to share because they make people a bit surprised.
00:02:51.720 RubyGems.org might not have the enormous scale of something like GitHub or npm, or whatever numbers were quoted earlier about Shopify, but we certainly don't make any revenue off the traffic we serve. However, the service has grown about 20% per year for the last 20 years. Compound growth is something!
00:03:50.240 Currently, we have around 180,000 registered users on RubyGems.org, who have pushed approximately 192,000 gems and over a million and a half versions across all those gems. I believe this totals approximately 150 billion downloads of gems. Feel free to count the commas to verify this!
00:04:09.840 Additionally, we average about 20,000 requests per second. That works out to about 2 billion requests every weekday, and we hit a maximum of 225,000 requests per second at peak times. All of that equates to 7.5 terabytes an hour served, approximately 185 terabytes a day, or 4.5 petabytes per month, and about 54 to 55 petabytes per year. That's a lot of traffic, and all of you contribute to it every time you run 'gem install', 'bundle install', visit the website, or push a gem. So, give yourselves a round of applause!
00:05:15.240 Now, let's consider the cost of serving all these requests. The vast majority of our infrastructure comes from two providers. AWS is our primary server host. We use Amazon EKS to host the Rails application, served by application load balancers, with files stored in S3.
00:05:51.759 In total, that costs about $20,000 per month. Our other significant infrastructure partner is Fastly, which acts as our cache and DDoS protection layer. Fastly handles about 20,000 requests per second and 185 terabytes per month of traffic. Retail prices for that would be more than half a million dollars per month! Fortunately, AWS is super generous, and they don't charge us for these services, allowing us to create and serve gems.
00:07:05.639 Now, let's discuss the people involved. There are two sides here: Ruby Central, which has been supporting RubyGems.org hosting for the past decade or so, and volunteers who help out in their spare time. Until three weeks ago, nobody worked full-time on RubyGems.org. Now I am an employee, which is excellent, but it also means you can now complain to me on a full-time basis about any issues with RubyGems.org.
00:08:51.639 Despite everything we do, we need to maintain a 24/7 reputation, because it’s hard to serve 10,000 requests per second without anyone on call. Without support, that can lead to outages, much like what happened ten years ago when RubyGems went down for five days. That was a bad time for many people.
00:09:10.200 Since then, we've only been paged a few times a year for minor incidents, and there haven't been any major outages in a long time. Let's keep it that way. Now, how on earth does a group of volunteers using sponsored hosting handle this kind of traffic?
00:09:38.040 Well, to achieve this, we optimize our architecture to handle the requests. It's not like we have an army of little elves typing out HTTP responses by hand; that just wouldn't be feasible for our amount of traffic. So how do we actually manage?
00:10:40.320 The secret weapon is that we enable caching for our traffic. We might be cheating, you could say, but it’s about using the tools and infrastructure appropriately. As engineers, we often try to minimize our workload, and in optimizing systems, laziness is a virtue.
00:11:15.920 One key lesson I've learned throughout my career in optimization is that the fastest work you can do is no work at all. This ties back to leveraging the work done by others instead of recreating the wheel. In our case, we allow simpler and more optimized systems to serve the vast majority of our traffic.
00:12:57.960 When we take a look at the lifecycle of a request to RubyGems.org, it starts with a client making a request that hits a Fastly edge point of presence, eventually routing through multiple servers until it reaches the backend.
00:14:45.600 We have two different backends: one refers to static assets served out of S3 and the other is our Rails application running on Docker containers in AWS EKS. Each layer in this lifecycle is optimized for different tasks. Fastly provides caching, while our Rails app handles complex requests.
00:18:00.080 Let me share a quick story titled 'The Weekend I Got paged'. In May, things went a bit sideways when our database became overloaded, causing requests to fail and prompting notifications. This peak had us hitting 225,000 requests per second, which led to a lot of panic.
00:19:38.760 We had deprecated the Dependency API and that caused a significant increase in traffic. When we turned off that one endpoint, requests surged exponentially. Our bandwidth costs skyrocketed, and we realized old Bundler versions fell back to this less efficient full index, thus compounding the problem.
00:21:01.200 To address this issue, we had to move quickly to minimize the number of requests to our Rails application by managing traffic more effectively. Understanding users' behaviors played a significant role in tackling the situation. We blocked some unreasonable IPs that were generating excessive requests.
00:23:56.120 However, as we encouraged users to upgrade to more recent versions, they began making requests using the compact index instead, generating even more traffic than before. The pattern of traffic revealed that while Fastly handled a large share of requests, cache misses still burdened our database.
00:25:59.040 To combat this, we decided to setup a system where requests could be handled in the background. By pre-calculating the responses and caching them in S3 we could offload some of the pressure on our database and Rails app, solving the initial traffic problem.
00:28:18.720 This is a reminder of how vital our sponsors are! As a nonprofit, we can’t afford a high bandwidth bill, and the support from groups like Fastly and AWS allows us to function effectively.
00:29:37.360 In closing, I must say thank you all for your attention. Please ask questions and feel free to share anything about RubyGems, RubyGems.org, or maybe even about my brunch plans.
00:30:54.040 When these incidents happen, we have a 24/7 on-call rotation. There are four of us around the world, which enables us to react quickly and share the burden of incident management.
00:31:32.480 In those tense moments, having multiple people in the virtual room is incredibly valuable. It’s not just about solving the immediate problem; we need to consider the bigger picture.
00:32:50.200 Thank you for this discussion, and now we can open the floor for questions.
00:33:18.080 Thank you, Sam, for your informative talk! Now, does anyone have any questions? Please raise your hand.
00:34:09.560 Thank you once again, and let’s enjoy the next break before our next session starts.
Explore all talks recorded at RubyConf Taiwan 2023
+15