00:00:00.040
The topic of handling 225k requests per second to RubyGems.org is up next. Please join me in welcoming your speaker, Samuel.
00:00:28.960
Good morning everyone! I know it's still early, and maybe we haven't had enough coffee yet, but we're going to talk a bit about everyone's favorite website that you never think about: RubyGems.
00:00:36.079
A little about me: I am pretty much everywhere on the internet. My name is Samuel, and I'm a maintainer of RubyGems, Bundler, and RubyGems.org.
00:00:49.039
You might know me for contributing bugs to all the above for at least a decade. My sincerest apologies for that! Currently, I am the Security Engineer in Residence at Ruby Central, thanks to AWS for sponsoring that role. What does that mean? It means thinking about security across the entire packaging ecosystem.
00:01:01.879
It's a lot of fun! I actually just shipped something called Trusted Publishing this past Thursday at a coffee shop here in Taipei. You'll have to stay tuned for a talk about that some other time.
00:01:17.000
Okay, so I promised you a talk about RubyGems.org. Let's begin with some Q&A. Show of hands: if you have ever run 'gem install'. Great! This is an easy way to tell who's not paying attention to this talk. Now, show of hands if you've ever put this in a Gemfile. Huh! Some people here aren't using Bundler, so guess what? You make my life difficult.
00:01:58.039
You've helped contribute to the traffic that RubyGems serves. Would it surprise you to hear that most of the traffic we have comes from people doing things like installing gems and resolving Gemfiles? How selfish of all of you! I'm, of course, kidding. I gave a talk at RubyConf San Diego last month. If you saw that, I apologize for some quick repetition, but I have some numbers that I like to share because they make people a bit surprised.
00:02:51.720
RubyGems.org might not have the enormous scale of something like GitHub or npm, or whatever numbers were quoted earlier about Shopify, but we certainly don't make any revenue off the traffic we serve. However, the service has grown about 20% per year for the last 20 years. Compound growth is something!
00:03:50.240
Currently, we have around 180,000 registered users on RubyGems.org, who have pushed approximately 192,000 gems and over a million and a half versions across all those gems. I believe this totals approximately 150 billion downloads of gems. Feel free to count the commas to verify this!
00:04:09.840
Additionally, we average about 20,000 requests per second. That works out to about 2 billion requests every weekday, and we hit a maximum of 225,000 requests per second at peak times. All of that equates to 7.5 terabytes an hour served, approximately 185 terabytes a day, or 4.5 petabytes per month, and about 54 to 55 petabytes per year. That's a lot of traffic, and all of you contribute to it every time you run 'gem install', 'bundle install', visit the website, or push a gem. So, give yourselves a round of applause!
00:05:15.240
Now, let's consider the cost of serving all these requests. The vast majority of our infrastructure comes from two providers. AWS is our primary server host. We use Amazon EKS to host the Rails application, served by application load balancers, with files stored in S3.
00:05:51.759
In total, that costs about $20,000 per month. Our other significant infrastructure partner is Fastly, which acts as our cache and DDoS protection layer. Fastly handles about 20,000 requests per second and 185 terabytes per month of traffic. Retail prices for that would be more than half a million dollars per month! Fortunately, AWS is super generous, and they don't charge us for these services, allowing us to create and serve gems.
00:07:05.639
Now, let's discuss the people involved. There are two sides here: Ruby Central, which has been supporting RubyGems.org hosting for the past decade or so, and volunteers who help out in their spare time. Until three weeks ago, nobody worked full-time on RubyGems.org. Now I am an employee, which is excellent, but it also means you can now complain to me on a full-time basis about any issues with RubyGems.org.
00:08:51.639
Despite everything we do, we need to maintain a 24/7 reputation, because it’s hard to serve 10,000 requests per second without anyone on call. Without support, that can lead to outages, much like what happened ten years ago when RubyGems went down for five days. That was a bad time for many people.
00:09:10.200
Since then, we've only been paged a few times a year for minor incidents, and there haven't been any major outages in a long time. Let's keep it that way. Now, how on earth does a group of volunteers using sponsored hosting handle this kind of traffic?
00:09:38.040
Well, to achieve this, we optimize our architecture to handle the requests. It's not like we have an army of little elves typing out HTTP responses by hand; that just wouldn't be feasible for our amount of traffic. So how do we actually manage?
00:10:40.320
The secret weapon is that we enable caching for our traffic. We might be cheating, you could say, but it’s about using the tools and infrastructure appropriately. As engineers, we often try to minimize our workload, and in optimizing systems, laziness is a virtue.
00:11:15.920
One key lesson I've learned throughout my career in optimization is that the fastest work you can do is no work at all. This ties back to leveraging the work done by others instead of recreating the wheel. In our case, we allow simpler and more optimized systems to serve the vast majority of our traffic.
00:12:57.960
When we take a look at the lifecycle of a request to RubyGems.org, it starts with a client making a request that hits a Fastly edge point of presence, eventually routing through multiple servers until it reaches the backend.
00:14:45.600
We have two different backends: one refers to static assets served out of S3 and the other is our Rails application running on Docker containers in AWS EKS. Each layer in this lifecycle is optimized for different tasks. Fastly provides caching, while our Rails app handles complex requests.
00:18:00.080
Let me share a quick story titled 'The Weekend I Got paged'. In May, things went a bit sideways when our database became overloaded, causing requests to fail and prompting notifications. This peak had us hitting 225,000 requests per second, which led to a lot of panic.
00:19:38.760
We had deprecated the Dependency API and that caused a significant increase in traffic. When we turned off that one endpoint, requests surged exponentially. Our bandwidth costs skyrocketed, and we realized old Bundler versions fell back to this less efficient full index, thus compounding the problem.
00:21:01.200
To address this issue, we had to move quickly to minimize the number of requests to our Rails application by managing traffic more effectively. Understanding users' behaviors played a significant role in tackling the situation. We blocked some unreasonable IPs that were generating excessive requests.
00:23:56.120
However, as we encouraged users to upgrade to more recent versions, they began making requests using the compact index instead, generating even more traffic than before. The pattern of traffic revealed that while Fastly handled a large share of requests, cache misses still burdened our database.
00:25:59.040
To combat this, we decided to setup a system where requests could be handled in the background. By pre-calculating the responses and caching them in S3 we could offload some of the pressure on our database and Rails app, solving the initial traffic problem.
00:28:18.720
This is a reminder of how vital our sponsors are! As a nonprofit, we can’t afford a high bandwidth bill, and the support from groups like Fastly and AWS allows us to function effectively.
00:29:37.360
In closing, I must say thank you all for your attention. Please ask questions and feel free to share anything about RubyGems, RubyGems.org, or maybe even about my brunch plans.
00:30:54.040
When these incidents happen, we have a 24/7 on-call rotation. There are four of us around the world, which enables us to react quickly and share the burden of incident management.
00:31:32.480
In those tense moments, having multiple people in the virtual room is incredibly valuable. It’s not just about solving the immediate problem; we need to consider the bigger picture.
00:32:50.200
Thank you for this discussion, and now we can open the floor for questions.
00:33:18.080
Thank you, Sam, for your informative talk! Now, does anyone have any questions? Please raise your hand.
00:34:09.560
Thank you once again, and let’s enjoy the next break before our next session starts.