00:00:16.480
Hello everyone! Before I start this talk, I want to mention that I've copied everyone else's slides so that people at the back can see them if they need to.
00:00:24.000
This is about Bundler versus RubyGems. You can find the slides on Speaker Deck if you need them.
00:00:36.000
I'm André Arko, and I’m involved with many internet-related things. I work for Cloud City Development in San Francisco, where we focus on Rails and general web development and consulting. I also work on something called Bundler.
00:00:48.800
This talk is the story of a programming disaster and how I accidentally launched a distributed denial of service attack that took down rubygems.org.
00:01:02.480
This is a tale of overcoming adversity, and I apologize for what happened. It's not the best thing ever. The story starts many moons ago with the very first release of Bundler on the 30th of August 2010, happening almost simultaneously with the release of Rails 3.0.
00:01:14.240
When Bundler 1.0 was released, it became apparent that users now needed to run 'bundle install,' and it took a long time. This left many people feeling unhappy.
00:01:26.799
I was alerted to this issue when someone sent me a link showing how Rubyists were sensitive to slow experiences.
00:01:31.760
So I thought that I should do something about it. The Bundler team spoke with the RubyGems team, and under Nick Caronto's leadership, we devised a plan.
00:01:45.200
The plan was for RubyGems.org to provide an API specifically for Bundler, listing only the gems specified in your Gemfile, rather than having Bundler fetch the entire list of every gem that ever existed, which was how Bundler 1.0 functioned.
00:02:05.600
With the new version, 1.1, the process was significantly faster. For example, a toy Gemfile with just Rails and two other gems had a bundle install time of 32 seconds in 1.0, while it dropped to just 8 seconds in 1.1. This was a considerable improvement.
00:02:32.560
Bundler 1.1 officially launched on March 7, 2012, after nearly two years of pre-releases. We were thrilled with the speed improvements, but we completely failed to realize something crucial.
00:02:52.239
Almost no one runs 'gem install bundler --pre'; they simply run 'gem install bundler.' As a result, we discovered that the API requests were a lot of work for the server to handle.
00:03:19.360
Back in 2012, RubyGems.org was hosted entirely on a single Rackspace VPS, equipped with a Postgres and Redis database, and multiple Unicorn servers. This became the single choke point for all the gems in the world.
00:03:41.920
Increasing the load on that one box was not a feasible long-term plan. It was evident that if you wanted to download a gem, it did a single database lookup and returned a 302 redirect to download from a CloudFront server.
00:04:02.720
However, the API operations were complicated. It required intricate Postgres queries spread across the database, serializing the returned data into a Ruby Marshal format before delivering it.
00:04:30.720
Every time someone executed 'bundle install,' there was a high chance this operation had to loop around 10 or 20 times, which substantially added to the load.
00:04:49.520
In February 2012, the Bundler core was primarily the only team using the API. By March, after the official release, the user base grew. By September, more and more people were actively using Bundler 1.1.
00:05:05.120
Then came October. RubyGems.org vanished from the internet, and all gems seemed to be gone! As it turns out, this occurred because RubyGems.org didn’t have sufficient CPU power to inform users which gems they needed.
00:05:27.759
The entire API became unavailable, which freaked everyone out. As the RubyGems.org team worked to restore service, they found the site crashed again as users continuously re-ran bundle installs waiting for functionality to return.
00:05:47.440
By October last year, it wasn't only me messing up but a large and anxious user base frustrated that they couldn't install or deploy their applications.
00:06:05.120
We made a new plan, hoping to learn from the failures experienced, and it’s from this collective experience that we devised a better approach.
00:06:20.880
Since hosting the API seemed unmanageable, the Bundler team—consisting of Terence Lee from Heroku and myself—decided to implement the API as a standalone app.
00:06:39.600
RubyGems.org agreed with this, and we created an app hosted on Heroku. It’s a simple Sinatra app that communicates with a Postgres database formatted similarly to RubyGems.org's database.
00:06:54.320
As we built and deployed it, we had questions about its functionality. Could it really handle all the Rubyists in the world? How would it scale up? We had many concerns about whether it would perform as needed.
00:07:15.040
Data measurement was crucial to answer these questions. We deliberated on what metrics to track and how to assess traffic performance and limitations.
00:07:35.520
To gather useful metrics, we collaborated with various services such as Librato for metrics and Papertrail for logging, ensuring we had visibility into our data.
00:07:50.320
With the right metrics in place, we learned about our response times and discovered the server's performance before and after the API transition.
00:08:03.360
From previous data, we knew that API requests sometimes took half a second to two seconds, with some taking up to 20 seconds due to the singular VPS architecture.
00:08:23.839
After better instrumentation, we found our median response times were around 80 milliseconds, but the 95th percentile was a shocking 600 milliseconds.
00:08:42.480
The big discovery from this experience was that using a more robust PostgreSQL system significantly improved our performance. By upgrading our PostgreSQL hosting, query times dropped to 5 to 15 milliseconds, and we could manage many complex queries efficiently.
00:09:06.799
As we continued experimenting with performance metrics, we initially thought Redis caching was beneficial, only to find that it often slowed down our operations.
00:09:20.560
We discovered that few users maintained identical Gemfiles. This led to very low cache hit rates; thus, keeping data in Redis often led to increased wait times rather than reducing them. We turned off Redis and noticed improvements.
00:09:43.920
Additionally, switching to a threading model was key. By enabling Puma, we could now handle more requests per dyno due to the random routing Heroku uses.
00:10:02.080
However, we ran into deadlocks while using Puma, leading to requests getting blocked indefinitely. This pushed me toward switching to Thin due to better request handling.
00:10:17.200
Keeping track of users and request load became more complex. The Heroku router measurement data provided useful charts to establish how many dynos we used effectively.
00:10:31.400
Once we adapted to the new routing system, we became concerned about potential capacity issues. Our efforts led us to monitor response times using services and track errors effectively.
00:10:48.960
Giving alerts to PagerDuty allowed us to maintain good service levels without incurring unnecessary costs for Heroku by minimizing the number of dynos used.
00:11:04.000
In the final stages of investment, we realized response times often don't reflect client-side experiences. After setting up external monitors, we learned that five percent of requests timed out.
00:11:21.679
The challenge emerged from using Thin, whereby each dyno managed only one request at a time, leading to timeouts during backlogs of requests. Switching to Unicorn with multiple child processes proved beneficial.
00:11:38.720
Even though we made substantial progress, we noticed bugs such as requests to yanked gems arising from synchronization issues between RubyGems.org and the Bundler API.
00:11:53.920
We experimented with a background process that consistently fetched updates from RubyGems.org until we transitioned to a webhook-based system.
00:12:11.520
Webhooks provided syncing without pushing too much load, allowing for near real-time updates whenever a gem was added.
00:12:25.440
Despite challenges with reliability, combining webhook notifications with periodic background checks allowed us to maintain acceptable database synchronization.
00:12:42.560
Today, thanks to these advancements, Bundler installs faster than it ever has—now at least twice as fast thanks to our new optimizations.
00:12:57.600
The functionality of the app server has greatly improved, and we have created opportunities for further enhancements to make our system even better.
00:13:10.960
We have an exciting plan in place. The Bundler and RubyGems teams are collaborating to create a new index format for both servers.
00:13:30.640
This new design will streamline updates, as it will utilize an append-only cacheable format with HTTP range requests to improve efficiency.
00:13:50.960
We have a working prototype built and are excited to begin implementation, which could further speed up the server-side interactions.
00:14:05.920
To conclude, I'm thrilled to announce that Ruby Central has granted me funds to work on this project, providing dedicated time to improve Bundler.
00:14:21.760
If you're looking to contribute, I urge all of you to reach out. Whether it's improving documentation or delving into the code, there are plenty of exciting opportunities.
00:14:42.960
We are also collaborating with Rails Girls for a summer of code initiative to help onboard new contributors to the project.
00:15:00.480
Thank you for your attention, and I hope you will consider helping us in this important venture!
00:15:18.560
Thank you!