GoRuCo 2013

Deathmatch Bundler vs Rubygems.org

The story of the quest to make bundle install faster; in which Rubyists around the world inadvertently DDoS rubygems.org, witness its ignominious death, and vow to rebuild it from the ashes stronger than it was before. Then, a tour of the changes; why is Redis so much slower than Postgres? Marvel at the gorgeous metrics and graphs used to measure and optimize; gasp in delight as we track, live, exactly how many Heroku dynos are needed. Finally, a happy ending: today, the server responds to requests TWO ORDERS OF MAGNITUDE faster than it did before.

Help us caption & translate this video!

http://amara.org/v/FG9a/

GoRuCo 2013

00:00:16.480 Hello everyone! Before I start this talk, I want to mention that I've copied everyone else's slides so that people at the back can see them if they need to.
00:00:24.000 This is about Bundler versus RubyGems. You can find the slides on Speaker Deck if you need them.
00:00:36.000 I'm André Arko, and I’m involved with many internet-related things. I work for Cloud City Development in San Francisco, where we focus on Rails and general web development and consulting. I also work on something called Bundler.
00:00:48.800 This talk is the story of a programming disaster and how I accidentally launched a distributed denial of service attack that took down rubygems.org.
00:01:02.480 This is a tale of overcoming adversity, and I apologize for what happened. It's not the best thing ever. The story starts many moons ago with the very first release of Bundler on the 30th of August 2010, happening almost simultaneously with the release of Rails 3.0.
00:01:14.240 When Bundler 1.0 was released, it became apparent that users now needed to run 'bundle install,' and it took a long time. This left many people feeling unhappy.
00:01:26.799 I was alerted to this issue when someone sent me a link showing how Rubyists were sensitive to slow experiences.
00:01:31.760 So I thought that I should do something about it. The Bundler team spoke with the RubyGems team, and under Nick Caronto's leadership, we devised a plan.
00:01:45.200 The plan was for RubyGems.org to provide an API specifically for Bundler, listing only the gems specified in your Gemfile, rather than having Bundler fetch the entire list of every gem that ever existed, which was how Bundler 1.0 functioned.
00:02:05.600 With the new version, 1.1, the process was significantly faster. For example, a toy Gemfile with just Rails and two other gems had a bundle install time of 32 seconds in 1.0, while it dropped to just 8 seconds in 1.1. This was a considerable improvement.
00:02:32.560 Bundler 1.1 officially launched on March 7, 2012, after nearly two years of pre-releases. We were thrilled with the speed improvements, but we completely failed to realize something crucial.
00:02:52.239 Almost no one runs 'gem install bundler --pre'; they simply run 'gem install bundler.' As a result, we discovered that the API requests were a lot of work for the server to handle.
00:03:19.360 Back in 2012, RubyGems.org was hosted entirely on a single Rackspace VPS, equipped with a Postgres and Redis database, and multiple Unicorn servers. This became the single choke point for all the gems in the world.
00:03:41.920 Increasing the load on that one box was not a feasible long-term plan. It was evident that if you wanted to download a gem, it did a single database lookup and returned a 302 redirect to download from a CloudFront server.
00:04:02.720 However, the API operations were complicated. It required intricate Postgres queries spread across the database, serializing the returned data into a Ruby Marshal format before delivering it.
00:04:30.720 Every time someone executed 'bundle install,' there was a high chance this operation had to loop around 10 or 20 times, which substantially added to the load.
00:04:49.520 In February 2012, the Bundler core was primarily the only team using the API. By March, after the official release, the user base grew. By September, more and more people were actively using Bundler 1.1.
00:05:05.120 Then came October. RubyGems.org vanished from the internet, and all gems seemed to be gone! As it turns out, this occurred because RubyGems.org didn’t have sufficient CPU power to inform users which gems they needed.
00:05:27.759 The entire API became unavailable, which freaked everyone out. As the RubyGems.org team worked to restore service, they found the site crashed again as users continuously re-ran bundle installs waiting for functionality to return.
00:05:47.440 By October last year, it wasn't only me messing up but a large and anxious user base frustrated that they couldn't install or deploy their applications.
00:06:05.120 We made a new plan, hoping to learn from the failures experienced, and it’s from this collective experience that we devised a better approach.
00:06:20.880 Since hosting the API seemed unmanageable, the Bundler team—consisting of Terence Lee from Heroku and myself—decided to implement the API as a standalone app.
00:06:39.600 RubyGems.org agreed with this, and we created an app hosted on Heroku. It’s a simple Sinatra app that communicates with a Postgres database formatted similarly to RubyGems.org's database.
00:06:54.320 As we built and deployed it, we had questions about its functionality. Could it really handle all the Rubyists in the world? How would it scale up? We had many concerns about whether it would perform as needed.
00:07:15.040 Data measurement was crucial to answer these questions. We deliberated on what metrics to track and how to assess traffic performance and limitations.
00:07:35.520 To gather useful metrics, we collaborated with various services such as Librato for metrics and Papertrail for logging, ensuring we had visibility into our data.
00:07:50.320 With the right metrics in place, we learned about our response times and discovered the server's performance before and after the API transition.
00:08:03.360 From previous data, we knew that API requests sometimes took half a second to two seconds, with some taking up to 20 seconds due to the singular VPS architecture.
00:08:23.839 After better instrumentation, we found our median response times were around 80 milliseconds, but the 95th percentile was a shocking 600 milliseconds.
00:08:42.480 The big discovery from this experience was that using a more robust PostgreSQL system significantly improved our performance. By upgrading our PostgreSQL hosting, query times dropped to 5 to 15 milliseconds, and we could manage many complex queries efficiently.
00:09:06.799 As we continued experimenting with performance metrics, we initially thought Redis caching was beneficial, only to find that it often slowed down our operations.
00:09:20.560 We discovered that few users maintained identical Gemfiles. This led to very low cache hit rates; thus, keeping data in Redis often led to increased wait times rather than reducing them. We turned off Redis and noticed improvements.
00:09:43.920 Additionally, switching to a threading model was key. By enabling Puma, we could now handle more requests per dyno due to the random routing Heroku uses.
00:10:02.080 However, we ran into deadlocks while using Puma, leading to requests getting blocked indefinitely. This pushed me toward switching to Thin due to better request handling.
00:10:17.200 Keeping track of users and request load became more complex. The Heroku router measurement data provided useful charts to establish how many dynos we used effectively.
00:10:31.400 Once we adapted to the new routing system, we became concerned about potential capacity issues. Our efforts led us to monitor response times using services and track errors effectively.
00:10:48.960 Giving alerts to PagerDuty allowed us to maintain good service levels without incurring unnecessary costs for Heroku by minimizing the number of dynos used.
00:11:04.000 In the final stages of investment, we realized response times often don't reflect client-side experiences. After setting up external monitors, we learned that five percent of requests timed out.
00:11:21.679 The challenge emerged from using Thin, whereby each dyno managed only one request at a time, leading to timeouts during backlogs of requests. Switching to Unicorn with multiple child processes proved beneficial.
00:11:38.720 Even though we made substantial progress, we noticed bugs such as requests to yanked gems arising from synchronization issues between RubyGems.org and the Bundler API.
00:11:53.920 We experimented with a background process that consistently fetched updates from RubyGems.org until we transitioned to a webhook-based system.
00:12:11.520 Webhooks provided syncing without pushing too much load, allowing for near real-time updates whenever a gem was added.
00:12:25.440 Despite challenges with reliability, combining webhook notifications with periodic background checks allowed us to maintain acceptable database synchronization.
00:12:42.560 Today, thanks to these advancements, Bundler installs faster than it ever has—now at least twice as fast thanks to our new optimizations.
00:12:57.600 The functionality of the app server has greatly improved, and we have created opportunities for further enhancements to make our system even better.
00:13:10.960 We have an exciting plan in place. The Bundler and RubyGems teams are collaborating to create a new index format for both servers.
00:13:30.640 This new design will streamline updates, as it will utilize an append-only cacheable format with HTTP range requests to improve efficiency.
00:13:50.960 We have a working prototype built and are excited to begin implementation, which could further speed up the server-side interactions.
00:14:05.920 To conclude, I'm thrilled to announce that Ruby Central has granted me funds to work on this project, providing dedicated time to improve Bundler.
00:14:21.760 If you're looking to contribute, I urge all of you to reach out. Whether it's improving documentation or delving into the code, there are plenty of exciting opportunities.
00:14:42.960 We are also collaborating with Rails Girls for a summer of code initiative to help onboard new contributors to the project.
00:15:00.480 Thank you for your attention, and I hope you will consider helping us in this important venture!
00:15:18.560 Thank you!