How Shopify Scales Rails

by John Duff

The video "How Shopify Scales Rails" presented by John Duff at Rails Conf 2013 details the evolution of Shopify, a hosted e-commerce platform built on Ruby on Rails, from its inception to its current scale. The talk highlights how Shopify has addressed significant growth challenges over the years through strategic scaling methods.

Key Points Discussed:

History and Growth of Shopify:
- Launched publicly in 2006, Shopify has grown tremendously, currently supporting over 40,000 online stores and processing up to half a million product sales daily.
- The codebase has seen continuous improvement; Shopify has upgraded from Rails 0.13.1 to Rails 3.2 without any rewrites.
Technical Architecture and Tools:
- The architecture involves 53 app servers, 1,590 unicorn workers, and different caching mechanisms like Memcache and Redis.
- They use various tools, including New Relic for real-time monitoring, Splunk for log management, and StatsD for metrics gathering.
Scaling Strategies:
- Caching Layers: Implemented caching strategies (like "Cachable" and "Identity Cache") to reduce database load by caching commonly requested data.
- Job Processing: Transitioned from Delayed Job to Resque for background job processing to enhance performance from 120 jobs/second to over 300 jobs/second.
- MySQL Optimization: Introduced techniques for MySQL, such as query optimization and proper index management, leading to significant improvements in performance.
Handling Traffic Spikes:
- Discussed specific scenarios, like handling high request rates during flash sales (over 200k requests per minute), emphasizing the need for flexible scaling solutions.
- The importance of knowing the system characteristics and continuously measuring performance to identify issues was underlined.
Service-Oriented Architecture:
- As Shopify grew, separating components into services became necessary to manage load effectively, such as moving image processing to a dedicated service, which allowed for tailored scaling and resource allocation.

Conclusions and Takeaways:

Adapting and evolving are crucial as needs change; understanding performance metrics and the system's limitations is key to successful scaling.
Continuous measurement and external tools play a vital role in recognizing what scaling strategies work and where improvements can be made without negative impacts.
The talk reinforces that scaling is not a one-size-fits-all solution; each department must be approached based on current performance demands and predictions for future growth.

00:00:12.259 Thank you for being here. I'm going to talk about some high-level skills related to scaling at Shopify. This talk will include an overview of the Shopify stack, what we focus on when determining what to scale, and how Shopify has evolved through challenges and advancements. We will discuss scaling beyond just caching and touch upon how we split tasks to enhance performance.

00:00:51.860 For those who may not be familiar, Shopify is a hosted e-commerce platform that has been using Ruby on Rails since 2004. We have maintained the same code base since our inception, transitioning through various versions of Rails without a complete rewrite. As of now, we are on Rails 3.2. To give you an idea of what Shopify looks like, here is our admin interface, which allows users to manage orders and products effectively. We've had notable clients like GitHub and Google using Shopify as their e-commerce solution, which is something we take pride in.

00:01:55.680 Our staff operates on a carefully curated stack, staying up-to-date with patch levels for optimal performance. Currently, we are using Unicorn as our web server, running multiple processes efficiently to handle requests. We have a complex architecture featuring several components: our app server, load balancers, and various services like Memcache and Redis to manage data. In total, we have 53 app servers running approximately 1590 Unicorn workers, alongside job servers to handle background processing.

00:03:30.450 Over the past year, we have processed 9.9 million orders, averaging one order every 3.2 seconds, which is remarkable. During events like Cyber Monday, we peaked at 2000 sales per minute. This growth in sales is doubling year on year, indicating that we need to prepare for even more load in the future. Our average RPM is about 50,000 with a 45-millisecond response time, though this fluctuates significantly.

00:04:41.560 Before diving deeper into the scaling challenges, I want to share a bit of our history. The first line of code for Shopify was written in 2004, just a year after David Heinemeier Hansson started on Basecamp. Toby Lütke, our CEO, adopted Rails early on, and Shopify was launched in June 2006 using an unchanged code base even today.

00:06:04.919 When I looked back at our initial commit, the code base consisted of nearly 7000 lines of application code and 4000 lines of test code. We have experienced substantial growth since then, evolving our technology and practices to meet increasing demands. In the early days, the Rails features we now take for granted, such as nested includes and polymorphic associations, didn't exist. The gradual upgrades we've embraced have allowed us to maintain robustness in our application.

00:07:20.540 A crucial factor in scaling is understanding your system. There's no magical formula; it requires knowing your system's characteristics and identifying constraints. For us at Shopify, we serve one request per process using standard Unicorn, which leads us to focus on optimizing the speed of these processes. We also need to consider the limits imposed by database connections.

00:08:02.700 Requests per minute (RPM) can be calculated by multiplying the number of workers by the inverse of the average response time. Utilizing our previous figures, our optimal RPM potential could reach about 1.3 million, assuming we can minimize bottlenecks and optimize our app's performance. When scaling Shopify, we primarily adjust two parameters: increasing the number of workers and reducing response times.

00:09:55.740 Shopify's application is write-intensive, meaning we generate many database hits as transactions occur, making caching more complex. Additionally, various network requests for shipping services and payment gateways contribute to this complexity. Our priority is to ensure that both the storefront, where customers interact with products, and the checkout process, where transactions happen, operate efficiently.

00:11:05.180 An interesting challenge occurs with flash sales, exemplified by websites like chive.com, which experience spikes in requests during limited-time offers. During these flash sales, we often need to handle 200,000+ requests per minute, significantly higher than our usual load. This variability necessitates a robust architecture that can adapt to sudden surges in traffic while maintaining average loads.

00:12:56.640 To effectively scale our application, we emphasize the importance of data measurement and monitoring. Tools like New Relic provide valuable insights into app performance and identify problem areas. Coupled with Splunk for logs and statsd for custom metrics, our team gains a comprehensive understanding of performance bottlenecks and optimizations.

00:14:03.720 New Relic allows us to analyze response times and throughput to identify slow areas. Understanding where performance issues lie is crucial; solving the identified problems is often more straightforward now that we have a clear picture of our performance metrics. Similarly, Splunk aids in tracking historical performance data and identifying trends over time.

00:15:18.780 With the lead-up to high-traffic days like Cyber Monday, our team worked tirelessly on storefront performance. We dissected requests and identified database hits to eliminate unnecessary load on our servers. The entire process demanded that each team member had a tangible goal––to reduce loads that could hinder our site's performance.

00:16:49.740 Caching is one of the essential techniques used to improve request performance at Shopify. We utilize a tool called Cachable for controller-based page caching, enabling us to serve gzipped content quickly. By leveraging E-tags and 304 Not Modified responses, we can minimize unnecessary resource usage.

00:18:12.420 In our ongoing efforts to enhance performance, we introduced Identity Cache, which enables model-level caching to store object requests using Memcache. By allowing opt-in caching, we bridge performance with flexibility for rebuilding the app when necessary.

00:19:51.619 Removing unnecessary processes from the main request cycle is another vital strategy. We used to store delayed jobs in the database, causing added database load, but have transitioned to using Rescue, a Redis-backed job queue. Rescue offers efficiency in job processing, enabling us to maintain rapid response times.

00:20:55.499 Furthermore, we rely on Redis for several functions including inventory reservation and session management. These tools enhance system responsiveness while requiring careful management to avoid data inconsistency.

00:22:19.130 Optimizing our MySQL server has become paramount as we rely heavily on it for data retrieval. We have implemented various optimizations, including ensuring full working sets are held in memory. This adjustment can lead to significant performance improvements.

00:23:54.180 We've also made use of tools like PT Query Digest, allowing us to identify slow queries in our logs and avoid actions that create temporary tables, further reducing load times. In combination with hiring a dedicated MySQL DBA, we've achieved further gains by refining our query performance.

00:25:50.270 Tuning involves careful adjustments, such as changing table open cache limits. These increments can affect system operations, so they must be balanced with the underlying architecture's capabilities. Additionally, we have switched to the TC malloc memory allocator to reduce contention during high loads.

00:27:17.640 After committing transactions, it’s crucial to process after-events that do not hinder the main transaction. This strategy allows actions such as webhooks to function independently of the primary transaction's success or failure.

00:28:25.499 Lastly, we emphasize leveraging services efficiently. As we expand, we've migrated components out of our main application to allow independent scaling. This thoughtful extraction minimizes complexity and enhances performance without sacrificing overall system stability.

00:29:48.180 To conclude, continuously adapting and evolving your systems based on performance data is vital. Scaling should rely on real-world data to inform decisions, ensuring you're making meaningful improvements, rather than adding complexity.

00:30:00.000 Thank you.