Scaling

Summarized using AI

How Shopify Scales Rails

John Duff • February 28, 2013 • Earth

The video titled "How Shopify Scales Rails" features John Duff discussing the evolution of Shopify from its inception to its current state as a leading e-commerce platform powered by Ruby on Rails. Shopify, founded nearly ten years ago, has grown to support over 40,000 online stores and processes up to half a million product sales daily. The video outlines the extensive scaling challenges Shopify has faced and how the company has evolved to handle these issues effectively through various strategies.

Key points discussed in the video include:
- Initial Setup and Evolution: Shopify's journey began in 2004 with a simple code base that has since expanded significantly while still using the same foundational architecture. Initial versions used Ruby 1.8.2 and a pre-1.0 version of Rails, which have evolved substantially over the years.
- Current Technical Stack: The current stack incorporates Ruby 1.9.3, Rails 3.2, MySQL with performance enhancements, and enterprise load handling with Unicorn and caching layers using Memcache and Redis. The system supports thousands of requests per minute while maintaining low response times.
- Scaling Challenges: Duff emphasizes the importance of understanding system constraints, optimizing for the storefront, and the prioritization necessary for effective scaling practices. Monitoring tools such as New Relic and Splunk play vital roles in performance management.
- Caching Strategies: The adoption of caching technologies like Cachable and Identity Cache has allowed Shopify to increase efficiency by reducing unnecessary database hits and improving response times.
- Optimizing Background Jobs: Transitioning to a Redis-backed system for background job processing enhanced performance significantly.
- Database Optimization: Utilizing high-performance hardware alongside query optimizations and adjusting MySQL configurations ensures efficient data handling.
- Service Segmentation: Separating services based on system metrics has enabled Shopify to independently scale components like image handling, ultimately improving performance.

In conclusion, John Duff highlights that adaptability and a data-driven approach to scaling have been essential to Shopify's success. By continuously monitoring performance metrics and adjusting strategies accordingly, businesses can effectively navigate growth challenges in complex systems.

How Shopify Scales Rails
John Duff • February 28, 2013 • Earth

Tobi Lutke wrote the first line of code for Shopify nearly 10 years ago to power his own Snowboard shop. Two years later Shopify launched to the public on a single webserver using Rails 0.13.1. Today Shopify powers over 40k online stores and processes up to half a million product sales per day across the platform. Over 30 people actively work on Shopify which makes it the longest developed and likely largest Rails code base out there.

This is the story of how Shopify has evolved to handle its immense growth over the years. This is what getting big is all about: evolving to meet the needs of your customers. You don't start out with a system and infrastructure that can handle a billion dollar in GMV. You evolve to it. You evolve by adding caching layers, hardware, queuing systems and splitting your application to services.

This is the story of how we have tackled the various scaling pain points that Shopify has hit and what we have done to surpass them, what we are doing to go even further.

Help us caption & translate this video!

http://amara.org/v/FGdU/

Big Ruby 2013

00:00:19.439 My talk is about how Shopify scales Rails. First up, just what is Shopify? I already had someone confuse us for Spotify, so we're not Spotify; we're Shopify. We provide hosted online e-commerce and make it really easy for anyone to set up an online store and get started.
00:00:27.279 I've grabbed a couple of pictures of our office. We're based in Ottawa, Canada, and we've made our office a really fun place to work, incorporating interesting elements from various offices in Silicon Valley. It's an amazing space, despite the cold weather we experience up here in Canada.
00:00:50.000 Now, getting into the talk, I want to share a bit about our stack. Currently, we're running Ruby 1.9.3 patch level 327 and Rails 3.2; we recently made a significant upgrade from Rails 3 to Rails 3.2. Our database uses a Percona flavor of MySQL 5.5, which adds instrumentation and performance patches.
00:01:14.799 Our stack includes Unicorn for web serving, along with Memcache and Redis. Overall, the structure supports 33 application servers, 1,172 Unicorn workers, five job servers, and 307 job workers, allowing us to process a lot of requests simultaneously, which is quite impressive.
00:01:55.840 This provides a snapshot of what a small part of our cluster looks like: we have firewall load balancers, our application servers are replicated, and we utilize search and Redis job servers. This standard setup lets us scale these components horizontally.
00:02:02.000 Here's a glimpse into the amount of code involved in the Shopify project itself. For the Shopify core product, we have about 55,000 lines of application code, as well as 80,000 lines of test code, 211 controllers, and 468 models. As you can see, there’s a significant amount of Ruby code and complexity at play.
00:02:25.280 What does our current scale look like? Last year, we processed 9.9 million orders for our merchants, translating to an order every 3.2 seconds. On Cyber Monday, which is traditionally our busiest day, we processed 2,008 sales per minute, which is pretty extraordinary.
00:03:05.120 One remarkable aspect of this is that in just a month or two, our regular transaction rate will double. We typically handle around 50,000 requests per minute on regular days and maintain a steady 45-millisecond response time. In total, last year we served about 13.3 billion requests, which underscores the sheer volume of activity occurring within Shopify.
00:03:58.959 Before diving deeper, I believe it's essential to consider our origin and evolution. Developers, myself included, tend to focus on new technologies, but understanding our past is crucial in recognizing what has been accomplished and solved over the years. The first line of code for Shopify was written in 2004 by Toby in a coffee shop, and we released the first version in 2005.
00:05:48.160 What's noteworthy is that we've been using the same code base since that initial line of code was written. Over the past nine years, we've gone through numerous Rails upgrades and enhancements but maintained that core. As far as I know, we are one of the longest-running Rails applications out there.
00:06:37.760 When Shopify was first released in 2005, it had only 6,000 lines of code, significantly less test code, controllers, and models. Over the years, our code base has grown substantially, and we’ve adapted to manage that growth effectively. I dug through the earliest commits to get an idea of our initial stack and discovered that we were likely running on Ruby 1.8.2, using a pre-1.0 version of Rails, and MySQL 4.1.
00:07:12.240 We were also using Lighty instead of Unicorn, and Memcache was integrated from the outset. It's interesting to see the evolution over time, given that various features we now consider standard were nonexistent back then, such as RJS, response formats, eager loading, and the like.
00:08:09.859 Now, let's discuss scaling your application. The most crucial component of scaling is understanding your system's unique characteristics, knowing how to alter or remove constraints it imposes. Constraints on your system may differ significantly from our e-commerce platform's constraints, which differentiates elements like the admin area from the customer storefront.
00:08:43.520 In our case, the storefront is paramount as we want transactions to occur. That’s why we optimize it, while optimizing the admin would be a waste of resources. Understanding this distinction helps prioritize our optimization efforts. In our setup, each request is assigned to one process, which is typical for most Rails applications.
00:09:30.880 If we want to enhance our performance, we can either add more processes or make individual processes more efficient. Earlier, I mentioned our requests per minute. To clarify, RPM is defined as workers times 60, divided by the response time. Using that formula with our previous slide’s data, we can estimate a potential RPM close to a million.
00:10:12.640 This potential figure highlights the need to optimize both the number of workers and the response times. Increasing workers is a straightforward task; however, improving response time can be more complex and represents the more engaging work of performance optimization.
00:10:49.679 To decrease response times, it’s crucial to avoid network calls during requests. Any unavoidable calls should be accelerated. In Shopify’s context, the storefront and checkout are front and center, and we concentrate on these parts since they handle the bulk of user interactions.
00:11:57.679 For instance, we handle some interesting challenges with concepts like 'the Chive,' which sells products through flash sales. A recent New Relic graph shows our request metrics during one such event. Normally, we see around 55K RPM, but that can quickly spike to 200K during a flash sale. This dynamic necessitates maintaining a buffer to manage sudden spikes in traffic.
00:12:24.480 Monitoring everything we can is imperative. Understanding how quickly things operate in the system and observing critical metrics are key to effective scaling. Our toolset includes New Relic, which provides invaluable insights into our application's performance, Splunk for log analysis, and StatsD for additional performance metrics.
00:13:25.199 We also have our in-house tool, Conan, which allows us to test our production system under load. It enables us to simulate substantial traffic patterns realistically and analyze performance during stress tests. It is vital to understand your system’s capacity through this testing, as findings from development environments can be misleading.
00:14:03.199 Data collected through tools like New Relic provides insights into web traffic, database performance, and Ruby execution times. Splunk, on the other hand, allows us to respond in real-time to the application status, setting alerts for unusual spikes in error codes like 500.
00:14:51.440 To streamline our operations, we implemented dashboards using the open-source framework Dashing, placing them throughout our office to communicate key metrics to the teams. This transparency encourages focus on improving performance, as recently, we managed to optimize our storefront requests down to 51 milliseconds.
00:15:30.400 However, other areas like the Shopify API lagged, averaging around 300 milliseconds. It’s crucial to know where to focus your efforts in order to drive impactful results.
00:15:51.920 Caching plays a significant role in performance enhancement. We developed a caching gem called Cachable, which supports page caching at the controller level with Memcache integration. This tool serves gzipped content to browsers when supported, reducing processing time when communicating with Memcache.
00:16:27.920 Cachable employs the concept of generational caching, meaning each cache key updates according to data alterations, ensuring freshness without explicit expiry settings. An added feature includes middleware to bypass controller processing if there’s a cache hit, ensuring rapid responses.
00:17:20.159 Recently, we began caching 404 pages as well since they aren’t something we typically prioritize. With random URL input, 404s frequently occur on Shopify, and instituting caching for these pages led to substantial performance improvements.
00:17:57.120 Another tool we built internally is Identity Cache, which allows us to cache model information via Memcache. This utility was overlooked for some time but has enhanced our performance significantly, decreasing database hit frequency by caching complete model objects and their associated objects.
00:18:38.919 We’ve maintained an opt-in strategy for this caching method, allowing flexibility for different models and their associated data. This approach avoids unnecessary reads from the database while still ensuring fresh data when required.
00:19:14.880 The Identity Cache implementation is very straightforward: you include it in your models, specify which data to cache, and it automatically handles associated objects, yielding remarkable performance gains.
00:19:53.360 In terms of processes, the simplest way to enhance efficiency is often to remove unnecessary steps. So, we implemented background job processing through tools like Delayed Job initially and then migrated to using Rescue, which is faster and minimizes database contention.
00:20:37.920 The transition from delayed jobs to a Redis-backed rescue system led to improvements in performance; we can now process more jobs per second significantly better than our earlier system allowed. This shift also provided access to Redis for ancillary processes such as our inventory reservation system.
00:21:24.559 As we continue to handle vast amounts of data, optimizing our MySQL setup is crucial. Our database servers are equipped with high-performance hardware, large amounts of RAM, and SSDs to ensure that a significant portion of our working set remains in memory, vastly enhancing response times.
00:22:11.439 Alongside hardware improvements, we actively perform query optimization utilizing tools such as pt-query-digest from Percona. We scrutinize our database operations to avoid generating temporary tables, which can undermine performance benefits.
00:22:48.480 We also focus on proper indexing to improve query efficiency, particularly for our main orders table, which is central to many operations. Tuning MySQL configurations has become an integral part of maintaining the performance of our database systems.
00:23:29.520 Some specific configurations we streamlined include optimizing the InnoDB metadata fetching process, which reduces the overhead during standard database operations. Crucial to the process is ensuring cache sizes and connection limits are properly synchronized.
00:24:06.560 Furthermore, we've switched memory allocations to enhance performance for write-heavy patterns, implemented auto-increment lock mode adjustments, and leveraged after commit hooks to facilitate additional work without holding up the transaction.
00:24:51.679 These techniques allow for better resource management. The After Commit mechanism ensures actions like notifications and updates happen objectively, relying on already persisted data, enhancing the predictability of our order processes.
00:25:31.679 Lastly, I encourage consideration of services and their segmentation within your application. Creating independent services allows for individual scaling and management as necessary. It’s essential to determine when to extract services based on metrics that indicate their necessity.
00:26:15.440 For example, we recently segmented our image handling services because they began overshadowing other application responses, making it more challenging to derive meaningful data insights. Identifying when to separate such functions is a critical part of scaling your system effectively.
00:26:59.200 To encapsulate, adaptability in your approach is paramount. Use the data and understanding of your system to guide decisions; the direction taken in Shopify has always been data-driven, progressing only when necessary and often prioritizing simplicity over premature optimization. By following these principles, I hope you can enhance your scaling strategies.
Explore all talks recorded at Big Ruby 2013
+10