00:00:12.259
Thank you for being here. I'm going to talk about some high-level skills related to scaling at Shopify. This talk will include an overview of the Shopify stack, what we focus on when determining what to scale, and how Shopify has evolved through challenges and advancements. We will discuss scaling beyond just caching and touch upon how we split tasks to enhance performance.
00:00:51.860
For those who may not be familiar, Shopify is a hosted e-commerce platform that has been using Ruby on Rails since 2004. We have maintained the same code base since our inception, transitioning through various versions of Rails without a complete rewrite. As of now, we are on Rails 3.2. To give you an idea of what Shopify looks like, here is our admin interface, which allows users to manage orders and products effectively. We've had notable clients like GitHub and Google using Shopify as their e-commerce solution, which is something we take pride in.
00:01:55.680
Our staff operates on a carefully curated stack, staying up-to-date with patch levels for optimal performance. Currently, we are using Unicorn as our web server, running multiple processes efficiently to handle requests. We have a complex architecture featuring several components: our app server, load balancers, and various services like Memcache and Redis to manage data. In total, we have 53 app servers running approximately 1590 Unicorn workers, alongside job servers to handle background processing.
00:03:30.450
Over the past year, we have processed 9.9 million orders, averaging one order every 3.2 seconds, which is remarkable. During events like Cyber Monday, we peaked at 2000 sales per minute. This growth in sales is doubling year on year, indicating that we need to prepare for even more load in the future. Our average RPM is about 50,000 with a 45-millisecond response time, though this fluctuates significantly.
00:04:41.560
Before diving deeper into the scaling challenges, I want to share a bit of our history. The first line of code for Shopify was written in 2004, just a year after David Heinemeier Hansson started on Basecamp. Toby Lütke, our CEO, adopted Rails early on, and Shopify was launched in June 2006 using an unchanged code base even today.
00:06:04.919
When I looked back at our initial commit, the code base consisted of nearly 7000 lines of application code and 4000 lines of test code. We have experienced substantial growth since then, evolving our technology and practices to meet increasing demands. In the early days, the Rails features we now take for granted, such as nested includes and polymorphic associations, didn't exist. The gradual upgrades we've embraced have allowed us to maintain robustness in our application.
00:07:20.540
A crucial factor in scaling is understanding your system. There's no magical formula; it requires knowing your system's characteristics and identifying constraints. For us at Shopify, we serve one request per process using standard Unicorn, which leads us to focus on optimizing the speed of these processes. We also need to consider the limits imposed by database connections.
00:08:02.700
Requests per minute (RPM) can be calculated by multiplying the number of workers by the inverse of the average response time. Utilizing our previous figures, our optimal RPM potential could reach about 1.3 million, assuming we can minimize bottlenecks and optimize our app's performance. When scaling Shopify, we primarily adjust two parameters: increasing the number of workers and reducing response times.
00:09:55.740
Shopify's application is write-intensive, meaning we generate many database hits as transactions occur, making caching more complex. Additionally, various network requests for shipping services and payment gateways contribute to this complexity. Our priority is to ensure that both the storefront, where customers interact with products, and the checkout process, where transactions happen, operate efficiently.
00:11:05.180
An interesting challenge occurs with flash sales, exemplified by websites like chive.com, which experience spikes in requests during limited-time offers. During these flash sales, we often need to handle 200,000+ requests per minute, significantly higher than our usual load. This variability necessitates a robust architecture that can adapt to sudden surges in traffic while maintaining average loads.
00:12:56.640
To effectively scale our application, we emphasize the importance of data measurement and monitoring. Tools like New Relic provide valuable insights into app performance and identify problem areas. Coupled with Splunk for logs and statsd for custom metrics, our team gains a comprehensive understanding of performance bottlenecks and optimizations.
00:14:03.720
New Relic allows us to analyze response times and throughput to identify slow areas. Understanding where performance issues lie is crucial; solving the identified problems is often more straightforward now that we have a clear picture of our performance metrics. Similarly, Splunk aids in tracking historical performance data and identifying trends over time.
00:15:18.780
With the lead-up to high-traffic days like Cyber Monday, our team worked tirelessly on storefront performance. We dissected requests and identified database hits to eliminate unnecessary load on our servers. The entire process demanded that each team member had a tangible goal––to reduce loads that could hinder our site's performance.
00:16:49.740
Caching is one of the essential techniques used to improve request performance at Shopify. We utilize a tool called Cachable for controller-based page caching, enabling us to serve gzipped content quickly. By leveraging E-tags and 304 Not Modified responses, we can minimize unnecessary resource usage.
00:18:12.420
In our ongoing efforts to enhance performance, we introduced Identity Cache, which enables model-level caching to store object requests using Memcache. By allowing opt-in caching, we bridge performance with flexibility for rebuilding the app when necessary.
00:19:51.619
Removing unnecessary processes from the main request cycle is another vital strategy. We used to store delayed jobs in the database, causing added database load, but have transitioned to using Rescue, a Redis-backed job queue. Rescue offers efficiency in job processing, enabling us to maintain rapid response times.
00:20:55.499
Furthermore, we rely on Redis for several functions including inventory reservation and session management. These tools enhance system responsiveness while requiring careful management to avoid data inconsistency.
00:22:19.130
Optimizing our MySQL server has become paramount as we rely heavily on it for data retrieval. We have implemented various optimizations, including ensuring full working sets are held in memory. This adjustment can lead to significant performance improvements.
00:23:54.180
We've also made use of tools like PT Query Digest, allowing us to identify slow queries in our logs and avoid actions that create temporary tables, further reducing load times. In combination with hiring a dedicated MySQL DBA, we've achieved further gains by refining our query performance.
00:25:50.270
Tuning involves careful adjustments, such as changing table open cache limits. These increments can affect system operations, so they must be balanced with the underlying architecture's capabilities. Additionally, we have switched to the TC malloc memory allocator to reduce contention during high loads.
00:27:17.640
After committing transactions, it’s crucial to process after-events that do not hinder the main transaction. This strategy allows actions such as webhooks to function independently of the primary transaction's success or failure.
00:28:25.499
Lastly, we emphasize leveraging services efficiently. As we expand, we've migrated components out of our main application to allow independent scaling. This thoughtful extraction minimizes complexity and enhances performance without sacrificing overall system stability.
00:29:48.180
To conclude, continuously adapting and evolving your systems based on performance data is vital. Scaling should rely on real-world data to inform decisions, ensuring you're making meaningful improvements, rather than adding complexity.
00:30:00.000
Thank you.