00:00:19.439
My talk is about how Shopify scales Rails. First up, just what is Shopify? I already had someone confuse us for Spotify, so we're not Spotify; we're Shopify. We provide hosted online e-commerce and make it really easy for anyone to set up an online store and get started.
00:00:27.279
I've grabbed a couple of pictures of our office. We're based in Ottawa, Canada, and we've made our office a really fun place to work, incorporating interesting elements from various offices in Silicon Valley. It's an amazing space, despite the cold weather we experience up here in Canada.
00:00:50.000
Now, getting into the talk, I want to share a bit about our stack. Currently, we're running Ruby 1.9.3 patch level 327 and Rails 3.2; we recently made a significant upgrade from Rails 3 to Rails 3.2. Our database uses a Percona flavor of MySQL 5.5, which adds instrumentation and performance patches.
00:01:14.799
Our stack includes Unicorn for web serving, along with Memcache and Redis. Overall, the structure supports 33 application servers, 1,172 Unicorn workers, five job servers, and 307 job workers, allowing us to process a lot of requests simultaneously, which is quite impressive.
00:01:55.840
This provides a snapshot of what a small part of our cluster looks like: we have firewall load balancers, our application servers are replicated, and we utilize search and Redis job servers. This standard setup lets us scale these components horizontally.
00:02:02.000
Here's a glimpse into the amount of code involved in the Shopify project itself. For the Shopify core product, we have about 55,000 lines of application code, as well as 80,000 lines of test code, 211 controllers, and 468 models. As you can see, there’s a significant amount of Ruby code and complexity at play.
00:02:25.280
What does our current scale look like? Last year, we processed 9.9 million orders for our merchants, translating to an order every 3.2 seconds. On Cyber Monday, which is traditionally our busiest day, we processed 2,008 sales per minute, which is pretty extraordinary.
00:03:05.120
One remarkable aspect of this is that in just a month or two, our regular transaction rate will double. We typically handle around 50,000 requests per minute on regular days and maintain a steady 45-millisecond response time. In total, last year we served about 13.3 billion requests, which underscores the sheer volume of activity occurring within Shopify.
00:03:58.959
Before diving deeper, I believe it's essential to consider our origin and evolution. Developers, myself included, tend to focus on new technologies, but understanding our past is crucial in recognizing what has been accomplished and solved over the years. The first line of code for Shopify was written in 2004 by Toby in a coffee shop, and we released the first version in 2005.
00:05:48.160
What's noteworthy is that we've been using the same code base since that initial line of code was written. Over the past nine years, we've gone through numerous Rails upgrades and enhancements but maintained that core. As far as I know, we are one of the longest-running Rails applications out there.
00:06:37.760
When Shopify was first released in 2005, it had only 6,000 lines of code, significantly less test code, controllers, and models. Over the years, our code base has grown substantially, and we’ve adapted to manage that growth effectively. I dug through the earliest commits to get an idea of our initial stack and discovered that we were likely running on Ruby 1.8.2, using a pre-1.0 version of Rails, and MySQL 4.1.
00:07:12.240
We were also using Lighty instead of Unicorn, and Memcache was integrated from the outset. It's interesting to see the evolution over time, given that various features we now consider standard were nonexistent back then, such as RJS, response formats, eager loading, and the like.
00:08:09.859
Now, let's discuss scaling your application. The most crucial component of scaling is understanding your system's unique characteristics, knowing how to alter or remove constraints it imposes. Constraints on your system may differ significantly from our e-commerce platform's constraints, which differentiates elements like the admin area from the customer storefront.
00:08:43.520
In our case, the storefront is paramount as we want transactions to occur. That’s why we optimize it, while optimizing the admin would be a waste of resources. Understanding this distinction helps prioritize our optimization efforts. In our setup, each request is assigned to one process, which is typical for most Rails applications.
00:09:30.880
If we want to enhance our performance, we can either add more processes or make individual processes more efficient. Earlier, I mentioned our requests per minute. To clarify, RPM is defined as workers times 60, divided by the response time. Using that formula with our previous slide’s data, we can estimate a potential RPM close to a million.
00:10:12.640
This potential figure highlights the need to optimize both the number of workers and the response times. Increasing workers is a straightforward task; however, improving response time can be more complex and represents the more engaging work of performance optimization.
00:10:49.679
To decrease response times, it’s crucial to avoid network calls during requests. Any unavoidable calls should be accelerated. In Shopify’s context, the storefront and checkout are front and center, and we concentrate on these parts since they handle the bulk of user interactions.
00:11:57.679
For instance, we handle some interesting challenges with concepts like 'the Chive,' which sells products through flash sales. A recent New Relic graph shows our request metrics during one such event. Normally, we see around 55K RPM, but that can quickly spike to 200K during a flash sale. This dynamic necessitates maintaining a buffer to manage sudden spikes in traffic.
00:12:24.480
Monitoring everything we can is imperative. Understanding how quickly things operate in the system and observing critical metrics are key to effective scaling. Our toolset includes New Relic, which provides invaluable insights into our application's performance, Splunk for log analysis, and StatsD for additional performance metrics.
00:13:25.199
We also have our in-house tool, Conan, which allows us to test our production system under load. It enables us to simulate substantial traffic patterns realistically and analyze performance during stress tests. It is vital to understand your system’s capacity through this testing, as findings from development environments can be misleading.
00:14:03.199
Data collected through tools like New Relic provides insights into web traffic, database performance, and Ruby execution times. Splunk, on the other hand, allows us to respond in real-time to the application status, setting alerts for unusual spikes in error codes like 500.
00:14:51.440
To streamline our operations, we implemented dashboards using the open-source framework Dashing, placing them throughout our office to communicate key metrics to the teams. This transparency encourages focus on improving performance, as recently, we managed to optimize our storefront requests down to 51 milliseconds.
00:15:30.400
However, other areas like the Shopify API lagged, averaging around 300 milliseconds. It’s crucial to know where to focus your efforts in order to drive impactful results.
00:15:51.920
Caching plays a significant role in performance enhancement. We developed a caching gem called Cachable, which supports page caching at the controller level with Memcache integration. This tool serves gzipped content to browsers when supported, reducing processing time when communicating with Memcache.
00:16:27.920
Cachable employs the concept of generational caching, meaning each cache key updates according to data alterations, ensuring freshness without explicit expiry settings. An added feature includes middleware to bypass controller processing if there’s a cache hit, ensuring rapid responses.
00:17:20.159
Recently, we began caching 404 pages as well since they aren’t something we typically prioritize. With random URL input, 404s frequently occur on Shopify, and instituting caching for these pages led to substantial performance improvements.
00:17:57.120
Another tool we built internally is Identity Cache, which allows us to cache model information via Memcache. This utility was overlooked for some time but has enhanced our performance significantly, decreasing database hit frequency by caching complete model objects and their associated objects.
00:18:38.919
We’ve maintained an opt-in strategy for this caching method, allowing flexibility for different models and their associated data. This approach avoids unnecessary reads from the database while still ensuring fresh data when required.
00:19:14.880
The Identity Cache implementation is very straightforward: you include it in your models, specify which data to cache, and it automatically handles associated objects, yielding remarkable performance gains.
00:19:53.360
In terms of processes, the simplest way to enhance efficiency is often to remove unnecessary steps. So, we implemented background job processing through tools like Delayed Job initially and then migrated to using Rescue, which is faster and minimizes database contention.
00:20:37.920
The transition from delayed jobs to a Redis-backed rescue system led to improvements in performance; we can now process more jobs per second significantly better than our earlier system allowed. This shift also provided access to Redis for ancillary processes such as our inventory reservation system.
00:21:24.559
As we continue to handle vast amounts of data, optimizing our MySQL setup is crucial. Our database servers are equipped with high-performance hardware, large amounts of RAM, and SSDs to ensure that a significant portion of our working set remains in memory, vastly enhancing response times.
00:22:11.439
Alongside hardware improvements, we actively perform query optimization utilizing tools such as pt-query-digest from Percona. We scrutinize our database operations to avoid generating temporary tables, which can undermine performance benefits.
00:22:48.480
We also focus on proper indexing to improve query efficiency, particularly for our main orders table, which is central to many operations. Tuning MySQL configurations has become an integral part of maintaining the performance of our database systems.
00:23:29.520
Some specific configurations we streamlined include optimizing the InnoDB metadata fetching process, which reduces the overhead during standard database operations. Crucial to the process is ensuring cache sizes and connection limits are properly synchronized.
00:24:06.560
Furthermore, we've switched memory allocations to enhance performance for write-heavy patterns, implemented auto-increment lock mode adjustments, and leveraged after commit hooks to facilitate additional work without holding up the transaction.
00:24:51.679
These techniques allow for better resource management. The After Commit mechanism ensures actions like notifications and updates happen objectively, relying on already persisted data, enhancing the predictability of our order processes.
00:25:31.679
Lastly, I encourage consideration of services and their segmentation within your application. Creating independent services allows for individual scaling and management as necessary. It’s essential to determine when to extract services based on metrics that indicate their necessity.
00:26:15.440
For example, we recently segmented our image handling services because they began overshadowing other application responses, making it more challenging to derive meaningful data insights. Identifying when to separate such functions is a critical part of scaling your system effectively.
00:26:59.200
To encapsulate, adaptability in your approach is paramount. Use the data and understanding of your system to guide decisions; the direction taken in Shopify has always been data-driven, progressing only when necessary and often prioritizing simplicity over premature optimization. By following these principles, I hope you can enhance your scaling strategies.