Camilo Lopez

Summarized using AI

How Shopify Sharded Rails

Camilo Lopez • February 20, 2014 • Earth

In this video, Camilo Lopez from Shopify presents a detailed account of how the company successfully implemented sharding in their Rails architecture in 2013. The need for this transition arose from exponential growth in traffic and customer count, particularly after the heavy load experienced during CyberMonday 2012. Shopify had to evolve from a single large database to a scalable model with multiple smaller databases.

Key Points Discussed:

  • Company Background: Shopify is an e-commerce platform serving approximately 100,000 active shops, with high-performance infrastructure based on Ruby on Rails and MySQL.
  • Scaling Challenges: The platform handled 150,000 requests per minute on average, with peaks up to 400,000, which necessitated a shift to sharding to improve performance and reliability.
  • Sharding Complexities: Introducing multiple databases complicated the Rails application which was traditionally designed for single-database operations, leading to challenges in data grouping and transactional integrity.
  • Initial Strategy: Before sharding, Shopify focused on optimizing MySQL and tuning their hardware until expansion needs prompted the shift to a sharded architecture.
  • Implementation Process: A dedicated team was assigned to focus on sharding, which included substantial preparation to prevent complexities during migration and maintain data consistency.
  • Data Migration and Model Changes: The migration involved redefining the data structures to include shop IDs and implement sharded model connections. A structured team approach ensured both stability and ongoing development.
  • Performance Optimization: Real-time performance was crucial, especially during peak times, thus advanced concurrency management and transaction handling were integrated.
  • Success Metrics: By Cyber Monday 2013, Shopify successfully handled up to 100,000 RPM with excellent response times, proving that their sharding approach effectively supported their significant increase in user demand.

Conclusions:

Lopez concludes the presentation by emphasizing the importance of thorough planning and structural awareness in application architecture which laid the groundwork for Shopify’s ongoing adaptability. They also highlight the profound benefits of the sharding strategy, ensuring robust performance and service reliability even amidst fluctuating traffic demands.

How Shopify Sharded Rails
Camilo Lopez • February 20, 2014 • Earth

Last year at this very conference John Duff spoke about how Shopify scales while maintaing one of the longest lived and largest Rails deployments, and how we affront the challenges that come with growth. Shopify in 2013 became more than twice the size in every single aspect; requests per minute, GMV, merchants, number of developers, etc.

After CyberMonday 2012 it became clear that if we wanted to survive CyberMonday 2013 we needed to spread the load across more than a single huge database, and move to a model of smaller databases to enable horizontal scalability. This is the story of how, in 2013, we more than doubled the number number of databases that power Shopify, and all the challenges that come along when sharding a living, breathing and money producing Ruby application.

Help us caption & translate this video!

http://amara.org/v/FG3p/

Big Ruby 2014

00:00:20.560 Hello, my name is Camilo Lopez, and I work at Shopify. Today, I want to share with you how Shopify sharded Rails in 2013. First, let me give you a bit of background on Shopify. We are an e-commerce platform; we don't sell products but instead enable people to create their own stores to sell their own products.
00:00:29.359 Let's talk about the scale of Shopify. We are deployed on around 100 application servers. Our hardware is heterogeneous but generally high-performance. Currently, we have approximately 100,000 active shops, of which about 88,000 are paying customers. This ecosystem is supported by a technology stack that includes Ruby 2.1, Rails 3.2, and MySQL 5.5, along with some important patches for MySQL. A few details about this stack are essential since some aspects are MySQL-specific.
00:00:43.600 We operate a typical Rails stack with web servers managed by Unicorn and job services handled by Resque. We have a master database along with some shards. A usual day at Shopify involves handling around 150,000 requests per minute, with peaks reaching up to 400,000 requests. Overall, our shards process roughly 20,000 queries per second. Being in the commerce business means that any downtime can have significant consequences, especially in terms of revenue loss.
00:01:03.680 Last year, our Gross Merchandise Volume (GMV) was $1.6 billion, which translates to nearly $4,000 processed every minute through our platform. Outages in our service can be quite disruptive. This leads us to the topic of sharding: why we should never ever shard lightly. Sharding involves slicing your data over multiple databases. Rails and many libraries are inherently designed to work with a single database, creating friction when you introduce multiple databases.
00:01:16.720 When you shard, you find yourself fighting the underlying assumptions throughout the Rails ecosystem. Common operations, such as grouping data by a customer or utilizing primary keys, require substantial rethinking. You might end up writing custom solutions or leaning on external gems, but there's no straightforward way to implement sharding effectively without significant maintenance overhead.
00:01:44.879 For many years, our solution was to continuously upgrade our database machine. For about seven years, we focused on tuning MySQL, enhancing caching, and implementing optimizations to keep performance steady. However, there was a realization that our traffic and number of customers were doubling each year. Black Friday and Cyber Monday periods impose massive loads on our platform, often two times the normal workload.
00:02:03.200 During times of peak user activity, it is vital that we remain transactional with no caching involved since we need to ensure all operations reflect accurately in real time. Thus, we made the decision to shard our database. One key insight was that MySQL is designed to handle threading better than our application, which is not threaded. This leads us to focus on MySQL's concurrency management settings to optimize performance for that environment.
00:02:28.360 In 2011, we had a configuration using Intel chips. By 2012, as our needs grew, purchasing better CPUs became too expensive, making horizontal scaling a logical next step. Thus, we conceptualized this need ahead of Cyber Monday, setting an initial goal to sharding. This initiative promised important side benefits, including maintaining localized buffer pools for data, providing better performance overall.
00:02:52.560 Sharding also would yield smaller indexes, leading to faster data retrieval, minimizing performance issues. Additionally, in the event of failures in one shard, we still retained parts of Shopify running in other shards, ensuring overall service availability. Before embarking on this journey, we spent an entire year preparing. We anticipated potential pitfalls and established constraints, committing to never mixing writes to different shards within a single transaction, given that this complexity could jeopardize transactional integrity.
00:03:27.679 We recognized early on that joining across different databases is not feasible; if one needed data across shards, they would have to rethink their model. Therefore, we restructured our code: non-necessary selections became more evident using blocks to establish session contexts to enforce data locality. A dedicated team of four to six engineers focused solely on sharding while other developers continued their feature work without interruptions.
00:03:49.600 Our sharding data involved migrating everything tied to a shop, making sure every association had a shop ID, which facilitated data movement between shards. Simultaneously, we incorporated sharding even if we only had one database as a preparatory step. This work culminated in a defined master database structure and meaningful separation, which we achieved in multiple iterations. As we rolled out these changes, we ensured that our indexes remained consistent, with primary keys established to account for multi-sharding environments.
00:04:20.560 One significant decision we made was to rely on services like Twitter's snowflake to manage our ID generation outside of the database context, thus maintaining unique identifiers across shards. However, relying on external services introduced additional risk into our architecture, so we maintained both MySQL config setups to manage IDs while iteratively refining our approach.
00:04:50.920 Connection switching was another core change we had to implement. We created a block-based context API that allowed our sharded models to maintain shard-specific connections. In practice, a charted model must include a specific module while holding a shop ID column. Many existing records were defaulted to the master database, while new models picked up the necessity for connection context dynamically.
00:05:22.720 Our approach to routing requests to the relevant shards mirrored that of HTTP requests. When a request is received, we pulled the shop ID and chart information centralized using modular controllers to divert traffic accordingly. In background jobs, the same module facilitated context-specific performance, allowing our scaling efforts to extend seamlessly across our architecture.
00:05:56.360 Given that our existing codebase didn't understand the constraints of this new approach, we constructed verifiers that would alert us to issues such as missing chart contexts. This logging tool proved beneficial as we iterated through fixes, ensuring we resolved every context issue before transitioning fully to the sharded model. The tight coupling of this sharded design pattern allowed us to gradually shift our developer base toward an awareness of the limits introduced by sharding.
00:06:32.400 Considering the potential need to rebalance load across shards, we implemented a cautious approach, fully locking a shop during data migration to prevent conflicting operations. Our solution employed two levels of locking, ensuring that no in-flight transactions would impact sharding processes. Furthermore, the locksmithing effort leveraged tools like Apache Zookeeper for locking mechanisms, providing a reliable way to manage concurrency across shards.
00:07:00.900 As we moved forward, we sought out optimizations to avoid excessive requests reaching Zookeeper and relied on in-table flags to identify active locks. Jobs were orchestrated with Ruby scripts handling the moves while capturing information about the migration process, maintaining data integrity, and confidence in outcome reliability. This offered us a graceful solution when transitioning across shards as the critical business confidence needed arose.
00:07:37.680 Upon dealing with old data remnants left behind after migrations, we established that manual intervention would initially be necessary to clean data no longer applicable within the active shard. Automating this process was a future consideration; there are existing methods in the Ruby world that could be applied. However, for now, we have been content manually ensuring our records remain coherent as we build the sharded application.
00:08:23.760 Moving forward into our queries across shards, particularly with joins, led us to develop an API that enables conditional access, allowing queries without incurring expensive locks. Our development philosophy led to a decoupling of database design, facilitating both speed and performance during query operations. Each chart now provided an abstraction layer where retrievals would occur, obfuscating complexity while ensuring data accuracy across shards.
00:09:06.840 We employed several methods to extract data using the chart's context, be it active record relations or background processing tasks that scale across shards. Importantly, proper APIs allowed us the freedom and flexibility warranted by our volume of requests while maintaining a clean architecture. These approaches were not entirely seamless but necessitated comprehensive planning and rigorous coding to reduce the impact on already established developers.
00:09:53.120 Documenting our journey to sharding was essential as it encapsulates the various decisions and considerations that we made throughout 2013. As we reflected on this experience, we learned the importance of understanding your application architecture before implementing drastic changes. The operational adjustments taken will allow Shopify to easily adapt over the following years, maintaining performance despite our ever-increasing user base.
00:10:33.040 By Cyber Monday 2013, we successfully handled around 100,000 RPM—significantly higher than our day-to-day performance. We observed impressive spikes up to 30,000 requests per minute during flash sales. Our sharding architecture was engaged, separating high-volume transactions from the general flow, ensuring that performance didn’t suit when it mattered most. It marked a triumphant moment for the Shopify team.
00:11:15.200 Our metrics illustrated that even with archived data, we experienced low average response times during peak shopping hours that year. The graph patterns showcased our double-layered architecture performing optimally as requests flowed seamlessly despite underlying increases in data volume. As we assess our eventual successes, we maintain focus on honing our sharding practices towards further optimizations.
00:11:41.760 In closing, we can assert that our journey to shard Rails successfully fortified Shopify's back-end infrastructure, paving the way for sustainability in the face of significant traffic evolution while introducing new strategies to maintain service reliability. Thank you for your attention. If there are any questions at this time, please feel free to ask!
Explore all talks recorded at Big Ruby 2014
+14