Running Jobs at Scale

by Kir Shatrov

In the presentation "Running Jobs at Scale," Kir Shatrov, a platform engineer at Shopify, discusses the complexities of managing background jobs within web applications, particularly in environments that require rapid deployments. His talk emphasizes the challenges faced by developers, especially those using Ruby on Rails and libraries such as Active Job and Sidekiq, when handling long-running jobs that can obstruct timely task execution.

Key Points Discussed:
- Acknowledgment of GoRuCo: Shatrov appreciates the opportunity to present at the final GoRuCo conference, showcasing the importance of the event to the Ruby community.
- Background Jobs Overview: He defines background jobs as processes that run outside of user requests, often involving tasks like sending emails or data processing that can be time-consuming.
- Challenges with Long-Running Jobs: Shatrov elaborates on difficulties posed by jobs that exceed the usual timeframe for execution. For example, jobs dealing with thousands of records may complete in minutes, but those processing millions can take much longer, complicating deployment strategies.
- Deployment Interruptions: He explains how standard practices in job libraries can lead to lost job progress during deployments, raising concerns about job reliability during frequent code updates.
- Operational Impact: For platforms like Shopify, where the scale can involve numerous merchants and vast product databases, long-running jobs can hinder critical functions, such as payment processing, due to resource constraints.
- Solution Concept: To combat these issues, Shatrov proposes the idea of interruptible and resumable jobs. By splitting job definitions into distinct parts—one for the collection of records and another for the individual processing—developers can manage interruptions more effectively, allowing jobs to save progress and resume without data loss.
- Implementation Benefits: This approach allows for better tracking of job progress, enables parallel processing of tasks, and enhances efficiency through throttling based on database load.
- Future Steps: Shatrov mentions plans to open-source this solution, inviting discussion with other developers facing similar challenges related to background job management.

Conclusions: Shatrov's insights underscore the necessity of rethinking how we handle long-running jobs in web applications, particularly in a cloud environment, to accommodate frequent deployments and ensure operational efficiency. The proposed solutions could greatly benefit developers looking to optimize their background job processing workflows.

00:00:14.599 Hi everyone!

00:00:15.750 My name is Kir, and I work as a platform engineer at Shopify. Today, my talk is about running jobs at scale.

00:00:19.800 Before I begin, I want to thank all the GoRuCo organizers. Let's give them a quick round of applause.

00:00:28.109 I've heard about GoRuCo as a really great Ruby conference, which I have wanted to speak at since 2015. I applied here, applied the next year too, and in 2017 it didn't work out. But I was so happy to receive this email from Joe this year. I also heard that this is the last GoRuCo, so it feels very special to be here.

00:01:06.510 My talk focuses on background jobs. Many of you here are Rails developers and have probably worked with libraries like Active Job and Sidekiq that allow you to define units of work to execute in the background. These are usually tasks that you don't want your users to wait for, such as sending emails and notifications, or importing and exporting data. These tasks can take longer than a typical web request.

00:01:30.270 The definition of these jobs typically looks something like this: there is a method, usually named 'perform', in a Ruby class. This method serves as the entry point that defines the job's logic. Let's jump to a more concrete example. In this example, we iterate over all products—every record in the database—and call a method on them. For instance, you might want to synchronize all the products in your database with a store, reconciling the data and refreshing the records. This is a common pattern I've seen in jobs.

00:02:06.159 This works fairly well, especially when you have just a few records in the database. When you have hundreds of records, this job will complete in a few seconds. If you have thousands of records, it might take minutes, and when you reach millions of records, the job starts taking days or even weeks. Here we encounter the problem of long-running jobs.

00:02:40.420 Let me explain why long-running jobs can be problematic. When you deploy a new version of code, you usually want to roll it out by shutting down workers running the old version and starting new workers with the new revision. However, what happens if you have a job that has two or three hours left to run? You have to be careful how you handle this situation. One approach is to wait for all workers to complete their jobs, but this could delay the deployment for hours or even days if there are very long jobs.

00:03:03.069 Many libraries like Sidekiq handle this by aborting the job and pushing it back to the queue, so it will be retried later by another worker on the new revision. Unfortunately, this results in losing the existing progress. This issue worsens with frequent deploys, as illustrated in a timeline where a job starts running, then the deployment occurs, and the job gets aborted repeatedly if deploys are frequent enough. In a typical environment with many developers, we deploy every 20 minutes during working hours, making it difficult for any job that takes longer than that to succeed, leaving only the night or weekends for such jobs to complete.

00:03:46.849 Another challenge we face concerns the capacity of workers and job duration. With more long-running jobs, there is a higher chance that many workers will be busy with these tasks, which could prevent higher-priority jobs—like payment processing—from being executed in a timely manner. Long-running jobs can complicate things further in cloud environments, where hardware is less predictable. For example, a cloud provider like Google or AWS might give you a notice that an instance will be shut down in a few minutes due to health issues, requiring your application logic to manage such interruptions.

00:04:57.120 At Shopify, this started to become a pressing problem; we were encountering too many long-running jobs that sometimes took weeks to complete. As we transitioned to the cloud, we needed a solution. We researched the reasons why so many jobs were taking so long, and we found that these jobs often iterate over long collections. Shopify is an eCommerce platform with many merchants, and we had jobs that iterated over all products for every merchant. While smaller merchants had shorter jobs, enterprise merchants with millions of products faced significant delays.

00:06:03.600 To address this, we began to conceptualize interruptible and resumable jobs. What if we could abort jobs upon deployment but save their progress? We envisioned splitting the job definition into two parts: one for the collection to process—which could be a smaller collection or even a large one containing millions of records—and another for the work to be performed on each record. In our previous example, the 'collection to process' would be all product records in the database, while the 'work to be done' would be a method call executed for each product.

00:08:06.180 This structure allowed us to include a module that provided iteration features. Rather than having just one perform method, we would now have a method defining the collection and a method called on each record within that collection. By implementing a more structured design, we could handle interruptions and resumption of jobs easily. Considering relations or collections as a whole allows the use of a cursor that we can persist during job interruptions, enabling the job to complete seamlessly.

00:09:03.600 This approach is not limited to Active Record relations; we can indeed build enumerators from multiple sources, including CSV files. Once we started implementing this, we realized the vast possibilities it opened up. We could track job progress effortlessly and parallelize tasks since the units of work were smaller and clearly defined. Additionally, we could throttle job processing based on the database load, improving efficiency. This enabled us to give developers the ability to work even with collections containing millions of records while keeping infrastructure uptime secure. We are planning to open-source this solution soon, and I look forward to discussing it with anyone facing challenges related to background jobs. Thank you all very much!