Running Jobs at Scale

GORUCO 2018: Running Jobs at Scale by Kir Shatrov

GoRuCo 2018

00:00:14.599 Hi everyone!

00:00:15.750 My name is Kir, and I work as a platform engineer at Shopify. Today, my talk is about running jobs at scale.

00:00:19.800 Before I begin, I want to thank all the GoRuCo organizers. Let's give them a quick round of applause.

00:00:28.109 I've heard about GoRuCo as a really great Ruby conference, which I have wanted to speak at since 2015. I applied here, applied the next year too, and in 2017 it didn't work out. But I was so happy to receive this email from Joe this year. I also heard that this is the last GoRuCo, so it feels very special to be here.

00:01:06.510 My talk focuses on background jobs. Many of you here are Rails developers and have probably worked with libraries like Active Job and Sidekiq that allow you to define units of work to execute in the background. These are usually tasks that you don't want your users to wait for, such as sending emails and notifications, or importing and exporting data. These tasks can take longer than a typical web request.

00:01:30.270 The definition of these jobs typically looks something like this: there is a method, usually named 'perform', in a Ruby class. This method serves as the entry point that defines the job's logic. Let's jump to a more concrete example. In this example, we iterate over all products—every record in the database—and call a method on them. For instance, you might want to synchronize all the products in your database with a store, reconciling the data and refreshing the records. This is a common pattern I've seen in jobs.

00:02:06.159 This works fairly well, especially when you have just a few records in the database. When you have hundreds of records, this job will complete in a few seconds. If you have thousands of records, it might take minutes, and when you reach millions of records, the job starts taking days or even weeks. Here we encounter the problem of long-running jobs.

00:02:40.420 Let me explain why long-running jobs can be problematic. When you deploy a new version of code, you usually want to roll it out by shutting down workers running the old version and starting new workers with the new revision. However, what happens if you have a job that has two or three hours left to run? You have to be careful how you handle this situation. One approach is to wait for all workers to complete their jobs, but this could delay the deployment for hours or even days if there are very long jobs.

00:03:03.069 Many libraries like Sidekiq handle this by aborting the job and pushing it back to the queue, so it will be retried later by another worker on the new revision. Unfortunately, this results in losing the existing progress. This issue worsens with frequent deploys, as illustrated in a timeline where a job starts running, then the deployment occurs, and the job gets aborted repeatedly if deploys are frequent enough. In a typical environment with many developers, we deploy every 20 minutes during working hours, making it difficult for any job that takes longer than that to succeed, leaving only the night or weekends for such jobs to complete.

00:03:46.849 Another challenge we face concerns the capacity of workers and job duration. With more long-running jobs, there is a higher chance that many workers will be busy with these tasks, which could prevent higher-priority jobs—like payment processing—from being executed in a timely manner. Long-running jobs can complicate things further in cloud environments, where hardware is less predictable. For example, a cloud provider like Google or AWS might give you a notice that an instance will be shut down in a few minutes due to health issues, requiring your application logic to manage such interruptions.

00:04:57.120 At Shopify, this started to become a pressing problem; we were encountering too many long-running jobs that sometimes took weeks to complete. As we transitioned to the cloud, we needed a solution. We researched the reasons why so many jobs were taking so long, and we found that these jobs often iterate over long collections. Shopify is an eCommerce platform with many merchants, and we had jobs that iterated over all products for every merchant. While smaller merchants had shorter jobs, enterprise merchants with millions of products faced significant delays.

00:06:03.600 To address this, we began to conceptualize interruptible and resumable jobs. What if we could abort jobs upon deployment but save their progress? We envisioned splitting the job definition into two parts: one for the collection to process—which could be a smaller collection or even a large one containing millions of records—and another for the work to be performed on each record. In our previous example, the 'collection to process' would be all product records in the database, while the 'work to be done' would be a method call executed for each product.

00:08:06.180 This structure allowed us to include a module that provided iteration features. Rather than having just one perform method, we would now have a method defining the collection and a method called on each record within that collection. By implementing a more structured design, we could handle interruptions and resumption of jobs easily. Considering relations or collections as a whole allows the use of a cursor that we can persist during job interruptions, enabling the job to complete seamlessly.

00:09:03.600 This approach is not limited to Active Record relations; we can indeed build enumerators from multiple sources, including CSV files. Once we started implementing this, we realized the vast possibilities it opened up. We could track job progress effortlessly and parallelize tasks since the units of work were smaller and clearly defined. Additionally, we could throttle job processing based on the database load, improving efficiency. This enabled us to give developers the ability to work even with collections containing millions of records while keeping infrastructure uptime secure. We are planning to open-source this solution soon, and I look forward to discussing it with anyone facing challenges related to background jobs. Thank you all very much!