Background Jobs at Scale

00:00:16.119 When a new user signs up for an application, that application usually creates a new account for them and sends a welcome email. However, sending emails is notoriously slow, and response time is crucial for a great user experience. Nobody likes to wait, but we can inform the user about the success or failure of their account creation, even if sending the welcome email hasn't been completed yet. We can run the slow task of sending emails outside of our main request-response cycle by offloading it into a background job. Background jobs can do much more than just improve response times.

00:00:41.360 So today, I'm going to talk about how to use background jobs to scale applications and how they can help us keep our codebase simple. I'll start with a brief introduction to background jobs, describe some of their features that aid in scaling, and explain how to master the challenges of working with them. To maximize these features, I'll explain how asynchronous techniques like background jobs can actually complement a RESTful approach. Additionally, I will show you how to work with background jobs at scale and how we're using them at Shopify, one of the largest running Ruby on Rails applications.

00:01:16.000 In the example of a user signing up, background jobs decouple the user-facing request from the time-consuming task of sending a welcome email. Our application server can fork a worker process and offload the slow task to the worker. It shouldn't wait for the worker to finish sending that email, but on the other hand, a fire-and-forget approach is not ideal either. We would like the worker to communicate in case something goes wrong. We can facilitate that communication by placing a message queue between the two processes. The sender and receiver of messages can interact with that queue at different times, which makes the communication asynchronous.

00:02:58.940 The message, in our case, is the slow task of sending that email, which is why our message queue is often referred to as a task queue. The application server places the message in the queue and does not have to wait for the worker to process it; it can immediately return to the user. This is how our response time improves. We can also have multiple workers, and these workers can exist on different machines, as long as they have access to the queue. A task queue works best when paired with a so-called broker, which routes the messages and manages the workers. Essentially, this is what a background job backend is—a task queue along with some broker code. It encapsulates all the complex functionalities, allowing us to keep our codebase simple and easy to maintain.

00:04:00.000 Our code is not aware of being run asynchronously on the worker. Background jobs improve the user experience by enhancing response times. But how do they help scale? If our application server experiences a sudden spike in image uploads, for instance, it won't hurt us if the actual image processing is occurring as a background job. Due to the spike, our queue might grow, and it might take a little longer until all images are processed, but users can still upload more images as quickly as usual. The speed of queuing more jobs is not constrained by the speed of processing them.

00:04:43.210 Our task queue serves as a buffer between the application server and the worker process. We can also add more workers on-the-fly, allowing us to process more jobs simultaneously, thereby speeding up the entire image processing task. In fact, we can process multiple jobs in parallel, and we can also handle large tasks by splitting them into many subtasks, queuing those as individual background jobs. Those subtasks can run on different worker machines simultaneously, effectively parallelizing the processing of the overall task. Again, our code remains very simple because it is not aware it is being executed in parallel; all it needs to know is how to execute one subtask at a time.

00:05:59.780 Background job backends typically come with some kind of retry functionality. If a worker encounters any error while processing a job, the job is re-queued after some time, allowing it to be retried later. If all attempts to run the job fail, it is marked as failed and is no longer retried. When a worker successfully runs a job, it confirms that to the queue, which means it is safe to remove that job from the queue. However, if the job is never confirmed, another worker can pick up that job from the queue and try to process it.

00:07:10.490 Some of our background jobs are more urgent than others. Sending a welcome email should happen soon after sign-up, while cleaning up stale data can certainly wait until later. We have some background jobs closely related to the checkout experience of customers, which are more urgent than, for example, sending webhooks to third-party partners. A common feature of many background job backends is the ability to set up multiple named queues and start workers with a queue list sorted by priority. Jobs in the high-priority queue will be processed first, but if that queue is empty, workers will not remain idle; they will process jobs of lower priority.

00:08:38.430 This feature also allows us to set up a specialized pool of workers that only process certain tasks. This is beneficial if some of our background jobs have specific requirements, such as running on certain hardware. Leveraging all these features can pose challenges. By offloading some tasks into background jobs, we also move them outside of the main transaction, which means we lose some consistency guarantees. The best we can hope for is eventual consistency, but even that can’t be guaranteed. We might end up with inconsistent data if a background job fails.

00:09:40.250 This is a general trade-off we face; we need to balance the benefits of background processing against the risks of having some inconsistent state. However, we can try to contain the impact of inconsistencies by setting up data reconciliation. Perhaps we can have a scheduled background job that tries to find inconsistencies and fix them or at least report them.

00:10:16.450 The job that is queued first is not necessarily the job that is processed first. Jobs exit the queue in a predefined order, but if we have more than one worker, we can never predict which worker will work faster. A job might also face an error and get re-queued for later processing.

00:10:58.990 So we cannot rely on the order of our background jobs. If the background job is used to build some kind of history or timeline, the order of jobs is not something we can depend on. Instead, we should use a timestamp, like 'updated at', to keep track of the order. If a background job depends on another job finishing first, we can encounter issues. We cannot know which will finish first, but if we only queue the first job and after its completion, this job queues the second job as a follow-up, we safely ensure both jobs run sequentially.

00:12:38.820 If a worker does not confirm that a job was processed successfully within a certain timeframe, we allow another worker to pick up the job from the queue. However, if the first worker didn't actually crash but was just slower than expected, we risk running the job twice. If we do not allow a second worker to pick up the unconfirmed job, we could risk that job not running at all. Therefore, we need to choose between at least once and at most once delivery. At most once delivery is the right choice if running the job twice results in problematic outcomes worse than not running it at all.

00:13:58.780 However, more often than not, this points to a flaw in how the job is written. If the job is written to be idempotent, running it twice does not cause issues and allows us to opt for at least once delivery, ensuring the job runs in many cases. Idempotency often comes for free—for example, if we’re removing stale data or writing a default value into an empty column. In other contexts, we need to be more vigilant. If a job loops over all users and sends them an email, or worse, charges them, running that job twice could send the email twice or charge the customer twice.

00:15:12.840 In this case, we should carefully track when an email was sent or when a customer was charged. When the background job processes the loop, it should check this state to determine if the record has already been processed by a previous run of the job. Background jobs do not need to be as fast as user-facing requests. However, long-running jobs block our workers and prevent deployments, resulting in different workers running different revisions of our codebase, which is undesirable.

00:16:31.730 If a user-facing request is too slow, we can offload some of its slow tasks into background jobs. If a background job is too slow, we can offload some of its tasks into other background jobs—either by splitting up the tasks into multiple jobs from the outset, or by queuing follow-up jobs from within a background job. This approach may not reduce our overall processing time, but it does reduce the processing time of the individual job, which is what we care about here.

00:18:17.760 Synchronous techniques like background jobs require extra care when exposing a REST API. If a POST request offloads resource creation into a background job, the status code 201 (Created) is no longer appropriate, as the resource has not been created yet. The creation can still fail, and we do not have the location of the new resource to return to the client. Instead, a status code of 202 informs the caller that the request was accepted and looks good, but has not been processed yet.

00:19:38.440 This response can also return a location that points to a temporary resource, typically some kind of status monitor. The caller can then check this monitor to learn about the progress of resource creation. Once the server has created the actual resource, the request to the status monitor can return a 303 status code, along with the location of the new resource. In this case, the target resource requested by the caller is the status monitor, while the caller gets the new resource. This represents not only a different response but an entirely different resource, which is a key difference between a 303 redirect and a regular redirect.

00:20:48.000 A caller can also use the 'Accept' header to request synchronous or asynchronous behavior. If a caller sets the 'Expect' header to a 202 status code, they are indicating they are only willing to have resource creation completed asynchronously. Conversely, the caller that requests 200, 201, or 204 is enforcing synchronous resource creation. If the server cannot fulfill that request, it can return an ‘Expectation Failed’ (417) status code.

00:22:14.700 Background jobs provide benefits for applications of varying sizes. For smaller to medium-sized applications, Delayed Job is a great starting point. Delayed Job was extracted from Shopify and open-sourced many years ago. It’s a nice solution to get started with if you're on Rails, as it doesn’t require additional infrastructure; it persists the task queue using Active Record, so it ends up in the database you're already using.

00:23:06.390 The application should not communicate with Delayed Job directly, but use Active Job, which comes with Rails. Active Job is an abstraction layer that makes it easy to replace the background job backend if the application ever outgrows Delayed Job. It allows developers to swap in a different background job backend without needing to rewrite code, as Active Job handles that for us.

00:24:39.080 However, if this trait is overloaded, the queue backs up significantly. The overhead of a relational database can be an issue with Delayed Job, and workers are mostly monitored from the outside. If something goes wrong, the worker's process often consumes too much memory, and all we can do is kill it and restart it. Restarting is quite painful as it reloads the whole environment, which includes Rails.

00:25:42.720 We need to perform those restarts regularly since that's essentially the only way to deal with troubled workers. Additionally, keeping track of what's happening with all these workers can be challenging because restarts and reporting happen from the outside. This is where Rescue performs better. Rescue uses Redis to persist the task queue. Redis is known as a datastore, but it has many features necessary for building great message queues.

00:26:47.000 By using Redis for the task queue, the Rescue Ruby code focuses on processing tasks and assumes, from the outset, that workers may run into trouble, such as hanging or consuming too much memory. Rescue does not run background jobs directly in the worker process; instead, the worker forks a child process for each job. If anything goes wrong, the worker can kill the child process without needing to restart the whole worker, which is a big advantage compared to Delayed Job.

00:27:56.170 It is also much easier to monitor what’s happening since the worker manages the state of all its child processes. For instance, it knows when it’s killing or starting a job. At Shopify, we are currently using Rescue at a massive scale; we are queuing tens of thousands of jobs per second, and Rescue keeps up with that. However, forking all these child processes can be memory-intensive, and the speed with Rescue can be lacking.

00:29:25.530 Sidekiq improves on this. Sidekiq is fully compatible with Rescue, meaning that jobs queued in Rescue can be processed by Sidekiq. Sidekiq does not fork child processes; instead, it uses threads for running background jobs, which makes it significantly faster and less memory-intensive. However, it does mean that our code and all its dependencies must be thread-safe.

00:30:46.810 With the large volume of data processed at Shopify, we have no way around sharding the database. The approach we’ve chosen is not only to shard the database but also to create what we call a pod internally, which is a shard that comes with its own Redis, its own workers, and its own application services. We hook into Rescue's startup process and overwrite substantial portions of its code to make it aware of this sharded infrastructure.

00:32:11.500 By having a background job system that understands this architecture, it becomes quite seamless to work with. Our database migrations do not run during the deployment; instead, we have scheduled jobs that check for pending migrations and run them one by one. This means our deployments remain fast and significantly less risky. However, it requires that our code works both before and after running a migration.

00:33:37.380 Most of our database migrations are even regular data migrations because a large drop in a MySQL table would be detrimental. In the end, even these migrations are run as background jobs. When we need to update large datasets regularly, for example, by adding a new column to a database table, we must backfill it with data for existing records. We write regular data migrations that do not change the schema but queue a background job to perform the update.

00:34:50.770 This means, since it is a database migration, our background job is queued after deployment. We use this approach even if we are only updating a smaller data set because it means our updates can be treated essentially as code changes. They go through regular quality assurance procedures, such as code reviews, and are tracked, which is very helpful for debugging purposes. Backfilling a new column often requires iterating over millions of records, resulting in a long-running background job.

00:36:07.000 If we need to shut down the worker for a deployment, Rescue makes us wait until the job finishes, effectively hindering deployments for this worker. If we deploy quite frequently, we would face running many different revisions concurrently—definitely not a desirable state. Sidekiq offers a better solution by aborting a running job and re-queuing it if we want to shut down a worker. However, this method may not suffice if the job starts from scratch every time it is retried, especially if deployments occur faster than the job can finish.

00:37:57.920 To solve this problem, we have our own abstraction layer for iterating over large collections, splitting the job into two parts: the collection being iterated over and the tasks to be executed. The iteration can be done for each record or in batches. After each iteration, a checkpoint is sent, and that job is re-queued. This way, the next time it is picked up by a worker, it resumes from that checkpoint.

00:39:33.240 This method effectively provides us with interruptible jobs with automatic resuming capabilities. Firstly, it allows for more frequent deployments because, when shutting down a worker, we do not need to wait for the entire collection to be processed. We only need to wait for one iteration to finish. This approach also aids in disaster prevention—for example, if our database experiences issues, we can stop a running job or throttle it completely, preventing an ongoing incident from worsening.

00:40:58.810 If we need to move data around, the interrupted job helps us preserve data integrity, as we can automate everything without worrying about our sharded infrastructure. Controlling the iterations simplifies progress tracking for large jobs, and it allows for parallel processing. We can run multiple iterations simultaneously on different workers. Some of these advantages can be realized by simply breaking down the large job into multiple smaller ones, but this complicates tracking the overall job's progress or success.

00:42:40.600 Using our abstraction layer makes this easier because ‘resume’ is not a ‘retry,’ allowing us to observe retries and failures for the complete long-running task. Our encapsulation of everything into this additional layer hides the scaling issues from developers writing the background jobs, helping us maintain a simple codebase. Background jobs are easy to implement, allowing even small applications to improve response times as they require us to make trade-offs regarding data consistency or delivery guarantees.

00:43:37.640 However, in the end, they help us keep our codebase organized and manageable, even when we are working at a massive scale. Thank you very much for your attention. If you have any questions or comments, here's my Twitter handle. I'll also be available for the rest of the weekend. If you're interested in working on a system that queues tens of thousands of jobs per second and processes them, please come and talk to me—Shopify is hiring!

00:45:03.550 Thank you so much.