00:00:00.240
Nice, welcome back! We're already at our final talk of the day here in Main Track Land. Our next two speakers are not strangers to the community. Ramon, can you tell our audience a bit about them? Absolutely! Here we go, folks.
00:00:15.759
Our next speaker comes in twos. We've got Kir, who is a platform engineer at Shopify and quite the culinary expert. He enjoys cooking and touring his favorite places to eat, and I hear he writes about them. Joining him is Julik, a software developer at WeTransfer, where he is responsible for the backend components of the Transfer product. He gave a lightning talk at NoRuKo last year, and he's also a musician who plays the trumpet. Let's hope he makes an appearance at karaoke today! But I'll let him toot his own horn.
00:00:51.520
Well folks, let's hear a little bit about sleeping on the job.
00:01:06.799
Hello everyone, and welcome to this talk about background jobs. My name is Julik, and today with me, we have the wonderful Mr. Kir. We thought it was really worth discussing background jobs because, while they can be very useful, they can also pose significant risks. Background jobs can crash your application.
00:01:18.479
The danger with background jobs lies in their delayed execution. They do not start executing immediately when you spool them up; rather, they run for a variable amount of time, which is not always predictable. For example, you might expect to complete a certain job within 300 milliseconds, but instead, it could take 10 seconds or even a minute.
00:01:31.680
Even if you're only running a single Sidekiq worker, you may have multiple flows of execution happening simultaneously. Some of those jobs will take longer than others to complete or fail, making it difficult to predict how long a specific task will take in your application. We think this session can be beneficial even if you're not running a large application. Understanding how to safely manage background jobs may help you better size your service and figure out how many servers you need.
00:01:55.600
Additionally, this knowledge could offer insights into where to look if you feel that your background job cluster or code is not functioning optimally. This is a touchy topic since managing background jobs involves controlling parallel processing over time, sometimes in non-immediate ways.
00:02:32.640
This image shows the control panel of an RBMK nuclear reactor at Chernobyl. The control panel on the right allows manual control of the load or the control rods within the reactor channels. The role of the nuclear reactor operator was to ensure that these channels heated evenly and produced power consistently. It required significant manual input.
00:03:12.319
This theme of managing background job executions evenly in terms of duration and load will be a recurring topic in our presentation. The first thing we need is a way to visualize background jobs appropriately. We suggest a timeline swin lane view, where you can see your flows of execution laid out in time from left to right.
00:03:28.400
In this visualization, a job is represented as a green class, which gets added to your message queue when you call something like 'process payment job.perform later.' It gets pulled into the queue, where it spends some time before being executed by one of the worker threads. After execution, it completes, and the process continues for subsequent jobs. Keeping an eye on multiple workers and threads of execution is essential.
00:04:02.720
To effectively manage your background jobs, it's vital to have a basic understanding of what your system is doing. You should visualize your workflows effectively, hoping to maintain the same level of observability in your live production system. It is crucial to know the status of your background jobs: whether they are stuck and at what rate they are executing. Observability is key.
00:04:45.679
There are various metrics you want to monitor regarding your background jobs, such as throughput, the rate of jobs processed over time, performance metrics for job classes, and their distributions, typically focused on p99, p95, and p50 metrics. At Shopify, we use Datadog for monitoring because we prefer building our infrastructure for e-commerce ourselves while leveraging third-party vendors for monitoring.
00:06:07.600
On the other hand, WeTransfer operates differently. We prioritize a more frugal approach, utilizing AppSignal for application performance monitoring instead. With AppSignal, creating extra metrics is straightforward and inexpensive, allowing us to monitor our background jobs efficiently.
00:06:30.400
An example of a dashboard could include metrics showing the number of jobs received for execution by our workers and how many jobs are currently executing. Observational patterns such as this can help identify spikes in specific job types, providing valuable insights into the overall system performance. For instance, if we see a spike in deletion jobs executing at night, it indicates a shift in our background processing workload.
00:07:11.440
It's essential to identify outlier jobs that take significantly longer than average or those that fail repeatedly. Another critical topic to discuss is the two-step deployment process. Generally, you would deploy in two steps with a blue-green release and spool a job in one of them.
00:08:29.440
However, you may face complications if your background worker server is still running an older application version while your web application server has already introduced the new job class. This mismatch can lead to errors when the old worker attempts to pick up the new job, which won't be executable.
00:09:49.760
To address this, consider implementing a two-step deployment process, such as a blue-red-green approach. In the first deployment, you would make the job code available for all workers to recognize the job class, and only once all background workers are running the updated code would you dequeue or execute the job. It's a simple but effective strategy.
00:10:57.760
For smaller applications, it is also possible to deploy in sequence: first updating your background workers, then your web application. At WeTransfer, we prioritize this strategy during code reviews, ensuring that changes related to job codes and their invocation are correctly sequenced.
00:11:40.640
As human reviewers, we may overlook critical aspects of a pull request. At Shopify, we utilize GitHub bots to flag specific changes or patterns within PRs, prompting authors to review documentation related to two-step deployments to reduce errors.
00:12:36.440
Additionally, we must consider job storage. While discussing background jobs and best practices, it's essential to address how job storage scales and what makes it durable. At Shopify, we've been using Redis for the last decade, with extensive operational experience. We run highly available Redis clusters and have invested in a robust Redis stack.
00:13:38.239
This includes two instances for high availability. We use LPUSH to move jobs from the queue to the worker queue, which becomes acknowledged if the job is processed successfully.
00:14:45.680
However, Redis operates as a single CPU, so as your workload increases, you'll need a plan for horizontal scaling. Meanwhile, WeTransfer decided to utilize AWS's SQS for message queue services, which offer benefits such as built-in durability and simplicity.
00:15:49.920
SQS is user-friendly with minimal complexity, making it an attractive solution for background job management, despite some limits on prioritizing requests. While it can manage a single queue efficiently, it may lack the robust features needed for complex scenarios.
00:17:14.240
Next, it's important to address sequential execution. This becomes an issue as your application scales. For example, a large number of jobs can be queued simultaneously for the same resource, leading to overwhelming pressure on your database or external APIs.
00:18:25.520
If one job monopolizes a worker thread for an extended time, it prevents other jobs from executing. To mitigate this, break long-running jobs into smaller segments. This allows scheduling flexibility, enabling other jobs to process in between.
00:19:05.759
Chopping jobs into manageable pieces avoids monopolization of resources and allows for improved responsiveness across your tasks.
00:19:59.680
It's worth noting that modern cloud environments necessitate jobs to be interruptible, as workloads can be terminated with minimal notice.
00:20:22.480
This requirement means that jobs, like those iterating over large datasets, must be capable of pausing and resuming, avoiding the total loss of progress.
00:21:09.600
At Shopify, we spent considerable time developing a mechanism called the iteration API to address this challenge. Instead of defining a job with a simple method, we structure it using two methods: one for defining the collection of items to process and another for the action taken on each item.
00:21:58.640
This design allows for better job management and can adapt to interruptions while providing insights into job performance. The enumerator builder pattern we've adopted makes it easier for developers to understand job iteration.
00:22:41.440
As we implemented this, developers needed guidance, but with proper documentation and onboarding, they quickly adapted.
00:23:27.050
Moving forward, it’s crucial to address resource congestion. At times, traffic spikes or unexpected loads can overwhelm workers. Increasing workers isn’t always the solution; you may just hit the same bottlenecks again.
00:24:16.960
It's vital to identify system bottlenecks, such as database write limits or third-party API rate limits, and implement throttling measures to improve system stability.
00:25:12.320
With the implementation of features that encapsulate workload units, developers can adhere to standardized throttling practices, thereby managing performance seamlessly across the platform.
00:26:11.440
Another aspect to consider is the phenomenon of job piling, where multiple jobs target the same resource at once, potentially overwhelming that resource, such as with API calls. S3 might yield a 'slow down' error if overloaded with requests.
00:27:01.840
To address this, we apply a concurrent execution lock to manage job processing effectively, focusing on jobs with shared resources; if one job is executing, subsequent jobs with the same parameters are delayed appropriately.
00:27:48.400
For example, if jobs try to delete from the same S3 prefix at the same time, we ensure that if a new job arrives attempting the same call, it gets canceled or delayed. This strategy is essential to maintaining operational efficiency.
00:28:38.720
As for controlling concurrency, both Shopify and WeTransfer utilize Redis as a metadata store to manage concurrent job execution effectively, minimizing resource contention issues.
00:29:32.160
During our session, we also need to discuss disaster recovery. If a bug inadvertently creates an infinite loop causing a job to queue excessively, we should have measures to cancel faulty jobs quickly via chat ops.
00:30:30.080
Shopify has implemented advanced chat ops for job operations, allowing selective cancellation of jobs based on specific filters or parameters.
00:31:35.440
Conversely, at WeTransfer, we often utilize a simpler system built on Redis. By setting keys that dictate job cancellation, we can manage runaway jobs.
00:32:20.480
We also discussed job prioritization strategy, noting that while all libraries suggest forming multiple queues with assigned priorities, it’s more complicated in practice. The assumption is that a higher priority lends it an operational advantage, yet that’s not guaranteed.
00:33:40.159
In the case of e-commerce platforms, understanding job importance can vary significantly—checkout jobs typically require higher urgency compared to other backend tasks. The operational challenges arise with executing jobs of varying priorities and ensuring resource availability.
00:34:51.280
Ultimately, determining job prioritization is tricky due to the unpredictable nature of job performance and system stress points. At WeTransfer, we focus on establishing a balance without over-relying on prioritization.
00:35:40.000
As we wrap up, it’s clear that while Shopify and WeTransfer have different operational strategies and scales, they face similar challenges regarding background job management. Regardless of the number of developers, the need for effective concurrency controls, robust deployment strategies, and efficient monitoring exists.
00:37:05.360
Tools and practices that develop from experience and through trial are meant to share these learnings to empower the broader community to build reliable background job architectures.
00:39:27.120
We aim to provide insights that help you adapt strategies best suited to your needs. Thank you for tuning in, and we look forward to any questions.
00:41:00.000
Thank you so much to both of you for your enlightening talk. As we move closer to our panel discussion, we will skip a quick question and transition directly into the panel.