Kir Shatrov

Summarized using AI

Sleeping on the job

Julik Tarkhanov and Kir Shatrov • August 21, 2020 • online

The video titled "Sleeping on the Job" features Kir Shatrov and Julik Tarkhanov discussing the challenges and strategies associated with managing background jobs in software applications. They illustrate how background jobs, while useful for executing delayed tasks, can pose risks, such as crashing applications due to unpredictable execution times. The speakers emphasize the importance of effectively visualizing and monitoring background job workflows to enhance control and reliability.

Key points discussed include:
- Delayed Execution Risks: Background jobs do not execute immediately; instead, they can take varying times to complete, complicating performance predictions and resource management.
- Visualization Techniques: A timeline swim lane view is proposed as an effective way to visualize job execution flows, allowing developers to track multiple worker threads over time.
- Monitoring Metrics: Essential metrics include job throughput and performance distributions, which are crucial for performance assessment. Different companies utilize various monitoring tools like Datadog at Shopify and AppSignal at WeTransfer, reflecting their operational needs.
- Two-step Deployment Strategy: Discussing deployment mechanisms, they suggest using a two-step or blue-green deployment strategy to prevent mismatches between the web application and background workers causing execution errors.

Sleeping on the job
Julik Tarkhanov and Kir Shatrov • August 21, 2020 • online

We all love our Sidekiq’s and our Resque’s. But they do let us down sometimes. Not because they are bad, but because the queueing theory is limiting us. There is a way to break out of the madness though - let’s explore how to get our job queues under control.

Kir Shatrov is a platform engineer at Shopify where he works on scalability and reliability of one of the world’s largest ecommerce platforms. When not into working, Kir enjoys cooking, gastronomic tourism (he even has a GitHub repo with his favourite spots!) and exploring London on the bike.
Julik Tarkhanov is a software developer at WeTransfer where he is responsible for the backend components of the Transfer product, enabling effortless transfer of creative ideas. Prior to WeTransfer he worked in the visual effects industry creating images that inspire and befuddle. On his free time he explores weird user interfaces and plays trumpet.

Welcome to the #NoRuKo conference. A virtual unconference organized by Stichting Ruby NL.

#NoRuKo playlist with all talks and panels: https://www.youtube.com/playlist?list=PL9_A7olkztLlmJIAc567KQgKcMi7-qnjg

Recorded 21th of August, 2020.
NoRuKo website: https://noruko.org/
Stichting Ruby NL website: https://rubynl.org/

NoRuKo 2020

00:00:00.240 Nice, welcome back! We're already at our final talk of the day here in Main Track Land. Our next two speakers are not strangers to the community. Ramon, can you tell our audience a bit about them? Absolutely! Here we go, folks.
00:00:15.759 Our next speaker comes in twos. We've got Kir, who is a platform engineer at Shopify and quite the culinary expert. He enjoys cooking and touring his favorite places to eat, and I hear he writes about them. Joining him is Julik, a software developer at WeTransfer, where he is responsible for the backend components of the Transfer product. He gave a lightning talk at NoRuKo last year, and he's also a musician who plays the trumpet. Let's hope he makes an appearance at karaoke today! But I'll let him toot his own horn.
00:00:51.520 Well folks, let's hear a little bit about sleeping on the job.
00:01:06.799 Hello everyone, and welcome to this talk about background jobs. My name is Julik, and today with me, we have the wonderful Mr. Kir. We thought it was really worth discussing background jobs because, while they can be very useful, they can also pose significant risks. Background jobs can crash your application.
00:01:18.479 The danger with background jobs lies in their delayed execution. They do not start executing immediately when you spool them up; rather, they run for a variable amount of time, which is not always predictable. For example, you might expect to complete a certain job within 300 milliseconds, but instead, it could take 10 seconds or even a minute.
00:01:31.680 Even if you're only running a single Sidekiq worker, you may have multiple flows of execution happening simultaneously. Some of those jobs will take longer than others to complete or fail, making it difficult to predict how long a specific task will take in your application. We think this session can be beneficial even if you're not running a large application. Understanding how to safely manage background jobs may help you better size your service and figure out how many servers you need.
00:01:55.600 Additionally, this knowledge could offer insights into where to look if you feel that your background job cluster or code is not functioning optimally. This is a touchy topic since managing background jobs involves controlling parallel processing over time, sometimes in non-immediate ways.
00:02:32.640 This image shows the control panel of an RBMK nuclear reactor at Chernobyl. The control panel on the right allows manual control of the load or the control rods within the reactor channels. The role of the nuclear reactor operator was to ensure that these channels heated evenly and produced power consistently. It required significant manual input.
00:03:12.319 This theme of managing background job executions evenly in terms of duration and load will be a recurring topic in our presentation. The first thing we need is a way to visualize background jobs appropriately. We suggest a timeline swin lane view, where you can see your flows of execution laid out in time from left to right.
00:03:28.400 In this visualization, a job is represented as a green class, which gets added to your message queue when you call something like 'process payment job.perform later.' It gets pulled into the queue, where it spends some time before being executed by one of the worker threads. After execution, it completes, and the process continues for subsequent jobs. Keeping an eye on multiple workers and threads of execution is essential.
00:04:02.720 To effectively manage your background jobs, it's vital to have a basic understanding of what your system is doing. You should visualize your workflows effectively, hoping to maintain the same level of observability in your live production system. It is crucial to know the status of your background jobs: whether they are stuck and at what rate they are executing. Observability is key.
00:04:45.679 There are various metrics you want to monitor regarding your background jobs, such as throughput, the rate of jobs processed over time, performance metrics for job classes, and their distributions, typically focused on p99, p95, and p50 metrics. At Shopify, we use Datadog for monitoring because we prefer building our infrastructure for e-commerce ourselves while leveraging third-party vendors for monitoring.
00:06:07.600 On the other hand, WeTransfer operates differently. We prioritize a more frugal approach, utilizing AppSignal for application performance monitoring instead. With AppSignal, creating extra metrics is straightforward and inexpensive, allowing us to monitor our background jobs efficiently.
00:06:30.400 An example of a dashboard could include metrics showing the number of jobs received for execution by our workers and how many jobs are currently executing. Observational patterns such as this can help identify spikes in specific job types, providing valuable insights into the overall system performance. For instance, if we see a spike in deletion jobs executing at night, it indicates a shift in our background processing workload.
00:07:11.440 It's essential to identify outlier jobs that take significantly longer than average or those that fail repeatedly. Another critical topic to discuss is the two-step deployment process. Generally, you would deploy in two steps with a blue-green release and spool a job in one of them.
00:08:29.440 However, you may face complications if your background worker server is still running an older application version while your web application server has already introduced the new job class. This mismatch can lead to errors when the old worker attempts to pick up the new job, which won't be executable.
00:09:49.760 To address this, consider implementing a two-step deployment process, such as a blue-red-green approach. In the first deployment, you would make the job code available for all workers to recognize the job class, and only once all background workers are running the updated code would you dequeue or execute the job. It's a simple but effective strategy.
00:10:57.760 For smaller applications, it is also possible to deploy in sequence: first updating your background workers, then your web application. At WeTransfer, we prioritize this strategy during code reviews, ensuring that changes related to job codes and their invocation are correctly sequenced.
00:11:40.640 As human reviewers, we may overlook critical aspects of a pull request. At Shopify, we utilize GitHub bots to flag specific changes or patterns within PRs, prompting authors to review documentation related to two-step deployments to reduce errors.
00:12:36.440 Additionally, we must consider job storage. While discussing background jobs and best practices, it's essential to address how job storage scales and what makes it durable. At Shopify, we've been using Redis for the last decade, with extensive operational experience. We run highly available Redis clusters and have invested in a robust Redis stack.
00:13:38.239 This includes two instances for high availability. We use LPUSH to move jobs from the queue to the worker queue, which becomes acknowledged if the job is processed successfully.
00:14:45.680 However, Redis operates as a single CPU, so as your workload increases, you'll need a plan for horizontal scaling. Meanwhile, WeTransfer decided to utilize AWS's SQS for message queue services, which offer benefits such as built-in durability and simplicity.
00:15:49.920 SQS is user-friendly with minimal complexity, making it an attractive solution for background job management, despite some limits on prioritizing requests. While it can manage a single queue efficiently, it may lack the robust features needed for complex scenarios.
00:17:14.240 Next, it's important to address sequential execution. This becomes an issue as your application scales. For example, a large number of jobs can be queued simultaneously for the same resource, leading to overwhelming pressure on your database or external APIs.
00:18:25.520 If one job monopolizes a worker thread for an extended time, it prevents other jobs from executing. To mitigate this, break long-running jobs into smaller segments. This allows scheduling flexibility, enabling other jobs to process in between.
00:19:05.759 Chopping jobs into manageable pieces avoids monopolization of resources and allows for improved responsiveness across your tasks.
00:19:59.680 It's worth noting that modern cloud environments necessitate jobs to be interruptible, as workloads can be terminated with minimal notice.
00:20:22.480 This requirement means that jobs, like those iterating over large datasets, must be capable of pausing and resuming, avoiding the total loss of progress.
00:21:09.600 At Shopify, we spent considerable time developing a mechanism called the iteration API to address this challenge. Instead of defining a job with a simple method, we structure it using two methods: one for defining the collection of items to process and another for the action taken on each item.
00:21:58.640 This design allows for better job management and can adapt to interruptions while providing insights into job performance. The enumerator builder pattern we've adopted makes it easier for developers to understand job iteration.
00:22:41.440 As we implemented this, developers needed guidance, but with proper documentation and onboarding, they quickly adapted.
00:23:27.050 Moving forward, it’s crucial to address resource congestion. At times, traffic spikes or unexpected loads can overwhelm workers. Increasing workers isn’t always the solution; you may just hit the same bottlenecks again.
00:24:16.960 It's vital to identify system bottlenecks, such as database write limits or third-party API rate limits, and implement throttling measures to improve system stability.
00:25:12.320 With the implementation of features that encapsulate workload units, developers can adhere to standardized throttling practices, thereby managing performance seamlessly across the platform.
00:26:11.440 Another aspect to consider is the phenomenon of job piling, where multiple jobs target the same resource at once, potentially overwhelming that resource, such as with API calls. S3 might yield a 'slow down' error if overloaded with requests.
00:27:01.840 To address this, we apply a concurrent execution lock to manage job processing effectively, focusing on jobs with shared resources; if one job is executing, subsequent jobs with the same parameters are delayed appropriately.
00:27:48.400 For example, if jobs try to delete from the same S3 prefix at the same time, we ensure that if a new job arrives attempting the same call, it gets canceled or delayed. This strategy is essential to maintaining operational efficiency.
00:28:38.720 As for controlling concurrency, both Shopify and WeTransfer utilize Redis as a metadata store to manage concurrent job execution effectively, minimizing resource contention issues.
00:29:32.160 During our session, we also need to discuss disaster recovery. If a bug inadvertently creates an infinite loop causing a job to queue excessively, we should have measures to cancel faulty jobs quickly via chat ops.
00:30:30.080 Shopify has implemented advanced chat ops for job operations, allowing selective cancellation of jobs based on specific filters or parameters.
00:31:35.440 Conversely, at WeTransfer, we often utilize a simpler system built on Redis. By setting keys that dictate job cancellation, we can manage runaway jobs.
00:32:20.480 We also discussed job prioritization strategy, noting that while all libraries suggest forming multiple queues with assigned priorities, it’s more complicated in practice. The assumption is that a higher priority lends it an operational advantage, yet that’s not guaranteed.
00:33:40.159 In the case of e-commerce platforms, understanding job importance can vary significantly—checkout jobs typically require higher urgency compared to other backend tasks. The operational challenges arise with executing jobs of varying priorities and ensuring resource availability.
00:34:51.280 Ultimately, determining job prioritization is tricky due to the unpredictable nature of job performance and system stress points. At WeTransfer, we focus on establishing a balance without over-relying on prioritization.
00:35:40.000 As we wrap up, it’s clear that while Shopify and WeTransfer have different operational strategies and scales, they face similar challenges regarding background job management. Regardless of the number of developers, the need for effective concurrency controls, robust deployment strategies, and efficient monitoring exists.
00:37:05.360 Tools and practices that develop from experience and through trial are meant to share these learnings to empower the broader community to build reliable background job architectures.
00:39:27.120 We aim to provide insights that help you adapt strategies best suited to your needs. Thank you for tuning in, and we look forward to any questions.
00:41:00.000 Thank you so much to both of you for your enlightening talk. As we move closer to our panel discussion, we will skip a quick question and transition directly into the panel.
Explore all talks recorded at NoRuKo 2020
+10