Talks

What does high priority mean? The secret to happy queues

What does high priority mean? The secret to happy queues

by Daniel Magliola

In the talk titled "What does high priority mean? The secret to happy queues," Daniel Magliola presents a practical guide to managing queues in systems, emphasizing a latency-focused approach to improve job flow and user satisfaction. He narrates the story of a lead engineer, Sally, at Manddling, a stationery company, who faces persistent queue management issues.

Key Points Discussed:

  • Introduction to the Problem: Sally has been facing challenges with unhappy queues, leading to delays and impacting customer experience.
  • Historical Context: Initially, Manddling operated with a simple job queue system. As the company grew, the complexity increased, with issues arising from a backlog of jobs. Special incidents, such as failed password resets and delays in credit card transactions, highlighted the flaws of their prioritization.
  • Prioritization Issues: Joey, a team member, attempted to solve problems by implementing priority queues. However, this caused conflicts, as everyone’s jobs were deemed important, leading to confusion and further incidents.
  • Marianne’s Insights: A new engineer, Marianne, suggests organizing jobs not by priority but by their purpose, advocating for separate queues for different functions (e.g., mailers, surveys) to enhance predictability.
  • Scaling Challenges: As more teams were formed, each with their own queues, Manddling faced chaos with too many queues (60), which increased operational costs and complexity.
  • The Role of Latency: Daniel emphasizes that the critical issue is not the prioritization of jobs but their latency. He proposes structuring queues based on the maximum latency acceptable before a job is perceived as late.
  • Implementing Changes: Sally’s innovative idea of naming queues according to their maximum latency tolerances brought clarity and improved performance. The team established contracts for each queue, providing clear expectations for job execution.
  • Continuous Improvement: It took time for the team to transition existing jobs into the new system, but they learned to maintain momentum while tightening constraints, ensuring that every job was placed in the appropriate queue.

Conclusions and Takeaways:
- Focus on Latency: The single most significant factor in queue management is latency rather than priority. By defining clear latency tolerances, teams can more effectively manage jobs and ensure an efficient workflow.
- Accountability and Monitoring: Clearly defined queues improve accountability, set job expectations, and simplify alerting systems, ultimately leading to proactive issue resolution.
- Flexibility Needed: It’s important to remain flexible and realistic in adjusting latency limits based on specific jobs and operational requirements.

This insightful session at BalticRuby 2024 underscores the importance of reconsidering how job management is approached in software systems, ultimately contributing to happier queues and satisfied users.

00:00:09 I want you to meet Sally. Sally is a lead engineer at Manddling, a leading provider of paper and other stationery products.
00:00:12 She's been there from the very beginning, one of the first engineers hired. Because of that, she knows the codebase inside and out, and she's extremely familiar with the issues of running the app in production.
00:00:20 Unfortunately, Sally is feeling sad today because the queues are once again unhappy. She is facing this problem again and is trying to find a solution to make the system happy so she can return to her normal work.
00:00:31 This has been a recurring issue for years, and no matter what they try, they never seem to be able to fix it. However, the next morning, after a good night's sleep, Sally wakes up with a radical new idea that she believes will solve these problems once and for all.
00:00:45 But to understand how this will solve all of their problems, we first need to explore a little bit of history and understand how Manddling arrived at this situation. First of all, though, hi, my name is Daniel, and I work in D Flex, a flexible staffing platform based in London. You probably noticed that I'm not originally from London—I come from Argentina.
00:01:06 So in case you're wondering, that's the accent. Now, let's go back to Manddling. When Sally joined the company, Manddling was a tiny team of only three developers, and they wrote the entire app that was running the whole company—buying paper, keeping track of inventory, selling it, everything. Initially, everything was running on a single little web server, and that was fine for a while.
00:01:25 However, over time, they started having a bit of trouble, so Sally decided to add a background job system, thereby introducing a queue. The queue was a good solution; it solved many problems for the team. They began adding more and more jobs to it, and the queue remained effective. So, our story begins a few months later.
00:01:54 Joey receives a support ticket: users are reporting that the password reset functionality is broken. Joey works on the ticket and says, 'Works on my machine,' and then he closes it because he cannot reproduce the issue.
00:02:07 Of course, the ticket comes back with more details: users are still saying they're not receiving the email. Sure enough, when Joey tests this again in production, he confirms that he indeed does not receive the email. After a bit of investigation, he finds that the queue has 40,000 jobs in it, and emails are getting stuck, causing late deliveries.
00:02:31 Joey spins up some extra workers so the queues drain faster and marks the ticket as resolved; however, this had customer impact, so he calls it an incident. While writing the postmortem, he starts thinking about how to prevent this from happening again.
00:02:49 Joey knows that when a job gets stuck in a queue, it's because other types of jobs are interfering with it. One way to fix that is to isolate those loads so that they don't interfere with each other. We can't have all the jobs in one big bag and just pick them out in order, right? Some jobs are more important than others. In some queuing systems, you can set priorities for your jobs, allowing certain jobs to skip the queue, if you will, and run earlier.
00:03:12 Joey thinks that might be a good idea, but it turns out their queuing system doesn't support job priorities. Jobs are picked up in the same order they are placed in the queue. By the way, that's a good design decision; priorities are not going to solve this problem, and we're going to see why later.
00:03:27 Instead of priorities, what Joey needs to do is isolate the workload by creating separate queues. So, he creates a high-priority queue for more critical jobs, and because programmers love symmetry, he also creates a low-priority queue for less important tasks. A few days later, he identifies some jobs that need to run sporadically but aren't urgent, and they take a bit longer to complete. He places them in the low-priority queue.
00:04:01 A few months down the road, Joey is addressing another incident related to queues. It turns out that now everybody's jobs are considered important.
00:04:06 The high-priority queue now contains numerous jobs, and credit card transactions have started processing late due to other long-running jobs, which is costing the company sales. After another postmortem, Joey tightens the rules on the critical queue, making it clear to everyone that while their jobs are important, only very critical tasks can reside there.
00:04:27 Several months later, a credit card company experiences an outage, and credit card jobs, which are typically completed in about a second, are now taking a full minute, resulting in timeouts. This creates a backlog in the queue, and two-factor authentication messages, which are also critical, start being delayed. However, by this time, the company has hired a senior engineer from a larger firm who brings valuable experience with her.
00:05:14 This is Marianne, and when she notices the incident, she quickly identifies the root of the problem. Organizing your queues based on priority or importance is a recipe for failure.
00:05:28 First, there isn't a clear definition of what constitutes high priority or low priority. Yes, there are guidelines with examples, but it's impossible to predict all the different tasks that will be added to our queues. Additionally, some of these jobs may be high volume or long-running, making it hard to determine how they will interact with one another.
00:05:45 However, Marianne has seen this type of situation before, and she understands that we need to organize our jobs based on their different purposes rather than simply setting priorities. This way, jobs that run together can be more predictable in terms of performance, and it will be much clearer what belongs where. This is crucial because if credit card jobs start experiencing issues, they shouldn't interfere with unrelated tasks, like two-factor authentication.
00:06:16 Of course, while we can't have a separate queue for every single task, some companies have tried. You can define a few queues for your most common tasks. Marianne sets up a queue for mailers, which are generally quite important, and also creates a specific low-priority queue for the thousands of customer satisfaction surveys, allowing them to avoid disrupting critical processes.
00:06:35 A few months later, a new system is in place, and jobs are functioning more efficiently. Everyone is content with this arrangement for a while.
00:06:53 Now, fast forward a few years—our company has grown significantly. We now have dozens of developers organized into various teams focusing on different parts of the application, such as purchasing, inventory tracking, logistics, etc. One day, during a retrospective meeting, the logistics team finds their backlog is excessively long.
00:07:15 The purchasing team has once again created a colossal problem. This is the fourth time this quarter that they have delayed things. How many times do we need to remind them? Apparently, purchasing had this brilliant idea: instead of contacting vendors for better prices, they decide to automatically email everyone and let AI handle the rest. The outcome was disastrous!
00:07:42 The emails were excessive, and trucking companies didn't receive the notifications that they desperately needed, resulting in late shipments. This is not solely the fault of the purchasing team; we've faced similar issues with the sales team, and it's worth noting that we've also contributed. Do we remember the shipping backlog from last Cyber Monday, which congested everyone's queues and ruined their day?
00:08:10 The conclusions seem clear: we have all these queues, but we lack control over what other teams are contributing to them, and they don't understand what complicates our jobs or what our requirements are for those tasks. There's only one solution: teams must operate independently.
00:08:26 Thus, the logistics team's queue is created. At least they didn't go for microservices! Naturally, it isn't just the logistics team that gets their own queue; they encourage the other teams to adopt this approach as well.
00:08:54 However, not everything the logistics team does shares the same urgency. Three months later, we end up with a Logistics high-priority queue and a Purchasing low-priority queue. Of course, the marketing and sales queues remain high and low, as there are hundreds of different jobs in various queues. Nobody knows where everything belongs or what roles they should fulfill. This is becoming increasingly chaotic.
00:09:26 At least we are fortunate to have venture capital backing this company. Still, some individuals start noticing that we're spending significantly more money on servers than necessary, leading to inquiries. It turns out we have 60 queues now, translating to 60 servers. Most of the servers are not doing anything most of the time, but someone has to manage those queues.
00:09:51 Thus, an obvious decision is made: we can configure processes to handle multiple queues simultaneously, grouping some of them together and lowering the server count. And guess what?
00:10:12 Now, you might be wondering why I'm sharing with you this story about people making such blatant mistakes. The truth is, this isn’t fiction; I've renamed the queues and teams for privacy reasons, but I've witnessed this same exact pattern unfold in every company I've worked with. Half of these characters proposing solutions that just wouldn’t work were actually me, follower of bad ideas. I've seen this enough to believe that this is a common progression companies go through. You may have observed this as well, and hopefully, the remainder of the talk will help address those issues or, if you haven't encountered this situation yet, it may save you from headwinds.
00:11:14 I believe the reason this is a common progression is not that these characters are bad engineers. The matter is, when faced with these challenges, these steps appear as the obvious solution; they seem sensible and solve the immediate problems. However, queues have interactions and behaviors that are incredibly hard to predict, making it difficult to foresee what the next issue may arise from these changes.
00:12:08 So, how do we escape this cycle? It starts with understanding what actual problem we are attempting to solve when creating new queues. The issue is deceptive, and we often address the wrong problems. The problem is not queues, jobs, priorities, or threats; it lies in the language we employ to discuss these items. We have jobs running in the background all the time without anyone noticing. A real problem arises when someone notices that a job didn't execute, which typically means it hasn't been executed yet, but it feels like it should have.
00:12:49 Put differently, a job is running late. The key issue is the expectation surrounding how long the jobs will take before they execute, but this expectation is often implicit. It's one of those 'I'll know it when I see it' situations; we recognize that it's late, but if you ask when it should have run, you likely can't pinpoint the time.
00:13:22 You have a database maintenance queue where you analyze every table every night because of an issue with the statistics two years back. Will we ever trust the database again? Now, if this job is running twenty minutes behind, is that a problem? Not at all. But if your critical queue is twenty minutes behind, that is definitely a problem. What about if it's only two minutes behind? Is that acceptable? Or what if it's the low priority queue?
00:13:45 What you see is the language we use to define these queues and the jobs within them gives vague meanings and puts us in a difficult situation. Low priority or high priority gives you a relative idea of importance. However, it doesn't clarify whether your queues are healthy until someone realizes a job is missing and starts shouting. As engineers, we are left without guidance on where to allocate tasks. When creating a job, you search for something similar in the queues and adhere to that or rely on intuition about its importance. This is a serious issue.
00:14:24 The notion of high and low priority doesn’t offer significant utility. However, the solution lies within this same language, because prioritizing jobs is not what matters. What we care about is whether a job is running late. The primary focus should be on latency. That’s how we should structure our queues—by assigning jobs based solely on the maximum latency they can tolerate without feeling late.
00:15:02 The main problem our colleagues are facing is that they are attempting to isolate their work from each other, yet they are not designing their queues around the one aspect we care about: latency. The symptom is always that something didn't happen quickly enough. The sole thing that matters is latency, even if that sounds simplistic. I’m firmly committed to this idea: latency is truly the only factor we care about for managing our queues, while everything else is merely additional.
00:15:47 Now, while there are other factors you will need to consider—such as throughput, thread counts, and hardware resources—these are just methods to achieve latency goals. You don’t truly care about throughput; what you care about is that when you submit a job, it gets completed in a timely manner. Everything else is about coping with the load to ensure that happens. Therefore, separating our queues based on priorities or teams is merely a roundabout approach to managing latency properly.
00:16:20 Fair enough, the first approach of implementing high and low queues was attempting to specify latency tolerances for jobs based on the right instinct. However, the problem with high or low and even super extra critical queues is the ambiguity they present. First, you might have a new job that you're implementing, and it’s acceptable if it takes a minute to finish; however, not ten minutes. Where does that job fall? Is it high priority or critical?
00:16:59 Additionally, in the event that something fails, you want to discover the issue before your customers do. That means you should set up alerts, but when do you notify someone? At what point is latency excessive for the high priority or low priority queue? This is where Sally steps up with her genius idea. We’ll organize our queues around the different latencies permitted, establishing clear guarantees and strict constraints.
00:17:44 Now, let’s break this idea down: first, we define queues based on the maximum latency a job in those queues might experience. The maximum latency sounds naïve, but you'll see the importance shortly. What's crucial is that you understand if you submit a job to a queue, it will start running within the advertised maximum latency timeframe. We will name these queues accordingly, ensuring everyone knows what to expect from each one.
00:18:09 Each name guarantees a contract: if you place a job in one of those queues, it has to start running before the specified time elapses. If a queue’s latency exceeds the defined limit, alerts will trigger, and the relevant department will be mobilized to address the situation. This naming convention is essential as it ensures accountability.
00:18:53 Consequently, as engineers, when you're writing a new job, you need to choose which queue fits best based on its allowable latency. This idea defines the point by which, if the job isn't executed, someone becomes aware of it, leading them to believe it's late or that the system is malfunctioning. By doing this, the placement becomes far less ambiguous. If a job takes an hour longer than expected, the question is whether anyone will notice. If not, put it in a one-hour queue; it can tolerate that. If it needs to run in less than ten minutes, place it in the corresponding queue.
00:19:56 Finding these thresholds might be more straightforward for some jobs than others, however, you must work to track the consequences of job execution at varying levels of lateness and assess what problems might arise in those scenarios.
00:20:01 It's also significant to remain realistic. There’s always a temptation to expect instant results, but we must acknowledge that not everything can happen simultaneously. Thus, it’s critical to determine how long jobs can delay before you choose the slowest queue option with the highest latency warranty.
00:20:51 Now, you may not be completely convinced, so let’s delve deeper into why this approach addresses these problems. The fundamental point is that the names given to queues are authentic promises and contracts. They set clear expectations.
00:21:10 For example, through implementing job queues clearly named, we can interpret what those jobs require. Earlier, we discussed the problem of jobs having an implicit tolerance for delay that is unrecorded and unmeasurable. But if a job belongs to a queue that has a ten-minute limit, it is clear it must commence within that timeframe if it intends to run optimally. Other jobs that run at a different pace or need unique hardware or processing should be properly delineated.
00:21:49 These changes in queues also simplify aspects of alerting and metrics reporting. With clear names identifying the queues, the alerts should set basic thresholds. If a job is in the a ten-minute queue, alerting it should occur once it reaches ten minutes of latency. Of course, while we want to act preemptively to avoid potential problems, we also wish to minimize unnecessary server costs by not over-allocating resources. These clear promises make autoscaling easier to execute.
00:22:20 You should know precisely what latency levels are acceptable. The queues will inform you about how long your servers typically take to boot and accept loads, thereby offering a bit of leeway in your autoscaling threshold. When you reach that latency level, it might be time to start scaling up servers to meet the demand.
00:22:50 Remember, the one truth in queue management is that these guarantees come at a cost. For us, this translates to a simple rule: if you are using a queue that can only accommodate limited latency, you should aim to fit jobs that can execute quickly, or they will impede everyone else's performance. I dub this concept the swimming pool principle. In a swimming pool, aspiring lane classifications govern speed; if you wish to occupy a designated lane, you must swim quickly enough to qualify, or alternatively congregate at a slower lane to avoid interference.
00:23:55 In practice, job management complies with this notion. If you're in a slow queue where jobs can tolerate longer delays, your jobs can take longer to finish without negative consequences, but in a fast queue where tasks must execute rapidly, you must vacate your spot promptly; otherwise, you'll disrupt overall performance for everyone else. By allocating jobs to specific queues, you accept the implicit contract. If you violate this contract, you can be removed from a queue you fail to comply with. As a result, it becomes essential to impose time limits on the time jobs are allowed to run based on their respective queues.
00:25:09 Thus, to be eligible to remain in a queue designed for one-minute execution (for example), your job must finish within an appropriate timeframe; otherwise, you must transition to a slower queue.
00:25:30 Of course, this varies depending on each company's needs.
00:25:32 Now, let’s transition to what happens with Sally and the rest of our software developers. After her epiphany, Sally proposed her idea to the team. They embraced it and began working on its adoption.
00:25:49 It wasn't easy; the implementation required effort and time, but eventually, they succeeded. There were valuable lessons learned during this process. Initially, they developed the new queues, complete with alerts, establishing the rule that every new job from now on could only be placed in these new queues.
00:26:12 To prevent forgetting this new protocol, they created a custom rule. Then, they undertook the large and somewhat daunting task of reassessing all existing jobs and deciding where each should belong.
00:26:30 They began with the easiest selections first. For instance, some queues were very obvious, such as the ones running overnight maintenance jobs with no urgency. These jobs were categorized in the 'within one day' queue.
00:26:48 Any lengthy task that wouldn't require immediate attention was directed to the low-priority queue. Once the straightforward jobs were relocated, it made sense to consolidate the remaining queues across different teams.
00:27:14 Consequently, they merged numerous queues into nine distinct ones and experienced a remarkable boost in motivation. The workload lightened, and the system's operability improved significantly.
00:27:41 Now they had momentum, and they were eager to continue this pace. They then tackled the tasks that were run most frequently. High-volume jobs tend to be the most challenging to manage.
00:27:55 So, they paid special attention to ensuring these were well classified. They realized they still had several jobs left to classify, but the volume of these remaining jobs was minimal compared to what had already been taken care of.
00:28:27 Oddly enough, after a while, as they kept working, the number of unprocessed jobs dwindled to the point where it was manageable. They decided to bundle the few remaining jobs into a general category, and that was sufficient for their needs.
00:29:00 It became a dream to maintain the system after streamlining it day-to-day. Nevertheless, they still had work ahead of them. To keep moving forward and maintain the momentum, they were willing to take some shortcuts.
00:29:17 The speed limits they set originally were increased, and they simply allocated extra resources to alleviate pressures and place jobs into their proper locations without worrying rapidly about performance. They maintained a precise list of jobs operating outside their constraints.
00:29:40 They gradually improved these outliers, eventually tightening these time limits until they struck a proper balance.
00:29:58 Thus, they learned when it was appropriate to diverge from the intended path. While I firmly stand by the importance of latency as our focus, practical considerations occasionally necessitate some deviation.
00:30:20 Some jobs may require special hardware, particularly in environments like Ruby, where this can lead to increased wait times. If they reside in their own dedicated queue, you might only need a few powerful servers instead of very expensive ones consistently across all queues.
00:30:45 Some jobs require execution one at a time, which is an anti-pattern, but there are legitimate reasons to have a controlled execution environment. Implementing a single-thread queue for specific jobs allows clarity about how many instances can run at a time.
00:31:06 Jobs that primarily wait on I/O can do so with high thread counts while requiring minimal CPU. The nuances of these processes highlight the importance of having specialized queues that cater to those jobs, thus saving server costs.
00:31:37 In conclusion, these represent the usual trade-offs between costs and complexity and may vary depending on your specific scenario. Yet, it’s useful to deviate from the planned course slightly. Just remember that all queues should always contain a defined latency guarantee.
00:32:00 To sum up, adopting clear naming conventions, focused metrics, and alerts is key to improving system performance and keeping all stakeholders in the loop regarding job tolerances.
00:32:16 Thank you for your time, and I hope these insights help you optimize your queue management strategies for happier, more efficient systems.