Job Scheduling
What does "high priority" mean? The secret to happy queues
Summarized using AI

What does "high priority" mean? The secret to happy queues

by Daniel Magliola

In this presentation titled "What does 'high priority' mean? The secret to happy queues," Daniel Magliola discusses the challenges teams face when managing job queues within web applications, emphasizing the importance of maintaining low latency for a smooth operation. He introduces the story of Sally, a lead engineer who continually finds her team's job queues overwhelmed despite attempts to solve the issue with separate high and low priority queues.

Key points covered include:

- The Origins of Queue Problems: Sally's company initially thrived with a small number of developers, but as complexity grew, support tickets began highlighting issues like delayed password reset emails due to a backlog of 40,000 jobs in the queue.

- Ineffectiveness of Priorities: Joey, another engineer, attempted to create priority queues, but issues persisted as critical jobs got slowed down by longer-running tasks in the same queue.

- Proposed Solution by Marianne: A senior engineer recognized that organizing by priorities failed because there's no clear understanding of what constitutes 'high' or 'low' priority. Instead, she suggested creating separate queues based on job types to manage latency effectively.

- Understanding Latency: The crux of the issue lies in accurately measuring and managing latency rather than attempting to set job priorities. Jobs should be categorized by how much latency they can tolerate, and queues must be renamed to reflect those guarantees—like 'within ten minutes'—creating a clear agreement between job timing expectations and performance.

- Implementation of Fixed Latency Contracts: By grouping jobs into defined latency limits and requiring jobs to finish within those constraints, development teams ensure clarity and accountability in their queue management. This creates a responsive system where alerts trigger if latency expectations are breached, thus preventing issues before they arise.
- Ongoing Maintenance and Improvement: Teams focused initially on easy wins by categorizing critical jobs, gradually shifting high volume and long-running tasks into designated queues without overwhelming the system. They learned to balance efficiency with performance needs while continually iterating on their system structure.

In conclusion, the presentation emphasizes the importance of understanding job latency over vague prioritization, driving home the point that effective queue management is essential to ensure a healthy and efficient system operation, ultimately leading to happier users and developers alike.

00:00:00.000 Ready for takeoff.
00:00:17.820 I want you to meet Sally. Sally is a lead engineer at Mandal Delfing,
00:00:22.859 a leading provider of paper and other stationary products. She's been there from the very beginning,
00:00:28.439 one of the first engineers they hired. Because of this, she knows the code base inside and out and is extremely
00:00:35.280 familiar with the issues of running the app in production.
00:00:40.320 Sally said today that her queues are sad, so she is dealing with the problem again.
00:00:45.899 This has been an ongoing issue for years, and no matter what they've tried, they never seem to be able to fix it.
00:00:51.840 The next morning, after a good night's sleep, Sally wakes up with a radical new idea that will solve these problems once and for all.
00:00:57.840 But to understand how that will fix all their problems, we first need to get to know a little bit of history.
00:01:04.019 First of all, though, hi! My name is Daniel, and I work at Indeed Flex,
00:01:10.500 a flexible staffing platform based in London. So you probably notice that I'm not originally from London; I come from Argentina.
00:01:15.780 In case you're wondering, that's the accent.
00:01:21.840 Now let's go back to our little paper company.
00:01:26.939 When Sally joined Mondale Delfing, there was a tiny team of only three developers who wrote the app that
00:01:32.759 was running the entire company—buying paper, keeping track of inventory, selling it, delivering it, everything.
00:01:39.960 At first, everything was just running inside that one little web server, and that was fine for a while.
00:01:45.900 But they started having some trouble at some point, so Sally decided to add a background job system.
00:01:51.600 Then there was a queue, and the queue was good; it solved a lot of problems for the team.
00:01:57.060 So they started adding more and more jobs to the queue, and it was still effective.
00:02:03.000 And so our story begins.
00:02:10.800 A few months later, Joey receives a support ticket.
00:02:16.819 Users are reporting that the reset password functionality is broken. Joey takes the ticket.
00:02:23.280 "Works on my machine," he says, and closes it, as he can’t reproduce the issue.
00:02:28.739 Of course, the ticket comes back with more details. Users say they are still not receiving the email.
00:02:35.099 Sure enough, when Joey reproduces the issue again in production, he does not receive the email.
00:02:41.459 After a bit of investigation, it turns out that the queue has 40,000 jobs in it, and emails are getting stuck there.
00:02:46.500 Joey spins up some extra workers to drain the queues faster and marks the ticket as resolved.
00:02:51.900 But since it had customer impact, he decides to declare it an incident.
00:02:58.080 Once he's writing the postmortem, he starts thinking about how to prevent this from happening again.
00:03:04.379 Now, Joey knows that when a job gets stuck in a queue, it’s often because other types of jobs are getting in the way.
00:03:10.200 One way to fix that is to try to isolate those loads so they don't interfere with each other.
00:03:16.260 We can't have all the jobs in one big bag and just pick them in order, right?
00:03:21.720 Some of them are more important than others.
00:03:28.739 Now, in some queuing systems, you can set priorities for those jobs and let some of them skip the queue.
00:03:34.379 So Joey thought maybe that would be a good idea, but it turns out their queues don't support priorities.
00:03:40.319 Jobs are picked up in the same order that they're put in the queue, and as an aside, by the way, that's a good thing.
00:03:46.799 Priorities won't really solve the problem, and we'll see why later.
00:03:52.920 Instead of priorities in this system, what you need to do to isolate loads is create separate different queues.
00:03:59.099 So Joey decides to create a high priority queue for important jobs, and because programmers love symmetry,
00:04:04.500 he also creates a low queue for less important tasks.
00:04:10.980 A few days later, he finds some jobs that need to run every now and then, but they're not urgent, and they take a bit long to run.
00:04:17.280 It would be a shame if something important was late because of them, so he puts them in the long queue.
00:04:24.600 A few months down the road, Joey is dealing with another incident related to queues.
00:04:30.540 It turns out everyone's jobs are important, so the high priority queue has tons of different jobs,
00:04:39.240 and now critical transactions are not being processed fast enough because of some other new job that's taking too long, and we're losing sales.
00:04:46.380 So after another postmortem, Joey makes it very clear to everyone,
00:04:51.780 "I know your jobs are important, but only very, very critical things can go here. We cannot let this happen again."
00:04:58.800 I think none of you are surprised here. A few months later, the cricket company had an outage.
00:05:04.979 Cricket jobs, which are critical and normally take about a second to run, were now taking a whole minute until they timed out.
00:05:11.340 This started backing up the queue, and two-factor authentication text messages, which are also critical, started going out late again.
00:05:18.120 But by this point, the company had hired a very senior engineer from a much bigger company, and she brought a lot of hard-won experience with her.
00:05:23.160 This is Marianne. When she noticed the incident, she recognized the problem immediately and got involved.
00:05:28.620 Marianne has seen this exact same pattern before and knows what the problem is.
00:05:34.199 Organizing jobs based on priority or importance is doomed to fail.
00:05:40.800 First of all, there's no clear meaning of what high priority or low priority is.
00:05:47.280 And yes, there are dogs with examples of what isn't priority, but they can never predict everything that we will need to throw out of our queues.
00:05:52.979 Some of those jobs will also be very high volume or long-running, and it's really hard to predict how all of them will interact.
00:06:00.120 But she's seen this before, and she has the answer.
00:06:05.460 We need to have separate queues for different purposes, not priorities. That way, jobs are run together, all looking similar,
00:06:12.300 making performance more predictable, and it's also going to be much clearer what goes where.
00:06:17.820 For example, credit card jobs start having trouble; they don't interfere with unrelated tasks like two-factor authentication.
00:06:23.880 Of course, you can't have a queue for every single purpose, although some companies have tried.
00:06:31.319 But you can define a few for your most common tasks. Marianne said to create a queue for mailers, which are generally pretty important.
00:06:38.699 She also noticed we send millions of customer satisfaction survey emails, and those are not important.
00:06:43.700 So she added a survey queue so they don't get in the way.
00:06:49.199 A few months later, they had something like this, and jobs were humming along just fine.
00:06:58.800 Every now and then, a new queue was born, and everybody was happy with this for a while.
00:07:05.080 Now fast forward a few years, and the company has grown quite a bit. We now have a couple dozen developers, and as usual, they're split into two teams based on the different functionalities of the app. You have the purchasing folks buying paper, the inventory people keeping track of it, the logistics team, etc.
00:07:55.080 One day the logistics team is having a little retro, and the Mad column is pretty long.
00:08:01.080 The purchasing folks have done it again! This is the fourth time this has happened this quarter.
00:08:07.919 How many times do we need to tell them not to start purchasing?
00:08:13.440 They had this genius idea that instead of calling vendors and asking them for better prices, they could just write an email to everyone and let the computer do the haggling.
00:08:19.500 The feature was a little aggressive and sent a ton of emails, so the trucking company didn't get the notifications they needed, and shipments went out late that day.
00:08:25.199 It's not always purchasing though; we had trouble with the sales team doing the same thing too.
00:08:30.960 And to be fair, we've also done this, remember the ship apocalypse last Cyber Monday, where we clogged every queue in the system and ruined everyone's day?
00:08:36.779 The conclusion seems clear: we have all of these queues, but we have no control over what the other teams are putting in them.
00:08:43.320 And they don't know what our jobs look like or what the requirements are for those jobs.
00:08:49.199 There's only one way to fix this: teams need to act independently of each other.
00:08:55.080 That's the point of teams, and just like that, the logistics team is born.
00:09:00.420 They didn't go for microservices. Now, of course, it's not just the logistics team that gets their own queues;
00:09:07.125 they share this idea with other teams and ask them to do the same.
00:09:14.640 Guess what? Not everything that the logistics team does has the same urgency.
00:09:20.640 So three months later, we end up with Logistics High and Purchasing Low.
00:09:26.159 But of course, our original High and Low mailers are still there, and we have hundreds of different jobs in there.
00:09:34.680 Nobody knows what they do or where they go.
00:09:40.560 This is starting to get hairy, and it's a good thing that we have VC money behind this company.
00:09:46.440 But even then, at some point, somebody does notice that we're spending way more on services than we should.
00:09:52.080 They start asking questions, and that's how we end up with 60 queues now.
00:09:57.600 That means 60 servers, and most of those servers are not doing anything most of the time.
00:10:04.320 But somebody has to serve those queues.
00:10:09.360 So an obvious decision is made: we can configure processes to work multiple different queues at once.
00:10:17.940 So we will group some of them together and reduce the server count.
00:10:24.180 Um, guess what?
00:10:30.600 Now you may be asking, why am I telling you this clearly facetious story about people making very obvious mistakes?
00:10:34.440 The truth is this isn’t fiction at all. I've renamed the queues and the teams to protect the innocent.
00:10:41.040 But I've seen this exact same story develop pretty much exactly this way in every company I’ve worked in.
00:10:48.420 Hell, I've been half of the characters in this story, proposing things that were obviously going to work.
00:10:55.920 That was me with the bad ideas.
00:11:01.500 And I've seen this enough that I think it’s a common progression that many companies go through. You may have seen this too,
00:11:09.000 and hopefully the rest of this talk will help you with those issues.
00:11:14.280 Or if you haven't gone down this path yet, hopefully you can prevent some headaches.
00:11:19.680 The reason I think this is a common progression is not that these characters are bad engineers.
00:11:27.300 The thing is, when you're faced with these challenges, these steps do seem like the obvious solution.
00:11:33.060 The problem is that queues tend to have interactions and behaviors that are very hard to reason about.
00:11:41.220 So when we propose one of these changes to fix a problem, it's really hard to predict the next problem that they will cause.
00:11:46.740 So how do we get out of this? Well, first, we need to think about what actual problem we're trying to solve.
00:11:54.780 When we started making new queues, the problem is deceptive, and we tend to attack the wrong issue.
00:12:00.360 The problem is in the queues, and it's not priorities or threads.
00:12:07.680 The problem is the language that we use to talk about this.
00:12:13.920 If you think about it, you have jobs running in the background all the time, and no one notices.
00:12:20.520 If someone notices, that’s when you have a problem.
00:12:26.760 But what is it that people notice? A job didn't run.
00:12:31.680 What generally actually means "didn’t run" is that it hasn’t started yet but I feel like it should have by now.
00:12:38.040 Or put another way, a job is running late.
00:12:44.640 And here's where the trouble starts, because what does late mean here?
00:12:49.740 Many jobs every day don't get to start immediately once they're queued.
00:12:55.560 They tend to wait in the queue a bit, and no one notices. Some wait for minutes, and no one notices.
00:13:01.620 Others make it very obvious if they're late, even by a little bit.
00:13:08.040 For example, you probably know this one: you send a message to a friend, and the app tells you when it has reached their phone.
00:13:14.220 That took a second, and you didn't even notice. But let's try a different scenario.
00:13:21.840 How anxious are you getting right now? Is he in a channel?
00:13:27.360 Did his phone run out of battery? Should I be worried? Is he dead? Oh, thank God, he's okay.
00:13:32.520 Now, that job was late.
00:13:38.520 So here's the issue: there's an expectation that you have for your jobs of how long it will be until they run.
00:13:45.600 But that expectation is implicit. It's one of those "I know it when I see it" kinds of things.
00:13:52.920 You know if it's running late, but if I ask you when that would be, you probably can't tell me.
00:13:58.980 Think about the database maintenance job, right? You run a vacuum and analyze on every table every night,
00:14:06.300 because there was a problem with statistics two years ago, and we will never trust the database again.
00:14:13.260 And if that's running 20 minutes behind today, is that a problem? Almost certainly not.
00:14:19.680 But what if your critical queue is running 20 minutes behind? That’s almost definitely a problem.
00:14:26.820 Now, what if it's two minutes behind? Is that okay?
00:14:32.220 What if the low priority queue is 20 minutes behind? Is that bad?
00:14:38.520 The problem is the language that we use to define these queues and the jobs we put in them.
00:14:45.840 We give them these vague names, and it gets us in trouble. High priority or low priority gives you some relative idea.
00:14:52.320 This is more important than that, but it gives you no insight as to whether your queues are healthy or not at any given point.
00:14:59.520 Until somebody's missing some job and they start shouting.
00:15:06.480 And as an engineer, it also gives you no indication as to where you should put things.
00:15:12.000 So what do you do when you're writing a new job? You look around the queues for other jobs that look similar and follow that,
00:15:18.180 or you just guess, or you listen to your gut, or you go with how important it feels.
00:15:23.760 The problem is that the words we use—high and low priority—don't really give you anything very useful.
00:15:29.760 They are very vague names, but the good thing is the answer is also in this language.
00:15:36.960 Because what we care about is not priority or importance or the type of jobs or what team owns them.
00:15:42.960 What we care about is whether a job is running late.
00:15:49.740 The one thing we care about is latency, and that's how we should structure our queues.
00:15:57.120 We should assign jobs to queues looking only at how much latency they can tolerate before it feels like they're late.
00:16:04.740 And this is the problem our friends are having, right? They're trying to isolate workloads from each other,
00:16:12.300 but their queues were not designed around the one thing that we care about.
00:16:18.300 Remember, the symptom is always: this thing didn't happen as quickly as it should have.
00:16:25.740 The one thing we actually care about is latency.
00:16:32.640 And that sounds simplistic, but I'm dying on this hill; latency is the only thing you actually really care about for your queues.
00:16:38.520 Everything else is gravy. Now, there are going to be other things that you care about in order to achieve that latency:
00:16:44.820 throughput, thread counts, hardware resources, etc.
00:16:51.420 But those are a means to an end: you don't really care about throughput.
00:16:57.420 What you care about is that you put a job in a queue and it gets done soon enough.
00:17:04.920 Throughput is just how you cope with the volume to ensure that happens.
00:17:12.960 So separating your queues based on purpose or per team are just roundabout ways to try and get that latency under control.
00:17:19.860 But they usually do not center on that latency; they're just roundabout ways.
00:17:27.780 And what we wanted to do is attack the problem directly.
00:17:35.880 Now, to be fair, the very first approach was doing roughly the right thing: the high and low queues were trying to specify that latency tolerance for those jobs.
00:17:43.980 The problem is that while that decision comes from the right instinct, having high and low and super critical is very ambiguous.
00:17:50.040 For two reasons: First, I have this new job that I just wrote, and it's okay if it takes a minute to run, but not 10 minutes.
00:17:56.520 Where does that go? Is that high? Is that critical?
00:18:00.540 And also, when things go wrong, we want to find out if there's a problem before our customers start calling us.
00:18:05.520 So we want to set up some alerts, for example, but at what point do we alert?
00:18:12.540 How much latency is too much latency for the high queue or the low queue?
00:18:19.260 How do you know?
00:18:25.440 And here’s our hero Sally coming in with her brilliant idea: we will organize our queues around different allowed latencies with clear guarantees and strict constraints.
00:18:31.680 Now, let’s unpack that because there’s a lot there.
00:18:37.920 We will define queues around how much latency a job in those queues might experience.
00:18:44.100 That is, the maximum latency we will allow in these queues.
00:18:50.760 Now, allow my naïveté, but you’ll see what I mean in a second.
00:18:56.640 What's important is that you know that if you put a job in those queues, it will start running in no more than that advertised maximum latency.
00:19:04.380 We will name these queues based on that clearly and unambiguously.
00:19:10.080 Anyone that's looking at a queue needs to know exactly what they can expect from it.
00:19:17.040 And here’s the important part: that name—that’s a guarantee. It’s a contract.
00:19:24.540 If you put a job in one of these queues, it will start running before that much time elapses.
00:19:29.280 If the queue's latency goes over what it says in the name, alerts go off,
00:19:34.680 departments get called, on-call engineers get woken up, and we fix it.
00:19:40.920 The name is a promise, and you can count on it. That part is key to the whole solution.
00:19:47.100 So, as an engineer, whenever you're writing a new job, you will choose what queue to put it in based on how much latency you can tolerate.
00:19:54.480 What that means is: what's the threshold where, if your job doesn't run by that time,
00:19:59.400 somebody will notice and think that the job is late or it didn't run, or the system is broken?
00:20:06.660 And this makes it much less ambiguous where to put things.
00:20:13.200 What happens if that new job runs an hour later? Will anyone notice? No? Great, it goes into the within one hour queue.
00:20:20.040 One hour too much? It's ten minutes; okay, can you wait a minute? You go to the within ten minutes queue.
00:20:27.960 If not, within one minute.
00:20:33.600 Now, granted, finding the threshold will be easier for some jobs than others, but you want to try and map out the consequences of running at different levels of lateness.
00:20:41.760 Figure out what would go wrong and be realistic. There’s always a temptation to think everything must happen now,
00:20:48.600 but remember that it can't all happen instantly, and we need to be thorough in figuring out how much time we can allow our job to wait.
00:20:55.620 And always choose the slowest queue possible.
00:21:01.380 Now, in essence, maybe you're not convinced yet, so let's look at how this solves the problem we are seeing in a bit more detail.
00:21:07.620 The key thing here is that the names that we give these queues are clear promises—they're a contract.
00:21:14.040 And when I put a given job in a given queue, a developer is declaring what that job's needs are.
00:21:20.760 Remember we said earlier the problem is that jobs have an implicit amount of delay that they can tolerate.
00:21:26.740 That’s a requirement those jobs have, but it's not documented anywhere, and it's not actionable in any way.
00:21:32.600 When we set up our queues like this, we turn that implicit requirement into an explicit one.
00:21:39.300 If a job is in the within ten minutes queue, you know it needs to start running in ten minutes or less or else.
00:21:46.320 Another side of this is that when you have things like high and low priority queues, you will be unintentionally mixing jobs that can tolerate different levels of latency.
00:21:52.320 If your high queue is running, say, four minutes behind, that will be fine for some jobs, but maybe not for others.
00:21:58.740 And it's not really very clear what we can do about that.
00:22:04.740 When you say clear expectations, this is much easier.
00:22:10.320 You already know that all the jobs in the within ten minutes queue can tolerate up to ten minutes of delay.
00:22:18.420 If your queue is running five minutes behind, you unequivocally know it’s fine,
00:22:24.420 because the authors of those jobs explicitly told you that by putting them in that queue.
00:22:30.300 And this was another one of our problems: If our queues are having trouble, we want to know before our customers find out.
00:22:36.660 And fix it as fast as possible. So you want alerts for when things go wrong.
00:22:43.680 If you have people on call, you want them paged if their queues breach the latency guarantees.
00:22:50.040 When you have vague names for your queues, how do you set thresholds for those alerts?
00:22:57.060 How do you know that any threshold you set is going to be good for every job in that queue that exists now or that will get added in the future?
00:23:02.520 With clear, key names, the alert pretty much sets itself.
00:23:09.200 Within the minute queue, you alert when the latency hits the minute, and you're done.
00:23:16.440 And of course, if you’re alerting, it’s too late, because things have already gone wrong.
00:23:22.680 Ideally, you want to try and prevent this from happening.
00:23:29.640 You also don't want to spend too much money on services—too much money by having too many servers that are idle all the time.
00:23:36.360 These clear promises make it easier to do auto-scaling and set adequate thresholds for it.
00:23:42.720 You know precisely what level of latency is acceptable.
00:23:50.460 The queues tell you. You have an idea of how long your servers take to boot up and start helping with the load.
00:23:58.320 You add a bit of margin, and that's your auto-scaling threshold.
00:24:05.880 When you hit that much latency, you start adding servers.
00:24:11.760 In this Flex, for example, we use 50 for most of our queues.
00:24:18.480 If you hit 50% of the allowed latency, we start adding servers—one more every few minutes—until the latency goes back down below 50%.
00:24:25.920 Now, 50 was an easy number to choose, and it's a bit arbitrary.
00:24:33.840 We can probably tune that up for the slower queues, but it's high enough that most of our queues are running on the minimum number of servers most of the time.
00:24:39.720 And when we get spikes, the queues just deal with it. We almost never get page for breaches.
00:24:47.520 Oh, by the way, if you do auto-scaling, always do it on queue latency.
00:24:52.500 Never on the number of jobs pending or CPU or anything else; it's always based on latency.
00:24:59.100 Now, there's a flip side to this.
00:25:02.640 These guarantees don't come for free. Contracts always go both ways.
00:25:07.920 And for us, that means just one simple rule: if you're in a queue that can only tolerate small amounts of latency, you need to finish running pretty quickly.
00:25:13.200 That's the flip side to this contract. I call it the swimming pool principle.
00:25:20.760 In a swimming pool, you normally have fast, medium, and slow lanes, right?
00:25:27.540 And if you want to be in this lane, you better swim at least this fast, or you need to go to the slower queue.
00:25:35.520 So you’re not getting in the way.
00:25:42.660 In practice, this applies to our jobs too: if you're on a slow queue, one-word jobs can tolerate a lot of latency.
00:25:49.500 You can run for a while, and that’s fine.
00:25:56.220 But if you're on a fast queue, jobs need to start quickly; you need to get out of the way pretty fast.
00:26:02.520 Or you’re going to ruin the party for everyone.
00:26:08.100 When you put your job in a given queue, you're signing into that contract.
00:26:14.820 And if you break the contract, you’re going to get kicked out of that queue.
00:26:21.600 So we need to set a time limit for jobs in each of these queues.
00:26:27.900 That is, if you want to run in the within one-minute queue, you need to finish in this many seconds.
00:26:37.560 Or we can take you in here; sorry, you need to go to a slower queue.
00:26:44.040 Now to see why, there are a bunch of queuing theory formulas to calculate this, but that’s kind of boring, so just picture this instead.
00:26:50.520 You have a queue here on the left where all of the jobs need to start within one minute.
00:26:57.180 In any queue here on the right, you have a number of threads working those jobs.
00:27:05.760 The queue is running its little jobs, and they get out of the way pretty quickly.
00:27:11.520 New jobs arrive, and everybody's happy.
00:27:18.480 Now you suddenly imagine you start putting jobs in the queue that each take one hour from start to finish.
00:27:25.680 What happens to that queue? As soon as one of those threads gets one of those long jobs, that thread is now gone.
00:27:31.560 It can’t pick up any new jobs for an hour, and a small number of these jobs will clog all of the threads in that queue.
00:27:38.520 Pretty quickly, and no work can get done anymore. If you do auto-scaling, you're going to be adding a new thread for each job.
00:27:46.260 For that close, that closes the thread.
00:27:51.720 That's a lot of threads. Imagine, instead, if you had a queue where every job takes two seconds instead.
00:27:58.920 Now a single thread can handle a lot of jobs before you breach your promise.
00:28:05.520 You can run a lot more jobs with fewer threads, and it's going to be much easier to keep your latency under control.
00:28:12.840 And this is also, by the way, why priorities don't work.
00:28:19.140 Your low priority but long-running jobs are going to quickly clog all of the threads, and then there's no one left to pick up the new high priority jobs that come in.
00:28:25.560 So you need to set time limits for each queue: the faster the queue, the quicker you need to finish.
00:28:33.960 Now where you choose to set these depends pretty much on how much money we need to spend on servers and on developers.
00:28:40.920 It’s a trade-off, like everything else.
00:28:47.760 If you set the limits too high, allowing jobs to run for a while makes your developers' lives easier,
00:28:52.920 but you're going to need more threads, and you’ll spend more money on servers.
00:28:59.880 If you set them too low, jobs that need to run with low latency need to be optimized more.
00:29:07.560 Maybe they need refactoring and splitting into different parts.
00:29:12.900 It’s going to be more annoying for your developers, and you're going to spend more time doing that; therefore, you’ll spend more money on developers.
00:29:19.380 It’s a balancing act, and it depends on your priorities.
00:29:27.240 I like the rule of thumb of setting the target at one-tenth the latency limit.
00:29:34.620 So a job in the within one-minute queue can only run for up to six seconds; within ten minutes, it can run for up to a minute.
00:29:41.280 Etc. In reality, this is kind of aspirational and will vary a lot,
00:29:46.860 depending on the state of your codebase and your app and your requirements.
00:29:53.040 It's quite possible that you're going to have to set this quite a lot higher first and spend more on servers for a while.
00:29:59.880 Now, there are three important things to keep in mind that are going to be key to this implementation.
00:30:06.300 Number one: the latency promise in the queue and those limits are a contract between the infrastructure and the developer.
00:30:13.140 And like any contract, we need to know if it's broken.
00:30:19.620 If a job is going over those limits, a warning should go out to the respective team that owns the job,
00:30:25.740 to either improve it or move it to a slower queue.
00:30:32.520 Now this is not a page; nobody should be woken up in the middle of the night, but it should make clear that it needs to be changed.
00:30:39.180 Number two: there's an implicit social contract here.
00:30:45.180 Remember the inter-team fights that resulted in one queue per team? This is how you prevent that.
00:30:51.840 If a job is breaching the time limit, there needs to be explicit permission to move it.
00:30:58.740 Now, of course, we will all talk to the respected people to get an idea of what's the best course of action.
00:31:05.520 But if a given job is making you sad, for everyone, you need that explicit permission for other teams to move it out.
00:31:11.280 Or you will end up in a situation where each team wants to control their own queues.
00:31:18.840 And finally, respect the principle and focus on latency tolerance.
00:31:25.680 Just because the job finishes quickly and can run on the fast queues doesn't mean it should.
00:31:33.480 There’s going to be a temptation to put jobs in the fastest queue that will accept them.
00:31:40.080 Don’t do that. You may have jobs that run very quickly, but nobody cares if they're late.
00:31:45.360 Put them in the slow queues; don't let them get in the way of jobs that are actually sensitive to latency.
00:31:51.720 And this, in a nutshell, is how you fix the problems you might be having with your queues.
00:31:58.440 But wait, Daniel, I hear you say, what happened to our merry band of software developers?
00:32:04.920 Well, after Sally had an epiphany, she proposed this new structure to the team, and they liked it.
00:32:12.720 They started work on adopting it. It wasn’t easy; it certainly wasn’t quick, but they got there in the end.
00:32:20.460 And there are useful lessons in what they did.
00:32:27.480 First, of course, they created the new queues with their alerts.
00:32:34.140 The rule was that every new job from now on could only go into those new queues.
00:32:42.420 They actually made a custom RuboCop rule to make sure nobody would forget.
00:32:48.620 Then, they started the huge task of figuring out where all the existing jobs should go, which is a lot and can be daunting.
00:32:55.320 But they found a clever strategy.
00:33:03.240 First, they started not with the most important jobs but with the easiest.
00:33:10.320 For example, some queues can be pretty obvious. They had two queues that run overnight maintenance jobs.
00:33:18.720 You see, no rush for those; they go into the within one-hour queue.
00:33:24.600 Jobs that took a long time to run for the most part also just got sent to the slow queues.
00:33:31.500 Nobody expected them to run quickly anyway!
00:33:38.100 And once the easy stuff was moved out, there was no point in having the queues for each team anymore.
00:33:46.140 They could just collapse them back into priorities and maybe add a few more servers to them if necessary.
00:33:51.960 They were still coming out ahead, and pretty quickly those 60 queues turned into 9.
00:33:59.220 And that was a massive motivational push. It was so much easier to keep the system running happily with fewer queues.
00:34:05.520 And they could see real progress, so they were very eager to do the rest.
00:34:12.240 Now, the next trick that was really clever: they focused on the jobs that ran the most often.
00:34:15.240 The highest volume jobs were candidates for ruining a day, so they made sure they were in the right queues.
00:34:22.740 And those are some things how to categorize, but there are normally not that many of the really big ones.
00:34:29.520 And as they kept going down this list of the most common jobs, the slowest jobs—at one point they realized—
00:34:36.300 they had a lot of different jobs left. There was a lot of work left to do.
00:34:41.640 But the total volume of those jobs left was tiny compared to the overall volume.
00:34:48.960 And here's a little secret: they never actually finished.
00:34:54.300 After a while, the jobs they had left were so few and so quick, there was little chance they would cause problems.
00:35:01.380 So they just merged them all into the default queue and left them there.
00:35:05.520 It wasn't worth cataloging each one, and that left them with this.
00:35:10.260 Now this was bliss. It was a dream to keep it running day to day.
00:35:16.560 But they weren't done. You see, to move faster and keep momentum, they did take a couple of shortcuts.
00:35:24.600 First of all, those speed limits? They set them way higher.
00:35:31.200 They first just threw more servers at the problem, and this made it much easier to start putting jobs in the right queues without worrying too much about performance.
00:35:39.300 But now they had a clear list of trespassers—jobs they could work on gradually to make them faster.
00:35:46.740 And then gradually lower those time limits until they got to the sweet spot.
00:35:53.040 They also learned an extremely important lesson: when to break the one rule.
00:35:58.200 You know how I said latency is the one thing you care about? That’s true.
00:36:03.840 But sometimes practical considerations mean it makes sense to stray from that path a little bit.
00:36:10.320 For example, some jobs need special hardware. In Ruby, that generally means a lot more runs.
00:36:17.820 If you can put them in their own queue, you can have a few large servers instead of making all the servers for that queue more expensive.
00:36:23.520 Sometimes you have jobs that want to run one at a time, and generally, this is an anti-pattern.
00:36:30.720 But there are some legitimate reasons to do this, having their own queue with one thread.
00:36:36.480 And never more than one instance lets you do that, and the name makes it clear that you should never give more than one instance to that queue.
00:36:43.920 So some jobs do a lot of waiting on I/O, and then they use pretty much no CPU.
00:36:50.520 Mailers or webhooks tend to be like that; you could have really high thread counts for those,
00:36:57.060 and save on servers, but it would hurt the other jobs in the same queue.
00:37:04.680 Giving them their own queue lets you make that distinction and save money on services.
00:37:11.760 You see, these are the usual trade-offs of money versus complexity, and as always, your mileage may vary.
00:37:19.800 But sometimes it's useful to stray from that path a little bit.
00:37:26.760 Just make sure all of your queues still have a clear latency guarantee in the name, the usual alerts, and you’re golden.
00:37:33.720 And that's how our brave team of developers solved all of their queuing woes, reached pure queue bliss, and lived speedily ever after.
00:37:41.520 Thank you!
00:37:57.480 Now before you go, I want to leave you with a couple of quick notes.
00:38:05.640 First, I just mentioned some of the most important lessons that our friends learned while implementing this, but there are a lot more that don't fit in half an hour.
00:38:11.400 You can get more practical advice on how to do this on this rebuild.
00:38:17.400 There's also that talk I mentioned; you can download from there.
00:38:23.520 I also want to do some thanking first to Suleiman, who wrote one of the best talks I've ever seen,
00:38:29.040 and whose style massively inspired this talk. If you haven't seen this, you should definitely watch it.
00:38:35.760 Also, my friends Nick, Lisa, and Chris for watching early versions of this talk and every other talk and making them much better.
00:38:43.680 Another thanks to Neil Chandler, one of my teammates who actually came up with most of those clever ideas that I described.
00:38:49.080 And that’s all from me. Thank you, and I'll see you all in the hallway.
Explore all talks recorded at RubyConf 2022
+58