00:00:00.000
Ready for takeoff.
00:00:17.820
I want you to meet Sally. Sally is a lead engineer at Mandal Delfing,
00:00:22.859
a leading provider of paper and other stationary products. She's been there from the very beginning,
00:00:28.439
one of the first engineers they hired. Because of this, she knows the code base inside and out and is extremely
00:00:35.280
familiar with the issues of running the app in production.
00:00:40.320
Sally said today that her queues are sad, so she is dealing with the problem again.
00:00:45.899
This has been an ongoing issue for years, and no matter what they've tried, they never seem to be able to fix it.
00:00:51.840
The next morning, after a good night's sleep, Sally wakes up with a radical new idea that will solve these problems once and for all.
00:00:57.840
But to understand how that will fix all their problems, we first need to get to know a little bit of history.
00:01:04.019
First of all, though, hi! My name is Daniel, and I work at Indeed Flex,
00:01:10.500
a flexible staffing platform based in London. So you probably notice that I'm not originally from London; I come from Argentina.
00:01:15.780
In case you're wondering, that's the accent.
00:01:21.840
Now let's go back to our little paper company.
00:01:26.939
When Sally joined Mondale Delfing, there was a tiny team of only three developers who wrote the app that
00:01:32.759
was running the entire company—buying paper, keeping track of inventory, selling it, delivering it, everything.
00:01:39.960
At first, everything was just running inside that one little web server, and that was fine for a while.
00:01:45.900
But they started having some trouble at some point, so Sally decided to add a background job system.
00:01:51.600
Then there was a queue, and the queue was good; it solved a lot of problems for the team.
00:01:57.060
So they started adding more and more jobs to the queue, and it was still effective.
00:02:03.000
And so our story begins.
00:02:10.800
A few months later, Joey receives a support ticket.
00:02:16.819
Users are reporting that the reset password functionality is broken. Joey takes the ticket.
00:02:23.280
"Works on my machine," he says, and closes it, as he can’t reproduce the issue.
00:02:28.739
Of course, the ticket comes back with more details. Users say they are still not receiving the email.
00:02:35.099
Sure enough, when Joey reproduces the issue again in production, he does not receive the email.
00:02:41.459
After a bit of investigation, it turns out that the queue has 40,000 jobs in it, and emails are getting stuck there.
00:02:46.500
Joey spins up some extra workers to drain the queues faster and marks the ticket as resolved.
00:02:51.900
But since it had customer impact, he decides to declare it an incident.
00:02:58.080
Once he's writing the postmortem, he starts thinking about how to prevent this from happening again.
00:03:04.379
Now, Joey knows that when a job gets stuck in a queue, it’s often because other types of jobs are getting in the way.
00:03:10.200
One way to fix that is to try to isolate those loads so they don't interfere with each other.
00:03:16.260
We can't have all the jobs in one big bag and just pick them in order, right?
00:03:21.720
Some of them are more important than others.
00:03:28.739
Now, in some queuing systems, you can set priorities for those jobs and let some of them skip the queue.
00:03:34.379
So Joey thought maybe that would be a good idea, but it turns out their queues don't support priorities.
00:03:40.319
Jobs are picked up in the same order that they're put in the queue, and as an aside, by the way, that's a good thing.
00:03:46.799
Priorities won't really solve the problem, and we'll see why later.
00:03:52.920
Instead of priorities in this system, what you need to do to isolate loads is create separate different queues.
00:03:59.099
So Joey decides to create a high priority queue for important jobs, and because programmers love symmetry,
00:04:04.500
he also creates a low queue for less important tasks.
00:04:10.980
A few days later, he finds some jobs that need to run every now and then, but they're not urgent, and they take a bit long to run.
00:04:17.280
It would be a shame if something important was late because of them, so he puts them in the long queue.
00:04:24.600
A few months down the road, Joey is dealing with another incident related to queues.
00:04:30.540
It turns out everyone's jobs are important, so the high priority queue has tons of different jobs,
00:04:39.240
and now critical transactions are not being processed fast enough because of some other new job that's taking too long, and we're losing sales.
00:04:46.380
So after another postmortem, Joey makes it very clear to everyone,
00:04:51.780
"I know your jobs are important, but only very, very critical things can go here. We cannot let this happen again."
00:04:58.800
I think none of you are surprised here. A few months later, the cricket company had an outage.
00:05:04.979
Cricket jobs, which are critical and normally take about a second to run, were now taking a whole minute until they timed out.
00:05:11.340
This started backing up the queue, and two-factor authentication text messages, which are also critical, started going out late again.
00:05:18.120
But by this point, the company had hired a very senior engineer from a much bigger company, and she brought a lot of hard-won experience with her.
00:05:23.160
This is Marianne. When she noticed the incident, she recognized the problem immediately and got involved.
00:05:28.620
Marianne has seen this exact same pattern before and knows what the problem is.
00:05:34.199
Organizing jobs based on priority or importance is doomed to fail.
00:05:40.800
First of all, there's no clear meaning of what high priority or low priority is.
00:05:47.280
And yes, there are dogs with examples of what isn't priority, but they can never predict everything that we will need to throw out of our queues.
00:05:52.979
Some of those jobs will also be very high volume or long-running, and it's really hard to predict how all of them will interact.
00:06:00.120
But she's seen this before, and she has the answer.
00:06:05.460
We need to have separate queues for different purposes, not priorities. That way, jobs are run together, all looking similar,
00:06:12.300
making performance more predictable, and it's also going to be much clearer what goes where.
00:06:17.820
For example, credit card jobs start having trouble; they don't interfere with unrelated tasks like two-factor authentication.
00:06:23.880
Of course, you can't have a queue for every single purpose, although some companies have tried.
00:06:31.319
But you can define a few for your most common tasks. Marianne said to create a queue for mailers, which are generally pretty important.
00:06:38.699
She also noticed we send millions of customer satisfaction survey emails, and those are not important.
00:06:43.700
So she added a survey queue so they don't get in the way.
00:06:49.199
A few months later, they had something like this, and jobs were humming along just fine.
00:06:58.800
Every now and then, a new queue was born, and everybody was happy with this for a while.
00:07:05.080
Now fast forward a few years, and the company has grown quite a bit. We now have a couple dozen developers, and as usual, they're split into two teams based on the different functionalities of the app. You have the purchasing folks buying paper, the inventory people keeping track of it, the logistics team, etc.
00:07:55.080
One day the logistics team is having a little retro, and the Mad column is pretty long.
00:08:01.080
The purchasing folks have done it again! This is the fourth time this has happened this quarter.
00:08:07.919
How many times do we need to tell them not to start purchasing?
00:08:13.440
They had this genius idea that instead of calling vendors and asking them for better prices, they could just write an email to everyone and let the computer do the haggling.
00:08:19.500
The feature was a little aggressive and sent a ton of emails, so the trucking company didn't get the notifications they needed, and shipments went out late that day.
00:08:25.199
It's not always purchasing though; we had trouble with the sales team doing the same thing too.
00:08:30.960
And to be fair, we've also done this, remember the ship apocalypse last Cyber Monday, where we clogged every queue in the system and ruined everyone's day?
00:08:36.779
The conclusion seems clear: we have all of these queues, but we have no control over what the other teams are putting in them.
00:08:43.320
And they don't know what our jobs look like or what the requirements are for those jobs.
00:08:49.199
There's only one way to fix this: teams need to act independently of each other.
00:08:55.080
That's the point of teams, and just like that, the logistics team is born.
00:09:00.420
They didn't go for microservices. Now, of course, it's not just the logistics team that gets their own queues;
00:09:07.125
they share this idea with other teams and ask them to do the same.
00:09:14.640
Guess what? Not everything that the logistics team does has the same urgency.
00:09:20.640
So three months later, we end up with Logistics High and Purchasing Low.
00:09:26.159
But of course, our original High and Low mailers are still there, and we have hundreds of different jobs in there.
00:09:34.680
Nobody knows what they do or where they go.
00:09:40.560
This is starting to get hairy, and it's a good thing that we have VC money behind this company.
00:09:46.440
But even then, at some point, somebody does notice that we're spending way more on services than we should.
00:09:52.080
They start asking questions, and that's how we end up with 60 queues now.
00:09:57.600
That means 60 servers, and most of those servers are not doing anything most of the time.
00:10:04.320
But somebody has to serve those queues.
00:10:09.360
So an obvious decision is made: we can configure processes to work multiple different queues at once.
00:10:17.940
So we will group some of them together and reduce the server count.
00:10:24.180
Um, guess what?
00:10:30.600
Now you may be asking, why am I telling you this clearly facetious story about people making very obvious mistakes?
00:10:34.440
The truth is this isn’t fiction at all. I've renamed the queues and the teams to protect the innocent.
00:10:41.040
But I've seen this exact same story develop pretty much exactly this way in every company I’ve worked in.
00:10:48.420
Hell, I've been half of the characters in this story, proposing things that were obviously going to work.
00:10:55.920
That was me with the bad ideas.
00:11:01.500
And I've seen this enough that I think it’s a common progression that many companies go through. You may have seen this too,
00:11:09.000
and hopefully the rest of this talk will help you with those issues.
00:11:14.280
Or if you haven't gone down this path yet, hopefully you can prevent some headaches.
00:11:19.680
The reason I think this is a common progression is not that these characters are bad engineers.
00:11:27.300
The thing is, when you're faced with these challenges, these steps do seem like the obvious solution.
00:11:33.060
The problem is that queues tend to have interactions and behaviors that are very hard to reason about.
00:11:41.220
So when we propose one of these changes to fix a problem, it's really hard to predict the next problem that they will cause.
00:11:46.740
So how do we get out of this? Well, first, we need to think about what actual problem we're trying to solve.
00:11:54.780
When we started making new queues, the problem is deceptive, and we tend to attack the wrong issue.
00:12:00.360
The problem is in the queues, and it's not priorities or threads.
00:12:07.680
The problem is the language that we use to talk about this.
00:12:13.920
If you think about it, you have jobs running in the background all the time, and no one notices.
00:12:20.520
If someone notices, that’s when you have a problem.
00:12:26.760
But what is it that people notice? A job didn't run.
00:12:31.680
What generally actually means "didn’t run" is that it hasn’t started yet but I feel like it should have by now.
00:12:38.040
Or put another way, a job is running late.
00:12:44.640
And here's where the trouble starts, because what does late mean here?
00:12:49.740
Many jobs every day don't get to start immediately once they're queued.
00:12:55.560
They tend to wait in the queue a bit, and no one notices. Some wait for minutes, and no one notices.
00:13:01.620
Others make it very obvious if they're late, even by a little bit.
00:13:08.040
For example, you probably know this one: you send a message to a friend, and the app tells you when it has reached their phone.
00:13:14.220
That took a second, and you didn't even notice. But let's try a different scenario.
00:13:21.840
How anxious are you getting right now? Is he in a channel?
00:13:27.360
Did his phone run out of battery? Should I be worried? Is he dead? Oh, thank God, he's okay.
00:13:32.520
Now, that job was late.
00:13:38.520
So here's the issue: there's an expectation that you have for your jobs of how long it will be until they run.
00:13:45.600
But that expectation is implicit. It's one of those "I know it when I see it" kinds of things.
00:13:52.920
You know if it's running late, but if I ask you when that would be, you probably can't tell me.
00:13:58.980
Think about the database maintenance job, right? You run a vacuum and analyze on every table every night,
00:14:06.300
because there was a problem with statistics two years ago, and we will never trust the database again.
00:14:13.260
And if that's running 20 minutes behind today, is that a problem? Almost certainly not.
00:14:19.680
But what if your critical queue is running 20 minutes behind? That’s almost definitely a problem.
00:14:26.820
Now, what if it's two minutes behind? Is that okay?
00:14:32.220
What if the low priority queue is 20 minutes behind? Is that bad?
00:14:38.520
The problem is the language that we use to define these queues and the jobs we put in them.
00:14:45.840
We give them these vague names, and it gets us in trouble. High priority or low priority gives you some relative idea.
00:14:52.320
This is more important than that, but it gives you no insight as to whether your queues are healthy or not at any given point.
00:14:59.520
Until somebody's missing some job and they start shouting.
00:15:06.480
And as an engineer, it also gives you no indication as to where you should put things.
00:15:12.000
So what do you do when you're writing a new job? You look around the queues for other jobs that look similar and follow that,
00:15:18.180
or you just guess, or you listen to your gut, or you go with how important it feels.
00:15:23.760
The problem is that the words we use—high and low priority—don't really give you anything very useful.
00:15:29.760
They are very vague names, but the good thing is the answer is also in this language.
00:15:36.960
Because what we care about is not priority or importance or the type of jobs or what team owns them.
00:15:42.960
What we care about is whether a job is running late.
00:15:49.740
The one thing we care about is latency, and that's how we should structure our queues.
00:15:57.120
We should assign jobs to queues looking only at how much latency they can tolerate before it feels like they're late.
00:16:04.740
And this is the problem our friends are having, right? They're trying to isolate workloads from each other,
00:16:12.300
but their queues were not designed around the one thing that we care about.
00:16:18.300
Remember, the symptom is always: this thing didn't happen as quickly as it should have.
00:16:25.740
The one thing we actually care about is latency.
00:16:32.640
And that sounds simplistic, but I'm dying on this hill; latency is the only thing you actually really care about for your queues.
00:16:38.520
Everything else is gravy. Now, there are going to be other things that you care about in order to achieve that latency:
00:16:44.820
throughput, thread counts, hardware resources, etc.
00:16:51.420
But those are a means to an end: you don't really care about throughput.
00:16:57.420
What you care about is that you put a job in a queue and it gets done soon enough.
00:17:04.920
Throughput is just how you cope with the volume to ensure that happens.
00:17:12.960
So separating your queues based on purpose or per team are just roundabout ways to try and get that latency under control.
00:17:19.860
But they usually do not center on that latency; they're just roundabout ways.
00:17:27.780
And what we wanted to do is attack the problem directly.
00:17:35.880
Now, to be fair, the very first approach was doing roughly the right thing: the high and low queues were trying to specify that latency tolerance for those jobs.
00:17:43.980
The problem is that while that decision comes from the right instinct, having high and low and super critical is very ambiguous.
00:17:50.040
For two reasons: First, I have this new job that I just wrote, and it's okay if it takes a minute to run, but not 10 minutes.
00:17:56.520
Where does that go? Is that high? Is that critical?
00:18:00.540
And also, when things go wrong, we want to find out if there's a problem before our customers start calling us.
00:18:05.520
So we want to set up some alerts, for example, but at what point do we alert?
00:18:12.540
How much latency is too much latency for the high queue or the low queue?
00:18:19.260
How do you know?
00:18:25.440
And here’s our hero Sally coming in with her brilliant idea: we will organize our queues around different allowed latencies with clear guarantees and strict constraints.
00:18:31.680
Now, let’s unpack that because there’s a lot there.
00:18:37.920
We will define queues around how much latency a job in those queues might experience.
00:18:44.100
That is, the maximum latency we will allow in these queues.
00:18:50.760
Now, allow my naïveté, but you’ll see what I mean in a second.
00:18:56.640
What's important is that you know that if you put a job in those queues, it will start running in no more than that advertised maximum latency.
00:19:04.380
We will name these queues based on that clearly and unambiguously.
00:19:10.080
Anyone that's looking at a queue needs to know exactly what they can expect from it.
00:19:17.040
And here’s the important part: that name—that’s a guarantee. It’s a contract.
00:19:24.540
If you put a job in one of these queues, it will start running before that much time elapses.
00:19:29.280
If the queue's latency goes over what it says in the name, alerts go off,
00:19:34.680
departments get called, on-call engineers get woken up, and we fix it.
00:19:40.920
The name is a promise, and you can count on it. That part is key to the whole solution.
00:19:47.100
So, as an engineer, whenever you're writing a new job, you will choose what queue to put it in based on how much latency you can tolerate.
00:19:54.480
What that means is: what's the threshold where, if your job doesn't run by that time,
00:19:59.400
somebody will notice and think that the job is late or it didn't run, or the system is broken?
00:20:06.660
And this makes it much less ambiguous where to put things.
00:20:13.200
What happens if that new job runs an hour later? Will anyone notice? No? Great, it goes into the within one hour queue.
00:20:20.040
One hour too much? It's ten minutes; okay, can you wait a minute? You go to the within ten minutes queue.
00:20:27.960
If not, within one minute.
00:20:33.600
Now, granted, finding the threshold will be easier for some jobs than others, but you want to try and map out the consequences of running at different levels of lateness.
00:20:41.760
Figure out what would go wrong and be realistic. There’s always a temptation to think everything must happen now,
00:20:48.600
but remember that it can't all happen instantly, and we need to be thorough in figuring out how much time we can allow our job to wait.
00:20:55.620
And always choose the slowest queue possible.
00:21:01.380
Now, in essence, maybe you're not convinced yet, so let's look at how this solves the problem we are seeing in a bit more detail.
00:21:07.620
The key thing here is that the names that we give these queues are clear promises—they're a contract.
00:21:14.040
And when I put a given job in a given queue, a developer is declaring what that job's needs are.
00:21:20.760
Remember we said earlier the problem is that jobs have an implicit amount of delay that they can tolerate.
00:21:26.740
That’s a requirement those jobs have, but it's not documented anywhere, and it's not actionable in any way.
00:21:32.600
When we set up our queues like this, we turn that implicit requirement into an explicit one.
00:21:39.300
If a job is in the within ten minutes queue, you know it needs to start running in ten minutes or less or else.
00:21:46.320
Another side of this is that when you have things like high and low priority queues, you will be unintentionally mixing jobs that can tolerate different levels of latency.
00:21:52.320
If your high queue is running, say, four minutes behind, that will be fine for some jobs, but maybe not for others.
00:21:58.740
And it's not really very clear what we can do about that.
00:22:04.740
When you say clear expectations, this is much easier.
00:22:10.320
You already know that all the jobs in the within ten minutes queue can tolerate up to ten minutes of delay.
00:22:18.420
If your queue is running five minutes behind, you unequivocally know it’s fine,
00:22:24.420
because the authors of those jobs explicitly told you that by putting them in that queue.
00:22:30.300
And this was another one of our problems: If our queues are having trouble, we want to know before our customers find out.
00:22:36.660
And fix it as fast as possible. So you want alerts for when things go wrong.
00:22:43.680
If you have people on call, you want them paged if their queues breach the latency guarantees.
00:22:50.040
When you have vague names for your queues, how do you set thresholds for those alerts?
00:22:57.060
How do you know that any threshold you set is going to be good for every job in that queue that exists now or that will get added in the future?
00:23:02.520
With clear, key names, the alert pretty much sets itself.
00:23:09.200
Within the minute queue, you alert when the latency hits the minute, and you're done.
00:23:16.440
And of course, if you’re alerting, it’s too late, because things have already gone wrong.
00:23:22.680
Ideally, you want to try and prevent this from happening.
00:23:29.640
You also don't want to spend too much money on services—too much money by having too many servers that are idle all the time.
00:23:36.360
These clear promises make it easier to do auto-scaling and set adequate thresholds for it.
00:23:42.720
You know precisely what level of latency is acceptable.
00:23:50.460
The queues tell you. You have an idea of how long your servers take to boot up and start helping with the load.
00:23:58.320
You add a bit of margin, and that's your auto-scaling threshold.
00:24:05.880
When you hit that much latency, you start adding servers.
00:24:11.760
In this Flex, for example, we use 50 for most of our queues.
00:24:18.480
If you hit 50% of the allowed latency, we start adding servers—one more every few minutes—until the latency goes back down below 50%.
00:24:25.920
Now, 50 was an easy number to choose, and it's a bit arbitrary.
00:24:33.840
We can probably tune that up for the slower queues, but it's high enough that most of our queues are running on the minimum number of servers most of the time.
00:24:39.720
And when we get spikes, the queues just deal with it. We almost never get page for breaches.
00:24:47.520
Oh, by the way, if you do auto-scaling, always do it on queue latency.
00:24:52.500
Never on the number of jobs pending or CPU or anything else; it's always based on latency.
00:24:59.100
Now, there's a flip side to this.
00:25:02.640
These guarantees don't come for free. Contracts always go both ways.
00:25:07.920
And for us, that means just one simple rule: if you're in a queue that can only tolerate small amounts of latency, you need to finish running pretty quickly.
00:25:13.200
That's the flip side to this contract. I call it the swimming pool principle.
00:25:20.760
In a swimming pool, you normally have fast, medium, and slow lanes, right?
00:25:27.540
And if you want to be in this lane, you better swim at least this fast, or you need to go to the slower queue.
00:25:35.520
So you’re not getting in the way.
00:25:42.660
In practice, this applies to our jobs too: if you're on a slow queue, one-word jobs can tolerate a lot of latency.
00:25:49.500
You can run for a while, and that’s fine.
00:25:56.220
But if you're on a fast queue, jobs need to start quickly; you need to get out of the way pretty fast.
00:26:02.520
Or you’re going to ruin the party for everyone.
00:26:08.100
When you put your job in a given queue, you're signing into that contract.
00:26:14.820
And if you break the contract, you’re going to get kicked out of that queue.
00:26:21.600
So we need to set a time limit for jobs in each of these queues.
00:26:27.900
That is, if you want to run in the within one-minute queue, you need to finish in this many seconds.
00:26:37.560
Or we can take you in here; sorry, you need to go to a slower queue.
00:26:44.040
Now to see why, there are a bunch of queuing theory formulas to calculate this, but that’s kind of boring, so just picture this instead.
00:26:50.520
You have a queue here on the left where all of the jobs need to start within one minute.
00:26:57.180
In any queue here on the right, you have a number of threads working those jobs.
00:27:05.760
The queue is running its little jobs, and they get out of the way pretty quickly.
00:27:11.520
New jobs arrive, and everybody's happy.
00:27:18.480
Now you suddenly imagine you start putting jobs in the queue that each take one hour from start to finish.
00:27:25.680
What happens to that queue? As soon as one of those threads gets one of those long jobs, that thread is now gone.
00:27:31.560
It can’t pick up any new jobs for an hour, and a small number of these jobs will clog all of the threads in that queue.
00:27:38.520
Pretty quickly, and no work can get done anymore. If you do auto-scaling, you're going to be adding a new thread for each job.
00:27:46.260
For that close, that closes the thread.
00:27:51.720
That's a lot of threads. Imagine, instead, if you had a queue where every job takes two seconds instead.
00:27:58.920
Now a single thread can handle a lot of jobs before you breach your promise.
00:28:05.520
You can run a lot more jobs with fewer threads, and it's going to be much easier to keep your latency under control.
00:28:12.840
And this is also, by the way, why priorities don't work.
00:28:19.140
Your low priority but long-running jobs are going to quickly clog all of the threads, and then there's no one left to pick up the new high priority jobs that come in.
00:28:25.560
So you need to set time limits for each queue: the faster the queue, the quicker you need to finish.
00:28:33.960
Now where you choose to set these depends pretty much on how much money we need to spend on servers and on developers.
00:28:40.920
It’s a trade-off, like everything else.
00:28:47.760
If you set the limits too high, allowing jobs to run for a while makes your developers' lives easier,
00:28:52.920
but you're going to need more threads, and you’ll spend more money on servers.
00:28:59.880
If you set them too low, jobs that need to run with low latency need to be optimized more.
00:29:07.560
Maybe they need refactoring and splitting into different parts.
00:29:12.900
It’s going to be more annoying for your developers, and you're going to spend more time doing that; therefore, you’ll spend more money on developers.
00:29:19.380
It’s a balancing act, and it depends on your priorities.
00:29:27.240
I like the rule of thumb of setting the target at one-tenth the latency limit.
00:29:34.620
So a job in the within one-minute queue can only run for up to six seconds; within ten minutes, it can run for up to a minute.
00:29:41.280
Etc. In reality, this is kind of aspirational and will vary a lot,
00:29:46.860
depending on the state of your codebase and your app and your requirements.
00:29:53.040
It's quite possible that you're going to have to set this quite a lot higher first and spend more on servers for a while.
00:29:59.880
Now, there are three important things to keep in mind that are going to be key to this implementation.
00:30:06.300
Number one: the latency promise in the queue and those limits are a contract between the infrastructure and the developer.
00:30:13.140
And like any contract, we need to know if it's broken.
00:30:19.620
If a job is going over those limits, a warning should go out to the respective team that owns the job,
00:30:25.740
to either improve it or move it to a slower queue.
00:30:32.520
Now this is not a page; nobody should be woken up in the middle of the night, but it should make clear that it needs to be changed.
00:30:39.180
Number two: there's an implicit social contract here.
00:30:45.180
Remember the inter-team fights that resulted in one queue per team? This is how you prevent that.
00:30:51.840
If a job is breaching the time limit, there needs to be explicit permission to move it.
00:30:58.740
Now, of course, we will all talk to the respected people to get an idea of what's the best course of action.
00:31:05.520
But if a given job is making you sad, for everyone, you need that explicit permission for other teams to move it out.
00:31:11.280
Or you will end up in a situation where each team wants to control their own queues.
00:31:18.840
And finally, respect the principle and focus on latency tolerance.
00:31:25.680
Just because the job finishes quickly and can run on the fast queues doesn't mean it should.
00:31:33.480
There’s going to be a temptation to put jobs in the fastest queue that will accept them.
00:31:40.080
Don’t do that. You may have jobs that run very quickly, but nobody cares if they're late.
00:31:45.360
Put them in the slow queues; don't let them get in the way of jobs that are actually sensitive to latency.
00:31:51.720
And this, in a nutshell, is how you fix the problems you might be having with your queues.
00:31:58.440
But wait, Daniel, I hear you say, what happened to our merry band of software developers?
00:32:04.920
Well, after Sally had an epiphany, she proposed this new structure to the team, and they liked it.
00:32:12.720
They started work on adopting it. It wasn’t easy; it certainly wasn’t quick, but they got there in the end.
00:32:20.460
And there are useful lessons in what they did.
00:32:27.480
First, of course, they created the new queues with their alerts.
00:32:34.140
The rule was that every new job from now on could only go into those new queues.
00:32:42.420
They actually made a custom RuboCop rule to make sure nobody would forget.
00:32:48.620
Then, they started the huge task of figuring out where all the existing jobs should go, which is a lot and can be daunting.
00:32:55.320
But they found a clever strategy.
00:33:03.240
First, they started not with the most important jobs but with the easiest.
00:33:10.320
For example, some queues can be pretty obvious. They had two queues that run overnight maintenance jobs.
00:33:18.720
You see, no rush for those; they go into the within one-hour queue.
00:33:24.600
Jobs that took a long time to run for the most part also just got sent to the slow queues.
00:33:31.500
Nobody expected them to run quickly anyway!
00:33:38.100
And once the easy stuff was moved out, there was no point in having the queues for each team anymore.
00:33:46.140
They could just collapse them back into priorities and maybe add a few more servers to them if necessary.
00:33:51.960
They were still coming out ahead, and pretty quickly those 60 queues turned into 9.
00:33:59.220
And that was a massive motivational push. It was so much easier to keep the system running happily with fewer queues.
00:34:05.520
And they could see real progress, so they were very eager to do the rest.
00:34:12.240
Now, the next trick that was really clever: they focused on the jobs that ran the most often.
00:34:15.240
The highest volume jobs were candidates for ruining a day, so they made sure they were in the right queues.
00:34:22.740
And those are some things how to categorize, but there are normally not that many of the really big ones.
00:34:29.520
And as they kept going down this list of the most common jobs, the slowest jobs—at one point they realized—
00:34:36.300
they had a lot of different jobs left. There was a lot of work left to do.
00:34:41.640
But the total volume of those jobs left was tiny compared to the overall volume.
00:34:48.960
And here's a little secret: they never actually finished.
00:34:54.300
After a while, the jobs they had left were so few and so quick, there was little chance they would cause problems.
00:35:01.380
So they just merged them all into the default queue and left them there.
00:35:05.520
It wasn't worth cataloging each one, and that left them with this.
00:35:10.260
Now this was bliss. It was a dream to keep it running day to day.
00:35:16.560
But they weren't done. You see, to move faster and keep momentum, they did take a couple of shortcuts.
00:35:24.600
First of all, those speed limits? They set them way higher.
00:35:31.200
They first just threw more servers at the problem, and this made it much easier to start putting jobs in the right queues without worrying too much about performance.
00:35:39.300
But now they had a clear list of trespassers—jobs they could work on gradually to make them faster.
00:35:46.740
And then gradually lower those time limits until they got to the sweet spot.
00:35:53.040
They also learned an extremely important lesson: when to break the one rule.
00:35:58.200
You know how I said latency is the one thing you care about? That’s true.
00:36:03.840
But sometimes practical considerations mean it makes sense to stray from that path a little bit.
00:36:10.320
For example, some jobs need special hardware. In Ruby, that generally means a lot more runs.
00:36:17.820
If you can put them in their own queue, you can have a few large servers instead of making all the servers for that queue more expensive.
00:36:23.520
Sometimes you have jobs that want to run one at a time, and generally, this is an anti-pattern.
00:36:30.720
But there are some legitimate reasons to do this, having their own queue with one thread.
00:36:36.480
And never more than one instance lets you do that, and the name makes it clear that you should never give more than one instance to that queue.
00:36:43.920
So some jobs do a lot of waiting on I/O, and then they use pretty much no CPU.
00:36:50.520
Mailers or webhooks tend to be like that; you could have really high thread counts for those,
00:36:57.060
and save on servers, but it would hurt the other jobs in the same queue.
00:37:04.680
Giving them their own queue lets you make that distinction and save money on services.
00:37:11.760
You see, these are the usual trade-offs of money versus complexity, and as always, your mileage may vary.
00:37:19.800
But sometimes it's useful to stray from that path a little bit.
00:37:26.760
Just make sure all of your queues still have a clear latency guarantee in the name, the usual alerts, and you’re golden.
00:37:33.720
And that's how our brave team of developers solved all of their queuing woes, reached pure queue bliss, and lived speedily ever after.
00:37:41.520
Thank you!
00:37:57.480
Now before you go, I want to leave you with a couple of quick notes.
00:38:05.640
First, I just mentioned some of the most important lessons that our friends learned while implementing this, but there are a lot more that don't fit in half an hour.
00:38:11.400
You can get more practical advice on how to do this on this rebuild.
00:38:17.400
There's also that talk I mentioned; you can download from there.
00:38:23.520
I also want to do some thanking first to Suleiman, who wrote one of the best talks I've ever seen,
00:38:29.040
and whose style massively inspired this talk. If you haven't seen this, you should definitely watch it.
00:38:35.760
Also, my friends Nick, Lisa, and Chris for watching early versions of this talk and every other talk and making them much better.
00:38:43.680
Another thanks to Neil Chandler, one of my teammates who actually came up with most of those clever ideas that I described.
00:38:49.080
And that’s all from me. Thank you, and I'll see you all in the hallway.