Talks

What does "high priority" mean? The secret to happy queues

Like most web applications, you run important jobs in the background. And today, some of your urgent jobs are running late. Again. No matter how many changes you make to how you enqueue and run your jobs, the problem keeps happening. The good news is you're not alone. Most teams struggle with this problem, try more or less the same solutions, and have roughly the same result. In the end, it all boils down to one thing: keeping latency low. In this talk I will present a latency-focused approach to managing your queues reliably, keeping your jobs flowing and your users happy.

RubyConf 2022

00:00:00.000 ready for takeoff
00:00:17.820 I want you to meet Sally Sally is a lead engineer at mandar De
00:00:22.859 fling a leading provider of paper and other stationary products she's been there from the very beginning
00:00:28.439 one of the first Engineers they hired because of this she knows the code base inside and out and she's extremely
00:00:35.280 familiar with the issues of running the wrap in production and Sally
00:00:40.320 she said today again as she said today because her cues are
00:00:45.899 sad today again so she deals with the problem again
00:00:51.840 makes the system happy then goes back to her normal work she's still thinking about this though
00:00:57.840 this has been a problem for years and no matter what they've tried they never seem to be able to fix it
00:01:04.019 the next morning after a good night's sleep Sally wakes up with a solution a radical new idea that will solve these
00:01:10.500 problems once and for all but to understand how that will fix all their problems we first need to get to
00:01:15.780 know a little bit of History to understand how mundeleifling adhere in the first place first of all though hi my name is Daniel
00:01:21.840 and I work at indeed Flex we had a Flexible Staffing platform based in London so you probably notice though I'm
00:01:26.939 not originally from London I come from Argentina so in case you're wondering that's the accent and now let's go back to a little paper
00:01:32.759 company you see when Sally joined Manda difling there were a tiny team of only three developers and they wrote the app that
00:01:39.960 was running the entire company buying paper keeping track of the inventory selling it delivering it everything
00:01:45.900 and at first everything was just running inside that one little web server and that was fine for a while but they
00:01:51.600 started having some trouble with that at some point so Sally decided to add a background job system
00:01:57.060 and then there was a queue and the queue was good it solved a lot
00:02:03.000 of problems for the team so they started adding more and more jobs to the queue and the queue was still good so more and
00:02:10.800 more shows get added to it and so our Story begins
00:02:18.120 a few months later Joey receives a support ticket users are reporting that reset password
00:02:23.280 functionality is broken Joey takes the ticket
00:02:28.739 Works in my machine he says and closes it as can reproduce now of course the tickets comes back
00:02:35.099 with more details users say they still not receiving the email and sure enough when Joey does this again in production
00:02:41.459 he indeed does not get the email after a bit of a bit of Investigation it
00:02:46.500 turns out that Q has 40 000 jobs in it and emails are getting stuck there and going out late
00:02:51.900 Joey spins up some extra workers to drain the queues faster and marks the ticket as resolved
00:02:58.080 but they said customer impact so he decides to declare it an incident and once he's writing the postmortem he
00:03:04.379 starts thinking how can we prevent this from happening again now Joy knows that when a job gets stuck
00:03:10.200 in a queue it's because other types of jobs are getting in the way and one way to fix that is to try to isolate those
00:03:16.260 loads so they don't interfere with each other we can't have all the jobs in one big bag and just pick them in order right
00:03:21.720 some of them are more important than others now in some queuing systems you can set priorities for those for some jobs and
00:03:28.739 not let some of those jobs skip the queue and run faster so Joey thought maybe that would be a
00:03:34.379 good idea but it turns out their cues don't support priorities jobs are picked up in the same order that they're put in
00:03:40.319 the queue as an aside by the way that's a good thing priorities won't really solve the problem and we'll see why later
00:03:46.799 instead of priorities in this system what you need to do to isolate loads is create separate different cues
00:03:52.920 so Joey decides to create the high priority queue for important jobs and because programmers love us some
00:03:59.099 symmetry he also creates a low queue for less important stuff a few days later he finds some jobs that
00:04:04.500 need to run every now and then but they're not urgent and they take a bit long to run be a shame if something important was
00:04:10.980 late because of him so he puts them in the long queue a few months down the road Joey is
00:04:17.280 dealing with another incident related to cues turns out everyone's drills are important so the high queue has tons of
00:04:24.600 different jobs and now critical transactions are now being processed fast enough because of some other new job that's taking too long and we're
00:04:30.540 losing sales and so after another postmortem Joy Christians the critical King
00:04:39.240 and he makes it very clear to everyone I know your jobs are important but only very very critical things can go here we
00:04:51.780 I think none of you are surprised here a few months later the Cricket company had an outage and Cricket jobs which are
00:04:58.800 critical and normally take about a second to run now they were taking a whole minute until they timed out
00:05:04.979 this started backing up the queue and two f8 text messages which are also critical started going out late
00:05:11.340 again but by this point the company had hired a very senior engineer from a much bigger company and she brought a lot of
00:05:18.120 hard-won experience with her this is Marianne and when she noticed the incident she recognized the problem
00:05:23.160 immediately and got involved Marielle has seen this exact same pattern before and she knows what the
00:05:28.620 problem is organizing skills based on priority or importance is doomed to fail
00:05:34.199 first of all there's no clear meaning of what high priority or low priority and yes there are dogs with examples of what sugar where but they can never predict
00:05:40.800 everything that we will need to throw out our cues some of those jobs will also be very high volume or long running and it's
00:05:47.280 really hard to predict how all of them will interact but she's seen this before and she has the answer
00:05:52.979 we need to have separate different cues for the different purposes no priorities that way jobs are run together all look
00:06:00.120 similar making performance more predictable and it's also going to be much clearer what goes where
00:06:05.460 the way it credit card credit cards are start having trouble they don't interfere with unrelated stuff like 2fa
00:06:12.300 now of course you can't have a queue for every single purpose although some companies have tried but you can define
00:06:17.820 a few for your most common things Marianne said the queue for mailers which are generally pretty important
00:06:23.880 she also notices we send millions of millions of customers satisfaction survey emails and those are not important so she adds a survey skew so
00:06:31.319 they don't get in the way a few months later you have something like this and jobs are humming along just fine every now and then a q a new
00:06:38.699 queue is born and everybody is happy with this for a while
00:06:49.199 company has grown quite a bit by now we now have a couple dozen developers and as usual they're split in two teams in
00:06:55.080 this case based on the different functionalities of the app you have the purchasing folks buying the paper the inventory people keeping track of it the
00:07:01.080 logistics team Etc one day the logistics team is having a little retro and the Mad column is
00:07:07.919 pretty long the purchasing Forks have done it again this is the fourth time this happens
00:07:13.440 this quarter how many times do we need to tell them don't start purchasing had this genius
00:07:19.500 idea instead of calling vendors and asking them for better prices they could just how to email everyone and let the
00:07:25.199 computer do the huggling the feature was a little aggressive and there were a ton of emails so the
00:07:30.960 trucking company didn't get the notifications they needed and shipments went out late that day again
00:07:37.380 it's not always purchasing though we had trouble with the sales team doing the same thing too and to be fair we've also done this
00:07:43.979 remember the ship apocalypse last Cyber Monday where we clocked every queue in the system and ruined everyone's days
00:07:50.340 the conclusion seems clear we have all of these queues but we have no control over what other teams are
00:07:55.979 putting in them and they don't know what our jobs look like or what the requirements are for those jobs
00:08:01.259 there's only one way to fix this teams need to be able to act independently of each other that's the
00:08:07.199 point of teams and just like that the logistics team here is born
00:08:12.660 listen didn't go for microservices now of course it's not just the logistics team that gets lq they
00:08:18.840 obviously share this idea with the other teams and ask them to do the same and guess what not everything that the
00:08:23.940 logistics team does has the same urgency so three months later we end up with Logistics high and purchasing low when
00:08:28.979 of course our original higher and lower mailers are still there because we have hundreds of different jobs in there nobody knows what they do or where they
00:08:34.680 go this is starting to get hairy and also it's a good thing that we have
00:08:40.560 VC money behind this company but even then at some point somebody does notice that we're spending way more
00:08:46.440 in service than we should and they start asking questions that's how we have 60 kills by now and
00:08:52.080 that means 60 servers most of those servers are not doing anything most of the time but somebody has to serve those
00:08:57.600 queues so an obvious decision is made we can configure processes to work several different queues at once so we will
00:09:04.320 group some of them together and reduce the server account um guess what
00:09:11.040 now you may be asking why I'm telling you these clearly facetious story about people making very obvious mistakes but
00:09:17.940 the truth is this isn't fiction at all I've renamed the queues and the teams to protect the Innocence but I've seen this
00:09:24.180 exact same story developing pretty much exactly this way in every company I worked in hell I've been half the
00:09:30.180 characters in this in this story proposing things that were obviously going to work that was me with the bad ideas
00:09:35.820 and I've seen this enough that I think this is a common progression that many companies go through you may have seen
00:09:41.040 this too and hopefully the rest of the stock will help you with those issues or if you haven't gone down this path yet
00:09:46.440 hopefully you can prevent you some headaches the reason I think this is a common progression by the way is not that these
00:09:52.500 characters are bad Engineers the thing is when you're faced with these challenges these steps do seem like the
00:09:58.500 obvious solution they make sense the problem is that queues tend to have interactions and behaviors that are very
00:10:04.019 hard to reason about so when we propose one of these changes to fix a problem it's really hard to predict the next
00:10:09.839 problem that they will cause so how do we get out of this well we
00:10:15.600 first need to think about what actual problem we're trying to solve when we started making new cues because the
00:10:20.760 problem is deceptive and we tend to attack the wrong issue the problem is in cues and it's not priorities or threats
00:10:27.300 the problem is the language that we used to talk about this it often is
00:10:33.000 if you think about it you have jobs running in the background all the time and no one notices if someone notices that's when you have
00:10:39.839 a problem but what is it that people notice a job didn't run what generally actually means didn't run
00:10:46.500 yet but I feel like it should have by now or put another way a job is running late
00:10:52.800 and here's where the trouble starts because what does late mean here many many jobs every day don't get to
00:11:00.000 start immediately once they're getting queued they tend to wait in the queue a bit and no one notices some wait for
00:11:05.339 minutes and no one notices some others it's very obvious if they're late even by a little bit
00:11:12.360 for example you probably know this one you send a message to a friend and the app tells you when it has reached their
00:11:18.000 phone that took a second you didn't even notice but let's try a different scenario
00:11:33.000 how anxious are you getting right now is he in a channel
00:11:38.040 did his phone run a battery should I be worried is he dead oh thank God
00:11:45.180 he's okay now that job was late
00:11:50.220 so here's the issue there's an expectation that you have for your jobs of how long it will be until they run but that expectation is implicit it's
00:11:57.600 one of those I know when I see it kind of things where you know if it's running late but if I ask you when that would be
00:12:03.000 you probably can't tell me like you got the database maintenance skill right where you run a vacuum
00:12:08.399 analyze on every table every night because there was a problem with Statistics two years ago and we will never trust the database again
00:12:13.800 and that's running 20 minutes behind today is that a problem almost certainly not
00:12:19.680 what if your critical queue is running 20 minutes behind though that's almost definitely a problem now what if his two minutes behind
00:12:26.279 is that okay what is the low priority queue is 20 minutes behind is that bad
00:12:33.300 the problem is the language that we used to define these cues and and the jobs we put in them we give them these vague
00:12:39.000 names and it gets us in trouble high priority or low priority gives you some relative idea this is more
00:12:45.300 important origin than that but it gives you no insight as to whether your cues are healthy or not at any given point
00:12:50.639 until somebody's missing some job and they start shouting and as an engineer it also gives you no
00:12:55.800 indication as to where you should put things so what do you do when you're writing a new job you look around the cues for
00:13:01.440 other jobs that look similar and you follow that or you just you just guess you listen to your gut or how important
00:13:07.079 is your feels the problem is the words we use high and
00:13:12.480 low priority doesn't really give you anything very useful they're very vague names but the good thing is the answer is also
00:13:18.899 in this language because what we care about is not priority or importance or the type of
00:13:24.060 jobs or what team owns them this is what we care about a job is running late
00:13:29.940 the one thing we care about is latency and that's how we should structure our cues we should assign jobs to queues
00:13:36.779 looking only at how much latency they can tolerate before it feels like they're late
00:13:42.300 and this is the problem our friends are having right they're trying to isolate workloads from each other but their queues were not designed around the one
00:13:48.360 thing that we care about remember the symptom is always this thing didn't happen as quickly as energy
00:13:54.420 to the one thing we care about is latency and that sounds simplistic but
00:13:59.459 I'm dying on this hill latency is the only thing you actually really care about for your cues everything else is
00:14:05.399 gravy now there are going to be other things that you care about in order to achieve that latency throughput threat accounts
00:14:11.940 Hardware resources Etc but those are a means to an end you don't really care about throughput what you care is that
00:14:18.420 you you what you care about is that you put a job in a queue and it gets done soon enough throughput is just how you cope with the
00:14:24.120 volume to make sure that happens so separating your cues based on purpose or per team they're just roundabout ways
00:14:31.139 to try and get that latency under control but they usually you'll have is They Don't Really Center on that latency
00:14:36.180 they're roundabout ways and what we wanted to attack the problem directly now to be fair the very first approach
00:14:43.560 was doing roughly the right thing the high and low cues were trying to specify that latency tolerance for those jobs
00:14:50.579 the problem is that while that decision comes from the right Instinct having high and low and super really extra
00:14:55.740 critical is very ambiguous for two reasons first I have this new job that I just
00:15:01.980 wrote and it's okay if it takes a minute to run but not 10 minutes where does that go is that high is that
00:15:08.399 critical and also when things go wrong we want to find out if there's a problem before our
00:15:14.160 customers start calling us so we want to set up some alerts for example but at one point do we alert how much latency
00:15:20.220 is too much latency for the high Q or the or the low queue how do you know
00:15:25.680 and here's our hero Sally comes in with her Brilliance idea we will organize our cues around
00:15:31.320 different allowed latencies with clear guarantees and straight constraints
00:15:36.899 now let's unpack that because there's a lot there we will Define cues around how much
00:15:42.300 latency a job in those queues might experience that is the maximum latency we will allow in these queues now allow myself
00:15:49.740 naive but you'll see what I mean in a second what's important is that you know that if you put a job in those queues it will
00:15:55.139 start running in no more than that but than that advertised maximum latency
00:16:00.420 we will name these queues based on that clearly and unambiguously anyone that's
00:16:05.459 looking at a queue needs to know exactly what they can expect from it and here's the important part
00:16:11.339 that name that's a guarantee it's a contract if you put a job in one of these skills it will start running
00:16:17.339 before that much time elapses if accused latency goes over what it says in the name alerts go off file departments get
00:16:24.060 called on-call engineers get woken up and we fix it the name is a promise and you can count on it and that part is the
00:16:30.600 key to the whole solution so as an engineer whenever you're writing a new job you will choose what
00:16:36.480 queue to put it in based on how much latency you can tolerate what that means is what's the threshold where if your
00:16:42.480 job doesn't run by that time somebody will notice and think that the job is late or it didn't run or the system is broken
00:16:48.300 and this makes it much less ambiguous where to put things what happens if that new job runs an
00:16:54.000 hour later will anyone notice no great it goes into the within one hour queue one hour too
00:16:59.579 much it's 10 minutes okay can you wait a minute you go to within 10 minutes if not within one minute
00:17:05.040 now granted finding the threshold will be easier for some jobs than others but you want to try and map out what would
00:17:10.500 be the consequences of running at different levels of late and figure out what would go wrong and be realistic there's always a
00:17:18.240 temptation to think everything must happen now but remember that it can't all happen instantly and we need to be thorough in
00:17:24.720 figuring out how much time we can allow our job to wait and always choose the slowest Q possible
00:17:31.320 now Essence maybe you're not convinced yet so let's look at why this solves the problem we are seeing in a bit more
00:17:36.660 detail the key thing here is that the names that we give these cues are clear promises they're a contract and they set
00:17:43.559 clear expectations for our developers am I put in a given job in a given queue a developer is declaring what does job's
00:17:49.919 needs are remember we said earlier the problem is that jobs have an implicit amount of
00:17:54.960 delay that they can tolerate that's a requirement that those jobs have but it's not documented anywhere and it's not actionable in any way
00:18:01.140 when we set up our cues like this we tend that implicit requirement into an explicit one if a job is in the within
00:18:07.559 10 minutes queue you know it needs to start running in 10 minutes or less or else
00:18:14.039 another side of this is that when you have things like high and low priority queues you will be unintentionally
00:18:19.200 mixing jobs that can tolerate different levels of latency if your high Q is running say four
00:18:25.559 minutes behind that will be fine for some jobs but maybe not for others and
00:18:30.960 it's not really very clear what we can do about that when you say clear expectations this is much easier you already know that all the jobs in the
00:18:37.200 with intermediate skill can tolerate up to 10 minutes of delay if your queue is running five minutes
00:18:42.240 behind you unequivocally no it's fine because the authors of those jobs explicitly told you that by putting them
00:18:48.660 in that queue and this was another one of our problems if our kids are having trouble we want
00:18:53.760 to know before our customers find out and fix it as fast as possible so you want alerts for when things go wrong if
00:19:00.720 you have people on call you want them paged if they queues Bridge the latest guarantees when you have vague names for your cues
00:19:07.679 how do you set thresholds for those alerts how do you know that any threshold you set is going to be good
00:19:13.260 for every job in that queue that exists now or that will get added in the future we clear key names the alert pretty much
00:19:20.580 sets itself within the minute School you alert when the latency hits the minutes and you're done and of course if you're alerting it's
00:19:27.780 too late things have already gone wrong ideally you want to try and prevent this from happening
00:19:33.720 you also don't want to spend too much money on service too much money by having too many servers that are idle all the time
00:19:39.900 these clear promises make it easier to do auto scaling and set adequate thresholds for it you know precisely
00:19:46.020 what level of latency is acceptable the keys tell you you have an idea of how long your servers take to boot up and
00:19:51.840 start helping with the load you add a bit of margin and that's your auto scaling threshold when you hit that much
00:19:56.880 latency you start adding servers and in this Flex for example we use 50 for most of our cues if you hit 50 of
00:20:03.900 the allowed latency we start adding servers one more every few minutes until the LTC goes back down below 50 percent
00:20:10.080 Now 50 was an easy number to choose and it's a bit arbitrary we can probably tune that up for the slower cues but
00:20:16.679 it's high enough that most of our cues are running on the minimum number of servers most of the time and when we get
00:20:21.960 spikes the queues just deal with it we almost never get paid for reaches
00:20:27.120 oh by the way if you do auto scaling always do it on Q latency never on the
00:20:32.580 number of job spending or CPU or anything else is always based on latency now
00:20:38.640 there is a flip side to this these guarantees don't come for free contracts always go both ways
00:20:46.200 and for us that means just one simple rule if you're on a queue that can only tolerate small amounts of latency you
00:20:52.260 need to finish running pretty quickly that's the flip side to this contract I call it the swimming pool principle in
00:20:59.460 a swimming pool you normally have fast medium and slow Lanes right and if you want to be in this Lane you better swim at least this fast or you need to go to
00:21:06.780 the slower queue so you're not getting away in practice this applies to our jobs too if you're on a slow queue one word jobs
00:21:13.260 can tolerate a lot of latency you can run for a while and that's fine but if you're on a fast queue what jobs need to
00:21:19.559 start quickly you need to get out of the way fairly fast or you're going to ruin the party for everyone when you put your job in a given queue
00:21:26.100 you're signing into that contract and if you break the contract you're going to get kicked out of that queue
00:21:31.440 so we need to set a time limit for for jobs in each of these queues that is if you want to run in the within one minute
00:21:37.320 queue you need to finish in this many seconds or we can take you in here sorry you need to go to a slower queue
00:21:43.620 now to see why there's a bunch of queuing theory formulas to calculate this but that's kind of boring so just
00:21:49.140 picture this instead you have a queue here on the left where all of the jobs need to start within one minute
00:21:54.600 I like any queue here on the right you have a number of threads working those jobs and the queue is running his little
00:21:59.760 jobs and they get out of the way pretty quickly new jobs arrive and everybody's happy
00:22:05.460 now you suddenly imagine you start putting jobs in the queue that each take one hour from start to finish
00:22:11.940 what happens to that queue as soon as one of those threats gets one of those long jobs the three is now gone
00:22:18.000 for an hour can't pick up any new jobs so a small number of these jobs will clog all of the threads in that queue
00:22:24.600 really quickly and no work can get done anymore if you do auto scaling you're going to
00:22:29.700 be adding a new thread for each job for that close that close the thread that's
00:22:34.799 a lot of threads imagine instead if you had a queue where every job takes two seconds instead now
00:22:41.640 a single thread can handle a lot of jobs before you bridge your promise you can now run a lot more jobs with a little fewer threads and it's going to be much
00:22:47.760 easier to keep your latency under control and this is also by the way why priorities don't work your low priority
00:22:55.080 but long-running jobs are going to fairly quickly clog all of the threats and then there's no one left to pick up the new high priority jobs that come in
00:23:03.240 so you need to set time limits for each queue the faster the queue the faster you need to finish
00:23:08.640 now where you choose to set these depends pretty much on how much money we need to spend on servers and on developers it's a trade-off like
00:23:15.720 everything else if you set the Sim is too high if you let jobs run for a while that's going to make your developers
00:23:20.940 lives easier but you're going to need more threads and you'll spend more money in servers if you set them too low jobs
00:23:27.179 that need to run with low latency need to be optimized more maybe refactor and split into different parts it's going to
00:23:32.400 be more annoying for your developers and you're going to spend more time doing that so you'll spend more money in developers it's a balancing act and it
00:23:38.940 depends on your priorities I like the rule of thumb of setting up the target at one tenths the latency
00:23:45.000 limit so a job in the within one minute queue can only run for up to six seconds within 10 minutes I can run for up to a
00:23:50.940 minute Etc in reality this is kind of aspirational and it's also going to vary a lot
00:23:56.880 depending on the state of your code base and your app and your requirements it's quite possible that you're going to
00:24:02.880 have to set this quite a lot higher first then spend more on servers for a while now there are three important things to
00:24:09.659 keep in mind that are going to be key to this implementation number one the latency promise in the
00:24:15.720 queue and those limits are a contract between the infrastructure and the developer and like any contract we need
00:24:21.840 to know if it's broken if a job is going over those limits a warning should go out to the respective team that owns the
00:24:28.080 job to either improve this improve it or move it to a slower queue now this is not a page nobody should be woken up in
00:24:34.020 the middle of the night but it should make it clear that it needs to be changed number two there's an implicit social
00:24:41.039 contract here remember the inter team fights that resulted in one Cube per team this is
00:24:46.620 how you prevent that if a job is breaching the time limit there needs to be an explicit permission
00:24:51.780 for folks to move it now of course we will all talk to respected people and see what's the get an idea of what's the
00:24:58.500 best course of action but if a given job is making you sad for everyone you need to have that explicit permission for
00:25:04.260 other teams to move it out or you will end up in a situation where each team wants to control their own cues
00:25:10.679 and finally respect the principle and focus on latency tolerance just because the job finishes quickly
00:25:17.340 and can run on the fast queues doesn't mean it should there's going to be a temptation to put jobs in the fastest
00:25:23.220 queue that will accept them don't do that you may have jobs that run very quickly but nobody cares if they're
00:25:28.980 late put them in the slow queues don't let them get in the way of jokes that are actually sensitive to latency
00:25:34.380 and this in a nutshell is how you fix the problems that you might be having with your cues
00:25:41.039 but wait Daniel I hear you say what happened to our merry band of software Developers
00:25:47.220 well after I had Epiphany Sally proposed this new structure to the team and they liked
00:25:53.340 it and they started work on adopting it it wasn't easy it certainly wasn't quick but they got
00:25:59.700 there in the end and there are useful lessons in what they did first of course they created the new queues with their alerts and the rule
00:26:06.120 was that every new job from now on can only go into those new cues they actually made a custom rubicub rule to
00:26:11.940 make sure nobody would forget and then they started a huge task of figuring out where all the existing jobs
00:26:18.120 should go which it's a lot and it can be a bit daunting
00:26:23.400 but they found a very clever strategy first they started not with the most important jobs but with the easiest
00:26:30.179 for example some cues can be pretty obvious they had two cues that run overnight maintenance jobs
00:26:35.460 you see notice in a rush for those they go into the winning one thank you jobs that took really long to run for the
00:26:41.220 most part also just got sent to the slow queues nobody expected them to run quickly anyway already
00:26:46.740 and then once the easy stuff is moved out there's no point in having the cues for each team anymore we can just
00:26:52.740 collapse them back into priorities and maybe add a few more servers to them if it's necessary you're still coming out
00:26:57.960 ahead and pretty quickly those 60 cues turn into nine
00:27:03.000 and that was a massive motivational push it was so much easier to keep the system running happily with all those fewer
00:27:08.940 queues and they could see real progress they were very eager to do the rest
00:27:14.279 now the next trick that was really clever they went for the jokes that that run the most often the highest volume
00:27:20.940 was those are always a candidate for ruining a day so you want to make sure that they're in the right queues
00:27:26.340 and those are some things how to categorize but there's normally not that many of them of the really big ones
00:27:32.760 and as they keep going down this list of the most common jobs and the slowest job at one point they realized that they had
00:27:39.720 a lot of different jobs left there was a lot of work left to do but the total volume of those jobs left
00:27:45.960 was Tiny compared to the overall volume and here's a little secret
00:27:51.419 they never actually finished after a while the job they had left were so few and so quick there was very
00:27:57.059 little chance that they would cause problems so they just merged them all into the different default queue and left them there it wasn't worth
00:28:03.179 cataloging each one and that left them with this now this was Bliss it was a dream to
00:28:10.320 keep it running day to day but they weren't done you see to move faster and keep momentum they did take a couple of shortcuts
00:28:17.520 first of all those speed limits they set them way higher the first just through more servers had the problem
00:28:23.039 and this made it much easier to start putting jobs in the right queues without without having to worry too much about
00:28:28.320 the performance but now they had a clear list ordered of trespassers just I would choose low for
00:28:35.159 their cues which they could work on gradually make them faster and then gradually lowering those time limits
00:28:40.440 until they got to The Sweet Spot they also learned an extremely important lesson when to break the one rule
00:28:48.779 you know how I said latency is the one thing you care about that's true but sometimes practical
00:28:54.360 considerations means it makes sense to stray from that path a little bit for example some jobs need special
00:29:00.720 Hardware in Ruby that generally means a lot more run if you can put them in their own queue then you can have a few large servers
00:29:07.500 for them instead of making all the servers for that Cube more expensive sometimes you have jobs that want to run
00:29:13.380 one at a time and generally this is an anti-pattern but there are actually some legitimate reasons to do this
00:29:19.679 having their own queue with one thread and never more than one instance lets you do that and the name makes it clear
00:29:24.899 that you should never give more than one instance to that queue so I'm jokes do a lot of waiting on IO
00:29:30.899 then they use pretty much no CPU mailers web hooks tend to be like that you could have really high thread counts for those
00:29:37.380 and save on servers but it would hurt the other jokes in the same queue giving them their own queue lets you
00:29:42.600 make that distinction and save money on service you see these are the usual trade-offs of money versus complexity and as always
00:29:49.380 your mileage may vary but it's sometimes useful to stray from that path a little bit just make sure all of your cues
00:29:55.620 still have a clear latency guarantee in the name the usual alerts and your golden
00:30:01.080 and that's how our brave team of developers solved all of their queuing woes reached pure Cube Bliss and lived
00:30:08.159 speedily ever after thank you
00:30:20.279 now before you go I want to leave you with a couple of quick notes first I just mentioned some of the most important lessons that our friends
00:30:25.559 learned while implementing this but there are a lot more and they don't fit in half an hour you can get more practical advice on how to do this on
00:30:31.559 this rebuild there's also that cop that I mentioned you can download from there I also want to do some thinking first to
00:30:36.960 taking Suleiman who wrote one of the best talks I've ever seen and whose style massively inspired this talk if
00:30:42.779 you haven't seen this you should definitely watch it also my friends Nick Lisa and Chris for watching early versions of this talk and
00:30:49.380 every other talk and making them much better another that's fine last but not least
00:30:54.600 to Neil Chandler one of my teammates who actually is the one that came up with most of those clever ideas that I
00:30:59.760 mentioned at the end on how to make the implementation tolerable and that's all for me thank you and I'll see you all in the hallway