00:00:16.080
all right well you already heard my name you heard where I work I contribute to open source you can find me on social
00:00:22.960
media nice to meet you all if you have heard my name you've
00:00:28.039
probably heard it in relationship to sqlite um I have been working a lot in
00:00:35.239
the last two years to make ruon rails the best RB application framework in the world for building on top of the sqlite
00:00:42.120
database engine and I had the opportunity this September to talk about that and all of
00:00:49.000
the work that is in rails 8 at rails World those talks are now on YouTube um
00:00:56.960
with subtitles in various languages if you haven't checked it out um I would
00:01:03.160
recommend it I heard it's a really good talk but today I'm not going to talk
00:01:09.040
about sqlite I'm going to talk about one of my older passions which is how to make background
00:01:15.720
jobs reliable and resilient this is a sequel talk of sorts
00:01:23.079
to a talk I gave at Ruby comp 2021 in Denver that talk was focused more on a
00:01:30.119
sort of introduction to the problem space like what does it mean to make a
00:01:36.680
job reliable and resilient what are some of the core problems this talk I want to
00:01:41.759
focus much more on the solution space what does it take to build and test
00:01:47.119
reliable and resilient jobs and to guide that exploration we
00:01:55.079
are going to be looking at two gems that I maintain which
00:02:00.680
encode the core principles that I believe are essential to
00:02:06.719
reliability the first is a gem called chaotic job this is a testing helper
00:02:11.959
this is going to help you test that your jobs are indeed reliable and resilient and the second is a gem called
00:02:19.080
acidic job this is a workflow execution engine this is what you can use inside
00:02:25.840
of your jobs to help them become reliable and resilient and we're going to use these as sort of
00:02:32.760
guideposts to walk our way through the core principles um and problem areas and
00:02:39.519
how to solve them so let's jump into it and let's talk about testing
00:02:45.920
because we really can't begin this discussion if we don't start from an
00:02:53.080
understanding that it's kind of fundamentally difficult to know is my
00:02:58.519
job reliable is it resilient like one way you can find out is run a million of
00:03:03.720
them in production and see if it's problematic but that's not like maybe
00:03:09.360
the smartest or safest thing to do with your business especially for business critical
00:03:14.720
jobs and the first thing that we have to tackle that I ended up tackling is like
00:03:22.519
how do you even test a job for resilience when you need to retry it you need to have multiple executions of it
00:03:30.560
when I was reading through the rails guides I was like oh I I see how to do this this is straightforward I'll use
00:03:36.400
the Performing CED jobs block I'll put my job in there Trans in error occurs a
00:03:44.360
retry will get scheduled it'll get run again it'll be fine and those two
00:03:52.159
executions end up and I can then do my assertions straightforward yeah unfortunately
00:03:57.640
that's not how it works at all this is not a a useful helper for testing
00:04:03.239
production Behavior it should really be called perform instead of enqing
00:04:09.319
jobs the way it behaves is it actually effectively overwrites the inq method
00:04:14.799
and just immediately performs that job when it was told to inue it and what that means is instead of incing it
00:04:21.280
finishing the job and then starting this new job inside of the execution of this first job instance it starts spinning
00:04:28.240
running the second job instance which is not how production is going to behave at all and of course for good
00:04:35.240
tests we want tests that mimic as much as possible actual production Behavior
00:04:41.479
right so performing cute jobs don't use that method to test reliability what can
00:04:46.520
we use instead from the active job test helpers well this is going to work much
00:04:53.000
better right so we're going to work in a loop and say if we have inced jobs flush
00:04:59.280
them that means perform them and so each time we run through the loop the first time we have one job we perform it
00:05:06.160
there's a transient error a retry gets ined inq jobs has an element so we come
00:05:11.880
into the loop again and we flush it one more time everything goes fine inq jobs is now empty and we move on to our
00:05:18.400
assertions this is going to work like production this is good and I thought
00:05:23.479
everything was fine I didn't need to do anything more until I started testing jobs that in cued and scheduled
00:05:30.280
other jobs and I immediately found okay this is not quite what I need because
00:05:38.720
the way that flushing cute jobs works I learned is that it's going to execute those jobs perform those jobs in the
00:05:45.000
order in which they were put into that inced jobs array which is to say the order in which they were inced not the
00:05:50.919
order in which they are scheduled right so you have a job it has a trans it's going to schedule one job in the future
00:05:56.600
and it also has an error that we've forced in our tests it schedules the job let's say it
00:06:02.440
schedules it 5 minutes from now some other job and then there's an error so the retry gets put in so we have two
00:06:07.759
elements in inq jobs the first one is scheduled to happen in the future and our retry this method is going to
00:06:14.440
perform the thing that's supposed to happen in the future and then after that our retry of course in production the
00:06:19.880
order would be flipped we would retry that job in 3 seconds not 5 minutes so it would occur first and then the next
00:06:26.599
one and if they happen to be talking to some of the same resources those race
00:06:31.880
well they're not really races but the order of operations is going to be different it's not going to make production so what I came to see pretty
00:06:38.599
quickly is that the active job test helpers are not really wellb built to
00:06:44.400
test reliability they are there to help you test performing a job once what do we
00:06:51.360
need I came to think about it and believe that there's really only three
00:06:56.599
helpers that we need so I built them the first thing we need is we just need a perform all jobs helper right that's
00:07:02.360
going to perform jobs immediately but in the order that they would be performed
00:07:08.199
in production so we're going to virtualize time and shrink it but we're going to guarantee that the order
00:07:13.599
matches production we're going to execute jobs in waves we're going to execute jobs based on when they're
00:07:20.360
scheduled but if we're dealing with scheduled jobs when we're virtualizing time we're going to need a little bit of
00:07:26.400
control so in addition to perform all jobs you you have perform all jobs before and after this allows you
00:07:34.400
to ensure that you for example in that first case perform all of the jobs and
00:07:42.240
its retries and perform no scheduled jobs so you could do a block of
00:07:47.440
assertions to say okay this job ran and it had some errors and it had some retries but it eventually succeeded
00:07:53.520
what's the state of my system now okay those assertions are done now let me run the jobs in the future wait
00:08:01.039
till that's done maybe they have some transient errors that I've forced that resolves and now I have a second block
00:08:06.080
of assertions right so I can have manual control and make sure that I'm testing my system with the level of granularity
00:08:11.759
that I need to be very confident that things are going to be working and it's going to do a little bit just a little
00:08:16.879
bit of magic to Res sort of soften those time boundaries so that you actually
00:08:23.639
perform jobs right like if I say this is scheduled in 3 seconds and I do that inside of the job and then say perform
00:08:29.960
all jobs before 3 seconds those two time Nows right they're going to have different seconds they're going to have different milliseconds for sure we want
00:08:36.640
to make sure that it actually gets in there so it it rounds down by like one
00:08:42.440
order of magnitude to make sure that it pulls those jobs a little tiny bit of magic just to make sure that you're not
00:08:49.200
losing any jobs in your tests those three helpers are what you need this like the
00:08:55.440
foundational blocks for testing jobs but once we have the ability to run our
00:09:03.160
tests with virtualized time but the correct production Behavior how do we actually Force
00:09:09.399
failure scenarios right like this is really what we need to test resilience we need to have errors and we need to
00:09:15.200
have a very particular kind of error if you imagine a simple job like
00:09:20.839
this I happen to know because I wrote this code that the weakest point in this
00:09:26.760
job is right here so I want to make sure that it behaves correctly if an error
00:09:33.519
occurs between these two steps how do I force that to happen and how do I get very particular
00:09:39.880
Behavior right because what I need here is a transient error not a permanent error there's no point in testing
00:09:45.000
permanent errors we know what's going to happen your job will try try try try try end up in the dead set the error is
00:09:50.839
non-resolvable so who cares what we need to test our transient errors something that went wrong once when you retry it
00:09:59.399
magically it's fine right rate limiting flaky networks uh there's all kinds of
00:10:05.000
transient errors so how can we force a failure scenario to
00:10:10.920
test well in this case we can monkey
00:10:16.560
patch this method to say all right I know in this test I want this method to fail I need it to be a transi failure so
00:10:22.760
I need it to fail once so I'm going to set up a little bit of State in my class to say have I errored or not change the
00:10:29.839
state do the error so the second time through it doesn't airor and it behaves transient and this will work but this
00:10:36.600
sucks this does not scale you don't want to have to write all of this code for every scenario for every job that you need to test we need a way
00:10:44.000
to do conceptual compression to borrow a phrase so that's the other helper that
00:10:50.680
chaotic job provides it allows you to define a scenario and a scenario allows
00:10:56.079
you to inject a glitch into the execution of a job and a glitch is just a tupal that says the position before or
00:11:04.079
after and a particular line of code and a particular line of code is just a file and a
00:11:10.680
line the actual engine will do effectively it it's implemented
00:11:16.320
differently but effectively what we saw on that last slide and it'll run it it'll use those
00:11:22.040
performed job helpers to run the jobs correctly run all of the retries get to a final completed State and then you
00:11:28.320
have your assertions what's the state of the system this ability to have glitches to
00:11:36.079
inject glitches into code execution is really the heart of chaotic Java and what I found after I found this and
00:11:42.839
built it was that this unlocked a really cool possibility this unlocked the potential
00:11:50.600
to run simulations so the final helper method that you get
00:11:57.920
is a run simulation you just just give it your job instance and then you define a block and inside that block you can
00:12:03.600
Define whatever assertions you want and what that's going to do if we go back to our madeup job here is that it is going
00:12:10.760
to build up for itself a scenario for every possible
00:12:16.440
error location in your job what it does is it performs your job once tracks all
00:12:21.839
of the line executions using a trace point and then says okay an error could occur here or here or here or here
00:12:29.880
I'm going to take your block of assertions I'm going to define those scenarios run them run them to
00:12:35.320
completion run your assertions and if you get that test to pass now you have
00:12:41.160
some pretty strong foundations to say like yeah I think this job is resilient I have tortured it in a high Lev set of
00:12:48.360
ways right like we're not going into every method that's called inside of like the back trace of one method but
00:12:54.399
we're injecting these glitches into every possible line execution point in
00:12:59.920
your code flow and then you define your assertions these are the state of the side effects and if that passes that's a
00:13:06.519
really strong test so altogether these five helpers I think
00:13:14.120
are an incredibly High leverage set of tools to get you Clarity on business
00:13:23.000
critical jobs are they reliable are they resilient and that's not even all that
00:13:29.519
chaotic job offers that's like 80% of it but um this is the foundation upon which
00:13:36.839
you can start to scale out reliability but of course right how do you actually build reliable
00:13:42.959
jobs so you can you can find out yeah my jobs aren't very reliable chaotic job will help you with that it'll show you
00:13:49.320
precisely where you have problems how do you solve those problems that's where we turn to acidic
00:13:55.680
job a little bit of context um I think that it's definitionally true
00:14:02.360
that every single business that has any background jobs has at least one business critical
00:14:09.519
background job that must be reliable and resilient which is to say it must be
00:14:14.560
acidic it needs to have Atomic operations the data State needs to be
00:14:20.279
consistent in where it ends up right the operations have to be isolated and of course those side effects have to be
00:14:26.480
durable and in working in this space for many years I have come to the belief
00:14:33.000
that the best mental model to help you get to that place are durable execution
00:14:39.759
workflows so what in the world is durable execution workflow durable execution is a bit of jargon that comes
00:14:46.199
from the microservices world primarily that has grown in popularity in the last
00:14:52.320
few years that describes an approach to
00:14:57.639
executing code that is is specifically oriented around fault tolerance right to
00:15:02.759
say how can we ensure that this code executes inside of some eventually consistent environment in a way that is
00:15:11.199
correct and in order to do that we need some durability um a workflow is a a very
00:15:20.680
simple concept it's just a linear sequence of steps each step is some executable
00:15:26.560
function and you define that workflow in a way that can be serialized and passed to an independent
00:15:34.360
execution engine this is at the heart of uh many
00:15:39.920
of these tools and all of this is driving towards getting to itm potency this is a fancy
00:15:46.720
word that you are going to see anytime you are working in this problem space it is fancy Latin for safe to
00:15:54.319
retry and what it really means is that your side effects
00:16:00.079
only take effect once regardless of how many times you run that job and this is
00:16:05.639
what you have to have this is the essential characteristic of a resilient background job because when you're in an
00:16:12.040
eventually consistent environment where your code is going to attempt to self-heal by retrying in order for your system to be
00:16:19.639
correct you have to know that running that code again from the beginning because it's not magic it's just going
00:16:26.000
to start over from the very beginning that you're going to Not Duplicate side effects you don't want to send 40 emails
00:16:32.639
to your biggest Customer because your API rate limit was triggered
00:16:38.160
right all of these things together help us get a little bit of a clearer sense of like what do we even mean by a job
00:16:44.440
right like a job is just an operation that runs in an eventually consistent environment right an environment that's
00:16:50.240
going to try to self-heal through retries and is defined by the side effects that it produces not the
00:16:57.199
computation that it returns at the end the value that it returns at the end that has been
00:17:03.880
computed and this space of durable execution engines that you can pass your
00:17:09.919
definitions to is growing quite large many of these companies here in this
00:17:15.319
list are valued at more than $10 million and one of the things that I love about
00:17:20.480
the Ruby ecosystem and the rails ecosystem is that we can compete with
00:17:26.799
them uh for free on the weekends and that's what acidic job is
00:17:32.080
this is my attempt to take down these $10 million behemoths with uh a gem that
00:17:37.280
has about a thousand lines of code all totaled in it so what is
00:17:44.360
it it's just a module that you include into any regular job that module is going to give you a few methods let's
00:17:50.799
walk through them primarily it gives you this execute workflow method this is the
00:17:57.320
heart this is where you pass over control to the execution engine you're going to put this inside of your perform
00:18:02.360
method and you're going to allow the workflow engine to take control it's
00:18:08.600
straightforward takes two arguments this block which receives this Builder object
00:18:13.679
that you can only call the step method on and you just Define your linear sequence of steps these symbols just map
00:18:19.039
to Method names those methods just have to be available on this job instance by
00:18:24.600
you know convention those are probably going to be private methods defined inside of that class but they can come through inheritance or composition or
00:18:31.360
whatever just need to be available and then you have this unique by keyword argument this is a really important
00:18:38.320
point so I want to go into it a little bit like uniqueness is fundamentally tied to it
00:18:45.520
and potency you can't have one without the other you have to think really deeply about what that means to give an
00:18:51.880
example if we imagine a system that's doing Financial transfers and Jill wants to give I mean
00:18:59.360
they want again and this is a Hot Topic uh Jill wants to give $10 to Jack right
00:19:04.840
so that background job is there and it has some sort of transient error so it
00:19:10.240
retries it is essential that our system the IDM
00:19:16.600
potency guards that we place on that second execution are not placed on a
00:19:22.600
completely independent transfer of $10 from Jill to Jack right we don't want if these two things get over ridden if our
00:19:29.280
sense of what makes a unique execution is not correctly applied right if we de said oh what makes a unique thing this
00:19:34.919
sender sends this amount of money to this person the first time Jill sends Jack $10 and we're like we've
00:19:40.640
successfully done that she tries to do it to him again we say oh we're just going to skip that this is an unsafe
00:19:45.720
free try oops you know that's a failure you have to really understand what defines a unique execution of a thing
00:19:53.120
such that you can differentiate new executions from retries of old executions
00:19:59.559
and uh that's why it is a like required keyword argument of course you can just default to the job
00:20:05.960
ID but if you have the time you really should Define
00:20:14.000
an actual uniqueness set right like whatever that might be these arguments or the whole
00:20:20.360
set of arguments or you passing other things it's going to be foundational to really thinking through your system and
00:20:26.080
its resiliency um the step method takes one optional keyword you can say run this in
00:20:32.159
a transaction yes or no by default it's a no
00:20:38.039
the interface if you imagine like a step like this that does two database
00:20:44.000
operations right you just want to put that in one transaction that's going to make it it imp poent free and
00:20:50.039
cheap retries we don't want that the other thing it gives you is
00:20:55.919
this context bag you're going to need to stash value vales you know you do some computation in one step you're going to
00:21:01.559
need that result in some other step stash them durably so that you can fetch them later regardless of
00:21:08.280
retries then it gives you a few directives to control Behavior you can tell the execution
00:21:14.279
engine repeat a step you can tell it to Halt a step and you can ask a question is this step currently retrying we're
00:21:20.480
going to see in just a bit how that can be useful but that's it like this is the
00:21:25.559
entire public interface of acidic job it fits on one slide with properly sized font I'm very proud of
00:21:31.919
that but seven minutes left let's hit the good stuff how do you use these
00:21:38.200
tools to actually build resilient jobs what are the golden rules you need to follow there are four of them let's walk
00:21:45.400
through them with some examples firstly focus on side effects if you have work that is like
00:21:53.679
Preparatory work this is very common I find in jobs that I write you don't need to put that in a step it's not doing
00:21:59.799
anything this has no side effects right you're inside of a perform method just
00:22:05.720
do it before you start the execution engine it's worth reminding ourselves that even though the interface that
00:22:12.200
active job provides us for performing jobs are class methods the performance
00:22:17.559
actually happens at the level of an instance you have a actual job instance you can just use instance variables that
00:22:23.039
state is available to any of these methods that are going to be called by that engine so do your Preparatory work
00:22:30.320
do any just like readon work up before you start the execution engine your
00:22:35.760
steps are going to be defined by side effects speaking of side effects it's
00:22:41.120
really important that we isolate the different kinds of IO in our steps you
00:22:46.640
do not want to have a step that does two different kinds of Io if you can do
00:22:52.960
anything to avoid it it's going to be very difficult to make that job resilient if you carve your workflow up
00:23:00.159
and each step is only talking to one particular IO
00:23:05.919
backend it's a lot easier to make each step it impotent And if every step is it impotent then compositionally the entire
00:23:12.640
job is it impotent but the fact of the matter is
00:23:19.720
that sometimes you can't avoid doing two different kinds of IO in one step a
00:23:25.000
really common example is you make an API request you create something and you need that response because you're going
00:23:31.679
to work with it in a later step right I'm talking to an API over HTTP and I
00:23:37.159
have to store this in my database in this context bag there's no two ways around that how do you make this it imp
00:23:45.120
poent well the first thing is you want to make each operation it imp poent the context bag is already it impotent it's
00:23:53.320
built to be an impotent it's functionally a put you can call it five times it doesn't matter everything will
00:23:58.600
stay the same how do we make API posts it poent this is probably the most common
00:24:04.679
example well you should always check does the API that you're talking to support it and potency keys that is a
00:24:12.080
great Innovation stripe really led the way on that you should be using them always go and check unfortunately most
00:24:19.400
apis don't that sucks and we just have to deal with this fact right if you want
00:24:26.000
to make an API post it potent the only way to guarantee it is to check before
00:24:32.679
you write right if you build this and you run a simulation you are going to find
00:24:38.880
there are multiple eror locations where you will get two resources created in that API there's no ways around it and I
00:24:46.360
know it's a little bit frustrating to be like well now I have to do two API requests every single time just to
00:24:51.880
ensure that this is safe like I just hate that overhead especially because definitionally I know on the first pass
00:24:57.840
it's fine that's where the step retrying comes in if you really wanted to to make this as
00:25:05.960
optimally performing as possible You' say only try to check if I've already
00:25:11.000
created it if I'm retrying if it's the first pass just go straight to post I'm
00:25:16.320
pretty confident that I haven't done this already if I'm retrying let me see where in this step did it fail did I
00:25:22.840
already create it or did I not and this is as good as it's going to get um
00:25:28.919
sometimes you're going to have to do two API requests to make sure you don't create two
00:25:34.320
resources but by ensuring that your step methods are it and poent like the
00:25:39.399
composition just follows your job will be it and poent if you're using a workflow engine and it's performing
00:25:46.399
correctly which it does like that is a natural guarantee but if you are using a
00:25:51.640
workflow engine one thing that is true and it's worth flagging is that you can't mutate the workflow definition you
00:25:57.799
don't want to see this in a pull request because if you deploy this and
00:26:03.720
there are any jobs in the queue they're mismatched now the engine is going to
00:26:10.600
just throw up its hands and create a permanent error and say Hey you mismatched the definition that I just got from this job and the definition
00:26:16.919
that was put in the database I don't know how to resolve it so you screwed up and this will go to the dead set see you in a
00:26:22.840
bit if you want to do this safely what you need to do and this is very similar
00:26:28.039
to strong migrations right like we are embracing resiliency we're embracing safety and that means that often times
00:26:33.440
we have to do a little bit of extra work you're going to have to clone that job into a new class you're going to have to
00:26:38.679
give it a new name you're going to have to tell your application only in cue the new job with the new definition you're
00:26:45.520
going to have to deploy that change you're going to have to wait for that deployment to get there then you're going to have to look at your job queue
00:26:52.200
and you're going to have to wait until the old job is completely drained and then you should wait uh double that
00:26:57.240
amount of time to make sure you actually had every single place where you were incing it and you don't get new ones and
00:27:02.799
only after you're 100% certain that there are no new instances of that old job in your queue then you can delete
00:27:10.240
it now if you want to rename the new job back to the old job you're going to have to do this 1.5 times more that kind of
00:27:18.520
sucks it is worth saying though as I said the engine will throw up a permanent failure so if you're okay with
00:27:26.240
manually resolving all the jobs that get sent to the Dead Set feel free to just send that up but
00:27:33.200
just know that that's what you are saying it's like however many jobs there are that are currently running for the
00:27:38.880
old definition when I make this deployment you are going to have to manually figure out how to get them to
00:27:46.039
some completed state in the console but those are the golden rules
00:27:52.480
for making your job resilient there's so much more I wish I
00:27:58.440
could say I would love to have had like an hour maybe two possibly five I'm just going to flash some stuff and if you're
00:28:05.080
interested in these things like come and talk to me afterwards or we can hack on this stuff in the hack day there are some higher order patterns one of the
00:28:11.399
things I get a lot is like H workflows they're only linear sequence of jobs like what I really want is like a fancy
00:28:18.000
dag and I want like all this Fancy Fancy Fancy you don't want fancy you want
00:28:23.600
reliable but it's worth remembering that you can build Fancy on top of of
00:28:28.760
reliable we've already looked at how to do external IO right like we want to check the step retrying always do a get
00:28:35.120
before a post we've already already looked at how to do internal IO right make sure those are in
00:28:40.840
transactions you can do fancier stuff though you can have iteration for those of you who've ever used the Shopify job
00:28:46.279
iteration gem we have a lower level primitive you can do it yourself um I wish I could
00:28:53.679
spend more time on it that's resilient you have to do a lot of stuff to make resilient but there you go you want to
00:28:59.600
do delays you have some you want to do some work then you want to wait 3 weeks then you want to pick back up and do
00:29:04.720
more work you do that resiliently you can do that there you go there's a pattern I'll leave it on the there for
00:29:12.200
five seconds you want to do job batching I'm not trying to like take money
00:29:17.679
directly out of Mike's pocket but um you could do sidekick Pro style batching if
00:29:23.200
you control everything yourself it's more code than fits on one slide so you have to come and ask me about it later
00:29:30.399
in general though these are the principles that you're going to need and these are the ways in which I think a
00:29:36.080
cic job provides you a minimal set of very sharp knives to allow you to Define
00:29:41.240
your jobs in such a way that they will run reliably and resiliently over time
00:29:47.760
chaotic job of course paired always have good tests and that's fundamentally all you