Summarized using AI

ACIDic Jobs: Scaling a resilient jobs layer

Stephen Margheim • November 13, 2024 • Chicago, IL • Talk

In the RubyConf 2024 talk titled "ACIDic Jobs: Scaling a resilient jobs layer", Stephen Margheim delves into the importance of building resilient and maintainable background jobs in Rails applications. He emphasizes two key principles: testing for reliability and constructing jobs that are ACID-compliant. The presentation is anchored around two main gems he maintains: Chaotic Job, which assists in testing job resilience, and Acidic Job, which serves as a workflow execution engine to enhance job reliability.

Key Points Discussed:

  • Understanding Resilience in Jobs:
    • Resilience is challenging to measure, and blindly running jobs in production to test them is unwise, especially for critical operations.
  • Testing Challenges:
    • The existing testing methods in Rails do not mimic production behaviors, making it difficult to ascertain job reliability. Margheim explains the limitations of the perform_async method of testing jobs and introduces alternative strategies for testing with a focus on production-like scenarios.
  • Chaotic Job:
    • A gem that provides testing helpers to simulate errors and glitches in job execution. It allows developers to define scenarios to inject transient errors, which is crucial for testing resilience by ensuring that jobs can recover from errors.
  • Acidic Job:
    • This gem helps structure background jobs to ensure they follow the ACID principles: atomicity, consistency, isolation, and durability. The module includes methods to manage workflow executions and ensures jobs can be retried safely without unintended side effects.
  • Best Practices for Building Resilient Jobs:
    • Focus on isolating side effects in jobs to promote reusability and reliability.
    • Emphasis on idempotency to ensure that operations do not lead to side effects when jobs are retried.
    • Maintain clarity in job definitions to prevent issues during deployments. This involves careful change management when modifying job definitions to avoid mismatches.
    • Use structured steps within jobs where each step is linked to specific side effects and can be retried without impacting others.

Conclusion:

  • The talk highlights the necessity for developers to adopt tools and patterns that not only enhance testing capabilities but also structure jobs in a way that accesses greater reliability and resilience. By implementing the principles surrounding Chaotic Job and Acidic Job, developers can significantly improve their background job infrastructures, ensuring that they can scale effectively while maintaining system integrity.
  • Margheim encourages developers to think critically about how they define and manage jobs in the Rails ecosystem, ultimately aiming to improve their application’s performance and robustness.

ACIDic Jobs: Scaling a resilient jobs layer
Stephen Margheim • November 13, 2024 • Chicago, IL • Talk

Background jobs have become an essential component of any Rails app. As your /jobs directory grows, however, how do you both ensure that your jobs are resilient and that complex operations are maintainable and cohesive? In this talk, we will explore how we can build maintainable, resilient, cohesive jobs, all by making our jobs ACIDic.

RubyConf 2024

00:00:16.080 all right well you already heard my name you heard where I work I contribute to open source you can find me on social
00:00:22.960 media nice to meet you all if you have heard my name you've
00:00:28.039 probably heard it in relationship to sqlite um I have been working a lot in
00:00:35.239 the last two years to make ruon rails the best RB application framework in the world for building on top of the sqlite
00:00:42.120 database engine and I had the opportunity this September to talk about that and all of
00:00:49.000 the work that is in rails 8 at rails World those talks are now on YouTube um
00:00:56.960 with subtitles in various languages if you haven't checked it out um I would
00:01:03.160 recommend it I heard it's a really good talk but today I'm not going to talk
00:01:09.040 about sqlite I'm going to talk about one of my older passions which is how to make background
00:01:15.720 jobs reliable and resilient this is a sequel talk of sorts
00:01:23.079 to a talk I gave at Ruby comp 2021 in Denver that talk was focused more on a
00:01:30.119 sort of introduction to the problem space like what does it mean to make a
00:01:36.680 job reliable and resilient what are some of the core problems this talk I want to
00:01:41.759 focus much more on the solution space what does it take to build and test
00:01:47.119 reliable and resilient jobs and to guide that exploration we
00:01:55.079 are going to be looking at two gems that I maintain which
00:02:00.680 encode the core principles that I believe are essential to
00:02:06.719 reliability the first is a gem called chaotic job this is a testing helper
00:02:11.959 this is going to help you test that your jobs are indeed reliable and resilient and the second is a gem called
00:02:19.080 acidic job this is a workflow execution engine this is what you can use inside
00:02:25.840 of your jobs to help them become reliable and resilient and we're going to use these as sort of
00:02:32.760 guideposts to walk our way through the core principles um and problem areas and
00:02:39.519 how to solve them so let's jump into it and let's talk about testing
00:02:45.920 because we really can't begin this discussion if we don't start from an
00:02:53.080 understanding that it's kind of fundamentally difficult to know is my
00:02:58.519 job reliable is it resilient like one way you can find out is run a million of
00:03:03.720 them in production and see if it's problematic but that's not like maybe
00:03:09.360 the smartest or safest thing to do with your business especially for business critical
00:03:14.720 jobs and the first thing that we have to tackle that I ended up tackling is like
00:03:22.519 how do you even test a job for resilience when you need to retry it you need to have multiple executions of it
00:03:30.560 when I was reading through the rails guides I was like oh I I see how to do this this is straightforward I'll use
00:03:36.400 the Performing CED jobs block I'll put my job in there Trans in error occurs a
00:03:44.360 retry will get scheduled it'll get run again it'll be fine and those two
00:03:52.159 executions end up and I can then do my assertions straightforward yeah unfortunately
00:03:57.640 that's not how it works at all this is not a a useful helper for testing
00:04:03.239 production Behavior it should really be called perform instead of enqing
00:04:09.319 jobs the way it behaves is it actually effectively overwrites the inq method
00:04:14.799 and just immediately performs that job when it was told to inue it and what that means is instead of incing it
00:04:21.280 finishing the job and then starting this new job inside of the execution of this first job instance it starts spinning
00:04:28.240 running the second job instance which is not how production is going to behave at all and of course for good
00:04:35.240 tests we want tests that mimic as much as possible actual production Behavior
00:04:41.479 right so performing cute jobs don't use that method to test reliability what can
00:04:46.520 we use instead from the active job test helpers well this is going to work much
00:04:53.000 better right so we're going to work in a loop and say if we have inced jobs flush
00:04:59.280 them that means perform them and so each time we run through the loop the first time we have one job we perform it
00:05:06.160 there's a transient error a retry gets ined inq jobs has an element so we come
00:05:11.880 into the loop again and we flush it one more time everything goes fine inq jobs is now empty and we move on to our
00:05:18.400 assertions this is going to work like production this is good and I thought
00:05:23.479 everything was fine I didn't need to do anything more until I started testing jobs that in cued and scheduled
00:05:30.280 other jobs and I immediately found okay this is not quite what I need because
00:05:38.720 the way that flushing cute jobs works I learned is that it's going to execute those jobs perform those jobs in the
00:05:45.000 order in which they were put into that inced jobs array which is to say the order in which they were inced not the
00:05:50.919 order in which they are scheduled right so you have a job it has a trans it's going to schedule one job in the future
00:05:56.600 and it also has an error that we've forced in our tests it schedules the job let's say it
00:06:02.440 schedules it 5 minutes from now some other job and then there's an error so the retry gets put in so we have two
00:06:07.759 elements in inq jobs the first one is scheduled to happen in the future and our retry this method is going to
00:06:14.440 perform the thing that's supposed to happen in the future and then after that our retry of course in production the
00:06:19.880 order would be flipped we would retry that job in 3 seconds not 5 minutes so it would occur first and then the next
00:06:26.599 one and if they happen to be talking to some of the same resources those race
00:06:31.880 well they're not really races but the order of operations is going to be different it's not going to make production so what I came to see pretty
00:06:38.599 quickly is that the active job test helpers are not really wellb built to
00:06:44.400 test reliability they are there to help you test performing a job once what do we
00:06:51.360 need I came to think about it and believe that there's really only three
00:06:56.599 helpers that we need so I built them the first thing we need is we just need a perform all jobs helper right that's
00:07:02.360 going to perform jobs immediately but in the order that they would be performed
00:07:08.199 in production so we're going to virtualize time and shrink it but we're going to guarantee that the order
00:07:13.599 matches production we're going to execute jobs in waves we're going to execute jobs based on when they're
00:07:20.360 scheduled but if we're dealing with scheduled jobs when we're virtualizing time we're going to need a little bit of
00:07:26.400 control so in addition to perform all jobs you you have perform all jobs before and after this allows you
00:07:34.400 to ensure that you for example in that first case perform all of the jobs and
00:07:42.240 its retries and perform no scheduled jobs so you could do a block of
00:07:47.440 assertions to say okay this job ran and it had some errors and it had some retries but it eventually succeeded
00:07:53.520 what's the state of my system now okay those assertions are done now let me run the jobs in the future wait
00:08:01.039 till that's done maybe they have some transient errors that I've forced that resolves and now I have a second block
00:08:06.080 of assertions right so I can have manual control and make sure that I'm testing my system with the level of granularity
00:08:11.759 that I need to be very confident that things are going to be working and it's going to do a little bit just a little
00:08:16.879 bit of magic to Res sort of soften those time boundaries so that you actually
00:08:23.639 perform jobs right like if I say this is scheduled in 3 seconds and I do that inside of the job and then say perform
00:08:29.960 all jobs before 3 seconds those two time Nows right they're going to have different seconds they're going to have different milliseconds for sure we want
00:08:36.640 to make sure that it actually gets in there so it it rounds down by like one
00:08:42.440 order of magnitude to make sure that it pulls those jobs a little tiny bit of magic just to make sure that you're not
00:08:49.200 losing any jobs in your tests those three helpers are what you need this like the
00:08:55.440 foundational blocks for testing jobs but once we have the ability to run our
00:09:03.160 tests with virtualized time but the correct production Behavior how do we actually Force
00:09:09.399 failure scenarios right like this is really what we need to test resilience we need to have errors and we need to
00:09:15.200 have a very particular kind of error if you imagine a simple job like
00:09:20.839 this I happen to know because I wrote this code that the weakest point in this
00:09:26.760 job is right here so I want to make sure that it behaves correctly if an error
00:09:33.519 occurs between these two steps how do I force that to happen and how do I get very particular
00:09:39.880 Behavior right because what I need here is a transient error not a permanent error there's no point in testing
00:09:45.000 permanent errors we know what's going to happen your job will try try try try try end up in the dead set the error is
00:09:50.839 non-resolvable so who cares what we need to test our transient errors something that went wrong once when you retry it
00:09:59.399 magically it's fine right rate limiting flaky networks uh there's all kinds of
00:10:05.000 transient errors so how can we force a failure scenario to
00:10:10.920 test well in this case we can monkey
00:10:16.560 patch this method to say all right I know in this test I want this method to fail I need it to be a transi failure so
00:10:22.760 I need it to fail once so I'm going to set up a little bit of State in my class to say have I errored or not change the
00:10:29.839 state do the error so the second time through it doesn't airor and it behaves transient and this will work but this
00:10:36.600 sucks this does not scale you don't want to have to write all of this code for every scenario for every job that you need to test we need a way
00:10:44.000 to do conceptual compression to borrow a phrase so that's the other helper that
00:10:50.680 chaotic job provides it allows you to define a scenario and a scenario allows
00:10:56.079 you to inject a glitch into the execution of a job and a glitch is just a tupal that says the position before or
00:11:04.079 after and a particular line of code and a particular line of code is just a file and a
00:11:10.680 line the actual engine will do effectively it it's implemented
00:11:16.320 differently but effectively what we saw on that last slide and it'll run it it'll use those
00:11:22.040 performed job helpers to run the jobs correctly run all of the retries get to a final completed State and then you
00:11:28.320 have your assertions what's the state of the system this ability to have glitches to
00:11:36.079 inject glitches into code execution is really the heart of chaotic Java and what I found after I found this and
00:11:42.839 built it was that this unlocked a really cool possibility this unlocked the potential
00:11:50.600 to run simulations so the final helper method that you get
00:11:57.920 is a run simulation you just just give it your job instance and then you define a block and inside that block you can
00:12:03.600 Define whatever assertions you want and what that's going to do if we go back to our madeup job here is that it is going
00:12:10.760 to build up for itself a scenario for every possible
00:12:16.440 error location in your job what it does is it performs your job once tracks all
00:12:21.839 of the line executions using a trace point and then says okay an error could occur here or here or here or here
00:12:29.880 I'm going to take your block of assertions I'm going to define those scenarios run them run them to
00:12:35.320 completion run your assertions and if you get that test to pass now you have
00:12:41.160 some pretty strong foundations to say like yeah I think this job is resilient I have tortured it in a high Lev set of
00:12:48.360 ways right like we're not going into every method that's called inside of like the back trace of one method but
00:12:54.399 we're injecting these glitches into every possible line execution point in
00:12:59.920 your code flow and then you define your assertions these are the state of the side effects and if that passes that's a
00:13:06.519 really strong test so altogether these five helpers I think
00:13:14.120 are an incredibly High leverage set of tools to get you Clarity on business
00:13:23.000 critical jobs are they reliable are they resilient and that's not even all that
00:13:29.519 chaotic job offers that's like 80% of it but um this is the foundation upon which
00:13:36.839 you can start to scale out reliability but of course right how do you actually build reliable
00:13:42.959 jobs so you can you can find out yeah my jobs aren't very reliable chaotic job will help you with that it'll show you
00:13:49.320 precisely where you have problems how do you solve those problems that's where we turn to acidic
00:13:55.680 job a little bit of context um I think that it's definitionally true
00:14:02.360 that every single business that has any background jobs has at least one business critical
00:14:09.519 background job that must be reliable and resilient which is to say it must be
00:14:14.560 acidic it needs to have Atomic operations the data State needs to be
00:14:20.279 consistent in where it ends up right the operations have to be isolated and of course those side effects have to be
00:14:26.480 durable and in working in this space for many years I have come to the belief
00:14:33.000 that the best mental model to help you get to that place are durable execution
00:14:39.759 workflows so what in the world is durable execution workflow durable execution is a bit of jargon that comes
00:14:46.199 from the microservices world primarily that has grown in popularity in the last
00:14:52.320 few years that describes an approach to
00:14:57.639 executing code that is is specifically oriented around fault tolerance right to
00:15:02.759 say how can we ensure that this code executes inside of some eventually consistent environment in a way that is
00:15:11.199 correct and in order to do that we need some durability um a workflow is a a very
00:15:20.680 simple concept it's just a linear sequence of steps each step is some executable
00:15:26.560 function and you define that workflow in a way that can be serialized and passed to an independent
00:15:34.360 execution engine this is at the heart of uh many
00:15:39.920 of these tools and all of this is driving towards getting to itm potency this is a fancy
00:15:46.720 word that you are going to see anytime you are working in this problem space it is fancy Latin for safe to
00:15:54.319 retry and what it really means is that your side effects
00:16:00.079 only take effect once regardless of how many times you run that job and this is
00:16:05.639 what you have to have this is the essential characteristic of a resilient background job because when you're in an
00:16:12.040 eventually consistent environment where your code is going to attempt to self-heal by retrying in order for your system to be
00:16:19.639 correct you have to know that running that code again from the beginning because it's not magic it's just going
00:16:26.000 to start over from the very beginning that you're going to Not Duplicate side effects you don't want to send 40 emails
00:16:32.639 to your biggest Customer because your API rate limit was triggered
00:16:38.160 right all of these things together help us get a little bit of a clearer sense of like what do we even mean by a job
00:16:44.440 right like a job is just an operation that runs in an eventually consistent environment right an environment that's
00:16:50.240 going to try to self-heal through retries and is defined by the side effects that it produces not the
00:16:57.199 computation that it returns at the end the value that it returns at the end that has been
00:17:03.880 computed and this space of durable execution engines that you can pass your
00:17:09.919 definitions to is growing quite large many of these companies here in this
00:17:15.319 list are valued at more than $10 million and one of the things that I love about
00:17:20.480 the Ruby ecosystem and the rails ecosystem is that we can compete with
00:17:26.799 them uh for free on the weekends and that's what acidic job is
00:17:32.080 this is my attempt to take down these $10 million behemoths with uh a gem that
00:17:37.280 has about a thousand lines of code all totaled in it so what is
00:17:44.360 it it's just a module that you include into any regular job that module is going to give you a few methods let's
00:17:50.799 walk through them primarily it gives you this execute workflow method this is the
00:17:57.320 heart this is where you pass over control to the execution engine you're going to put this inside of your perform
00:18:02.360 method and you're going to allow the workflow engine to take control it's
00:18:08.600 straightforward takes two arguments this block which receives this Builder object
00:18:13.679 that you can only call the step method on and you just Define your linear sequence of steps these symbols just map
00:18:19.039 to Method names those methods just have to be available on this job instance by
00:18:24.600 you know convention those are probably going to be private methods defined inside of that class but they can come through inheritance or composition or
00:18:31.360 whatever just need to be available and then you have this unique by keyword argument this is a really important
00:18:38.320 point so I want to go into it a little bit like uniqueness is fundamentally tied to it
00:18:45.520 and potency you can't have one without the other you have to think really deeply about what that means to give an
00:18:51.880 example if we imagine a system that's doing Financial transfers and Jill wants to give I mean
00:18:59.360 they want again and this is a Hot Topic uh Jill wants to give $10 to Jack right
00:19:04.840 so that background job is there and it has some sort of transient error so it
00:19:10.240 retries it is essential that our system the IDM
00:19:16.600 potency guards that we place on that second execution are not placed on a
00:19:22.600 completely independent transfer of $10 from Jill to Jack right we don't want if these two things get over ridden if our
00:19:29.280 sense of what makes a unique execution is not correctly applied right if we de said oh what makes a unique thing this
00:19:34.919 sender sends this amount of money to this person the first time Jill sends Jack $10 and we're like we've
00:19:40.640 successfully done that she tries to do it to him again we say oh we're just going to skip that this is an unsafe
00:19:45.720 free try oops you know that's a failure you have to really understand what defines a unique execution of a thing
00:19:53.120 such that you can differentiate new executions from retries of old executions
00:19:59.559 and uh that's why it is a like required keyword argument of course you can just default to the job
00:20:05.960 ID but if you have the time you really should Define
00:20:14.000 an actual uniqueness set right like whatever that might be these arguments or the whole
00:20:20.360 set of arguments or you passing other things it's going to be foundational to really thinking through your system and
00:20:26.080 its resiliency um the step method takes one optional keyword you can say run this in
00:20:32.159 a transaction yes or no by default it's a no
00:20:38.039 the interface if you imagine like a step like this that does two database
00:20:44.000 operations right you just want to put that in one transaction that's going to make it it imp poent free and
00:20:50.039 cheap retries we don't want that the other thing it gives you is
00:20:55.919 this context bag you're going to need to stash value vales you know you do some computation in one step you're going to
00:21:01.559 need that result in some other step stash them durably so that you can fetch them later regardless of
00:21:08.280 retries then it gives you a few directives to control Behavior you can tell the execution
00:21:14.279 engine repeat a step you can tell it to Halt a step and you can ask a question is this step currently retrying we're
00:21:20.480 going to see in just a bit how that can be useful but that's it like this is the
00:21:25.559 entire public interface of acidic job it fits on one slide with properly sized font I'm very proud of
00:21:31.919 that but seven minutes left let's hit the good stuff how do you use these
00:21:38.200 tools to actually build resilient jobs what are the golden rules you need to follow there are four of them let's walk
00:21:45.400 through them with some examples firstly focus on side effects if you have work that is like
00:21:53.679 Preparatory work this is very common I find in jobs that I write you don't need to put that in a step it's not doing
00:21:59.799 anything this has no side effects right you're inside of a perform method just
00:22:05.720 do it before you start the execution engine it's worth reminding ourselves that even though the interface that
00:22:12.200 active job provides us for performing jobs are class methods the performance
00:22:17.559 actually happens at the level of an instance you have a actual job instance you can just use instance variables that
00:22:23.039 state is available to any of these methods that are going to be called by that engine so do your Preparatory work
00:22:30.320 do any just like readon work up before you start the execution engine your
00:22:35.760 steps are going to be defined by side effects speaking of side effects it's
00:22:41.120 really important that we isolate the different kinds of IO in our steps you
00:22:46.640 do not want to have a step that does two different kinds of Io if you can do
00:22:52.960 anything to avoid it it's going to be very difficult to make that job resilient if you carve your workflow up
00:23:00.159 and each step is only talking to one particular IO
00:23:05.919 backend it's a lot easier to make each step it impotent And if every step is it impotent then compositionally the entire
00:23:12.640 job is it impotent but the fact of the matter is
00:23:19.720 that sometimes you can't avoid doing two different kinds of IO in one step a
00:23:25.000 really common example is you make an API request you create something and you need that response because you're going
00:23:31.679 to work with it in a later step right I'm talking to an API over HTTP and I
00:23:37.159 have to store this in my database in this context bag there's no two ways around that how do you make this it imp
00:23:45.120 poent well the first thing is you want to make each operation it imp poent the context bag is already it impotent it's
00:23:53.320 built to be an impotent it's functionally a put you can call it five times it doesn't matter everything will
00:23:58.600 stay the same how do we make API posts it poent this is probably the most common
00:24:04.679 example well you should always check does the API that you're talking to support it and potency keys that is a
00:24:12.080 great Innovation stripe really led the way on that you should be using them always go and check unfortunately most
00:24:19.400 apis don't that sucks and we just have to deal with this fact right if you want
00:24:26.000 to make an API post it potent the only way to guarantee it is to check before
00:24:32.679 you write right if you build this and you run a simulation you are going to find
00:24:38.880 there are multiple eror locations where you will get two resources created in that API there's no ways around it and I
00:24:46.360 know it's a little bit frustrating to be like well now I have to do two API requests every single time just to
00:24:51.880 ensure that this is safe like I just hate that overhead especially because definitionally I know on the first pass
00:24:57.840 it's fine that's where the step retrying comes in if you really wanted to to make this as
00:25:05.960 optimally performing as possible You' say only try to check if I've already
00:25:11.000 created it if I'm retrying if it's the first pass just go straight to post I'm
00:25:16.320 pretty confident that I haven't done this already if I'm retrying let me see where in this step did it fail did I
00:25:22.840 already create it or did I not and this is as good as it's going to get um
00:25:28.919 sometimes you're going to have to do two API requests to make sure you don't create two
00:25:34.320 resources but by ensuring that your step methods are it and poent like the
00:25:39.399 composition just follows your job will be it and poent if you're using a workflow engine and it's performing
00:25:46.399 correctly which it does like that is a natural guarantee but if you are using a
00:25:51.640 workflow engine one thing that is true and it's worth flagging is that you can't mutate the workflow definition you
00:25:57.799 don't want to see this in a pull request because if you deploy this and
00:26:03.720 there are any jobs in the queue they're mismatched now the engine is going to
00:26:10.600 just throw up its hands and create a permanent error and say Hey you mismatched the definition that I just got from this job and the definition
00:26:16.919 that was put in the database I don't know how to resolve it so you screwed up and this will go to the dead set see you in a
00:26:22.840 bit if you want to do this safely what you need to do and this is very similar
00:26:28.039 to strong migrations right like we are embracing resiliency we're embracing safety and that means that often times
00:26:33.440 we have to do a little bit of extra work you're going to have to clone that job into a new class you're going to have to
00:26:38.679 give it a new name you're going to have to tell your application only in cue the new job with the new definition you're
00:26:45.520 going to have to deploy that change you're going to have to wait for that deployment to get there then you're going to have to look at your job queue
00:26:52.200 and you're going to have to wait until the old job is completely drained and then you should wait uh double that
00:26:57.240 amount of time to make sure you actually had every single place where you were incing it and you don't get new ones and
00:27:02.799 only after you're 100% certain that there are no new instances of that old job in your queue then you can delete
00:27:10.240 it now if you want to rename the new job back to the old job you're going to have to do this 1.5 times more that kind of
00:27:18.520 sucks it is worth saying though as I said the engine will throw up a permanent failure so if you're okay with
00:27:26.240 manually resolving all the jobs that get sent to the Dead Set feel free to just send that up but
00:27:33.200 just know that that's what you are saying it's like however many jobs there are that are currently running for the
00:27:38.880 old definition when I make this deployment you are going to have to manually figure out how to get them to
00:27:46.039 some completed state in the console but those are the golden rules
00:27:52.480 for making your job resilient there's so much more I wish I
00:27:58.440 could say I would love to have had like an hour maybe two possibly five I'm just going to flash some stuff and if you're
00:28:05.080 interested in these things like come and talk to me afterwards or we can hack on this stuff in the hack day there are some higher order patterns one of the
00:28:11.399 things I get a lot is like H workflows they're only linear sequence of jobs like what I really want is like a fancy
00:28:18.000 dag and I want like all this Fancy Fancy Fancy you don't want fancy you want
00:28:23.600 reliable but it's worth remembering that you can build Fancy on top of of
00:28:28.760 reliable we've already looked at how to do external IO right like we want to check the step retrying always do a get
00:28:35.120 before a post we've already already looked at how to do internal IO right make sure those are in
00:28:40.840 transactions you can do fancier stuff though you can have iteration for those of you who've ever used the Shopify job
00:28:46.279 iteration gem we have a lower level primitive you can do it yourself um I wish I could
00:28:53.679 spend more time on it that's resilient you have to do a lot of stuff to make resilient but there you go you want to
00:28:59.600 do delays you have some you want to do some work then you want to wait 3 weeks then you want to pick back up and do
00:29:04.720 more work you do that resiliently you can do that there you go there's a pattern I'll leave it on the there for
00:29:12.200 five seconds you want to do job batching I'm not trying to like take money
00:29:17.679 directly out of Mike's pocket but um you could do sidekick Pro style batching if
00:29:23.200 you control everything yourself it's more code than fits on one slide so you have to come and ask me about it later
00:29:30.399 in general though these are the principles that you're going to need and these are the ways in which I think a
00:29:36.080 cic job provides you a minimal set of very sharp knives to allow you to Define
00:29:41.240 your jobs in such a way that they will run reliably and resiliently over time
00:29:47.760 chaotic job of course paired always have good tests and that's fundamentally all you
Explore all talks recorded at RubyConf 2024
+64