Ancient City Ruby 2013

How to Fail at Background Jobs

From DRB, XMPP, and AMQP to Resque and Rails 4. Running a background worker process is a tool I've reached for often, and while the underlying tools may change, the same problems seem to crop up in every one. A failed request serves your fancy custom 500 error page, but what about a failed job? Is there such a thing as a "reliable" queuing system that will never lose OR double process any jobs? Are we talking about "simple" asynchronous method calls on models or should we build "pure" workers with only the knowledge of a single task? What does "idempotent" mean again? Please allow me to enliven the debates.

Ancient City Ruby 2013

00:00:00.560 this talk is called how to fail at background jobs and the slides are right there so you
00:00:06.080 can skip ahead and cheat um i think that failure is really interesting and we learn a lot from
00:00:12.000 failure like the guy that was just talking said about introduce the pain before the solution i felt a lot of pain and maybe a few
00:00:19.359 solutions mostly i want to talk about pain i think uh but i was talking about all the things
00:00:25.119 that i could be talking about in this talk um to my good co-worker evan who's sitting back there
00:00:31.039 and he was like oh so you want to talk about how to fail at background jobs well the answer is simple right
00:00:36.079 delay job that's how you fail it back does anybody use delay job have in the past anybody's still using
00:00:42.480 it i'm sorry go talk to evan find out why you shouldn't be using delayed job basically
00:00:49.039 evan works in support he works with a lot of engineer customers and every or many times he sees problems with
00:00:54.960 database performance it relates to a customer using delay job that's hammering their database so if you happen to be using delay job or be
00:01:00.960 considering using labjob you might want to look at this other alternative similar thing that requires postgres written by heroku guy
00:01:07.360 there and that's it that's my talk um
00:01:13.200 engine yard's having a conference in august called distill and you should all submit to the cfp you could probably
00:01:18.479 give a better talk than that that i just gave i i still have time huh
00:01:23.759 yeah okay okay i guess i'll i'll continue then uh so maybe the talk i really want to give isn't well we're definitely
00:01:30.880 going to get some failure but uh maybe maybe a lesson we can get out of this is uh abstraction so i'm going to give a talk
00:01:37.280 about seeking better abstractions for background jobs i think my current working theory is that the
00:01:42.799 reason i fail so much is because the abstractions are wrong or fail me or i need better ones
00:01:50.079 so speaking of abstractions let's talk about rails 4 queueing how many people have heard about the rails
00:01:56.320 4q system or the q system in rails 4. how many people have heard that that is
00:02:01.520 not going to be in rails 4. okay so you already know that if you didn't know or you're interested more
00:02:07.119 you can go to this commit message or this commit on github and read all the comments about what was pulled out
00:02:12.400 i'm going to attempt to summarize some of the reasons why it was pulled out and draw some conclusions on my own about
00:02:17.680 what what about it was a fail so first of all the api for the rails
00:02:23.440 for queuing system as it stood um the
00:02:29.040 the one of the one of the commenters uh points out that the biggest problem with this api
00:02:34.879 is the fact that the name of the queue that you're putting jobs into is defined at the place where you call the
00:02:41.680 push as opposed to being defined in the job class actually i should probably just explain this api
00:02:47.519 straight up so you the api says you create any object any object you want
00:02:52.879 and has a run method and the contract is you put that object push that object onto some queue and then later on
00:02:58.879 somebody else can pull it off and call run and run your job uh and so the problem i
00:03:05.360 see with this is that the arguments to your job the all the data that's needed to
00:03:11.280 reconstruct that object does not get is not part of this contract right
00:03:16.959 somehow it's in this object and you're going to have to serialize this object and deserialize it
00:03:22.560 and given this api what they ended up having to do was use marshall
00:03:29.280 and if you've ever tried a marshall inactive record object you'll see something like this and this isn't great for debugging
00:03:35.360 things in production uh but more so it's very limiting to marshall things
00:03:40.480 and and marshall itself is prone to problems about well marshalling something deep down you might might find
00:03:45.760 a proc you might have circular references and you might marshal way too much you have this like
00:03:51.040 massive amount of data in your q system that's completely unnecessary right so i think that's a fail too finally
00:03:58.959 they they alluded at the fact that they solved the completely wrong problem don't look at this code yet look at this
00:04:04.560 code in a second uh one of the one of the main use cases i could see in rails 4 for the queuing
00:04:10.080 system was to send emails action mailers part of rails you send emails with rails but when you send an
00:04:16.239 email you don't want it to interfere with the request processing right if you sent an email from a controller action
00:04:22.800 which might be the simplest thing to do then that could take some time to send that email before you continue to process your response
00:04:29.120 and that would slow down the request right that's actually the i should mention that one of the big main reasons background job systems exist is so that
00:04:35.919 you can do things that you could might want to do in your quest or might want to do as a result of requests without interfering with the
00:04:41.360 processing that request so to that end what they said is they really want changes to rack they want you to not have to run a
00:04:48.080 separate system or a separate job processor for simple things like sending email they just want that email sent to
00:04:53.280 happen after the request goes to the browser but before you're done processing the request right so
00:04:59.440 here's a terrible evil way to hack rack to make it do this
00:05:05.199 right rack expects the last expects a triplet of status header and body so we can make a proxy object
00:05:11.600 around that body and then implement our each to return the real body and then the last thing we do is
00:05:18.000 actually send the email and because we set content length the browser will think we're done and not be blocked on
00:05:23.759 this extra work we're doing but somehow we'll have to figure out how to get this in a rack
00:05:29.919 and uh so i took these problems to thought and thought well maybe i can help
00:05:35.520 maybe what they say is is something we can fix so i tried these pull requests and it looks like really
00:05:41.120 they're going to rewrite this in rails 401 i don't expect it to look anything like this when they're done
00:05:47.600 so moving on let's tell a story back in 2009
00:05:54.080 i was working at a company called 3m this was our product it's a lava cos
00:06:00.080 chairside oral scanner and i was writing a rails app rails three or rails two something
00:06:07.600 and basically what that rails app was it was a glorified file server and so uh files would get uploaded from
00:06:14.160 these devices in the wild and we organize them and do things with them and
00:06:19.440 the state of the art at the time for doing background processing we googled this because we you know we
00:06:24.800 had a use case of we're going to copy some files for something right and that's going to come from a web request we already have the files one place and
00:06:30.639 move them somewhere else uh so that needs to happen in the background not in the request so state of the art time from what we could tell
00:06:36.720 was this queue called starling anybody use that still use that workling anybody still use workling
00:06:44.479 he's the only guy he just randomly found you earlier so these these were q systems this is
00:06:49.919 how workling looks when you use it you create a class of your worker class represents some work
00:06:55.600 that you want to do you define a method that does that work you call async underscore that method with the arguments that are then going
00:07:01.520 to be used to do that work you could set up the back end system so this
00:07:07.199 dispatcher equals line is setting up to talk to starling
00:07:12.880 and you can you can configure starling pretty easily and you just call run and you got a worker
00:07:20.080 so this is cool we use this for a while and then we started reading the twitter engineering blog
00:07:27.440 and this is around the time that uh people were saying rails can't scale because twitter is not scaling
00:07:33.280 and we also were concerned about our ability to be distributed and highly available and have a version of this thing running in
00:07:39.919 europe so we decided to to do the eric redmond thing and use some erlang
00:07:49.199 rabbidmq was was pretty popular seem to be getting popularity so we're like all right we're going to move
00:07:54.560 we're going to keep using workland we're just going to move from starling to rabbitmq as the backend system because
00:08:00.960 this is this runs on multiple nodes they replicate each other if one goes down you still have a q system that's working you didn't lose anything has all these
00:08:07.440 other benefits and the protocol that rabbitmq speaks is called amqp
00:08:13.680 so i'm not going to explain this exactly but this is some code for speaking gamecube it's slightly more complicated
00:08:18.960 than the simple push and pop that you get with starling those are two libraries for using amqp
00:08:25.599 if you're interested and then some nice guy before the existence of github on some svn
00:08:30.879 repository had written this amqp exchange runner for workling so all we had to do was swap out
00:08:37.039 our starling runner for our mtp runner and we thought we'd be good to go
00:08:44.080 and little bug wasn't exactly related to rabbit it was
00:08:49.760 actually red to the fact that rabbit was faster than starling so this is what we started getting
00:08:56.320 active record not found and that's because
00:09:02.399 we were we were in queueing jobs in after create blocks and the funny thing about after
00:09:09.120 create is that it is after your object is created you're all familiar with rails record
00:09:14.320 act create it is after your your active your model is created in the database but it's not after
00:09:20.240 the transaction to the database that was creating that object completes
00:09:26.240 so there's to create an object in the database there's two lines insert and then commit
00:09:31.680 so the insert happened we enqueue the job uh rabbit's like an open socket to uh
00:09:38.480 our workers have an open socket to rabbit so there's not even polling it just immediately goes right into the worker picks it up okay let me go to the
00:09:43.839 database let me find that object let me do some work it's not there so you can see the hack fix commented
00:09:50.000 out here sleep one
00:09:55.600 um but the other hack fix was this plug-in that we wrote it's old i don't use it anymore it's realty three
00:10:01.680 only but it's it's interesting maybe uh does somebody do something like this
00:10:07.440 because i don't know how to solve this other than just not doing after creates or after saves
00:10:14.079 rescue retry so so what this does is it monkey patches the current connection object
00:10:20.959 like finding it opening up the class instance of vowing in this method chaining so that you can run this block
00:10:26.480 of code upon execute of commit clever
00:10:31.680 useful maybe so that was cool but then we had some more bugs um
00:10:38.880 and maybe maybe these weren't just bugs it developed itself into this fundamental
00:10:44.079 flaw see what we wanted to do is we wanted to use the power of rabbit enqueue we
00:10:49.120 wanted to start actually using the tool underneath of workling to the extent that it can be used right
00:10:56.720 we actually had several apps and we were like oh let's send messages between those apps and we're clean was getting in a way on
00:11:02.880 our way because it was a very simple abstraction
00:11:08.800 the abstraction for worklink assumed that you have your rails app
00:11:14.160 accepting requests and you have your workers processing background jobs and that's all you need to do right it
00:11:19.839 didn't even let you specify names of cues it was very and so we had started hacking this stuff in and
00:11:26.079 just decided to throw away eventually and write our own wrapper that was like work link because it had a runner and stuff i don't understand why this didn't
00:11:32.399 exist by the way but we wrote it i think we're the only ones that ever used it uh and it's very similar you say
00:11:38.800 who you subscribe to as the cue name and then you can notify to that named queue and then this work like if we share the
00:11:44.880 rabbit with two apps then suddenly you could have background jobs in one app that are actually running due to messages that
00:11:51.519 were notified by the first app so that's kind of cool can't do that in rescue
00:11:58.240 okay let's reflect on this for a second so what did we learn
00:12:08.160 uh workling failed us uh well i mean it worked for a while
00:12:13.360 uh but but it failed to to give us a an abstraction that would go the distance
00:12:19.839 um and we failed to open source too i guess because never did anything let's move on
00:12:26.160 uh my next job my current job at engine yard i have to thank engineer by the way for sending me here to speak to you lovely people
00:12:32.160 uh we're training no we're not a trained company uh but we use rescue a lot and our primary use case for rescue
00:12:40.000 and background jobs in general is to boot a server on amazon ec2 so let
00:12:45.760 me walk through this code a little bit to give you a sense of the type of thing we're doing here
00:12:51.440 so we've created an instance as a server some server that some customer wants we've created that in our database
00:12:56.880 already and we have a job here to boot that instance so we go find it we create this fog
00:13:04.320 fog's a great library for talking to amazon object we make the create call to amazon we save off the amazon id
00:13:11.440 and then we pull amazon to see if that server is up now by the way this code is terrible all
00:13:18.160 the code that i'm posting is terrible it's all just for example purposes don't copy my code copy my ideas
00:13:24.399 maybe don't even copy my ideas anyway we wait for that server to be uh ready and then we move on to the
00:13:31.519 next thing which is attaching an ip and it actually goes on for quite a while doing all the things that are necessary to set the server up
00:13:39.040 so this job was kind of ugly and um i was talking to some co-workers about ways that we could refactor this sort of
00:13:44.880 behavior one direction that you could go in
00:13:51.680 would be to kind of do what we did at 3m or did for some kinds of jobs at 3m
00:13:57.360 which is to think about what it would be like if we ran this job in an entirely separate system from the
00:14:03.279 place that included so in order for that to happen obviously we don't want to share databases i know
00:14:08.320 nobody here does that right so what we need is a way to get all the information that we want
00:14:14.399 sent with that uh with the arguments to that job not just some id that we could look it up
00:14:19.680 so we we have extra arguments to our perform method and then we'd have to have maybe some callbacks that we have this api to where
00:14:27.279 we get to key points in our job to call back to update the customer on what's going on
00:14:33.680 never actually implement anything like this of course you could also go in the
00:14:38.880 complete opposite direction you could say that background jobs
00:14:44.880 don't actually need to have all of that logic defined inside of the background job right just because the api says
00:14:50.480 class with the perform method and logic inside of it doesn't mean that you need to put all of that logic in
00:14:55.680 before method you could put that logic somewhere else what if we create a job called method calling and that's the
00:15:01.519 only job we need and anytime we need to do anything in the background we'll just secure the method calling job with the activerecord object we're called method
00:15:07.519 on and the name of the method we actually did this for some jobs
00:15:12.880 uh the downside being that now when you go and look at what jobs am i running they're just all method calling jobs you don't have any insight
00:15:19.360 and you don't have the other niceties and so but along those lines i recently
00:15:25.040 wrote this library called async which has this clever dsl for
00:15:30.639 calling async run another a method on
00:15:36.000 an active record object later and it has a plugable back ends concept
00:15:42.959 so you could use any back end you want because actually they're all very similar it's just very slight changes to support rescue and sidekick
00:15:49.600 and cube okay anyway so uh
00:15:54.959 the factorings aside we had a problem with this instance provisioning job our customers had a problem
00:16:00.639 they would sometimes be doing things and they would get this they would get some like
00:16:06.800 their instances of spinning some waiting status it's really stale old nothing ever happening to their instances
00:16:13.680 we would go into rescue web which is the web interface for rescue that tells you what's going on and we see things like
00:16:19.040 jobs running for the last eight hours you know 157 workers
00:16:24.320 i don't think that's the configured number of workers that we wanted that that could be because some workers
00:16:29.519 died and were replaced and that count didn't decrement or because some workers are hung or
00:16:34.560 like who knows and what i realized
00:16:39.920 or what i started to realize was that there's a difference in the types of reliability that our q
00:16:45.920 system can have or that our background jobs processing system might have
00:16:50.959 at 3m with amqp we thought we were having a reliable system because we used
00:16:56.160 rabbitmq and it had these acknowledgments and it had these durable cues and durable messages concepts and we're like well if
00:17:03.279 if if a rabbit node goes down the cue's still there but we didn't consider all the problems
00:17:08.720 that the worker processors themselves could have right so there's a difference so for
00:17:14.400 example you could have a guarantee that this job's going to be delivered and if it's not acknowledged it'll be delivered
00:17:20.319 somewhere else but what if the job is like halfway done processing and then crashes or is almost all the way done
00:17:26.559 processing and then crashes and you don't get that announcement and you run twice maybe that's not maybe that doesn't work
00:17:33.679 and and then there's the concept of retrying you know a lot a lot of libraries for background jobs will give you simple
00:17:39.760 retry logic like if you actually raise an exception it'll retry but if your ruby process crashes for some reason
00:17:45.679 maybe it won't be retried or if you have a hang maybe you'll need a timeout or some
00:17:50.799 method of timing out maybe your timeout won't fire because your process is really hung we actually had problems like this
00:17:56.400 because we open sockets and ssh connections places from our background jobs and then the remote end disappears
00:18:02.320 and ruby doesn't know to close it we have the problem of monitoring and
00:18:08.000 maintaining some set of workers right we want to keep the pool at a certain
00:18:14.880 number and we want a graceful restart if if it's time to we deploy code
00:18:21.600 multiple times a day we deploy new code we want our our unicorn workers bounce we want our
00:18:27.760 background job workers to bounce to and pick up that new code right but you don't want to interrupt a job that's running
00:18:33.919 so you need to tell the workers to like finish the job you're doing and then shut down and then spin up a new one and and unicorn is
00:18:40.640 cool because it has 60 second built-in request timeout so you know you only need to wait 60 seconds but who knows how long your jobs are going to run
00:18:46.880 or whatever job is currently running inside of your worker
00:18:51.919 and then on top of that you have all this problem of well with all these problems comes the problem of not
00:18:57.200 knowing like oh if a job failed or or the thing that you're expected to happen didn't
00:19:03.120 happen and you know a job is involved is that because the job was never in cued in the first place is that because
00:19:10.000 the job failed somewhere how did it fail where in this code path did it feel
00:19:15.760 so having all these problems i tried to tackle the most simple or the most logical one first which is
00:19:21.280 figuring out what's going on so i wrote this rescue plugin because you know there's like 200 rescue plugins
00:19:27.200 on rubygems so clearly that's how you do things in rescue you write plugins
00:19:32.640 so you can try this out it's kind of okay you you include it in your job and you
00:19:38.240 define a track method and what that does is it lets you extract these identifiers to objects
00:19:44.880 that are relevant to the job you're running so for an instance provision job the relevant
00:19:51.280 objects are the instance and also the account the the which customer owns this
00:19:56.720 instance and maybe other things and this will this will hit this during enqueue and
00:20:02.480 during run and keep keep up to date what jobs are going on
00:20:07.760 and then after the fact or during the fact you can go call methods like this where you go look up are there any jobs
00:20:13.760 currently running of any sort that might be affecting this customer or this
00:20:18.799 instance or are there any jobs that failed in the past 24 hours because you know the thing about redis is everything you put in there you should expire
00:20:25.760 because that's how it's that's what's awesome about redis right just put temporary data in there tracking data
00:20:32.880 that kind of helped a little bit i i was probably the only person that really used that to debug problems
00:20:39.280 bigger problem that we had other problem that we had
00:20:44.559 dependencies so one of the things you might do as an engineer customer is add a database
00:20:52.240 replica to your cluster and since we like making code in nice
00:20:58.159 concrete blocks instead of monolithic methods we break this out into several jobs
00:21:03.840 you create a database replica that means we need another job to snapshot the database maps or we reuse that job
00:21:09.120 that's snapshots to database master database master and when that job finishes we're going to need to provision
00:21:14.960 a volume that's like the disk that's gonna run that's from that copy of the database master and
00:21:20.559 before we can do anything after that we're gonna have to wait for the snapshot to complete and then there's gonna be these callbacks that like
00:21:26.480 re-enqueue those jobs to make sure this happens and that was a big mess so why not write another rescue plug-in
00:21:36.880 this one's much crazier than job tracking never got used in production but i thought it was super clever
00:21:43.120 so what this does this example here is it's depending on other jobs so we'll run the sandwich job
00:21:50.000 with these arguments and we'll see that the first step is that we need to fetch a tomato so we'll depend on the tomato job pass it the
00:21:56.240 argument from this job and continue on and we'll see that slicing the tomato depends on
00:22:03.039 the output of the first job so that step can't run yet and then when the first job finishes we'll re-eq the sandwich
00:22:09.120 job and proceed to the next step and then some steps we'll see that they can run simultaneously so we'll run the sandwich job twice on those steps and
00:22:15.679 it'll all come come back together and finish and this solved the problem of
00:22:22.080 expressing the dependencies of the jobs more clearly but it didn't solve the problem of being able to debug what was
00:22:28.480 going on because it just put a bunch of in redis so in desperation
00:22:35.679 we went back to that that job tracking job by the way the job tracking plug-in depended on this rescue plug-in
00:22:41.919 called metadata which lets you associate random metadata to any job so we were
00:22:47.360 we decided okay well it's kind of hard to figure out what's going on by looking in redis for these failures what if we
00:22:52.480 put them in a database instead then maybe we could figure it out so we create a rescue job model
00:22:57.840 and we start hooking into this to create a rescue job object whenever you
00:23:03.440 enqueue a job in redis and then when it gets starts to execute we'll update that record and so forth
00:23:09.440 just like the job tracking plugin did and now we have this you know active record thing we can do we can find on
00:23:15.440 this we can do sql to figure out what's coming we figure out what the most failing job is and how recently it happened
00:23:22.400 and this kind of host our database for a while and so we turned it off
00:23:28.320 my co one of my co-workers andy andy delcom gave this talk at cascadia
00:23:34.720 not too long ago you can download my slides and click play on this one if you're interested uh and his premise was
00:23:42.000 we could solve this tracking dependencies problem by generating unique ids
00:23:47.600 so when a customer hits a page that's a web request we'll generate a unique id and then if there's any jobs that get
00:23:54.400 enqueued as a result of that we'll just use the same unique id on that and if any of those jobs in queue other jobs
00:24:00.320 we'll use the same unique id and then finally when we get some exception somewhere down in the middle of nothing uh nowhere we'll have that id and then
00:24:07.600 we'll be able to go back and look like when how did this we can build this tree of all these jobs that
00:24:13.919 figure it out he never actually implemented that but it's still a cool idea and this guy
00:24:20.159 this is josh uh he went and solved the problem for me
00:24:27.120 he called he called the solution modeling intent he said that what we needed to do was
00:24:33.440 model what the customer wanted to do with an object in our system and
00:24:39.279 my interpretation of that in terms of background jobs means creating background job creating
00:24:45.360 models that are specific to the task you're performing right so whereas before
00:24:50.720 instance provision was a job that was defined it'd go on redis you could go look at that queue in redis and see all
00:24:56.640 of them that would then disappear when it had completed now instance provision is an active
00:25:02.559 record object and it relates to an instance via simple belongs to and the implementation is
00:25:08.960 very similar and you know we could use the method calling job or some similar flavor thereof to actually
00:25:14.960 invoke this thing but the point is now we have much better
00:25:20.880 sense of what's going on in our system right and we have not solved the problem in an abstract way
00:25:26.640 we're solving the problem that we have in a way that is specific to the domain
00:25:34.960 and the important attributes one some of the important attributes that this instance provision
00:25:40.080 object had were things like requested at started at finished at and state
00:25:45.360 so you could look at the system look at all of the instant provision jobs for a given
00:25:51.279 account or environment and no ones that recently finished ones that are running that haven't finished you
00:25:57.200 could look at instance terminate jobs and compare them and now suddenly you know our instance provision job could
00:26:03.919 could see that an instance terminate is running so maybe i shouldn't run myself and before that would have just been
00:26:09.440 this race condition between two jobs one trying to start a serving the other one tried to shut it down
00:26:14.640 and having state lets us be item potent so anybody everybody knows that this is
00:26:20.640 important when you're writing jobs right if you if you write your jobs in an idempotent way
00:26:25.919 that means that if you run the job twice it will not do anything more than if you had ran it just once
00:26:31.360 a a get request is idempotent because it always returns the same result whereas the post post request is not item potent because
00:26:37.760 it's potentially modifying something
00:26:42.880 so making every job item potent is really powerful and with a simple lock and some updating
00:26:48.159 of state we can ensure that the job won't run twice or if it attempts to run
00:26:53.919 a second time that it will raise an exception
00:26:59.120 uh additionally uh there's a piece of code that my coworker wrote
00:27:05.200 called viaduct which is extracted from a project called vagrant which you may have heard of
00:27:10.320 and what viaduct is is it's rack middlewares without rack so it's this concept of you have a bunch
00:27:17.440 of things to do and you want them to be wrapped such that the innermost one is
00:27:24.240 wrapped by the outer the outer ones right so we could add we could have this generic concept of instrumentation
00:27:30.400 for jobs and we would use viaduct to run it and this wasn't specific to rescue it wasn't
00:27:36.799 specific to the abstraction that we're using for background jobs we could just put this inside the implementation of any particular
00:27:42.960 instance provision type of thing i guess what we learned again is that abstractions are failing us
00:27:52.240 modifying rescue adding plugins is a fun exercise and solving the problem in an abstract
00:27:58.320 way is like it's like this goal like i'm going to do this i'm going to solve this problem in an abstract way and then
00:28:03.840 everybody else who has this problem is also going to have that solution because i've solved abstractly but in many cases
00:28:10.240 it's a lot easier simpler and better to simply solve exactly the problem you have
00:28:16.720 so uh on my latest project a building project that i've been working on
00:28:22.880 i i thought well if i'm really going to learn how these background job systems work
00:28:29.120 background jobs systems work and and understand anything i'm gonna have to build one
00:28:34.399 myself all right so let's distill down the three ingredients that it takes to make
00:28:39.520 a background job system it's peaches butter and sugar it's really delicious on the grill no
00:28:45.279 it's actually a work loop some some loop of code that's
00:28:50.720 picking up a job running it picking up the next job running it maybe there's polling maybe it's push two
00:28:57.039 whatever it's monitoring and restarting
00:29:03.600 anecdote i missed mentioning earlier so at 3m i was a developer
00:29:09.600 there was a separate person that did operations i thought i was being awesome when i wrote code that would make amqp
00:29:17.039 fail over to another server with if she provided me a list of servers that were running by their host names right
00:29:24.000 at engine yard there's no operations because our product is operations
00:29:29.039 so every engineer developer is operations and you know we have you know dbas that are really good at
00:29:35.919 databases and some support people that really know unicorn or this or that but for the most part we just used our products
00:29:42.320 so i i got to see what it was really like to monitor background job systems ensure that they
00:29:49.360 gracefully restart etc and so now i consider this a major part of building any
00:29:55.679 and of course the third ingredient that we'll need is the queue
00:30:01.840 that's just a a place to put jobs so that you can then pop them off right
00:30:07.600 so let's talk about some different loops this is a rough
00:30:14.399 some line stripped out version of what the rescue event looks like our event loop looks like it's just a
00:30:20.080 loop as you would expect there's a nice little hook for before fork
00:30:26.880 rescue has this fun behavior of forking off child processes to insulate the parent from failures and that
00:30:32.960 kind of works in some cases kind of makes it harder to monitor in other cases but but the the thing i don't like about
00:30:39.279 this is it's not it monopolizes what i what the worker process is doing right you you run a
00:30:45.039 rescue worker process it never exits it's just running this loop i can't hook into it very well what i'd like
00:30:51.360 in many cases is event machine this is event machines loop it's written in c plus plus
00:30:56.559 as the c extension so it's kind of harder to read and there's a lot more code that i stripped out to be able to fit it on the slide but what event
00:31:03.120 machine has is looks so you can you can run us if em runs start server is gonna run forever
00:31:09.840 but somewhere inside of there i can call em next tick and that'll just get stuck into this event machine event loop
00:31:15.760 and i can say uh em add timer do something in this number of time and that will get shoved into that that loop
00:31:22.080 at some point so i kind of wish we had something like this for rescue
00:31:28.640 then i i never actually have used sidekick production but we've tried to move to sidekick a few
00:31:34.399 times and been prevented by the existence of too many rescue plugins so i went trying to figure out how
00:31:40.159 sidekick works it's a lot of magic relating to celluloid basically there's these three things
00:31:45.600 a manager a fetcher and a processor and they are celluloid actors which
00:31:50.880 means there are threads running on their own and they're all they're like calling back to each other so the the manager
00:31:56.640 will will kick off the fetcher which will pick up a job which will call the processor which say okay i'm done and go back to the manager and call the process
00:32:02.000 etc and so forth so that's kind of that's a cool loop concept because you just add a new
00:32:07.840 thread in there and it's just all threads and it's just like this pool of things running
00:32:14.640 next topic monitor and restart
00:32:19.679 so take a look at uh how rescue handles signals and how it handles the restart the the signal we care about for
00:32:26.080 graceful restart with rescue is quit because that will wait for the current job to finish
00:32:32.799 the signal that you send when you control c or kill the process is that second one term or in so you have to be careful
00:32:40.080 so we want something like this and we're going to need a monitoring solution most of the most of the rescue deploys we
00:32:45.600 have in production are monitored with god so um you know we could just essentially into
00:32:51.360 a box in bundle exec rake rescue work ampersand background it like that but we want to demonize this and have another
00:32:58.000 process monitoring and restart it if it dies and have this logic for restarting it
00:33:03.279 so he's got but you could just use daemons this is what's workling used this is what then
00:33:09.440 therefore amputee listener used because i didn't know anything else and and damien's works pretty simply
00:33:15.679 except for the fact that you have to remember to set all these options to true it's just one line of like this is the file to run
00:33:22.880 or or maybe you don't need a separate background job process anybody use torque box
00:33:28.960 or heard of torque box so i was in south africa
00:33:34.480 and this guy with this awesome handstand gave a talk about how his company is using exclusively jruby
00:33:41.200 and he talked a lot about background jobs and he mentioned that torque box which is built on top of jboss is like
00:33:46.720 built on top of this java stuff this messaging system blah blah and basically they don't have worker processes that
00:33:53.120 they have to manage they just have to manage the restart of their web application server their their
00:33:58.240 torque box um and they can even have you know some threads that process web requests and now process a
00:34:04.320 background job and it's all like built in together and they don't have to worry about it so i was pretty jealous of that
00:34:09.839 so i kind of wish i had that for rescue third ingredient is the queue
00:34:17.040 do you guys know who terence is he's a maintainer of um rescue and he's also a maintainer of
00:34:23.520 bundler and i thought it was interesting that he wrote this code and he gave a talk about how this is a very small snippet of this
00:34:29.520 code but he gave a talk about how the bundler api works so whenever you whenever you bundle that makes a request to the bundler api to resolve
00:34:35.760 dependencies that api is fetching data from rubygems.org and
00:34:41.679 it's not using rescue so he had he had a need to use background job processing and instead he used a thread pool
00:34:49.440 so and and the the construct for the queue in this case is the ruby library object called q which is just a thread
00:34:55.760 safe array basically that you can push and pop things or nq d2 things you can go read the full source code there
00:35:03.280 or we could use a drb object and anybody use drb4
00:35:08.400 just a little bit let me explain what drb is drb is written in ruby um eric's
00:35:16.240 examples could have been written with drb instead of zero mq's your mq's like the cool version i guess
00:35:21.599 but it lets you take an object and start a service on some port that is
00:35:27.200 hosting that object and then other ruby processes can go and
00:35:32.960 call methods on that object remotely and get the result like serialized and re-marshaled on their side right so we
00:35:39.599 can have this service manager object so i i wanted to make a library called hydra but that name was taken so we have
00:35:45.920 multi-headed greek monster uh we create the service manager that has these things that it does and we can give and
00:35:51.440 take it take things from there and that'll that'll be our cue and actually this is a cool library because
00:35:57.440 this is what i use when i want to run a migration that affects a lot of records
00:36:03.359 so you can because what it'll do is it'll fork off child processes that communicate to the service manager
00:36:10.000 and so you could do find all find in batches of some active record model and it'll pull
00:36:15.280 and feed the monster these batches of things and then workers will come up and pick them up and then they'll all close up and finish and your
00:36:21.599 migration runs faster because it's multi-core because otherwise rails doesn't use multiple cores or ruby
00:36:27.839 doesn't use multiple cores unless you have a better implementation redis
00:36:33.280 the most common popular default q system this is a little snippet of code
00:36:39.440 pulled from the queue background processing library we use r push and l pop those are the
00:36:45.280 canonical message methods for putting things in and out of redis you all know redis right
00:36:50.800 no you just don't like raising your hands anymore and i picked q because it also has a
00:36:57.359 back end so it's kind of like working in that or kind of tried to revive some of that concept of work cleaning up well we can
00:37:03.119 have multiple back ends uh and this is what we never use so i don't even know exactly how this code works there's an upsert which i thought was cool that
00:37:09.839 you put things into queue with an upsert and then you pull them out by deleting them
00:37:16.000 okay so let's put all this stuff together let's make the most simple cue that we can
00:37:23.119 that still works and is production ready
00:37:28.240 so we're going to have this the script that runs requires our rails environment
00:37:33.760 uh i created a loop called trap loop the whole implementation fits on the next slide so it's pretty simple i'll explain
00:37:39.520 it so i i start this trap loop that's a loop and i put uh the call to the work
00:37:44.560 that i'm going to do so this this is a processor that's very this uh this whole cue system is just for processing
00:37:50.480 invoices in our billing system and then we we do the daemons thing where we say this is the
00:37:56.320 file i want to run and we put in all those options that are important
00:38:01.839 and then this trap loop what it does is it lets us do graceful restarts
00:38:08.480 so start was as our loop object we said trap loop start and inside of that do
00:38:13.839 work and it has the while loop do yield things so that's running
00:38:19.040 forever and then when we get a signal of somebody control seeing or killing our process
00:38:24.560 we hit that stop call and we save off the fact that our loop should terminate or
00:38:29.599 loop false and then loop is now false and it won't yield anymore and it'll exit
00:38:36.000 additionally we have this thing called safe exit point i don't know why this doesn't exist in any other q system
00:38:41.760 or how other q systems solve this if you know please tell me safe exit point means that i can put
00:38:48.000 somewhere in my invoice processing code places that i'm finished processing
00:38:53.040 something that it's okay for me to stop before i start processing the next thing
00:38:58.800 and so that what that'll do is like if i call that and we're not looping then we can raise and if we are looping
00:39:04.560 then we don't do anything because uh nobody tell us to stop so finally the implementation of our
00:39:10.240 invoice processing task our use case here is uh
00:39:17.839 was especially ill suited to the conventional queuing systems because we actually have two cues
00:39:25.200 we have a queue of invoices and then each invoice has a queue of tasks that must be run to process that
00:39:31.760 invoice and so i actually wanted to be able to enqueue a task push on both lists and
00:39:37.119 then pop off the one for process invoices and then as you process an invoice pop off the tasks
00:39:42.320 list and so forth and when you run a task it may enqueue new tasks and then i've called save exit point in
00:39:48.480 these you know convenient places so that i know i won't interrupt my jobs from processing something
00:39:55.200 and that's it so let me try to draw some conclusions from my failures
00:40:00.720 abstraction awareness so i think one of the lessons i learned
00:40:08.640 was that it's it's okay sometimes to uh create a specific abstraction or
00:40:15.839 solve the problem you're having uh specific to the problem as opposed to thinking in the
00:40:21.520 broader global sense of let me solve this generic problem right i love doing that i love coming up with a solution
00:40:27.599 that i'm going to solve everybody who has this class of problems problem but i don't do that
00:40:32.800 model important jobs in german if if you're sending an email you can use a simple uh job class for that right if
00:40:39.680 you are doing something that's critical to your business process that you need to know if it completed when it completed what happened if it
00:40:45.839 failed why it failed it's going to be a lot easier to track that stuff down if you put that directly in your
00:40:51.200 database and you have a one-to-one relationship between a job that could be running and
00:40:57.119 a record in your database that is the intention of that job
00:41:02.560 and contribute to rescue because i want to see these problems solved i want to see better abstractions out
00:41:08.480 there i want to see rails 4 actually solve a lot of the problems that i've had
00:41:13.920 and it seems like rescue is currently the best place for that type of innovation to happen there's a lot going
00:41:19.520 on with rescue2 and there's tons of pull requests being pulled in and i think whatever comes out of that will feed
00:41:24.960 back into the eventual rails 4.1 queuing system