00:00:15.519
good day my name is Yousef as already been said uh and this is building skillable
00:00:21.000
and resilient background jobs for fintech um quick uh agenda so we'll go
00:00:27.880
through a little bit of intro about myself and the company work for and then pipelines the what why and how this is
00:00:35.040
how we use background jobs uh at the company uh PayPal brain tree and the
00:00:41.480
keys to scale the how we build resilient pipelines and background jobs and
00:00:47.520
observability and then a little bit of Q&A um I like to usually do interactive
00:00:55.680
presentations this is not one of them but if you do have a question and you want to interrupt me please feel you have my you I'm empowered by me to do
00:01:03.519
so uh so let's get into it so first of all introductions uh real quick I've
00:01:09.640
been doing Ruby for over 15 years I stopped counting after 15 um honestly if it weren't for Rubi I
00:01:16.799
don't know if I would still be a software developer uh Matt talked about like the love of for Ruby and the
00:01:22.159
community earlier it's really what uh keeps me going I do love Ruby um I think
00:01:29.720
I would hate my job if it weren't for Ruby like honestly I I've I've tried different things I and I really did give
00:01:35.600
Python and PHP a chance it just didn't work out for me um so any time I get a
00:01:42.320
chance to give back to the Ruby Community I'm all for it so if you're looking for a mentor and Ruby or something like that please uh reach out
00:01:48.240
to me I'd be happy to share my experience with you I don't know how many of the 15 years are good experience
00:01:55.640
but I'll share that with you anyways um I work at PayPal uh and a sub org of PayPal called brain
00:02:02.920
tree uh PayPal uh acquired brain free over a decade ago I think if I'm not
00:02:08.280
mistaken um and yeah to your surprise PayPal does Ruby
00:02:13.720
um Sharon did a great job introducing us earlier today so all of that stuff she
00:02:20.560
already did much better than I do about mobile sdks and uh uh process paper and
00:02:26.239
processing but if you're familiar with Ecom and you want to take credit card uh transactions from your customers and get
00:02:33.239
paid well you probably want to use something like brentry um we give you all the tools that you need and coupled
00:02:39.720
with PayPal you get the quick checkout buttons and all that kind of stuff so uh
00:02:45.599
obviously I don't know why I would choose anything else uh but yeah so we have a large Ruby code base uh at brain
00:02:52.840
free it's uh alt I won't use the L word
00:02:58.400
um and we uh I work on a subgroup of brain tree
00:03:05.680
where we do the uh merchant processing so everything that happens after transactions are taken and we move the
00:03:12.400
money and I'll talk a little bit more about that in a little bit and we're hiring we have a booth outside we
00:03:17.920
probably saw the big PayPal sign uh we have a office here in Chicago there's a bunch of people here from
00:03:25.319
PayPal awesome thank you that was not rehearsed so I'm glad uh all right so
00:03:30.360
come talk to us we're super friendly I promise maybe not me but they are um
00:03:36.400
required sort of disclaimer uh all the code that you're going to see today is
00:03:41.480
not production code it's for illustrative purposes it's examples so don't go try to copy paste that into
00:03:46.720
your code right now um but hopefully it will make sense for the examples that I'm trying to show so let's dig into it pipelines what
00:03:54.480
are pipelines well um the reason why we use pipelines is uh
00:04:00.599
we as a team we take everything all of the data that happens throughout the day where uh Merchants are taking credit
00:04:07.840
card transactions and they need to get paid they need their money so how do we get them their money and we are a global
00:04:14.760
processing platform so we have to do this all over the globe so we have uh us
00:04:21.560
customers Canada customers European customers Brazil Australia you name it
00:04:28.520
right and so each region has its own particulars uh each region has its own
00:04:35.240
different partners and banking Partners all kind of stuff so we have to build something that's deer to all of all of
00:04:40.280
that right um and what happens usually what my team works on is we compile all
00:04:45.759
the payout data every day that uh all the amounts that Merchants need to get paid we generate payout instructions
00:04:52.800
that we send to our uh banking Partners so you can imagine working with banks that's XML files or uh uh Bane uh files
00:05:01.560
which is spec that some uh banking Partners use um and so we sent a file to
00:05:07.680
a banking partner and our banking Partners transfer that money that we send in those instructions and they
00:05:14.080
either tell us everything worked or they don't tell us anything and we have to guess or they tell us that uh something
00:05:20.720
failed and we have to fix that and so they send us back some files so we have
00:05:27.960
to do this in every region with every partner and this is why we have to build like these pipelines of background jobs
00:05:33.160
that work together uh in some sort of sequence to achieve what we do and so some of the requirements we have are we
00:05:40.039
need to be able to schedule jobs right so we have to have schedule job schedulable jobs um different parts of
00:05:47.720
the day depends on the uh on each job uh jobs have to have dependencies so we
00:05:53.759
have to have a job that compiles the payout data and another job that creates the file and another job that transfers
00:05:59.080
the file to our our banking Partners so these jobs need to be dependent on each other um we need to be able to do
00:06:05.520
retrives because happens um and uh
00:06:10.639
that we want that to be one of the bakedin features of our Pipeline and we
00:06:15.840
need to have notifications we need to know what's going on we need to be able to uh send our developers on call folks
00:06:21.800
uh information about what's going on and we need some sort of observability so we
00:06:26.840
need to tag our jobs track our jobs have stats about our jobs Etc right and each
00:06:32.960
region gets its own Pipeline and so all of that if you put it together we
00:06:38.840
have what something looks like a pipeline which is job a job that generates a report or consumes a report
00:06:45.919
from our partners right so we either generate a report or consumer report we transmit the report to our partner or we
00:06:52.120
save the report data to rdb and then we not we notify somebody so if you put out
00:06:57.680
put that all together that lend itself to uh concept of pipelines right um and
00:07:05.360
which is really a grouping of background jobs that run in succession at a specific time of the day um and if any
00:07:12.919
step in that pipeline fails because of these dependencies the whole pipeline is halted right and so we need to be able
00:07:18.720
to intervene and uh fix whatever is going on so that we can continue restart
00:07:24.120
or continue uh the process um so this is why we use
00:07:29.520
pipelines the how well obviously we use Ruby this is why we're here um we also
00:07:35.000
use Ruby on Rails for really because Ruban re as a framework gives us a lot of glue to put together you know we're
00:07:42.120
talking to a database Ruban re gives us that layer um we do have some web based
00:07:47.759
stuff but that's morly for internal purposes uh we do generate certain reports and stuff like that um and one
00:07:53.879
of like the the star of the show for us is you really the culus gem um might
00:08:00.840
have not heard of it uh but it's a great jum that we work with which is similar
00:08:05.879
to sidekick this is not uh I'm not sponsoring any of them uh but uh this is
00:08:13.080
culus was the gem that was decided long before I joined um and it has some features that I'll talk about in a
00:08:19.960
second um and so one of the things that really
00:08:25.319
we like about culus is the dependency management which uh I'll show examples of um and it has a good enough admin UI for
00:08:32.360
us to see what's going on and um we got a lot of modules on top of that so a lot
00:08:38.919
of custom code that we enhance and um I'll show you some
00:08:45.640
examples all right so the first thing is how do we create a job so it's very
00:08:53.880
basic example this is very boilerplate rails example if you want a pipeline job
00:08:59.480
for us is a database model and every time we want to queue a job we create an
00:09:05.600
instance of it in our database uh and we do that so that we can keep track of every job that runs and what metadata is
00:09:11.560
associated with it uh so that we are able to retrieve that and figure out what's going on with our system right um
00:09:19.200
so this is very basic if you look at it we're just uh what the job type is is
00:09:24.600
really a class that we have it's like a background job class um then job data is a cash that we save and then uh we'll
00:09:31.800
talk uh about delays and dependencies and then priorities and pipeline day are just extra uh metadata that we have on
00:09:39.040
it and so every day we uh or every time we run our schedule or we save this
00:09:45.560
information in uh in our database so I talked about pipelines every pipeline
00:09:51.360
has a schedule Service uh in that schedule service here this is an example
00:09:56.480
pipeline X um we have a few jobs that are being scheduled for the day um so
00:10:03.440
this scheduler uses the the method that I uh show you before which we call create job very simple name but that the
00:10:11.240
code that we showed in the previous slide uh we call it here each time so that we uh we say these are the jobs
00:10:18.480
that we want for this Pipeline and this is how it looks like so um assume that
00:10:25.120
there is a job there's a class called a transfer generator and then a
00:10:31.079
dispersement job those are uh the classes that are going to get called when we schedule an inced job right um
00:10:39.519
so a couple of things here to note uh at the very top you see that we have a noop job that's being
00:10:45.279
created this no job does what the name says nothing um and the reason we do
00:10:50.600
this is because we want to start certain jobs at a particular time of the day and
00:10:55.800
so we use the delay option in qist when we uh when we incute jobs that lets us incute
00:11:03.279
jobs at a specific time with the delay parameter now why don't I just specify
00:11:11.279
that on any job that I want well we have jobs that not only depend on a specific time and need to start after a specific
00:11:18.040
time but also after a specific job that it needs to end first so that's the third job that we have there the
00:11:24.480
dispersement job depends on the job that's before it but also should not start before a specific time in a
00:11:31.800
contrived example if you think that maybe somehow the second job started early and finished early but we don't
00:11:38.519
want the third job to start before 7 now we can uh specify the two dependence uh
00:11:44.360
dependencies for the third job the noop job and the generator job and so this is how we can combine both a delay and a
00:11:52.320
job dependency together into one uh for one job and so this is our workaround
00:11:58.800
for being being able to do both types of delays based on another job and on a time
00:12:04.600
right so now that we have this pipeline scheduler every day we have a crown job
00:12:10.760
that runs and that Crown job uh calls the pipeline schedule creates the job
00:12:17.040
for us uh and really it does the following it checks is this a business day do we need to do anything or is it a
00:12:23.199
bank holiday if it's a business day it creates the job in our database and then
00:12:28.360
cues up the job using qless after the dependencies are met for
00:12:33.600
each job the job runs we save the execution there's obviously a lot more that we do but for our talk we'll limit
00:12:40.199
it to this right um so I talked uh the last slide showed the
00:12:47.839
execution is saved so mentioned a lot of modules we built a lot of modules uh to
00:12:52.920
help one of the modules we have is the execution middleware um in this code here you can
00:12:58.800
see that really the beauty of Ruby is that everything is extendable and we can
00:13:04.320
uh we can override anything that we want uh obviously that's dangerous don't do it all the time but in certain cases
00:13:11.000
like this one uh we can over at the around perform method that uh culus gives us and we can uh uh track the
00:13:19.199
start time of the job the end time the completion time of the job and then save that once the job has executed
00:13:26.279
and this is how we can sort of hook into to qist as a gem and then use Ruby to uh
00:13:32.639
to do what we want to track what we want right um and so in this method you just
00:13:38.079
have to assume that we have some basic association between our pipeline job and the pipeline job execution just regular
00:13:44.360
rails associations and models over there so now that we have the setup uh
00:13:51.920
for how our pipelines are set up and how they're uh saved and triggered uh how do
00:13:57.519
we assure that we have scale because um at PayPal we move a lot of money billions of dollars of uh a day GDP of
00:14:06.000
like Switzerland maybe per year or something like that um so scale is
00:14:11.680
important one of the keys to scale for US is resiliency obviously anything that could
00:14:18.279
go wrong will go wrong we all know that um so how do we prepare for that um so
00:14:25.320
one of our ways one of the ways we build resiliency into our pipeline line is retries which I talked about earlier
00:14:31.759
baked into aculus uh item pance uh through save points which uh I'll show an example of
00:14:39.600
that in a second and some feature Flags those are just a small subset of things that you can build but obviously each
00:14:46.920
each app is different so you can figure out what uh your app needs um but one of the important things is um bake anything
00:14:55.480
that can go wrong uh into your code right cuz you know it's going to happen so uh bacon into figure out how you're
00:15:03.120
going to deal with it in an automated way and um so let's let's show
00:15:09.440
examples um safe points so talked about uh item pance so obviously retri are B
00:15:16.360
into culus so we don't have to write custom code but there are certain uh times where uh item potency is not easy
00:15:23.720
to achieve through vanilla Ruby uh so we have built in a way where we have the
00:15:29.079
things called safe points where you can uh enclose your code with this block um
00:15:35.399
how you implement this is up to you but for us it's we have a row in a in a table that's called the safe point with
00:15:41.319
a label that label is unique uh if our code tries to run the same code again
00:15:47.440
with the same label and it knows that this uh safe Point has already been run before uh it doesn't try to run it again
00:15:54.639
so an example of this is um assume that we're downloading a file and processing
00:16:00.680
a file uh that a bank partner gave us we don't want if the job fails for whatever
00:16:06.480
reason at some point after we've downloaded the file and that uh succeeded we don't want to try to download the file again usually it's
00:16:13.000
because after we download the file we move the file somewhere and so it's no longer there so if we try to rerun the
00:16:18.079
job that job will fail because the file is no longer there so we know we've
00:16:23.240
already processed the file and we saved it somewhere where we want it and S3 let's say um so we encapsulate that with
00:16:30.800
a safe point and if the job is Rerun for any reason uh it knows that this
00:16:36.600
particular file which is unique with this date and some identifier uh has
00:16:41.920
already been downloaded so it doesn't try to rerun that part of the code so that's one way we build resiliency and
00:16:48.199
uh into our pipeline another way um is we know that
00:16:54.920
we're going to get uh query timeout because we usually process long files and we save those into our
00:17:01.440
database uh and a lot of the times we get these peak days like holidays for example where the file is too big it
00:17:07.880
doesn't uh process within the uh timeout that we've already set for the database
00:17:13.319
connection it doesn't need to page we don't need to have manual intervention uh we know usually that a
00:17:19.760
retry with a larger timeout uh actually works um and so we've built that into
00:17:26.280
our code we automatically retry any query time out once with uh you can set
00:17:32.320
that uh through metadata on the job if you want more retries but uh you can you
00:17:38.640
can say I wanted to retried once with double timeout and so we Bak that into our
00:17:45.360
code and if you remember uh part of the pipeline scheduler before we showed that
00:17:51.160
we have these cutof metrics well uh the second way uh we built scale is
00:17:57.400
observability and it's a tough word for me to say all the time but uh I wish there was an easier uh short shorter
00:18:05.080
word but uh we can't diagnose issues if you don't know what's going on with your system so you have to collect as many
00:18:11.000
metrics as you can and an example for us here is part of the metadata which is
00:18:16.760
just a ruby hash we can add whatever we want to it in here we add something called the cut off at metric which we
00:18:22.520
say this job needs to be finished by this particular time otherwise we will miss some business deadline
00:18:29.640
uh the banks will not be able to process our transfers something like that and so we set that uh metadata on uh on our job
00:18:38.360
and then we use a middleware like I said we do a lot of middlewares where we say
00:18:43.840
okay if and you don't have to know this code but it's just an example if
00:18:49.039
um when the job runs track the time if the time exceeds the cutoff metric track
00:18:55.760
that somewhere and so we'll get charts like this where we use data dog to track our um our metrics and so you'll see
00:19:04.640
these are two examples the top example is uh all of the jobs that have run for the past x amount of time they all in
00:19:12.240
the green so they all were finished before the cut off time and the line just represents the
00:19:18.640
time between when it finished and the cut off right and so the higher the graph the better uh the lower example
00:19:24.760
shows where one of the jobs actually at one point uh exceeded the the cut off time and finished after the cut off time
00:19:31.440
and what we can do is set up um paging alerts in data dog that track these
00:19:37.120
graphs and these metrics and let it let the on call folks know if something's gone wrong right and so this is how we
00:19:42.960
can build into our pipeline way for us to know what's going on um it's very
00:19:49.000
important to collect baselines so that we're not doing a lot of uh false positives um and pagings for no reason
00:19:57.799
uh but uh observability is really a key is key thing for us that we have a great team uh that works on this all the time
00:20:04.679
that they try to uh improve some of the folks are here um I really owe all of
00:20:10.039
the stock to them um so yeah so
00:20:15.600
uh another thing for us for example is we want to track uh when a job has run
00:20:24.480
and so this is another example of a middle Weare uh example of middleware
00:20:30.159
that you can write to record a job event and what we do here is we notify um if
00:20:37.360
you remember the graph that I showed earlier where you run the job and then at the end you want to notify well you
00:20:43.000
want to know if your job is running and the difficult thing about background jobs is they're in the background uh so
00:20:49.600
unlike a web request where you're loading a browser you see what's happening you get instant feedback with
00:20:55.440
a background job you're sitting there hoping that everything's going well and so for us it's important that we notify
00:21:01.640
of events of what's going on and so we have this metalware again this is contrived example there's a lot more
00:21:07.799
code that usually goes into it but you can see the example of say all right we have these events on that cus gives us
00:21:15.159
of when a job uh starts or completes or fails and then we can notify in our case
00:21:21.240
we notify a bunch of systems for example slack email whatever you want right data dog anything like that and so
00:21:29.919
uh this is this is that same thing but uh um displayed a little bit differently
00:21:36.080
where we can you can see that anytime a job um starts we notify slack anytime it
00:21:43.960
uh uh a job finishes we notify slack and so if you want to just a quick view of
00:21:49.679
hey what's going on with my system are the jobs that are supposed to be running running you can immediately go and
00:21:54.960
search um uh search slack for the shop name and you'll find it and you'll see
00:22:00.480
the latest instance if it's been um put in the queue started not started
00:22:09.159
Etc and so we have a lot of slas in our system
00:22:16.360
right uh we are moving people's money it's very important that we do it right
00:22:22.200
um so one of the things that we do have is we have a crown job that runs and
00:22:27.919
checks the health of our system on a periodic basis and we have monitoring code that takes any job that we want to
00:22:35.520
Monitor and say has this job started at a specific time that we expected to and
00:22:41.400
has it finished at a specific time that it we expected to and the way we do that is a simple Crown job that uh calls the
00:22:48.240
health endpoint uh of our app and what this endpoint does is for each job that
00:22:54.240
we're monitoring gets the list of jobs checks the expected runtime
00:22:59.279
we lock that into the metric like we saw before um is it out of bounds um so easy to check you compare
00:23:07.600
the time of the job when it uh is it is it finished is it not finished and when it's supposed to finish and you compare
00:23:13.520
that basic time uh math there and then if it is out of bounds you alert and you
00:23:20.240
say Hey listen this job supposed to finish at 10 it hasn't finished something's wrong go check it out and
00:23:27.320
this gives our on call folks the ability to come in and intervene um manually if
00:23:32.919
there is any issues that uh that comes up which brings us to our last section
00:23:40.600
our Q&A I have five minutes so happy to answer any questions that you have and
00:23:46.000
we already have one over there and I'll come back to you here so the question is if we you have checkpoints what happens
00:23:51.760
if the checkpoint fails right correct and uh so if the checkpoint fails it restarts and so
00:23:59.159
um it either finishes and you save that it finished in the database and so you know if it tries to uh execute again you
00:24:06.880
don't if it failed you hope that the code that you wrote in there can be restarted and usually you don't put a
00:24:13.200
lot of code in each checkpoint you try to split things up in a way that's logical and easy to manage so that if it
00:24:19.840
does restart uh you have the expected outcome answerers a
00:24:26.880
question uh there was a question here first the beauty of working for a
00:24:33.000
company like PayPal is we have offices around the globe um and so we have uh we
00:24:39.720
have folks who are dedicated to being like First Responders uh to onall issues
00:24:45.440
uh and we have uh coverage 24 coverage uh with the folks around different time zones um if you don't have that uh
00:24:54.520
pcture Duty and loud alarms uh can help
00:25:00.240
uh since the mic is there let me go with that and then I come back to you just another uh question about the
00:25:06.360
specifics of the checkpoint system uh do you keep track of those in perpetuity or do you delete them once they're
00:25:12.520
completed um it was a great question for the talk before me about uh uh slas and
00:25:19.440
uh time to live uh everything that we have in our database has sort of like a time to live I don't know Cooper if you
00:25:25.320
want to dat yeah it's a database record and then some things are ice booxed or archived uh so moved away from our main
00:25:32.640
database or purged if okay so it's just a combination of like requirements for
00:25:38.279
keeping track of things versus you know not having the database just full of year old jobs yeah okay thank
00:25:46.120
you Francisco it was my question was
00:25:51.520
actually along the same lines like do you use a specialized database to track all of these things cuz um uh postgress
00:25:58.559
oh cool it's yeah I mean all rails models and
00:26:05.039
postc yeah um Power of Open
00:26:10.279
Source uh yeah I mean that's uh Scott and Cooper's ban uh their existence uh the
00:26:18.360
white nose uh it's a it's a really tough battle to like when uh how much noise is
00:26:24.240
good noise or bad and all that kind of stuff um the slack Channel easy to mute if you don't want to listen to it and we
00:26:31.279
have layers and like tiers of things that are noisy or less noisy right and
00:26:36.559
so um there's a channel where everything goes into it you don't have to listen to it but if you need to go back you can uh
00:26:43.559
some of that data is also replicated in data dog and there are more metrics and alerts that come out of that um and
00:26:50.000
there is some stuff that actually Pages uh on call or primary or backup on call
00:26:55.120
and stuff like that and so it really depends on how much like critical it is and how
00:27:01.600
much intervention it requires or how much is automated right and so uh that's really up to you to tweak that to your
00:27:08.600
to your liking all the time all the time um I mean we have
00:27:14.600
these conversations all the time but uh just like Matt said uh re re-engineering something from scratch doesn't work uh
00:27:21.640
and so we keep improving and building on our systems obviously uh I think there's a sidekick call uh talk coming up next
00:27:28.840
like would we use qist versus sidekick or any other technology like it's U
00:27:34.039
there's always these conversations that we have to do I don't have anything like worth talking about in the next 10 seconds but it's like there's always
00:27:40.480
tweaks that we're doing always um with the with the safe points um how with
00:27:46.559
jobs that you want to run multiple times but maybe not like very close together
00:27:52.519
how are you how are you scoping those save points so that you can like pay P
00:27:58.399
the same person two times but you're not doing the same transaction yeah so we don't
00:28:03.600
pay the same person two times a day uh so that's one of the ways we solve that but everything has a unique key so
00:28:10.679
usually for example when we um when we're running something for the day there's there's a model that has a
00:28:17.799
unique identifier that is generated and uh that can go in as part of the label
00:28:23.559
for the safe point and so you always know that if you're trying to run the same job multip because we do we have a
00:28:29.320
lot of jobs that we run multiple times because we don't want to reinalt ourself and all that kind of stuff but there's
00:28:34.360
something unique so if that job runs um
00:28:39.440
like I don't know I'm trying to think of something that's easy to say but for example the pipeline is part of the
00:28:45.399
label right and so if running it in EU versus us the pipeline is part of the label so that you know that this job can
00:28:51.039
run multiple times one for this pipeline one for that one right uh stuff like that uh that can help there are more
00:28:57.760
questions I can take one more question and keep things on time
00:29:09.799
okay so we do rely on like data dog quite heavily for example and not to like not don't mean to endorse them or
00:29:16.200
plug them but it is one of the tools that we do rely on um there's there's
00:29:22.279
yeah there's plenty of tools that are out there that are very helpful and uh good to use all right um disclaimer
00:29:29.760
again all the co snippers or examples um I did generate the uh
00:29:36.840
Snippets images using this website if you're interested and reminder we are hiring
00:29:42.919
comec us at the PayPal pooth right next door and thank you so much