hey now so get spaceo so a little more
about me um you know uh I have been in the Ruby Community since 2004 uh with my
character's name the entire time I've been using psychic for a long time as well uh across several different
companies um and you know like many of you uh I learned Ruby by wise guide and
so like we were talking about it on madon hey do the do the thing in character so I am and uh we'll see how
this goes um anyway so uh the way that I set this presentation up is it's largely
going to be the uh mistakes that I've made do been this for a long time and
some of them are really boneheaded mistakes and so don't feel bad if you do that sort of stuff too um and you know
there's a lot of them I've done a lot of mistakes I've made but these are the top seven um and also going to assume that you're familiar with psychic and if not
hopefully you can hang a little bit um anyway so um this is really my opinion
and not necessarily anyone I've ever worked for just to put that out there so
what's the first lesson um use a separate database don't do what I
did um so we've been using I've been using psychic for a long long time and uh early early on it wasn't really well
understood what would happen if you shared the database with rails and um eventually made it to the wiki and it's
definitely never do that as the answer um if you can avoid it what whatever so
the reason for that is that um you you know okay we didn't put the other database up because we're a startup and
it costs money and so why bother and um
why would you want to maybe have a sear database well you're going to need to scale out the rail side of things right
you're going to want to use you're going to want to leverage cluster and you're going to want to leverage um you know
having sort of isolation between the different databases and stuff like that psychic kind of doesn't really need that
it can work great with one database no problem and so what do you do right well
you might try okay well if you want to use a different you Rea supports a different database index So like um you
could use the database index one put psychic there and you technically only running the one database but that
doesn't actually work because they still compete for resources the clustering is configured more or less the same and
also like a lot of different distributions of you know redus or valky don't
support um number databases so here's some ideas on how to fix this or at
least try to so the Brute Force solution uh what would that be right it's the simplest shut everything off copy
paste no that doesn't work that I mean you could do it if you can go offline
but that doesn't yeah that that doesn't work at scale right so what's the second one you might try well you res
replication right that kind of works so we with this it's like the job queuing never stops um or you know like you can
still write but you can't read and um you know you don't really have any user facing downtime but the way that
this works is it leverages sort of a a feature SL Miss feature of redus wherein
you can write to a replica which is not a thing you normally would want to do and so if you set it up that way and you
write to this replica and you know everything's good great you've migrated over no problem problem replication
restart throws all your data away so if you started to migrate your rights and they're about halfway over and replication restarts Network went down
there's your application buffer overfilled whatever you've just lost all those jobs permanently and you're never going to get them back so that doesn't
quite work and of course if you did get it to work the first way like you have to just immediately turn off replication
and take a snapshot and then you know hope that it didn't fail at any point in that process and okay that doesn't do it
for you either right also another thing is not every provider gives you the access to use replica of and in those
cases maybe you could use Riot but maybe you can't and uh you know right it's a thing that you can download from Rus and
it's it depends on the provider but it may or may not work for you so okay what
do you do right well we have to manually replicate we have to do something so um
what we've done or you know I've done talked with some colleagues about is that basically just Implement a manual
replication process and that's essentially what I'm going to go with um and this is also a high level overview
but you know sort of the keywords to think about if you're trying to think
about this problem too is we basically just you know our pole push the jobs to a new key and then like use the sorted
set commands to move them in batches and deal with psychic Pro batches and all of that stuff um the whole idea is really
that like there's not a step in the process that's going to result in data loss um you know that would definitely be a problem if that happened um and so
you can always revert every step y y y um there only downtime again is what like you can't execute jobs but you can
cue them so that's that's good right uh you also end up with the end state of like the source database and destination
database are clean and they don't have to remove any data list Left Behind anything like that and I just
anecdotally trying to advocate for open sourcing the migrator once we're done with it but we'll see I can't promise
anything so it's the second lesson um this one's really you know kind of uh
common knowledge at this point um Adam McCrae Judo scale wrote a really good article about this called an opinionated
guide to planning your psychic cues and it's really good it covers everything you need to know go search for that um
or you know find him I think he's here anyway so the one thing I'd say is like
you don't necessarily have to start with a gigantic cluster of psychic processes uh you do eventually need to have that
as a thing you want to scale to so think about it like make sure you're prepared for it you know maybe like give yourself
the provision for that in your you know infrastructure orchestration stuff like that and of course what happens if you
don't do this right and this is the mistake that I that I made um you end up with Q blocking so like something simple
like okay you have jobs that take tens of milliseconds so you're going to cue those 10 you know 10,000 jobs like maybe
sending activity emails or something so you got those 10,000 jobs cute but they're not going to execute because
you've got like image processing or transcoding or something that take like a minute and there's like a dozen or so of those and uh that's just using all
your resources and you can't possibly run the things that are super fast and
um you can't really use Auto scaling for this because it takes minutes you know sometimes 10 minutes or so to autosale
out and you know this sort of when you run into this and you hadn't really you
know again learned like I did um then you kind of get gunshy31
would be better if it just a long running psychic job right so what I would say is like I've done jobs that
take half an hour hour two hours sometimes and don't be afraid of them just make sure they're in a queue by
themselves um so the third lesson is um to kind of a little bit contradict what
I just said about using Q's named after the SLA you want um but uh to fall
isolate and so this is like think about like boundaries for like pack work right or I guess pack anyway um if you're
going to modularize with packwork then like uh you're already kind of drawing these boundaries right you're G to have different components and stuff like that
and you can think about your psychic cues the same way right so you have your SLA based cues but you like prefix them
with the name of the component and you know you don't necessarily always do this but this is a thing where you want to do that because um you know these are
going to typically be different like Amazon ECS Services um that are all Serv
a different set of cues or whatever and you know what you really want is for
that to um how do I put it you want that to the things to not necessarily be able
to interact with each other like you might want to have different resources allocated you know different CPU different Ram whatever you might want to
pack them more efficiently or less efficiently depending on whether or not the job can technically go a little bit
over its SLA or maybe it can't right like that's the thing you need to think about the most thing like the biggest
kind of important thing that is I think not thought about all that much is that the security context is important and
you know my background is an information security so this is what I think about at night sometimes anyway so what's the
um what's that mean right so what that really means is that like you okay you have like you know kubernetes pods or
ECS services or whatever those are going to get different Secrets assigned to them those are going to have different you know on AWS be an IM role whatever
your collab provider has um and you know you kind of don't necessarily need something that's just doing image
processing to be able to like read and write customer CRM data right like you know might not need that and so you can
restrict that at a level that it doesn't matter if someone could just upload an executable and run it it doesn't do
anything right like you can't access that data or do anything bad to it um
you can also use a restricted database user you know um again in the image processing example you might say okay
well the image is going to get uh you I'm only going to can only rewrite the images table and read a few others and
that's it um and so you know sort of like I said really limits the impact of remote code execution vulnerability and
the example I'm thinking of is Mastadon with image magic um so if that were FAL
isolated in the way that it should be then um it would be only able to have
affected images right because you have the image processing is happening in a completely separate set of containers
that are isolated from the others by virtue of the you know container ecosystem it probably you could still
Escape there's a lot of container escapes out there but it's more difficult right um You also can deal
with the fact that some jobs might need to more aggressively or less aggressively scale out and you know you need to have a knob to turn for that
theoretically and you know that said faster duration is most important um when you're starting out so really this
is more think about this have this in the back of your mind when you're when you're working on it and kind of don't create extra Tech DB for yourself
uh so they can go back in a future refactor and add this so the fourth lesson is to ensure preservability
um and I will make a a little bit of a I don't
know uh admission I don't run the psychic dashboard um and you know originally it
was because I didn't have the time to set it up to make it access restricted and then I did and uh I'd already set
the the metrics collection so I didn't really need it and then now I you know the company now got acquired by a big
company and uh we need to put SSO in front of it it's more complicated and still haven't done it and you know
psychic Enterprise gives you a hook for a lot of this but yeah it's a lot of
work it's not a lot of work it's just a little bit of work that doesn't necessarily bring all that much value at
the moment um but regardless if you're could to run the psychic dashboard don't make it inter accessible put it behind a
Bastion host or restrict it to a certain IP address range or whatever don't uh don't put it on the internet that that's
that's not a good idea um so in addition to the different uh you know libraries I
have on the screen you can also roll your own with psychic stats that's what you know I did because um at the time
these didn't exist um and you know really regardless of whatever system you're going to use to collect metrics
um collect them more than once per minute the reason for that is really pretty simple it's that you don't want
to just use one data point to make a decision on oh I need to scale out or something bad's happening I need to paid somebody right you want a few and you
want a few over some time period and that time period being 10 minutes might be the difference between having
customer impact or not right and so if you could make that you know time to response be a few minutes like one or
two or three because you're you know measuring every 15 20 seconds or something that's way better than if it
takes 10 minutes because if you think about like oh well the action is that scale out well uh uh scaling out if you have to
launch new instances can take 10 minutes and now again you're back to having customer impact if the jobs have to
execute quick enough right so you know you really excuse me you really want to avoid um you know really want to keep
track of that one thing though is that uh it's going to require a little bit of special consideration because usually
high resolution metrics cost more and sometimes require you to use a different API to publish them or whatever but you
know that's something that you can plan for accordingly um um you can also think
about like okay APM application performance metrics right and uh what I would say is that like that's not going
to be a way out that's that's the that's the mistake in this slide um is is you
can't really rely on APM to tell you that your jobs are performing poorly because it's going to tell you the code path that's performing poorly which
might not translate to the job actually being bad right like you might have a method that is not necessarily the most
optimal it can be but hey guess what it's still exec way faster than you need
it to and you don't need to care so it doesn't frequently correlate and that's kind of unfortunate because usually
you're already using some tool to do that um I think the sponsors or one of them is an APM provider so you're using
their tool theoretically and um you know if you are awesome and uh just probably
want to also collect your own separate metrics for job performance not just APM
um but you know what do you have when you have metrics what do you do right well like I said you probably are going
to end up paging someone in the worst case and then in the best case you have data to draw from to make decisions
about how to scale the different psychic you know Runners workers whatever you
have you know like in the case of like Amazon ECS right you have different ECS services and so you want to give some
more CPU and some less CPU and you know it's you know how do you know which ones
which you know how do you know what is the best choice is that you don't have to have the data and You by collecting the job performance metrics you're able
to figure that out um and then you know also it's like you know how many
different uh instances of the service are you running how many different you what's the psychic concurrency set to
all of these are things that like you can collect this data and then easily go compare later hey this is good this is
bad you know I'm going to make a tweak and you know make it more efficient and okay it still performs just as well I'm
GNA continue to tweak that until I'm you know at my cost optimization goal or whatever right um that said for Auto
scaling uh don't use CPU or Ram utilization as a proxy for performance like that you need to scale out or scale
back in doesn't really work very well for that uh in my experience and that's the mistake that I made here um and you
know it's it's uh unfortunate because it's you know really unhelpful to have
uh those be provided by default and you can't use them um I mean I guess if you did nothing I guess you could use them
and that would be okay but like it really um I would suggest just hooking up you wiring up Auto scaling with some
of your metrics you're collecting so the other thing is um you know um your scale
in and out logic might need to be more complicated uh and really the the key thing is to eagerly scale out but
conservatively scale in so the fifth lesson is to avoid the sharp edges
there's a lot of them and I thought about making this like a whole bunch of different ones but like then we're at
like less than 50 or something and it's not worth it um the kind of the first thing I have on here is make the job
idempotent um and what that really means is that sometimes you have a situation you know say it happens 0.01% of the
time at a certain scale that's daily or hourly or you know minutely for some
folks um and so what might happen is that the job can do all of its work and
commit the transaction to your your rdbs and everything is good and then it can't
Market complete and red as that pack it gets dropped psychic Pro will re happily deal with that problem but
um if you you know not every situation would that still there's still the possibility that it wouldn't necessarily
be able to check the job back in is complete right and so um what happens when that job gets recovered well it's
going to redo all the same work again meaning if you didn't have any check for idempotency that you're going to redo all the work you're going to duplicate
rows theoretically uh it might be at this point too late to do the thing you needed to do and now you are creating
like a exception that lives forever until you go clear out the dead side or something so that's deeply unpleasant so
try to make sure that you're checking this when the job starts and then you know or at least some version of IDE
dependency whether it's the checking that the status is needing you still need to do the thing or whatever um
another thing is like you know file system and containerized environment is weird um because containers are going to have quotas files the system is going to
have a quota they're not necessarily going to be set to what you expect them to be and they might need to be tuned or
might not need to be tuned whatever um you also might have it where each
container has a volume for itself or it might be there's a shared shared volume and you know it's for the whole instance
and you know that's for like cach or something and you know you kind of those are all things to think about but uh if
you're going to use the file system and you don't really know what the configuration is then you might end up
in a bad situation for instance temporary directories those are usually by default Ram dis nowadays and uh so
you're downloading say a multi-gigabyte file which you think is going to a dis that you don't really care about the size of but it's actually going to ram
and you crash the container that's not great you don't want to do that um ask me how I know
that um so you know think about that um also another thing that happens for me a
lot of my jobs are network hungry um not CPU or ramb bound really and so what do
you do about that right like the kind of thing you might initially think is I'm
going to go look at the product pricing page for my cloud provider and go hey how much does it you know how what what
is the the performance level for the network here right it's going to say up to a number well don't do that instead
go to the documentation and find the minimum performance usually it'll be like there's a uh a per flow rate and a
peak like a you aggregate flow rate or whatever pick whichever one's smaller and then you know use that to map to the
the CPU and RAM that are uh each instance type has and then use that to kind of use CPU and RAM as a proxy
for the amount of network performance you need right so like if you know that like I need to be on this instance serving you know five gigabits a second
it would be really bad to have jobs that could get scheduled on an instance that has a minimum performance that's
guaranteed to you that is a megabit right or 10 megabits or 50 megabits like that would be you would you would clog
up that instance's pipe really quickly even if it might be able to boost to 10 gbits per second or something and you
know it's uh unfortunate you can't really you know use that metric as a
thing to you know you have to you have to convert it to being CPU and RAM you can't just use network uh performance um
also like you might have a different you know networking strategy you want to use besides Docker or whatever um you can do
whatever right there's a lot of different choices out there there all different tradeoffs you might need to do some tuning for your cloud provider
there's all kinds of stuff like that there's also some more sharp edges um and really these are how the they
affect customer impact or how they have how they create customer impact and
so I've uh sure that no one here has ever pushed a job out that uh or push a
code change out that is a job that then suddenly is failing or not performing correctly and then oops I got to roll
this back immediately um no one no one's ever done that so um I have um and so so
what I found is that it's really valuable to have it'd be possible to have uh a key in redus that you can just
set and then jobs with that class name don't run anymore they just immediately fail and you just you can use a a you
know a super class basically that influen this or whatever and you know psychic also lets you pause cues but you
know being real you probably have more than one job in a queue um and so that's really not helpful if you just pause the
queue because you're just still creating that customer impact um again you just can kill that job and then while you
make the job fail it'll get retried and then you can fix the code deploy it pull the key out and then everything's good
and you've limited the customer impact of like one thing um you might also have jobs that get re-executed frequently
because like clockwork or cron or whatever is going to start those jobs maybe if they fail think about whether
or not they need to be retried in the first place um you know you also should probably avoid having jobs that uh get
retried while the job is scheduled again that's already running concurrently
that's a situation that's bad um again experience um and you know kind of other
things is like you know we have jobs that fail terminally don't retry those psychic retry in implements that if you
have a new enough versions of psychic uh in your project and if you don't sorry um but upgrade if you can um
and you know sometimes you have jobs that are like you know if it gets retried after 10 minutes it's worthless to have done it to to begin with don't
retry those like just customize the retry policy so that it doesn't try to run it 3 days from now if it only can be
valid 10 you know 10 minutes from an hour whatever so the sixth lesson psychic is not the only system out there
it's complimentary with a lot of systems and you're not going to be able to say like pick one you know they're all they
all do different things they all are really good and use whatever the best tool for the job is and so psychic is
like the best whenever it's like you want to do work in Ruby later from within Ruby and that's what you want to
do if you want to schedule a job from a Java
application and then run it in a ruby application later probably not the tool for you um but maybe Factory would be
right like Factory is a lot like psychic architecturally it has some neat features like progress tracking might be
a thing that in hindsight I would have used but I definitely have considered it I I don't know if I would use it or not but I definitely consider it um but if
you're going to use these other systems don't use active job extractions um because you're using systems for the
special features and you don't want to make those special features disappear because what's the point in doing that right so what are those other systems
what do they do for you there's you know sort of the high level ones are Kafka sqs and you know amqp or whatever so
kfka is a re into a log um consumers are going to track where they are in the log there transactions you can have exactly
once semantics and um really kofka is bad at the things psychic is great at um
like you can't really reliably execute individual jobs in Kafka because it costs a lot of resources to acknowledge
individual messages so instead you want to like acknowledge a batch of them and
what happens if some fail right that's not it's it's not it's not particularly great at that where psychic is is
particularly great at that um also I'll plug kka it's a really great processing framework
um highly recommend using it if you're using Kafka if you're using sqs SNS which you might be using in addition to
kofka like you know one thing to keep in mind is that you can really only have one consumer per queue U so if you need
to have two different like things that don't do the same thing you have to have two cues and use SNS to Fan out to two
different cues um typically you know I've been using sqs to do like fan
messages out over to Kafka or Q psychic jobs in some rare situations directly uh
but usually is to just process it entirely on AWS using Lambda um and you know that's what most people do but you
know again all of these are all these are things that um you're going to you know probably have three or
four different Tools in place right um amqp mqtt that's a Telemetry thing um
most of the time an amqp message or mqtt is not going to actually be like a job that needs to be executed it's going to
be an observation or a set of them and uh you know there's a whole bunch of different semantics that are possible
there's a whole bunch of different um it really depends on like software involved and it's it's not really the same thing
as any of the other things um so like for instance you might have your your amqp Q or whatever you might have that
that Q process your messages and uh use something like uh Apache Flink to
process the messages rather and like submit the observations like here's what action items right send those to Kafka
and then use kfka to consume the Kafka topic to then schedule psychic cues or or psychic jobs that's that's a thing
you might do um in that situation and so it's not um you know again you might run
every one of these or even more and it's all totally reasonable so what's the seventh lesson the last one is to pay
Mike um psychic Pro
yeah psychic Pro is worth it like the it's it's really there's a whole bunch of cool features besides the fact you're
supporting the development of it and the fact that you know Mike can be the person leading this track and all that
right like besides that like it's psychic pro has a bunch of cool features it also solves some licensing issues if
you work in a company that doesn't like the lgpl um I think it's easy to comply with but that's my opinion um but not the
opinion necessarily of corporate legal so um yeah it makes it really easy to
deal with that because now you're paying for it and they don't care um SuperFetch is really good reliable plush is really
good both of those would be on their own individually worth it batches are useful
again worth it um Enterprise maybe this supplies you maybe it doesn't but it
checks all the compliance boxes um that nobody really wants to deal with um but
you know it's one of those things that uh you might have to or you might just roll your own but like it's a lot easier
to just do interprise um and you know it's it's also has a bunch of other features like
the SSO thing I mention earlier um and really that's that's the main takea away from this is that uh it's really worth
it to get psychic pro at the very least um particularly if you're at scale that's what we run um we don't need the
stuff in Enterprise because we've already had to deal with like before Enterprise existed we dealt with the compliance stuff so anyway that's the
end um I don't know what if I'm over or not because the clock stopped working um
but this is all my socials um and and uh I will probably upload the the thing to
my website at some point but I didn't think about it and so it won't be until I'm back home so next week sometime
whatever um I don't know if we have time for questions if you have questions shoot them if you don't have questions
or whatever I have stickers more than just the two that are on the SC the screens over there not more than the
ones that are on the screen if you're interested I've got them in big size small size one thing I ask is if you put
me you put my character some more interesting send me a picture I'm in the Oakland
Coliseum so uh anyway that's anybody have questions I didn't look
up yeah so the um I guess this really just an
observation that if you use if you're trying to fall isolate then when you have a job that's like 30 minutes or 30
seconds rather and you have like five jobs and you don't want them to be able to compete for resources and you isolate
them right you end up with what you said like 30 seconds a b c d whatever it's really hard to reason about that and I
agree and so the thing I left out is that in practice we sort of just have
our cues not named the time but rather the function so like we would have like
listen tracking for instance and we everything listen tracking related is going to take a certain amount of time
and that goes in one place and um that said though at some point that becomes
really hard to reason about and it would be way better if I could say like um you know something like uh delivery
underscore 10 milliseconds and then delivery underscore 5 seconds because
there's kind of two things that are the same thing but different and some takes longer and so that would be better but
there's a certain inflection point where it makes a lot more sense to just do what I did and name them sort of not the
way that Adam suggested and uh I would say that like at a certain scale I guess
it'd be really annoying to have to deal with um oh how long is this Q take which queue should I put this job into I don't
know I put it in this one and now you're creating the problem you tried to avoid to begin with because you have like 50
developers and it's just super hard so anyway that that's that's my thoughts on
that but like I totally um yeah I I didn't I didn't it's a lie of a mission is what that is um yeah totally get it
though um thank you thank you that concludes our morning program please enjoy your