Summarized using AI

Lessons Learned Running Sidekiq at Scale

Keith Gable • December 10, 2024 • Chicago, IL • Talk

Summary of "Lessons Learned Running Sidekiq at Scale"

In this presentation, Keith Gable shares his extensive experience using Sidekiq and the various challenges he has faced over the past decade. The talk primarily focuses on crucial lessons learned while operating Sidekiq in different environments, ranging from a small startup to a large corporation.

Key Points:

  • Separate Database Usage:

    • Importance of using a dedicated database for Sidekiq instead of sharing it with Rails.
    • Common mistakes made due to cost-cutting, leading to potential issues in scalability and data loss.
  • Planning and Scaling:

    • It is critical to plan for scaling from the beginning, rather than scrambling to adapt as needs grow, to avoid blocking queues with jobs that require extensive resources.
  • Isolation of Job Queues:

    • Use named queues that align with Service Level Agreements (SLAs), allowing for better resource allocation and isolation of jobs, aiding performance and security.
  • Operational Resilience:

    • Set up robust means of preserving jobs and metrics collection to avoid data loss and make informed decisions, ensuring that job data is not solely reliant on APM tools.
  • Avoiding Common Pitfalls:

    • Implement idempotency checks in jobs to prevent duplicated work or data corruption during processing failures.
  • Integration with Other Systems:

    • Recognizing Sidekiq as part of a larger toolkit for queuing systems, understanding how it complements technologies like Kafka and SQS, while leveraging their strengths appropriately.
  • Investment in Sidekiq Pro:

    • Discussing the benefits of Sidekiq Pro's features that greatly enhance operational capabilities and provide compliance advantages, especially for scaling applications effectively.

Conclusion:

Gable emphasizes the need for systematic planning, thoughtful isolation of jobs, and the importance of investing in the right tools to enhance performance and manageability when using Sidekiq. His experience along the journey spotlights that making informed architectural decisions significantly impacts the scalability and reliability of the system. The takeaway is clear: mistakes are inevitable, but learning from them systematically leads to better engineering practices.

Lessons Learned Running Sidekiq at Scale
Keith Gable • December 10, 2024 • Chicago, IL • Talk

I've been using Sidekiq for more than a decade and have made a ton of mistakes along my journey from being the second engineer at a tiny startup to becoming acquired by a huge company over the past 10 years.

I aim to share my operational experience, architectural advice, and how I got out of sticky situations when I made poor choices.

RubyConf 2024

00:00:15.280 hey now so get spaceo so a little more
00:00:20.439 about me um you know uh I have been in the Ruby Community since 2004 uh with my
00:00:27.840 character's name the entire time I've been using psychic for a long time as well uh across several different
00:00:33.840 companies um and you know like many of you uh I learned Ruby by wise guide and
00:00:40.239 so like we were talking about it on madon hey do the do the thing in character so I am and uh we'll see how
00:00:49.280 this goes um anyway so uh the way that I set this presentation up is it's largely
00:00:55.920 going to be the uh mistakes that I've made do been this for a long time and
00:01:02.039 some of them are really boneheaded mistakes and so don't feel bad if you do that sort of stuff too um and you know
00:01:09.280 there's a lot of them I've done a lot of mistakes I've made but these are the top seven um and also going to assume that you're familiar with psychic and if not
00:01:16.479 hopefully you can hang a little bit um anyway so um this is really my opinion
00:01:24.159 and not necessarily anyone I've ever worked for just to put that out there so
00:01:29.960 what's the first lesson um use a separate database don't do what I
00:01:35.159 did um so we've been using I've been using psychic for a long long time and uh early early on it wasn't really well
00:01:42.680 understood what would happen if you shared the database with rails and um eventually made it to the wiki and it's
00:01:49.479 definitely never do that as the answer um if you can avoid it what whatever so
00:01:56.960 the reason for that is that um you you know okay we didn't put the other database up because we're a startup and
00:02:03.119 it costs money and so why bother and um
00:02:08.440 why would you want to maybe have a sear database well you're going to need to scale out the rail side of things right
00:02:13.760 you're going to want to use you're going to want to leverage cluster and you're going to want to leverage um you know
00:02:19.120 having sort of isolation between the different databases and stuff like that psychic kind of doesn't really need that
00:02:24.319 it can work great with one database no problem and so what do you do right well
00:02:30.280 you might try okay well if you want to use a different you Rea supports a different database index So like um you
00:02:35.840 could use the database index one put psychic there and you technically only running the one database but that
00:02:41.599 doesn't actually work because they still compete for resources the clustering is configured more or less the same and
00:02:47.319 also like a lot of different distributions of you know redus or valky don't
00:02:52.640 support um number databases so here's some ideas on how to fix this or at
00:02:59.480 least try to so the Brute Force solution uh what would that be right it's the simplest shut everything off copy
00:03:07.680 paste no that doesn't work that I mean you could do it if you can go offline
00:03:13.799 but that doesn't yeah that that doesn't work at scale right so what's the second one you might try well you res
00:03:20.400 replication right that kind of works so we with this it's like the job queuing never stops um or you know like you can
00:03:27.599 still write but you can't read and um you know you don't really have any user facing downtime but the way that
00:03:34.799 this works is it leverages sort of a a feature SL Miss feature of redus wherein
00:03:40.040 you can write to a replica which is not a thing you normally would want to do and so if you set it up that way and you
00:03:45.879 write to this replica and you know everything's good great you've migrated over no problem problem replication
00:03:53.159 restart throws all your data away so if you started to migrate your rights and they're about halfway over and replication restarts Network went down
00:03:59.920 there's your application buffer overfilled whatever you've just lost all those jobs permanently and you're never going to get them back so that doesn't
00:04:06.840 quite work and of course if you did get it to work the first way like you have to just immediately turn off replication
00:04:12.200 and take a snapshot and then you know hope that it didn't fail at any point in that process and okay that doesn't do it
00:04:20.000 for you either right also another thing is not every provider gives you the access to use replica of and in those
00:04:26.639 cases maybe you could use Riot but maybe you can't and uh you know right it's a thing that you can download from Rus and
00:04:32.440 it's it depends on the provider but it may or may not work for you so okay what
00:04:37.720 do you do right well we have to manually replicate we have to do something so um
00:04:43.199 what we've done or you know I've done talked with some colleagues about is that basically just Implement a manual
00:04:50.080 replication process and that's essentially what I'm going to go with um and this is also a high level overview
00:04:55.479 but you know sort of the keywords to think about if you're trying to think
00:05:01.120 about this problem too is we basically just you know our pole push the jobs to a new key and then like use the sorted
00:05:08.120 set commands to move them in batches and deal with psychic Pro batches and all of that stuff um the whole idea is really
00:05:15.199 that like there's not a step in the process that's going to result in data loss um you know that would definitely be a problem if that happened um and so
00:05:23.360 you can always revert every step y y y um there only downtime again is what like you can't execute jobs but you can
00:05:30.680 cue them so that's that's good right uh you also end up with the end state of like the source database and destination
00:05:37.039 database are clean and they don't have to remove any data list Left Behind anything like that and I just
00:05:42.199 anecdotally trying to advocate for open sourcing the migrator once we're done with it but we'll see I can't promise
00:05:49.240 anything so it's the second lesson um this one's really you know kind of uh
00:05:54.560 common knowledge at this point um Adam McCrae Judo scale wrote a really good article about this called an opinionated
00:06:01.280 guide to planning your psychic cues and it's really good it covers everything you need to know go search for that um
00:06:07.360 or you know find him I think he's here anyway so the one thing I'd say is like
00:06:14.000 you don't necessarily have to start with a gigantic cluster of psychic processes uh you do eventually need to have that
00:06:19.680 as a thing you want to scale to so think about it like make sure you're prepared for it you know maybe like give yourself
00:06:25.000 the provision for that in your you know infrastructure orchestration stuff like that and of course what happens if you
00:06:30.680 don't do this right and this is the mistake that I that I made um you end up with Q blocking so like something simple
00:06:37.280 like okay you have jobs that take tens of milliseconds so you're going to cue those 10 you know 10,000 jobs like maybe
00:06:43.759 sending activity emails or something so you got those 10,000 jobs cute but they're not going to execute because
00:06:50.080 you've got like image processing or transcoding or something that take like a minute and there's like a dozen or so of those and uh that's just using all
00:06:56.759 your resources and you can't possibly run the things that are super fast and
00:07:02.039 um you can't really use Auto scaling for this because it takes minutes you know sometimes 10 minutes or so to autosale
00:07:07.800 out and you know this sort of when you run into this and you hadn't really you
00:07:12.879 know again learned like I did um then you kind of get gunshy31
00:07:30.080 would be better if it just a long running psychic job right so what I would say is like I've done jobs that
00:07:35.400 take half an hour hour two hours sometimes and don't be afraid of them just make sure they're in a queue by
00:07:42.080 themselves um so the third lesson is um to kind of a little bit contradict what
00:07:49.759 I just said about using Q's named after the SLA you want um but uh to fall
00:07:55.599 isolate and so this is like think about like boundaries for like pack work right or I guess pack anyway um if you're
00:08:03.599 going to modularize with packwork then like uh you're already kind of drawing these boundaries right you're G to have different components and stuff like that
00:08:10.759 and you can think about your psychic cues the same way right so you have your SLA based cues but you like prefix them
00:08:16.759 with the name of the component and you know you don't necessarily always do this but this is a thing where you want to do that because um you know these are
00:08:24.240 going to typically be different like Amazon ECS Services um that are all Serv
00:08:29.960 a different set of cues or whatever and you know what you really want is for
00:08:35.760 that to um how do I put it you want that to the things to not necessarily be able
00:08:42.959 to interact with each other like you might want to have different resources allocated you know different CPU different Ram whatever you might want to
00:08:50.040 pack them more efficiently or less efficiently depending on whether or not the job can technically go a little bit
00:08:55.279 over its SLA or maybe it can't right like that's the thing you need to think about the most thing like the biggest
00:09:01.279 kind of important thing that is I think not thought about all that much is that the security context is important and
00:09:06.839 you know my background is an information security so this is what I think about at night sometimes anyway so what's the
00:09:16.399 um what's that mean right so what that really means is that like you okay you have like you know kubernetes pods or
00:09:22.040 ECS services or whatever those are going to get different Secrets assigned to them those are going to have different you know on AWS be an IM role whatever
00:09:28.920 your collab provider has um and you know you kind of don't necessarily need something that's just doing image
00:09:34.760 processing to be able to like read and write customer CRM data right like you know might not need that and so you can
00:09:41.920 restrict that at a level that it doesn't matter if someone could just upload an executable and run it it doesn't do
00:09:47.000 anything right like you can't access that data or do anything bad to it um
00:09:52.440 you can also use a restricted database user you know um again in the image processing example you might say okay
00:09:58.200 well the image is going to get uh you I'm only going to can only rewrite the images table and read a few others and
00:10:03.600 that's it um and so you know sort of like I said really limits the impact of remote code execution vulnerability and
00:10:10.000 the example I'm thinking of is Mastadon with image magic um so if that were FAL
00:10:15.440 isolated in the way that it should be then um it would be only able to have
00:10:21.279 affected images right because you have the image processing is happening in a completely separate set of containers
00:10:27.800 that are isolated from the others by virtue of the you know container ecosystem it probably you could still
00:10:33.800 Escape there's a lot of container escapes out there but it's more difficult right um You also can deal
00:10:40.360 with the fact that some jobs might need to more aggressively or less aggressively scale out and you know you need to have a knob to turn for that
00:10:46.519 theoretically and you know that said faster duration is most important um when you're starting out so really this
00:10:53.360 is more think about this have this in the back of your mind when you're when you're working on it and kind of don't create extra Tech DB for yourself
00:10:59.839 uh so they can go back in a future refactor and add this so the fourth lesson is to ensure preservability
00:11:07.160 um and I will make a a little bit of a I don't
00:11:12.959 know uh admission I don't run the psychic dashboard um and you know originally it
00:11:19.440 was because I didn't have the time to set it up to make it access restricted and then I did and uh I'd already set
00:11:27.639 the the metrics collection so I didn't really need it and then now I you know the company now got acquired by a big
00:11:34.079 company and uh we need to put SSO in front of it it's more complicated and still haven't done it and you know
00:11:39.920 psychic Enterprise gives you a hook for a lot of this but yeah it's a lot of
00:11:45.079 work it's not a lot of work it's just a little bit of work that doesn't necessarily bring all that much value at
00:11:51.360 the moment um but regardless if you're could to run the psychic dashboard don't make it inter accessible put it behind a
00:11:57.440 Bastion host or restrict it to a certain IP address range or whatever don't uh don't put it on the internet that that's
00:12:03.800 that's not a good idea um so in addition to the different uh you know libraries I
00:12:10.200 have on the screen you can also roll your own with psychic stats that's what you know I did because um at the time
00:12:16.760 these didn't exist um and you know really regardless of whatever system you're going to use to collect metrics
00:12:23.920 um collect them more than once per minute the reason for that is really pretty simple it's that you don't want
00:12:30.160 to just use one data point to make a decision on oh I need to scale out or something bad's happening I need to paid somebody right you want a few and you
00:12:37.240 want a few over some time period and that time period being 10 minutes might be the difference between having
00:12:43.720 customer impact or not right and so if you could make that you know time to response be a few minutes like one or
00:12:49.480 two or three because you're you know measuring every 15 20 seconds or something that's way better than if it
00:12:55.600 takes 10 minutes because if you think about like oh well the action is that scale out well uh uh scaling out if you have to
00:13:03.120 launch new instances can take 10 minutes and now again you're back to having customer impact if the jobs have to
00:13:08.240 execute quick enough right so you know you really excuse me you really want to avoid um you know really want to keep
00:13:15.600 track of that one thing though is that uh it's going to require a little bit of special consideration because usually
00:13:20.639 high resolution metrics cost more and sometimes require you to use a different API to publish them or whatever but you
00:13:26.199 know that's something that you can plan for accordingly um um you can also think
00:13:31.360 about like okay APM application performance metrics right and uh what I would say is that like that's not going
00:13:37.880 to be a way out that's that's the that's the mistake in this slide um is is you
00:13:42.920 can't really rely on APM to tell you that your jobs are performing poorly because it's going to tell you the code path that's performing poorly which
00:13:49.279 might not translate to the job actually being bad right like you might have a method that is not necessarily the most
00:13:54.959 optimal it can be but hey guess what it's still exec way faster than you need
00:14:00.279 it to and you don't need to care so it doesn't frequently correlate and that's kind of unfortunate because usually
00:14:06.240 you're already using some tool to do that um I think the sponsors or one of them is an APM provider so you're using
00:14:13.279 their tool theoretically and um you know if you are awesome and uh just probably
00:14:20.240 want to also collect your own separate metrics for job performance not just APM
00:14:25.320 um but you know what do you have when you have metrics what do you do right well like I said you probably are going
00:14:31.160 to end up paging someone in the worst case and then in the best case you have data to draw from to make decisions
00:14:36.720 about how to scale the different psychic you know Runners workers whatever you
00:14:42.040 have you know like in the case of like Amazon ECS right you have different ECS services and so you want to give some
00:14:47.600 more CPU and some less CPU and you know it's you know how do you know which ones
00:14:53.399 which you know how do you know what is the best choice is that you don't have to have the data and You by collecting the job performance metrics you're able
00:14:58.800 to figure that out um and then you know also it's like you know how many
00:15:04.560 different uh instances of the service are you running how many different you what's the psychic concurrency set to
00:15:09.920 all of these are things that like you can collect this data and then easily go compare later hey this is good this is
00:15:16.320 bad you know I'm going to make a tweak and you know make it more efficient and okay it still performs just as well I'm
00:15:21.800 GNA continue to tweak that until I'm you know at my cost optimization goal or whatever right um that said for Auto
00:15:29.279 scaling uh don't use CPU or Ram utilization as a proxy for performance like that you need to scale out or scale
00:15:35.959 back in doesn't really work very well for that uh in my experience and that's the mistake that I made here um and you
00:15:43.680 know it's it's uh unfortunate because it's you know really unhelpful to have
00:15:48.720 uh those be provided by default and you can't use them um I mean I guess if you did nothing I guess you could use them
00:15:55.560 and that would be okay but like it really um I would suggest just hooking up you wiring up Auto scaling with some
00:16:01.959 of your metrics you're collecting so the other thing is um you know um your scale
00:16:08.000 in and out logic might need to be more complicated uh and really the the key thing is to eagerly scale out but
00:16:13.720 conservatively scale in so the fifth lesson is to avoid the sharp edges
00:16:20.000 there's a lot of them and I thought about making this like a whole bunch of different ones but like then we're at
00:16:25.759 like less than 50 or something and it's not worth it um the kind of the first thing I have on here is make the job
00:16:31.120 idempotent um and what that really means is that sometimes you have a situation you know say it happens 0.01% of the
00:16:37.720 time at a certain scale that's daily or hourly or you know minutely for some
00:16:43.800 folks um and so what might happen is that the job can do all of its work and
00:16:49.839 commit the transaction to your your rdbs and everything is good and then it can't
00:16:55.279 Market complete and red as that pack it gets dropped psychic Pro will re happily deal with that problem but
00:17:02.440 um if you you know not every situation would that still there's still the possibility that it wouldn't necessarily
00:17:08.520 be able to check the job back in is complete right and so um what happens when that job gets recovered well it's
00:17:15.720 going to redo all the same work again meaning if you didn't have any check for idempotency that you're going to redo all the work you're going to duplicate
00:17:22.480 rows theoretically uh it might be at this point too late to do the thing you needed to do and now you are creating
00:17:29.480 like a exception that lives forever until you go clear out the dead side or something so that's deeply unpleasant so
00:17:37.000 try to make sure that you're checking this when the job starts and then you know or at least some version of IDE
00:17:42.320 dependency whether it's the checking that the status is needing you still need to do the thing or whatever um
00:17:47.960 another thing is like you know file system and containerized environment is weird um because containers are going to have quotas files the system is going to
00:17:54.480 have a quota they're not necessarily going to be set to what you expect them to be and they might need to be tuned or
00:18:00.480 might not need to be tuned whatever um you also might have it where each
00:18:05.720 container has a volume for itself or it might be there's a shared shared volume and you know it's for the whole instance
00:18:12.400 and you know that's for like cach or something and you know you kind of those are all things to think about but uh if
00:18:18.640 you're going to use the file system and you don't really know what the configuration is then you might end up
00:18:24.080 in a bad situation for instance temporary directories those are usually by default Ram dis nowadays and uh so
00:18:30.760 you're downloading say a multi-gigabyte file which you think is going to a dis that you don't really care about the size of but it's actually going to ram
00:18:36.480 and you crash the container that's not great you don't want to do that um ask me how I know
00:18:42.600 that um so you know think about that um also another thing that happens for me a
00:18:50.039 lot of my jobs are network hungry um not CPU or ramb bound really and so what do
00:18:55.880 you do about that right like the kind of thing you might initially think is I'm
00:19:01.480 going to go look at the product pricing page for my cloud provider and go hey how much does it you know how what what
00:19:07.120 is the the performance level for the network here right it's going to say up to a number well don't do that instead
00:19:13.679 go to the documentation and find the minimum performance usually it'll be like there's a uh a per flow rate and a
00:19:19.720 peak like a you aggregate flow rate or whatever pick whichever one's smaller and then you know use that to map to the
00:19:26.919 the CPU and RAM that are uh each instance type has and then use that to kind of use CPU and RAM as a proxy
00:19:34.600 for the amount of network performance you need right so like if you know that like I need to be on this instance serving you know five gigabits a second
00:19:41.760 it would be really bad to have jobs that could get scheduled on an instance that has a minimum performance that's
00:19:47.600 guaranteed to you that is a megabit right or 10 megabits or 50 megabits like that would be you would you would clog
00:19:54.080 up that instance's pipe really quickly even if it might be able to boost to 10 gbits per second or something and you
00:20:01.520 know it's uh unfortunate you can't really you know use that metric as a
00:20:07.440 thing to you know you have to you have to convert it to being CPU and RAM you can't just use network uh performance um
00:20:14.480 also like you might have a different you know networking strategy you want to use besides Docker or whatever um you can do
00:20:21.760 whatever right there's a lot of different choices out there there all different tradeoffs you might need to do some tuning for your cloud provider
00:20:28.080 there's all kinds of stuff like that there's also some more sharp edges um and really these are how the they
00:20:34.679 affect customer impact or how they have how they create customer impact and
00:20:40.480 so I've uh sure that no one here has ever pushed a job out that uh or push a
00:20:45.840 code change out that is a job that then suddenly is failing or not performing correctly and then oops I got to roll
00:20:51.000 this back immediately um no one no one's ever done that so um I have um and so so
00:20:58.919 what I found is that it's really valuable to have it'd be possible to have uh a key in redus that you can just
00:21:05.400 set and then jobs with that class name don't run anymore they just immediately fail and you just you can use a a you
00:21:12.159 know a super class basically that influen this or whatever and you know psychic also lets you pause cues but you
00:21:18.840 know being real you probably have more than one job in a queue um and so that's really not helpful if you just pause the
00:21:26.120 queue because you're just still creating that customer impact um again you just can kill that job and then while you
00:21:32.559 make the job fail it'll get retried and then you can fix the code deploy it pull the key out and then everything's good
00:21:38.520 and you've limited the customer impact of like one thing um you might also have jobs that get re-executed frequently
00:21:45.039 because like clockwork or cron or whatever is going to start those jobs maybe if they fail think about whether
00:21:50.919 or not they need to be retried in the first place um you know you also should probably avoid having jobs that uh get
00:21:57.240 retried while the job is scheduled again that's already running concurrently
00:22:02.279 that's a situation that's bad um again experience um and you know kind of other
00:22:08.880 things is like you know we have jobs that fail terminally don't retry those psychic retry in implements that if you
00:22:14.360 have a new enough versions of psychic uh in your project and if you don't sorry um but upgrade if you can um
00:22:23.480 and you know sometimes you have jobs that are like you know if it gets retried after 10 minutes it's worthless to have done it to to begin with don't
00:22:29.600 retry those like just customize the retry policy so that it doesn't try to run it 3 days from now if it only can be
00:22:35.600 valid 10 you know 10 minutes from an hour whatever so the sixth lesson psychic is not the only system out there
00:22:42.400 it's complimentary with a lot of systems and you're not going to be able to say like pick one you know they're all they
00:22:49.039 all do different things they all are really good and use whatever the best tool for the job is and so psychic is
00:22:55.840 like the best whenever it's like you want to do work in Ruby later from within Ruby and that's what you want to
00:23:00.880 do if you want to schedule a job from a Java
00:23:08.480 application and then run it in a ruby application later probably not the tool for you um but maybe Factory would be
00:23:16.320 right like Factory is a lot like psychic architecturally it has some neat features like progress tracking might be
00:23:23.120 a thing that in hindsight I would have used but I definitely have considered it I I don't know if I would use it or not but I definitely consider it um but if
00:23:31.919 you're going to use these other systems don't use active job extractions um because you're using systems for the
00:23:37.559 special features and you don't want to make those special features disappear because what's the point in doing that right so what are those other systems
00:23:44.159 what do they do for you there's you know sort of the high level ones are Kafka sqs and you know amqp or whatever so
00:23:51.799 kfka is a re into a log um consumers are going to track where they are in the log there transactions you can have exactly
00:23:58.200 once semantics and um really kofka is bad at the things psychic is great at um
00:24:05.720 like you can't really reliably execute individual jobs in Kafka because it costs a lot of resources to acknowledge
00:24:12.039 individual messages so instead you want to like acknowledge a batch of them and
00:24:17.200 what happens if some fail right that's not it's it's not it's not particularly great at that where psychic is is
00:24:23.520 particularly great at that um also I'll plug kka it's a really great processing framework
00:24:29.000 um highly recommend using it if you're using Kafka if you're using sqs SNS which you might be using in addition to
00:24:34.720 kofka like you know one thing to keep in mind is that you can really only have one consumer per queue U so if you need
00:24:40.960 to have two different like things that don't do the same thing you have to have two cues and use SNS to Fan out to two
00:24:46.240 different cues um typically you know I've been using sqs to do like fan
00:24:51.760 messages out over to Kafka or Q psychic jobs in some rare situations directly uh
00:24:58.000 but usually is to just process it entirely on AWS using Lambda um and you know that's what most people do but you
00:25:05.520 know again all of these are all these are things that um you're going to you know probably have three or
00:25:13.039 four different Tools in place right um amqp mqtt that's a Telemetry thing um
00:25:19.080 most of the time an amqp message or mqtt is not going to actually be like a job that needs to be executed it's going to
00:25:24.919 be an observation or a set of them and uh you know there's a whole bunch of different semantics that are possible
00:25:30.720 there's a whole bunch of different um it really depends on like software involved and it's it's not really the same thing
00:25:38.240 as any of the other things um so like for instance you might have your your amqp Q or whatever you might have that
00:25:44.080 that Q process your messages and uh use something like uh Apache Flink to
00:25:50.840 process the messages rather and like submit the observations like here's what action items right send those to Kafka
00:25:57.760 and then use kfka to consume the Kafka topic to then schedule psychic cues or or psychic jobs that's that's a thing
00:26:03.799 you might do um in that situation and so it's not um you know again you might run
00:26:10.399 every one of these or even more and it's all totally reasonable so what's the seventh lesson the last one is to pay
00:26:17.440 Mike um psychic Pro
00:26:23.520 yeah psychic Pro is worth it like the it's it's really there's a whole bunch of cool features besides the fact you're
00:26:30.399 supporting the development of it and the fact that you know Mike can be the person leading this track and all that
00:26:35.640 right like besides that like it's psychic pro has a bunch of cool features it also solves some licensing issues if
00:26:41.480 you work in a company that doesn't like the lgpl um I think it's easy to comply with but that's my opinion um but not the
00:26:48.760 opinion necessarily of corporate legal so um yeah it makes it really easy to
00:26:53.919 deal with that because now you're paying for it and they don't care um SuperFetch is really good reliable plush is really
00:26:59.200 good both of those would be on their own individually worth it batches are useful
00:27:04.760 again worth it um Enterprise maybe this supplies you maybe it doesn't but it
00:27:10.360 checks all the compliance boxes um that nobody really wants to deal with um but
00:27:16.360 you know it's one of those things that uh you might have to or you might just roll your own but like it's a lot easier
00:27:21.399 to just do interprise um and you know it's it's also has a bunch of other features like
00:27:27.279 the SSO thing I mention earlier um and really that's that's the main takea away from this is that uh it's really worth
00:27:35.480 it to get psychic pro at the very least um particularly if you're at scale that's what we run um we don't need the
00:27:41.919 stuff in Enterprise because we've already had to deal with like before Enterprise existed we dealt with the compliance stuff so anyway that's the
00:27:49.519 end um I don't know what if I'm over or not because the clock stopped working um
00:27:56.000 but this is all my socials um and and uh I will probably upload the the thing to
00:28:02.279 my website at some point but I didn't think about it and so it won't be until I'm back home so next week sometime
00:28:09.039 whatever um I don't know if we have time for questions if you have questions shoot them if you don't have questions
00:28:15.360 or whatever I have stickers more than just the two that are on the SC the screens over there not more than the
00:28:22.000 ones that are on the screen if you're interested I've got them in big size small size one thing I ask is if you put
00:28:28.320 me you put my character some more interesting send me a picture I'm in the Oakland
00:28:34.480 Coliseum so uh anyway that's anybody have questions I didn't look
00:28:40.240 up yeah so the um I guess this really just an
00:28:46.039 observation that if you use if you're trying to fall isolate then when you have a job that's like 30 minutes or 30
00:28:52.080 seconds rather and you have like five jobs and you don't want them to be able to compete for resources and you isolate
00:28:57.640 them right you end up with what you said like 30 seconds a b c d whatever it's really hard to reason about that and I
00:29:03.120 agree and so the thing I left out is that in practice we sort of just have
00:29:09.000 our cues not named the time but rather the function so like we would have like
00:29:16.279 listen tracking for instance and we everything listen tracking related is going to take a certain amount of time
00:29:22.799 and that goes in one place and um that said though at some point that becomes
00:29:29.279 really hard to reason about and it would be way better if I could say like um you know something like uh delivery
00:29:37.440 underscore 10 milliseconds and then delivery underscore 5 seconds because
00:29:42.600 there's kind of two things that are the same thing but different and some takes longer and so that would be better but
00:29:49.320 there's a certain inflection point where it makes a lot more sense to just do what I did and name them sort of not the
00:29:56.240 way that Adam suggested and uh I would say that like at a certain scale I guess
00:30:01.720 it'd be really annoying to have to deal with um oh how long is this Q take which queue should I put this job into I don't
00:30:08.320 know I put it in this one and now you're creating the problem you tried to avoid to begin with because you have like 50
00:30:13.720 developers and it's just super hard so anyway that that's that's my thoughts on
00:30:19.039 that but like I totally um yeah I I didn't I didn't it's a lie of a mission is what that is um yeah totally get it
00:30:26.919 though um thank you thank you that concludes our morning program please enjoy your
Explore all talks recorded at RubyConf 2024
+64