Plan to Scale or Plan to Fail: An Evidence-Based Approach for Improving Systems Performance

Summarized using AI

Plan to Scale or Plan to Fail: An Evidence-Based Approach for Improving Systems Performance

Jade Dickinson • November 13, 2024 • Chicago, IL • Talk

In the talk "Plan to Scale or Plan to Fail: An Evidence-Based Approach for Improving Systems Performance," Jade Dickinson discusses a methodology aimed at addressing scaling challenges for Rails applications. Operating at scale at Theta Lake, a fintech company, Dickinson emphasizes the importance of load testing to prepare systems for increased traffic and data ingestion demands before problems arise. Key points include:

  • Scaling Challenges: Dickinson addresses common issues faced by teams when experiencing spikes in traffic or data ingestion, highlighting the need for a proactive approach to performance testing.

  • The Methodology for Load Testing: The methodology presented focuses on replicating existing production systems to conduct realistic load tests. This process includes using tools like Terraform to build and tear down the infrastructure needed for testing, ensuring operational costs remain manageable.

  • Understanding Performance Metrics: Key performance metrics discussed include throughput, response time, and bottlenecks. Dickinson clarifies the definitions of these terms and their relevance in identifying system limitations.

  • Data Generation: As Theta Lake operates in a regulated environment, Dickinson notes that real user data cannot be used for load testing. Instead, they generate fake data that simulates real loading patterns to push through the system for testing.

  • Case Study of Load Testing: A significant example involves a five-day performance test where 190,000 fake chat records and 634,000 fake email records were processed per hour. This stress tested various components, particularly the identity matching functionalities, to uncover potential bottlenecks.

  • Testing Outcomes and Mitigations: Insights from the performance tests revealed specific areas for system improvement, including excessive database connection issues. This led to preemptive scaling of the database during testing, averting potential real-world production failures.

  • Data Sharing and Team Collaboration: Dickinson advocates for sharing the load testing results and insights across engineering teams to foster collaboration and preemptively address performance issues.

In conclusion, Dickinson encourages teams facing scaling challenges to implement these methodologies for load testing, highlighting the importance of anticipating issues in advance rather than merely reacting to them. By understanding their systems through rigorous testing and sharing the insights across teams, they can improve scalability and operational efficiency.

Plan to Scale or Plan to Fail: An Evidence-Based Approach for Improving Systems Performance
Jade Dickinson • November 13, 2024 • Chicago, IL • Talk

Plan to scale or plan to fail: an evidence-based approach for improving systems performance by Jade Dickinson

In this talk, I will present a methodology for replicating most standard Rails systems, for the purpose of load testing.

You can use this to find out how your system performs with more traffic than you currently encounter. This will be useful if you are on a Rails team that is starting to see scaling challenges.

At Theta Lake we operate at scale and are applying this methodology to proactively find ways to bring down our server costs. You don’t want to leave it until either your server costs soar out of control, or your entire system is about to fail. By seeing into the future just a little bit, you can find bottlenecks in your system and so find where you can improve its scalability.

RubyConf 2024

00:00:15.440 uh hello my name is Jade um I've worked
00:00:18.359 on as Mike said I'm a senior soft
00:00:21.039 engineer at Theta Lake uh I've
00:00:22.680 previously worked on eight different
00:00:24.080 rails apps as part of four different
00:00:25.760 rails systems uh totally different
00:00:28.320 architectures totally different products
00:00:30.400 but what they've had in common is
00:00:31.679 scaling
00:00:33.200 challenges so first up have you ever had
00:00:36.239 a big spike in user traffic come into
00:00:38.520 your system has your rails app ever
00:00:40.879 started slowing down or shedding
00:00:42.960 requests to the point where it looks
00:00:44.520 like it's actually down or maybe instead
00:00:47.440 of user traffic you've had a different
00:00:49.280 kind of problem in that whatever data
00:00:51.520 source you ingest grows to the point
00:00:53.480 where it's going to take just too long
00:00:55.320 to ingest all into your system at the
00:00:57.719 regular interval that you need to then
00:01:00.399 if any of those ring true this talk will
00:01:02.440 be useful for
00:01:04.600 you um so the last time I spoke at Ruby
00:01:07.920 comp in Denver one of the questions was
00:01:10.119 about load testing for user traffic and
00:01:12.400 I recommended a tool some former
00:01:13.880 co-workers had written foreshadowing or
00:01:15.759 reaying real requests against a system
00:01:18.799 the subsequent year I moved to Theta
00:01:20.520 Lake uh we are a fintech uh more
00:01:23.079 specifically Regulatory Compliance and
00:01:26.320 since 20 2022 a few Engineers have been
00:01:29.280 doing performance testing on a replica
00:01:31.280 of our production system and earlier
00:01:33.560 this year I was asked to join in on this
00:01:36.000 um so to give an overview this is
00:01:38.200 basically a cross- team project with
00:01:40.399 people from the Ingus team operations
00:01:42.920 team and uh my team working on the
00:01:45.600 customer facing app rails team and it's
00:01:48.159 led by our CTO Rich this talk is about a
00:01:51.719 methodology for replicating a production
00:01:53.880 system then seeing how how it performs
00:01:56.719 when we push more data into it than we
00:01:58.799 currently see
00:02:01.240 you might ask yourself why well the
00:02:03.880 point of the performance and scalability
00:02:06.000 testing we're carrying out is to
00:02:07.880 demonstrate the maximum performance and
00:02:09.640 throughput of our system when balanced
00:02:11.920 against operating at a reasonable cost
00:02:14.440 so you could horizontally scale out
00:02:16.000 further we're just looking to balance
00:02:18.280 operational cost and
00:02:19.760 throughput um there's the old quote
00:02:22.080 about premature optimization which I'm
00:02:24.080 sure many of you all know and I would
00:02:26.040 agree that if you don't need to do this
00:02:28.239 sort of thing then don't maybe um but
00:02:31.319 what we're trying to do here is given
00:02:33.160 what we know what we've been told and
00:02:35.599 where our ingestion is in terms of scale
00:02:37.360 at the moment we're trying to see ahead
00:02:39.599 a little bit into the future in order
00:02:41.519 find in order to find bottlenecks in our
00:02:43.800 system then we can then both demonstrate
00:02:46.800 how scalable it is at the moment and
00:02:49.280 also find limiting factors that we can
00:02:51.280 then look to
00:02:52.760 mitigate uh so my hope is if you're on a
00:02:56.200 team that is starting to see scaling
00:02:58.120 challenges uh then you'll be able to go
00:03:00.159 back to your team and apply this could
00:03:02.200 be a big hope but um the idea is you
00:03:04.840 want to do this before either your
00:03:06.400 server costs sort out of control or uh
00:03:09.879 your entire system is about to um start
00:03:12.560 acting like it's down or actually go
00:03:15.000 down um so yeah today I'm going to
00:03:18.400 present our methodology for replicating
00:03:20.599 our system for the purpose of load
00:03:24.000 testing right so bit of background uh
00:03:27.360 for this sort of thing a lot of the
00:03:28.840 tooling and resource that you'll hear
00:03:30.360 about are for looking at the performance
00:03:33.360 of um big systems uh designed by and for
00:03:39.200 absolutely massive companies operating
00:03:41.400 at truly massive scale so Netflix
00:03:44.040 Shopify and a lot of the work including
00:03:46.920 the literal book on the subject uh has
00:03:49.360 been done by grenon Greg who is in
00:03:51.239 Netflix the issue is in um smaller
00:03:56.079 companies where you don't have that uh
00:03:58.560 that large team
00:04:00.239 uh you may have different constraints
00:04:01.920 and also as well as having people's time
00:04:04.840 limited you might not want to operate a
00:04:07.519 fake version of your system all the time
00:04:09.680 because it would simply be too expensive
00:04:11.400 like why would you do that uh you may
00:04:13.799 also have some or a lot of pii personal
00:04:17.160 in identifiable information that you
00:04:19.479 have to remove before it goes into an
00:04:21.280 external logging tool and specifically
00:04:24.040 to us you cannot use production data in
00:04:27.360 any form even anonymized the problem is
00:04:30.560 when you're operating at some kind of
00:04:32.400 scale just your rails app alone like
00:04:34.960 your single monolith isn't often going
00:04:36.800 to be your entire system um so you could
00:04:40.600 locally optimize performance in it you
00:04:42.560 could see someone's PR and say oh I
00:04:44.160 think that could be a bit faster there
00:04:46.120 but you could do the classics you could
00:04:47.960 just switch out Malo for jalo put an
00:04:50.560 index somewhere in your database where
00:04:52.039 you need it um but this might not even
00:04:54.800 be where the bottlenecks are so that is
00:04:57.199 that could in that case be wasted work
00:04:59.720 so in short we want to replicate our
00:05:02.080 entire system on real infrastructure
00:05:04.800 with real logging and then load test
00:05:07.039 against that I personally think this is
00:05:09.320 really cool and I've not actually seen
00:05:10.880 it before so I'm going to walk you
00:05:13.000 through how we do
00:05:15.720 it so um I just wanted to clarify a bit
00:05:19.479 of terminology this is very directly
00:05:22.840 from Brandon Greg's book that I put a
00:05:25.120 few slides ago uh so few important
00:05:28.280 points so throughput is defined as the
00:05:30.479 rate of work performed workload is the
00:05:33.319 input to the system or the load you're
00:05:35.120 applying to that system response time is
00:05:37.960 the time for an operation to complete
00:05:40.440 both comprising of weight time and
00:05:42.360 actual service time and then utilization
00:05:45.960 uh has two definitions for resources
00:05:49.240 servicing requests like servers how busy
00:05:52.360 a resource was and for resources that
00:05:54.919 provide storage the cap the capacity
00:05:57.400 consumed so for example memory us
00:06:01.120 utilization then probably quite an
00:06:03.639 important one for this talk bottleneck
00:06:05.680 is a resource that limits the
00:06:07.280 performance of the system like a
00:06:08.960 limiting factor uh and you're aiming to
00:06:11.759 identify and remove systemic
00:06:15.639 bottlenecks um okay so this is a high
00:06:18.479 level system diagram of our system and
00:06:21.120 what it does and I've highlighted some
00:06:23.000 fairly typical areas of a system in my
00:06:24.960 experience so data ingestion your
00:06:27.560 pipeline and then rails py all the usual
00:06:30.560 maybe an API maybe not and here our data
00:06:35.360 ingestion has things like it's got a
00:06:37.919 pipeline uh go leads into a pipeline for
00:06:40.400 Content analysis and I'm not going to go
00:06:42.080 into great detail of that um my team's
00:06:44.479 part of the system is the rails psychic
00:06:46.120 or the
00:06:47.240 usual and then this is an architecture
00:06:50.199 diagram so we'll have various
00:06:51.800 Integrations like Zoom slack they feed
00:06:54.919 into a system called the integrator that
00:06:57.599 feeds into the ingestor through to the
00:06:59.919 pipeline through to portal which is the
00:07:01.599 rails app and then the API kind of feeds
00:07:03.919 into quite a few of those um I've seen
00:07:07.240 similar architectures or heard about
00:07:09.360 them and a fairly common thing I've seen
00:07:12.440 is for there to be differences in how
00:07:14.160 you do data ingestion um I've seen a few
00:07:17.560 places where the strategy is to write or
00:07:20.800 in some cases actually rewrite your data
00:07:22.680 ingestion service in a language other
00:07:25.160 than Ruby and um so maybe closure maybe
00:07:29.080 go line
00:07:30.479 uh the other month I actually heard
00:07:32.160 about a uh web hosting platform who'd
00:07:35.440 also decided to write like us their data
00:07:37.919 ingestion service in goang and that gets
00:07:40.720 data into database and then it
00:07:42.879 eventually um gets it into a form where
00:07:45.199 rails monolith reads it but
00:07:47.360 interestingly that team are actually
00:07:49.120 going to move back to Ruby because of
00:07:51.479 Team changes so it's therefore easier
00:07:53.479 for that team to maintain their data
00:07:55.000 ingestion
00:07:57.440 service okay
00:08:01.440 so I've covered when it will help to
00:08:02.840 load test your system let's get into
00:08:05.360 details so firstly we want to replicate
00:08:08.080 the existing system assuming you want to
00:08:10.520 repeat this process perform several load
00:08:12.960 tests not running just constantly and
00:08:15.800 also not pay to have that infra up all
00:08:18.199 of the time you want you're going to
00:08:19.960 need a way to build up and tear down
00:08:22.080 that
00:08:23.080 infrastructure so we use terraform uh
00:08:26.360 many of you will be familiar in case
00:08:28.599 anyone isn't terraform is infrastructure
00:08:30.800 is code if you've ever dug around in an
00:08:33.640 AWS console looking for some kind of
00:08:36.839 configuration setting you'll understand
00:08:38.800 why that's a useful
00:08:40.200 tool um in my experience some teams that
00:08:42.839 are on Heroku or AWS may already be
00:08:45.360 using it and you might not you might
00:08:47.680 have plans to move on to it benefits
00:08:49.839 here are for performance testing like I
00:08:53.080 said you want to be able to
00:08:56.839 um build up that infrastructure in the
00:08:59.200 same way and then tear it down between
00:09:01.360 tests so you're not leaving it idle and
00:09:03.160 paying for that capacity and you want
00:09:05.720 roughly the same setup as production and
00:09:08.000 you want to bring it up and tear it and
00:09:10.440 uh as I said tear it down in a
00:09:12.279 repeatable way and that is what
00:09:13.800 terraform was very useful
00:09:16.440 for um next thing is you want to think
00:09:19.560 about how work arrives in our system so
00:09:22.760 systems I've worked on have had two
00:09:24.600 typical ways pretty standard uh either
00:09:27.680 use a traffic or some kind of data
00:09:29.880 ingestion
00:09:32.320 service okay then the third thing you
00:09:34.560 want to think about for the purposes of
00:09:36.079 load testing is how can we then
00:09:38.000 artificially push work into this
00:09:41.040 performance testing system um so there
00:09:45.880 are for user traffic I'm aware of uh two
00:09:49.200 companies that have looked at uh doing
00:09:51.880 request replays or Shadow requesting so
00:09:54.640 you're capturing real production
00:09:56.000 requests removing the pii so it doesn't
00:09:58.240 go to You're logging and then replaying
00:10:00.279 those against your system um companies
00:10:02.360 I've heard of who are doing this have
00:10:03.519 actually open source their tools so one
00:10:05.160 is Carwell and Umbra and the other is
00:10:08.320 love holidays and a tool called
00:10:10.800 Ripley um which uh I recently saw a talk
00:10:14.480 about actually so was oh coincidence um
00:10:17.680 in our case because our customers are in
00:10:19.399 regulated Industries it is not
00:10:21.480 appropriate at all to use their real
00:10:23.880 data so instead what we use is a go tool
00:10:28.000 my colleague uh David who's on the
00:10:29.800 ingestion team wrote to generate some
00:10:32.600 fake data to approximate present day
00:10:35.000 workload and then from that we do 10x
00:10:38.480 100x THX Etc so the sort of things that
00:10:42.320 are coming into the system are emails
00:10:44.480 chats Zoom calls etc etc so we're going
00:10:48.320 to push fake versions of those through
00:10:50.200 the ingestion
00:10:52.440 service um so foundational work on our
00:10:56.320 tool for this generation and pushing
00:10:58.800 fake data into our system actually
00:11:01.120 happened before I got involved so I
00:11:02.920 asked David about his approach for
00:11:04.839 reflecting real traffic coming in uh
00:11:07.600 according to him the approach he took
00:11:09.639 was to push traffic at certain volumes
00:11:12.800 that correspond to what we were already
00:11:14.399 seeing and seeing what we would need to
00:11:16.600 process that in terms of resources
00:11:18.519 servers
00:11:19.760 Etc and then we could look at the sort
00:11:23.760 of 24 average rate across a work day
00:11:26.920 check the difference and then use that
00:11:29.480 to um check the difference between that
00:11:32.279 and the kind of peak load uh demand uh
00:11:35.200 requirement for resources and then work
00:11:37.320 out from that how many resources we
00:11:40.240 needed for that um so we used volume
00:11:44.680 volumes of then we used volumes of
00:11:47.360 anticipated customer data and allowed
00:11:49.800 some extra for growth and used that to
00:11:51.959 get a a kind of average rate and then
00:11:54.959 once we hit that then we would have at
00:11:57.079 most 24 hours latency in Pro processing
00:11:59.800 data coming in all of it kind of
00:12:01.120 normalized
00:12:02.160 out um with the rate of processing per
00:12:05.120 machine or maybe end machines we could
00:12:09.079 look at production data centers the
00:12:12.399 heaviest used ones and look at the
00:12:14.880 effective rate of the typical busiest
00:12:17.639 hour and then see how that differed from
00:12:20.199 the 24-hour rate so that would again
00:12:22.959 that's kind of deciding how many
00:12:24.399 resource that we need to process extra
00:12:27.519 workload coming in and a satisfa time
00:12:30.560 and this diagram is just showing 24
00:12:32.600 hours of ingestion for a production
00:12:34.399 server so average rate here was about
00:12:37.199 83,000 records processed per hour um
00:12:42.720 and the the idea here is just to get to
00:12:46.120 a message uh get to a process a record
00:12:48.959 through in a reasonable amount of
00:12:52.240 time so
00:12:54.560 finally measuring results um yeah before
00:12:58.120 I read computer science readed biology
00:13:00.160 which covered scientific method how you
00:13:02.199 generate a hypothesis design an
00:13:04.079 experiment test it and then analyze your
00:13:06.279 results to see if you're correct or not
00:13:08.040 pretty standard stuff but then we also
00:13:10.320 had a really cool lecture from one of
00:13:12.440 our Immunology lecturers Dan Davis about
00:13:15.360 how his research group was sort of
00:13:17.720 flipping that by gathering tons of data
00:13:20.800 and then sharing the data itself with
00:13:22.560 the wider scientific community so both
00:13:25.279 his research group and also other teams
00:13:27.680 could analyze it draw draw conclusions
00:13:29.839 and then publish from that so we're
00:13:32.199 partway there in that we're Gathering
00:13:34.519 tons of data in a standard and our
00:13:36.320 standard logging and then from that um
00:13:40.519 so like I said we have the same logging
00:13:42.240 that we do in a production system and
00:13:44.320 from that we can also therefore look at
00:13:46.040 CPU utilization RAM and also individual
00:13:49.639 log lines from each component in the
00:13:51.560 system um so I'm going to skip over
00:13:55.480 looking for absence of Errors because
00:13:57.800 that's kind of ear tend to be early on
00:13:59.839 in a low test where you're you might
00:14:02.639 have like tried needed to switch
00:14:04.839 something off external like sending
00:14:07.199 emails and that might be quite specific
00:14:09.519 to your
00:14:10.680 system so this is hang on yeah this is
00:14:14.560 an example load test and we wanted to
00:14:18.079 look at the part of the system that
00:14:19.240 Roots records through to workflow so the
00:14:21.440 idea is to assign to an individual for
00:14:23.680 review and we were putting 240k records
00:14:27.360 through and looking for no errors total
00:14:31.000 time to process all of those 240k
00:14:33.440 records and how long each individual
00:14:35.959 record took to be
00:14:41.199 processed and this is what it looked
00:14:43.240 like over time this is the event
00:14:45.079 individual time per record processed and
00:14:48.320 that was working out to about 2 and a
00:14:50.240 half seconds per record to go through
00:14:52.279 that workflow so then we have logging as
00:14:55.440 well through that out that code path so
00:14:57.639 we can break down where the time is
00:14:59.440 being spent so for most to least time
00:15:03.360 spent we were assigning those records to
00:15:05.920 a workflow and then a lot of time
00:15:08.720 preparing less time 31% to set up a new
00:15:12.079 record then 8% to enter the actual
00:15:15.000 workflow process and then everything
00:15:17.000 from there was 4% of the time or less so
00:15:19.040 we w super concerned about
00:15:22.399 that okay so that's a kind of a gist of
00:15:27.360 one performance test this is now a full
00:15:30.199 system performance test I'm going to
00:15:31.759 walk you through
00:15:34.079 so you do not need to read all
00:15:36.720 this uh the key points are this is a
00:15:39.240 full systems performance test that we
00:15:41.120 carried out it was across 5 days
00:15:44.120 ingesting 190,000 fake chat records per
00:15:48.160 hour
00:15:49.800 634k fake email records per hour and
00:15:53.040 intentionally exercising the identity
00:15:55.399 matching part of the code base so
00:16:00.399 on the rail side a very important area
00:16:02.839 is how we recognize new participants
00:16:04.839 from Zoom meetings with the we and chest
00:16:07.079 and make sure that we're not constantly
00:16:08.759 saying oh Jade with this email address
00:16:12.199 is one person and Jade with this email
00:16:13.920 address is another person but you have
00:16:16.160 to have a couple of things to match on
00:16:18.440 um there are going to be things that
00:16:20.759 will identify as clearly the same person
00:16:23.519 like combination of name and number or
00:16:26.079 name and employee record for example so
00:16:28.920 so to artificially stress test this part
00:16:30.920 of the system um my team already had
00:16:34.319 something to generate an arbitrary
00:16:36.000 number of meeting participants written
00:16:37.839 by uh written by our lead and I wrote a
00:16:40.839 rake task that was to generate pairs of
00:16:44.000 participants that would be recognized as
00:16:46.000 the same person so say
00:16:48.639 100,000 and then our rails app is going
00:16:50.759 to realize that those are actually
00:16:52.920 50,000 individuals just with very
00:16:55.040 slightly different
00:16:56.639 details then when we're pushing fake
00:16:58.920 Zoom meetings into the system in the uh
00:17:01.880 overall systems test it's going to hit
00:17:04.640 this code path for identity matching and
00:17:07.039 then we can see how that in handles that
00:17:08.880 increase
00:17:13.600 throughput okay so this was really cool
00:17:16.880 so two days into the test I mentioned it
00:17:18.679 was a 5-day test um just from force of
00:17:22.280 habit from doing daytime like site
00:17:24.439 reliability support rotations uh a
00:17:27.199 couple of jobs ago I check the sidekick
00:17:29.840 cues about 11:00 UK time and they were
00:17:32.360 really really backing up so just want to
00:17:35.799 emphasize this is not a production
00:17:37.640 system this is a performance testing
00:17:39.280 system so we're fine the workflow
00:17:43.400 sidekick jobs were taking excessive
00:17:45.880 amounts of time to complete and to be
00:17:49.840 more specific the meantime of seconds to
00:17:52.520 complete was suddenly the exact same as
00:17:55.039 the
00:17:56.320 p999 so the other issue issue the kind
00:17:59.720 of the comb the same issue really uh we
00:18:02.200 were looking at latency of 6 and 1 half
00:18:04.679 hours for one of our I kick qes which is
00:18:07.320 nothing like our normal latency so
00:18:09.919 obviously it felt a bit high and at some
00:18:13.440 point in the night the databased
00:18:15.039 connection count had doubled from about
00:18:17.960 500 to
00:18:19.559 1,000 and therefore each of those
00:18:22.320 workflow items was taking a lot more
00:18:24.760 time to process um the actual time per
00:18:28.400 record was still around 2 and 1 half
00:18:31.039 seconds but they were waiting far far
00:18:33.320 far longer in the queue than usual so
00:18:37.440 then at midday UK time so about an hour
00:18:40.039 later jobs actually started failing and
00:18:43.600 there was a massive spike in the classic
00:18:45.799 active record database connection error
00:18:48.200 which I'm sure many of you will have
00:18:49.400 seen before so reasonably obviously if
00:18:52.200 you've seen this yourself this is not
00:18:53.559 good this could be an incident depends
00:18:56.440 but obviously in a production system and
00:18:59.640 what was happening here was the database
00:19:02.520 server was just the resource was just
00:19:04.640 getting totally exhausted and
00:19:07.840 that just led to database like keep
00:19:11.840 trying to open a database connection
00:19:13.480 keep getting rejected job fails goes
00:19:16.360 around again and it just keeps on
00:19:17.919 backing up the queue like that so that
00:19:21.880 500 database connections number actually
00:19:24.520 used to be heroku's Li heroku's limit on
00:19:27.360 postgress database connection
00:19:29.760 and uh at the end I'll share a reference
00:19:32.039 from them on rale behind that so there's
00:19:35.640 ways that you can handle this in this
00:19:37.799 test we just scaled up the database and
00:19:39.840 that worked perfectly fine so what I
00:19:42.440 thought was really cool about this was
00:19:43.960 we got a preview of what could be a real
00:19:46.559 production incident before it ever
00:19:48.679 actually happened in production so we
00:19:50.880 got to mitigate and present prevent it
00:19:53.559 ever actually happening for
00:19:56.120 real
00:19:57.640 so sharing data why bother I've talked
00:20:00.440 about the how why the sort of things you
00:20:02.760 might find when load testing I want to
00:20:05.200 go back to what I said about sharing
00:20:06.440 data and not just results uh why would
00:20:09.400 you bother so you want to bring your
00:20:11.360 team along with you this was really an
00:20:13.600 excuse to get a copy on the slides again
00:20:16.159 and uh the idea is you're limited on
00:20:19.360 time um ideally if you can share around
00:20:23.400 the like results from your logs then
00:20:26.400 everyone can pick up optimization work
00:20:28.360 when time available so what would be
00:20:31.159 ideal share the data in this case by
00:20:33.720 sharing your logging results from each
00:20:35.440 load
00:20:37.480 test um yeah so from what you learn in
00:20:40.120 performance tests a load test sorry you
00:20:42.280 can do smaller scale controlled
00:20:43.799 experiments and I would basically just
00:20:46.039 point you to Nate bapex work especially
00:20:48.960 the DRM method database Ruby memory and
00:20:52.720 I'd say both of these two books are the
00:20:55.360 go-to resources for rails performance
00:20:57.480 optimization
00:20:59.080 um this actually doesn't require a
00:21:00.919 performance testing environment like
00:21:02.559 I've described instead you start by
00:21:05.080 benchmarking locally and proving your
00:21:07.520 the worth and Improvement of your
00:21:09.919 performance PR and then you do the same
00:21:11.640 in production if you've got approval for
00:21:14.159 the pr um what's kind of cool is that
00:21:17.279 once you follow that process you can
00:21:19.440 also prove how your optimization will
00:21:21.440 perform in a larger load test this is
00:21:23.679 something that we've done uh with a
00:21:25.200 couple of my teammates work which has
00:21:26.679 been really interesting so
00:21:30.080 um then just going back
00:21:33.600 to uh the point about premature
00:21:38.760 optimization um I'd agree with anyone
00:21:41.600 who's thinking about premature
00:21:42.679 optimization being the root of all evil
00:21:44.600 the actual full quote is that you
00:21:47.120 there's no doubt The Grail of efficiency
00:21:48.720 leads to abuse we waste enormous time
00:21:51.840 amounts of time thinking about or
00:21:54.159 worrying about the speed of non-critical
00:21:56.279 parts of our programs and these attempts
00:21:58.360 are efficiency actually have a strong
00:22:00.400 negative impact when debugging and
00:22:02.480 maintenance are considered so we should
00:22:04.400 forget about small efficiencies since
00:22:06.679 say about 97% of the time um premature
00:22:10.760 optimization is the root of all evil
00:22:12.440 that's the full quote so everything I've
00:22:14.799 discussed in this talk is about ignoring
00:22:17.080 the non-critical so instead to find
00:22:19.960 bottlenecks in critical parts of your
00:22:21.720 system so you can Rectify them
00:22:24.000 preferably just before you need
00:22:26.640 to uh summing up up uh using the
00:22:30.600 methodology I've described you can carry
00:22:32.799 out large scale measurements share data
00:22:35.520 and your insights from those uh those
00:22:38.120 tests with your wider engineering team
00:22:40.880 and you're using tools to anticipate
00:22:43.559 problems rather than having to react to
00:22:45.279 them where they happen and just cause
00:22:47.400 Panic uh the idea is to anticipate
00:22:51.840 multiples of your current scale and
00:22:54.760 replicate your system with terraform
00:22:56.960 think about how the workload or traffic
00:22:59.520 arrives into your system whether that's
00:23:01.240 user traffic or data ingestion and then
00:23:04.919 from that load test and take
00:23:07.080 measurements from a performance testing
00:23:09.159 environment and I've also walked you
00:23:11.120 through a few example findings and
00:23:14.000 thought about looking ahead how you can
00:23:15.440 share that with your team so yeah thank
00:23:18.320 you for listening I'll be around for
00:23:20.000 questions
00:23:26.799 afterwards thanks very much for
Explore all talks recorded at RubyConf 2024
+64