Summarized using AI

Chaos Engineering on the Death Star

Sara Jackson • November 13, 2024 • Chicago, IL • Talk

In the talk titled "Chaos Engineering on the Death Star" presented by Sara Jackson at RubyConf 2024, the speaker explores the concept of chaos engineering using the infamous Death Star as a case study. The primary theme revolves around understanding how proactive measures in software engineering can prevent disastrous failures and improve system resilience.

Key Points Discussed:

  • Understanding Chaos Engineering: Refers to applying scientific methodology to prepare for unpredictable system failures rather than merely relying on testing.
  • Observability: Emphasizes the importance of having a comprehensive understanding of typical system behavior to effectively conduct chaos experiments.
  • Testing in Production: Highlights that chaos engineering involves performing tests in the live environment, which provides clearer insights into how systems behave under stress when compared to pre-production scenarios.
  • Case Study - Maia: A fictional web developer working on the Death Star PR site lacks preparedness for unexpected incidents, illustrating various challenges such as server outages and improper handling of third-party API failures.
  • Simulated Experiments: Examples of chaos experiments that could have provided valuable lessons for Maia include simulating server failures and high latency situations to improve user response during actual outages.
  • User Experience Impact: Discusses how chaos engineering can enhance real user experiences, such as incorporating better messaging during system slowdowns or outages to retain user engagement.
  • Best Practices and Tools: Recommends specific chaos engineering tools for Ruby applications, such as Flu Shot and Chaos RB, to facilitate experimentation and system improvement.

Conclusions and Takeaways:

  • Implementing chaos engineering practices, even in small steps, can significantly enhance system resilience, improve confidence in application performance, and provide clear data to justify technical enhancements within teams.
  • Failure to apply these proactive techniques, as illustrated by the Death Star's operational flaws, can lead to disastrous outcomes.
  • Organizations should start small, focusing on improving observability, testing controlled elements in production, and using data-driven results to advocate for further investments in system resiliency.

This talk not only serves as an introduction to chaos engineering but also provides actionable insights on how such methodologies could have dramatically altered the fate of systems like the Death Star.

Chaos Engineering on the Death Star
Sara Jackson • November 13, 2024 • Chicago, IL • Talk

An exhaust vent wasn't the only flaw on the Death Star! We'll follow along with a flustered Death Star engineer, Maia, and learn how ideas from the world of chaos engineering could have saved her app from exploitation by the rebels.

RubyConf 2024

00:00:15.320 hi uh I'm Sarah Jackson thank you for
00:00:18.279 coming to my talk chaos engineering on
00:00:20.199 the Death
00:00:21.160 Star this talk covers a topic that may
00:00:24.279 require some prerequisite knowledge so
00:00:26.920 let's get some things straight
00:00:30.400 the Galactic
00:00:32.360 Empire is
00:00:35.399 bad the Rebel Alliance is
00:00:40.320 good and the first death star was
00:00:43.399 destroyed as a result of poor planning
00:00:45.559 poor execution and
00:00:47.960 hubris hubris to assume that nothing
00:00:50.600 would go wrong and everything would go
00:00:53.039 according to
00:00:54.160 plan the internet sees hubris and laughs
00:00:58.440 things will go wrong and we should
00:01:00.399 prepare for
00:01:01.600 it chaos engineering is a tool we can
00:01:05.119 use to prepare for the
00:01:07.799 unexpected it can help prevent
00:01:10.920 incidents improve user
00:01:13.560 experience and increase team
00:01:17.320 confidence and so much more so what is
00:01:21.079 chaos
00:01:22.040 engineering chaos engineering itself is
00:01:25.680 a bit of a misnomer if you've heard of
00:01:28.759 Netflix's tool chaos monkey you might
00:01:31.240 imagine a monkey wildly disrupting the
00:01:33.880 system uh shooting off blasters with the
00:01:36.439 aim of a
00:01:38.119 stormtrooper however chaos engineering
00:01:41.000 is a technique more similar to the
00:01:42.880 scientific method it's strategic and
00:01:46.719 plant so let's see how each of the steps
00:01:50.159 of the scientific method play out in
00:01:52.840 chaos
00:01:54.799 engineering observability is
00:01:57.280 key we need to know how the system acts
00:02:01.039 normally to compare how it responds
00:02:03.840 under an experiment so Step Zero should
00:02:06.759 be ensuring you have sufficient
00:02:09.039 observability in the area of the
00:02:10.959 application you'd like to
00:02:13.400 test will then form a hypothesis to base
00:02:16.640 our experiment
00:02:18.280 on like my application can withstand a
00:02:21.480 few servers going offline or my
00:02:24.200 application will fail gracefully from an
00:02:26.319 unexpected HTTP response
00:02:30.800 experimental failures can then be
00:02:32.959 injected into a production environment
00:02:36.040 the outcomes of which will help prove or
00:02:39.640 disprove our
00:02:42.560 thesis by analyzing the logs and metrics
00:02:46.560 recorded during the experiment and
00:02:49.440 comparing them with a baseline we can
00:02:51.560 learn exactly how our system will react
00:02:54.360 in the
00:02:56.760 scenario if our hypothesis was false
00:03:00.560 and the system was degraded or taken
00:03:03.560 offline when a few servers are also
00:03:06.560 offline then we can make a plan to
00:03:08.879 implement improvements to our systems
00:03:11.879 resiliency this is a proactive approach
00:03:14.720 to fault
00:03:16.159 tolerance now chaos engineering is
00:03:19.440 traditionally within the domain of
00:03:21.360 infrastructure and devops and this is a
00:03:24.360 ruby
00:03:25.640 talk there is a growing interest in
00:03:28.159 running chaos engineering experiments at
00:03:30.640 the application
00:03:32.000 layer focusing on the Ruby application
00:03:35.319 tightens the feedback loop for us
00:03:37.400 developers and lets us think about
00:03:39.519 improvements we can make for our users
00:03:43.159 when infrastructure is not
00:03:44.959 working things like servers being
00:03:47.400 unavailable and corrupt HTTP responses
00:03:51.040 still result in application layer
00:03:54.120 errors so if we're injecting bad input
00:03:57.560 or responses isn't that just test
00:03:59.920 testing testing with extra
00:04:02.840 steps close but not quite you see a key
00:04:06.000 difference is that the experiments run
00:04:08.599 on production yes that's right folks
00:04:11.799 we're doing it live even the best
00:04:14.920 pre-production environment doesn't
00:04:17.120 replicate reality
00:04:19.079 perfectly I've seen many clients in the
00:04:21.440 past wish for a way to test against the
00:04:23.840 chaos that occurs in production usually
00:04:26.400 with a focus on user provided data user
00:04:29.720 provided data can be a mind field of the
00:04:33.000 unexpected also your infrastructure
00:04:35.000 scale will be different think the number
00:04:36.919 of servers workers threads dinos shards
00:04:39.440 replicas
00:04:40.680 Etc third-party service availability is
00:04:43.479 going to be different your genuine
00:04:44.919 interactions are going to look
00:04:46.639 different the roles and permissions in
00:04:49.240 your production environment tend to be
00:04:50.759 more
00:04:52.440 secure and I don't know about you but
00:04:55.199 user traffic patterns on my staging
00:04:56.919 environment are very different from
00:04:58.639 production
00:05:01.440 the benefit of using production is
00:05:03.800 having a true picture of what our user
00:05:05.600 sees as a result of our
00:05:10.240 experiments so a
00:05:12.520 disclaimer this talk is designed as an
00:05:14.680 introduction and I avoided real
00:05:17.160 experiment implementation because
00:05:20.840 unsurprisingly chaos engineering can
00:05:23.440 cause real damage it should not be taken
00:05:26.560 undertaken carelessly and it's perfectly
00:05:29.160 fine to start your journey with a
00:05:31.479 pre-production environment especially
00:05:33.680 depending on your application's risk
00:05:36.280 tolerance that said there is so much
00:05:39.080 value in being able to compare the
00:05:41.800 steady state real world metrics to the
00:05:44.280 metrics of the production environment
00:05:45.880 under an experiment tests alone will
00:05:48.479 tell you that X will happen when when y
00:05:52.639 uh but these experiments could tell you
00:05:54.960 that X happens regularly in the steady
00:05:57.360 state in increase or decrease during an
00:06:00.080 experiment will give you so much more
00:06:03.280 information additionally a test Suite
00:06:05.680 must be thorough and well-designed on
00:06:08.440 its own to provide the highest level of
00:06:11.680 confidence uh but the world in which
00:06:14.280 every application is backed by a
00:06:16.280 thorough well tested
00:06:18.240 environment sweet might be in a galaxy
00:06:21.479 far far
00:06:23.280 away meet Maya a web developer managing
00:06:27.240 the death stars public relations site
00:06:30.440 it doesn't have a thorough and
00:06:32.400 well-designed test Suite backing it yet
00:06:34.400 it's surprisingly complex with access to
00:06:37.960 live streams of Target
00:06:40.560 planets she's not familiar with chaos
00:06:42.960 engineering and it's uh she's woefully
00:06:46.160 unprepared for what the rebels have
00:06:47.759 planned for her site she overhears one
00:06:51.520 of her fellow minions of the Empire
00:06:54.000 complaining about how they can't upload
00:06:55.840 a video they just took of a planet I
00:06:58.680 keep hitting upload and nothing
00:07:01.520 happens when looking into the logs Maya
00:07:04.720 sees the app is receiving 503s and
00:07:07.680 failing
00:07:08.919 silently hopefully Darth Vader doesn't
00:07:11.199 try this feature anytime
00:07:13.759 soon then a news bullettin reports that
00:07:16.440 the rebels have taken over and shut down
00:07:18.520 the Empire's favorite cloud service
00:07:21.240 enter your most disliked cloud service
00:07:25.479 here no matter what your cloud service
00:07:27.960 provider at the end of the day it's just
00:07:29.960 a collection of computers failures
00:07:32.120 happen AWS goes
00:07:34.840 down chaos monkey I mentioned earlier is
00:07:38.160 a chaos engineering tool to simulate
00:07:40.520 server outages and was developed at
00:07:42.520 Netflix shortly after they migrated to
00:07:45.039 AWS to shore up against this very
00:07:48.039 situation they realize the best defense
00:07:50.639 is a good offense you should fail while
00:07:52.919 you're
00:07:53.840 watching we can run a chaos engineering
00:07:56.639 experiment to simulate this at the
00:07:58.560 application layer
00:08:00.000 after all what does a cloud outage look
00:08:03.199 like to the
00:08:05.039 application a request timeout
00:08:10.479 response or a service unavailable
00:08:16.000 response how does your application deal
00:08:18.360 with these which of your services are
00:08:22.199 affected what does your user
00:08:24.599 see are your logs working
00:08:27.800 properly are you sure
00:08:30.440 when was the last time you
00:08:34.719 checked possible outcomes of such an
00:08:37.000 experiment could be that the site
00:08:39.080 handled the situation well confidence is
00:08:41.279 gained for the team and
00:08:42.919 management or more realistically our
00:08:45.680 service was degraded slightly we can now
00:08:48.680 decide how we want to handle that
00:08:51.240 shorter timeouts for
00:08:53.000 example or um worst case
00:08:56.560 scenario our service is shut down
00:08:58.959 completely
00:09:00.800 and this requires an improvement and
00:09:02.399 soon um even a page displaying a
00:09:05.240 temporarily offline uh message would be
00:09:08.000 better so consider adding a failover
00:09:10.800 service of course Solutions will be
00:09:13.040 different depending on your application
00:09:15.079 scale and
00:09:16.560 priorities Maya's temporary solution is
00:09:19.720 to add a banner on the page indicating
00:09:21.880 the upload service is currently
00:09:25.200 unavailable although she wonders if
00:09:27.240 anything else is happening on the Des
00:09:29.399 Stars PR page she clicks and clicks
00:09:33.000 suddenly things are not as responsive as
00:09:36.120 they should be the pages are loading so
00:09:40.440 slowly then all she sees is a white
00:09:43.079 screen h a glance at her data talk
00:09:45.560 dashboard hashtag notsponsored tells her
00:09:47.800 traffic has spiked and what's this all
00:09:51.519 the IP addresses are originating on Hoth
00:09:53.640 it must be the
00:09:56.000 rebels High latency can be an issue
00:09:59.399 caused by a number of things in this
00:10:02.040 case increased traffic slowing down the
00:10:04.320 site even if an application isn't a
00:10:06.959 target for a Dos attack like this that
00:10:10.040 doesn't mean it's safe from the friendly
00:10:11.600 internet hug of
00:10:12.959 death that's when a site is shared to
00:10:15.480 social media and goes viral suddenly the
00:10:18.760 entire internet seems to be on the site
00:10:20.519 and takes it down like a t-47s toe cable
00:10:22.839 taking down an
00:10:26.360 at8 the concept of a Slowdown goes hand
00:10:29.399 in hand with real world user experiences
00:10:32.360 and business
00:10:34.040 metrics how long is it acceptable for
00:10:36.560 your user to handle a degraded response
00:10:38.680 time with no further
00:10:41.079 information when will that user decide
00:10:43.200 to just leave the site entirely and
00:10:44.920 spread the word this has real costs
00:10:47.519 associated with it I know you all
00:10:50.160 know at the application layer a Slowdown
00:10:53.399 can look the same as a simple sleep
00:10:56.079 command we can design a chaos
00:10:58.399 engineering experience M to test the
00:11:00.880 user's experience during a high latency
00:11:04.079 situation this can service various
00:11:06.680 opportunities for
00:11:08.680 improvements like adding or improving
00:11:11.120 site messaging a shorter timeout and a
00:11:13.760 message that says hey we're still
00:11:15.560 loading is really important for your
00:11:18.320 users to know that it's not them or
00:11:20.920 their internet the site is still
00:11:23.399 thinking you could also add a loading
00:11:25.920 icon or progress bar or hey consider
00:11:29.519 adding turbo to the app's front end to
00:11:32.600 lazily load some frames while things are
00:11:35.560 going on in the
00:11:37.040 background so I mentioned earlier
00:11:39.279 Solutions will be different depending on
00:11:41.279 your situation fortunately for Maya the
00:11:44.680 increased traffic all originated on Hoff
00:11:47.560 so she alerted Darth
00:11:49.200 Vader and it was dealt
00:11:53.000 with oh fighting the rebels is
00:11:56.200 expensive so Maya added a subscription
00:11:59.240 service a while ago to the Death Star
00:12:01.000 site promising live streams of Planet
00:12:04.160 demolitions heck yeah we love an
00:12:07.839 explosion she checks in on the site
00:12:09.920 subscriptions page the payments for
00:12:12.079 which are handled by a third-party
00:12:14.199 integration well this page should be
00:12:16.399 displaying information about the recent
00:12:19.199 subscriptions but suddenly it doesn't
00:12:21.399 have any
00:12:23.399 values another check of the logs and she
00:12:26.440 sees 401 responses coming from the API
00:12:30.760 it turns out a member of the Rebellion
00:12:32.800 works for the third party service and
00:12:35.000 deleted the death star's API
00:12:39.199 token I mean third party apis are a pain
00:12:42.199 to integrate to begin with tokens Can
00:12:45.000 expire on their own or be maliciously
00:12:47.800 deleted by Rebel scum apis can change
00:12:51.399 unexpectedly and these Services can
00:12:54.000 experience their own
00:12:56.160 outages I believe that applications
00:12:58.760 should take ownership of the entire user
00:13:01.320 experience on their site regardless of
00:13:03.399 third party
00:13:05.000 Integrations so we can run a chaos
00:13:07.880 engineering experiment that mimics this
00:13:10.199 situation by serving different error
00:13:12.360 responses to the
00:13:14.160 application existing observability can
00:13:16.440 help you determine the most common one
00:13:18.519 your site faces and design experiments
00:13:20.959 around
00:13:23.639 that how does your site handle these
00:13:26.199 HTTP
00:13:27.600 responses is a Grace
00:13:30.240 is your logging correct or my worst
00:13:32.720 nightmare the user sees a stack trace
00:13:35.920 this is real on the US Department of
00:13:38.160 Transportation
00:13:42.240 website retry
00:13:44.480 mechanisms graceful error messaging and
00:13:48.079 fall back options are all things that
00:13:50.440 could be added to your site to improve
00:13:52.440 your users experience in these
00:13:55.839 situations Oh Maya is at her wit's end
00:13:59.320 handling all these incidences only as
00:14:01.320 they arise if she had known about chaos
00:14:04.120 engineering she could have run some
00:14:05.680 experiments and shed up the resiliency
00:14:08.320 of the Death Star site but you know it's
00:14:10.199 the Empire so money's
00:14:13.600 tight there's so many different areas
00:14:16.240 that could experience failure even if
00:14:18.519 she had the budget where could she have
00:14:21.079 started where can you
00:14:23.839 start as I said Step Zero should be
00:14:26.279 increasing
00:14:27.800 observability from there you should
00:14:30.240 start small you want to limit the
00:14:33.399 unknowns when you're running an
00:14:35.240 experiment most chaos engineer experts
00:14:37.560 will tell you that you should start your
00:14:39.519 experiments in an area that you are
00:14:41.079 already very confident
00:14:43.399 in from there you want to identify an
00:14:46.320 area in your application that is going
00:14:48.120 to limit the side effects like a single
00:14:52.160 microservice and then you can prioritize
00:14:54.440 experiments and improvements based on
00:14:56.440 your own service level agreements and
00:14:58.560 priorities
00:14:59.959 and remember knowledge is
00:15:07.800 power so how do I get my team to
00:15:11.959 care you're all here listening to the
00:15:14.320 wise they're not well experiments and
00:15:17.399 their outcomes can provide a lot of
00:15:19.399 value to the developers on your
00:15:21.639 team chaos engineering can help build a
00:15:25.040 higher level of confidence for everyone
00:15:27.399 we all want to be able to trust the code
00:15:29.279 we write and these experiments can test
00:15:32.639 our theories around the
00:15:35.160 unexpected and then there's data who
00:15:37.920 doesn't love data uh wrong talk sorry um
00:15:41.319 so this data can help when trying to
00:15:44.680 lobby for Tech investment it's easier
00:15:47.360 for a director to agree to choosing a
00:15:49.600 tech investment Sprint over feature
00:15:51.839 workor when you can point at the
00:15:53.839 experiments data and the costly user
00:15:56.279 impact should these improvements not be
00:15:58.160 made
00:16:00.160 plus it's cool to call yourself an agent
00:16:01.959 of
00:16:04.759 chaos well how do I get my manager
00:16:07.639 director CTO on
00:16:09.920 board well this doesn't have to be an
00:16:12.000 all and cell and it shouldn't be as I've
00:16:15.639 mentioned chaos engineering is best
00:16:17.880 practice carefully and
00:16:20.959 incrementally like getting byy in for
00:16:23.000 anything it's important to bring aspects
00:16:25.040 that align with the priorities and
00:16:26.839 values of your audience so when you're
00:16:28.920 speaking with a business leader you can
00:16:30.800 share details about how this practice
00:16:33.680 will provide a better customer
00:16:35.440 experience and increase your sit's
00:16:38.680 reputation it will result in less
00:16:41.759 downtime thus fewer lost customer
00:16:44.600 interactions click-throughs or
00:16:48.600 sales for engineering leaders you can
00:16:51.440 take a different Tack and bring up how
00:16:54.639 this can improve the resiliency of your
00:16:56.920 app
00:16:58.959 and it can reduce the stress around
00:17:00.759 launches or busy
00:17:03.439 Seasons this can improve the team's
00:17:05.720 confidence in the system and the work
00:17:07.280 that they're
00:17:08.799 doing and improve nearly every aspect of
00:17:12.480 managing Tech investment from
00:17:15.000 determining areas to work on
00:17:17.439 prioritizing timing and having datab
00:17:20.439 backed evidence to get upper management
00:17:23.000 on
00:17:24.079 board for the Galactic Empire some
00:17:27.679 selling points could the thermal exhaust
00:17:30.760 ports are shored
00:17:32.520 up schematics are stored more securely
00:17:36.480 and the site does not go down
00:17:40.160 ever I encourage you to seek tools and
00:17:42.960 methods of experimentation that will
00:17:45.160 allow you and your team to see the
00:17:47.480 benefits of application layer chaos
00:17:50.760 engineering improving your applications
00:17:53.200 resilience your team's your users
00:17:55.440 experience and your team's confidence
00:17:57.880 are worth the
00:18:00.159 experiment here are some tools to get
00:18:02.000 you
00:18:03.799 started flu shot is a chaos testing tool
00:18:07.000 for Ruby applications it can inject
00:18:10.080 unusual and unexpected behaviors into
00:18:12.520 your system like we've been talking
00:18:13.880 about it can add extra latency to your
00:18:16.640 network requests it can simulate
00:18:19.080 infinite loops and it can raise
00:18:24.159 exceptions chaos RB is another chaos
00:18:27.440 testing tool for Ruby applic
00:18:29.520 ations it supports simulating issues
00:18:32.720 like increased CPU usage raised
00:18:35.880 exceptions long IO weight and increased
00:18:39.520 memory
00:18:42.960 usage chaos Orca is an original chaos
00:18:46.720 engineering system for Docker it
00:18:49.679 provides monitoring and injections that
00:18:52.320 are done exclusively from an apps Docker
00:18:54.799 host so it's language
00:18:56.520 agnostic if you're interested in contri
00:18:58.840 meeting to open source like I am I'll be
00:19:01.919 at hack day tomorrow representing
00:19:03.919 clearance uh I know some of these
00:19:05.919 projects are looking for Community
00:19:07.320 contributions so please consider
00:19:09.880 them had the first Death Star properly
00:19:12.960 employed chaos engineering practices it
00:19:15.559 could have sa it could have been saved
00:19:17.480 from such a thorough and obvious
00:19:19.720 demise but try telling the emperor he's
00:19:22.799 not doing development
00:19:27.240 right this is the real life Maya my
00:19:29.600 four-year-old Ms
00:19:31.400 Kitty I was delighted to hire Brian an
00:19:34.320 illustrator from Fiverr to draw Maya as
00:19:36.480 the Imperial officer I imagined her as
00:19:38.640 for this
00:19:40.360 presentation this slide is hard to read
00:19:42.480 but these are my sources and I'll be
00:19:43.919 sharing these and other resources to get
00:19:46.320 started and slack
00:19:49.240 later thank you so much you can find me
00:19:52.159 on madon as C Sarah thoughtbot doso and
00:19:55.960 searah vmst doio I won't be taking
00:19:59.320 questions now but if you find me later
00:20:01.240 you can talk my ear off about chaos
00:20:02.640 engineering because I have so much I
00:20:04.159 could talk about
Explore all talks recorded at RubyConf 2024
+64