00:00:15.320
hi uh I'm Sarah Jackson thank you for
00:00:18.279
coming to my talk chaos engineering on
00:00:20.199
the Death
00:00:21.160
Star this talk covers a topic that may
00:00:24.279
require some prerequisite knowledge so
00:00:26.920
let's get some things straight
00:00:30.400
the Galactic
00:00:32.360
Empire is
00:00:35.399
bad the Rebel Alliance is
00:00:40.320
good and the first death star was
00:00:43.399
destroyed as a result of poor planning
00:00:45.559
poor execution and
00:00:47.960
hubris hubris to assume that nothing
00:00:50.600
would go wrong and everything would go
00:00:53.039
according to
00:00:54.160
plan the internet sees hubris and laughs
00:00:58.440
things will go wrong and we should
00:01:00.399
prepare for
00:01:01.600
it chaos engineering is a tool we can
00:01:05.119
use to prepare for the
00:01:07.799
unexpected it can help prevent
00:01:10.920
incidents improve user
00:01:13.560
experience and increase team
00:01:17.320
confidence and so much more so what is
00:01:21.079
chaos
00:01:22.040
engineering chaos engineering itself is
00:01:25.680
a bit of a misnomer if you've heard of
00:01:28.759
Netflix's tool chaos monkey you might
00:01:31.240
imagine a monkey wildly disrupting the
00:01:33.880
system uh shooting off blasters with the
00:01:36.439
aim of a
00:01:38.119
stormtrooper however chaos engineering
00:01:41.000
is a technique more similar to the
00:01:42.880
scientific method it's strategic and
00:01:46.719
plant so let's see how each of the steps
00:01:50.159
of the scientific method play out in
00:01:52.840
chaos
00:01:54.799
engineering observability is
00:01:57.280
key we need to know how the system acts
00:02:01.039
normally to compare how it responds
00:02:03.840
under an experiment so Step Zero should
00:02:06.759
be ensuring you have sufficient
00:02:09.039
observability in the area of the
00:02:10.959
application you'd like to
00:02:13.400
test will then form a hypothesis to base
00:02:16.640
our experiment
00:02:18.280
on like my application can withstand a
00:02:21.480
few servers going offline or my
00:02:24.200
application will fail gracefully from an
00:02:26.319
unexpected HTTP response
00:02:30.800
experimental failures can then be
00:02:32.959
injected into a production environment
00:02:36.040
the outcomes of which will help prove or
00:02:39.640
disprove our
00:02:42.560
thesis by analyzing the logs and metrics
00:02:46.560
recorded during the experiment and
00:02:49.440
comparing them with a baseline we can
00:02:51.560
learn exactly how our system will react
00:02:54.360
in the
00:02:56.760
scenario if our hypothesis was false
00:03:00.560
and the system was degraded or taken
00:03:03.560
offline when a few servers are also
00:03:06.560
offline then we can make a plan to
00:03:08.879
implement improvements to our systems
00:03:11.879
resiliency this is a proactive approach
00:03:14.720
to fault
00:03:16.159
tolerance now chaos engineering is
00:03:19.440
traditionally within the domain of
00:03:21.360
infrastructure and devops and this is a
00:03:24.360
ruby
00:03:25.640
talk there is a growing interest in
00:03:28.159
running chaos engineering experiments at
00:03:30.640
the application
00:03:32.000
layer focusing on the Ruby application
00:03:35.319
tightens the feedback loop for us
00:03:37.400
developers and lets us think about
00:03:39.519
improvements we can make for our users
00:03:43.159
when infrastructure is not
00:03:44.959
working things like servers being
00:03:47.400
unavailable and corrupt HTTP responses
00:03:51.040
still result in application layer
00:03:54.120
errors so if we're injecting bad input
00:03:57.560
or responses isn't that just test
00:03:59.920
testing testing with extra
00:04:02.840
steps close but not quite you see a key
00:04:06.000
difference is that the experiments run
00:04:08.599
on production yes that's right folks
00:04:11.799
we're doing it live even the best
00:04:14.920
pre-production environment doesn't
00:04:17.120
replicate reality
00:04:19.079
perfectly I've seen many clients in the
00:04:21.440
past wish for a way to test against the
00:04:23.840
chaos that occurs in production usually
00:04:26.400
with a focus on user provided data user
00:04:29.720
provided data can be a mind field of the
00:04:33.000
unexpected also your infrastructure
00:04:35.000
scale will be different think the number
00:04:36.919
of servers workers threads dinos shards
00:04:39.440
replicas
00:04:40.680
Etc third-party service availability is
00:04:43.479
going to be different your genuine
00:04:44.919
interactions are going to look
00:04:46.639
different the roles and permissions in
00:04:49.240
your production environment tend to be
00:04:50.759
more
00:04:52.440
secure and I don't know about you but
00:04:55.199
user traffic patterns on my staging
00:04:56.919
environment are very different from
00:04:58.639
production
00:05:01.440
the benefit of using production is
00:05:03.800
having a true picture of what our user
00:05:05.600
sees as a result of our
00:05:10.240
experiments so a
00:05:12.520
disclaimer this talk is designed as an
00:05:14.680
introduction and I avoided real
00:05:17.160
experiment implementation because
00:05:20.840
unsurprisingly chaos engineering can
00:05:23.440
cause real damage it should not be taken
00:05:26.560
undertaken carelessly and it's perfectly
00:05:29.160
fine to start your journey with a
00:05:31.479
pre-production environment especially
00:05:33.680
depending on your application's risk
00:05:36.280
tolerance that said there is so much
00:05:39.080
value in being able to compare the
00:05:41.800
steady state real world metrics to the
00:05:44.280
metrics of the production environment
00:05:45.880
under an experiment tests alone will
00:05:48.479
tell you that X will happen when when y
00:05:52.639
uh but these experiments could tell you
00:05:54.960
that X happens regularly in the steady
00:05:57.360
state in increase or decrease during an
00:06:00.080
experiment will give you so much more
00:06:03.280
information additionally a test Suite
00:06:05.680
must be thorough and well-designed on
00:06:08.440
its own to provide the highest level of
00:06:11.680
confidence uh but the world in which
00:06:14.280
every application is backed by a
00:06:16.280
thorough well tested
00:06:18.240
environment sweet might be in a galaxy
00:06:21.479
far far
00:06:23.280
away meet Maya a web developer managing
00:06:27.240
the death stars public relations site
00:06:30.440
it doesn't have a thorough and
00:06:32.400
well-designed test Suite backing it yet
00:06:34.400
it's surprisingly complex with access to
00:06:37.960
live streams of Target
00:06:40.560
planets she's not familiar with chaos
00:06:42.960
engineering and it's uh she's woefully
00:06:46.160
unprepared for what the rebels have
00:06:47.759
planned for her site she overhears one
00:06:51.520
of her fellow minions of the Empire
00:06:54.000
complaining about how they can't upload
00:06:55.840
a video they just took of a planet I
00:06:58.680
keep hitting upload and nothing
00:07:01.520
happens when looking into the logs Maya
00:07:04.720
sees the app is receiving 503s and
00:07:07.680
failing
00:07:08.919
silently hopefully Darth Vader doesn't
00:07:11.199
try this feature anytime
00:07:13.759
soon then a news bullettin reports that
00:07:16.440
the rebels have taken over and shut down
00:07:18.520
the Empire's favorite cloud service
00:07:21.240
enter your most disliked cloud service
00:07:25.479
here no matter what your cloud service
00:07:27.960
provider at the end of the day it's just
00:07:29.960
a collection of computers failures
00:07:32.120
happen AWS goes
00:07:34.840
down chaos monkey I mentioned earlier is
00:07:38.160
a chaos engineering tool to simulate
00:07:40.520
server outages and was developed at
00:07:42.520
Netflix shortly after they migrated to
00:07:45.039
AWS to shore up against this very
00:07:48.039
situation they realize the best defense
00:07:50.639
is a good offense you should fail while
00:07:52.919
you're
00:07:53.840
watching we can run a chaos engineering
00:07:56.639
experiment to simulate this at the
00:07:58.560
application layer
00:08:00.000
after all what does a cloud outage look
00:08:03.199
like to the
00:08:05.039
application a request timeout
00:08:10.479
response or a service unavailable
00:08:16.000
response how does your application deal
00:08:18.360
with these which of your services are
00:08:22.199
affected what does your user
00:08:24.599
see are your logs working
00:08:27.800
properly are you sure
00:08:30.440
when was the last time you
00:08:34.719
checked possible outcomes of such an
00:08:37.000
experiment could be that the site
00:08:39.080
handled the situation well confidence is
00:08:41.279
gained for the team and
00:08:42.919
management or more realistically our
00:08:45.680
service was degraded slightly we can now
00:08:48.680
decide how we want to handle that
00:08:51.240
shorter timeouts for
00:08:53.000
example or um worst case
00:08:56.560
scenario our service is shut down
00:08:58.959
completely
00:09:00.800
and this requires an improvement and
00:09:02.399
soon um even a page displaying a
00:09:05.240
temporarily offline uh message would be
00:09:08.000
better so consider adding a failover
00:09:10.800
service of course Solutions will be
00:09:13.040
different depending on your application
00:09:15.079
scale and
00:09:16.560
priorities Maya's temporary solution is
00:09:19.720
to add a banner on the page indicating
00:09:21.880
the upload service is currently
00:09:25.200
unavailable although she wonders if
00:09:27.240
anything else is happening on the Des
00:09:29.399
Stars PR page she clicks and clicks
00:09:33.000
suddenly things are not as responsive as
00:09:36.120
they should be the pages are loading so
00:09:40.440
slowly then all she sees is a white
00:09:43.079
screen h a glance at her data talk
00:09:45.560
dashboard hashtag notsponsored tells her
00:09:47.800
traffic has spiked and what's this all
00:09:51.519
the IP addresses are originating on Hoth
00:09:53.640
it must be the
00:09:56.000
rebels High latency can be an issue
00:09:59.399
caused by a number of things in this
00:10:02.040
case increased traffic slowing down the
00:10:04.320
site even if an application isn't a
00:10:06.959
target for a Dos attack like this that
00:10:10.040
doesn't mean it's safe from the friendly
00:10:11.600
internet hug of
00:10:12.959
death that's when a site is shared to
00:10:15.480
social media and goes viral suddenly the
00:10:18.760
entire internet seems to be on the site
00:10:20.519
and takes it down like a t-47s toe cable
00:10:22.839
taking down an
00:10:26.360
at8 the concept of a Slowdown goes hand
00:10:29.399
in hand with real world user experiences
00:10:32.360
and business
00:10:34.040
metrics how long is it acceptable for
00:10:36.560
your user to handle a degraded response
00:10:38.680
time with no further
00:10:41.079
information when will that user decide
00:10:43.200
to just leave the site entirely and
00:10:44.920
spread the word this has real costs
00:10:47.519
associated with it I know you all
00:10:50.160
know at the application layer a Slowdown
00:10:53.399
can look the same as a simple sleep
00:10:56.079
command we can design a chaos
00:10:58.399
engineering experience M to test the
00:11:00.880
user's experience during a high latency
00:11:04.079
situation this can service various
00:11:06.680
opportunities for
00:11:08.680
improvements like adding or improving
00:11:11.120
site messaging a shorter timeout and a
00:11:13.760
message that says hey we're still
00:11:15.560
loading is really important for your
00:11:18.320
users to know that it's not them or
00:11:20.920
their internet the site is still
00:11:23.399
thinking you could also add a loading
00:11:25.920
icon or progress bar or hey consider
00:11:29.519
adding turbo to the app's front end to
00:11:32.600
lazily load some frames while things are
00:11:35.560
going on in the
00:11:37.040
background so I mentioned earlier
00:11:39.279
Solutions will be different depending on
00:11:41.279
your situation fortunately for Maya the
00:11:44.680
increased traffic all originated on Hoff
00:11:47.560
so she alerted Darth
00:11:49.200
Vader and it was dealt
00:11:53.000
with oh fighting the rebels is
00:11:56.200
expensive so Maya added a subscription
00:11:59.240
service a while ago to the Death Star
00:12:01.000
site promising live streams of Planet
00:12:04.160
demolitions heck yeah we love an
00:12:07.839
explosion she checks in on the site
00:12:09.920
subscriptions page the payments for
00:12:12.079
which are handled by a third-party
00:12:14.199
integration well this page should be
00:12:16.399
displaying information about the recent
00:12:19.199
subscriptions but suddenly it doesn't
00:12:21.399
have any
00:12:23.399
values another check of the logs and she
00:12:26.440
sees 401 responses coming from the API
00:12:30.760
it turns out a member of the Rebellion
00:12:32.800
works for the third party service and
00:12:35.000
deleted the death star's API
00:12:39.199
token I mean third party apis are a pain
00:12:42.199
to integrate to begin with tokens Can
00:12:45.000
expire on their own or be maliciously
00:12:47.800
deleted by Rebel scum apis can change
00:12:51.399
unexpectedly and these Services can
00:12:54.000
experience their own
00:12:56.160
outages I believe that applications
00:12:58.760
should take ownership of the entire user
00:13:01.320
experience on their site regardless of
00:13:03.399
third party
00:13:05.000
Integrations so we can run a chaos
00:13:07.880
engineering experiment that mimics this
00:13:10.199
situation by serving different error
00:13:12.360
responses to the
00:13:14.160
application existing observability can
00:13:16.440
help you determine the most common one
00:13:18.519
your site faces and design experiments
00:13:20.959
around
00:13:23.639
that how does your site handle these
00:13:26.199
HTTP
00:13:27.600
responses is a Grace
00:13:30.240
is your logging correct or my worst
00:13:32.720
nightmare the user sees a stack trace
00:13:35.920
this is real on the US Department of
00:13:38.160
Transportation
00:13:42.240
website retry
00:13:44.480
mechanisms graceful error messaging and
00:13:48.079
fall back options are all things that
00:13:50.440
could be added to your site to improve
00:13:52.440
your users experience in these
00:13:55.839
situations Oh Maya is at her wit's end
00:13:59.320
handling all these incidences only as
00:14:01.320
they arise if she had known about chaos
00:14:04.120
engineering she could have run some
00:14:05.680
experiments and shed up the resiliency
00:14:08.320
of the Death Star site but you know it's
00:14:10.199
the Empire so money's
00:14:13.600
tight there's so many different areas
00:14:16.240
that could experience failure even if
00:14:18.519
she had the budget where could she have
00:14:21.079
started where can you
00:14:23.839
start as I said Step Zero should be
00:14:26.279
increasing
00:14:27.800
observability from there you should
00:14:30.240
start small you want to limit the
00:14:33.399
unknowns when you're running an
00:14:35.240
experiment most chaos engineer experts
00:14:37.560
will tell you that you should start your
00:14:39.519
experiments in an area that you are
00:14:41.079
already very confident
00:14:43.399
in from there you want to identify an
00:14:46.320
area in your application that is going
00:14:48.120
to limit the side effects like a single
00:14:52.160
microservice and then you can prioritize
00:14:54.440
experiments and improvements based on
00:14:56.440
your own service level agreements and
00:14:58.560
priorities
00:14:59.959
and remember knowledge is
00:15:07.800
power so how do I get my team to
00:15:11.959
care you're all here listening to the
00:15:14.320
wise they're not well experiments and
00:15:17.399
their outcomes can provide a lot of
00:15:19.399
value to the developers on your
00:15:21.639
team chaos engineering can help build a
00:15:25.040
higher level of confidence for everyone
00:15:27.399
we all want to be able to trust the code
00:15:29.279
we write and these experiments can test
00:15:32.639
our theories around the
00:15:35.160
unexpected and then there's data who
00:15:37.920
doesn't love data uh wrong talk sorry um
00:15:41.319
so this data can help when trying to
00:15:44.680
lobby for Tech investment it's easier
00:15:47.360
for a director to agree to choosing a
00:15:49.600
tech investment Sprint over feature
00:15:51.839
workor when you can point at the
00:15:53.839
experiments data and the costly user
00:15:56.279
impact should these improvements not be
00:15:58.160
made
00:16:00.160
plus it's cool to call yourself an agent
00:16:01.959
of
00:16:04.759
chaos well how do I get my manager
00:16:07.639
director CTO on
00:16:09.920
board well this doesn't have to be an
00:16:12.000
all and cell and it shouldn't be as I've
00:16:15.639
mentioned chaos engineering is best
00:16:17.880
practice carefully and
00:16:20.959
incrementally like getting byy in for
00:16:23.000
anything it's important to bring aspects
00:16:25.040
that align with the priorities and
00:16:26.839
values of your audience so when you're
00:16:28.920
speaking with a business leader you can
00:16:30.800
share details about how this practice
00:16:33.680
will provide a better customer
00:16:35.440
experience and increase your sit's
00:16:38.680
reputation it will result in less
00:16:41.759
downtime thus fewer lost customer
00:16:44.600
interactions click-throughs or
00:16:48.600
sales for engineering leaders you can
00:16:51.440
take a different Tack and bring up how
00:16:54.639
this can improve the resiliency of your
00:16:56.920
app
00:16:58.959
and it can reduce the stress around
00:17:00.759
launches or busy
00:17:03.439
Seasons this can improve the team's
00:17:05.720
confidence in the system and the work
00:17:07.280
that they're
00:17:08.799
doing and improve nearly every aspect of
00:17:12.480
managing Tech investment from
00:17:15.000
determining areas to work on
00:17:17.439
prioritizing timing and having datab
00:17:20.439
backed evidence to get upper management
00:17:23.000
on
00:17:24.079
board for the Galactic Empire some
00:17:27.679
selling points could the thermal exhaust
00:17:30.760
ports are shored
00:17:32.520
up schematics are stored more securely
00:17:36.480
and the site does not go down
00:17:40.160
ever I encourage you to seek tools and
00:17:42.960
methods of experimentation that will
00:17:45.160
allow you and your team to see the
00:17:47.480
benefits of application layer chaos
00:17:50.760
engineering improving your applications
00:17:53.200
resilience your team's your users
00:17:55.440
experience and your team's confidence
00:17:57.880
are worth the
00:18:00.159
experiment here are some tools to get
00:18:02.000
you
00:18:03.799
started flu shot is a chaos testing tool
00:18:07.000
for Ruby applications it can inject
00:18:10.080
unusual and unexpected behaviors into
00:18:12.520
your system like we've been talking
00:18:13.880
about it can add extra latency to your
00:18:16.640
network requests it can simulate
00:18:19.080
infinite loops and it can raise
00:18:24.159
exceptions chaos RB is another chaos
00:18:27.440
testing tool for Ruby applic
00:18:29.520
ations it supports simulating issues
00:18:32.720
like increased CPU usage raised
00:18:35.880
exceptions long IO weight and increased
00:18:39.520
memory
00:18:42.960
usage chaos Orca is an original chaos
00:18:46.720
engineering system for Docker it
00:18:49.679
provides monitoring and injections that
00:18:52.320
are done exclusively from an apps Docker
00:18:54.799
host so it's language
00:18:56.520
agnostic if you're interested in contri
00:18:58.840
meeting to open source like I am I'll be
00:19:01.919
at hack day tomorrow representing
00:19:03.919
clearance uh I know some of these
00:19:05.919
projects are looking for Community
00:19:07.320
contributions so please consider
00:19:09.880
them had the first Death Star properly
00:19:12.960
employed chaos engineering practices it
00:19:15.559
could have sa it could have been saved
00:19:17.480
from such a thorough and obvious
00:19:19.720
demise but try telling the emperor he's
00:19:22.799
not doing development
00:19:27.240
right this is the real life Maya my
00:19:29.600
four-year-old Ms
00:19:31.400
Kitty I was delighted to hire Brian an
00:19:34.320
illustrator from Fiverr to draw Maya as
00:19:36.480
the Imperial officer I imagined her as
00:19:38.640
for this
00:19:40.360
presentation this slide is hard to read
00:19:42.480
but these are my sources and I'll be
00:19:43.919
sharing these and other resources to get
00:19:46.320
started and slack
00:19:49.240
later thank you so much you can find me
00:19:52.159
on madon as C Sarah thoughtbot doso and
00:19:55.960
searah vmst doio I won't be taking
00:19:59.320
questions now but if you find me later
00:20:01.240
you can talk my ear off about chaos
00:20:02.640
engineering because I have so much I
00:20:04.159
could talk about