00:00:00.640
well thank you so much everybody for joining uh so today we're gonna do a
00:00:06.200
little introductory chat about observability and
00:00:11.240
hopefully uh that hopefully that will help you ship cod with more
00:00:19.359
confidence all right let's do this um little bit about me I'm a senior
00:00:26.519
software developer at Shopify I'm a pet mom I have a two cats and a doggy on my
00:00:32.960
free time I like to dream about living in the woods and um also a game overover
00:00:39.000
I love playing video games when I have the time and I wish I had more time than I actually do it um cool before we start
00:00:50.559
I actually am curious about what's your uh relationship with observability today
00:00:56.640
so I have here a few options and if you feel comfortable with and you can share share in the chat if you're number one
00:01:03.320
two or three uh so number one is my company or team has observability
00:01:08.600
dashboards and alerts and I check them periodically and it directly impacts how I do my
00:01:14.520
work uh number two my company or team has observability dashboards and alerts
00:01:20.119
but it not I'm not very familiar with them and it doesn't impact how I do my
00:01:25.479
work number three will be my company or team doesn't have observ probility dashboards and alerts or it doesn't have
00:01:33.040
the need for them or number four um if you want to share in the
00:01:39.079
comments if you are in a different
00:01:46.719
scenario nice thanks thanks for sharing um for those who posted two and three I
00:01:52.439
think this talk will be perfect for you uh and for those who posted well I think
00:01:57.680
there you still can get some value from this so from it so hang hang
00:02:03.439
tight um so how observability how did how why did I decide to do this talk in
00:02:08.840
the first place so I've been working with development for 13 years but I feel
00:02:14.000
like I've been coding in the dark because i' I've always been in scenario two or three where my company or team
00:02:21.239
doesn't have observability at all or he has it but it doesn't really uh impact how I do my work um I'm in the checkout
00:02:30.080
ability team at Shopify and uh earlier this year I led the the backend migration uh to the new checkup
00:02:37.280
editor uh so basically uh Shopify we used to let uh Merchants customize their
00:02:43.720
check out with uh checkout liquid uh but for many reasons security
00:02:50.920
performance uh we wanted them to customize in a more controlled environment so we came up with check out
00:02:57.680
accessibility and so we had to migrate all the millions of merchants uh to this
00:03:03.920
new platform and we need a better visibility to ensure that we would have
00:03:09.360
a smooth rollout and my lights went up weird I don't know why anyway our
00:03:18.200
main concern there was uh handling a massive increase in requests because we were migrating millions of
00:03:25.080
merchants and uh we wanted to make to ensure that there will no performance
00:03:31.560
degradation uh I saw a question about liquid no more liquid on checkout I
00:03:37.000
think there's still liquid and other areas of Shopify um yeah so we implemented
00:03:44.599
metrics dashboards and we did uh in fact found the issue during the roll out but
00:03:52.040
it was related to a separate team because we had dashboards the alerts we were able to identify this earlier and
00:04:00.480
address with the team they made a few twixs and there was no performance degradation no issues the release was a
00:04:07.159
success and yeah that's hey really good thing um but maybe you're here because
00:04:15.879
you don't even know what observability means so observability if if you don't know what
00:04:21.560
observability means it mean it's a term that we use to describe the ability to understand internal uh state of a system
00:04:29.600
by by examining its external signals um what are the signals so
00:04:37.080
there's logs logs are text records that uh you can see how the application is
00:04:42.479
doing if when you run R server immediately see logs there but some
00:04:47.639
companies also have a production application where you can see the logs um there's metrics uh metrics are
00:04:56.280
numer numeric data points that you can use to uh tracks things such as like CPU usage
00:05:05.039
and memory consumption uh request rates and even Uh custom things uh like features and
00:05:13.400
anything anything really that you want to track when it goes to production and we also have traces
00:05:19.759
traces are detailed records of an entire journey of our request and as it flow
00:05:26.160
through the different services and components so it helps you figure it out like uh possible issues or delays in
00:05:32.840
specific um stamps sorry spans of the the request
00:05:38.800
journey and profiling uh profiling was only introduced to open chetry and
00:05:44.800
observability uh earlier this year and it allows you to inspect the code
00:05:49.880
behavior and performance at runtime in production so pretty cool
00:06:00.160
a few use cases uh for observability in production is uh understand the impact
00:06:06.759
of new code changes uh quick problem detection improve the system understanding track
00:06:14.120
uh health and performance enhance the user experience which can be related to Performance as well and scalability and
00:06:23.120
reliability so basically not only you're there coding on your on development
00:06:28.840
machine fix bugs but you owning the the changes from development to production
00:06:34.639
and ensuring like that the system is uh
00:06:40.919
reliable uh there are here a few commsion tools um there's data dog neur
00:06:46.759
Relic Honeybadger signal robar blun Sentry scalp and many many others um
00:06:53.639
some of them are more focused about more focused about logging or traces or
00:07:00.800
um metric or like a combination of all these things um yeah I'm curious if if
00:07:07.000
you feel comfortable uh sharing like U do you use any of those tools at work or
00:07:12.560
for for personal projects um feel free to to share in the comments if you if
00:07:19.759
you feel comfortable yeah I used I used New Relic
00:07:26.599
in the past Honeybadger robar splank and data dog like in companies that I worked
00:07:33.560
for and I used the Sentry for personal projects and I really like
00:07:40.840
Sentry
00:07:48.440
nice yeah data dog and and centry and neic are very very popular thanks thanks
00:07:55.280
for sharing that's that's really insightful um we also have have open source tools um so there's Prometheus
00:08:02.800
more focused on metrics fluent D for logging jar for tracing grafana is uh
00:08:12.000
like a UI that you can see a bunch of things together uh so you can create dashboards and uh manage alerts and it
00:08:19.960
does a bunch of things I don't think it's only related to observability but um it's very commonly used with
00:08:27.159
Prometheus um and open Teter which is not a two but it's a open source
00:08:32.800
framework I have a separate slide for it does any does anyone here use one of
00:08:38.360
those open source toes feel free to to
00:08:45.839
share nice good question I will share more about this open source tools uh during the
00:08:54.680
talk nice yeah I'm going to cover a lot lot of peral and graph in a more like
00:09:01.839
introductory way but yeah cool uh let's talk a little bit about Petry so Petry
00:09:08.959
it's an open source framework or toit that it's used to manage Telemetry data
00:09:14.560
logs metrics traces and now profiling um it integrates with many
00:09:19.760
tools so Prometheus grafana data do neur Relic and it has been becoming very very
00:09:25.720
popular due to the standardization that it does so uh let's say your company uses data
00:09:34.160
dog but the bill is getting to expensive so you want to switch to an open source
00:09:40.200
too and then own your data and you know we have an infrastructure team to deal with that uh supposedly supposed to be
00:09:47.240
easy to uh switch from uh data dog to perits and gra or other open source
00:09:53.839
tools if they use this standard
00:09:59.640
cool uh I have a demo later uh in this presentation and so a few things I'm
00:10:05.600
going to mention here is so that on the demo you don't feel lost you it will help you understand things a little bit
00:10:12.200
better so we're going to talk a little bit about Prometheus uh just a reminder it's an open source too it's used mostly
00:10:19.680
for metrics and also alerts so promethos has four different
00:10:26.040
uh metric types you have counter counter is more to count occurence of events uh
00:10:32.399
can be used to count number of requests or errors we have gauge gaug is more
00:10:39.519
used for measurements that fluctuate so temperature maybe number of open
00:10:44.800
database connections um histogram it's very uh
00:10:51.360
interesting one it samples observations and counts them in configurable buckets
00:10:57.120
it's usually used for request duration response sizes and uh lastly we have
00:11:03.440
summary which is very similar to histogram but it does a little bit more
00:11:09.360
uh he also provides us with a count and a total count and sum of the observed
00:11:14.959
values and the quantiles that we're going to see more about it later also used for request durations response
00:11:22.399
sizes things like that uh cool Mythos has a query language
00:11:30.680
called promel uh it allows you to query metrics aggregate data some average rates and
00:11:38.720
visualized Trends uh here's an example this is um a
00:11:48.480
query using prom it's getting the rate of the Ruby HP request total on the
00:11:54.440
controller jokes using this rate interval that I pretty sh it's from
00:12:00.160
forna and it's aggregating by the status and so with this query we would expect
00:12:07.360
to see something like this um we see two results groped by the HTP
00:12:13.639
status and then you can like take some insights like between 210 220 the number
00:12:21.000
of Errors responses that return 500 are bigger than the 200 so something
00:12:28.680
something have come go WR here right and I want to mention another
00:12:36.240
thing about prom and pereus because when I started learning about it um it was
00:12:43.760
driving me crazy that I couldn't get exact numbers um on my graphs and now I
00:12:50.160
understand why so I want to share with you in case you going to work with Prometheus and so you don't go to the
00:12:55.839
same trouble that I was having so the way Prometheus work it takes snapshots of data at regular intervals and events
00:13:04.639
between this snapshots may not be captured exactly because of
00:13:09.920
the the intervals uh so we use raids uh to smooth out the trends and there's an
00:13:17.279
example how to get R over there on the interval of five minutes and then you
00:13:22.760
might not see the exact number maybe you're testing locally and you only have a few requests like that few is you send
00:13:29.600
few occurrences of U of A Metric and you expect to see the exact number you
00:13:34.920
you'll see something very very similar to what you expect but not but not the exact number still pretty useful
00:13:43.000
though uh let's talk a little bit about sis slos and slas um they help the company or team to
00:13:52.279
understand and quantify the reliability and performance of the services okay but what what the hell is
00:13:58.440
that um sis are service level indicators so the actual measurement of a
00:14:04.880
system uh slos are Target or service level objectives they are targets for
00:14:11.759
the sis so targets for the metrics and slas are service level agreements and
00:14:19.079
it's like a formal commitment that the company has with the customers I'm going
00:14:25.000
to give you an example and then should be more clear and found out that GitHub
00:14:30.560
actually have public slas which is pretty cool it says that GitHub commits
00:14:36.440
to maintain at least 99.9% of time for services and failing to this criteria
00:14:42.600
means that a credit claim can be made by customer I didn't know that I thought
00:14:48.079
was pretty cool so you know next time you see that Angry unicorn there maybe maybe you can do a credit
00:14:54.360
claim um so that's a public SLA it exists and this is a a fake SLO just to
00:15:00.639
give you an example maybe in order for GitHub to um ensure this SLA doesn't
00:15:07.959
fail they could have an SLO for a time for actions which is one of their
00:15:13.160
services the SLO usually will have um will be slightly uh bigger than the SLA
00:15:21.759
because you want to be conservative you want to be alerted before the SLA actually uh
00:15:29.680
fails um in order to measure this goal maybe
00:15:36.600
you could have an SLI for total triggered executions and another one for unavailable executions and then Crossing
00:15:45.079
this data could could get you to have like an SLO that will prevent the SLA to
00:15:51.959
fail another example of an SLO could be uh 99.99% of time for issues and
00:15:58.480
requests as another service and then maybe an SLA for that could be the
00:16:03.720
service availability uh crossed with the error rate and then you cross this information you calculate your goal and
00:16:11.079
then and then you have your SLO your SLI and your
00:16:17.399
SLA uh about slos still let's talk a little bit about percentiles um I've
00:16:24.600
always seen like people talking about P90 p50 P99 and I'm like I I don't know
00:16:31.519
what that means but I'll go with it um but let let's get some clarity about
00:16:38.959
that so p50 will be like the average uh will be like the average uh rate of
00:16:46.279
something so let's say the p50 of one request is 1 second it means that 50% of
00:16:53.600
the requests are actually faster than that and the other 50% are slower so the
00:16:59.440
average uh user experience P95 however will be
00:17:14.199
requests are um faster than that but that's the the typical worst case
00:17:19.720
scenario for users and it means that 5% of the users are getting uh performance
00:17:26.039
that worse than that so if it's two seconds it means that 5% of the users are having a performance that's actually
00:17:32.640
higher than two seconds and P99 it's us you usually used
00:17:38.480
for U rare performance issues it means that that 1% that's getting like the worst performance ever so if P99 it's 3
00:17:46.480
seconds there's 1% of the users are that are having like really bad performance
00:17:52.480
issues on their application none of the things that we
00:17:59.760
saw so far matters at all if we don't have alerts because you can have
00:18:05.600
beautiful dashboards and metrix but if nobody looks at it when there is a real issue uh it it's worth nothing so let's
00:18:14.000
talk a little bit more about that um you can create alerts based on your sis and
00:18:20.280
slos uh and use them to warn the team when the application isn't working as
00:18:26.120
expected most monitoring tools will let you configure alerts uh but be careful
00:18:32.240
not to introduce noisy alerts because um if you have an alert that's going enough
00:18:37.280
all the time and it's not a real issue there's this weird thing that happens to our brain where it creates a p pattern
00:18:44.400
and people starts to ignore it and it's nobody fault uh it's just like something that
00:18:50.720
you need to be very careful about not to create noisy alerts same goes for uh
00:18:56.919
exceptions and uh lastly alerts must be actionable I'm going to talk more about
00:19:05.280
that um one thing that you can do in order to make the alerts more accessible
00:19:10.640
and actionable is to have a Playbook uh The Playbook could be a documentation that uh goes on the
00:19:19.000
company uh like Internal Documentation system or it could go in the alert
00:19:25.360
description itself uh or even better could be both um so in the Playbook or
00:19:32.159
the alert description or both you should describe what to do when the alerts triggered and also what to do when there
00:19:37.679
are false positives false alerts again we want to avoid noisy
00:19:46.280
alerts uh my suggestion here for an alert description is to have a very
00:19:51.760
specific title uh describe how the alerts being calculated because you
00:19:57.360
don't want to go on vacation then the aler goes off and then nobody understands why and if it's a real
00:20:03.039
problem or not you want to uh describe why the alert may go off um what to do
00:20:10.640
when the alert goes off including what to do when the alert goes off for by
00:20:15.919
mistake when it's not a real issue because again I'm GNA be talking about this many times we don't want
00:20:21.440
noisy and another cool thing to add is uh the customer
00:20:26.960
impact uh I have an example here I hope is readable but I will also read it um
00:20:32.840
so the title here is very specific says back in errors jokes API custom error
00:20:38.280
monitor have received more errors than normal the amount of errors on the back
00:20:44.039
end for jok API has exceeded the threshold of 0.1% because it's now in
00:20:51.159
0.2% how do we calculate this metric we are using the percentile of p50 because
00:20:57.039
here we just care about the average uh impact
00:21:02.559
um this monitor may go off because um we then here I I added whatever I could
00:21:09.440
think about that could cause the the this error so I added we are using an
00:21:15.440
external API and the API might be down the version of the API might have changed or the code that consumes data
00:21:23.240
from the API might be broken uh how to fix it so the first thing is investigate
00:21:28.840
what's going on so if you post uh if you post a joke's API custom error on the
00:21:35.360
logs platform it might give you an idea of why the error is
00:21:40.840
happening uh and then I there is also an instruction here saying that if the alerts going off uh too often with false
00:21:49.200
positives to consider handling it in a different way or increase the threshold
00:21:55.559
because maybe there is a known issue the team doesn't have time to work on it it's better to increase the threshold of
00:22:03.039
0.1% to let's say 0.3 and then get alerted when it goes to
00:22:08.799
0.4 then let the alert pop all the time and it gets ignored when things uh go
00:22:16.440
off rails and then lastly here we have the impact so customers could have been
00:22:22.279
problem accessing the the index page which is really bad so you put the
00:22:28.240
impact there uh maybe the person who sees the alert might uh you know take
00:22:34.279
action
00:22:41.440
faster awesome now I'm gonna do use something wild and I'm going to try a
00:22:46.640
live live demo I hope it goes well uh for this live demo I'm using uh
00:22:53.240
this raos project here uh I'm GNA share the slides later so you you can access
00:22:59.480
the links and I'm I'm going to be covering
00:23:05.480
metrics and I'm going to be using Prometheus in G to collect and visualize
00:23:10.679
Matrix and I'm using this gam called Prometheus exporter to export the metrix
00:23:16.440
from the rails app cool so this is how my application
00:23:24.200
looks like nothing fancy it's fetching a joke from a public jok's
00:23:29.600
API uh when I refresh it's going to fetch a new joke and then a new joke this joke doesn't have a title some of
00:23:36.679
them has have a title uh why do we care we don't uh but I have a custom metric
00:23:43.080
for that so going to show you later we have a dashboard here so this is how graffan looks like in the back hand I
00:23:50.480
see the jokes are we good I hope so I got some bad ones that I was like a
00:23:56.039
little bit concerned about refreshing the page because who knows what's going to show up there cool um let me here so
00:24:05.960
I have this is how graan looks like and the back end uh there is the Prometheus
00:24:12.799
and Prometheus exporter exporting metric to
00:24:17.919
Prometheus we have here a board with the average request duration we have P99 P90
00:24:24.320
and p50 as we can see here p50 is um around 300 milliseconds uh P90 about
00:24:34.000
300 milliseconds as well P99 almost 400 millisecond so it's not bad it's
00:24:40.559
actually a perform a good performance for this purpose um have another one
00:24:45.679
here that it's the the the rate of ATP requests growed by status code uh so
00:24:53.840
here we have 500 we've been having zero 500 uh request should great and here are
00:25:02.000
like all the 200 so mostly you're getting 200 uh so mostly the my
00:25:07.559
controller is succeeding um cool uh I have a script here that's
00:25:15.200
pinging my application every half a second and this is my
00:25:21.159
controller I'm gonna uncomment this code here and I'm G to
00:25:26.720
simulate like if the API would uh time out so I'm adding a sleep time of two
00:25:33.799
seconds and pretend that the the API is standing
00:25:38.840
up um and then we should see sorry sorry
00:25:44.279
just real quick I don't think we see the code only see the dashboard oh sorry let
00:25:49.720
me I think I need to share the screen again sorry thanks for the heads up
00:26:01.600
oh I should have put it put selected entire screen
00:26:08.679
instead okay I'm going to go back to the code here I have a so yeah I was showing
00:26:14.000
that I have a script here that's pinging the request every half a
00:26:19.520
second and I have uh this is my controller I'm
00:26:25.360
fetching from this public API and there is a sleep time here that every
00:26:31.399
like 20% of the time you should generate errors as you can see here we we're
00:26:36.919
getting a couple of 500 and now what we're going to do is see like how a timeout rate error in my
00:26:44.600
controller can impact uh the graphs um here we can see that the
00:26:50.520
number of 500 requests are increasing as well of the number of 200 requests are
00:26:56.520
decreasing um the P99 already went all the way up saying that's taking over two
00:27:03.880
seconds to complete requests for 1% of the users
00:27:10.279
um and the P90 should go up too because you're raising errors for 20% of the
00:27:16.399
users but that takes a little bit longer uh okay the other thing I want to
00:27:22.919
show you is that I have an alert here right now it's normal and it says that
00:27:29.240
when my jokes page p50 is over two seconds um it will notify me on slack um
00:27:39.480
for example if I don't really have it set up to notify me but it will it will go
00:27:45.760
red um cool let yeah P95 went up let's
00:27:51.640
make it so the timeout occur for 80% of the you users and for
00:28:00.279
this uh we will have to wait a minute or two in the meantime I'm going to show
00:28:05.960
you some extra stuff hope I'm not going too fast I have a
00:28:13.039
separate dashboard here separate graph for because I was curious how many of
00:28:18.200
those jokes have uh a title or not uh because one time I was refreshing
00:28:25.200
you all the time and most of the jokes didn't have a title so I thought maybe maybe most of the jokes doesn't have a title but I actually surprisingly
00:28:33.000
actually it's kind of half half and you can see here that sometimes most of them doesn't have a title but some other
00:28:39.760
times most of them do have a title and I don't know this is an example of
00:28:44.919
something you can track Uh custom for some feature uh in your
00:28:53.080
application uh okay the 500's going up but the p50 is still
00:28:59.840
stable didn't get enough errors um so I'm going to show a little bit of
00:29:07.080
cold in the meantime uh so I have Prometheus exporter here as an initializer I'm just adding the midor
00:29:13.880
and it handles everything else uh but I do have to run it as a separate
00:29:19.679
server and then here I have per yo and what I'm saying is that this is the end
00:29:26.279
point for Prometheus exporter I'm telling Prometheus to scrape Matrix from
00:29:33.200
Prometheus exporter on this port and then I go back here I have my custom
00:29:40.240
matric it's a counter matric the name is joke joke's custom metric and it tells
00:29:46.440
me whether a joke has a title so joke title present it's going to be true otherwise going to be
00:29:52.720
false um and here uh this is the end point where promethus collects the
00:29:59.440
metric so you can see my custom metric here uh the counter for the ones that
00:30:05.200
had title or not we can see the the summary metric for ATP request duration
00:30:12.080
uh it creates that for every controller I only have one controller here and he also creates the qu the quen ties that I
00:30:18.320
can use for my p50s P90s and Etc if I go back here now something
00:30:27.279
really bad's going on right the p50 that's taking over two seconds uh but
00:30:32.960
luckily our team uh is aware of it because we're getting an alert on slack
00:30:38.519
and everybody it's act it's taking an action uh has their eyes on it and we
00:30:45.000
heard from observability and not from the customers which is uh the main take
00:30:50.559
of this presentation uh let's go back to the
00:30:56.120
presentation and we're almost was wrapping up here I do want to invite uh
00:31:01.360
folks for a reflection maybe even homework uh so my question is do you have observability implemented if yes uh
00:31:09.440
when was the last time the observability helped you find a bug uh was there a
00:31:14.480
recent incident that observability could have caught and yeah this is a great
00:31:19.720
opportunity for you to introduce some alerts and maybe some metrics to and
00:31:25.240
remember that they have to be actionable the alerts and if you don't have observability at
00:31:30.440
all then why not try to implement one of those tools that I mentioned on this talk however if you're just a Dev
00:31:37.720
instead of if you don't want a full-time job um taking care of all the infrastructure
00:31:44.080
for promeals and refen and the open source tools I recommend going to one of the uh commercial tools they most of
00:31:50.960
them have like free trials and free accounts that you can start with and sell the idea to to your team and your
00:31:58.399
company and yeah it should really pay off with time and last question does it
00:32:04.440
your team have a Playbook not maybe create a Playbook list all the alerts there all the metrics what's the intent
00:32:11.960
uh what to do when the alerts are going off too often and things like that uh
00:32:17.440
and all of these things they're really valuable to the company so if you see opportunity to work on some of these
00:32:24.159
items it should help you your career hopefully maybe even a promotion that
00:32:30.240
you're working on um so yeah keep that in mind it's a great
00:32:35.960
opportunity and that's it we we're we're done that I have there my email and my
00:32:43.880
website if you want to reach out my website has all the social media I will love to keep in touch so feel free to
00:32:50.240
give me a follow and send a message and I have some references here too I'm
00:32:55.279
going to share the slides later uh so you can access is the links we're good