Ruby Video | Observability on Rails

Observability on Rails

#application-performance

Observability on Rails

Caroline Salib • August 30, 2024 • online

Observability on Rails: Enhancing Confidence in Code Deployment

In the talk "Observability on Rails," Caroline Salib discusses the concept of observability in software development, particularly within Ruby on Rails applications, and how it can enhance developers' confidence when deploying code to production. The presentation serves as an introductory guide to observability for those who may not be familiar with the term or its importance in modern software development.

Key Points Discussed:

Conclusions and Takeaways:

Emphasizing observability can prevent production issues and enhance the team's ability to respond quickly to incidents.
Implementing observability tools and creating actionable alerts can significantly improve software reliability.
Caution against introducing noisy alerts, which could lead to alert fatigue and neglect.
Developers are encouraged to assess their current observability practices and consider potential improvements to enhance their systems' reliability and performance.

Observability on Rails
Caroline Salib • August 30, 2024 • online

If you're unsure about what observability is or haven't given it much thought, this talk is perfect for you! We'll cover the basics of observability and how it can boost your confidence when shipping code to production by improving your ability to manage and monitor your Ruby on Rails applications.
https://www.wnb-rb.dev/meetups/2024/08/30

WNB.rb Meetup

00:00:00.640 well thank you so much everybody for joining uh so today we're gonna do a

00:00:06.200 little introductory chat about observability and

00:00:11.240 hopefully uh that hopefully that will help you ship cod with more

00:00:19.359 confidence all right let's do this um little bit about me I'm a senior

00:00:26.519 software developer at Shopify I'm a pet mom I have a two cats and a doggy on my

00:00:32.960 free time I like to dream about living in the woods and um also a game overover

00:00:39.000 I love playing video games when I have the time and I wish I had more time than I actually do it um cool before we start

00:00:50.559 I actually am curious about what's your uh relationship with observability today

00:00:56.640 so I have here a few options and if you feel comfortable with and you can share share in the chat if you're number one

00:01:03.320 two or three uh so number one is my company or team has observability

00:01:08.600 dashboards and alerts and I check them periodically and it directly impacts how I do my

00:01:14.520 work uh number two my company or team has observability dashboards and alerts

00:01:20.119 but it not I'm not very familiar with them and it doesn't impact how I do my

00:01:25.479 work number three will be my company or team doesn't have observ probility dashboards and alerts or it doesn't have

00:01:33.040 the need for them or number four um if you want to share in the

00:01:39.079 comments if you are in a different

00:01:46.719 scenario nice thanks thanks for sharing um for those who posted two and three I

00:01:52.439 think this talk will be perfect for you uh and for those who posted well I think

00:01:57.680 there you still can get some value from this so from it so hang hang

00:02:03.439 tight um so how observability how did how why did I decide to do this talk in

00:02:08.840 the first place so I've been working with development for 13 years but I feel

00:02:14.000 like I've been coding in the dark because i' I've always been in scenario two or three where my company or team

00:02:21.239 doesn't have observability at all or he has it but it doesn't really uh impact how I do my work um I'm in the checkout

00:02:30.080 ability team at Shopify and uh earlier this year I led the the backend migration uh to the new checkup

00:02:37.280 editor uh so basically uh Shopify we used to let uh Merchants customize their

00:02:43.720 check out with uh checkout liquid uh but for many reasons security

00:02:50.920 performance uh we wanted them to customize in a more controlled environment so we came up with check out

00:02:57.680 accessibility and so we had to migrate all the millions of merchants uh to this

00:03:03.920 new platform and we need a better visibility to ensure that we would have

00:03:09.360 a smooth rollout and my lights went up weird I don't know why anyway our

00:03:18.200 main concern there was uh handling a massive increase in requests because we were migrating millions of

00:03:25.080 merchants and uh we wanted to make to ensure that there will no performance

00:03:31.560 degradation uh I saw a question about liquid no more liquid on checkout I

00:03:37.000 think there's still liquid and other areas of Shopify um yeah so we implemented

00:03:44.599 metrics dashboards and we did uh in fact found the issue during the roll out but

00:03:52.040 it was related to a separate team because we had dashboards the alerts we were able to identify this earlier and

00:04:00.480 address with the team they made a few twixs and there was no performance degradation no issues the release was a

00:04:07.159 success and yeah that's hey really good thing um but maybe you're here because

00:04:15.879 you don't even know what observability means so observability if if you don't know what

00:04:21.560 observability means it mean it's a term that we use to describe the ability to understand internal uh state of a system

00:04:29.600 by by examining its external signals um what are the signals so

00:04:37.080 there's logs logs are text records that uh you can see how the application is

00:04:42.479 doing if when you run R server immediately see logs there but some

00:04:47.639 companies also have a production application where you can see the logs um there's metrics uh metrics are

00:04:56.280 numer numeric data points that you can use to uh tracks things such as like CPU usage

00:05:05.039 and memory consumption uh request rates and even Uh custom things uh like features and

00:05:13.400 anything anything really that you want to track when it goes to production and we also have traces

00:05:19.759 traces are detailed records of an entire journey of our request and as it flow

00:05:26.160 through the different services and components so it helps you figure it out like uh possible issues or delays in

00:05:32.840 specific um stamps sorry spans of the the request

00:05:38.800 journey and profiling uh profiling was only introduced to open chetry and

00:05:44.800 observability uh earlier this year and it allows you to inspect the code

00:05:49.880 behavior and performance at runtime in production so pretty cool

00:06:00.160 a few use cases uh for observability in production is uh understand the impact

00:06:06.759 of new code changes uh quick problem detection improve the system understanding track

00:06:14.120 uh health and performance enhance the user experience which can be related to Performance as well and scalability and

00:06:23.120 reliability so basically not only you're there coding on your on development

00:06:28.840 machine fix bugs but you owning the the changes from development to production

00:06:34.639 and ensuring like that the system is uh

00:06:40.919 reliable uh there are here a few commsion tools um there's data dog neur

00:06:46.759 Relic Honeybadger signal robar blun Sentry scalp and many many others um

00:06:53.639 some of them are more focused about more focused about logging or traces or

00:07:00.800 um metric or like a combination of all these things um yeah I'm curious if if

00:07:07.000 you feel comfortable uh sharing like U do you use any of those tools at work or

00:07:12.560 for for personal projects um feel free to to share in the comments if you if

00:07:19.759 you feel comfortable yeah I used I used New Relic

00:07:26.599 in the past Honeybadger robar splank and data dog like in companies that I worked

00:07:33.560 for and I used the Sentry for personal projects and I really like

00:07:40.840 Sentry

00:07:48.440 nice yeah data dog and and centry and neic are very very popular thanks thanks

00:07:55.280 for sharing that's that's really insightful um we also have have open source tools um so there's Prometheus

00:08:02.800 more focused on metrics fluent D for logging jar for tracing grafana is uh

00:08:12.000 like a UI that you can see a bunch of things together uh so you can create dashboards and uh manage alerts and it

00:08:19.960 does a bunch of things I don't think it's only related to observability but um it's very commonly used with

00:08:27.159 Prometheus um and open Teter which is not a two but it's a open source

00:08:32.800 framework I have a separate slide for it does any does anyone here use one of

00:08:38.360 those open source toes feel free to to

00:08:45.839 share nice good question I will share more about this open source tools uh during the

00:08:54.680 talk nice yeah I'm going to cover a lot lot of peral and graph in a more like

00:09:01.839 introductory way but yeah cool uh let's talk a little bit about Petry so Petry

00:09:08.959 it's an open source framework or toit that it's used to manage Telemetry data

00:09:14.560 logs metrics traces and now profiling um it integrates with many

00:09:19.760 tools so Prometheus grafana data do neur Relic and it has been becoming very very

00:09:25.720 popular due to the standardization that it does so uh let's say your company uses data

00:09:34.160 dog but the bill is getting to expensive so you want to switch to an open source

00:09:40.200 too and then own your data and you know we have an infrastructure team to deal with that uh supposedly supposed to be

00:09:47.240 easy to uh switch from uh data dog to perits and gra or other open source

00:09:53.839 tools if they use this standard

00:09:59.640 cool uh I have a demo later uh in this presentation and so a few things I'm

00:10:05.600 going to mention here is so that on the demo you don't feel lost you it will help you understand things a little bit

00:10:12.200 better so we're going to talk a little bit about Prometheus uh just a reminder it's an open source too it's used mostly

00:10:19.680 for metrics and also alerts so promethos has four different

00:10:26.040 uh metric types you have counter counter is more to count occurence of events uh

00:10:32.399 can be used to count number of requests or errors we have gauge gaug is more

00:10:39.519 used for measurements that fluctuate so temperature maybe number of open

00:10:44.800 database connections um histogram it's very uh

00:10:51.360 interesting one it samples observations and counts them in configurable buckets

00:10:57.120 it's usually used for request duration response sizes and uh lastly we have

00:11:03.440 summary which is very similar to histogram but it does a little bit more

00:11:09.360 uh he also provides us with a count and a total count and sum of the observed

00:11:14.959 values and the quantiles that we're going to see more about it later also used for request durations response

00:11:22.399 sizes things like that uh cool Mythos has a query language

00:11:30.680 called promel uh it allows you to query metrics aggregate data some average rates and

00:11:38.720 visualized Trends uh here's an example this is um a

00:11:48.480 query using prom it's getting the rate of the Ruby HP request total on the

00:11:54.440 controller jokes using this rate interval that I pretty sh it's from

00:12:00.160 forna and it's aggregating by the status and so with this query we would expect

00:12:07.360 to see something like this um we see two results groped by the HTP

00:12:13.639 status and then you can like take some insights like between 210 220 the number

00:12:21.000 of Errors responses that return 500 are bigger than the 200 so something

00:12:28.680 something have come go WR here right and I want to mention another

00:12:36.240 thing about prom and pereus because when I started learning about it um it was

00:12:43.760 driving me crazy that I couldn't get exact numbers um on my graphs and now I

00:12:50.160 understand why so I want to share with you in case you going to work with Prometheus and so you don't go to the

00:12:55.839 same trouble that I was having so the way Prometheus work it takes snapshots of data at regular intervals and events

00:13:04.639 between this snapshots may not be captured exactly because of

00:13:09.920 the the intervals uh so we use raids uh to smooth out the trends and there's an

00:13:17.279 example how to get R over there on the interval of five minutes and then you

00:13:22.760 might not see the exact number maybe you're testing locally and you only have a few requests like that few is you send

00:13:29.600 few occurrences of U of A Metric and you expect to see the exact number you

00:13:34.920 you'll see something very very similar to what you expect but not but not the exact number still pretty useful

00:13:43.000 though uh let's talk a little bit about sis slos and slas um they help the company or team to

00:13:52.279 understand and quantify the reliability and performance of the services okay but what what the hell is

00:13:58.440 that um sis are service level indicators so the actual measurement of a

00:14:04.880 system uh slos are Target or service level objectives they are targets for

00:14:11.759 the sis so targets for the metrics and slas are service level agreements and

00:14:19.079 it's like a formal commitment that the company has with the customers I'm going

00:14:25.000 to give you an example and then should be more clear and found out that GitHub

00:14:30.560 actually have public slas which is pretty cool it says that GitHub commits

00:14:36.440 to maintain at least 99.9% of time for services and failing to this criteria

00:14:42.600 means that a credit claim can be made by customer I didn't know that I thought

00:14:48.079 was pretty cool so you know next time you see that Angry unicorn there maybe maybe you can do a credit

00:14:54.360 claim um so that's a public SLA it exists and this is a a fake SLO just to

00:15:00.639 give you an example maybe in order for GitHub to um ensure this SLA doesn't

00:15:07.959 fail they could have an SLO for a time for actions which is one of their

00:15:13.160 services the SLO usually will have um will be slightly uh bigger than the SLA

00:15:21.759 because you want to be conservative you want to be alerted before the SLA actually uh

00:15:29.680 fails um in order to measure this goal maybe

00:15:36.600 you could have an SLI for total triggered executions and another one for unavailable executions and then Crossing

00:15:45.079 this data could could get you to have like an SLO that will prevent the SLA to

00:15:51.959 fail another example of an SLO could be uh 99.99% of time for issues and

00:15:58.480 requests as another service and then maybe an SLA for that could be the

00:16:03.720 service availability uh crossed with the error rate and then you cross this information you calculate your goal and

00:16:11.079 then and then you have your SLO your SLI and your

00:16:17.399 SLA uh about slos still let's talk a little bit about percentiles um I've

00:16:24.600 always seen like people talking about P90 p50 P99 and I'm like I I don't know

00:16:31.519 what that means but I'll go with it um but let let's get some clarity about

00:16:38.959 that so p50 will be like the average uh will be like the average uh rate of

00:16:46.279 something so let's say the p50 of one request is 1 second it means that 50% of

00:16:53.600 the requests are actually faster than that and the other 50% are slower so the

00:16:59.440 average uh user experience P95 however will be

00:17:14.199 requests are um faster than that but that's the the typical worst case

00:17:19.720 scenario for users and it means that 5% of the users are getting uh performance

00:17:26.039 that worse than that so if it's two seconds it means that 5% of the users are having a performance that's actually

00:17:32.640 higher than two seconds and P99 it's us you usually used

00:17:38.480 for U rare performance issues it means that that 1% that's getting like the worst performance ever so if P99 it's 3

00:17:46.480 seconds there's 1% of the users are that are having like really bad performance

00:17:52.480 issues on their application none of the things that we

00:17:59.760 saw so far matters at all if we don't have alerts because you can have

00:18:05.600 beautiful dashboards and metrix but if nobody looks at it when there is a real issue uh it it's worth nothing so let's

00:18:14.000 talk a little bit more about that um you can create alerts based on your sis and

00:18:20.280 slos uh and use them to warn the team when the application isn't working as

00:18:26.120 expected most monitoring tools will let you configure alerts uh but be careful

00:18:32.240 not to introduce noisy alerts because um if you have an alert that's going enough

00:18:37.280 all the time and it's not a real issue there's this weird thing that happens to our brain where it creates a p pattern

00:18:44.400 and people starts to ignore it and it's nobody fault uh it's just like something that

00:18:50.720 you need to be very careful about not to create noisy alerts same goes for uh

00:18:56.919 exceptions and uh lastly alerts must be actionable I'm going to talk more about

00:19:05.280 that um one thing that you can do in order to make the alerts more accessible

00:19:10.640 and actionable is to have a Playbook uh The Playbook could be a documentation that uh goes on the

00:19:19.000 company uh like Internal Documentation system or it could go in the alert

00:19:25.360 description itself uh or even better could be both um so in the Playbook or

00:19:32.159 the alert description or both you should describe what to do when the alerts triggered and also what to do when there

00:19:37.679 are false positives false alerts again we want to avoid noisy

00:19:46.280 alerts uh my suggestion here for an alert description is to have a very

00:19:51.760 specific title uh describe how the alerts being calculated because you

00:19:57.360 don't want to go on vacation then the aler goes off and then nobody understands why and if it's a real

00:20:03.039 problem or not you want to uh describe why the alert may go off um what to do

00:20:10.640 when the alert goes off including what to do when the alert goes off for by

00:20:15.919 mistake when it's not a real issue because again I'm GNA be talking about this many times we don't want

00:20:21.440 noisy and another cool thing to add is uh the customer

00:20:26.960 impact uh I have an example here I hope is readable but I will also read it um

00:20:32.840 so the title here is very specific says back in errors jokes API custom error

00:20:38.280 monitor have received more errors than normal the amount of errors on the back

00:20:44.039 end for jok API has exceeded the threshold of 0.1% because it's now in

00:20:51.159 0.2% how do we calculate this metric we are using the percentile of p50 because

00:20:57.039 here we just care about the average uh impact

00:21:02.559 um this monitor may go off because um we then here I I added whatever I could

00:21:09.440 think about that could cause the the this error so I added we are using an

00:21:15.440 external API and the API might be down the version of the API might have changed or the code that consumes data

00:21:23.240 from the API might be broken uh how to fix it so the first thing is investigate

00:21:28.840 what's going on so if you post uh if you post a joke's API custom error on the

00:21:35.360 logs platform it might give you an idea of why the error is

00:21:40.840 happening uh and then I there is also an instruction here saying that if the alerts going off uh too often with false

00:21:49.200 positives to consider handling it in a different way or increase the threshold

00:21:55.559 because maybe there is a known issue the team doesn't have time to work on it it's better to increase the threshold of

00:22:03.039 0.1% to let's say 0.3 and then get alerted when it goes to

00:22:08.799 0.4 then let the alert pop all the time and it gets ignored when things uh go

00:22:16.440 off rails and then lastly here we have the impact so customers could have been

00:22:22.279 problem accessing the the index page which is really bad so you put the

00:22:28.240 impact there uh maybe the person who sees the alert might uh you know take

00:22:34.279 action

00:22:41.440 faster awesome now I'm gonna do use something wild and I'm going to try a

00:22:46.640 live live demo I hope it goes well uh for this live demo I'm using uh

00:22:53.240 this raos project here uh I'm GNA share the slides later so you you can access

00:22:59.480 the links and I'm I'm going to be covering

00:23:05.480 metrics and I'm going to be using Prometheus in G to collect and visualize

00:23:10.679 Matrix and I'm using this gam called Prometheus exporter to export the metrix

00:23:16.440 from the rails app cool so this is how my application

00:23:24.200 looks like nothing fancy it's fetching a joke from a public jok's

00:23:29.600 API uh when I refresh it's going to fetch a new joke and then a new joke this joke doesn't have a title some of

00:23:36.679 them has have a title uh why do we care we don't uh but I have a custom metric

00:23:43.080 for that so going to show you later we have a dashboard here so this is how graffan looks like in the back hand I

00:23:50.480 see the jokes are we good I hope so I got some bad ones that I was like a

00:23:56.039 little bit concerned about refreshing the page because who knows what's going to show up there cool um let me here so

00:24:05.960 I have this is how graan looks like and the back end uh there is the Prometheus

00:24:12.799 and Prometheus exporter exporting metric to

00:24:17.919 Prometheus we have here a board with the average request duration we have P99 P90

00:24:24.320 and p50 as we can see here p50 is um around 300 milliseconds uh P90 about

00:24:34.000 300 milliseconds as well P99 almost 400 millisecond so it's not bad it's

00:24:40.559 actually a perform a good performance for this purpose um have another one

00:24:45.679 here that it's the the the rate of ATP requests growed by status code uh so

00:24:53.840 here we have 500 we've been having zero 500 uh request should great and here are

00:25:02.000 like all the 200 so mostly you're getting 200 uh so mostly the my

00:25:07.559 controller is succeeding um cool uh I have a script here that's

00:25:15.200 pinging my application every half a second and this is my

00:25:21.159 controller I'm gonna uncomment this code here and I'm G to

00:25:26.720 simulate like if the API would uh time out so I'm adding a sleep time of two

00:25:33.799 seconds and pretend that the the API is standing

00:25:38.840 up um and then we should see sorry sorry

00:25:44.279 just real quick I don't think we see the code only see the dashboard oh sorry let

00:25:49.720 me I think I need to share the screen again sorry thanks for the heads up

00:26:01.600 oh I should have put it put selected entire screen

00:26:08.679 instead okay I'm going to go back to the code here I have a so yeah I was showing

00:26:14.000 that I have a script here that's pinging the request every half a

00:26:19.520 second and I have uh this is my controller I'm

00:26:25.360 fetching from this public API and there is a sleep time here that every

00:26:31.399 like 20% of the time you should generate errors as you can see here we we're

00:26:36.919 getting a couple of 500 and now what we're going to do is see like how a timeout rate error in my

00:26:44.600 controller can impact uh the graphs um here we can see that the

00:26:50.520 number of 500 requests are increasing as well of the number of 200 requests are

00:26:56.520 decreasing um the P99 already went all the way up saying that's taking over two

00:27:03.880 seconds to complete requests for 1% of the users

00:27:10.279 um and the P90 should go up too because you're raising errors for 20% of the

00:27:16.399 users but that takes a little bit longer uh okay the other thing I want to

00:27:22.919 show you is that I have an alert here right now it's normal and it says that

00:27:29.240 when my jokes page p50 is over two seconds um it will notify me on slack um

00:27:39.480 for example if I don't really have it set up to notify me but it will it will go

00:27:45.760 red um cool let yeah P95 went up let's

00:27:51.640 make it so the timeout occur for 80% of the you users and for

00:28:00.279 this uh we will have to wait a minute or two in the meantime I'm going to show

00:28:05.960 you some extra stuff hope I'm not going too fast I have a

00:28:13.039 separate dashboard here separate graph for because I was curious how many of

00:28:18.200 those jokes have uh a title or not uh because one time I was refreshing

00:28:25.200 you all the time and most of the jokes didn't have a title so I thought maybe maybe most of the jokes doesn't have a title but I actually surprisingly

00:28:33.000 actually it's kind of half half and you can see here that sometimes most of them doesn't have a title but some other

00:28:39.760 times most of them do have a title and I don't know this is an example of

00:28:44.919 something you can track Uh custom for some feature uh in your

00:28:53.080 application uh okay the 500's going up but the p50 is still

00:28:59.840 stable didn't get enough errors um so I'm going to show a little bit of

00:29:07.080 cold in the meantime uh so I have Prometheus exporter here as an initializer I'm just adding the midor

00:29:13.880 and it handles everything else uh but I do have to run it as a separate

00:29:19.679 server and then here I have per yo and what I'm saying is that this is the end

00:29:26.279 point for Prometheus exporter I'm telling Prometheus to scrape Matrix from

00:29:33.200 Prometheus exporter on this port and then I go back here I have my custom

00:29:40.240 matric it's a counter matric the name is joke joke's custom metric and it tells

00:29:46.440 me whether a joke has a title so joke title present it's going to be true otherwise going to be

00:29:52.720 false um and here uh this is the end point where promethus collects the

00:29:59.440 metric so you can see my custom metric here uh the counter for the ones that

00:30:05.200 had title or not we can see the the summary metric for ATP request duration

00:30:12.080 uh it creates that for every controller I only have one controller here and he also creates the qu the quen ties that I

00:30:18.320 can use for my p50s P90s and Etc if I go back here now something

00:30:27.279 really bad's going on right the p50 that's taking over two seconds uh but

00:30:32.960 luckily our team uh is aware of it because we're getting an alert on slack

00:30:38.519 and everybody it's act it's taking an action uh has their eyes on it and we

00:30:45.000 heard from observability and not from the customers which is uh the main take

00:30:50.559 of this presentation uh let's go back to the

00:30:56.120 presentation and we're almost was wrapping up here I do want to invite uh

00:31:01.360 folks for a reflection maybe even homework uh so my question is do you have observability implemented if yes uh

00:31:09.440 when was the last time the observability helped you find a bug uh was there a

00:31:14.480 recent incident that observability could have caught and yeah this is a great

00:31:19.720 opportunity for you to introduce some alerts and maybe some metrics to and

00:31:25.240 remember that they have to be actionable the alerts and if you don't have observability at

00:31:30.440 all then why not try to implement one of those tools that I mentioned on this talk however if you're just a Dev

00:31:37.720 instead of if you don't want a full-time job um taking care of all the infrastructure

00:31:44.080 for promeals and refen and the open source tools I recommend going to one of the uh commercial tools they most of

00:31:50.960 them have like free trials and free accounts that you can start with and sell the idea to to your team and your

00:31:58.399 company and yeah it should really pay off with time and last question does it

00:32:04.440 your team have a Playbook not maybe create a Playbook list all the alerts there all the metrics what's the intent

00:32:11.960 uh what to do when the alerts are going off too often and things like that uh

00:32:17.440 and all of these things they're really valuable to the company so if you see opportunity to work on some of these

00:32:24.159 items it should help you your career hopefully maybe even a promotion that

00:32:30.240 you're working on um so yeah keep that in mind it's a great

00:32:35.960 opportunity and that's it we we're we're done that I have there my email and my

00:32:43.880 website if you want to reach out my website has all the social media I will love to keep in touch so feel free to

00:32:50.240 give me a follow and send a message and I have some references here too I'm

00:32:55.279 going to share the slides later uh so you can access is the links we're good

explore all talks recorded at WNB.rb Meetup

Explore all talks recorded at WNB.rb Meetup

WNB.rb Meetup