Keeping the lights on: Application monitoring with Sensu and BatsD

00:00:14.920 so so let me go ahead and get started first of all my name is aaron pfeiffer for those of

00:00:21.439 you don't know me today i'm going to be talking about application monitoring using two open

00:00:26.480 source ruby projects sensu and batsd there are entire conferences

00:00:32.079 on application monitoring but today we're going to look at a small slice of that using these two technologies and what we

00:00:37.840 at tapjoy call keeping the lights on there's a github repo link at the bottom here and that has all the examples that

00:00:44.719 are in the slides as well as documentation on how to get sensu and batsy up and running on your own laptop

00:00:52.399 so there are really two things that i want everyone to sort of take away from this session

00:00:57.840 the first is that as engineers as organizations we really need to be focusing and spending time and effort on looking into

00:01:05.360 monitoring our business and system metrics so that we can operate effectively you know we're all here building rails

00:01:11.680 apps we're all spending time looking at performance good design but we often forget a major

00:01:18.240 part of that process of building new features which is how do we monitor it how do we make sure it's actually working properly when it goes

00:01:24.840 live the second thing that i want everyone to take away is that you know there are a lot of open source

00:01:31.360 technologies available for monitoring and it's really exploded over the past few years and ruby has been an important part of

00:01:37.759 that but so is rails and the principles that guided rails things like convention

00:01:42.880 over configuration simplicity and you know we we really

00:01:48.399 still have a long way to go there are a lot of problems we need to tackle we still have a lot to learn

00:01:54.720 because the truth is there are still at least a few of us in this room who have walked into work over the past

00:02:01.680 year and found that deployed that we went out the other day broke the world right

00:02:07.680 and that's surprising why is this why are we responding that way because i mean we don't deploy code

00:02:14.879 unless we spec our code first right i mean that maybe that's not always the case but

00:02:21.840 that's the ideal at least you know write specs before it goes live but the truth is

00:02:27.840 what happens when it goes live there's nothing testing that and that's what monitoring is for

00:02:33.040 monitoring is our 24 7 r spec running live in production against production data

00:02:38.800 this needs to be part of the ideal this is going to be catching those problems that we wouldn't have

00:02:44.480 seen otherwise a little bit about me i'm based out of boston i've been hacking on ruby for

00:02:51.440 quite a while now i've been working on a bunch of different ruby and rails projects though my main project is

00:02:56.480 called state machine if you haven't heard of it before i'm a uh yeah

00:03:02.239 thank you thanks um i'm a principal engineer at tap joy i previously worked

00:03:07.280 at vixmo and uh you know over over the past several years i've i've really learned

00:03:12.319 a lot and one of the primary things is that as you grow and as you scale

00:03:18.239 you tend to hit some of these issues with downtime and when you encounter downtime

00:03:23.760 it's going to cost real money it hasn't a real effect on your company you know if you're a 10 million dollar

00:03:29.599 business hours or days of downtime is going to cost a lot hundreds of thousands of dollars

00:03:35.599 but the question is when we talk about downtime here what does that really mean and how can we use

00:03:41.440 monitoring to help improve that metric because this is what we think of as downtime

00:03:48.159 pingdom our website returns 500 takes too long to respond

00:03:53.920 this is what we think of as an outage but the question is what do you do in this case this is amazon returning with no css no images

00:04:01.680 returns 200 it's fast respond has all the right content no one's going to buy anything when they hit this website so

00:04:08.000 that has to be considered an outage so when we think about this you know pingdom's not going to catch this what

00:04:13.920 are the different metrics that we can monitor that are going to detect this type of downtime

00:04:20.160 i like to categorize that into three different types and the first are business metrics these aren't going

00:04:25.680 to tell you the root cause but these are sort of your backup these are going to tell you when there's an

00:04:30.720 outage that's affecting a key performance indicator for your business

00:04:36.080 at its most basic this is something like revenue but this could include other things like

00:04:41.360 conversions like purchases how about new users how many of us have run into this before

00:04:47.600 don't validate the presence of a boolean field we've all run into this at some point

00:04:54.080 and this validation seems innocent but the truth is that this is going to silently fail and

00:05:00.240 prevent new users from being created and on the surface nothing seems wrong we get no exceptions

00:05:05.759 so we're not going to know about it unless we're tracking our new users metric

00:05:12.320 and the second category i like to call application performance metrics and this is the type of information that

00:05:18.000 you might usually see from a service like new relic so these metrics deal with individual rev

00:05:24.000 requests and how performance is perceived by the user and so this is probably one of the most

00:05:29.759 common examples this is uh looking at the response time for a particular application

00:05:35.199 you can see we're returning we're it's taking us over 100 seconds to return to the user and that's going to cause major issues

00:05:42.160 and be directly correlated to conversions and the third

00:05:48.320 group i like to refer to as system metrics and these are going to tell you the root cause of

00:05:53.360 any infrastructure outage these are things like memcache metrics redis metrics network activity

00:05:59.919 here's an example of a disk usage metric we're not rotating our logs we forgot and that's going to cause major issues

00:06:07.039 in a server like your database server right so this is the type of thing we need to

00:06:12.240 be monitoring we have to have in place and so you know with all these different types of metrics we've talked about

00:06:18.240 pingdom being really useful for worldwide outages we've talked about new relic being really useful

00:06:24.240 for application performance but we haven't talked about a tool that we can use to access and

00:06:31.039 monitor real-time system and business metrics something that works really well in the cloud

00:06:36.560 and ideally something that would be written in ruby something that we might be able to read understand and

00:06:42.000 hack on and this is really where sensu and batsy really make a difference and shine

00:06:49.199 so i'm going to talk a little about what these two technologies are how they fit into your architecture

00:06:54.720 we'll talk a little bit about the components and then we'll go into some examples so first

00:07:00.240 sensu is a framework written by sean porter at sonian and this is the uh the base framework

00:07:08.240 for your monitoring infrastructure so it's built with cloud-based apps in mind and at its core

00:07:14.000 function its purpose is to basically run commands on various servers in your

00:07:19.360 infrastructure and then process the results of those commands and that's that's really what it's very

00:07:25.280 good at now there are a lot of features that complement that but at its core it's really good at that

00:07:32.080 and babstee is a time series database and if you've heard of satsd if you know of cesti

00:07:37.520 you know basically what basti is about because it actually implements the same protocol

00:07:42.720 its core functions are to track our real-time metrics and aggregate them or roll them up over

00:07:48.879 over periods of time unfortunately batstee doesn't have a logo so i took it upon myself to give it

00:07:55.280 one that wasn't obvious uh so let's walk

00:08:00.479 through the architecture a little bit see how this actually fits into into your your uh architecture so we can

00:08:07.120 assume that you have a set of servers that are running in production these you want to monitor these can be app servers

00:08:12.319 they can be databases once they're running they're configured to connect to the sensu stack and this

00:08:18.319 is what allows it to run system specific checks now once we have all those checks running we probably

00:08:24.319 want to start tracking some of the metrics from those checks if we add memcache servers maybe they'd

00:08:29.759 be our memcache metrics so here's where bassd comes into play so that those metrics that are coming back

00:08:36.320 from those checks go through sensu and are stored in batch batsd so once we have all those system metrics being

00:08:42.080 stored the next step is where do we get our business metrics and this is where our app servers talk

00:08:49.440 directly to our batchd server using the same mechanism that we would normally track our system

00:08:56.480 metrics and the final piece is once we now have access to those particular business

00:09:01.760 metrics we can now implement some checks through sensu which read the data from batsd and then can

00:09:09.360 alert on particular thresholds of that data so at a high level this is how we're using

00:09:14.959 basti and sensu to sort of fill that gap in our monitoring solution

00:09:21.279 so let me talk a little bit about the inner workings of sensu and its core

00:09:26.720 components so the core component here is the server

00:09:33.760 process and this really orchestrates the entire system its primary responsibility is to know

00:09:41.120 when to publish requests to run checks on your servers and what servers to publish those

00:09:48.640 requests too now the wenpar is not that interesting it's sort of a cron style

00:09:54.160 type implementation but the who part i think is actually more interesting you

00:10:00.320 know and i actually think it's easiest to think about this from the perspective of the message bus in sensu now the

00:10:07.680 message bus is what allows sensu to communicate to the different servers in your infrastructure and in this case

00:10:13.519 sensor uses rabbitmq and if you haven't used rabbitmq you should take a look at it it's a really awesome technology

00:10:20.480 i actually think this is the one component that really makes senzu work so well so you know when when sensu

00:10:28.800 fires up for the first time it actually has no idea what servers are running in your

00:10:33.839 infrastructure it's only as clients start actually registering with

00:10:38.880 rabbitmq that sensu is aware of them and starts monitoring them automatically so let's look at an

00:10:45.360 example so on the left here you see what looks essentially like server roles we have things like

00:10:51.760 memcache redis in sensor terminology these are subscribers

00:10:57.360 and each subscriber has in rabbitmq a fanout exchange which means any message

00:11:02.720 or any check request that gets published to that particular subscriber

00:11:07.920 will get fanned out or sent out to each client that is consuming from that

00:11:13.120 exchange so you can see we actually have these bindings in place which provide that connection so on the

00:11:20.000 right we have what looks like a bunch of gibberish and letters but those are actually our clients in our system

00:11:25.279 those are our servers so they fire up and if we have something like a memcache server it knows

00:11:30.720 it wants to listen to the memcache role and if we have maybe a core role that

00:11:35.760 represents everything so any memcache check that gets published through the memcache exchange

00:11:42.160 is going to be picked up by any client bound to that so once a client picks up a request to

00:11:49.760 run a check this is where the senzu client process comes into play so the client is responsible for running

00:11:56.160 that check and returning the results to sensu and there are actually two requirements for checks the two very

00:12:02.800 basic requirements the first is to provide an exit code this tells you

00:12:08.160 this is the severity of the check is it warning is it critical is it okay and

00:12:13.920 the second thing is to provide output and this could be individual metrics it could be debugging information it all

00:12:20.320 depends on the individual check itself now once you've got a check that's

00:12:25.760 running returning results back to sensu the next question is what do we do with

00:12:31.680 those results and this is where handlers come in handlers are basically commands that

00:12:36.880 take in the output of a check and do something with it whether it's sending an email

00:12:43.200 maybe sending a page or duty alert and you can essentially think of them at

00:12:48.399 think of the way this works as a command pipe in linux so that we get a little bit of additional metadata

00:12:54.000 so the information on the left there that json actually gets piped into standard input

00:13:00.079 to your handler command and handlers can do a lot with that and we'll see that in a little bit

00:13:06.399 so the senzu admin is really where this all starts to actually come together and this is actually a rails app again

00:13:12.720 open source in the sensor repository and from here you can see all the checks that are running the alerts

00:13:18.880 the clients that are connected it's all it's getting all this data through a restful api and given that it's in rails

00:13:24.880 we can all go in and sort of make modifications add additional things we should all be familiar with sort of

00:13:30.240 the underlying code of this admin so this is a diagram that shows all

00:13:35.920 those components now coming together and i really think this shows the simplicity of sensu

00:13:41.360 and if you actually look at the underlying implementation of sensu it's about just as straightforward it is

00:13:47.120 a pretty simple pretty simple product and the meat of the product

00:13:52.560 really exists in the community provided checks and handlers those are the really really important

00:13:58.320 aspects so now let's shift a little bit to the bat c architecture i mentioned this is

00:14:04.079 our time series database now the the bat c basically has two

00:14:10.800 processes that are at its core the first is the receiver process this is responsible for taking in those

00:14:17.279 real-time metrics aggregating them or rolling them up over time and then storing them in a couple

00:14:22.720 of different data stores memory reticent disk once that's stored we then have the api

00:14:28.320 or the server process which allows you to then query for those data points and get the

00:14:33.600 get those previous data points out of those different data stores and again this is actually this is

00:14:38.959 completely written in ruby and it's on top of a vent machine so we should all be able to go in and at least start to follow the

00:14:44.959 implementation i mentioned there are three different data stores and which one gets used

00:14:51.279 depends on actually how real time the data is so for a typical scenario the memory is

00:14:57.279 going to be storing the last data point and that'll be data from less than a minute ago redis

00:15:02.720 is used for sort of the the shortest term roll up of your data and that could be data from minutes or hours sometimes days ago

00:15:09.760 and the file system is what gets used for long-term storage this would be data for

00:15:15.199 months potentially even years ago so you can see that the different technologies were used depending on how

00:15:22.079 real time we needed the data to be now there are actually three different

00:15:27.600 types of metrics that can be reported to batsd the first are counters and they all

00:15:34.160 differ actually depending on how they behave over time for counters

00:15:40.320 these represent relative changes in a value so plus one minus one plus five

00:15:45.600 and these actually get summed up over time time timers represent absolute values at

00:15:51.920 any point in time and as time goes on these values actually get averaged

00:15:57.440 as they're rolled up and we track a couple different stats about them like standard deviation 90th percentile and finally we have

00:16:05.120 gauges which are a little bit of an oddball i would say you probably don't use them that often

00:16:10.720 they are every data point that gets recorded actually gets tracked on disk and they're never aggregated in any

00:16:16.560 fashion so they don't actually take advantage of it being a time series database but it's there if you need it

00:16:24.320 so configuring bat c is actually really simple this is the entire configuration for batsd

00:16:29.680 and it's able to do this by a suit making a few assumptions about your data you know following a little bit

00:16:34.880 convention over configuration and the most important thing really to focus in on here is the retentions

00:16:40.160 because the retentions determine how long your data sticks around for and at what granularity so in this case

00:16:47.440 we've determined we've defined two retentions the first is that we keep around fourteen hundred

00:16:52.800 forty data points at one minute roll-ups so that's about a

00:16:57.839 day's worth and then we keep around what is that eight eight thousand six hundred forty

00:17:02.880 data points at five minute rollups and that's about a month's worth of data so you're going to have data going back

00:17:09.120 to about a month if you want it longer you can increase these values now once we've got that c up and running

00:17:15.679 i mentioned this implements the statsy protocol so you can actually use the statsd gem so this makes it really easy to define

00:17:23.039 to record a few a few counters timers gauges just install the gem and it's

00:17:28.240 really easy to get up and running now the next step naturally is to

00:17:33.520 actually integrate this into your app but before we do this you really need to think about what are the key performance

00:17:39.840 indicators for your business what are those kpis because that's going to determine what

00:17:45.440 are the things that you alert on and so these are going to be the basics right you're going to have things like

00:17:50.480 revenue new users you might look at clicks conversions if you're a mobile company you might be

00:17:55.679 looking at push notifications i wish we could look at keg usage but we don't have that yet i think we have

00:18:01.039 someone working a hackathon on that so that's coming up but once you've

00:18:06.240 figured out you know what those kpis are the next step is really to integrate them into your rails application and usually

00:18:13.360 you do this through some active record callbacks you might put them in observers you might

00:18:18.799 put them in your controller i have two examples up here the first one is hooking them into your active record callback

00:18:24.720 so here we're just hooking into after after create and tracking a few things about conversions like the revenue number of

00:18:31.360 conversions there were and in the bottom example you can see us hooking into just a show action for the

00:18:37.520 controller so this is just tracking the number of times that we render a particular action

00:18:43.120 and the really nice thing at least with sets d is that this actually has little to no performance

00:18:49.200 impact on your application all the metrics get transferred over udp now if you're using a dns name you might

00:18:55.679 have to deal with dns lookup performance but for the most part this actually has little impact on your application

00:19:03.280 now you can actually take this one step further and use the notification system in rails 4

00:19:09.919 to start tracking your metrics and you can imagine this is actually what new relic is doing under the hood

00:19:16.080 it's getting information about all all of the requests that are being made to your server in this case we're actually tracking the

00:19:22.000 number of requests for each each http status code that we render

00:19:28.000 and so you could do this and you could maybe alert on it and get again some of the same information

00:19:33.440 that new relic is getting so now we've got sensu up and running

00:19:39.120 we've got bassy integrated into our app the next step is to get into some of the meat of the project these are the checks

00:19:48.000 this check this example is sent to at its most basic this is this is the one of the simplest examples

00:19:55.039 you could have and you can see it's really only a simple few lines of code

00:20:00.880 sensu has a really really nice interface for defining these checks there are two important things to look

00:20:06.559 at here the first is that we're actually looking at a few command line arguments those are those options warning and crit

00:20:12.559 which allow you to define a couple of thresholds that we're going to look at and this particular check is looking at

00:20:17.760 the 15-minute load average on a server the second thing to look at is that we're implementing a run method this is

00:20:24.080 sort of the core logic for the check and what we do is we read in that load average

00:20:29.840 and then based on the thresholds that have been configured in the command line we actually

00:20:36.000 render a status code warning or critical based on a comparison of those values so

00:20:41.600 that's really easy so let's look at a little something maybe a little bit more interesting here we're defining a check for tracking

00:20:49.520 our memcache sets and again it's it actually looks like pretty simple it's only a few lines of

00:20:55.520 code but the really interesting thing to call out here is that this is where we can start to take

00:21:00.960 advantage of everything that everyone in the ruby community has built here we're pulling in the memcache

00:21:06.799 gem which gives us really quick and easy access to the stats api and memcache and we can just write those out to be

00:21:13.919 picked up later on by sensu maybe stored in a database grafton dashboard you get a lot of

00:21:20.720 information by just outputting these metrics and remember this is ruby so we can even write our

00:21:26.400 specs around our checks that's that's kind of awesome all right and

00:21:32.799 this is where what the output looks like so you can see we've got the keys on the left those are our stats from memcache

00:21:39.919 we've got sort of the counts for each of those keys and we've got timestamp when they

00:21:45.200 occurred so now that those checks are written the next step is to actually configure them

00:21:51.840 in sensu server so that we can actually get them running on our servers so sensor uses a very basic json

00:21:59.520 configuration file but sometimes it's a little bit hard to remember what each configuration means so i have a little bit of a cheat sheet

00:22:05.600 at the top there that sort of provides a sentence for what this configuration means and we'll run

00:22:10.799 through it so for this load average check what we're asking sensu to do

00:22:16.240 is to run the check load 15 command on a set of subscribers in this

00:22:22.880 case our databases and to run it every interval seconds so that's 60 seconds

00:22:30.080 and after occurrences failures after 10 failures we're going to run a set of handlers

00:22:36.320 in this case the mailer handler and we're going to send a pagerduty alert and we're going to be reminded every 3

00:22:41.840 600 seconds if it's still failing so this is sort of the most basic implementation for a check configuration

00:22:50.159 you can see we're going to use the same those that same memcache check we wrote before but the one key difference is that we

00:22:56.159 actually have a type configuration what the cell sends you is that it's only going to be reporting metrics

00:23:02.880 it's not all it's not actually going to generate any alerts in your system

00:23:09.120 and this is where the community starts to really show how easy it is to start tracking

00:23:15.200 some of these things we already have checks written for redis for memcache for rescue there are a whole ton of

00:23:21.840 these and the really nice thing is that if you use nagios this is compatible with any nagios plug-in

00:23:28.159 so if you're looking to make the transition from a system like nagios to sensu this is

00:23:34.000 going to make it a lot easier now we've touched on system checks the

00:23:39.120 next thing to look at are our business checks but business checks are a bit of a different beast they

00:23:45.200 don't really behave quite as predictably as your system metrics you know it'd be great if our

00:23:51.440 revenue looked like this i mean maybe we'd want it to go up into the right but it'd be really easy to alert on right we

00:23:58.080 could just set a few thresholds and if it goes below those two thresholds bottom or

00:24:03.120 up we can all raise an alarm but the truth is our revenue tends to look like this

00:24:09.279 that's a bit of a roller coaster ride right it's going to depend on the time of the day maybe the day the week

00:24:15.840 and for some of us our revenue graphs make no sense at all i don't even know what's going on there

00:24:22.640 right but the truth is with all these different graphs there are patterns we just need to

00:24:28.480 identify those patterns and i usually start with the most basic i usually start with just looking at the

00:24:34.000 absolute values because these are going to be likely the most catastrophic so if i'm looking at

00:24:39.840 something like revenue i'm going to want to alert when revenue is zero over the last hour

00:24:45.679 start with something basic right and as you start to find the need as your data starts to conform to it you

00:24:51.840 can start to look at more complex patterns whether that's percentage hour over hour difference

00:24:57.679 or if that's looking at trends week over week day over day or maybe you're looking at

00:25:03.600 a more complex forecast model these are the types of things you can start to look at but i do encourage

00:25:09.600 people to start thinking sort of small think about the easy ones first and then look

00:25:14.880 at more complicated ones as you find the need so let's run through a few examples again here we're going to define a new

00:25:21.200 check using the senso api and this one's pretty straightforward so what this is going to look at is the

00:25:27.440 absolute value of our revenue over the past hour and if it's below a certain value

00:25:33.600 we're going to alert so the two interesting things to look at here are

00:25:39.120 first we're actually pulling in the bass d gem that's written by ben hathaway and what this gives us access to are all

00:25:45.840 those metrics that we are previously recording in our batsd instance the second thing to

00:25:51.600 look at is sort of at the bottom where we're using it so we're going to going to grab that revenue counter that

00:25:57.440 we were tracking before for our conversions we're going to sum them all up and compare them against those thresholds

00:26:03.360 we've defined and based on that comparison we're going to provide an exit code pretty straightforward but we can get

00:26:09.520 more creative so here's one looking at percentage hour over hour difference

00:26:15.520 it's a similar idea what we're going to read in our those that revenue over the previous hour but we're also going to

00:26:20.720 read in the hour before that and look at the percentage difference between those two

00:26:26.159 and exit provide an exit code based on that comparison now the one interesting thing to note with this one is that we're actually

00:26:31.840 providing some output what that output gives us is some debugging information so that if our revenue check is failing

00:26:38.400 we can quickly see in our page pager duty alert what what is the reason you know what

00:26:45.039 what is the difference in revenue and finally we can see an even more

00:26:50.640 complicated example what this uses is a forecast so this is using a gem called line fit from eric line and

00:26:58.880 this provides a least squares line fit of our data points over a

00:27:04.640 period of time and in this case what we're doing is we're asked we're saying let's look at the previous hour and

00:27:12.480 compare that to the same hour in the last four weeks so

00:27:17.600 we're looking at three pm four weeks ago 3 p.m two weeks ago 3 p.m a week ago

00:27:25.440 and then we're saying okay for 3 p.m today what should it have been and as long as as long as it's within a

00:27:31.840 certain uncertainty we're not going to alert on it so this starts to take a look at you know some

00:27:38.080 of the more creative things that we can start to do with our business checks so now we have we have all these checks

00:27:44.480 in place the next step is to look at our handlers what are we going to do with the output

00:27:50.159 the results of those checks so here's an example again of a pretty

00:27:55.520 basic handler and you can see the implementation is actually fairly similar to a check except in this

00:28:01.279 case we're basically defining a handle method instead of a run method

00:28:06.640 now this particular handler is going to take metrics that we've previously reported

00:28:12.080 from our met from our checks and publish that to our bat c instance and again we can actually use the statsd

00:28:18.799 ruby gem since it implements that same protocol so you can see it's actually fairly

00:28:24.080 straightforward we pull in that connection to our bat c instance we parse the output of the check

00:28:31.279 and then we generate a few data points with stat c there are no exit codes it's really as

00:28:37.760 simple as just processing that output and doing something with it in this case it's reporting those metrics

00:28:44.480 to our batsd instance now here's another example this is a

00:28:49.679 pagerduty handler and what this allows us to do is

00:28:55.840 res process a check result which has critically failed and send a pager

00:29:02.640 duty alert and you'll notice the one really nice thing again like the memcache check that we saw before

00:29:09.520 this allows us to pull in some of the existing work that people have already done in this case we're actually pulling in the red

00:29:15.440 phone gem and this is what gives us access to the pagerduty api so you can see

00:29:22.240 one interesting thing here which is the filter method and what this allows us to do is say we're only going to generate one pager

00:29:28.559 duty incident per failure so that even if our check was failing every minute we wouldn't

00:29:34.320 constantly get pages to our phone and then in the handle method is where that core logic is so that's where we're

00:29:39.919 actually going to create an incident on page duty and you can see this is actually what it looks

00:29:45.360 like so we've got the name of the check the client that it ran on the output in this case shows our load

00:29:53.120 average was above the 1.0 critical threshold that we set

00:29:59.039 so now we've got our handlers written we have to do the same type of configuration we did for our checks

00:30:04.240 now the handler configuration is actually quite a bit simpler basically has two main configurations

00:30:10.399 it's the type and the command command is straightforward it's just path to the actual command but the type

00:30:16.880 is pipe which tells us that all of the results from our checks are actually going to be

00:30:22.000 piped into our handlers using standard in and again by making some assumptions you

00:30:27.600 know we can we can really simplify these configurations and you can see we actually have a severities there are there are other configurations

00:30:33.600 here that you can put we have a severities configuration that we added to our pagerduty handler and this forces only critical

00:30:41.679 checks to actually get processed and again the community really helps out

00:30:47.440 here they've already integrated against all of these services there's a lot more we can do i think there are more

00:30:52.880 services that we should be integrating against but there's already a whole ton of things that we can do that's a great

00:30:58.960 start so the question is you know we've got sensu the sensu framework up and running

00:31:05.840 we've got batsd data we've got system checks we've got business checks

00:31:10.960 what's next where do we go from here the first thing is i want to mention everything we've used

00:31:17.360 everything we've we've shown you today is being used by tapjoy

00:31:22.559 in production for almost a year and this is at scale we're monitoring hundreds of servers

00:31:29.039 we're getting over a thousand check results per minute back from all those servers we're tracking thousands and thousands

00:31:35.600 of metrics per second in our batch instance this is scaling this can work for some of the companies

00:31:42.080 that are at this scale the one other thing to note is that this is starting to become

00:31:47.600 a part of our development process and if you recall i mentioned at the beginning of the talk

00:31:53.120 that this is something people often miss when developing new features

00:32:00.080 it hasn't historically been really part of that process so the question is why not

00:32:05.600 i think one of the main reasons is because it's not easy enough yet you know when rails came

00:32:11.840 out and respect came out and made it really easy maybe sometimes even fun to write our

00:32:18.320 unit tests but we haven't done that for monitoring yet and it's not ops the ops responsibility to be writing

00:32:25.760 these checks this is engineering responsibility it needs to be a part of the engineering process it needs to be

00:32:31.600 really really easy to write checks for our applications imagine if we could

00:32:38.559 do that right within our own rails app what if we could build a feature

00:32:44.159 instrument it with metrics write tests for it put the monitoring

00:32:49.840 checks in with that commit as well and maybe you can even write

00:32:54.880 specs against that check this is what we need we need the ability to be defining these

00:33:00.720 checks right within our application and what if it was tied to our

00:33:06.720 deployment process so what if we add a capistrano extension so that when our feature goes live

00:33:12.640 so does our check this would be awesome but no one's built this yet someone

00:33:17.760 needs to look at tying something like sensu into our rails apps to make it that much easier

00:33:23.120 where we have no excuse to not be monitoring any of the metrics any of the business metrics that

00:33:28.640 are coming out of the new features we're building this needs to happen and what about our

00:33:33.919 infrastructure we're using things like chef and puppet to automate our stacks but none of these have any

00:33:40.799 smart defaults or conventions around the metrics that we're tracking and the alerts on those metrics

00:33:49.440 you know every time we build a redis server or memcache server we should be automatically tracking those metrics and

00:33:55.919 there should be some smart default set of alerts that we're creating

00:34:00.960 and this hasn't happened yet but you could imagine we could actually implement some chef cookbooks or what have you

00:34:07.440 that automatically puts in this instrumentation in the process for building your stacks

00:34:14.399 this needs to happen we need to make it so easy there's no excuse to not do it

00:34:22.399 but yes alerting is hard it takes a while to figure out the right thresholds so it's

00:34:28.800 not going to be as easy as maybe i would like it to be where there's sort of all these defaults

00:34:34.399 but at the very least we need to be monitoring and measuring those metrics we need to actually be

00:34:41.919 tracking those metrics because even though this doesn't look easy this should be easy i love infomercials but seriously

00:34:49.760 alerting is hard yes there there are a lot of things we need to be thinking about there's really this delicate balance

00:34:55.599 between good thresholds and noisy alerts if we

00:35:01.280 if we tighten our thresholds too much we're going to miss alerts if we loosen them too much we're going

00:35:06.800 to get too many alerts it takes a little bit while to find that balance

00:35:12.079 context is important you know etsy wrote in a blog post saying that

00:35:17.599 measure anything measure everything they weren't too far from the truth you

00:35:22.880 need all of that context to really know if there's an if there's actually a problem going on in your system and sometimes there's

00:35:29.520 even context outside of your product that is sometimes hard to anticipate things like sales

00:35:35.599 seems like things like holidays time of the year and alerting also requires this good

00:35:42.560 balance between top-down metrics and bottom-up metrics those top-down metrics are your business

00:35:48.400 metrics those are the things that are going to tell you when there's an outage affecting something important to your business but

00:35:55.520 this those bottom-up metrics those system metrics are going to be critical to understanding

00:36:00.880 when there's a failure in your infrastructure so there needs to be a good balance there

00:36:07.119 some final thoughts i want to leave you guys with first is choosing the right monitoring technology

00:36:12.640 i know i've talked about sensu bats it's worked well for us but there are others out there and i

00:36:17.760 encourage everyone to think about the differences between those technologies and choose the right one

00:36:24.960 second it's really critical to think about alerts and metrics that you can actually do something with even though

00:36:32.320 etsy said measure everything if we measured everything it would be really difficult to find the important

00:36:38.480 things so we need to be measuring stuff that's actually actionable

00:36:44.320 and third we need to start making this transition from being reactive and reacting to outages

00:36:51.440 more towards being proactive and knowing about the metrics that are

00:36:56.640 starting to show a trend which is going to result in an outage affecting all of your users

00:37:01.680 and it's going to take a while to get there and you're never going to completely get there but i think we need to

00:37:06.880 start to try and move in that direction fourth we need really good smart

00:37:13.119 dashboards things that don't make it hard for us to get access to the metrics and data we need

00:37:19.599 when there's an outage things that don't get in our way and last week we really need a game

00:37:25.920 changer in monitoring we need changes that will make it

00:37:31.119 so easy for us to add monitoring into our applications and into our infrastructure that there's

00:37:37.359 no excuse to not do it this is a hard problem

00:37:43.119 but we need someone to come in and make that contribution

00:37:48.560 so if you recall there were two things that i wanted everyone to take away from this talk

00:37:53.920 the first is that we need to be investing time and effort into this monitoring problem this will

00:37:59.760 affect your business it will affect your revenue and this is the type of thing that you don't realize you need it

00:38:07.520 until you need it so i'm telling you now you need it and the second thing is that there are a

00:38:15.440 lot of open source tools available for monitoring there are a lot

00:38:21.040 of people solving some of these problems they can always use more help though so what i encourage

00:38:27.440 everyone in this room you can take this back to your companies is that we need to work together and we

00:38:33.599 really need to think about how do we take monitoring to that next level and integrate that into our own rails

00:38:41.200 applications thanks for your time

00:39:21.920 you