Aaron Pfeifer

Keeping the lights on: Application monitoring with Sensu and BatsD

By Aaron Pfeifer

The good news: you're quickly signing up new customers, you've scaled your Rails app to a growing cluster of 10+ servers, and the business is really starting to take off. Great! The bad news: Just 30m of failures is starting to be measured in hundreds or even thousands of dollars. Who's going to make sure the lights stay on when your app is starting to fall over? Or worse, what if your app is up, but sign-ups, payments, or some other critical function is broken?

Learn how you can build a robust monitoring infrastructure using the Sensu platform: track business metrics in all of your applications, any system metric on your servers, and do so all with the help of BatsD - a time series data store for real-time needs. We'll also talk about how to look at trending data and how you can integrate Sensu against PagerDuty, RabbitMQ, or any other third-party service. Oh, and of course - everything's written in Ruby, so you can even use your favorite gems!

Help us caption & translate this video!

http://amara.org/v/FGaK/

RailsConf 2013

00:00:14.920 so so let me go ahead and get started first of all my name is aaron pfeiffer for those of
00:00:21.439 you don't know me today i'm going to be talking about application monitoring using two open
00:00:26.480 source ruby projects sensu and batsd there are entire conferences
00:00:32.079 on application monitoring but today we're going to look at a small slice of that using these two technologies and what we
00:00:37.840 at tapjoy call keeping the lights on there's a github repo link at the bottom here and that has all the examples that
00:00:44.719 are in the slides as well as documentation on how to get sensu and batsy up and running on your own laptop
00:00:52.399 so there are really two things that i want everyone to sort of take away from this session
00:00:57.840 the first is that as engineers as organizations we really need to be focusing and spending time and effort on looking into
00:01:05.360 monitoring our business and system metrics so that we can operate effectively you know we're all here building rails
00:01:11.680 apps we're all spending time looking at performance good design but we often forget a major
00:01:18.240 part of that process of building new features which is how do we monitor it how do we make sure it's actually working properly when it goes
00:01:24.840 live the second thing that i want everyone to take away is that you know there are a lot of open source
00:01:31.360 technologies available for monitoring and it's really exploded over the past few years and ruby has been an important part of
00:01:37.759 that but so is rails and the principles that guided rails things like convention
00:01:42.880 over configuration simplicity and you know we we really
00:01:48.399 still have a long way to go there are a lot of problems we need to tackle we still have a lot to learn
00:01:54.720 because the truth is there are still at least a few of us in this room who have walked into work over the past
00:02:01.680 year and found that deployed that we went out the other day broke the world right
00:02:07.680 and that's surprising why is this why are we responding that way because i mean we don't deploy code
00:02:14.879 unless we spec our code first right i mean that maybe that's not always the case but
00:02:21.840 that's the ideal at least you know write specs before it goes live but the truth is
00:02:27.840 what happens when it goes live there's nothing testing that and that's what monitoring is for
00:02:33.040 monitoring is our 24 7 r spec running live in production against production data
00:02:38.800 this needs to be part of the ideal this is going to be catching those problems that we wouldn't have
00:02:44.480 seen otherwise a little bit about me i'm based out of boston i've been hacking on ruby for
00:02:51.440 quite a while now i've been working on a bunch of different ruby and rails projects though my main project is
00:02:56.480 called state machine if you haven't heard of it before i'm a uh yeah
00:03:02.239 thank you thanks um i'm a principal engineer at tap joy i previously worked
00:03:07.280 at vixmo and uh you know over over the past several years i've i've really learned
00:03:12.319 a lot and one of the primary things is that as you grow and as you scale
00:03:18.239 you tend to hit some of these issues with downtime and when you encounter downtime
00:03:23.760 it's going to cost real money it hasn't a real effect on your company you know if you're a 10 million dollar
00:03:29.599 business hours or days of downtime is going to cost a lot hundreds of thousands of dollars
00:03:35.599 but the question is when we talk about downtime here what does that really mean and how can we use
00:03:41.440 monitoring to help improve that metric because this is what we think of as downtime
00:03:48.159 pingdom our website returns 500 takes too long to respond
00:03:53.920 this is what we think of as an outage but the question is what do you do in this case this is amazon returning with no css no images
00:04:01.680 returns 200 it's fast respond has all the right content no one's going to buy anything when they hit this website so
00:04:08.000 that has to be considered an outage so when we think about this you know pingdom's not going to catch this what
00:04:13.920 are the different metrics that we can monitor that are going to detect this type of downtime
00:04:20.160 i like to categorize that into three different types and the first are business metrics these aren't going
00:04:25.680 to tell you the root cause but these are sort of your backup these are going to tell you when there's an
00:04:30.720 outage that's affecting a key performance indicator for your business
00:04:36.080 at its most basic this is something like revenue but this could include other things like
00:04:41.360 conversions like purchases how about new users how many of us have run into this before
00:04:47.600 don't validate the presence of a boolean field we've all run into this at some point
00:04:54.080 and this validation seems innocent but the truth is that this is going to silently fail and
00:05:00.240 prevent new users from being created and on the surface nothing seems wrong we get no exceptions
00:05:05.759 so we're not going to know about it unless we're tracking our new users metric
00:05:12.320 and the second category i like to call application performance metrics and this is the type of information that
00:05:18.000 you might usually see from a service like new relic so these metrics deal with individual rev
00:05:24.000 requests and how performance is perceived by the user and so this is probably one of the most
00:05:29.759 common examples this is uh looking at the response time for a particular application
00:05:35.199 you can see we're returning we're it's taking us over 100 seconds to return to the user and that's going to cause major issues
00:05:42.160 and be directly correlated to conversions and the third
00:05:48.320 group i like to refer to as system metrics and these are going to tell you the root cause of
00:05:53.360 any infrastructure outage these are things like memcache metrics redis metrics network activity
00:05:59.919 here's an example of a disk usage metric we're not rotating our logs we forgot and that's going to cause major issues
00:06:07.039 in a server like your database server right so this is the type of thing we need to
00:06:12.240 be monitoring we have to have in place and so you know with all these different types of metrics we've talked about
00:06:18.240 pingdom being really useful for worldwide outages we've talked about new relic being really useful
00:06:24.240 for application performance but we haven't talked about a tool that we can use to access and
00:06:31.039 monitor real-time system and business metrics something that works really well in the cloud
00:06:36.560 and ideally something that would be written in ruby something that we might be able to read understand and
00:06:42.000 hack on and this is really where sensu and batsy really make a difference and shine
00:06:49.199 so i'm going to talk a little about what these two technologies are how they fit into your architecture
00:06:54.720 we'll talk a little bit about the components and then we'll go into some examples so first
00:07:00.240 sensu is a framework written by sean porter at sonian and this is the uh the base framework
00:07:08.240 for your monitoring infrastructure so it's built with cloud-based apps in mind and at its core
00:07:14.000 function its purpose is to basically run commands on various servers in your
00:07:19.360 infrastructure and then process the results of those commands and that's that's really what it's very
00:07:25.280 good at now there are a lot of features that complement that but at its core it's really good at that
00:07:32.080 and babstee is a time series database and if you've heard of satsd if you know of cesti
00:07:37.520 you know basically what basti is about because it actually implements the same protocol
00:07:42.720 its core functions are to track our real-time metrics and aggregate them or roll them up over
00:07:48.879 over periods of time unfortunately batstee doesn't have a logo so i took it upon myself to give it
00:07:55.280 one that wasn't obvious uh so let's walk
00:08:00.479 through the architecture a little bit see how this actually fits into into your your uh architecture so we can
00:08:07.120 assume that you have a set of servers that are running in production these you want to monitor these can be app servers
00:08:12.319 they can be databases once they're running they're configured to connect to the sensu stack and this
00:08:18.319 is what allows it to run system specific checks now once we have all those checks running we probably
00:08:24.319 want to start tracking some of the metrics from those checks if we add memcache servers maybe they'd
00:08:29.759 be our memcache metrics so here's where bassd comes into play so that those metrics that are coming back
00:08:36.320 from those checks go through sensu and are stored in batch batsd so once we have all those system metrics being
00:08:42.080 stored the next step is where do we get our business metrics and this is where our app servers talk
00:08:49.440 directly to our batchd server using the same mechanism that we would normally track our system
00:08:56.480 metrics and the final piece is once we now have access to those particular business
00:09:01.760 metrics we can now implement some checks through sensu which read the data from batsd and then can
00:09:09.360 alert on particular thresholds of that data so at a high level this is how we're using
00:09:14.959 basti and sensu to sort of fill that gap in our monitoring solution
00:09:21.279 so let me talk a little bit about the inner workings of sensu and its core
00:09:26.720 components so the core component here is the server
00:09:33.760 process and this really orchestrates the entire system its primary responsibility is to know
00:09:41.120 when to publish requests to run checks on your servers and what servers to publish those
00:09:48.640 requests too now the wenpar is not that interesting it's sort of a cron style
00:09:54.160 type implementation but the who part i think is actually more interesting you
00:10:00.320 know and i actually think it's easiest to think about this from the perspective of the message bus in sensu now the
00:10:07.680 message bus is what allows sensu to communicate to the different servers in your infrastructure and in this case
00:10:13.519 sensor uses rabbitmq and if you haven't used rabbitmq you should take a look at it it's a really awesome technology
00:10:20.480 i actually think this is the one component that really makes senzu work so well so you know when when sensu
00:10:28.800 fires up for the first time it actually has no idea what servers are running in your
00:10:33.839 infrastructure it's only as clients start actually registering with
00:10:38.880 rabbitmq that sensu is aware of them and starts monitoring them automatically so let's look at an
00:10:45.360 example so on the left here you see what looks essentially like server roles we have things like
00:10:51.760 memcache redis in sensor terminology these are subscribers
00:10:57.360 and each subscriber has in rabbitmq a fanout exchange which means any message
00:11:02.720 or any check request that gets published to that particular subscriber
00:11:07.920 will get fanned out or sent out to each client that is consuming from that
00:11:13.120 exchange so you can see we actually have these bindings in place which provide that connection so on the
00:11:20.000 right we have what looks like a bunch of gibberish and letters but those are actually our clients in our system
00:11:25.279 those are our servers so they fire up and if we have something like a memcache server it knows
00:11:30.720 it wants to listen to the memcache role and if we have maybe a core role that
00:11:35.760 represents everything so any memcache check that gets published through the memcache exchange
00:11:42.160 is going to be picked up by any client bound to that so once a client picks up a request to
00:11:49.760 run a check this is where the senzu client process comes into play so the client is responsible for running
00:11:56.160 that check and returning the results to sensu and there are actually two requirements for checks the two very
00:12:02.800 basic requirements the first is to provide an exit code this tells you
00:12:08.160 this is the severity of the check is it warning is it critical is it okay and
00:12:13.920 the second thing is to provide output and this could be individual metrics it could be debugging information it all
00:12:20.320 depends on the individual check itself now once you've got a check that's
00:12:25.760 running returning results back to sensu the next question is what do we do with
00:12:31.680 those results and this is where handlers come in handlers are basically commands that
00:12:36.880 take in the output of a check and do something with it whether it's sending an email
00:12:43.200 maybe sending a page or duty alert and you can essentially think of them at
00:12:48.399 think of the way this works as a command pipe in linux so that we get a little bit of additional metadata
00:12:54.000 so the information on the left there that json actually gets piped into standard input
00:13:00.079 to your handler command and handlers can do a lot with that and we'll see that in a little bit
00:13:06.399 so the senzu admin is really where this all starts to actually come together and this is actually a rails app again
00:13:12.720 open source in the sensor repository and from here you can see all the checks that are running the alerts
00:13:18.880 the clients that are connected it's all it's getting all this data through a restful api and given that it's in rails
00:13:24.880 we can all go in and sort of make modifications add additional things we should all be familiar with sort of
00:13:30.240 the underlying code of this admin so this is a diagram that shows all
00:13:35.920 those components now coming together and i really think this shows the simplicity of sensu
00:13:41.360 and if you actually look at the underlying implementation of sensu it's about just as straightforward it is
00:13:47.120 a pretty simple pretty simple product and the meat of the product
00:13:52.560 really exists in the community provided checks and handlers those are the really really important
00:13:58.320 aspects so now let's shift a little bit to the bat c architecture i mentioned this is
00:14:04.079 our time series database now the the bat c basically has two
00:14:10.800 processes that are at its core the first is the receiver process this is responsible for taking in those
00:14:17.279 real-time metrics aggregating them or rolling them up over time and then storing them in a couple
00:14:22.720 of different data stores memory reticent disk once that's stored we then have the api
00:14:28.320 or the server process which allows you to then query for those data points and get the
00:14:33.600 get those previous data points out of those different data stores and again this is actually this is
00:14:38.959 completely written in ruby and it's on top of a vent machine so we should all be able to go in and at least start to follow the
00:14:44.959 implementation i mentioned there are three different data stores and which one gets used
00:14:51.279 depends on actually how real time the data is so for a typical scenario the memory is
00:14:57.279 going to be storing the last data point and that'll be data from less than a minute ago redis
00:15:02.720 is used for sort of the the shortest term roll up of your data and that could be data from minutes or hours sometimes days ago
00:15:09.760 and the file system is what gets used for long-term storage this would be data for
00:15:15.199 months potentially even years ago so you can see that the different technologies were used depending on how
00:15:22.079 real time we needed the data to be now there are actually three different
00:15:27.600 types of metrics that can be reported to batsd the first are counters and they all
00:15:34.160 differ actually depending on how they behave over time for counters
00:15:40.320 these represent relative changes in a value so plus one minus one plus five
00:15:45.600 and these actually get summed up over time time timers represent absolute values at
00:15:51.920 any point in time and as time goes on these values actually get averaged
00:15:57.440 as they're rolled up and we track a couple different stats about them like standard deviation 90th percentile and finally we have
00:16:05.120 gauges which are a little bit of an oddball i would say you probably don't use them that often
00:16:10.720 they are every data point that gets recorded actually gets tracked on disk and they're never aggregated in any
00:16:16.560 fashion so they don't actually take advantage of it being a time series database but it's there if you need it
00:16:24.320 so configuring bat c is actually really simple this is the entire configuration for batsd
00:16:29.680 and it's able to do this by a suit making a few assumptions about your data you know following a little bit
00:16:34.880 convention over configuration and the most important thing really to focus in on here is the retentions
00:16:40.160 because the retentions determine how long your data sticks around for and at what granularity so in this case
00:16:47.440 we've determined we've defined two retentions the first is that we keep around fourteen hundred
00:16:52.800 forty data points at one minute roll-ups so that's about a
00:16:57.839 day's worth and then we keep around what is that eight eight thousand six hundred forty
00:17:02.880 data points at five minute rollups and that's about a month's worth of data so you're going to have data going back
00:17:09.120 to about a month if you want it longer you can increase these values now once we've got that c up and running
00:17:15.679 i mentioned this implements the statsy protocol so you can actually use the statsd gem so this makes it really easy to define
00:17:23.039 to record a few a few counters timers gauges just install the gem and it's
00:17:28.240 really easy to get up and running now the next step naturally is to
00:17:33.520 actually integrate this into your app but before we do this you really need to think about what are the key performance
00:17:39.840 indicators for your business what are those kpis because that's going to determine what
00:17:45.440 are the things that you alert on and so these are going to be the basics right you're going to have things like
00:17:50.480 revenue new users you might look at clicks conversions if you're a mobile company you might be
00:17:55.679 looking at push notifications i wish we could look at keg usage but we don't have that yet i think we have
00:18:01.039 someone working a hackathon on that so that's coming up but once you've
00:18:06.240 figured out you know what those kpis are the next step is really to integrate them into your rails application and usually
00:18:13.360 you do this through some active record callbacks you might put them in observers you might
00:18:18.799 put them in your controller i have two examples up here the first one is hooking them into your active record callback
00:18:24.720 so here we're just hooking into after after create and tracking a few things about conversions like the revenue number of
00:18:31.360 conversions there were and in the bottom example you can see us hooking into just a show action for the
00:18:37.520 controller so this is just tracking the number of times that we render a particular action
00:18:43.120 and the really nice thing at least with sets d is that this actually has little to no performance
00:18:49.200 impact on your application all the metrics get transferred over udp now if you're using a dns name you might
00:18:55.679 have to deal with dns lookup performance but for the most part this actually has little impact on your application
00:19:03.280 now you can actually take this one step further and use the notification system in rails 4
00:19:09.919 to start tracking your metrics and you can imagine this is actually what new relic is doing under the hood
00:19:16.080 it's getting information about all all of the requests that are being made to your server in this case we're actually tracking the
00:19:22.000 number of requests for each each http status code that we render
00:19:28.000 and so you could do this and you could maybe alert on it and get again some of the same information
00:19:33.440 that new relic is getting so now we've got sensu up and running
00:19:39.120 we've got bassy integrated into our app the next step is to get into some of the meat of the project these are the checks
00:19:48.000 this check this example is sent to at its most basic this is this is the one of the simplest examples
00:19:55.039 you could have and you can see it's really only a simple few lines of code
00:20:00.880 sensu has a really really nice interface for defining these checks there are two important things to look
00:20:06.559 at here the first is that we're actually looking at a few command line arguments those are those options warning and crit
00:20:12.559 which allow you to define a couple of thresholds that we're going to look at and this particular check is looking at
00:20:17.760 the 15-minute load average on a server the second thing to look at is that we're implementing a run method this is
00:20:24.080 sort of the core logic for the check and what we do is we read in that load average
00:20:29.840 and then based on the thresholds that have been configured in the command line we actually
00:20:36.000 render a status code warning or critical based on a comparison of those values so
00:20:41.600 that's really easy so let's look at a little something maybe a little bit more interesting here we're defining a check for tracking
00:20:49.520 our memcache sets and again it's it actually looks like pretty simple it's only a few lines of
00:20:55.520 code but the really interesting thing to call out here is that this is where we can start to take
00:21:00.960 advantage of everything that everyone in the ruby community has built here we're pulling in the memcache
00:21:06.799 gem which gives us really quick and easy access to the stats api and memcache and we can just write those out to be
00:21:13.919 picked up later on by sensu maybe stored in a database grafton dashboard you get a lot of
00:21:20.720 information by just outputting these metrics and remember this is ruby so we can even write our
00:21:26.400 specs around our checks that's that's kind of awesome all right and
00:21:32.799 this is where what the output looks like so you can see we've got the keys on the left those are our stats from memcache
00:21:39.919 we've got sort of the counts for each of those keys and we've got timestamp when they
00:21:45.200 occurred so now that those checks are written the next step is to actually configure them
00:21:51.840 in sensu server so that we can actually get them running on our servers so sensor uses a very basic json
00:21:59.520 configuration file but sometimes it's a little bit hard to remember what each configuration means so i have a little bit of a cheat sheet
00:22:05.600 at the top there that sort of provides a sentence for what this configuration means and we'll run
00:22:10.799 through it so for this load average check what we're asking sensu to do
00:22:16.240 is to run the check load 15 command on a set of subscribers in this
00:22:22.880 case our databases and to run it every interval seconds so that's 60 seconds
00:22:30.080 and after occurrences failures after 10 failures we're going to run a set of handlers
00:22:36.320 in this case the mailer handler and we're going to send a pagerduty alert and we're going to be reminded every 3
00:22:41.840 600 seconds if it's still failing so this is sort of the most basic implementation for a check configuration
00:22:50.159 you can see we're going to use the same those that same memcache check we wrote before but the one key difference is that we
00:22:56.159 actually have a type configuration what the cell sends you is that it's only going to be reporting metrics
00:23:02.880 it's not all it's not actually going to generate any alerts in your system
00:23:09.120 and this is where the community starts to really show how easy it is to start tracking
00:23:15.200 some of these things we already have checks written for redis for memcache for rescue there are a whole ton of
00:23:21.840 these and the really nice thing is that if you use nagios this is compatible with any nagios plug-in
00:23:28.159 so if you're looking to make the transition from a system like nagios to sensu this is
00:23:34.000 going to make it a lot easier now we've touched on system checks the
00:23:39.120 next thing to look at are our business checks but business checks are a bit of a different beast they
00:23:45.200 don't really behave quite as predictably as your system metrics you know it'd be great if our
00:23:51.440 revenue looked like this i mean maybe we'd want it to go up into the right but it'd be really easy to alert on right we
00:23:58.080 could just set a few thresholds and if it goes below those two thresholds bottom or
00:24:03.120 up we can all raise an alarm but the truth is our revenue tends to look like this
00:24:09.279 that's a bit of a roller coaster ride right it's going to depend on the time of the day maybe the day the week
00:24:15.840 and for some of us our revenue graphs make no sense at all i don't even know what's going on there
00:24:22.640 right but the truth is with all these different graphs there are patterns we just need to
00:24:28.480 identify those patterns and i usually start with the most basic i usually start with just looking at the
00:24:34.000 absolute values because these are going to be likely the most catastrophic so if i'm looking at
00:24:39.840 something like revenue i'm going to want to alert when revenue is zero over the last hour
00:24:45.679 start with something basic right and as you start to find the need as your data starts to conform to it you
00:24:51.840 can start to look at more complex patterns whether that's percentage hour over hour difference
00:24:57.679 or if that's looking at trends week over week day over day or maybe you're looking at
00:25:03.600 a more complex forecast model these are the types of things you can start to look at but i do encourage
00:25:09.600 people to start thinking sort of small think about the easy ones first and then look
00:25:14.880 at more complicated ones as you find the need so let's run through a few examples again here we're going to define a new
00:25:21.200 check using the senso api and this one's pretty straightforward so what this is going to look at is the
00:25:27.440 absolute value of our revenue over the past hour and if it's below a certain value
00:25:33.600 we're going to alert so the two interesting things to look at here are
00:25:39.120 first we're actually pulling in the bass d gem that's written by ben hathaway and what this gives us access to are all
00:25:45.840 those metrics that we are previously recording in our batsd instance the second thing to
00:25:51.600 look at is sort of at the bottom where we're using it so we're going to going to grab that revenue counter that
00:25:57.440 we were tracking before for our conversions we're going to sum them all up and compare them against those thresholds
00:26:03.360 we've defined and based on that comparison we're going to provide an exit code pretty straightforward but we can get
00:26:09.520 more creative so here's one looking at percentage hour over hour difference
00:26:15.520 it's a similar idea what we're going to read in our those that revenue over the previous hour but we're also going to
00:26:20.720 read in the hour before that and look at the percentage difference between those two
00:26:26.159 and exit provide an exit code based on that comparison now the one interesting thing to note with this one is that we're actually
00:26:31.840 providing some output what that output gives us is some debugging information so that if our revenue check is failing
00:26:38.400 we can quickly see in our page pager duty alert what what is the reason you know what
00:26:45.039 what is the difference in revenue and finally we can see an even more
00:26:50.640 complicated example what this uses is a forecast so this is using a gem called line fit from eric line and
00:26:58.880 this provides a least squares line fit of our data points over a
00:27:04.640 period of time and in this case what we're doing is we're asked we're saying let's look at the previous hour and
00:27:12.480 compare that to the same hour in the last four weeks so
00:27:17.600 we're looking at three pm four weeks ago 3 p.m two weeks ago 3 p.m a week ago
00:27:25.440 and then we're saying okay for 3 p.m today what should it have been and as long as as long as it's within a
00:27:31.840 certain uncertainty we're not going to alert on it so this starts to take a look at you know some
00:27:38.080 of the more creative things that we can start to do with our business checks so now we have we have all these checks
00:27:44.480 in place the next step is to look at our handlers what are we going to do with the output
00:27:50.159 the results of those checks so here's an example again of a pretty
00:27:55.520 basic handler and you can see the implementation is actually fairly similar to a check except in this
00:28:01.279 case we're basically defining a handle method instead of a run method
00:28:06.640 now this particular handler is going to take metrics that we've previously reported
00:28:12.080 from our met from our checks and publish that to our bat c instance and again we can actually use the statsd
00:28:18.799 ruby gem since it implements that same protocol so you can see it's actually fairly
00:28:24.080 straightforward we pull in that connection to our bat c instance we parse the output of the check
00:28:31.279 and then we generate a few data points with stat c there are no exit codes it's really as
00:28:37.760 simple as just processing that output and doing something with it in this case it's reporting those metrics
00:28:44.480 to our batsd instance now here's another example this is a
00:28:49.679 pagerduty handler and what this allows us to do is
00:28:55.840 res process a check result which has critically failed and send a pager
00:29:02.640 duty alert and you'll notice the one really nice thing again like the memcache check that we saw before
00:29:09.520 this allows us to pull in some of the existing work that people have already done in this case we're actually pulling in the red
00:29:15.440 phone gem and this is what gives us access to the pagerduty api so you can see
00:29:22.240 one interesting thing here which is the filter method and what this allows us to do is say we're only going to generate one pager
00:29:28.559 duty incident per failure so that even if our check was failing every minute we wouldn't
00:29:34.320 constantly get pages to our phone and then in the handle method is where that core logic is so that's where we're
00:29:39.919 actually going to create an incident on page duty and you can see this is actually what it looks
00:29:45.360 like so we've got the name of the check the client that it ran on the output in this case shows our load
00:29:53.120 average was above the 1.0 critical threshold that we set
00:29:59.039 so now we've got our handlers written we have to do the same type of configuration we did for our checks
00:30:04.240 now the handler configuration is actually quite a bit simpler basically has two main configurations
00:30:10.399 it's the type and the command command is straightforward it's just path to the actual command but the type
00:30:16.880 is pipe which tells us that all of the results from our checks are actually going to be
00:30:22.000 piped into our handlers using standard in and again by making some assumptions you
00:30:27.600 know we can we can really simplify these configurations and you can see we actually have a severities there are there are other configurations
00:30:33.600 here that you can put we have a severities configuration that we added to our pagerduty handler and this forces only critical
00:30:41.679 checks to actually get processed and again the community really helps out
00:30:47.440 here they've already integrated against all of these services there's a lot more we can do i think there are more
00:30:52.880 services that we should be integrating against but there's already a whole ton of things that we can do that's a great
00:30:58.960 start so the question is you know we've got sensu the sensu framework up and running
00:31:05.840 we've got batsd data we've got system checks we've got business checks
00:31:10.960 what's next where do we go from here the first thing is i want to mention everything we've used
00:31:17.360 everything we've we've shown you today is being used by tapjoy
00:31:22.559 in production for almost a year and this is at scale we're monitoring hundreds of servers
00:31:29.039 we're getting over a thousand check results per minute back from all those servers we're tracking thousands and thousands
00:31:35.600 of metrics per second in our batch instance this is scaling this can work for some of the companies
00:31:42.080 that are at this scale the one other thing to note is that this is starting to become
00:31:47.600 a part of our development process and if you recall i mentioned at the beginning of the talk
00:31:53.120 that this is something people often miss when developing new features
00:32:00.080 it hasn't historically been really part of that process so the question is why not
00:32:05.600 i think one of the main reasons is because it's not easy enough yet you know when rails came
00:32:11.840 out and respect came out and made it really easy maybe sometimes even fun to write our
00:32:18.320 unit tests but we haven't done that for monitoring yet and it's not ops the ops responsibility to be writing
00:32:25.760 these checks this is engineering responsibility it needs to be a part of the engineering process it needs to be
00:32:31.600 really really easy to write checks for our applications imagine if we could
00:32:38.559 do that right within our own rails app what if we could build a feature
00:32:44.159 instrument it with metrics write tests for it put the monitoring
00:32:49.840 checks in with that commit as well and maybe you can even write
00:32:54.880 specs against that check this is what we need we need the ability to be defining these
00:33:00.720 checks right within our application and what if it was tied to our
00:33:06.720 deployment process so what if we add a capistrano extension so that when our feature goes live
00:33:12.640 so does our check this would be awesome but no one's built this yet someone
00:33:17.760 needs to look at tying something like sensu into our rails apps to make it that much easier
00:33:23.120 where we have no excuse to not be monitoring any of the metrics any of the business metrics that
00:33:28.640 are coming out of the new features we're building this needs to happen and what about our
00:33:33.919 infrastructure we're using things like chef and puppet to automate our stacks but none of these have any
00:33:40.799 smart defaults or conventions around the metrics that we're tracking and the alerts on those metrics
00:33:49.440 you know every time we build a redis server or memcache server we should be automatically tracking those metrics and
00:33:55.919 there should be some smart default set of alerts that we're creating
00:34:00.960 and this hasn't happened yet but you could imagine we could actually implement some chef cookbooks or what have you
00:34:07.440 that automatically puts in this instrumentation in the process for building your stacks
00:34:14.399 this needs to happen we need to make it so easy there's no excuse to not do it
00:34:22.399 but yes alerting is hard it takes a while to figure out the right thresholds so it's
00:34:28.800 not going to be as easy as maybe i would like it to be where there's sort of all these defaults
00:34:34.399 but at the very least we need to be monitoring and measuring those metrics we need to actually be
00:34:41.919 tracking those metrics because even though this doesn't look easy this should be easy i love infomercials but seriously
00:34:49.760 alerting is hard yes there there are a lot of things we need to be thinking about there's really this delicate balance
00:34:55.599 between good thresholds and noisy alerts if we
00:35:01.280 if we tighten our thresholds too much we're going to miss alerts if we loosen them too much we're going
00:35:06.800 to get too many alerts it takes a little bit while to find that balance
00:35:12.079 context is important you know etsy wrote in a blog post saying that
00:35:17.599 measure anything measure everything they weren't too far from the truth you
00:35:22.880 need all of that context to really know if there's an if there's actually a problem going on in your system and sometimes there's
00:35:29.520 even context outside of your product that is sometimes hard to anticipate things like sales
00:35:35.599 seems like things like holidays time of the year and alerting also requires this good
00:35:42.560 balance between top-down metrics and bottom-up metrics those top-down metrics are your business
00:35:48.400 metrics those are the things that are going to tell you when there's an outage affecting something important to your business but
00:35:55.520 this those bottom-up metrics those system metrics are going to be critical to understanding
00:36:00.880 when there's a failure in your infrastructure so there needs to be a good balance there
00:36:07.119 some final thoughts i want to leave you guys with first is choosing the right monitoring technology
00:36:12.640 i know i've talked about sensu bats it's worked well for us but there are others out there and i
00:36:17.760 encourage everyone to think about the differences between those technologies and choose the right one
00:36:24.960 second it's really critical to think about alerts and metrics that you can actually do something with even though
00:36:32.320 etsy said measure everything if we measured everything it would be really difficult to find the important
00:36:38.480 things so we need to be measuring stuff that's actually actionable
00:36:44.320 and third we need to start making this transition from being reactive and reacting to outages
00:36:51.440 more towards being proactive and knowing about the metrics that are
00:36:56.640 starting to show a trend which is going to result in an outage affecting all of your users
00:37:01.680 and it's going to take a while to get there and you're never going to completely get there but i think we need to
00:37:06.880 start to try and move in that direction fourth we need really good smart
00:37:13.119 dashboards things that don't make it hard for us to get access to the metrics and data we need
00:37:19.599 when there's an outage things that don't get in our way and last week we really need a game
00:37:25.920 changer in monitoring we need changes that will make it
00:37:31.119 so easy for us to add monitoring into our applications and into our infrastructure that there's
00:37:37.359 no excuse to not do it this is a hard problem
00:37:43.119 but we need someone to come in and make that contribution
00:37:48.560 so if you recall there were two things that i wanted everyone to take away from this talk
00:37:53.920 the first is that we need to be investing time and effort into this monitoring problem this will
00:37:59.760 affect your business it will affect your revenue and this is the type of thing that you don't realize you need it
00:38:07.520 until you need it so i'm telling you now you need it and the second thing is that there are a
00:38:15.440 lot of open source tools available for monitoring there are a lot
00:38:21.040 of people solving some of these problems they can always use more help though so what i encourage
00:38:27.440 everyone in this room you can take this back to your companies is that we need to work together and we
00:38:33.599 really need to think about how do we take monitoring to that next level and integrate that into our own rails
00:38:41.200 applications thanks for your time
00:39:21.920 you