Talks
Herding Elephants
Summarized using AI

Herding Elephants

by Clint Shryock

Herding Elephants presents insights on how Heroku operates the largest fleet of PostgreSQL databases through a blend of Ruby applications, emphasizing service-oriented architecture, infrastructure as code, and robust fault tolerance. Speaker Clint Shryock, a support engineer at Heroku, uses humor and personal anecdotes to connect with the audience while delving into the technical aspects of their database management approach.

Key Points:

  • Introduction to Heroku and Its Postgres Team:

    • Clint clarifies his role at Heroku and distinguishes his team's responsibilities, noting that they are a small unit managing a vast infrastructure with thousands of PostgreSQL databases.
    • Emphasizes the concept of a database as a service and the add-on relationship of Heroku Postgres, highlighting its early adoption in the marketplace.
  • Evolution of Heroku Postgres:

    • Begins with a simple Sinatra application that has grown into a constellation of applications for effective management.
    • Describes a distributed architecture with specific applications handling different tasks, enhancing operational responsibilities.
  • Monitoring and Managing Databases:

    • Importance of continuously monitoring several databases to spot issues early.
    • States that they adopt an outside-in approach, where workers gather information to assess different database statuses, rather than relying solely on software installations for monitoring.
  • State Machines and Stateless Workers:

    • Clint outlines the use of state machines to manage complex behaviors and transitions among various states (e.g., up, down, uncertain) for server resources.
    • Discusses the efficiency of stateless workers that quickly execute tasks without maintaining deep state connections, allowing for rapid recovery from issues.
  • Incident Management:

    • Design of incident resolution protocols ensures issues are documented and addressed systematically.
    • Playbooks for common incidents promote knowledge sharing across team members, reducing reliance on individual expertise.
  • Handling Failures and Escalations:

    • If resolution efforts fail, there are escalation procedures in place that involve human intervention to resolve complex problems.
    • Stresses the importance of expecting failures as an inherent part of operating at scale and maintaining a positive attitude.

Conclusion:

Clint’s talk illustrates the significance of simplicity in design, effective monitoring, and error management in complex systems. He asserts that embracing the inevitability of failures while having a structured approach to handling them is crucial for success. This presentation serves as a valuable resource for engineering teams looking to improve database management processes while maintaining system reliability.

00:00:20.480 All right, that's our h.
00:00:25.599 Hi everyone, my name is Clint.
00:00:31.039 I like to start all my presentations with a very awkward photo of myself to kind of break the ice.
00:00:38.079 And my name is spelled very largely; that's mostly for vanity purposes, but also just to clarify that my name is not
00:00:45.920 Glenn, Quint, Client, Chad Quinn, or anything that ends in 'nt'.
00:00:52.000 Do I have any other Clints in the audience? No? If there was another Clint, he would know exactly what I'm talking about.
00:00:58.719 Whenever I call and order things, I have to spell my name. My wife makes fun of me, but if I don't, I'm always called Quinn.
00:01:04.720 Or Client; I don't get that one either.
00:01:10.880 Maybe you've noticed that I'm not wearing boots, I’m not wearing sandals, and I don't have a cowboy hat, so I am not from Texas.
00:01:16.400 I come all the way from Missouri. Yay, Missouri!
00:01:22.400 That's actually the next part, right? So, is anyone from Missouri? Yes!
00:01:27.840 Hey, two people! That's twice as many as I was expecting.
00:01:33.360 When I got here, I was meeting other people, and I said, 'Yeah, I'm from Missouri.'
00:01:38.479 As one from Missouri does, people are like, 'Oh yeah, Missouri! St. Louis! God, I love St. Louis, it's great. You know they’ve got baseball and an awesome hockey team, and the Arch, right?'
00:01:44.000 But yeah, I'm not from St. Louis.
00:01:49.680 So they're like, 'Oh, Kansas City! The City of Fountains!' which I honestly don't even know if that's Kansas City. Google says it is.
00:01:56.159 Why it’s called the City of Fountains, I have no idea.
00:02:01.439 There are some fountains there apparently, and no, I'm not from Kansas City either.
00:02:07.759 So now I have their attention because their list of cities in Missouri is exhausted, and I tell them I'm from Columbia.
00:02:13.360 Of course, they have to talk, I don’t know who you are, but he probably knows what I'm talking about.
00:02:18.640 Because then the next comment is 'Oh, okay, where's Columbia?' And I say, 'Well, come on, Missouri, right?'
00:02:23.840 So we've got Kansas City on the west, St. Louis on the right, and in between we've got Columbia!
00:02:29.120 Columbia is well known for two things: one, the University of Missouri, which had an awesome, fantastic football season better than any Texas team, for sure.
00:02:36.560 And, of course, we're known for being exactly in the middle of Kansas City and St. Louis.
00:02:41.760 I’m convinced we were founded on the wagon days, probably about two days' time outside of St. Louis.
00:02:47.440 And on the second day, you really don’t want to sleep in your wagon again, so we had to build a roof or something.
00:02:53.040 So, yeah, I tell people I'm from Columbia.
00:02:59.040 That's how that goes.
00:03:05.840 I'm glad I got a little laughs there. If anybody does not find this slide hilarious, I'm really sorry you don't find this funny.
00:03:11.280 I laugh every time I see this internally; I don't know why.
00:03:17.520 But if you don't find this funny, then just hunker down because you're in for a rough ride.
00:03:22.640 This was just posted like an hour ago, and I know it's kind of cliché, but this is my first time presenting at a conference.
00:03:30.400 It happens to be on a Friday, so it'd be really cool if everybody would stand up.
00:03:37.680 I can take that funny little picture. Come on, no one's standing up? Oh God.
00:03:44.000 It's happening! Yay, all right! There we go, yay! Hugs, great.
00:03:50.120 They even got some woos for me! You know, woo is not really native to Missouri, but that's okay.
00:03:55.440 I'll post that picture in a little bit; if I'm on my A-game, go ahead and retweet that.
00:04:02.080 I'll be really famous after that and somehow profit.
00:04:08.639 Yeah, I work for a company named Heroku. If you’ve heard of us, you know how awesome we are.
00:04:14.240 If you have not heard of us, we do not make signs. We're actually a platform company.
00:04:19.840 In Missouri, that is really, really difficult to describe.
00:04:26.800 I’m a support engineer at Heroku, and a lot of people think, 'Oh, so you do support?'
00:04:34.080 Well, yes, I do, but at Heroku a lot of support engineering is just like you.
00:04:40.240 We're programming all day, but we have a mindset of taking all the support tickets we get and engineering ourselves out of support tickets.
00:04:47.760 Specifically, I work with the Heroku Postgres team where we are a database as a service for your favorite elephant-themed relational database.
00:04:55.440 The team itself is less than 10 people; we have hundreds of Amazon servers.
00:05:02.000 The last I checked, I guess I don’t know; we don’t really pay attention, but we have thousands of Postgres databases.
00:05:09.600 Just like the RTF team, we are internally referred to by an acronym - DOD, which stands for Department of Data.
00:05:17.240 And if you need to remember what that means, it just means we are way better than the RTA.
00:05:22.560 Oh, I ruined that joke! Is Richard even here? No? He's not even here for me to make fun of. There he is! Hi, Richard!
00:05:29.440 Sorry for poking fun at you.
00:05:34.240 Today, we're going to talk about the approach we take to managing lots and lots of databases in a talk I call Herding Elephants.
00:05:41.120 Get it? Elephant? You know, Postgres works out.
00:05:47.440 So, how Heroku uses Ruby to run the largest fleet of Postgres databases in the world.
00:05:53.440 The asterix there means probably the largest at the time of writing.
00:05:58.800 That's probably true; it might not be true for the next few years – who knows?
00:06:03.680 Things change; it's not really a vanity metric we keep track of, but it sounds really cool on slides!
00:06:09.760 So, maybe you can tell I've never really spoken at a large conference.
00:06:17.200 When I got my little acceptance email, you're probably thinking now, like, 'Wow, what were they thinking with that?'
00:06:22.880 But I had this title, you know, this thinking of Herding Elephants. This is going to be great!
00:06:28.960 I thought I could come up with this whole talk that's all oriented around the Postgres elephant and elephants in general.
00:06:36.160 But as I've discovered, talking to a couple of people here, I've been here for the past few days.
00:06:43.680 This is not a PostgreSQL talk. Postgres is awesome, we love it, and it's great.
00:06:50.320 But no, I'm not actually going to talk about Postgres things, really.
00:06:55.680 We could be managing, I don't know, bots or something of whatever.
00:07:01.920 It's really more of an architecture talk, right? It's about managing a lot of things in the cloud.
00:07:07.680 Things that we refer to as fleets. Really, it's about managing fleets.
00:07:14.400 And who doesn't love Star Destroyers? Come on!
00:07:19.440 You are a rebel scum!
00:07:25.680 So, if you came here to hear important and cool things about managing Postgres, I’m sorry.
00:07:32.240 And I had this idea of herding elephants because it's thousands of Postgres databases.
00:07:39.440 I'd make this all elephant-and-herd-themed, but it's also not a talk about elephants.
00:07:46.080 If you came here for elephants, I’m sorry; I can't help you.
00:07:52.080 So, quick backstory: Heroku Postgres is actually not a core thing of the Heroku platform itself.
00:07:59.360 We're what's called an add-on. We exist in the add-on marketplace, which is a nice offering from Heroku.
00:08:06.400 It allows you to easily extend and add things to your applications, like New Relic, Redis, and Postgres.
00:08:12.720 You can attach these things as you will.
00:08:20.000 We were actually one of the first ones, which is cool. We kind of broke a lot of ground and a lot of things.
00:08:27.440 So, you can kind of think of it like this: you talk to Heroku, Heroku talks to us.
00:08:34.240 And we're kind of in our own little realm, even though we actually all work for Heroku.
00:08:40.960 We all sit there and eat the awesome lunches and stuff.
00:08:47.200 And all of our applications run on Heroku, so that’s pretty cool.
00:08:54.080 So, Heroku Postgres version zero, the very first thing, was just a single Sinatra application.
00:09:02.240 It used a library called Stem to speak to AWS. Can you read the orange?
00:09:08.160 Oh, so sorry! I'll just read the orange parts. That’s really disappointing; it looked great on my screen.
00:09:13.839 So, we used Stem to talk directly to AWS, and we used SQL to speak directly to the Postgres instance.
00:09:20.799 There weren't that many databases then, so this worked out great!
00:09:26.640 We just had one app and one server, and they talked to each other.
00:09:32.960 This was all great when you didn't have a lot of databases. The goal, design model, or mantra we had was just the simplest thing that could possibly work.
00:09:40.000 But no less, this is a common theme in the DOD; this is something they strive to do.
00:09:46.160 It's just something that's always in the back of our mind.
00:09:52.240 More things, more features mean more broken stuff.
00:09:58.400 Every line of code you write to do anything is something that will bite you at some point.
00:10:05.680 Or, if you move on, it'll bite someone else, and they won’t like you.
00:10:11.760 So, fast forward to now, the latest version has grown into a constellation of applications.
00:10:18.080 About five applications, still all using Sinatra. We're using Fog now to talk to AWS, and we still use SQL to talk to the Postgres instances themselves.
00:10:25.440 We've also grown to use background workers, which is underlined wonderfully with Sidekiq and Q Classic to do the bulk of our work.
00:10:32.160 Now it kind of looks like this: this is nothing groundbreaking, right? It's a constellation of applications.
00:10:38.560 They communicate over APIs, and there’s separation of concerns.
00:10:46.480 We have one that's just in charge of managing the production tier and one that's in charge of the starter tier.
00:10:52.280 We've got one that does your data clips, and you could probably count PG backups in there that does backups and snapshots.
00:10:58.160 There's also an internal one used for administratively managing things.
00:11:04.080 Some of them talk to AWS, some of them don't, but we’ve grown and spread out like that.
00:11:11.040 Almost all these applications are Sinatra, running within Sinatra.
00:11:17.440 So it's expanded; we've got various middleware, and you've got all your different endpoints that themselves really just launch individual Postgres Sinatra applications.
00:11:23.760 By nesting Sinatra applications like this, it allows you to focus and isolate specific things within the single application domain.
00:11:32.320 It makes the code easier to reason about as it's separated into different endpoints.
00:11:40.480 The apps themselves are divided into several processes.
00:11:47.280 Anyone familiar with Heroku probably recognizes this; it's a Procfile.
00:11:53.600 It's a way of taking a single application and defining individual processes contained in there.
00:12:00.160 This is a feature that you can use and scale horizontally.
00:12:06.720 You can also see that we have a summary here.
00:12:13.040 The point is that the majority of these are workers.
00:12:20.080 Your background processes do most of the heavy lifting, while the front-end stuff is usually quick and doesn’t do much of the lifting.
00:12:28.080 We literally run hundreds of workers across the five applications we have.
00:12:35.200 I think we have about 50 distinct process types, and each application itself has maybe three or four web workers.
00:12:41.520 Each of them has probably 200 plus workers of various kinds; some of the queues or processes have over 200 workers.
00:12:48.480 So we use workers a lot.
00:12:55.440 So, even while splitting this into an ecosystem of applications, it's still the simplest thing that could possibly work.
00:13:02.640 But no less than that; so, that's kind of the ecosystem.
00:13:09.200 Or the architecture of the lay of the land, so to speak.
00:13:16.640 Now, on to managing databases.
00:13:23.920 So, like I said, we have this fleet, this great awesome fleet of things.
00:13:30.560 In order to successfully run a service like this, you have to be continuously monitoring them.
00:13:37.760 You have to keep watch of everything.
00:13:44.080 If you’re looking at this, you might be asking yourself what's wrong with this picture.
00:13:51.040 It should jump out pretty quickly that it's this guy; we've got this whole fleet doing stuff.
00:13:57.280 Then we got this random one going the wrong direction.
00:14:02.160 What’s this guy doing? I mean, all these ships are coming this way; this is dangerous.
00:14:09.680 This is no good; it's going to run into somebody!
00:14:15.520 So you need to be able to monitor the fleet and identify and spot this guy.
00:14:22.080 Find out what's going on.
00:14:28.480 When you manage and monitor a lot of things, you tend to expect them to go wrong.
00:14:34.960 Things do, and you have to have this attitude about it.
00:14:40.880 But whatever this guy is doing, you need to be able to expect that this is going to happen.
00:14:48.640 Yeah, he is doing his own thing, probably causing trouble.
00:14:54.000 When you see things like this from a service point of view, someone is probably having a bad time.
00:14:59.520 That represents someone's servers or someone's database that’s gone astray.
00:15:05.680 If you don't keep your eye on these things, if you don't monitor them, things will go wrong.
00:15:11.840 People will open support tickets stating, 'Things go wrong! My database is down,' and you can tell that people are mad.
00:15:18.080 So how do you do that? You've got thousands of these things to monitor, both at the server level.
00:15:25.760 And how do you monitor them at the resource level? I'll get to the resource in a minute.
00:15:32.360 Your first thought might be: well, with these images that we’re using, we'll install software on them.
00:15:37.919 That’s not the approach we took; the approach we’ve taken is kind of outside-in.
00:15:44.160 So workers connect with SSH and they collect information about the environment.
00:15:50.640 The servers themselves are actually very dumb; we try to keep them dumb.
00:15:58.400 They only have the base OS, which is Ubuntu 12 or whatever long-term support we had last.
00:16:03.680 Postgres nine plus; we finally killed all Postgres eights. We used to have those until about six months ago.
00:16:10.560 That was a pain. There’s also this thing called Wall-e.
00:16:18.080 Wall-e was something developed internally for shipping our write-ahead logs, which is a feature of PostgreSQL.
00:16:24.480 We ship that off-site; that's part of our durability.
00:16:30.000 All that stuff, but Wall-e is written in Python.
00:16:35.280 So, if you came here for the Wall-e story, we're not going to talk about that.
00:16:41.920 Again, outside-in information is gathered by the workers.
00:16:48.320 It's used to determine the state and makes an observation, then decides the action to take.
00:16:54.720 The primary things we observe here are resources; these represent the databases.
00:17:02.080 Things like, uh, information collected such as database name, port, created at, and various database type information.
00:17:09.760 Then we have servers, which represent the physical things on AWS, or virtual things on AWS.
00:17:16.320 You've got IPs, instance ID, what availability zone it's in, how long it’s been up, and that kind of fun stuff.
00:17:24.080 We need to monitor all of these things all the time.
00:17:31.440 To do this, we use two things which are awesome: state machines and stateless workers.
00:17:37.360 I'll explain a little bit more on that; you probably know what a state machine is.
00:17:42.560 The history there is rooted in game programming.
00:17:48.320 Peter V. H. is one of the founders of Heroku Postgres, and his background was in game development.
00:17:55.200 So when it came to this kind of monitoring idea, he naturally thought of gaming.
00:18:01.120 Where you have this constant loop of observing your environment and taking action.
00:18:07.280 Am I on fire? What should I do about that? Am I being attacked by a goblin? What should I do about that?
00:18:13.760 Am I sleeping in a tent? Great!
00:18:20.080 You know, it’s like we connect to a server, and we talk to it; we say hello.
00:18:26.880 The server says hi, all right, well that's established, we can connect to the server and make progress.
00:18:33.960 Then we say things like, 'All right, select one from Postgres,' and it's like, 'Oh one.' You're like, 'Great!'
00:18:41.200 Now the server is not only up, but Postgres is running.
00:18:47.520 We do this all the time, forever.
00:18:54.080 Right? We’ve got thousands of resources, thousands of things that need to be checked.
00:19:00.160 And every one of them gets checked at least once a minute.
00:19:06.480 So, yeah, you connect to another server.
00:19:14.080 It has Postgres installed; hi, hello, select one. Great!
00:19:20.320 Then you connect to yet another server; hello, hi, select one. Great!
00:19:26.240 All these workers are going around, feeling their environment, thinking about things, and maybe doing stuff.
00:19:32.560 And we need to do this all the time, forever.
00:19:38.720 So to do this, we use a queue but treat it like a ring.
00:19:46.240 We want a worker to grab a database off the top of the queue, you know, shift it off.
00:19:52.960 We want to feel, which is a method name, and it sounds kind of odd when I stand up here.
00:19:59.680 But we want to feel its state, and then we tell it to think, where we take action.
00:20:05.120 Once we're done with that, we push it back onto the queue.
00:20:11.760 We don't linger here; these steps are usually pretty quick, and we need to do this all the time.
00:20:17.920 We use state machines to help us out.
00:20:24.000 When you're creating a server or making a new database, we have different states that these things can be in.
00:20:29.920 The creating stage, the happy-up stage, the maybe stage, the 'whoops' stage, and destroying.
00:20:36.000 We have these workers, and they need to go around and find out what state these things are in.
00:20:43.200 They need to feel their environment and determine, 'Am I up? Great! Let’s keep on going!'
00:20:50.960 'Am I maybe up? Well, that was my last state, so maybe I'm up now; maybe I'm back; I don’t know!'
00:20:57.200 My last one was not so great; maybe I'm down.
00:21:04.000 So feeling is when we go and connect to a server and observe the environment.
00:21:10.240 We have this class resource that is obviously abbreviated.
00:21:17.680 We have this class called Feeler, and the feelers collect information about the system.
00:21:22.880 So when we grab the resource, we say feel.
00:21:28.080 We just do is create a new observation with that feeler and grab the current environment.
00:21:34.240 We record that in the observations table, which is an append-only type table.
00:21:40.480 We don’t update an observation; we create a new record each time; hence, we have a history.
00:21:47.760 The observations are very simple; they have an ID, when they were created, attributes, and foreign UUIDs pointing to the resource or server.
00:21:55.040 Once we record that, we move on to the next step, which is thinking.
00:22:02.480 We consider the last observation we made.
00:22:08.960 What do we do? We include this thing called... I thought I switched to new slides.
00:22:15.040 So resources have these things called states; it’s a method that comes from the Staple module.
00:22:22.320 When the resource itself loads, we execute this method, which has a name and a block of code.
00:22:28.400 We end up creating this map of things.
00:22:35.680 Here's the Staple module summarized: the state method takes a name, a default nil, which I have no idea what it does.
00:22:42.000 Then we create like a map of names to blocks of code.
00:22:49.440 The uncertain one gets this block of code, and the available one gets that block, so on and so forth.
00:22:57.440 And here we get the think method.
00:23:03.760 After we've observed, we now say, hey, evaluate the state that we're currently in.
00:23:10.560 So look at this code and evaluate it; do this thing.
00:23:16.880 So if we're available, what do we do?
00:23:24.080 Well, if the last observation we had said the service was not available, we transition to the uncertain stage.
00:23:31.600 If it didn't say that, well we just move on; hooray!
00:23:38.000 Just like if it's uncertain, and now it says it is true, we'll go back to available and get on with our lives.
00:23:44.800 So pulling something off of the queue, feeling its environment, thinking about it, pushing it back onto the queue.
00:23:51.440 We need to do this all the time, forever.
00:23:57.440 So, state machines, stateless workers. The workers don't know much about the state.
00:24:05.360 They don't want to track the state because we don't want to tie up a worker to a resource.
00:24:11.680 We want to be able to quickly just grab it, do something really quickly, and move on.
00:24:18.080 We don't want a worker to have too much of an important relationship with what it’s doing.
00:24:24.320 Workers go down; all sorts of different things happen there.
00:24:31.440 So workers are constantly going through the queue; they’re the ones that do all the heavy lifting.
00:24:37.440 They’re the ones who do the things that take time.
00:24:43.680 The stateless workers are the ones who talk to AWS; they’re the ones who talk to the Heroku API.
00:24:50.160 To synchronize information or get commands or whatever they need to do.
00:24:56.960 And they’re the ones that connect to databases and talk directly to Postgres.
00:25:03.760 These, as far as computer terms go, are the things that take time.
00:25:10.560 We need to offload them to background workers because all of these things require networks.
00:25:16.720 In a giant cloud, even a great one like Amazon, all of them can fail, and they all do fail all the time.
00:25:22.720 The great aspect of Amazon's network is the idea of quickly getting and doing this.
00:25:29.440 Feeling, thinking, and moving on; part of this is, if for some reason a worker can’t connect to that service,
00:25:35.360 it just immediately moves on to the next thing.
00:25:42.240 But due to the way Amazon's network happens, sometimes that’s just a little glitch,
00:25:48.480 and the next worker is going to pop up in some other place in an entirely different availability zone.
00:25:55.760 It might have no problem at all connecting.
00:26:02.960 So we've kind of avoided a false positive situation and worked around maybe network partitions or various things that could go wrong.
00:26:10.080 I’m tired of you already. Sorry.
00:26:15.280 I was supposed to tie that in with the last thing: what to do when things fail.
00:26:20.560 When we've figured out that it's not a network issue and the server is down.
00:26:28.240 So, when that last observation said that the service was not available, we need to create an incident.
00:26:35.680 It's a certain type of incident; we have a lot of these.
00:26:42.240 Incidents occur when things go wrong - as they will.
00:26:48.560 As I've said, on the cloud and at scale, strange things happen.
00:26:55.120 I was talking to Tanner about how, as you scale, edge cases will remain edge cases.
00:27:01.840 But they simply become more frequent; you're doing things so many times that they are no longer bizarre.
00:27:08.080 They're just kind of strange, and there are many different types of incidents.
00:27:14.160 So we could have a resource down, stalls, failed followers stuck, the mounting drives, critical servers down, or duplicates.
00:27:21.520 In order to address all of these things, you naturally start to develop playbooks.
00:27:27.760 Things that engineers can read and use to solve problems.
00:27:34.160 That way, the solutions to these things aren't tied in an individual's head.
00:27:40.720 Once you start codifying and cataloging these things, you can create yet another state machine.
00:27:46.560 If incidents have their own state machine, you can have a triggered one with resolved, waiting, needing a human, or resolving.
00:27:52.720 So we bounce back and forth here using the stateful module to do all of this.
00:27:59.280 We utilize state machines and stateless workers.
00:28:06.720 State machines, stateful modules, take that home with you.
00:28:13.600 We have yet another queue—a ring of incidents!
00:28:20.080 The workers go along, and we don't feel at this point because we know things are wrong.
00:28:27.680 We just need to start taking action!
00:28:33.600 We pop something off the queue; we need to take some action, and then we push it back onto the queue.
00:28:40.000 Maybe that action will actually resolve things.
00:28:46.960 This worker won’t know; it will just execute the code, transition to the stage it needs to, and move on.
00:28:54.160 Then the next worker picks it up and says, 'Hey, you're all better now!'
00:29:01.440 So we need to do this all the time, a lot, all the time.
00:29:07.760 So again, with our wonderful stateful module, we have incidents.
00:29:15.120 All of these incidents are going to attempt this resolution.
00:29:22.080 And if the resolution doesn't immediately work, we will open a ticket to the customer.
00:29:29.920 If we get an error somewhere in there, we escalate to a human.
00:29:36.880 We’ll get to that in a minute.
00:29:43.040 So, the same with a wait for resolution, we have one of these state blocks for basically everything.
00:29:49.040 We usually wrap these explicitly in begin type statements because it's common for these things just to completely bomb out.
00:29:54.880 If something is just unreachable, we deal with it.
00:30:01.680 If all else fails, you know, escalate to a human.
00:30:08.640 Actually, looking at a different part of the file, we have all these types here; it’s an array.
00:30:14.560 We load all these resolvers; these resolvers are files codified from our playbooks.
00:30:20.160 We've actually written it into code how to do these things.
00:30:26.800 So we load up all these files and create an in-memory hash of them for the type of incident and the resolver that can handle it.
00:30:33.360 We do that by calling the handles method.
00:30:39.760 So, a resolver like this is a basic restart one, and it can handle the resource down state.
00:30:46.240 When the worker comes along, it’s going to attempt to initiate the resolution.
00:30:53.280 Here, it gets a lock on the resource in the database, so no other worker comes along and tries to do something.
00:31:00.480 If it sees that this is locked, it won't try to touch it.
00:31:06.720 The first thing it does, in this case, is try to restart it.
00:31:12.800 So, it turns out that on Amazon, if you're using elastic block storage and your thing crashes, and if you restart it,
00:31:20.080 your thing—in this case, Postgres—is going to come up probably in the same availability zone.
00:31:25.920 But on a different machine, and if you've used AWS long enough, you notice that sometimes just restarting the thing,
00:31:31.840 and having it come up somewhere else, resolves all your problems.
00:31:38.440 So, the very first thing we do is often just restart it, and everything sorts itself out.
00:31:44.320 Later along the line, we'll call resolve, perform a new observation, get a new state, and we'll do the tick method.
00:31:51.200 This is actually an alias for the think method, and we’ll repeat the process.
00:31:57.440 We'll continue to check these things and transition to the right state along the state machine until we determine things are better.
00:32:04.720 Until we think they get resolved.
00:32:11.200 Here's another example of a resolver; this is for server down or for stuck EBS volumes with a production tier.
00:32:17.440 We might try to fail over if you have a high availability failover.
00:32:24.080 This will actually restart the entire server instead of the Postgres instance.
00:32:30.720 That's when the thing will move.
00:32:38.080 We do another thing like resolve or perform the observation: 'Is it available now? Does it still have a stuck EBS? What’s going on?'
00:32:44.480 So we’re doing good.
00:32:51.200 But even that’s not perfect, right? So, we’re left with the obvious question of what to do when even these resolutions fail.
00:32:58.240 Because they will.
00:33:05.760 So you saw there, when the resolver doesn't work, we have to escalate to a human.
00:33:12.080 We actually have to call somebody, which is the software equivalent of, 'Well, I tried!'
00:33:18.560 Eventually, this will all read or lead to Dumbledore there, which ultimately leads to pager duty.
00:33:25.440 Escalate to humans; we’ll wake somebody up, and then we’ll have to go and figure out why the resolvers didn’t work.
00:33:31.760 Why am I being woken up in the night? Not me, fortunately!
00:33:37.760 So, I have no idea how I'm on time, but I'm pretty close to the end.
00:33:44.160 Oh wow, that's like right on time; odd.
00:33:50.960 So yeah, in summary: the simplest thing that can work.
00:33:56.320 The simplest thing you can do that works, but no less than that.
00:34:03.680 State machines are fantastic for modeling complex behavior, complex states, and things that can go wrong.
00:34:09.600 Stateless workers are great because you don’t get too tied up in what you're doing.
00:34:17.440 You can quickly move to resolution, especially when things take time.
00:34:24.160 When you get big and you need to monitor things like that, you should expect things to break.
00:34:30.960 And have a good attitude about it; just remember: well yeah, things break.
00:34:37.920 This is the summary of my talk.
00:34:43.680 Welp! Well-driven presentations.
00:34:50.320 All right, that’s all I got.
00:34:57.200 Thank you!
00:35:00.000 That's all I've got.
Explore all talks recorded at Big Ruby 2014
+13