00:00:20.480
All right, that's our h.
00:00:25.599
Hi everyone, my name is Clint.
00:00:31.039
I like to start all my presentations with a very awkward photo of myself to kind of break the ice.
00:00:38.079
And my name is spelled very largely; that's mostly for vanity purposes, but also just to clarify that my name is not
00:00:45.920
Glenn, Quint, Client, Chad Quinn, or anything that ends in 'nt'.
00:00:52.000
Do I have any other Clints in the audience? No? If there was another Clint, he would know exactly what I'm talking about.
00:00:58.719
Whenever I call and order things, I have to spell my name. My wife makes fun of me, but if I don't, I'm always called Quinn.
00:01:04.720
Or Client; I don't get that one either.
00:01:10.880
Maybe you've noticed that I'm not wearing boots, I’m not wearing sandals, and I don't have a cowboy hat, so I am not from Texas.
00:01:16.400
I come all the way from Missouri. Yay, Missouri!
00:01:22.400
That's actually the next part, right? So, is anyone from Missouri? Yes!
00:01:27.840
Hey, two people! That's twice as many as I was expecting.
00:01:33.360
When I got here, I was meeting other people, and I said, 'Yeah, I'm from Missouri.'
00:01:38.479
As one from Missouri does, people are like, 'Oh yeah, Missouri! St. Louis! God, I love St. Louis, it's great. You know they’ve got baseball and an awesome hockey team, and the Arch, right?'
00:01:44.000
But yeah, I'm not from St. Louis.
00:01:49.680
So they're like, 'Oh, Kansas City! The City of Fountains!' which I honestly don't even know if that's Kansas City. Google says it is.
00:01:56.159
Why it’s called the City of Fountains, I have no idea.
00:02:01.439
There are some fountains there apparently, and no, I'm not from Kansas City either.
00:02:07.759
So now I have their attention because their list of cities in Missouri is exhausted, and I tell them I'm from Columbia.
00:02:13.360
Of course, they have to talk, I don’t know who you are, but he probably knows what I'm talking about.
00:02:18.640
Because then the next comment is 'Oh, okay, where's Columbia?' And I say, 'Well, come on, Missouri, right?'
00:02:23.840
So we've got Kansas City on the west, St. Louis on the right, and in between we've got Columbia!
00:02:29.120
Columbia is well known for two things: one, the University of Missouri, which had an awesome, fantastic football season better than any Texas team, for sure.
00:02:36.560
And, of course, we're known for being exactly in the middle of Kansas City and St. Louis.
00:02:41.760
I’m convinced we were founded on the wagon days, probably about two days' time outside of St. Louis.
00:02:47.440
And on the second day, you really don’t want to sleep in your wagon again, so we had to build a roof or something.
00:02:53.040
So, yeah, I tell people I'm from Columbia.
00:02:59.040
That's how that goes.
00:03:05.840
I'm glad I got a little laughs there. If anybody does not find this slide hilarious, I'm really sorry you don't find this funny.
00:03:11.280
I laugh every time I see this internally; I don't know why.
00:03:17.520
But if you don't find this funny, then just hunker down because you're in for a rough ride.
00:03:22.640
This was just posted like an hour ago, and I know it's kind of cliché, but this is my first time presenting at a conference.
00:03:30.400
It happens to be on a Friday, so it'd be really cool if everybody would stand up.
00:03:37.680
I can take that funny little picture. Come on, no one's standing up? Oh God.
00:03:44.000
It's happening! Yay, all right! There we go, yay! Hugs, great.
00:03:50.120
They even got some woos for me! You know, woo is not really native to Missouri, but that's okay.
00:03:55.440
I'll post that picture in a little bit; if I'm on my A-game, go ahead and retweet that.
00:04:02.080
I'll be really famous after that and somehow profit.
00:04:08.639
Yeah, I work for a company named Heroku. If you’ve heard of us, you know how awesome we are.
00:04:14.240
If you have not heard of us, we do not make signs. We're actually a platform company.
00:04:19.840
In Missouri, that is really, really difficult to describe.
00:04:26.800
I’m a support engineer at Heroku, and a lot of people think, 'Oh, so you do support?'
00:04:34.080
Well, yes, I do, but at Heroku a lot of support engineering is just like you.
00:04:40.240
We're programming all day, but we have a mindset of taking all the support tickets we get and engineering ourselves out of support tickets.
00:04:47.760
Specifically, I work with the Heroku Postgres team where we are a database as a service for your favorite elephant-themed relational database.
00:04:55.440
The team itself is less than 10 people; we have hundreds of Amazon servers.
00:05:02.000
The last I checked, I guess I don’t know; we don’t really pay attention, but we have thousands of Postgres databases.
00:05:09.600
Just like the RTF team, we are internally referred to by an acronym - DOD, which stands for Department of Data.
00:05:17.240
And if you need to remember what that means, it just means we are way better than the RTA.
00:05:22.560
Oh, I ruined that joke! Is Richard even here? No? He's not even here for me to make fun of. There he is! Hi, Richard!
00:05:29.440
Sorry for poking fun at you.
00:05:34.240
Today, we're going to talk about the approach we take to managing lots and lots of databases in a talk I call Herding Elephants.
00:05:41.120
Get it? Elephant? You know, Postgres works out.
00:05:47.440
So, how Heroku uses Ruby to run the largest fleet of Postgres databases in the world.
00:05:53.440
The asterix there means probably the largest at the time of writing.
00:05:58.800
That's probably true; it might not be true for the next few years – who knows?
00:06:03.680
Things change; it's not really a vanity metric we keep track of, but it sounds really cool on slides!
00:06:09.760
So, maybe you can tell I've never really spoken at a large conference.
00:06:17.200
When I got my little acceptance email, you're probably thinking now, like, 'Wow, what were they thinking with that?'
00:06:22.880
But I had this title, you know, this thinking of Herding Elephants. This is going to be great!
00:06:28.960
I thought I could come up with this whole talk that's all oriented around the Postgres elephant and elephants in general.
00:06:36.160
But as I've discovered, talking to a couple of people here, I've been here for the past few days.
00:06:43.680
This is not a PostgreSQL talk. Postgres is awesome, we love it, and it's great.
00:06:50.320
But no, I'm not actually going to talk about Postgres things, really.
00:06:55.680
We could be managing, I don't know, bots or something of whatever.
00:07:01.920
It's really more of an architecture talk, right? It's about managing a lot of things in the cloud.
00:07:07.680
Things that we refer to as fleets. Really, it's about managing fleets.
00:07:14.400
And who doesn't love Star Destroyers? Come on!
00:07:19.440
You are a rebel scum!
00:07:25.680
So, if you came here to hear important and cool things about managing Postgres, I’m sorry.
00:07:32.240
And I had this idea of herding elephants because it's thousands of Postgres databases.
00:07:39.440
I'd make this all elephant-and-herd-themed, but it's also not a talk about elephants.
00:07:46.080
If you came here for elephants, I’m sorry; I can't help you.
00:07:52.080
So, quick backstory: Heroku Postgres is actually not a core thing of the Heroku platform itself.
00:07:59.360
We're what's called an add-on. We exist in the add-on marketplace, which is a nice offering from Heroku.
00:08:06.400
It allows you to easily extend and add things to your applications, like New Relic, Redis, and Postgres.
00:08:12.720
You can attach these things as you will.
00:08:20.000
We were actually one of the first ones, which is cool. We kind of broke a lot of ground and a lot of things.
00:08:27.440
So, you can kind of think of it like this: you talk to Heroku, Heroku talks to us.
00:08:34.240
And we're kind of in our own little realm, even though we actually all work for Heroku.
00:08:40.960
We all sit there and eat the awesome lunches and stuff.
00:08:47.200
And all of our applications run on Heroku, so that’s pretty cool.
00:08:54.080
So, Heroku Postgres version zero, the very first thing, was just a single Sinatra application.
00:09:02.240
It used a library called Stem to speak to AWS. Can you read the orange?
00:09:08.160
Oh, so sorry! I'll just read the orange parts. That’s really disappointing; it looked great on my screen.
00:09:13.839
So, we used Stem to talk directly to AWS, and we used SQL to speak directly to the Postgres instance.
00:09:20.799
There weren't that many databases then, so this worked out great!
00:09:26.640
We just had one app and one server, and they talked to each other.
00:09:32.960
This was all great when you didn't have a lot of databases. The goal, design model, or mantra we had was just the simplest thing that could possibly work.
00:09:40.000
But no less, this is a common theme in the DOD; this is something they strive to do.
00:09:46.160
It's just something that's always in the back of our mind.
00:09:52.240
More things, more features mean more broken stuff.
00:09:58.400
Every line of code you write to do anything is something that will bite you at some point.
00:10:05.680
Or, if you move on, it'll bite someone else, and they won’t like you.
00:10:11.760
So, fast forward to now, the latest version has grown into a constellation of applications.
00:10:18.080
About five applications, still all using Sinatra. We're using Fog now to talk to AWS, and we still use SQL to talk to the Postgres instances themselves.
00:10:25.440
We've also grown to use background workers, which is underlined wonderfully with Sidekiq and Q Classic to do the bulk of our work.
00:10:32.160
Now it kind of looks like this: this is nothing groundbreaking, right? It's a constellation of applications.
00:10:38.560
They communicate over APIs, and there’s separation of concerns.
00:10:46.480
We have one that's just in charge of managing the production tier and one that's in charge of the starter tier.
00:10:52.280
We've got one that does your data clips, and you could probably count PG backups in there that does backups and snapshots.
00:10:58.160
There's also an internal one used for administratively managing things.
00:11:04.080
Some of them talk to AWS, some of them don't, but we’ve grown and spread out like that.
00:11:11.040
Almost all these applications are Sinatra, running within Sinatra.
00:11:17.440
So it's expanded; we've got various middleware, and you've got all your different endpoints that themselves really just launch individual Postgres Sinatra applications.
00:11:23.760
By nesting Sinatra applications like this, it allows you to focus and isolate specific things within the single application domain.
00:11:32.320
It makes the code easier to reason about as it's separated into different endpoints.
00:11:40.480
The apps themselves are divided into several processes.
00:11:47.280
Anyone familiar with Heroku probably recognizes this; it's a Procfile.
00:11:53.600
It's a way of taking a single application and defining individual processes contained in there.
00:12:00.160
This is a feature that you can use and scale horizontally.
00:12:06.720
You can also see that we have a summary here.
00:12:13.040
The point is that the majority of these are workers.
00:12:20.080
Your background processes do most of the heavy lifting, while the front-end stuff is usually quick and doesn’t do much of the lifting.
00:12:28.080
We literally run hundreds of workers across the five applications we have.
00:12:35.200
I think we have about 50 distinct process types, and each application itself has maybe three or four web workers.
00:12:41.520
Each of them has probably 200 plus workers of various kinds; some of the queues or processes have over 200 workers.
00:12:48.480
So we use workers a lot.
00:12:55.440
So, even while splitting this into an ecosystem of applications, it's still the simplest thing that could possibly work.
00:13:02.640
But no less than that; so, that's kind of the ecosystem.
00:13:09.200
Or the architecture of the lay of the land, so to speak.
00:13:16.640
Now, on to managing databases.
00:13:23.920
So, like I said, we have this fleet, this great awesome fleet of things.
00:13:30.560
In order to successfully run a service like this, you have to be continuously monitoring them.
00:13:37.760
You have to keep watch of everything.
00:13:44.080
If you’re looking at this, you might be asking yourself what's wrong with this picture.
00:13:51.040
It should jump out pretty quickly that it's this guy; we've got this whole fleet doing stuff.
00:13:57.280
Then we got this random one going the wrong direction.
00:14:02.160
What’s this guy doing? I mean, all these ships are coming this way; this is dangerous.
00:14:09.680
This is no good; it's going to run into somebody!
00:14:15.520
So you need to be able to monitor the fleet and identify and spot this guy.
00:14:22.080
Find out what's going on.
00:14:28.480
When you manage and monitor a lot of things, you tend to expect them to go wrong.
00:14:34.960
Things do, and you have to have this attitude about it.
00:14:40.880
But whatever this guy is doing, you need to be able to expect that this is going to happen.
00:14:48.640
Yeah, he is doing his own thing, probably causing trouble.
00:14:54.000
When you see things like this from a service point of view, someone is probably having a bad time.
00:14:59.520
That represents someone's servers or someone's database that’s gone astray.
00:15:05.680
If you don't keep your eye on these things, if you don't monitor them, things will go wrong.
00:15:11.840
People will open support tickets stating, 'Things go wrong! My database is down,' and you can tell that people are mad.
00:15:18.080
So how do you do that? You've got thousands of these things to monitor, both at the server level.
00:15:25.760
And how do you monitor them at the resource level? I'll get to the resource in a minute.
00:15:32.360
Your first thought might be: well, with these images that we’re using, we'll install software on them.
00:15:37.919
That’s not the approach we took; the approach we’ve taken is kind of outside-in.
00:15:44.160
So workers connect with SSH and they collect information about the environment.
00:15:50.640
The servers themselves are actually very dumb; we try to keep them dumb.
00:15:58.400
They only have the base OS, which is Ubuntu 12 or whatever long-term support we had last.
00:16:03.680
Postgres nine plus; we finally killed all Postgres eights. We used to have those until about six months ago.
00:16:10.560
That was a pain. There’s also this thing called Wall-e.
00:16:18.080
Wall-e was something developed internally for shipping our write-ahead logs, which is a feature of PostgreSQL.
00:16:24.480
We ship that off-site; that's part of our durability.
00:16:30.000
All that stuff, but Wall-e is written in Python.
00:16:35.280
So, if you came here for the Wall-e story, we're not going to talk about that.
00:16:41.920
Again, outside-in information is gathered by the workers.
00:16:48.320
It's used to determine the state and makes an observation, then decides the action to take.
00:16:54.720
The primary things we observe here are resources; these represent the databases.
00:17:02.080
Things like, uh, information collected such as database name, port, created at, and various database type information.
00:17:09.760
Then we have servers, which represent the physical things on AWS, or virtual things on AWS.
00:17:16.320
You've got IPs, instance ID, what availability zone it's in, how long it’s been up, and that kind of fun stuff.
00:17:24.080
We need to monitor all of these things all the time.
00:17:31.440
To do this, we use two things which are awesome: state machines and stateless workers.
00:17:37.360
I'll explain a little bit more on that; you probably know what a state machine is.
00:17:42.560
The history there is rooted in game programming.
00:17:48.320
Peter V. H. is one of the founders of Heroku Postgres, and his background was in game development.
00:17:55.200
So when it came to this kind of monitoring idea, he naturally thought of gaming.
00:18:01.120
Where you have this constant loop of observing your environment and taking action.
00:18:07.280
Am I on fire? What should I do about that? Am I being attacked by a goblin? What should I do about that?
00:18:13.760
Am I sleeping in a tent? Great!
00:18:20.080
You know, it’s like we connect to a server, and we talk to it; we say hello.
00:18:26.880
The server says hi, all right, well that's established, we can connect to the server and make progress.
00:18:33.960
Then we say things like, 'All right, select one from Postgres,' and it's like, 'Oh one.' You're like, 'Great!'
00:18:41.200
Now the server is not only up, but Postgres is running.
00:18:47.520
We do this all the time, forever.
00:18:54.080
Right? We’ve got thousands of resources, thousands of things that need to be checked.
00:19:00.160
And every one of them gets checked at least once a minute.
00:19:06.480
So, yeah, you connect to another server.
00:19:14.080
It has Postgres installed; hi, hello, select one. Great!
00:19:20.320
Then you connect to yet another server; hello, hi, select one. Great!
00:19:26.240
All these workers are going around, feeling their environment, thinking about things, and maybe doing stuff.
00:19:32.560
And we need to do this all the time, forever.
00:19:38.720
So to do this, we use a queue but treat it like a ring.
00:19:46.240
We want a worker to grab a database off the top of the queue, you know, shift it off.
00:19:52.960
We want to feel, which is a method name, and it sounds kind of odd when I stand up here.
00:19:59.680
But we want to feel its state, and then we tell it to think, where we take action.
00:20:05.120
Once we're done with that, we push it back onto the queue.
00:20:11.760
We don't linger here; these steps are usually pretty quick, and we need to do this all the time.
00:20:17.920
We use state machines to help us out.
00:20:24.000
When you're creating a server or making a new database, we have different states that these things can be in.
00:20:29.920
The creating stage, the happy-up stage, the maybe stage, the 'whoops' stage, and destroying.
00:20:36.000
We have these workers, and they need to go around and find out what state these things are in.
00:20:43.200
They need to feel their environment and determine, 'Am I up? Great! Let’s keep on going!'
00:20:50.960
'Am I maybe up? Well, that was my last state, so maybe I'm up now; maybe I'm back; I don’t know!'
00:20:57.200
My last one was not so great; maybe I'm down.
00:21:04.000
So feeling is when we go and connect to a server and observe the environment.
00:21:10.240
We have this class resource that is obviously abbreviated.
00:21:17.680
We have this class called Feeler, and the feelers collect information about the system.
00:21:22.880
So when we grab the resource, we say feel.
00:21:28.080
We just do is create a new observation with that feeler and grab the current environment.
00:21:34.240
We record that in the observations table, which is an append-only type table.
00:21:40.480
We don’t update an observation; we create a new record each time; hence, we have a history.
00:21:47.760
The observations are very simple; they have an ID, when they were created, attributes, and foreign UUIDs pointing to the resource or server.
00:21:55.040
Once we record that, we move on to the next step, which is thinking.
00:22:02.480
We consider the last observation we made.
00:22:08.960
What do we do? We include this thing called... I thought I switched to new slides.
00:22:15.040
So resources have these things called states; it’s a method that comes from the Staple module.
00:22:22.320
When the resource itself loads, we execute this method, which has a name and a block of code.
00:22:28.400
We end up creating this map of things.
00:22:35.680
Here's the Staple module summarized: the state method takes a name, a default nil, which I have no idea what it does.
00:22:42.000
Then we create like a map of names to blocks of code.
00:22:49.440
The uncertain one gets this block of code, and the available one gets that block, so on and so forth.
00:22:57.440
And here we get the think method.
00:23:03.760
After we've observed, we now say, hey, evaluate the state that we're currently in.
00:23:10.560
So look at this code and evaluate it; do this thing.
00:23:16.880
So if we're available, what do we do?
00:23:24.080
Well, if the last observation we had said the service was not available, we transition to the uncertain stage.
00:23:31.600
If it didn't say that, well we just move on; hooray!
00:23:38.000
Just like if it's uncertain, and now it says it is true, we'll go back to available and get on with our lives.
00:23:44.800
So pulling something off of the queue, feeling its environment, thinking about it, pushing it back onto the queue.
00:23:51.440
We need to do this all the time, forever.
00:23:57.440
So, state machines, stateless workers. The workers don't know much about the state.
00:24:05.360
They don't want to track the state because we don't want to tie up a worker to a resource.
00:24:11.680
We want to be able to quickly just grab it, do something really quickly, and move on.
00:24:18.080
We don't want a worker to have too much of an important relationship with what it’s doing.
00:24:24.320
Workers go down; all sorts of different things happen there.
00:24:31.440
So workers are constantly going through the queue; they’re the ones that do all the heavy lifting.
00:24:37.440
They’re the ones who do the things that take time.
00:24:43.680
The stateless workers are the ones who talk to AWS; they’re the ones who talk to the Heroku API.
00:24:50.160
To synchronize information or get commands or whatever they need to do.
00:24:56.960
And they’re the ones that connect to databases and talk directly to Postgres.
00:25:03.760
These, as far as computer terms go, are the things that take time.
00:25:10.560
We need to offload them to background workers because all of these things require networks.
00:25:16.720
In a giant cloud, even a great one like Amazon, all of them can fail, and they all do fail all the time.
00:25:22.720
The great aspect of Amazon's network is the idea of quickly getting and doing this.
00:25:29.440
Feeling, thinking, and moving on; part of this is, if for some reason a worker can’t connect to that service,
00:25:35.360
it just immediately moves on to the next thing.
00:25:42.240
But due to the way Amazon's network happens, sometimes that’s just a little glitch,
00:25:48.480
and the next worker is going to pop up in some other place in an entirely different availability zone.
00:25:55.760
It might have no problem at all connecting.
00:26:02.960
So we've kind of avoided a false positive situation and worked around maybe network partitions or various things that could go wrong.
00:26:10.080
I’m tired of you already. Sorry.
00:26:15.280
I was supposed to tie that in with the last thing: what to do when things fail.
00:26:20.560
When we've figured out that it's not a network issue and the server is down.
00:26:28.240
So, when that last observation said that the service was not available, we need to create an incident.
00:26:35.680
It's a certain type of incident; we have a lot of these.
00:26:42.240
Incidents occur when things go wrong - as they will.
00:26:48.560
As I've said, on the cloud and at scale, strange things happen.
00:26:55.120
I was talking to Tanner about how, as you scale, edge cases will remain edge cases.
00:27:01.840
But they simply become more frequent; you're doing things so many times that they are no longer bizarre.
00:27:08.080
They're just kind of strange, and there are many different types of incidents.
00:27:14.160
So we could have a resource down, stalls, failed followers stuck, the mounting drives, critical servers down, or duplicates.
00:27:21.520
In order to address all of these things, you naturally start to develop playbooks.
00:27:27.760
Things that engineers can read and use to solve problems.
00:27:34.160
That way, the solutions to these things aren't tied in an individual's head.
00:27:40.720
Once you start codifying and cataloging these things, you can create yet another state machine.
00:27:46.560
If incidents have their own state machine, you can have a triggered one with resolved, waiting, needing a human, or resolving.
00:27:52.720
So we bounce back and forth here using the stateful module to do all of this.
00:27:59.280
We utilize state machines and stateless workers.
00:28:06.720
State machines, stateful modules, take that home with you.
00:28:13.600
We have yet another queue—a ring of incidents!
00:28:20.080
The workers go along, and we don't feel at this point because we know things are wrong.
00:28:27.680
We just need to start taking action!
00:28:33.600
We pop something off the queue; we need to take some action, and then we push it back onto the queue.
00:28:40.000
Maybe that action will actually resolve things.
00:28:46.960
This worker won’t know; it will just execute the code, transition to the stage it needs to, and move on.
00:28:54.160
Then the next worker picks it up and says, 'Hey, you're all better now!'
00:29:01.440
So we need to do this all the time, a lot, all the time.
00:29:07.760
So again, with our wonderful stateful module, we have incidents.
00:29:15.120
All of these incidents are going to attempt this resolution.
00:29:22.080
And if the resolution doesn't immediately work, we will open a ticket to the customer.
00:29:29.920
If we get an error somewhere in there, we escalate to a human.
00:29:36.880
We’ll get to that in a minute.
00:29:43.040
So, the same with a wait for resolution, we have one of these state blocks for basically everything.
00:29:49.040
We usually wrap these explicitly in begin type statements because it's common for these things just to completely bomb out.
00:29:54.880
If something is just unreachable, we deal with it.
00:30:01.680
If all else fails, you know, escalate to a human.
00:30:08.640
Actually, looking at a different part of the file, we have all these types here; it’s an array.
00:30:14.560
We load all these resolvers; these resolvers are files codified from our playbooks.
00:30:20.160
We've actually written it into code how to do these things.
00:30:26.800
So we load up all these files and create an in-memory hash of them for the type of incident and the resolver that can handle it.
00:30:33.360
We do that by calling the handles method.
00:30:39.760
So, a resolver like this is a basic restart one, and it can handle the resource down state.
00:30:46.240
When the worker comes along, it’s going to attempt to initiate the resolution.
00:30:53.280
Here, it gets a lock on the resource in the database, so no other worker comes along and tries to do something.
00:31:00.480
If it sees that this is locked, it won't try to touch it.
00:31:06.720
The first thing it does, in this case, is try to restart it.
00:31:12.800
So, it turns out that on Amazon, if you're using elastic block storage and your thing crashes, and if you restart it,
00:31:20.080
your thing—in this case, Postgres—is going to come up probably in the same availability zone.
00:31:25.920
But on a different machine, and if you've used AWS long enough, you notice that sometimes just restarting the thing,
00:31:31.840
and having it come up somewhere else, resolves all your problems.
00:31:38.440
So, the very first thing we do is often just restart it, and everything sorts itself out.
00:31:44.320
Later along the line, we'll call resolve, perform a new observation, get a new state, and we'll do the tick method.
00:31:51.200
This is actually an alias for the think method, and we’ll repeat the process.
00:31:57.440
We'll continue to check these things and transition to the right state along the state machine until we determine things are better.
00:32:04.720
Until we think they get resolved.
00:32:11.200
Here's another example of a resolver; this is for server down or for stuck EBS volumes with a production tier.
00:32:17.440
We might try to fail over if you have a high availability failover.
00:32:24.080
This will actually restart the entire server instead of the Postgres instance.
00:32:30.720
That's when the thing will move.
00:32:38.080
We do another thing like resolve or perform the observation: 'Is it available now? Does it still have a stuck EBS? What’s going on?'
00:32:44.480
So we’re doing good.
00:32:51.200
But even that’s not perfect, right? So, we’re left with the obvious question of what to do when even these resolutions fail.
00:32:58.240
Because they will.
00:33:05.760
So you saw there, when the resolver doesn't work, we have to escalate to a human.
00:33:12.080
We actually have to call somebody, which is the software equivalent of, 'Well, I tried!'
00:33:18.560
Eventually, this will all read or lead to Dumbledore there, which ultimately leads to pager duty.
00:33:25.440
Escalate to humans; we’ll wake somebody up, and then we’ll have to go and figure out why the resolvers didn’t work.
00:33:31.760
Why am I being woken up in the night? Not me, fortunately!
00:33:37.760
So, I have no idea how I'm on time, but I'm pretty close to the end.
00:33:44.160
Oh wow, that's like right on time; odd.
00:33:50.960
So yeah, in summary: the simplest thing that can work.
00:33:56.320
The simplest thing you can do that works, but no less than that.
00:34:03.680
State machines are fantastic for modeling complex behavior, complex states, and things that can go wrong.
00:34:09.600
Stateless workers are great because you don’t get too tied up in what you're doing.
00:34:17.440
You can quickly move to resolution, especially when things take time.
00:34:24.160
When you get big and you need to monitor things like that, you should expect things to break.
00:34:30.960
And have a good attitude about it; just remember: well yeah, things break.
00:34:37.920
This is the summary of my talk.
00:34:43.680
Welp! Well-driven presentations.
00:34:50.320
All right, that’s all I got.
00:34:57.200
Thank you!
00:35:00.000
That's all I've got.