Talks

After Death

GORUCO 2018: After Death by Sam Phippen

GoRuCo 2018

00:00:14.570 Hello, everyone! You got like five seconds of hot pulsing techno music when you come up on the stage. It's great! My name is Sam Phippen, and this talk is titled 'After Death.' I'm a member of the Aspect Corps team and an architect and interim manager at DigitalOcean. Let's get started.
00:00:29.849 If you've worked in software engineering for any amount of time, you'll have heard tales of encountering issues that look like this, or when the service is trying and trying and then, after you do a deploy, it gives up the ghost. Earlier today, not one hour ago, this status page went up on DigitalOcean, and there was a non-zero likelihood that I would be doing incident response right now instead of being on stage, which is fun. It happens—the things we build fail; it's inevitable. The systems we work on aren't perfect, and neither are the people. If you hold a pager for any production system, I'm sure you know the feeling of sadness and frustration when facing events like these.
00:01:13.830 You may feel anger at being woken up at 3 o'clock in the morning, questioning why your team didn't care enough to avoid that bad deploy, or wondering why your company is not protecting you from failing infrastructure. What on earth has gone wrong? As I mentioned, I'm a manager now, and I have a covenant with my people that states this will be as rare as possible. I will empower them to prevent incidents from happening as often as we possibly can. The thing about DigitalOcean is that it's not a Rails monolith; it's a big, complicated services world. I can tell from experience that those services contribute just as much to downtime as any other factor in our infrastructure.
00:01:54.479 This happens—it’s really nothing we can do about it, no matter how hard we try. Computers will break; the disks on your servers will fill up; a human will write code, deploy it to production, and in the worst-case scenario, the hardware might literally catch fire. So what can we do about this? What tools do we have at our disposal to ensure that, as often as possible, we are not affected by these incidents and that we're improving? Many of us, when these things happen, conduct a post-incident review or a learning review, or something similar. At DigitalOcean, we call them post-mortems, which I appreciate.
00:02:51.420 Conducting these reviews gives us a literal system for learning from the failures we encounter. With that system, we're able to identify and resolve underlying causes. We're also better equipped to work with our organizations to understand the risks we might be accepting when building our systems. Everyone here who can deploy to their production environment accepts a risk every time they do so. Although we are all asked to ship features quickly, it’s important to remember that if your product isn't available, you can’t have any customers.
00:03:13.050 By no means do I believe I have all the answers. I used to be terrible at this, but working on a cloud-scale system for a couple of years has taught me a lot. In fact, we only started getting good at this about a year ago. So I'm just going to share some insights here. If you disagree, feel free to chat with me afterward; I love discussing this stuff!
00:03:50.490 To illustrate this, I’d like to dig into how we conduct post-mortems at DigitalOcean to give you an understanding of how I think about the process. I'll walk you through DigitalOcean's post-mortem template. Every time we have a severe production incident, one of these gets initiated, and an engineer starts filling it out. We give the incident a name, note the date it occurred, and assign a severity level. Severity ratings are useful because they guide our response, indicate what kind of response is necessary, and specify who should be involved.
00:04:31.360 At DigitalOcean, we maintain a five-point scale that starts at 'Severity Zero.' A Severity Zero incident means that there is a critical impact on the business, indicating that if engineers don’t work quickly to fix it, everything could potentially fail. These severe events are incredibly rare; we’ve had two since we defined this scale. In a Severity Zero incident, every engineer in the company, multiple directors, and infrastructure staff coordinate a response, essentially banding together to resolve the problem.
00:05:17.040 Severity One denotes a major global outage or an entire product not functioning. If a data center goes down, that's rated as a Severity One. At this level, we also involve executives, and there are usually multiple teams coordinating, but we do not believe the business will end if we don’t act immediately. Severity Two is the lowest severity of an incident that wakes people up in the middle of the night; this usually pertains when a single product has stopped working or some other severe issue has arisen, but not severe enough to alarm everyone. Typically, one engineer handles it while coordinating with our support and communications teams.
00:06:07.790 Severity Three and Four are just bugs and defects that go into JIRA to be addressed at a future date. Hopefully, your organization has a mature Incident Response practice; if not, having a severity scale is a great place to start. It allows you to communicate effectively with your company about incident classes, assess their severity, and outline needed procedures.
00:06:58.790 The next focus in our post-mortem is the timeline, which is one of the most critical aspects of these documents. It's where we record everything that happened that caused the incident to occur—from the very beginning to the very end—including all involved parties and systems that went wrong. Logging, monitoring, and certain rules apply: timelines must consist of hard facts; they should not include analysis or emotion. Actions taken are valid, but feelings about those actions are not. This provides a clear understanding of what occurred.
00:07:38.410 A tip is to complete your timeline as close to the incident as possible since logs can rotate out, graphs expire, and memories get fuzzy. Depending on the incident's severity, we usually either finish this as soon as the response concludes or the next morning after to ensure we retain accurate details. Document everything, no matter how minor it may seem; such details can become critical to later analysis. The timeline acts as a valuable tool.
00:08:08.390 This is what a typical incident timeline looks like—it includes timestamps. We use UTC, although others may prefer their local timezone; that's your choice. The timeline includes events such as a deployment that introduced a bug. An important aspect to remember is that your incident starts before you notice it: a deployment occurs, a server fails, or a database gets corrupted, but at that moment, no human has noticed. Hence, when we think about Incident Response, we should document when the incident actually started, separate from when it came to our attention or when response began.
00:08:41.559 Another entry could indicate that our monitoring systems indicated something went wrong. You should include evidence alongside each timeline entry, whether that be a screenshot of a graph, logs, a link to a Slack channel, or a link to a page alert. Remember that log and graph systems can expire; ensure you capture their outputs permanently. Effective observability is crucial; not only does it aid in incident response, but it’s also a good practice enhancing your systems.
00:09:12.789 An additional line might state that I, Sam, was paged for our internal tools team with a link to the incident on Page Duty, which is an excellent tool for alerting engineers. Documenting the time when a human first gets involved is important, as that marks the earliest opportunity to address the issue. Therefore, it's key to not just note when the incident started and when we first saw it, but also when the response began.
00:09:32.410 Another important factor is who alerted the person responding. At DigitalOcean, we are fortunate to have a 24/7 support and operations team capable of waking us up for incidents and performing technical investigations. If you don't have that, you need some machine alerting. Typically, we find that when a machine alerts us, our incident closure time is much shorter than if a human alerts us. Thus, having automated alerts is a healthy operations practice and something you can implement once you have observability established within your application.
00:10:31.130 Next, we document when a person acknowledges the page. This is crucial because the time it takes for someone to wake up and respond can vary significantly, especially if they were asleep. If you’re paging someone at 3 AM, it might take 40-50 minutes before they even realize what's happening. Having multiple responders is also documented in our timelines. When multiple people are involved in incident response, we need to track where each person was pulled in and what they were doing.
00:11:36.640 There’s a valuable pro tip I want to share: if you're conducting Incident Response, ensure that your teammates know what you’re working on. I recommend primary incident responders to update Slack at least once every five minutes. This practice isn't about micromanagement but serves as a heartbeat to confirm that we're continuously working through the incident, even if they don’t have a status update at that moment. It’s a useful and healthy practice.
00:12:09.240 Finally, we should document the fixes issued in links to GitHub or any other relevant resources, accompanied by graphs showing resolution. Here, we're asking: how did we fix it? Can we confirm that we resolved the issue? This leads us to the timeline, and while it might seem straightforward, accurate documentation is exceptionally valuable. Having a continuous log of incidents allows engineers to discern patterns in what went wrong, why it happened, and how solutions were implemented.
00:12:40.879 Let’s move on to the next section, which is the most crucial part of any post-mortem or learning review—the root cause section. One thing to remember about root causes is that the term can be misleading; it might imply that there's a singular cause behind the incident. Many people mistakenly fill out this section, including myself in the past, by identifying only what seems to be the most direct cause. A past example of mine at DigitalOcean indicated a deployment error as the cause, which doesn’t really help solve the problem.
00:13:18.310 To approach root cause analysis effectively, we must dig deeper into incidents. For instance, in one example, there was an Atlantis deployment involving a significant scaling factor, where my SQL queries were not optimized. These proximate causes, while close to the incident, are not truly causal. What we need to do is analyze those factors. What led to the database not having the appropriate index? Why was an outdated worker version in use? These issues suggest a lack of testing, unrealistic testing environments, and an absence of a properly reflective staging environment.
00:14:40.790 There is a common issue where staging environments do not accurately reflect the production environment, making it essential to create and maintain an effective staging infrastructure. Furthermore, when we transitioned to Kubernetes, we had issues with outdated nodes running in tandem with newer ones, complicating our environment. This reflects a failure in processes around managing transitions in production environments, compounded by a lack of maturity in handling such transitions.
00:16:41.450 To clarify, the estimations of scaling applications must be managed with rigorous procedures. If all of the previous factors were handled, the problems wouldn't have surfaced. The thought process during an event like this requires looking for procedural, policy, system, and training issues that affect an organization at a macro level. It’s essential to remember that a major production incident is never caused by a single person making a mistake; it results from systemic failures. Everyone involved is trying hard, so I often ask what safeguards exist to prevent such dangerous operations. It’s always about the system, never just the individual.
00:19:29.750 This journey takes time; mastering post-mortems and addressing organizational issues that lead to incidents requires ongoing effort, diligent practice, and empathy within teams. The best post-mortems are crafted by individuals who greatly care about their colleagues and understand the stress that comes with being roused from sleep due to incidents. If you would like to see our post-mortem template, I’ll be sharing it online as it could serve as a useful resource. Right now, I’m actively hiring for my team, so if you want to work with me, that’s an opportunity. Thank you!