Building for Gracious Failure

by James Thompson

In this talk titled "Building for Gracious Failure" presented at RailsConf 2019, James Thompson discusses the inevitability of failure in software development and the importance of addressing it gracefully. Throughout the presentation, Thompson emphasizes that failures are a part of the software lifecycle and that having strategies to handle them can lead to growth and improvement.

Key Points Discussed:

Acceptance of Failure: Thompson starts by acknowledging that all software fails at some point and emphasizes the need for proactive planning to mitigate these failures.
Visibility into Failures: Gaining visibility is crucial. By using tools for logging and metrics, developers can have better insights into issues when they occur. Thompson shares a personal experience where enhanced logging tools (Bug Snag and Signal Effects) significantly improved the team's ability to understand system failures, illustrating this point with a scenario of job failures in their staging environment.
Focus on Value: Developers should prioritize fixing issues that impact customer value rather than every error that arises. This involves understanding which errors are critical and which can be deprioritized based on their impact on user experience.
Returning What You Can: In the case of corrupt data during a migration, Thompson discusses how returning a null value instead of an error can prevent cascading failures and allow the system to remain operational.
Forgiving Systems: A system designed to be forgiving accepts all data it can understand, even if some data is missing or corrupt. This normalizes interactions and reduces total failures across the system.
Trust Carefully: Trust in dependencies, whether external services or internal teams, must be managed meticulously. Thompson warns that over-reliance can lead to significant failures if not properly mitigated.

Significant Examples:

Thompson recalls an incident where an error in data configuration led to a cascading failure affecting service availability and revenue loss. This scenario highlighted the importance of resilience in system design.

Conclusion:

Thompson concludes with the key takeaway that expecting failure is essential for robustness in software development. By implementing visibility, prioritizing valuable fixes, accepting partial data, and being cautious about trust, development teams can better handle failures and mitigate their effects on the user experience.

The session serves as a reminder that failure is not just a possibility but a certainty in software development, and how teams respond to it defines their success.

00:00:22.240 All right, good morning everyone. It's the last day of RailsConf, and you have come to my talk, so thank you. I am James Thompson, a principal software engineer for Mavenlink. We build project management software primarily for professional services companies. We are hiring in both Salt Lake City and San Francisco, so if you're looking for work, come and talk to me! I would love to get you on our team.

00:00:27.599 Now, this is a talk that is roughly similar to what I did in Los Angeles at RubyConf, but I have added some new material. If you did happen to attend my talk on the same subject at RubyConf, you will be getting a little extra today. I've also tried to change the focus, but what we’re going to talk about today is failure and how to deal with it. We will explore how we can cope with the failures that happen in our systems.

00:00:45.280 I want to start with a very simple question: how many of you have ever written software that failed? Yes, it fails in all kinds of ways. Sometimes it's due to hardware issues, sometimes the code we wrote isn't quite perfect, or even close to it, or is just garbage. Sometimes our systems fail for things completely outside our control. I'm going to discuss various ways to handle these kinds of failures, particularly those that are difficult to foresee but can be mitigated.

00:01:03.039 The first thing we have to come to terms with is that failure happens. We have to accept that everything fails some of the time; it's unavoidable. We need to plan for it and have strategies in place to help us mitigate it. It's crucial to think about these things ahead of time as much as possible. While we can't foresee the future, we can plan for reasonable outcomes and reasonable failure modes.

00:01:22.480 While not everything I discuss today will be immediately applicable to your projects—especially if you're working predominantly in a monolith—some specific stories I have come from a microservice ecosystem. There won't be a perfect one-to-one correspondence, but I will present ideas that should have general applicability regardless of your environment or programming languages.

00:01:29.119 I want to start with what I believe is the most basic and fundamental practice, which is that we can't fix what we can't see. I hope no one here believes they possess such perfect omniscience that they can address issues they are unaware of.

00:01:35.440 As we think about our systems, we need to look for ways to gain visibility into failures. This visibility aids us in managing our failures by providing a window into the many facets that determine when, how, and to what degree our systems are failing. Beyond just error reporting and instrumentation, systems for metric capturing can provide rich context, helping you understand why and how your systems are failing.

00:01:54.560 To illustrate this, let me share a story. We know that low visibility is dangerous in many contexts, whether sailing, flying, or driving. Having low visibility is a hazard in software as well, although not to the same life-threatening degree in most cases. I was working for a company in a microservice environment, focusing on a system written in Go. While I was able to get up to speed, we had extensive logging output, which made us feel as though we understood our system.

00:02:13.200 However, we started noticing something strange in our staging environment: processes we thought should be completing were not actually finishing. While we could see data coming into the system and the logs indicated processing was occurring, we were not seeing the expected results on the other end. This experience revealed that despite our confidence in our visibility, we were clearly missing some part of the picture.

00:02:44.360 In conjunction with the service I was working on, I began rolling out additional tools—specifically, Bugsnag and SignalFx—for metric tracking. These two solutions provided us with a much more context-aware view of the errors in our system. Our newfound visibility allowed us to understand better how many jobs were starting, succeeding, and failing.

00:03:00.879 Logging alone is often insufficient; I find most log files to be nearly useless in identifying system issues. They are typically noisy and offer a very low signal-to-noise ratio. Tools like Bugsnag and others, which we have here on the vendor floor, offer a clearer picture of what's going on, which in turn alters how we engage with our applications.

00:03:16.000 With increased visibility, we can determine what to focus on and when. However, simply knowing that there are thousands of errors in a system may not provide much insight. Even if we think there are high error rates, we need context to understand if those rates are normal or abnormal. As an example, one of our data sources—a credit bureau—had notoriously poor data quality, involving inconsistent formats.

00:03:31.520 Because of these inconsistencies, we were unsure how much we should expect to fail. We knew failures were to be expected due to the quality of the data we were handling, but we could not quantify them. This is why we brought in metrics tools to create clarifying graphs like the ones provided by SignalFx. This one particular graph was alarming: the blue line represented how many jobs were starting, and the orange line depicted failures; there was no green line, revealing that none of the jobs were succeeding.

00:03:50.879 This visualization quickly alerted us that something had gone badly wrong, thankfully, in our staging environment before changes were rolled out to production. Without this context, we would not have realized the severity of our problem, risking the chance of addressing lesser issues while ignoring the true underlying cause.

00:04:11.760 This taught us that visibility provides greater context not only for identifying what is failing but also understanding why the failure is significant. There's also additional tooling I have come to love from a company called LogRocket, which shows user interactions that triggered errors in the system, connecting them with services like Bugsnag or Sentry. This layer of detail allows us to understand not just what broke but why it matters.

00:04:30.639 Visibility is just the starting point for dealing with errors. We must ensure we are raising the visibility of our errors and gathering as much context as we can about those errors to address them effectively. Picking the right tools allows us to monitor not just system health, but also the broader context surrounding those errors.

00:04:45.200 This environment will lead to improved prioritization efforts when issues arise. Simply increasing visibility grants a significant advantage when apps go awry. This will further equip you to discern which errors are truly meaningful to your customers.

00:05:01.680 The more context you have, the better your actionable information becomes. This leads me to my next point: fix what’s valuable. How many of you have worked with compiled languages? Now, how many of you are familiar with the mantra that we should treat warnings as errors? This is something I first encountered in the 90s.

00:05:18.240 I thought it was a great idea without considering the implications. Treating every warning as an error sounds like a good practice because it should lead to better code. However, with dynamic languages like JavaScript, we might spend too much time focused on irrelevant warnings.

00:05:30.880 Even with a bug reporting system highlighting legitimately occurring bugs, not every issue requires an immediate response. This is why prioritizing efforts based on value is critical. We need to evaluate the value our systems bring to customers, collaborators, and consumers. Therefore, we should focus on addressing errors that are detrimental to that value.

00:05:46.200 When handling issues like outdated dependencies or security vulnerabilities, a lot of what we categorize as technical debt may not hold real value. If an issue isn't depriving users of valuable experience, it might not be as urgent as we think. We must discern whether a given error is genuinely worth fixing.

00:06:06.560 If you have product and customer service teams, it's best to consult them before focusing on a reported error; they might suggest alternative strategies. Perchance customer service could assist users in adapting to the system in a way that avoids the error altogether.

00:06:23.920 Sometimes, it’s prudent to allow certain issues to persist, enabling us to concentrate resources on correcting things that truly create user value. Ultimately, focusing on value leads to greater satisfaction among customers who rely on our systems.

00:06:37.760 I’d now like to share some stories that delve into unusual error conditions. The first principle relevant here is called 'return what we can.’ At one company where I worked, we had a microservice that replaced a previous generation of a similar service, which stored and tracked data points about businesses over time.

00:06:56.960 As part of our migration strategy for moving several million data points from the old service to the new one, we aimed to preserve historical context while adapting to a new database structure. The migration process itself went smoothly as we successfully brought over the data, and every cross-check returned passes.

00:07:14.560 However, once in production, it became evident that the historical data we migrated contained corrupt values. These corrupted data points triggered hard application errors because we had serialized them using YAML in the database.

00:07:31.200 As a result, our service returned a 500 error whenever it encountered such corrupt data. Not only did our service experience problems, but it also negatively impacted our collaborators, leading to a cascading failure that affected portions of our site critical for revenue generation.

00:07:45.440 We faced multiple issues that needed addressing, but the simplest solution was to rescue the parsing error. Having evaluated the corrupted data, we concluded that due to its unavailability, it held no practical usage. Thus, returning null for any data we couldn't parse became our approach.

00:08:07.919 It speeds up processes, allowing the system to continue functioning without throwing an error whenever it encounters a problem with a specific piece of data. In many application contexts, it is generally more effective to return something rather than return nothing. We discovered that rarely does all data need to be complete for it to maintain some usefulness.

00:08:26.720 Thus, focusing on how little can be returned without entirely compromising value reflects a mentality we should adopt in our systems. That’s why we should embrace the principle of returning what we can whenever possible.

00:08:42.560 Another related principle is the acceptance of varied input. While we need to be careful regarding what we return, we must also be generous in terms of what we accept. This means accepting as much data as collaborators can provide, even if it's faulty.

00:09:00.560 In one project, our system had many collaborators, each contributing only part of a business profile. Therefore, we designed our service to allow submission of a business profile without requiring all fields to be complete. Users could input one field or all fields depending on their data availability.

00:09:16.880 This flexibility ensured that our service became more resilient to the varying quality of data submitted by different sources. Even if one field was invalid, we would still accept valid inputs and communicate which fields needed attention later.

00:09:31.280 This forgiving approach greatly improved our ability to process data while simultaneously minimizing errors. By accepting what could be processed without enforcing strict data integrity, our systems became more robust.

00:09:48.000 The last principle I want to discuss revolves around trust. In any software system, trust is crucial because we depend on various services—either within our organization or third-party tools.

00:10:02.560 However, this dependency implies certain risks. When the reliability of those dependencies falters, their failures can quickly cascade into our own systems. We previously built a distributed monolith's worth of trust, ultimately resulting in a significant outage affecting user experience.

00:10:19.840 Our excessive reliance on other services resulted in failures propagating throughout our architecture. Therefore, we must be wary of who we trust—always preparing mitigation strategies for inevitable failures.

00:10:35.680 It’s essential to acknowledge that we might face outages in the services we depend on, be it a cloud provider or a third-party API. When we build our systems, we should lean towards lower dependency rather than a high level of faith in external systems.

00:10:50.800 It's critical to design our approaches so that the system will not fail catastrophically should any component start misbehaving. Many organizations may not be prepared for adopting microservices and the complexities involved.

00:11:07.440 Distribution in systems complicates our architectures, and many teams may not have the necessary skills or knowledge to navigate that complexity. This problem can occur in any collaborative environment where dependencies introduce risks.

00:11:25.680 This is not an argument against using dependencies; rather, it is a caution to trust carefully, analyzing who and what we rely on and actively preparing for when failures occur. Ensuring we provide adequate user experience even despite errors is essential.

00:11:42.320 To yield a net positive experience, we should not render service interruptions as standard errors whenever possible. Consider the user experience, whether they are customers or developers interacting with your service.

00:12:00.160 The principle takeaway is to expect failures—this concept is vital and is a critical part of chaos engineering. It's important to approach our software systems with this mindset as we cannot escape failures.

00:12:19.440 The first fundamental step involves establishing good visibility across our systems—not just through logs or error services, but utilizing meaningful metrics that encompass a full view of the extent and scale of failures should they transpire.

00:12:36.560 This practice enables us to preserve and restore customer value when systems break down. Thank you for listening, and if anyone has questions or you're interested in work opportunities, I will be here afterward to chat.

00:12:51.760 You can also access my slides via the link provided. Thank you for coming out!