00:00:22.240
All right, good morning everyone. It's the last day of RailsConf, and you have come to my talk, so thank you. I am James Thompson, a principal software engineer for Mavenlink. We build project management software primarily for professional services companies. We are hiring in both Salt Lake City and San Francisco, so if you're looking for work, come and talk to me! I would love to get you on our team.
00:00:27.599
Now, this is a talk that is roughly similar to what I did in Los Angeles at RubyConf, but I have added some new material. If you did happen to attend my talk on the same subject at RubyConf, you will be getting a little extra today. I've also tried to change the focus, but what we’re going to talk about today is failure and how to deal with it. We will explore how we can cope with the failures that happen in our systems.
00:00:45.280
I want to start with a very simple question: how many of you have ever written software that failed? Yes, it fails in all kinds of ways. Sometimes it's due to hardware issues, sometimes the code we wrote isn't quite perfect, or even close to it, or is just garbage. Sometimes our systems fail for things completely outside our control. I'm going to discuss various ways to handle these kinds of failures, particularly those that are difficult to foresee but can be mitigated.
00:01:03.039
The first thing we have to come to terms with is that failure happens. We have to accept that everything fails some of the time; it's unavoidable. We need to plan for it and have strategies in place to help us mitigate it. It's crucial to think about these things ahead of time as much as possible. While we can't foresee the future, we can plan for reasonable outcomes and reasonable failure modes.
00:01:22.480
While not everything I discuss today will be immediately applicable to your projects—especially if you're working predominantly in a monolith—some specific stories I have come from a microservice ecosystem. There won't be a perfect one-to-one correspondence, but I will present ideas that should have general applicability regardless of your environment or programming languages.
00:01:29.119
I want to start with what I believe is the most basic and fundamental practice, which is that we can't fix what we can't see. I hope no one here believes they possess such perfect omniscience that they can address issues they are unaware of.
00:01:35.440
As we think about our systems, we need to look for ways to gain visibility into failures. This visibility aids us in managing our failures by providing a window into the many facets that determine when, how, and to what degree our systems are failing. Beyond just error reporting and instrumentation, systems for metric capturing can provide rich context, helping you understand why and how your systems are failing.
00:01:54.560
To illustrate this, let me share a story. We know that low visibility is dangerous in many contexts, whether sailing, flying, or driving. Having low visibility is a hazard in software as well, although not to the same life-threatening degree in most cases. I was working for a company in a microservice environment, focusing on a system written in Go. While I was able to get up to speed, we had extensive logging output, which made us feel as though we understood our system.
00:02:13.200
However, we started noticing something strange in our staging environment: processes we thought should be completing were not actually finishing. While we could see data coming into the system and the logs indicated processing was occurring, we were not seeing the expected results on the other end. This experience revealed that despite our confidence in our visibility, we were clearly missing some part of the picture.
00:02:44.360
In conjunction with the service I was working on, I began rolling out additional tools—specifically, Bugsnag and SignalFx—for metric tracking. These two solutions provided us with a much more context-aware view of the errors in our system. Our newfound visibility allowed us to understand better how many jobs were starting, succeeding, and failing.
00:03:00.879
Logging alone is often insufficient; I find most log files to be nearly useless in identifying system issues. They are typically noisy and offer a very low signal-to-noise ratio. Tools like Bugsnag and others, which we have here on the vendor floor, offer a clearer picture of what's going on, which in turn alters how we engage with our applications.
00:03:16.000
With increased visibility, we can determine what to focus on and when. However, simply knowing that there are thousands of errors in a system may not provide much insight. Even if we think there are high error rates, we need context to understand if those rates are normal or abnormal. As an example, one of our data sources—a credit bureau—had notoriously poor data quality, involving inconsistent formats.
00:03:31.520
Because of these inconsistencies, we were unsure how much we should expect to fail. We knew failures were to be expected due to the quality of the data we were handling, but we could not quantify them. This is why we brought in metrics tools to create clarifying graphs like the ones provided by SignalFx. This one particular graph was alarming: the blue line represented how many jobs were starting, and the orange line depicted failures; there was no green line, revealing that none of the jobs were succeeding.
00:03:50.879
This visualization quickly alerted us that something had gone badly wrong, thankfully, in our staging environment before changes were rolled out to production. Without this context, we would not have realized the severity of our problem, risking the chance of addressing lesser issues while ignoring the true underlying cause.
00:04:11.760
This taught us that visibility provides greater context not only for identifying what is failing but also understanding why the failure is significant. There's also additional tooling I have come to love from a company called LogRocket, which shows user interactions that triggered errors in the system, connecting them with services like Bugsnag or Sentry. This layer of detail allows us to understand not just what broke but why it matters.
00:04:30.639
Visibility is just the starting point for dealing with errors. We must ensure we are raising the visibility of our errors and gathering as much context as we can about those errors to address them effectively. Picking the right tools allows us to monitor not just system health, but also the broader context surrounding those errors.
00:04:45.200
This environment will lead to improved prioritization efforts when issues arise. Simply increasing visibility grants a significant advantage when apps go awry. This will further equip you to discern which errors are truly meaningful to your customers.
00:05:01.680
The more context you have, the better your actionable information becomes. This leads me to my next point: fix what’s valuable. How many of you have worked with compiled languages? Now, how many of you are familiar with the mantra that we should treat warnings as errors? This is something I first encountered in the 90s.
00:05:18.240
I thought it was a great idea without considering the implications. Treating every warning as an error sounds like a good practice because it should lead to better code. However, with dynamic languages like JavaScript, we might spend too much time focused on irrelevant warnings.
00:05:30.880
Even with a bug reporting system highlighting legitimately occurring bugs, not every issue requires an immediate response. This is why prioritizing efforts based on value is critical. We need to evaluate the value our systems bring to customers, collaborators, and consumers. Therefore, we should focus on addressing errors that are detrimental to that value.
00:05:46.200
When handling issues like outdated dependencies or security vulnerabilities, a lot of what we categorize as technical debt may not hold real value. If an issue isn't depriving users of valuable experience, it might not be as urgent as we think. We must discern whether a given error is genuinely worth fixing.
00:06:06.560
If you have product and customer service teams, it's best to consult them before focusing on a reported error; they might suggest alternative strategies. Perchance customer service could assist users in adapting to the system in a way that avoids the error altogether.
00:06:23.920
Sometimes, it’s prudent to allow certain issues to persist, enabling us to concentrate resources on correcting things that truly create user value. Ultimately, focusing on value leads to greater satisfaction among customers who rely on our systems.
00:06:37.760
I’d now like to share some stories that delve into unusual error conditions. The first principle relevant here is called 'return what we can.’ At one company where I worked, we had a microservice that replaced a previous generation of a similar service, which stored and tracked data points about businesses over time.
00:06:56.960
As part of our migration strategy for moving several million data points from the old service to the new one, we aimed to preserve historical context while adapting to a new database structure. The migration process itself went smoothly as we successfully brought over the data, and every cross-check returned passes.
00:07:14.560
However, once in production, it became evident that the historical data we migrated contained corrupt values. These corrupted data points triggered hard application errors because we had serialized them using YAML in the database.
00:07:31.200
As a result, our service returned a 500 error whenever it encountered such corrupt data. Not only did our service experience problems, but it also negatively impacted our collaborators, leading to a cascading failure that affected portions of our site critical for revenue generation.
00:07:45.440
We faced multiple issues that needed addressing, but the simplest solution was to rescue the parsing error. Having evaluated the corrupted data, we concluded that due to its unavailability, it held no practical usage. Thus, returning null for any data we couldn't parse became our approach.
00:08:07.919
It speeds up processes, allowing the system to continue functioning without throwing an error whenever it encounters a problem with a specific piece of data. In many application contexts, it is generally more effective to return something rather than return nothing. We discovered that rarely does all data need to be complete for it to maintain some usefulness.
00:08:26.720
Thus, focusing on how little can be returned without entirely compromising value reflects a mentality we should adopt in our systems. That’s why we should embrace the principle of returning what we can whenever possible.
00:08:42.560
Another related principle is the acceptance of varied input. While we need to be careful regarding what we return, we must also be generous in terms of what we accept. This means accepting as much data as collaborators can provide, even if it's faulty.
00:09:00.560
In one project, our system had many collaborators, each contributing only part of a business profile. Therefore, we designed our service to allow submission of a business profile without requiring all fields to be complete. Users could input one field or all fields depending on their data availability.
00:09:16.880
This flexibility ensured that our service became more resilient to the varying quality of data submitted by different sources. Even if one field was invalid, we would still accept valid inputs and communicate which fields needed attention later.
00:09:31.280
This forgiving approach greatly improved our ability to process data while simultaneously minimizing errors. By accepting what could be processed without enforcing strict data integrity, our systems became more robust.
00:09:48.000
The last principle I want to discuss revolves around trust. In any software system, trust is crucial because we depend on various services—either within our organization or third-party tools.
00:10:02.560
However, this dependency implies certain risks. When the reliability of those dependencies falters, their failures can quickly cascade into our own systems. We previously built a distributed monolith's worth of trust, ultimately resulting in a significant outage affecting user experience.
00:10:19.840
Our excessive reliance on other services resulted in failures propagating throughout our architecture. Therefore, we must be wary of who we trust—always preparing mitigation strategies for inevitable failures.
00:10:35.680
It’s essential to acknowledge that we might face outages in the services we depend on, be it a cloud provider or a third-party API. When we build our systems, we should lean towards lower dependency rather than a high level of faith in external systems.
00:10:50.800
It's critical to design our approaches so that the system will not fail catastrophically should any component start misbehaving. Many organizations may not be prepared for adopting microservices and the complexities involved.
00:11:07.440
Distribution in systems complicates our architectures, and many teams may not have the necessary skills or knowledge to navigate that complexity. This problem can occur in any collaborative environment where dependencies introduce risks.
00:11:25.680
This is not an argument against using dependencies; rather, it is a caution to trust carefully, analyzing who and what we rely on and actively preparing for when failures occur. Ensuring we provide adequate user experience even despite errors is essential.
00:11:42.320
To yield a net positive experience, we should not render service interruptions as standard errors whenever possible. Consider the user experience, whether they are customers or developers interacting with your service.
00:12:00.160
The principle takeaway is to expect failures—this concept is vital and is a critical part of chaos engineering. It's important to approach our software systems with this mindset as we cannot escape failures.
00:12:19.440
The first fundamental step involves establishing good visibility across our systems—not just through logs or error services, but utilizing meaningful metrics that encompass a full view of the extent and scale of failures should they transpire.
00:12:36.560
This practice enables us to preserve and restore customer value when systems break down. Thank you for listening, and if anyone has questions or you're interested in work opportunities, I will be here afterward to chat.
00:12:51.760
You can also access my slides via the link provided. Thank you for coming out!