RailsConf 2016
Your Software is Broken — Pay Attention

Your Software is Broken — Pay Attention

by James Smith

In the talk "Your Software is Broken — Pay Attention," James Smith addresses the critical need for effective production monitoring in software applications. He emphasizes that while rapid deployment is essential to stay competitive, it should not come at the cost of software quality. Effective production monitoring allows teams to ship confidently by identifying and resolving issues before they lead to user dissatisfaction. Smith breaks down production monitoring into three core areas: stability monitoring, performance monitoring, and availability monitoring.

Key points discussed include:

- The importance of confidence in software reliability post-deployment: Developers must understand that simply shipping code is insufficient; ongoing monitoring is crucial.

- Acknowledging customer frustrations: Common scenarios include users abandoning buggy apps that fail during critical transactions, like purchasing. A retention study highlighted that 84% of customers will cease using software after just two crashes.

- Common pitfalls or 'deadly sins' of production monitoring include:

- Assuming everything is fine post-deployment

- Relying solely on customer complaints for insights

- Lack of visibility into issues due to ineffective logging practices

- Absence of accountability regarding issue ownership

- Principles for improving production monitoring:

- Accept that software can and will break; embrace this reality to foster a more proactive approach to issue resolution.

- Use automated tools to detect problems promptly rather than relying solely on log files or alerts.

- Prioritize alerts through effective aggregation and team communication tools to manage engineering efforts efficiently.

- Accountability and ownership should be established within teams to ensure problems are addressed timely, promoting a healthy culture of learning from mistakes.

In conclusion, James Smith urges teams to implement these best practices for effective production monitoring to enhance software quality and user experience. By recognizing the inevitability of issues, employing the right tools, prioritizing alerts, and fostering accountability, organizations can navigate the challenges of rapid software deployment while ensuring customer satisfaction and brand integrity.

00:00:11.920 Hey everyone, thanks for coming down. Can everyone hear me at the back? Got a wave? Awesome! So yeah, thanks for coming down today. I'm going to talk a little bit about how to rethink production monitoring.
00:00:19.560 That's quite a broad topic, but I'll dive into it in a lot of depth.
00:00:26.039 But who the heck am I? You can probably tell by my accent that I'm not from around here. I'm originally from the UK and I've been working in the Bay Area for the past seven years.
00:00:38.079 I've also been building production monitoring systems for over ten years, originally in finance when I was working at Bloomberg, then at a startup, and now at Bugsnag, helping people do better production monitoring in their own teams.
00:00:51.120 Just a quick overview of Bugsnag: I think a lot of people may know about Bugsnag or have used a similar tool. We help you understand what's going wrong in your software in production.
00:01:01.320 Instead of receiving a slew of emails from the exception notifier gem, or digging into your log files or using an older tool like Airbrake, we provide you with the tools and workflow you need to identify the most important problems in your software.
00:01:14.439 We highlight answers rather than just data. I think Skylight used that term as well, but we focus on identifying the most harmful bugs and errors in your applications.
00:01:20.280 I'll be talking a lot about the philosophies and techniques we use when building Bugsnag. However, this is more of a general talk about setting up good quality production monitoring and ensuring that you're not dropping anything.
00:01:41.960 This is the scary reality for many companies. If people are honest with themselves, most are doing this: "Let's build some code, let's write tests. Most people are doing a good job of writing tests these days. Let's send it out to production, and then I guess it's okay. It's probably fine." But what you really want is confidence.
00:02:08.160 You want to be sure that when your code is live and your customers are using your product, it's working properly. I'll talk a bit about how you can move from the left to the right side of this process and gain confidence in your software.
00:02:21.400 So, what is production monitoring? It breaks down into three core areas: stability monitoring, performance monitoring, and availability monitoring. Stability monitoring is what Bugsnag does; it detects if your software is broken and alerts you if crashes are occurring.
00:02:53.360 Performance monitoring, on the other hand, includes tools like New Relic that inform you if your application is running slowly, and availability monitoring focuses on uptime, something like Pingdom. Essentially, it's checking if your site is responding to requests.
00:03:06.319 But really, all of these aspects aim to deliver an exceptional experience to your customers. That's the point of production monitoring. But why do we care about that?
00:03:31.080 It's easier to make software than ever before. Particularly in the Rails and Ruby communities, there's a wealth of tutorials available, enabling you to build applications quickly. However, your app's success depends on its quality.
00:04:07.120 For example, if you are trying to purchase a TV and your app crashes, you'll likely switch to another app like Amazon's to complete the transaction. It doesn't matter where you get the TV from; what matters is making the purchase.
00:04:27.080 If your app is broken, slow, or unavailable, your customers will be frustrated and may leave. A reality based on a retention study conducted a year ago revealed that 84% of customers will abandon your software after just two crashes.
00:05:01.560 You may have invested significant time creating a valuable product, but if your app falters, customers will simply choose not to use it anymore. Moreover, unhappy customers often express their grievances online, whether on Twitter or the App Store, leading to permanent damage to your brand.
00:05:34.240 Conversely, there's a study we've repeated a couple of times that suggests 49% of engineering time is spent on finding and fixing bugs. This figure is consistent across many teams, and it's incredibly frustrating, as engineers prefer spending time building features rather than resolving issues.
00:06:13.680 On one side, you have a customer base that you painstakingly cultivated, abandoning your product due to stability, performance, or quality issues. On the other hand, you are wasting your own time as an engineer or engineering manager.
00:07:09.000 I've created a list of the deadly sins of production monitoring. Many people are guilty of one or more of these. The first sin is pretending that nothing is wrong. Some teams operate under the outdated belief that shipping to production is the final step of the process.
00:07:54.600 They think they can produce something, ship it, and walk away. However, modern software development focuses on deploying code as quickly as possible, observing its performance in real world conditions, and learning from it.
00:08:52.360 You should not assume everything is fine after shipping; this often leads to complacency. One common pitfall is assuming that writing tests guarantees quality. While it's true that most organizations work hard to write tests, no team can write tests for every potential scenario.
00:09:46.120 To complicate matters, too many organizations rely on QA teams, thinking they are responsible for all testing and oversight—even in teams without any QA staff.
00:10:23.640 Next, we have waiting for customers to complain. This is a significant mistake. Waiting for that first complaint ignores the reality that many customers won't voice their concerns, thus leading to a much larger churn.
00:10:49.760 By the time you hear a complaint, you may have already lost several other customers. If you allow your app to continue to fail, you've failed your customers already.
00:11:03.880 Lack of visibility is another critical problem. If you implement logging without actively monitoring it, you won't know if there are issues at hand. Many teams mistakenly assume that logging means they're covered.
00:11:52.079 For example, I've observed cases where teams rely on log files, assuming they can track issues through them—but in reality, log files can become a black hole where data goes unnoticed.
00:12:54.480 Once you have some visibility, it's essential to establish ownership. If your monitoring tools are not owned by someone, little will be done to fix problems.
00:13:21.480 Having a clear ownership structure ensures that there is accountability for addressing issues. Moving on to how we can improve production monitoring, I've developed a set of core principles that can be applied across all areas of monitoring.
00:13:57.600 First, accept that your software will break after shipping. Once you acknowledge this, you'll be more relaxed and less arrogant. You'll recognize that while you're a skilled programmer, things will inevitably go wrong.
00:14:17.080 By adopting this mindset, you'll be well-positioned to iteratively improve your application based on real-world feedback. Second, use tools that detect problems automatically. This will save you time digging through log files or relying on various alerts.
00:14:53.040 Most programming languages provide exception hooks. Utilize these to catch issues where they occur. For instance, in Ruby on Rails, you can configure middleware to capture exceptions and manage how they propagate up the stack.
00:15:39.680 You can also establish performance monitoring to assess how long requests are taking and get alerts when specific parameters are exceeded. Finally, leverage existing tools, like Bugsnag's SDKs, which allow you to capture diagnostic data.
00:16:12.960 Once you set everything up and start receiving data, it's crucial to manage the noise generated by all the alerts. Prioritize what you focus on by determining which issues have the broadest impact on users.
00:17:06.840 Aggregation plays a vital role here; for instance, if many errors stem from one line of code, fix that first rather than trying to tackle all errors individually.
00:17:57.280 To achieve visibility into these issues, utilize team chat platforms to streamline communication. Instead of relying on email notifications, which can quickly become overwhelming, push those important alerts into channels where your engineering team already communicates.
00:18:50.160 Once notifications are aggregated in your teamwork channels, you can filter through them and prioritize issues that impact the most users. Prioritization requires consideration—how many users were impacted, the severity of each error, and other related attributes.
00:19:53.600 It's important to ensure that you have the necessary diagnostic data available at the time a production issue arises. Understanding the context of an error—like what caused it or how many users are being affected—helps you mitigate its consequences.
00:20:37.240 At this point, I cannot stress the importance of ensuring that someone is responsible for addressing these issues on your team. Every organization, regardless of size, should prioritize assigning responsibility so that any detected problems can be promptly addressed.
00:21:41.200 That said, how you choose to structure your team to manage production monitoring can take different approaches. One viable method involves creating a dedicated bug team tasked with monitoring production tools and identifying urgent problems.
00:22:36.600 This model can work, but it has drawbacks, as it may hinder individual developers from learning from production incidents. Another method involves running a rotation system, having team members take turns being responsible for bug monitoring for a period.
00:23:41.640 Finally, there’s the notion of accountability. The last person who modified the code may have insights into the cause of a problem. This way, they can work through the issue or provide context on why it’s happening.
00:24:23.520 A measured approach towards handling post-deployment bugs should strive for a blame-free culture. Establishing this kind of environment allows software engineers to learn from their mistakes while working collaboratively.
00:25:19.440 To conclude, as you work to avoid the sins of production monitoring, adhere to the core principles of quality assurance: accept that software will break, implement automated error hooks, and aggregate your error data.
00:26:10.640 Make use of team chat for notifications, prioritize critical issues, and ensure that you can access diagnostic data. Lastly, make certain that there is accountability—someone should be responsible for acting on detected issues.
00:27:08.640 These core tenets should inform your workflow as you seek to bolster your production monitoring strategies.
00:27:36.480 Are there any questions? (pause for audience questions) That’s a great question. When you have a lot of bugs that seem to appear at a similar level of priority or frequency, how do you prioritize fixing them?
00:28:07.920 One approach is to simply have a team project where you identify and address the most pressing issues. Another technique, known as 'declaring bankruptcy,' involves identifying duplicate bugs and cleaning house to reduce clutter.
00:28:56.440 We recently added a feature to Bugsnag that allows you to ‘snooze’ alerts, meaning if you’re not concerned about a specific issue right now, you can temporarily mute it while still being notified if it worsens.
00:29:51.680 This creates a more manageable situation instead of overwhelming developers with alerts. If a significant number of errors occur due to the same root cause, focus on fixing that specific issue first.
00:30:47.600 In summary, building and maintaining an effective production monitoring framework entails comprehensively addressing these concerns through thoughtful processes and cultural shifts. Avoid pitfalls where you feel everything’s okay and empower your engineering teams to speak up when issues arise.
00:31:53.760 Keep the lines of communication open; learn from feedback, prioritize action items, and surely you will manage to thrive in a quality-focused development environment.
00:32:12.960 Thank you all for joining me today and I look forward to seeing you successfully develop and monitor your applications.”