How to get to zero unhandled exceptions in production

00:00:13.590 Good morning! I'm speechless after this introduction, to be honest. But yeah, I like to fire the heat—that's the reason I actually like to have zero exceptions. I don't want to have an exception like most people are afraid of, which could bring the server down and wake them up in the morning. In my case, I'm scared to get an exception because I have to go home and fix it. So yeah, my name is Radoslav, but most people know me as Radoslav Stankov.

00:00:40.809 I often include something technical in my presentations. You know, something happens while I'm talking and someone wants a picture of the slides because I have a lot of code and other information. So I always make sure to share my slides online. This will also be the last link of the presentation, so if you decide that you actually like my work, you don’t need to take pictures now—just enjoy the presentation!

00:01:10.600 I work at a company called Product Hunt, which is our website. Usually, we deal with a lot of questions and answers during conferences. We often joke that the answer to everything in the universe is 'it depends', and a fancier version of that is 'context is king'. So let me walk you through our setup and processes so that you know the context of what I'm talking about.

00:01:22.900 Our setup consists of Redux, with GraphQL on the API layer, and a cluster of Sidekiq jobs. This is our server setup, and it works quite well. When I started working at Product Hunt, my time management tool looked like a brand new BMW—this nice car. We were so excited to get started, doing random coding work like a startup does. We adopted a mantra: 'Move fast and have breaks.' Sometimes in a startup, we would skip lunch or dinner, because who really needs those meals?

00:02:05.680 In practice, when you do this in a startup or a regular company, you might find that you start to experience issues. Things become tricky; you start band-aiding around the edges, and there are some problems here and there. Most of the time, it’s nothing major, especially when it comes to exceptions, like crashes while using the app. Smaller issues tend to appear randomly, which we often dismiss as 'Oh, cute bunnies.' Those are the exceptions you see occasionally, and slowly it can build up.

00:02:39.780 If you actually see those errors on your servers, that’s really bad. In production, we had a situation where we were moving quickly on multiple projects and running into trouble. We were often skipping meals to keep pushing forward, and our ability to manage these exceptions started to decline.

00:03:06.840 To combat this, we decided to implement what we call 'Happy Friday.' On Happy Friday, developers choose what they want to work on, similar to Google's 20% time—but we have five categories to focus on. Developers may fix bugs on previous features, work on 'goodie features'—small things that improve the user experience—or tackle technical debt since our system had amassed many battle scars during our busy periods.

00:03:38.890 We also have a process called 'catch up on projects.' Sometimes, you fall behind on your sprints, and Happy Friday is the day you can catch up in order to stay on track. Developers decide the priority of their work then, and lastly, we allocate time for exception tracking—just fix, fix, fix! I would spend every Friday for two years just addressing exceptions. Sometimes, even focusing just on a big bucket of exceptions to get everything sorted.

00:04:50.690 Now, we are in a pretty stable state. Most of the problems we faced earlier arose from ignoring exceptions, allowing them to pile up. Most exceptions aren't actionable; they’re random. In this talk, I want to share some strategies that helped me tackle resolving thousands of unhandled exceptions.

00:05:34.480 It's always good to go back to the basics when trying to solve issues. Often, the best solutions come from revisiting first principles. When it comes to exceptions, I recommend checking out the book 'Exceptional Ruby' because it explains how the exception system works in Ruby. Ruby’s exception system has several cool features that are beneficial.

00:06:02.139 Let’s look at some practical tips. The ideal situation is to build a system where errors are managed effectively, where nothing raises an error. However, we need to understand that things can trigger errors, whether it be UI interactions or other unexpected issues. It's better to specifically rescue from known possible errors rather than risk handling something too vaguely.

00:06:50.889 Leaving notes and comments in the code can be helpful for the next developer. Express why exceptions can occur and what they signal. For instance, if a note explains when a piece of code might fail, it can guide future work, especially when those names are integrated into the codebase.

00:07:08.520 This can be especially important if you're working with network requests. Network-related errors are common and can disrupt workflows. In Ruby, we can rescue from network errors effectively by capturing various types of network exceptions in a unified manner.

00:07:44.400 These captured errors can make it much easier to deal with exceptions. When we see repeated network errors, we can group them together. This way, developers can avoid the burden of sifting through repeated notifications about those standard errors, while still being aware of critical ones. The key is to keep the noise low so we can focus on actionable alerts.

00:08:30.320 When errors do arise, such as from user actions that are no longer valid, utilizing systems like Sentry to track these exceptions can provide insight. We often log scenarios where subscriptions are missing, allowing tracking of user-ID-specific issues. It's also crucial to distinguish between operational errors and genuine systemic problems.

00:09:08.780 Right now, we’re utilizing Sentry for monitoring our entire stack, which includes our front-end and back-end services. While monitoring errors is vital, having specialized projects for different services can help manage exceptions more effectively. We split web exceptions and background job exceptions into separate contexts.

00:09:30.950 This allows us to have targeted responses for different types of errors. Web exceptions often require quick resolution since they can directly affect user experience, whereas background jobs can be retried if they fail. We group similar exceptions together. By categorizing these exceptions into meaningful clusters, we can avoid flooding our logging systems.

00:10:08.300 Also, there’s a common occurrence with exceptions like 'record not found' or 'active record not unique,’ which can happen if multiple jobs operate simultaneously and cause race conditions. We developed a mechanism called 'handle conditions' to manage this. With it, we can ensure that we only log exceptions that represent true errors.

00:10:39.050 We combine this process with active job handling in Sidekiq, which helps us to manage background jobs effectively without getting overwhelmed by noise. Any time we encounter exceptions that we can easily rectify, we try to implement fixes rather than allowing them to linger.

00:11:45.000 When delivering notifications about user actions, we also ensure they align with the correct context and do not overwhelm our exception reporting systems. This involves managing output from scheduled jobs effectively. During emergencies like outages, we focus on reducing overall noise, allowing real issues to stand out.

00:12:16.920 By using different layers of error-handling mechanisms, we were able to minimize exceptions significantly. Our application now undergoes checks to ensure it meets expectations without overwhelming ourselves with unnecessary error messages.

00:12:51.360 At the end of the day, simplifying our error management led us to a near-zero state for unhandled exceptions. When exceptions appear now, they often indicate real problems, enabling us to address them more effectively. Reducing exception noise allows for clear insights into the application's health.

00:13:30.720 In summary, outlining effective strategies for managing exceptions drastically decreases the workload and provides clarity. Thank you, and I’m looking forward to your questions shortly!