Talks

How to get to zero unhandled exceptions in production

How to get to zero unhandled exceptions in production

by Radoslav Stankov

In the talk titled "How to get to zero unhandled exceptions in production," Radoslav Stankov shares his insights from experience in managing exceptions within software applications. The presentation addresses the challenges developers face in handling exceptions and offers structured strategies to minimize these issues, aiming for a more robust and user-friendly application environment.

Key Points Discussed:
- Introduction of Personal Experience: Stankov opens by reflecting on the journey of starting new applications, likening the experience to acquiring a new car, yet acknowledging the confusion that can ensue as the project progresses.
- Understanding Exceptions: He compares unhandled exceptions to cute bunnies that gradually overwhelm an application, highlighting the necessity of having a structured approach to manage them.
- Process for Handling Exceptions: Stankov emphasizes the crucial need for an explicit process in tackling exceptions. He recounts creating a system called 'Happy Fridays' at Product Hunt, where developers dedicated one day a week to manage bugs and exceptions.
- Importance of Clear Ownership: He discusses the significance of ensuring clear ownership of exceptions, as a lack of it can lead to a backlog of unresolved issues.
- Effective Exception Handling Techniques: Stankov advises against generic error handling and promotes the practice of addressing specific exceptions intentionally, using logging to capture important information for troubleshooting.
- Monitoring Tools: The speaker highlights the value of using monitoring tools like Sentry to track exceptions efficiently, maintaining separate projects for various components of an application to distinguish between different types of errors.
- Communicating Exceptions: Setting up notifications via Slack to alert the team of new exceptions post-deployment is recommended to help prioritize responses effectively.
- Real-time Analysis: He emphasizes correlating code deployments with exceptions to spot issues early and how to manage noise in exception tracking to focus on significant problems.
- Practical Handling Examples: Stankov shares coding practices that help organize exception handling, recommending rapid retry methods for network-related errors without cluttering the error reporting process.
- Conclusion on Best Practices: The overall message advocates for a systematic approach to error handling, reinforcing the importance of logging errors, analyzing them, and consistently refining handling practices to maintain productive development environments.

Through structured processes, effective monitoring, and clear communication, teams can significantly reduce the occurrence of unhandled exceptions in production environments, leading to enhanced application reliability and user satisfaction. Stankov concludes by encouraging continued dialogue on improving exception management practices.

00:00:07.799 Living without exceptions is a challenge every time I start a new app. I always put my mental hat on and imagine my project.
00:00:15.120 It feels like getting into a nice new car. You start coding, getting into the groove, and following the motto of "move fast and have breakfast" because we all know that breakfast is the most important meal of the day.
00:00:27.800 However, as time goes on, your application starts devolving into something that's still recognizable, but you see patches and confusion—like a feeling of 'what the hell just happened here?'.
00:00:40.360 The thing with exceptions is that they are like cute bunnies that eventually cover your application until they suffocate you. You might find yourself staring at screens full of errors. Today, I'm going to talk about how to live without these exceptions.
00:01:11.680 My name is Radoslav Stankov, and I come from Bulgaria, a country known for its Windows error screens. I write a newsletter called 'Tips', and I am currently the CTO and co-founder of a startup called Angry Building. I hope to rename it to Happy Building once we accomplish our mission. I was previously the CTO of Product Hunt, and I'll be sharing tips from my experience in those two companies.
00:01:36.799 I like to include a lot of code in my presentations, and I've uploaded my code here because I often notice people taking photos and that I present too quickly. So, all the slides are available, and you can see the code and everything.
00:02:04.159 When I started at Product Hunt, there were only four developers. We realized that while nobody wants errors in their applications, it's crucial to have a structured process to manage them. We established an internal process called 'Happy Fridays'. As the company grew to eight developers, we developed another process called the 'Strike Team'.
00:02:30.200 The Strike Team consisted of junior developers responsible for fixing exceptions. Over time, as the company expanded to over 20 engineers, we introduced a designated 'Bug Duty' sprint, during which one or two engineers focused on fixing bugs and handling exceptions. At my current company, I'm the sole developer, so I only conduct Happy Fridays. I miss Strike Teams since it’s just a team of one.
00:03:07.440 Happy Fridays turned out to be an effective hack to manage and prioritize tasks. You work four days as normal, then on Fridays, you can focus on one of five tasks: fixing bugs, addressing exceptions, bumping dependencies, paying off technical debt, or catching up on projects. The key idea is to think about your sprint as two weeks long but spread out across four days instead.
00:03:43.239 What I often did was pick exceptions to address. If you search GitHub, you'll see numerous pull requests dedicated to fixing exceptions left and right. The first takeaway today is that to establish a clear system for managing exceptions, you must have an explicit process outlining who fixes what, and when.
00:04:14.280 If no one owns the exceptions or if everyone feels responsible, they will just pile up. Another important aspect of working with exceptions is understanding that the Ruby exception system is well-designed, despite being somewhat dated. A good resource is a book that teaches us useful tricks about handling Ruby exceptions.
00:04:54.400 When you encounter an error in your code, you can use a single rescue to handle it. However, you should avoid this approach, as it might lead to your application behaving erratically over time without you understanding the root of the issue.
00:05:17.479 Instead, I recommend handling specific errors explicitly. This way, if you encounter a specific issue—like a network service being unreliable—it's fine to explicitly catch and return null.
00:05:53.479 The key here is to be deliberate about handling errors. Make sure to log any relevant information, even if your customers don't see it, to help you understand what went wrong. Additionally, monitoring your application for exceptions is vital.
00:06:20.440 Having a robust monitoring system ensures you are aware of errors, and not just assuming that everything is fine because your systems show no exceptions. I personally like using Sentry for this purpose. It's a tool we’ve used effectively in high-scale applications over the last decade.
00:06:44.400 As a tip, I create multiple Sentry projects for different tools. For example, in my Rails applications, I maintain at least two projects—one for production web and another for Sidekiq. This distinction is useful because errors in the web layer can differ greatly from those in the Sidekiq background jobs. The former typically receive immediate attention, whereas you can afford to delay the latter.
00:07:24.279 It's also beneficial to organize your exceptions into specific buckets based on usage. For instance, when we had a public API at Product Hunt, its exceptions were monitored separately because they had different causes and resolutions compared to the main web application or Sidekiq.
00:07:59.240 You want to avoid being overwhelmed by a flood of exceptions where you won't know where to start fixing. Setting up a Slack channel to notify your team every time you deploy can help track new exceptions entering the systems as well.
00:08:36.640 When you discuss exceptions in this environment, people can react to new errors appropriately and prioritize what needs addressing. I've found new exceptions tend to be easier to solve, especially right after a deployment. Therefore, having a system that correlates code deployments with exceptions is invaluable.
00:09:10.960 As a great quote from Ken Loughlin and Rob Pike states, 'Exceptions are for exceptional situations.' They're not for regular occurrences or control flows. When exceptions occur frequently across the same types of errors, we should determine whether this is noise or something to be addressed.
00:09:52.400 Managing the noise is important; you don’t want to flood your exception tracker with messages that don’t warrant immediate attention. Always ensure clarity by knowing whether certain exceptions, like 'Sidekiq shutting down', are significant enough to track.
00:10:33.440 It's about keeping the most relevant exceptions highlighted while filtering out those that don't require action. In development, you want to avoid introducing too much noise in your applications, as this can obscure critical issues. Now, let's shift gears and discuss some practical coding examples.
00:12:01.680 In my work, I often encounter various exceptions that I need to deal with. For example, when I first came across a specific error message and had to figure out how to fix it, I realized that sometimes the answers are easily available online. Nowadays, I often consult AI models like ChatGPT for quick solutions, which can save a significant amount of time.
00:13:20.919 Additionally, I have a practice of organizing my exception handling code into neat namespaces. In my projects, I create modules that encompass all my error handlers, making it easier to manage them efficiently. Keeping everything organized and neatly grouped helps to mitigate chaos in error handling.
00:14:05.760 We might also encounter common errors like a method returning null. To manage this, we need to determine the root causes—such as issues related to account subscriptions or misspecified data. When errors happen, we should remain calm and examine why the errors occurred, documenting any potential patterns that might arise.
00:14:52.100 One significant lesson is to never hide exceptions. If a network service cannot be accessed, and an error occurs, logging that is essential. The key takeaway is to analyze the reasons behind every exception efficiently.
00:15:36.320 For instance, in scenarios where users attempt to change subscriptions but don't have any, I inform them instead of letting an error bubble up without context. This allows users to see responses without seeing raw errors while still giving the development team the opportunity to troubleshoot proactively.
00:16:22.320 Managing exceptions gracefully can turn your application into a robust system that minimizes user frustration. It’s about building that safety net—ensuring users receive understandable errors while your team is notified in the background for resolution.
00:17:12.720 Another point to consider is during the deployment of updates, it's essential to capture how users interact with your application during this period. At times you might find errors you hadn't anticipated, and it's essential to analyze them systematically.
00:18:03.760 In dealing with races in processing, stay aware of how your jobs might interact with one another in quick succession. Multiple jobs being triggered for the same task can lead to conflicting states, generating setbacks. This is where implementing strategies to avoid race conditions becomes invaluable.
00:19:01.920 Let's also examine how job failures and retries work with Sidekiq or Active Job. Job recursion can lead to repeated failures if not handled correctly; having methods that effectively pull or retry jobs ensures that processes remain efficient.
00:20:00.160 I recommend having a uniform approach to handle network-related errors, ensuring that your jobs gracefully retry their attempts while providing helpful feedback to your systems. Should jobs continue to fail, keep a record so you can analyze patterns in resolution methods.
00:21:04.760 In managing network exceptions, for example, I utilize a method attended for rapid retries that trigger upon failure. This method can help streamline the function of background processing systems while adding robustness to your error-reporting systems.
00:22:00.400 As we optimize our systems to effectively handle these nuances, always note the need for reliable metrics and monitoring processes to avoid exceptional noise clouding productivity and focus for teams. Tailoring responses to the dynamic environment of development helps reduce confusion and maintain consistency.
00:22:46.080 It’s imperative to be cautious about the common errors that can arise when working with asynchronous processing. For example, ensuring tasks are performed only after transactions are finalized prevents data consistency issues with job execution.
00:23:33.680 Finally, it’s necessary to build upon a strong foundation of error handling measures in every aspect of your work. Keep refining the practices that help track, troubleshoot, and ultimately resolve the underlying issues—ensuring your applications run smoothly.
00:24:23.520 In summary, to live without unhandled exceptions, it’s crucial to explicitly define processes, utilize effective monitoring tools, review exceptions meaningfully, and involve your team actively in discussions. By investing time in building systematic approaches, you can avoid being overwhelmed by handling exceptions.
00:25:11.840 Thank you for your attention, and I'm eager to engage further with all of you. Let's keep the discussion productive!