The Arcane Art of Error Handling

by Brad Urani

The video titled "The Arcane Art of Error Handling" by Brad Urani, presented at RailsConf 2017, delves into the complexities of error handling in software development. In environments like Procore, where robust Rails applications are deployed frequently, managing unexpected errors effectively becomes crucial for customer satisfaction. The talk emphasizes the importance of adding contextual data to errors, designing error hierarchies, and developing reusable error handlers. Urani outlines several key principles and techniques for effective error handling, including:

Understanding Exceptions vs. Errors: Distinction between standard exceptions and errors in Ruby, with a focus on handling subclasses appropriately.
Avoiding Complexity: Cautions against over-engineering error handling solutions in simpler applications.
Creating Hierarchies: Discussing the implementation of custom error classes to improve context and control when handling failures in deep call stacks.
Recovery Strategies: Highlighting the differences between recoverable and unrecoverable errors, and techniques to manage control flow effectively during error scenarios.
Logging and Reporting: Encourages the use of error reporting systems like Bugsnag to ensure errors are logged and the context is retained for troubleshooting.
Clear Communication: Advocates for separating user messages from developer-facing error messages to improve user experience without leaking internal system details.

Urani uses real-world examples from Procore's complex software ecosystem to illustrate these points, especially in situations involving third-party service integrations. He emphasizes the need for appropriate user notifications and status codes in APIs to enhance the experience for both developers and end-users. The session concludes with the critical takeaway that a well-structured approach to error handling not only benefits developers by simplifying troubleshooting, but also ensures a pleasant experience for users, leading to overall happier business outcomes.

00:00:11.900 Hi everyone, thanks for sticking around this late. I'm not sure if you're actually staying to see me or if that's because this is the room where they're serving happy hour after my talk. Hopefully, it's for me! So thank you. My name is Brad Urani, I work at Procore.

00:00:22.920 We make software for the construction industry. Procore is a giant suite of tools for the industry, so many features that I don’t think one person knows them all. It’s one of the oldest, largest, and most mature Rails apps on earth, with about 2 million lines of code and around 90 to 100 people working on it every single day.

00:00:39.149 We deploy it four or five times a day and encounter a lot of errors. This isn’t because we’re not careful; it’s just due to the fact that we release a lot of really advanced business tools. We are constantly surprised by the ways our customers use them in interesting and creative ways that we can’t often predict.

00:00:59.219 As a result, we often find ourselves shocked by some of the exceptions that pop up in our logs when customers do things we weren’t expecting. We have to react quickly because we care about our customers. They pay a lot of money for our software and expect us to be right there answering their support questions.

00:01:22.020 A lot of the content of this talk comes from techniques I’ve learned, and our team has learned over the years, to make errors more useful, create architectures that facilitate troubleshooting potential problems, and simply make our lives easier.

00:01:40.229 I really enjoy this subject. I know I'm a bit strange, but I get excited about error handling. Maybe it’s because when I go to sleep at night, at least I know that if something goes wrong, I will be able to troubleshoot it easily and figure out what happened.

00:02:06.180 By the way, this is RailsConf, right? Everyone makes their presentations funny, putting in lots of memes. I’m not very good at picking out memes; technical content comes naturally to me, but not humor. I put some funny memes in here, but I didn’t like them, so I took them out and put them all at the end. Stick around, and you’ll find a random selection of completely irrelevant memes.

00:02:28.410 Let’s see... this talk starts off with fundamentals, going through some basic and simple things, but it gets advanced pretty quickly. So if you're here for more advanced architectures, stick around. It ramps up.

00:02:41.910 A word of caution: when you start talking about what you can do with errors and exceptions — I’ll explain the difference in a second — it’s really easy to overdo it. I’ll show you a lot of tools and techniques that could add complexity to your application. Ensure that this complexity is necessary and not for no reason.

00:03:00.570 If you're working with vanilla Rails, like MVC and a lot of CRUD operations, don't start using everything I’m showing you here. These architectures are best suited for reactive situations where you discover gaps in your knowledge or find that your error reports are coming through without the context you need.

00:03:24.450 I’ve seen this scenario play out with junior and senior engineers who get overly excited about the new power they’ve learned and create something elaborate that isn’t necessary. So, don’t overdo it.

00:03:42.720 By the way, most of these techniques apply to any language that supports exceptions. A lot of the best literature on this subject comes from the Java world, and it really originates all the way back to the C++ days.

00:03:56.160 To get started, here’s a simple example: in Ruby, raising an error is straightforward. You can simply pass a string, which is equivalent to raising a runtime error. In Ruby, errors are a hierarchy. At the top of the hierarchy, we have the 'Exception' class, and everything below it extends from 'Exception', including what we mostly deal with today: 'StandardError'.

00:04:15.660 I mentioned the difference between 'Exception' and 'Error.' The distinction is that 'Error' is a subclass of 'Exception'. For those who are visually inclined, you can think of it as creating our own tree with our own exception hierarchy. StandardError extends 'Exception', and many subclasses like ZeroDivisionError extend StandardError.

00:04:49.050 When rescuing, we can do so in a way that typically rescues just 'StandardError', which will not rescue other exceptions unless they are subclasses of StandardError.

00:05:05.400 Never do this: rescuing from 'Exception' at the root of that hierarchy can lead to capturing all sorts of unpredictable errors, like

00:05:23.820 If you want to see the consequences of running out of memory while still executing the program, that would be an example of 'NoMemoryError.' 'SignalException' occurs when you press Control-C.

00:05:45.330 Rescuing from 'StandardError' avoids these pitfalls, allowing you to manage what errors you want to capture.

00:06:01.520 You can make your own serious error, for example, 'NachosError,' by extending 'StandardError'. Inside the framework, and in popular gems, there is typically an exception hierarchy already created.

00:06:13.740 Let’s talk more about Rails. This is one area where I had a good meme, but I won't show it now. However, if you're thinking of having children, avoid it because you end up watching insufferable cartoons like Thomas the Tank Engine.

00:06:31.280 Yet, if you do have children, make sure they watch it because it's about trains on the Island of Sodor that cooperate. It’s a great parable for creating a well-functioning development team.

00:06:51.960 Here's a typical Rails controller, straight from the scaffold. We call 'user.save', which triggers validations. These can throw errors in very unexpected scenarios, like when you can’t connect to your database. This is a completely predictable error that can occur even without using 'save!'.

00:07:12.010 If you think you can just rescue everything with a catch-all, that’s an example of poor error handling. This can lead to situations where unexpected errors occur. Rule number one is: don’t rescue just because you can. Sometimes, the best option is to let the error show the generic 500 error page.

00:07:52.400 Consider who your audiences are: the users, developers, and computers. For users, we need to show proper navigational error messages, redirect them accordingly, and ensure they have a good experience.

00:08:12.150 For developers, we want descriptive error messages to troubleshoot effectively, including additional metadata relevant to the error.

00:08:29.350 For computers, we must use status codes properly for APIs to facilitate programming against them.

00:08:44.580 Our goals include ensuring user-facing error messages are helpful. We should control status codes, add contextual data, and send notifications properly, whether through email, SMS, or Slack, ensuring they reach the right team.

00:09:14.970 This is our method again. There are problems if we rescue without understanding the context. If we simply swallow errors, we end up with no evidence of something having gone wrong, which makes troubleshooting much more difficult.

00:09:40.430 Do not swallow errors. Always ensure that if something bad happens, there’s traceable evidence of it in your system.

00:10:05.050 Error reporting tools have greatly improved since the days of building solutions ourselves. They’ve become highly configurable and easy to use. For instance, we use Bug Snag at Procore, which has proved to be reliable.

00:10:30.600 They provide features such as graphs, charts, and excellent filtering capabilities. It’s easy to set up with minimal configuration. Most of them offer a free tier, so take advantage of that. However, understand the power-user features to unlock even more utility.

00:10:55.970 For your error classes, when you start your projects, I recommend creating a utility class with a method called 'handle'. This will allow you to connect your errors and initiate the error reporting solution.

00:11:26.090 Most systems report severity levels, which can be useful for filtering alerts. In the reporting systems, you can configure what alerts you want and their severity to avoid being overwhelmed by notifications. Customize severity levels to ensure you only get alerted about critical issues.

00:11:50.230 Also, having a way to add metadata for filtering will enhance your team's ability to respond to alerts effectively, letting you adjust notifications to avoid unnecessary noise.

00:12:16.890 Adding contextual data and flexibility in notifications can help your teams significantly by targeting alerts to the right audiences. It’s a very powerful feature for controlling how your team engages with reported errors.

00:12:38.110 Additionally, creating custom error classes can significantly enhance your application architecture's power, especially in scenarios with deep call stacks. For example, in an online shoe store, checking inventory may involve multiple layers of processing.

00:12:55.780 If there’s an issue when a customer tries to purchase an item that’s out of stock, we need to communicate that error up to the controller level from deep in the call stack.

00:13:12.740 Using exceptions appropriately allows us to raise errors at the end of the call stack, and then catch them at the top where we can affect change, like showing user messages.

00:13:41.100 This raises the question of control flow: do we want to affect the user experience and provide them useful feedback? By allowing these errors to bubble up, we can control how the application behaves when errors occur.

00:14:02.920 Finally, remember that if there’s ambiguity in user-facing messages or if they could expose sensitive information, avoid using exception messages directly. This is especially important in production environments.

00:14:24.960 For authorization errors, creating a robust hierarchy can be essential for managing permissions across your application. This allows you to recover gracefully from errors that may arise from all over the application.

00:14:41.680 For instance, instead of failing with a generic error, you can route users to specific guidelines related to their situation by raising contextual errors.

00:14:55.410 Creating a custom exception tree can help gather information that allows you to tailor your responses and improve workflows based on different error types.

00:15:23.030 By enhancing our error handling system, we've enabled a structured way to manage errors and also apply forwards and backtracking for resolution efforts.

00:15:48.220 The overall point of this is to provide meaningful interactions for your team and end-users by establishing a robust error handling strategy that scales.

00:16:12.780 Wrapping errors correctly allows us to add context and clarity. For instance, when working in API-driven architectures, we can derive better status codes and user-friendly messages.

00:16:34.000 To summarize what we've accomplished: we've looked at how enhancing user-facing messages improves user experience, we gained control over HTTP status codes, and we learned to gather and apply contextual metadata in handling errors.

00:16:54.630 We’ve changed how notifications are sent to ensure the right team gets the right message at the right time. By building effective reporting structures, we can target alerts more effectively.

00:17:16.430 All this can tie into logs that help with tracking errors, offering a comprehensive view from user impact to developer needs.

00:17:38.040 In project implementations, you might want to visualize complex interactions within your error-handling structure for clarity, as this can get complicated.

00:18:01.050 For further reading, I can share some great blog posts that delve deeper into error handling and reporting tools, providing additional capabilities.

00:18:21.840 I’m Brad Urani. Follow me on Twitter, I love to connect with professionals on LinkedIn, and I’m also on Mastodon for those on decentralized social networks.

00:18:37.220 I work in Santa Barbara at Procore, a growing company with an incredible team.”

00:19:01.260 I invite each of you to check out my colleague Derek’s talk tomorrow about building a powerful API structure.

00:19:19.060 Now, here are some unrelated memes as promised! They are completely irrelevant, but you've made it this far!

00:19:45.880 Thank you for your time, and I would love to take questions afterwards.