Talks

Fix Production Bugs Quickly - The Power of Structured Logging in Rails

Fix Production Bugs Quickly - The Power of Structured Logging in Rails

by John Gallagher

In the presentation "Fix Production Bugs Quickly - The Power of Structured Logging in Rails," John Gallagher discusses enhancing observability in Rails applications to expedite the resolution of production bugs. The talk highlights the challenges engineers face when troubleshooting issues, driven by an anecdote of a critical production incident involving password reset emails at his company, Bigger Pockets. Gallagher underscores the importance of structured logging in Rails 8, which will significantly improve logging capabilities compared to traditional text-based logs. He outlines five critical steps for implementing structured logging:

  • Identify the Problem: Defining what data needs to be collected to address specific questions, such as identifying delays in job processing.
  • Data Collection Criteria: Gallagher emphasizes the importance of determining the type of data (e.g., job duration, event types), stressing that visibility into application performance is crucial.
  • Implementing Structured Logging: He compares traditional logging to structured logging, advocating for structured formats that facilitate better searching and filtering.
  • Choosing a Logging Library: Gallagher recommends using Semantic Logger due to its robust features and compatibility with Rails, providing logged data in a structured JSON format.
  • Sending Data to Monitoring Tools: He details the process of configuring logs to integrate with observability tools like Rollbar and Honeybadger, illustrating how to utilize logs for proactive error tracking.

Throughout the talk, Gallagher reflects on the emotional aspects of engineering—frustration, embarrassment, and self-doubt—that arise from unresolved bugs and system outages. He encourages developers to view their struggles with observability as common challenges rather than personal failures. Conclusively, the session advocates not only for technical solutions to improve logging but also for a mindset shift towards proactive observability that can prevent future issues before they impact users.

00:00:09.920 Okay, so our third speaker, John, will tell us about observability in Rails. It's a really cool topic, and I'm really curious about it, so let's welcome John. Thank you, and good luck.
00:00:39.079 So, I'm going to talk about how to fix production bugs quickly and the power of structured logging. I work for a company called Bigger Pockets, which operates in the real estate sector. The following presentation is based on a true story, though names have been changed to protect the innocent.
00:00:55.320 So, stop me if this sounds familiar: I'm on call on a Friday afternoon and I get a Slack notification from the support channel. Something is going wrong. Oh, and another notification—it looks like password reset emails aren't working. Another alert pops up, and it looks like an emergency. So, let's search the logs.
00:01:21.040 Alright, the code looks fine, which doesn't help much. Twenty minutes have passed since the incident, and our CTO, James Greene (not his real name), is getting a little anxious. He's escalating it to the board, and people are starting to complain on Twitter.
00:01:32.079 Oh dear, nobody can get into their accounts; they're all locked out. I decide to pair with a colleague, and under great pressure, we discover that the emails are using background jobs. Let's add some logging in background jobs. Excellent.
00:02:03.520 While we wait for the app to deploy, it fixes itself. That's interesting—we'll handle it later, right? We’ll just put it in the backlog. And we all know what happens to things in the backlog, don't we? They rarely get done.
00:02:22.000 So, was this just a one-off incident? Maybe not, as it’s happening again. That's okay; we added the extra logging. Fine, let’s search the logs again. Still nothing. Let's take a look at the Sidekick admin dashboard, and we see there are loads of processed jobs. But how does that help us? It really doesn't.
00:02:41.360 We found ourselves in a vicious cycle. We would add ad-hoc logging all over the place, deploy it, wait for the incident to happen again, and then go around in this continuous cycle. Wasting time was very embarrassing for the company, and all in all, it was not good.
00:03:05.640 I put out a survey because I wondered if this was just me. I asked 50 engineers a question, and the results shocked me: 30% of people lack visibility into their apps on a weekly basis. This means people are struggling like we were at least once a week, and 54% report this every month. Wow, so perhaps it’s not just us.
00:03:49.680 Now, I want to discuss the feelings we experience as engineers. We don’t typically like to talk about our feelings much, but first, I get really annoyed. Seriously, it's 2024—how can it be this difficult? Then I feel a bit sad. I’m just going around in circles, looking at the stack trace, still with no clues.
00:04:15.479 As time passes—an hour, two hours—I still haven't solved the problem. There’s an outage, and it’s really critical. Then I start to wonder if maybe the problem is me. Wait a second! I’m an experienced engineer with 10 years under my belt. How can I go to my boss and say, 'I'm sorry, I can't fix this'? That wouldn't look good.
00:04:50.120 But I have good news for you: the problem isn’t you, and you're not alone if you’ve ever felt this way. At Bigger Pockets, I experienced this firsthand, and I came up with five steps to implement a solution. This is what we've been doing for the last two years to break this cycle.
00:05:10.680 The question you want answered is the starting point. We need to decide what data to collect, build the instrumentation to gather it, use that data, and then improve upon it. I call it the steps to observable software or SOS; it’s somewhat of a backronym. Let’s walk through these five steps using an example.
00:05:48.600 First, we need to ask what question we want answered. Why are password reset emails not being sent? That’s pretty general. We need to drill down into a more specific question. We formulate some hypotheses: is it failing, timing out, or is it delayed?
00:06:07.440 Well, we don’t have any errors in the logs, so it’s not failing. If it were timing out, we’d probably see errors. So, I'm guessing it’s delayed. To figure this out, we need to ask ourselves a more specific question: which jobs are taking the most time within the same queue?
00:06:38.120 All emails go into this five-minute queue in Sidekick. To answer that question, we need data. We can't guess or divine this from the codebase.
00:06:49.360 I use these four criteria to determine what data to collect: first, what event do we want? We want to know whenever a job is performed; this gives us the data. We want to filter by jobs performed in a queue, group by job class, and analyze the duration values.
00:07:02.000 The hypothesis is that we have all these mailers queued to Sidekick, but there might be a two- to three-hour delay because the queue is filling up with lots of other jobs before these messages are sent. That could explain why it resolves itself after a while.
00:07:12.560 In an ironic twist, when I booked my flight to this very conference, I encountered a similar issue. The flight confirmation said, 'We've sent your account confirmation,' but without that, I couldn't do anything on their site. I waited for an hour, two hours, went to bed, and when I got up the next morning—four hours later—the email finally landed.
00:07:33.200 So, let’s build some instrumentation to answer those questions. I want to cover four different topics: what data type to use, whether to use plain text or structured logging, which logging library to use, and what customizations need to be made.
00:07:50.600 Now let’s talk about structured logging. You can probably guess the first two answers, but I want to explain why. These are often discussed as the three pillars; however, they are really three data types: traces, logs, and metrics.
00:08:06.240 Traces are the most useful, followed by logs, and then metrics seem to be the least useful. While there are trade-offs, structured logging is on the rise, especially with Rails 8. Logging is fairly familiar to most engineers, so let’s discuss logs and why we prefer structured over plain text.
00:08:27.680 Here's what plain text logs look like in code: logger.info. It usually includes the job class and the duration. In observability tools such as Sentry, New Relic, Airbreak, Honeybadger, and Rollbar, if we wanted to look at our Action Mailer jobs, we’d have to use a regular expression.
00:09:02.959 Now, searching for logs with a duration of more than a certain amount becomes impossible. What we really want is structured data attached to the logs. This is akin to how we can filter, search, sort, and group fields in a database.
00:09:21.120 This is great; we now can specify everything we want to see with a specific job class within a specific queue. Awesome! Now finally, let's discuss which logging library to use. Here are my four criteria.
00:09:39.839 While the third criterion, which is great documentation, is very subjective, I've set high standards. I also removed any logging library that hadn't seen a commit in the last year or anything that looked fairly new.
00:09:56.080 Here are the choices you have. Before I criticize any of these gems, I want to give a big shout-out to anyone doing open-source work; it's a really difficult and often thankless task.
00:10:10.720 That said, according to these criteria, Semantic Logger comes out as essentially the winner. Here's how we can install it: no surprise there. It also has Rails bindings, so we can install those as well. Set the format to JSON so that our observability tools can ingest structured logs.
00:10:24.680 Semantic Logger gives you standard features out of the box. While it's not amazing, it’s a decent starting point. These are libraries it hooks into, and I want to give you a brief tour of what it provides in your logs.
00:10:48.760 This is what you get after an action completes in the action controller. You can see the duration, which is critical, along with the controller and action names. You’ll see standard attributes for HTTP requests, which is useful.
00:11:06.839 Now, there’s a bit of a caveat: Action Mailer sends unfiltered emails to your logs, which is not advisable. You don’t want personally identifiable information in your logs, so this needs filtering.
00:11:25.200 The attributes provided can save your life several times over—the mailer that was called, the action taken on it, and the arguments passed to that mailer, such as the invoice ID. This allows you to search through your logs more effectively.
00:11:51.919 For Action View logs, there isn’t much to see, but Active Record does log every SQL request. Those who attended the performance workshop know what I’m referring to. The highlight of this presentation is the Active Job section, where you can see the queue, job class, and the event name, which is essentially 'job performed.'
00:12:12.839 All of this sounds good, right? We just add Rails Semantic Logger and everything works perfectly! Well, not quite. It turns out this area, where convention over configuration is applied, hasn't been adequately addressed for structured logging.
00:12:30.960 There are still lots of missing attributes, and API requests aren’t instrumented out of the box. I could rant about this for much longer, but let’s focus on three key issues.
00:12:57.180 First, we have no conventions. You can use job.class, job.class name; anybody can use anything, and we all know how that turns out. This leads me to introduce OpenTelemetry.
00:13:19.919 OpenTelemetry is an open-source project and one of the fastest-growing in that space, sponsored by the Cloud Native Foundation. They’ve worked hard over the last five years to standardize many elements, and there’s a Ruby library available for it.
00:13:38.039 Sadly, the library is still immature, needing significant maintenance and support. Yet, we can leverage their semantic attributes. They specify which attributes should have standard names—like calling a job class by a more human-friendly name in your observability tool.
00:13:58.640 To ensure that we produce the kind of structured data we want, we may need to perform some hacks. We'll find the log subscriber for Active Job inside Rails Semantic Logger and copy it into our project, renaming it as necessary.
00:14:11.680 We’ll replace this payload event with a simple Ruby hash that maps a few things. This is the outcome we desire. Next, let’s address missing attributes.
00:14:34.200 We have a number of attributes that are missing and quite a few issues—some are very basic. For instance, let's discuss HTTP headers and the user agent.
00:14:54.039 Here’s the standard name for the HTTP request header key—it's the standard OpenTelemetry name for the user agent. We can add this to our application.rb to attach these to every HTTP request.
00:15:15.680 There are two different styles here, a Lambda style and a straightforward hash style, and you can see our extra attributes are now included in the logs.
00:15:34.640 Now, let's talk about API requests. I’m going to simplify this a bit and assume we’re using Faraday, which we all know is the best HTTP client.
00:15:57.560 Here’s a bit of middleware we can create in Faraday. From there, we can initiate our API calls using this invented URL, and we can add a response middleware.
00:16:13.640 Here’s how we can tackle many of the weaknesses. I wanted to whip through how I've addressed these issues. There’s much more beyond this, so let’s move on to the next step: sending data to our observability tool.
00:16:30.720 Now, we’re logging all these structured JSON logs internally, but how do we send them? There are three sponsors of this talk: AppSignal, Honeybadger, and Rollbar.
00:17:01.680 Here’s how it looks for Rollbar. In Semantic Logger, you have these things called appenders, which serve as destinations for logs, making it easy to send logs to multiple destinations.
00:17:16.840 You can see I’m subclassing one of the out-of-the-box appenders at the top and tweaking a little bit of code. Here's how it looks in scroll versus wild.
00:17:36.240 The same approach applies for Honeybadger. Here’s a mockup of how the logs look in Honeybadger. And finally, for AppSignal, it’s a similar setup.
00:17:54.800 That was a lot of information! I’ve added links to the slides and various resources at the end of this talk, so don’t worry if you missed anything.
00:18:06.960 Next, let's deploy that and use graphs, which I believe is the best part. Logging is fine, but graphs really highlight the important data.
00:18:31.680 So, as a reminder, this is the kind of log data we want, filled with all our new shiny attributes. You can see event name, code name space, and job class. We will filter by queue: this contains all Sidekick jobs within a five-minute queue.
00:18:49.840 We can change the range to the last hour, group by the job class, and voilà! We have our data—pretty cool!
00:19:14.879 Some tips for using graphs: always check the time range. I can't tell you how often I’ve pulled my hair out asking why I can't find something, only to realize I was searching for records from two days ago instead of the last five minutes.
00:19:30.759 Also, don't go to the search and just start typing regular expressions and freeform text; this is guaranteed to throw you into a doom loop. Try to find the exact attribute in your tool’s user interface and then filter by that.
00:19:50.360 Go wide at first, and then gradually narrow down your search. These tips have been immensely beneficial to me in the last couple of years.
00:20:06.240 The final step is to improve. I tend to be critical about my work, often sorting through it meticulously and perhaps adding another bit of instrumentation.
00:20:25.680 After deploying and making tweaks, once I’m satisfied, I show it to my colleagues. Generally, they say they don’t care about that, but they would love to know about things like test suite speed.
00:20:41.840 Recently, someone raised concerns about the test suite's speed, so within four or five hours, we added custom logging to our test suite. We learned that 78% of our test suite runtime resulted from 17% of the tests, and surprise, surprise, those were all system specs!
00:21:03.080 There are many additional metrics we could include, such as request ID, IP, job latency, and a variety of other facets. You get the idea; at Bigger Pockets, we've encountered this cycle many times.
00:21:24.600 Let’s revisit the original outage with the tools we have today. This time, it's a Friday morning, not afternoon. I’m still on call, but I receive a notification from our monitoring software saying that a specific queue, which should complete in 5 minutes, is now taking too long.
00:21:53.680 This notification comes before any issues arise from customers. We can click a link and see the job performance metrics. At the top, we see job performed, and it's grouping by the messaging destination name or queue name, indicating that our 5-minute queue has a total duration of 50 minutes.
00:22:19.840 That's not good! Let’s figure out which jobs are taking the most time by filtering all jobs performed. We group the results by job class and compute the total duration.
00:22:43.840 The results show a few seconds for analytics update, and the user visit job seems to be an odd one. Let’s dive deeper and examine what’s