Ask your logs

by Youssef Boulkaid

In the talk titled "Ask your logs," Youssef Boulkaid discusses the importance of structured logging in Ruby on Rails applications and how it can significantly enhance the insights derived from logs. The session is aimed at engineers who wish to improve their understanding and utilization of logs to address common issues faced when analyzing unstructured data.

Key points discussed include:

Logging Issues: Youssef recounts an experience where his team faced a CPU usage crisis due to over 800,000 jobs in progress, highlighting the inadequacies of unstructured logs, which are hard to analyze and lack essential insights.
Structure Matters: The talk emphasizes transforming logs from unstructured text into structured data, which allows for effective querying and analysis. The use of gems like Lograge and Semantic Logger are recommended for consolidating logs into single-line JSON documents.
Questions Over Answers: Youssef argues that formulating the right questions is more important than deriving answers. Questions should focus on user actions, job queues, API endpoint usage, and performance impacts of feature flags.
Four Problems with Current Logs: Boulkaid identifies four main issues with existing logs: they are multi-line, unstructured, lacking essential data, and uncorrelated across systems. He provides solutions to each of these problems to enable better log management and data extraction.
Log Correlation: The importance of correlating logs from various systems is underscored, particularly in distributed environments where multiple services interact. Use of correlation IDs to track requests is shared as a practical solution.
Insight Derivation: The talk transitions to how structured logs can be analyzed using tools like Kibana to gain actionable insights, demonstrating how to ask specific questions about user actions and job processing.
Broader Context: Boulkaid points out that structured logging isn't a new idea. Companies like Heroku, Stripe, Shopify, and others have recognized its value for years, yet many in the Ruby community have yet to embrace this approach.

The main takeaway from the presentation is that adopting structured logging practices can transform how developers perceive and interact with logs, enabling them to extract valuable insights quickly and efficiently. Boulkaid encourages the audience to start implementing structured logging in their Rails applications to ask meaningful questions and carry out better data analysis.

00:00:04.480 Welcome to the first talk after lunch! I wonder how many of you are energized and how many are still sleeping. All right, let's do this. I want to start with a story. Our story begins a couple of years ago when I was part of a team working on a small internal application at a previous company.

00:00:14.360 One day, we were paged by our infrastructure team because our worker servers were at 100% CPU usage and had been that way for hours. Something was wrong. This was related to our workers, so of course, the first thing we did was open our Sidekiq dashboard.

00:00:28.000 What we saw shocked us: over 800,000 jobs in progress. Considering this was a small internal app, this was completely unexpected. We fixed the issue—I won’t go into details here—but we wanted to figure out why this had happened to prevent it from occurring again. Someone on the team asked the natural question, 'Hey, can we check our logs?' This is where my thoughts immediately went...

00:00:50.520 We do have logs, but they are just piles of raw text that are hard to use. Finding anything in them was painful and far from the best experience. The second thought that crossed my mind was an article I read a couple of months prior, and it came back to haunt me. Our logs are basically trash. This realization hit too close to home. I kept thinking, why are our logs trash? Why can’t we use them to get actionable insights? They are considered trash precisely because we can’t do that.

00:01:59.679 What that article meant was that our logs were not structured data; they were just piles of raw text with no value. However, that got me thinking: what if they could be structured? What if instead of unstructured text, we could treat our logs like data and extract valuable insights from them? Hi, my name is Youssef, and I want to talk to you about turning logs into data.

00:02:20.080 I am a senior engineer at Workato, where we create a product for companies to automate their business processes without writing any code. Like all of you, I love writing Ruby. I've been coding in Ruby almost my entire professional career, and I especially enjoy working on developer tools because I believe that coding should never be a chore. It’s a joy to make developers more productive.

00:02:36.280 Interestingly, my second favorite hobby is photography. I rarely go out without my camera, and if I'm lucky, I sometimes come back with shots that I’m happy with, like these.

00:02:50.520 So speaking of logs, here’s our agenda for today. We’re going to start by looking at the default Rails logs and which questions they can help us answer—and which they can’t. In the second part, we’re going to introduce more structure into those logs to shape them into usable data. Finally, we’re going to see a showcase of the insights we can derive from them.

00:03:39.600 Let’s start by examining the default Rails logs. To get logs, we need to first run some code. There’s a basic action that everyone should be familiar with: the user update action. You find a user, update them, and then redirect to another action. The default Rails logs look something like this in a very simplified case: we would have five lines of output in total. You can see the request starts, and then we output a line. We start processing the request and output a second line, after which we output the parameters (like the user, country, and locale) before finally redirecting and summarizing the request at the end.

00:04:15.600 These logs are what we all have by default. So what kind of questions can we ask these logs? We can ask how many PATCH requests we process per day, which controllers are our busiest, or what our rate of 404 responses is. But are these really the questions we want to ask? As one French philosopher once wrote, 'Judge a man by his questions rather than by his answers.' I attended a French school, so by law, I have to quote Voltaire at least once a month, or they come for me.

00:05:14.760 In all seriousness, I find that thinking about questions is often more valuable than getting answers. So what are the questions we would like to ask? For instance, we want to be able to ask which requests a specific user performed, or why we suddenly queued a large amount of jobs. We might also want to know how many users are utilizing those legacy API endpoints we’ve been wanting to retire for a while—and what they all have in common.

00:06:02.560 Another more complex question might be: is this feature flag affecting database performance on some revision of our code in the US East region? Notice that this question requires correlating data from completely different parts of our stack, including the application, database, and infrastructure. So why can’t we answer this question with our current logs?

00:06:37.920 Today, I believe there are four main issues with these logs. First, they are multi-line; while viewing single requests is manageable, aggregating logs from multiple servers (which can number in the hundreds) causes them to become intertwined, making it impossible to make sense of them. Second, they are not structured: they are made for human eyes but difficult to analyze programmatically, which hinders the ability to find patterns.

00:07:49.119 Third, they lack essential data—there is no concept of user, revision, or AWS zone, which means there is nothing to query against to analyze the data. Finally, they are uncorrelated between systems. Each system’s logs typically refer only to it, and we can’t correlate, for example, the logs output from a worker to those from the web server.

00:08:20.560 Can we fix this? The good news is that we have four distinct problems that we can tackle one by one. This brings us to the second part of this talk, where we’ll take each of these points and explore how we can fix them. Our first problem is that the logs are multi-line. I refer to these as 'print-as-you-go' logs; as the request goes through the Rails stack, it outputs log lines for each step, which tells us a kind of story about the request's lifecycle. It indicates when a request has started and when it has finished processing.

00:09:15.920 However, this isn't the information we want. What we actually need is a summary of the request. We want all relevant information available at once. To achieve this, we can consolidate these logs. In Ruby, this process is quite straightforward; there is a gem for that! You can use Lograge, the OG, or a more advanced solution called Semantic Logger.

00:09:51.360 Both of these will give you the capability to output a single line summarizing the request, which includes the method, path, format, and all the information we had before—consolidated into a single line at the end of the request.

00:10:29.600 Now, the second issue is that these logs are unstructured. When we have unstructured logs, we can introduce structure into them. The gems I mentioned earlier have settings that allow formatting logs in a specific way, using the JSON formatter for Lograge.

00:10:55.440 What this formatter does is output a JSON document instead of just key-value pairs. The output is a single line encapsulating all relevant information from the request, which includes nested JSON documents. This is a key point: structured logging turns our logs into real data. If you're going to remember one key takeaway from this presentation, it’s that structured logging transforms logs into actionable data.

00:11:23.760 We can think of these logs as event summaries generated by the requests. Moving on to problem number three: logs are often missing data. Although our logs are now structured, they still contain generic information. They don’t have any specific details about our application.

00:11:54.000 We can enrich our logs with domain-specific information. I’m very thankful for the large screen here, as we can include environment information, which might involve passing the proc type, revision, or region. We can also include extra data from the request, like the request ID and user agent, as well as additional context from our controllers. This way, we capture all relevant data in our JSON document.

00:12:55.680 However, I want to stress a couple of warnings here. First, be mindful of performance; only log information that is inexpensive to compute. Environment variables are perfectly fine, and objects already in memory are great too, but avoid adding extra database calls to enrich your logs, as this overhead could be significant.

00:13:26.399 Secondly, there is a risk of accidentally exposing private information in logs. It’s all too easy to inadvertently log sensitive data. In fact, if you haven’t made any changes to your Rails app, by default, it probably outputs sensitive data in the logs, as the params object can contain names, emails, birth dates, Social Security numbers, and so on.

00:14:07.560 So, if you want to prevent that from happening, you can either avoid outputting the params object altogether or utilize the parameter filtering feature in Rails. This allows you to filter out sensitive information easily.

00:14:43.200 Now, I would like to share a cautionary tale. Five years ago, Facebook made headlines for storing user passwords in plain text. This did not occur because they stored the data itself in plain text in their database—they're too savvy for that—but rather because they were logging sensitive data as part of their parameters. So, remember to exercise caution!

00:15:31.600 Finally, we arrive at problem number four: the lack of correlation between logs across systems. Log correlation means linking logs between different systems, which varies from case to case. I want to provide you with two examples of log correlation for inspiration. The first example arises in the context of distributed systems. Sometimes, you may have numerous microservices calling each other just to answer one single HTTP request.

00:16:41.320 To track these requests, the entry-point service (like a gateway) would generate a unique identifier known as a correlation ID, which should be passed as an HTTP header for subsequent requests. Any service that receives this correlation ID will then include it in their logs. By filtering your logs with the correlation ID, you can see all requests associated with that specific chain.

00:17:14.360 The second example illustrates how to correlate jobs with web requests. Sidekiq possesses a concept called middlewares that allow you to add metadata to jobs when they are enqueued. You can add the request ID to the job when it is created so that whenever you execute a job, it outputs the request ID, allowing you to trace where the job originated.

00:17:45.840 To recap, we have consolidated our logs into one line per request, structured them into JSON documents, enriched them with domain-specific data, and correlated them across our systems. This approach allows us to extract insightful log lines.

00:18:38.600 Now we move into the fun part of the talk, where we will explore how we can analyze our logs in various ways. Now that the logs are treated as data, we can ingest them into our data analysis platform of choice. I recommend Better Stack, who is sponsoring this conference, and you can find them at the booth outside.

00:19:07.360 For this presentation, however, I’ll be using Kibana, which is based on Elasticsearch. This enables us to turn a single log line into a document within a database, transforming instances of nonsense into a queryable database containing all our requests.

00:19:50.360 Back to our questions: we originally wondered which requests a specific user performed. Now we will filter those logs for a particular user ID within Kibana. Doing so, we discover that there were 20 hits! Furthermore, we can select different fields we’re interested in to discern which controller, action, and request duration we are investigating.

00:20:35.200 Next, we asked why we suddenly queued a large number of jobs. Here, we can filter the Sidekiq logs and visually observe that the app has minimal activity, except at a certain point. We can zero in on the log lines that appeared during that time and correlate them using the request ID we included earlier.

00:21:19.720 By doing this, we see many jobs sharing the same request ID, which is indeed suspicious. With that request ID, we can investigate which controller was responsible, and we find it was the Maintenance Tasks Controller, specifically the action for 'start daily cleanup.' This confirms our hunch that a problem originated from that direction.

00:22:08.320 Now we can take it a step further. We can analyze the log entries for that action and retrieve the user email and ID. It turns out it was a super admin who initiated this process, and now we understand the situation.

00:23:11.000 Continuing with our analysis, let’s ask how many users are utilizing this legacy API endpoint and what they have in common. Transitioning from traditional log views to visual ones, we will count unique users by filtering on the legacy controller.

00:23:55.760 Doing this, we find that seven users are active on this legacy endpoint. Now we need to uncover who these users are and what they might have in common to facilitate retiring this endpoint.

00:24:29.399 The final question leads us to consider a feature flag’s effect on database performance for a specific revision in the US East region. By enriching our log data, we have streamlined our infrastructure into a unified level—but now, it becomes quite easy to analyze. For example, we can generate a graph of the database front time filtered by revision to see if enabling the feature flag impacts performance.

00:25:53.760 As anticipated, the results reveal that enabling the feature flag indeed negatively affects database performance. Structured logs are highly beneficial; by converting logs into actionable data, we can explore them in new ways. We can analyze this data to uncover patterns and answer questions without needing to alter any existing code. The inquiries may range from short-term needs to long-term trends, serving as a form of business intelligence tailored for developers.

00:26:45.680 I can guarantee that once you adopt this approach, you'll never want to return to sifting through traditional logs. As we approach the conclusion of this presentation, I’d like to touch upon structured logging within a broader context.

00:27:12.760 First and foremost, it’s essential to recognize that structured logging is not a new concept. This practice has existed for years. For example, Heroku has included it in their guides, and Stripe published a blog post in 2019 discussing what they termed canonical log lines. Recently, Hypernews produced a blog post advocating for wide events, which are essentially structured logs.

00:27:49.400 Furthermore, companies like Shopify and 37signals have employed in-house structured logging for a long time. While the practice remains lesser-known among smaller or newer companies, the Ruby and Rails community is lagging in its adoption compared to communities like those in Java or Golang, where such practices have long been integrated into new projects.

00:28:23.720 To conclude, if you're building anything in the Rails community, I encourage you to try adopting structured logging and sharing your learnings. Gem maintainers should consider adding an optional structured logging feature to their gems, facilitating the adoption of this increasingly crucial approach.

00:28:34.440 Today, we already have the chance to utilize structured logging. Let’s start leveraging it today so we can begin to ask our logs the questions we've been longing to have answered—rather than merely the ones we can. Thank you!