Forensic Log Analysis with BigQuery

Talks

Aja Hammerly

Forensic Log Analysis with BigQuery

by Aja Hammerly

The video titled 'Forensic Log Analysis with BigQuery' presented by Aja Hammerly at MountainWest RubyConf 2015 centers on the importance of analyzing logs effectively in the event of website performance issues. Aja discusses how common unexpected traffic spikes can lead to problems such as server errors and performance degradation, emphasizing that plain logs are often not user-friendly or efficient for troubleshooting. Instead, she advocates for the use of Google BigQuery, a tool designed for querying large datasets using SQL, to facilitate forensic log analysis after incidents occur.

Key points discussed include:

- Introduction to Log Analysis: Aja introduces the context of analyzing logs when a website faces issues due to non-malicious traffic or unexpected behavior. She shares anecdotes about her experiences with sudden spikes in requests and the challenges faced due to the volume of data.
- Challenges in Traditional Approaches: She outlines the difficulties in troubleshooting traditional log data, mentioning the time-consuming process of data importation, the lack of SQL querying knowledge among team members, and the ineffectiveness of some log analysis tools.
- Need for Effective Tools: Aja highlights the need for an efficient tool that can scale with data surge, like BigQuery, for forensic analysis. BigQuery helps in quickly locating anomalies by enabling ad-hoc SQL queries on massive datasets.
- Demo of BigQuery Capabilities: She provides a demo comparing rainfall data between Seattle and Salt Lake City to demonstrate BigQuery's speed and efficiency in handling queries over large datasets, illustrating the power of familiar SQL syntax in a new context.
- Practical Application of Log Analysis: Aja discusses setting up a simple blog application to generate logs, emphasizing the importance of structuring data for effective analysis. She details the steps to clean and upload log data into BigQuery, and how to perform meaningful queries, culminating in identifying performance bottlenecks.
- Conclusions and Benefits: The advantages of utilizing BigQuery for forensic log analysis include improved confidence in data analysis capabilities, cost-effective querying, and the ability to perform historical data analysis retroactively.

Overall, Aja's presentation illustrates how leveraging tools like BigQuery can enhance the capability to perform log analysis, ensuring effective troubleshooting and continuous performance improvement for web applications.

00:00:08.960 Hello, everyone. I'm Aja Hammerly. You can find me on Twitter as Thagomizer and on GitHub with the same username. I blog at thagomizer.com, and now that I'm paid to do some blogging, I expect to post more than just a few times each year.

00:00:11.960 I work as a developer advocate at Google, and if you're curious about what that means, feel free to ask me later. In short, I help Google learn how other people write code, and I assist others in learning about the Google Cloud Platform.

00:00:47.559 Now, how many of you have had your site taken down by non-malicious traffic? Raise your hands. Alright, and how many of you were actually that non-malicious traffic? We've all had those days when the internet just doesn't behave, right? Perhaps you struggled with a site that had a missing index that took you far too long to find.

00:01:24.320 Let me share a little story from one of my previous jobs. I went on vacation to speak at a conference, which is essentially how I take time off.

00:01:24.320 While I was gone, we received 500,000 requests in a little under five minutes for a site that had fewer than 100 active users. Clearly, something was wrong. You would think that given that statistic, it should have been easy to figure out what happened. However, it took us two days of following hunches to troubleshoot the issue.

00:01:41.120 We knew our system well, and we had some ideas about what the problem could be, but we still couldn't find it. When I returned, the team was busy importing app data. It was a microservices app, which meant there were several applications involved. We were importing logs from the app, syslogs, and database logs into a database, aiming to query and correlate the data to understand what went wrong.

00:02:01.119 However, a couple of issues arose.

00:02:01.119 For one, not all team members knew how to query the database.

00:02:26.160 In fact, I was one of the few who were somewhat knowledgeable in query languages. Also, it was time-consuming to import all those logs, taking most of a day. Most of this data was stored on a developer's local machine.

00:02:48.160 Another issue arose in a different company I worked with, where one user was sending 10 requests per second to the login page. Our average user age was only six, which led us to suspect it was just a case of a kitten on the keyboard constantly hitting refresh.

00:03:12.720 Regardless of how the issues arise, sometimes traffic is high, and you want to be busy, but when something goes wrong, things can pile up.

00:03:32.319 So, what do you do then? The first step is to get your site running again. Then, you need to understand what caused the issue.

00:03:51.760 Sometimes, identifying a problem is straightforward.

00:04:05.680 But other times, you need tools to make it easier. For example, I've had situations when it was as simple as discovering that I was missing an index, thanks to a tool that pointed it out. Nonetheless, I’ve also encountered problems that were complex due to complicated service-oriented architectures.

00:04:41.760 In some cases, everything might look normal initially, but by the time you reach the second step, things might be broken. Sometimes, bugs cannot be reproduced in your staging environment. Thus, how do you identify the issue?

00:05:35.760 There are many potential tools available for log analysis. If you've done any operations work, you probably have a collection of log extract scripts that pull interesting bits from your logs. Personally, I have dozens, despite never working in ops officially.

00:05:49.760 Using visual tools, spreadsheets, or even doing low-tech log analysis can be effective in pinpointing issues. At one company, my boss was convinced that Rails was at fault for our system's slowness, so I created a detailed timeline showing request times and processing data to demonstrate that the server was mostly idle, which indicated the issue lay elsewhere.

00:06:02.720 Yet none of the approaches I've mentioned scale very well.

00:06:04.000 When faced with half a million requests in under five minutes, we struggled to identify the actual problems because we were inundated with too much data. Most of that data simply indicated that our site was running slow—"Your site is slow," repeated over and over. Eventually, more sophisticated tools like log aggregators and monitoring systems are needed.

00:06:25.200 However, these tools can be expensive, especially if you’re working with a pre-alpha or alpha site with a small team and a tight budget. Sometimes, the tools on offer may not even be suited for the specific issues you're experiencing.

00:06:40.000 This is where being able to perform ad-hoc queries is crucial. Often, to solve unique problems, quick and spontaneous analysis is necessary, relying on your understanding of how your systems function.

00:07:03.440 However, many existing tools don’t facilitate this. Another issue is needing to anticipate problems before they arise—setting up the necessary analyses before issues develop.

00:07:10.560 This leads us to the topic of forensic log analysis—it's the kind of log analysis for use after an incident, proving invaluable if the appropriate tools weren’t established ahead of time.

00:07:20.800 Google BigQuery is a potential solution for this analysis. Before three months ago, I had no idea it existed; I suspect many of you might not have either, but now you do. It is a tool for querying large datasets, with 'large' often defined as being in the range of terabytes.

00:07:33.720 It serves two primary purposes: aggregating and summarizing millions or billions of rows, and locating anomalies quickly, even if they are needles in haystacks within large data sets.

00:07:47.560 For instance, it can query birth records in the U.S. from 1968 to 2008 in less than ten seconds—quite impressive! Moreover, its familiar SQL interface makes it accessible to those already comfortable with SQL querying.

00:08:04.960 How many of you are familiar with writing a simple SQL SELECT statement? Great! That’s all you need to work with BigQuery. You can upload CSV or JSON files, making it easy to work with formats you already know.

00:08:21.680 BigQuery was purpose-built for querying logs at a large scale, and it’s publicly available, making it an effective tool for the job. Having used it, I find that utilizing tools for their intended purpose generally yields better results than misusing them.

00:08:40.960 Let’s conduct a quick demo comparing rainfall in Seattle and Salt Lake City. I’m using a publicly available dataset on NOAA weather data through BigQuery, which comprises at least 50 years—the dataset has around 115 million rows.

00:09:05.680 In this query, I’ll gather data from two specific stations: Salt Lake City and Seattle. While counting all the rows to find total days of rain, we’ll also find the total number of rainy days for each location.

00:09:20.360 With SQL, we can use a simple GROUP BY clause for summarizing results based on the station. Once I run the query, it’ll yield quick results. What we expect is that Seattle gets significantly more rainy days compared to Salt Lake City.

00:09:36.560 Upon executing the query, we see that indeed, Seattle has about three to four times the number of rainy days than Salt Lake City, which isn't surprising given the geographical context of both cities.

00:09:51.760 Interestingly, other cities such as Atlanta actually experience more rainy days than Seattle, despite the common perception of Seattle as a consistently rainy city. When it comes to data, we can download our results as a CSV, allowing for additional analysis.

00:10:02.480 BigQuery runs queries incredibly fast; in this case, processing 115 million rows in under two seconds! Now, I'd like to illustrate how you can utilize this tool for your own data, especially for practical applications where you might want insights from your logging data.

00:10:42.400 This will resonate particularly with anyone familiar with data warehousing. The steps are quite straightforward: first, extract your data, then upload it, execute your queries, and most importantly, understand the data. Many times people skip over understanding the insights data provides, which is a crucial aspect of data analysis.

00:11:00.680 In this talk, I will use a simple blog app I created, a basic site with posts and titles, so not incredibly complex, but it serves to demonstrate the processes effectively. To generate appropriate load for testing purposes, I used a tool that simulated traffic, resulting in a large number of logs from the Rails app and NGINX.

00:11:26.160 The goal will be to effectively extract information from these logs and to get the data into a format that's compatible with BigQuery.

00:11:28.960 Initially, we need to verify log patterns and identify errors that may arise.

00:11:41.360 Rails logs can vastly differ in structure or format, thereby complicating analysis, but with the right code and techniques, the log data's most relevant pieces can be extracted.

00:12:28.760 To achieve effective data analysis results, you need to create a script to read the logs. Let’s say we utilize CSV given its compatibility with various analytical tools. Your goal is to always prepare data with a clear focus on what will be interesting for later analyses.

00:12:40.080 A robust understanding of your application will enable you to select the most interesting data points to analyze further, aiding in determining root causes of any ongoing issues.

00:12:55.440 Next, we set up a new CSV structure to keep our data organized. I generally incorporate headers for each data point, such as body, timestamp, IP address, HTTP verb, resource identifier, status codes, and processing times, feeding all necessary details into an array for processing.

00:13:09.760 Employing regular expressions effectively will help extract the required information from your log data. Creating a basic CSV will create the structured data we want to analyze.

00:14:03.640 Once we've crafted our CSV, we can upload the file.

00:14:03.640 If the file is less than 10MB, you can directly upload it to BigQuery, but since I anticipate somewhere larger, I’ll upload it to Google Cloud Storage directly.

00:14:13.920 Once set up, we define a schema to guide our data structure. The supported types include integer, float, and boolean, and defining how we expect the data to interact makes the dataset easier to query later.

00:14:35.440 Next, we will make a new dataset in BigQuery. Creating a dataset called log forensics and establishing a new table within will prepare us to import our already structured log data.

00:15:09.760 Finally, as data starts loading, we can soon begin querying it. Given the previous load test of over 2000 blog posts in only a few hours, I'm eager to analyze the results and see what the data reveals about performance and latency.

00:15:22.640 Let’s start by counting the rows that exceed a processing time of 60 milliseconds, using a straightforward WHERE clause that targets and filters out the relevant records.

00:15:50.640 After executing the query, we quickly identified 10,000 rows exceeding that processing threshold—a significant concern that would merit further investigation. Now let's see the details of those rows, focusing on the body and the processing time.

00:16:13.440 Sorting this data by high processing times will give us better visibility into the slower requests, allowing us to prioritize further examination of performance issues. For instance, we can find specific instances of requests that took remarkably long to process.

00:16:35.840 Upon reviewing these entries, processing times reveal that certain requests took as long as 36,000 milliseconds to process due to the bottlenecking nature of generating excessive views from rendering 2000+ posts at once.

00:16:53.040 After identifying these high-impact areas, we can effectively implement solutions like pagination that immediately alleviate strain on server resources.

00:17:13.840 This approach will drastically enhance user experience, reducing load times significantly without huge overhead or infrastructure changes.

00:17:33.360 Moreover, creating indexes on important fields not only assists with filtering through loads of records efficiently but also optimizes overall database performance, leading to quicker response times.

00:17:51.920 Importantly, you might wonder why all this effort is warranted. The advantage is that you can perform forensic log analysis retroactively as needed when your current tools fall short of resolving issues. Knowing that you have this option available boosts confidence in your analytical capabilities.

00:18:12.320 Being able to utilize SQL in a familiar format allows quicker integrations with existing collaborative workflows versus entirely new languages. Time is of the essence; thus, when fast access to data matters, the recursive nature of your data becomes a non-issue.

00:18:30.960 Additionally, a streamlined process curtails delays, enabling you to analyze various datasets across your organization more efficiently. This enables collaborative querying and sharing of insights, yielding tangible results over time.

00:18:50.560 The overall functionality of integrating with services like BigQuery incurs minimal costs; whereas the demo today cost less than five dollars, the querying function is free, highlighting the value of leveraging available tools effectively.

00:19:12.560 Lastly, if any of you have questions or need links to helpful resources, including how to set these tools up or get various datasets referenced in this talk, please let me know; I would be happy to assist!

00:19:30.776 Thank you to the conference organizers for hosting this event and creating such an enriching environment. I also want to acknowledge my colleagues for their support and encouragement throughout my preparation for this talk. Thank you all!