00:00:08.960
Hello, everyone. I'm Aja Hammerly. You can find me on Twitter as Thagomizer and on GitHub with the same username. I blog at thagomizer.com, and now that I'm paid to do some blogging, I expect to post more than just a few times each year.
00:00:11.960
I work as a developer advocate at Google, and if you're curious about what that means, feel free to ask me later. In short, I help Google learn how other people write code, and I assist others in learning about the Google Cloud Platform.
00:00:47.559
Now, how many of you have had your site taken down by non-malicious traffic? Raise your hands. Alright, and how many of you were actually that non-malicious traffic? We've all had those days when the internet just doesn't behave, right? Perhaps you struggled with a site that had a missing index that took you far too long to find.
00:01:24.320
Let me share a little story from one of my previous jobs. I went on vacation to speak at a conference, which is essentially how I take time off.
00:01:24.320
While I was gone, we received 500,000 requests in a little under five minutes for a site that had fewer than 100 active users. Clearly, something was wrong. You would think that given that statistic, it should have been easy to figure out what happened. However, it took us two days of following hunches to troubleshoot the issue.
00:01:41.120
We knew our system well, and we had some ideas about what the problem could be, but we still couldn't find it. When I returned, the team was busy importing app data. It was a microservices app, which meant there were several applications involved. We were importing logs from the app, syslogs, and database logs into a database, aiming to query and correlate the data to understand what went wrong.
00:02:01.119
However, a couple of issues arose.
00:02:01.119
For one, not all team members knew how to query the database.
00:02:26.160
In fact, I was one of the few who were somewhat knowledgeable in query languages. Also, it was time-consuming to import all those logs, taking most of a day. Most of this data was stored on a developer's local machine.
00:02:48.160
Another issue arose in a different company I worked with, where one user was sending 10 requests per second to the login page. Our average user age was only six, which led us to suspect it was just a case of a kitten on the keyboard constantly hitting refresh.
00:03:12.720
Regardless of how the issues arise, sometimes traffic is high, and you want to be busy, but when something goes wrong, things can pile up.
00:03:32.319
So, what do you do then? The first step is to get your site running again. Then, you need to understand what caused the issue.
00:03:51.760
Sometimes, identifying a problem is straightforward.
00:04:05.680
But other times, you need tools to make it easier. For example, I've had situations when it was as simple as discovering that I was missing an index, thanks to a tool that pointed it out. Nonetheless, I’ve also encountered problems that were complex due to complicated service-oriented architectures.
00:04:41.760
In some cases, everything might look normal initially, but by the time you reach the second step, things might be broken. Sometimes, bugs cannot be reproduced in your staging environment. Thus, how do you identify the issue?
00:05:35.760
There are many potential tools available for log analysis. If you've done any operations work, you probably have a collection of log extract scripts that pull interesting bits from your logs. Personally, I have dozens, despite never working in ops officially.
00:05:49.760
Using visual tools, spreadsheets, or even doing low-tech log analysis can be effective in pinpointing issues. At one company, my boss was convinced that Rails was at fault for our system's slowness, so I created a detailed timeline showing request times and processing data to demonstrate that the server was mostly idle, which indicated the issue lay elsewhere.
00:06:02.720
Yet none of the approaches I've mentioned scale very well.
00:06:04.000
When faced with half a million requests in under five minutes, we struggled to identify the actual problems because we were inundated with too much data. Most of that data simply indicated that our site was running slow—"Your site is slow," repeated over and over. Eventually, more sophisticated tools like log aggregators and monitoring systems are needed.
00:06:25.200
However, these tools can be expensive, especially if you’re working with a pre-alpha or alpha site with a small team and a tight budget. Sometimes, the tools on offer may not even be suited for the specific issues you're experiencing.
00:06:40.000
This is where being able to perform ad-hoc queries is crucial. Often, to solve unique problems, quick and spontaneous analysis is necessary, relying on your understanding of how your systems function.
00:07:03.440
However, many existing tools don’t facilitate this. Another issue is needing to anticipate problems before they arise—setting up the necessary analyses before issues develop.
00:07:10.560
This leads us to the topic of forensic log analysis—it's the kind of log analysis for use after an incident, proving invaluable if the appropriate tools weren’t established ahead of time.
00:07:20.800
Google BigQuery is a potential solution for this analysis. Before three months ago, I had no idea it existed; I suspect many of you might not have either, but now you do. It is a tool for querying large datasets, with 'large' often defined as being in the range of terabytes.
00:07:33.720
It serves two primary purposes: aggregating and summarizing millions or billions of rows, and locating anomalies quickly, even if they are needles in haystacks within large data sets.
00:07:47.560
For instance, it can query birth records in the U.S. from 1968 to 2008 in less than ten seconds—quite impressive! Moreover, its familiar SQL interface makes it accessible to those already comfortable with SQL querying.
00:08:04.960
How many of you are familiar with writing a simple SQL SELECT statement? Great! That’s all you need to work with BigQuery. You can upload CSV or JSON files, making it easy to work with formats you already know.
00:08:21.680
BigQuery was purpose-built for querying logs at a large scale, and it’s publicly available, making it an effective tool for the job. Having used it, I find that utilizing tools for their intended purpose generally yields better results than misusing them.
00:08:40.960
Let’s conduct a quick demo comparing rainfall in Seattle and Salt Lake City. I’m using a publicly available dataset on NOAA weather data through BigQuery, which comprises at least 50 years—the dataset has around 115 million rows.
00:09:05.680
In this query, I’ll gather data from two specific stations: Salt Lake City and Seattle. While counting all the rows to find total days of rain, we’ll also find the total number of rainy days for each location.
00:09:20.360
With SQL, we can use a simple GROUP BY clause for summarizing results based on the station. Once I run the query, it’ll yield quick results. What we expect is that Seattle gets significantly more rainy days compared to Salt Lake City.
00:09:36.560
Upon executing the query, we see that indeed, Seattle has about three to four times the number of rainy days than Salt Lake City, which isn't surprising given the geographical context of both cities.
00:09:51.760
Interestingly, other cities such as Atlanta actually experience more rainy days than Seattle, despite the common perception of Seattle as a consistently rainy city. When it comes to data, we can download our results as a CSV, allowing for additional analysis.
00:10:02.480
BigQuery runs queries incredibly fast; in this case, processing 115 million rows in under two seconds! Now, I'd like to illustrate how you can utilize this tool for your own data, especially for practical applications where you might want insights from your logging data.
00:10:42.400
This will resonate particularly with anyone familiar with data warehousing. The steps are quite straightforward: first, extract your data, then upload it, execute your queries, and most importantly, understand the data. Many times people skip over understanding the insights data provides, which is a crucial aspect of data analysis.
00:11:00.680
In this talk, I will use a simple blog app I created, a basic site with posts and titles, so not incredibly complex, but it serves to demonstrate the processes effectively. To generate appropriate load for testing purposes, I used a tool that simulated traffic, resulting in a large number of logs from the Rails app and NGINX.
00:11:26.160
The goal will be to effectively extract information from these logs and to get the data into a format that's compatible with BigQuery.
00:11:28.960
Initially, we need to verify log patterns and identify errors that may arise.
00:11:41.360
Rails logs can vastly differ in structure or format, thereby complicating analysis, but with the right code and techniques, the log data's most relevant pieces can be extracted.
00:12:28.760
To achieve effective data analysis results, you need to create a script to read the logs. Let’s say we utilize CSV given its compatibility with various analytical tools. Your goal is to always prepare data with a clear focus on what will be interesting for later analyses.
00:12:40.080
A robust understanding of your application will enable you to select the most interesting data points to analyze further, aiding in determining root causes of any ongoing issues.
00:12:55.440
Next, we set up a new CSV structure to keep our data organized. I generally incorporate headers for each data point, such as body, timestamp, IP address, HTTP verb, resource identifier, status codes, and processing times, feeding all necessary details into an array for processing.
00:13:09.760
Employing regular expressions effectively will help extract the required information from your log data. Creating a basic CSV will create the structured data we want to analyze.
00:14:03.640
Once we've crafted our CSV, we can upload the file.
00:14:03.640
If the file is less than 10MB, you can directly upload it to BigQuery, but since I anticipate somewhere larger, I’ll upload it to Google Cloud Storage directly.
00:14:13.920
Once set up, we define a schema to guide our data structure. The supported types include integer, float, and boolean, and defining how we expect the data to interact makes the dataset easier to query later.
00:14:35.440
Next, we will make a new dataset in BigQuery. Creating a dataset called log forensics and establishing a new table within will prepare us to import our already structured log data.
00:15:09.760
Finally, as data starts loading, we can soon begin querying it. Given the previous load test of over 2000 blog posts in only a few hours, I'm eager to analyze the results and see what the data reveals about performance and latency.
00:15:22.640
Let’s start by counting the rows that exceed a processing time of 60 milliseconds, using a straightforward WHERE clause that targets and filters out the relevant records.
00:15:50.640
After executing the query, we quickly identified 10,000 rows exceeding that processing threshold—a significant concern that would merit further investigation. Now let's see the details of those rows, focusing on the body and the processing time.
00:16:13.440
Sorting this data by high processing times will give us better visibility into the slower requests, allowing us to prioritize further examination of performance issues. For instance, we can find specific instances of requests that took remarkably long to process.
00:16:35.840
Upon reviewing these entries, processing times reveal that certain requests took as long as 36,000 milliseconds to process due to the bottlenecking nature of generating excessive views from rendering 2000+ posts at once.
00:16:53.040
After identifying these high-impact areas, we can effectively implement solutions like pagination that immediately alleviate strain on server resources.
00:17:13.840
This approach will drastically enhance user experience, reducing load times significantly without huge overhead or infrastructure changes.
00:17:33.360
Moreover, creating indexes on important fields not only assists with filtering through loads of records efficiently but also optimizes overall database performance, leading to quicker response times.
00:17:51.920
Importantly, you might wonder why all this effort is warranted. The advantage is that you can perform forensic log analysis retroactively as needed when your current tools fall short of resolving issues. Knowing that you have this option available boosts confidence in your analytical capabilities.
00:18:12.320
Being able to utilize SQL in a familiar format allows quicker integrations with existing collaborative workflows versus entirely new languages. Time is of the essence; thus, when fast access to data matters, the recursive nature of your data becomes a non-issue.
00:18:30.960
Additionally, a streamlined process curtails delays, enabling you to analyze various datasets across your organization more efficiently. This enables collaborative querying and sharing of insights, yielding tangible results over time.
00:18:50.560
The overall functionality of integrating with services like BigQuery incurs minimal costs; whereas the demo today cost less than five dollars, the querying function is free, highlighting the value of leveraging available tools effectively.
00:19:12.560
Lastly, if any of you have questions or need links to helpful resources, including how to set these tools up or get various datasets referenced in this talk, please let me know; I would be happy to assist!
00:19:30.776
Thank you to the conference organizers for hosting this event and creating such an enriching environment. I also want to acknowledge my colleagues for their support and encouragement throughout my preparation for this talk. Thank you all!