00:00:09.120
Hello everyone and welcome to RailsConf. This year is a little different since we cannot meet in Portland, but I hope you still enjoy all the talks we recorded for you. Thank you for watching my talk. I would like to start today with a quote from Richard Feynman, a famous physicist who won a Nobel Prize in 1965. On his blackboard, they found this quote: 'What I cannot create, I do not understand.' But what did he mean by this?
00:00:20.770
By building something from scratch, step by step, we can develop a firm understanding of the problem at hand. Sometimes we have to admit that we do not understand something we thought we already grasped. If we eventually succeed, it proves that we understood the problem correctly. Today, we will build a performance monitoring tool from the ground up, which will help us use these tools better in the future. My name is Christian Bruckmayer, and you can find me on Twitter @bruckmayer or on my blog bruckmayer.net. I live in the southwest of England, in the beautiful city of Bristol, and I work for Cookpad—an online recipe sharing application.
00:01:02.920
Let's talk a little bit about performance monitoring. Performance monitoring helps us spot and identify performance bottlenecks in our applications. Many of you probably use software-as-a-service platforms like New Relic or Skylight. Today, we will build a tool called Active Monitoring together, which will implement a subset of the features you know from New Relic or Skylight. Given our limited time of 30 minutes, our implementation will only cover response time and database operation metrics.
00:01:31.090
The tool will consist of three components: a Ruby on Rails plugin, a data storage system, and a dashboard for data visualizations. Everything I present today is also published on my blog, so if something is a bit too fast, you can look it up in my blog article. Most of the knowledge I present today is extracted from the InfluxDB, an open-source performance monitoring tool I maintain with a friend. Since we have just 30 minutes, we will not discuss how to actually spot and fix performance issues. If you're interested in this topic, I recommend watching the talk from Meteor Quebec from last year's RailsConf.
00:02:12.000
The talk is divided into four chapters. In our first chapter, we will start collecting data with the help of Active Support. After that, we will continue with storing, processing, and visualizing the collected data. As I mentioned earlier, we will use Active Support to build our performance monitoring tool. Active Support is a utility framework of Ruby on Rails. One of the more popular features of Active Support is its Ruby core extensions, such as date and time extensions. Active Support also comes with an instrumentation framework.
00:02:54.050
You can think of this framework as a way to create an event stream. At the top, we instrument an event with a name and have a payload containing the term. We then execute a block of code, and at the bottom, we subscribe to this event. Every time the code at the top is executed, the subscribed block gets executed as well, allowing us to log the start time and payload. Luckily, Ruby on Rails already ships with more than 50 hooks into the framework, which helps us to extract metrics from our apps. A few events we will use today include the start processing event, which fires before a controller action is executed, and the process action event, which fires after a controller action has been executed. Active Record even provides an event for each SQL query executed.
00:04:10.260
To illustrate how this framework works, let's re-implement a simple subscribe method. We need a class attribute called notifier, which will be an empty hash. The subscribe method will accept a name and a block, which we will store in this hash. The instrument method will also accept a name and a payload. We will store the current time before we execute the block, then execute the block and store the time after execution. In the next step, we will iterate over all the callbacks stored in the subscribe method and call them with a payload, merging the start and finish times.
00:04:59.380
As I mentioned, Ruby on Rails already has more than 50 hooks into its framework, but sometimes we need additional instrumentation for external libraries like Redis or Memcached. In such cases, we can monkey patch the function we want to instrument and wrap it in an instrument block that we can later subscribe to. For this example, we will monkey patch the call method of the Redis library. In our first chapter, we closely examined Active Support and its instrumentation framework. We now know which hooks Ruby on Rails provides and how we can use them to extract data from our applications.
00:06:02.370
Next, we will implement the storage of the collected metrics. As a first step, we need to install the process action subscriber, which we will do in a Ruby on Rails plugin. Plugins are our way to extend Rails by hooking into its internals. The process action event payload provides us with information such as the controller and action name plus the execution runtime, and we may later extend this to also store the response format, the method, and the response status. From the payload, we could create a table containing an ID, timestamps, and the calculated durations based on start and finish times.
00:06:56.080
This structure resembles that of a relational database, which typically implements create, read, update, and delete operations known as CRUD. These operations align well with our controller actions, which often have corresponding methods such as create, show, update, and destroy. However, for the performance monitoring tool we plan to implement, we only really need some of these operations. We definitely need a create operation to write our metrics to the database, as, for each request, we will write multiple metrics including response time and several metrics for each SQL query.
00:07:56.489
In the future, we could also extend this tool to include view render times or background jobs. We will, of course, need to read the data, but only for recent metrics like those from the last three or twenty-four hours. We will almost never update any of the metrics since the response time, once recorded after executing a controller action, won't change. Sometimes, we may need to delete metrics, but that will essentially be more like garbage collection. For instance, if we are only interested in metrics for the last month, we can run a background job to delete anything older.
00:09:02.800
If we take a look at how most relational databases are implemented, we will notice they often use data structures like B-trees or B+ trees. These data structures contain multiple pages of data, with pointers to the next page. However, as we will be writing a high volume of metrics, it may not be efficient to constantly create new pages and write to disk. In recent years, different types of databases have become popular, particularly time series databases. While time series databases have existed for 30 to 40 years, they gained popularity largely due to the Internet of Things, where many devices have sensors that continuously write data such as temperature or humidity.
00:09:41.340
A few popular time series databases you might have heard of include InfluxDB, Prometheus, and Graphite. For the rest of this talk, I'll be using InfluxDB as a reference. If we look at the data schema in InfluxDB, we will notice that it is surprisingly similar to a relational database schema. Instead of tables, we have measurements, each of which has several metrics. Each metric can have tags, which serve as primary keys, and additionally, we have fields that can include duration and text, such as hook name and location.
00:10:30.929
The main difference between fields and tags is that tags are indexed, enabling us to filter our metrics easily. Time series databases are better for handling high-volume write applications because they use a data structure called a Log-Structured Merge Tree (LSM tree). An LSM tree consists of two main elements: a sorted tree in memory, which accepts new writes, and log files on disk that are also sorted by time and indexed. This allows for efficient searching of metrics.
00:11:02.160
When searching for a metric in our database, we first look in the in-memory tree; if it doesn't exist there, we then iterate over the log files on disk. The big advantage of most time series databases is that they already implement retention policies that handle garbage collection, compressing, and downsampling log files, making them more efficient than traditional relational databases.
00:12:08.430
With this knowledge, we can now write request metrics to the InfluxDB. We take the timestamp, calculate the values based on the hook name and location, and derive the duration from finish time minus start time. Additionally, we want to monitor database queries, but first, we need to clean, normalize, and group the payload we will receive.
00:12:55.710
In our next chapter, we will take a closer look at the data structure of our metrics and determine the most efficient type of data storage — opting for a time series database such as InfluxDB. In the third chapter, we will process the data received in the SQL event. Before storing the data in the database, we will need to clean, normalize, and group it. We will start with a normalization process because the payload received from Rails contains the SQL query along with actual values that require sanitizing.
00:13:39.520
We must remove these values to normalize the queries, allowing for effective grouping later. One approach is to use regular expressions to find and replace all variable values within the SQL query. There are several positions in SQL where values can appear, resulting in various regular expressions. Furthermore, SQL can vary between dialects, meaning we might need to adapt our expressions based on the database we use. The next step in the cleaning process involves removing unnecessary data that does not provide value.
00:14:53.580
Without this cleaning step, we would end up writing SQL queries that aren't actually executed in our app but are generated by Active Record in the background. For example, queries that check the version of the database or schema version should be filtered out since they typically do not affect performance. We should also filter out object commands, as they are usually executed in migrations or tests.
00:15:59.280
We can create a query object that only tracks SQL commands we are interested in, such as select, insert, update, and delete, while ignoring schema migrations. The final step in processing our data is grouping them so we can see which queries are executed by which controller actions. This helps us identify potential problems. However, the payload for the SQL query event does not include a request ID because Active Record is often used independently of the request/response cycle.
00:17:05.520
Despite its standalone nature, we can subscribe to the start processing event, which gets triggered at the beginning of each controller action. With this subscription, we can cache the current request ID and location. The attribute 'current_attributes' keeps track of the request context. This attribute is implemented in Active Support but we can easily replicate it by storing a hash in the current thread and then implement getter and setter methods to manage key-value pairs.
00:18:17.440
Bringing this all together, we first check if we want to track the query. We fetch the current request ID and location, normalize the query, and then store everything in our InfluxDB. This chapter addressed the cleaning, normalizing, and grouping of our data, allowing our metrics to provide better insights later on. However, we should note that this normalization process does add some response time. To avoid any impact on user experience, we could implement this process in a separate job or background task.
00:19:28.360
In the last chapter of this talk, we will discuss visualizing the collected metrics. But why is visualization necessary? If we have only a few metrics, it is easy to spot anomalies, like a response time issue. In a real-world application, however, we could have thousands, or even millions of data points, making it impossible to identify trends without proper visualization. A dashboard allows us to see that most users experience a response time of around 16 to 60 milliseconds, but that some users might experience delays of up to one second—a significant concern.
00:20:11.170
Visualization helps us pinpoint issues. Among the metrics we might calculate, the average is one possibility—measured by the sum of values divided by the number of values, while the median is another, found by ordering all values and identifying the middle point. In our example, this average is 105 milliseconds while the median is 56 milliseconds, with InfluxDB implementing mathematical functions that ease these calculations.
00:21:13.580
When we visualize our metrics in a histogram, we can see how the average and median often diverge, particularly in web applications where there are a lot of quick responses but also longer delays. This discrepancy can skew our understanding if we rely solely on the average metric, which could misrepresent the bulk user experience.
00:21:40.160
For example, we might find that 10% of our users experience response times slower than 10 seconds—an issue that deserves immediate attention. Richard Lehman wrote a great article that covers this topic in depth, and I highly recommend checking it out for further reading.
00:22:09.580
Finally, visualization assists us in setting the right priorities for performance improvements. By calculating which controller actions consume the most time, we can see which action demands nearly 50% of our processing time. The 80/20 rule often reveals that 80% of our time is spent in 20% of our actions, so we should focus our optimization efforts there to maximize efficiency.
00:22:47.290
Visualization not only helps identify performance issues, but also clues us into their causes. By calculating the maximum and median times for SQL queries, we might discover that some indices are missing. Additionally, we can tally the number of SQL queries executed, which may indicate the presence of N+1 problems or absent memoization.
00:23:16.610
In conclusion, today we built a performance monitoring tool step by step. We began with collecting metrics from Ruby on Rails, then stored them in a time series database. Some metrics required cleaning, normalization, and grouping, all of which were covered in previous chapters. Finally, we visualized the data to identify potential performance issues.
00:24:16.060
We built this tool with extensive use of open-source software, utilizing a Ruby on Rails plugin, storing metrics in InfluxDB, and visualizing data on a dashboard with a tool called Grafana. The significant advantage here is the ability to customize solutions tailored to our specific needs, collaborating with the Ruby on Rails community to share best practices and dashboards.
00:24:35.000
To return to Mr. Feynman's quote, 'What I cannot create, I do not understand.' By building a performance monitoring tool step by step, we now gain an understanding of how its internals work, which also prepares us to use these tools more effectively in the future.
00:25:00.000
As mentioned in the beginning, most of the knowledge in this talk is derived from the InfluxDB race gem. We are always looking for new contributors and help, so if this talk has sparked your interest in performance monitoring, you can find us on GitHub.
00:25:10.000
This year, I cannot take your questions in person, but I am happy to address any questions you might have. Please feel free to reach me via Twitter or by email at brookmayer.net. Thank you again for watching my talk.