Andrey Novikov

Yabeda: Monitoring monogatari

Developing large and high load application without monitoring is a hard and dangerous task, just like piloting an aircraft without gauges. Why it is so important to keep an eye on how application's “flight“ goes, what things you should care more, and how graphs may help to resolve occurring performance problems quickly. On an example of Sidekiq, Prometheus, and Grafana, we will learn how to take metrics from the Ruby application, what can we see from their visualization, and I will tell a couple of sad tales about how it helps in everyday life (with happy-end!)

RubyKaigi 2019 https://rubykaigi.org/2019/presentations/Envek.html#apr19

RubyKaigi 2019

00:00:00.060 So, a lot of people, I am very glad to see you. Let me do a selfie with you. I am very glad to see all of you. So, say cheese! Thank you very much.
00:00:32.030 Minna-san, konnichiwa! Watashi no […] arigato gozaimasu! My name is Andre, and I came today from Russia, from a city called Korolev. It's located just near Moscow, and Korolev is a center of Russian space exploration. There, we have a rocket factory and a mission control center for the International Space Station. So, if something went wrong in our space mission, we could say, 'Korolev, we have a problem.' But, I have never actually heard such an expression, and it's not because we don't have any problems with our space missions; they actually work a lot. So many lessons learned from those problems are applicable to software development.
00:01:22.140 I can refer you to a talk by my colleague this evening about this. You can watch the slides afterwards if you wish. I’m very glad. Just right before I came here, at RubyKaigi, I went to the Japanese Center of Space Exploration at the Tanegashima Space Center. Okay, I’m Andre, and I am one of the 'evil merchants.' We are literally angry aliens: extra-terrestrial engineers hungry for new technologies, quality software, and open source. We love open source and dedicate all our development processes and professional life to using and creating open source projects.
00:02:04.259 We help enhance existing projects, but open source does not appear by itself. Usually, it’s born inside commercial projects and extracted from private code. Actually, we evil merchants do commercial development for money for our customers. We have launched Russian Groupon and helped GetTaxi grow its service for ordering taxis and delivery in Russia and Israel. Additionally, we have many projects with giants such as eBay.
00:03:07.040 Today, I want to share with you my experience from working on a project for eBay called eBay Muncom. This is a branch of eBay that helps sellers sell their goods on various sites. Apart from ebay.com, there are eBay Germany, eBay Australia, and others. What you need to know about this project is that it's highly asynchronous. Users can sign into this screen, check their items, and publish them everywhere in real time. We start translating titles and descriptions to suggest categories, and we are doing this without user interaction.
00:04:01.100 We use Sidekiq to handle the jobs. Sidekiq is a standard de facto for executing delayed jobs because it's very performant and reliable. However, like any tool that executes our code, sometimes it can become slow or break due to mistakes in our code. We need to understand what is happening inside Sidekiq, especially in our jobs. Imagine a situation where, one day, our queues are full, and workers are trying hard to execute them all. Everything is slow, and we are in trouble.
00:05:02.100 So, how do we understand whether the situation will improve or worsen in the near future? Should we take urgent action? Sure, Sidekiq has a nice web dashboard included with its default installation. It looks pretty cool and solid, with many numbers and graphs, but is it enough? Let's think about the questions we need answers for. For example, have I executed more jobs in the last five minutes or since the last deployment? I can understand this if I look at the page for some time.
00:06:24.330 Have the optimizations I made helped? By looking at the numbers over five minutes, I can roughly estimate if we are getting more jobs or less. Are we going to drain the queues at all? Are we going to execute the jobs on time, and when will this probably happen? I can only tell whether the number of jobs is increasing or decreasing; that's not good. We don’t have enough tools right now to understand what is going on. Are we safe, or should we do something? We need two things: first, how queue sizes are changing over time (growing or shrinking), and second, which job types are problematic, slow, and, more importantly, when they slow down.
00:07:43.680 This information will help us determine where to start digging for the root causes of failures. Also, we should monitor the application performance. If we have a problem with a slow Sidekiq job, we want to understand why. There are many possible reasons: we could be running out of memory on our servers, or maybe an external API is slowing down. Perhaps a link to our application was published on sites like Reddit or Slashdot, leading to a surge in customers and an influx of jobs.
00:09:05.820 Another possibility is that we made some refactoring changes and replaced an optimal SQL query with a non-optimal one. I've done that myself several times; it happens. Therefore, we need to track not only Sidekiq statistics, but also the whole application, including servers and external services, to understand why Sidekiq jobs are slowing down. We need a few more graphs for Sidekiq: one with queue sizes, one with job dynamics, and many graphs for all aspects of our application.
00:10:12.630 Ultimately, we need something resembling a cockpit filled with graphs. If you are worried now, thinking it looks like an aircraft cockpit, you are right! It’s similar to a flight engineer's workspace in older aircraft, like the hypersonic TU-144, which rivaled the Concorde. Modern aircraft don't have this arrangement because it's very optimal. We can actually compare our applications to modes of transport. For example, when your application is small, it's like a bicycle—you can understand how it’s operating without any dials or gauges.
00:11:31.010 As it grows and becomes more heavily used, it resembles an automobile; it becomes inconvenient when you don't know how much gasoline you have left or whether you are hitting speed limits. Finally, when applications get big and are under high load, you cannot fly without all the instrumentation. Pilots cannot see the engines or wings, and sometimes they can't even see the ground, so gauges are crucial to understanding whether the aircraft is functioning well.
00:12:37.130 Many crashes have occurred because of faulty or misunderstood instruments. Thus, monitoring applications as they grow is crucial, and you may need to add it to your stack. Next, we need to learn some theory. We understand the need for monitoring, but what do we need to know about adding more graphs?
00:13:29.250 I will share some principles. This is not too hard—just a little patience, please. So, what does monitoring cost? This is a definition I took from Russian Wikipedia because it’s concise and precise, and I translated it into English. It’s important to have criteria for what is good or not good. For instance, when someone visits your homepage or the main page of your application, if it takes more than half a second, a typical user will think it's a very slow website and may leave. Everyone has an implicit understanding of what is a slow website versus a fast one.
00:14:58.080 You need to make these criteria explicit. For example, if you have a Sidekiq queue categorized as urgent and it has more than 10,000 jobs for over five minutes, that's a disaster that requires waking everyone up to fix it. Monitoring is a continuous process; you need to observe the application state every few seconds, day and night. A metric is simply a system parameter you are monitoring, such as the number of jobs executed in Sidekiq, queue size, number of HTTP requests, and average or 95th percentile job processing times.
00:16:21.800 When measuring metrics, we log these values at intervals. This forms a time series. When visualized, you get a graph with metric values on the y-axis and time on the x-axis. For example, while tracking Sidekiq queue sizes, you don’t need the total size of all queues; you need the size of individual queues. Queue sizes represent single metrics but can be segmented by labels such as job class name.
00:17:55.340 However, you can always aggregate these metrics to get totals, averages, or other statistics. Be cautious, though. Each set of labels forms a separate time series that takes memory—and having too many can crash your monitoring system. Therefore, keep label values finite and small. Metrics, while just numbers, actually have types. The most important type is the counter, which tracks events over time. They are monotonic and can only increase. When a Sidekiq job executes, whether it's successful or not, that event is counted.
00:19:29.820 You can’t undo events, so their nature makes counters suitable for defining conflict-free replicated data types (CRDT), specifically G-counters. This allows you not to worry about storing counter values even if your application restarts or crashes; you simply start counting from zero. Another metric type is a gauge, which measures external parameters like temperature or memory usage on the server. However, the operations you can perform on gauges are limited.
00:20:51.850 There are also more complex metric types called histograms or summaries, which are used for timing measurements. Histograms use multiple counters internally and allow for calculations like the 95th percentile of job execution times. They provide insight into the order of execution times. For example, do your jobs complete in seconds, tens of seconds, or minutes, or hours? But remember, they are less precise and rough.
00:21:35.000 Now, let’s practice applying these foundations. We'll use some open-source monitoring solutions that are easy to install and get started with. For starters, let's install the Prometheus client gem into our application and register some metrics. Our metrics will allow us to reconstruct the whole Sidekiq web dashboard in Grafana. Everything we had in Sidekiq web will be replicated, and we will also add two new metrics for the graphs we want to create.
00:22:41.510 The first metric is the number of executed jobs, successful or not. This is a counter since it’s simply event tracking, which cannot be undone. The second is job runtime, measuring how long it took to complete the job, which is quantified with a histogram. The third metric is queue size, which we can see changing as our application pushes jobs to queues and workers execute them. This is a gauge.
00:23:39.200 To gather metrics about job execution, we need to write a small Sidekiq middleware. Initially, we will log the start time, yield to the job itself, and upon completion, increment the counter while capturing the time taken for execution. For gauges, periodically we will query Redis to find out how many jobs are in queues and how many workers are currently running.
00:24:52.650 The parameters will provide us with metrics values, which we will obtain by issuing a simple HTTP GET request to the server. In our Sidekiq process, we'll need to launch a web server, and although it may sound odd, it's how Prometheus works, using the pull model. Other monitoring systems, like New Relic, use a push model, where an agent in your process performs HTTP POST requests periodically.
00:25:55.860 We can operate with requests as Prometheus does. After executing some jobs, we can review the metric values. For instance, we might see that 16 jobs were executed from an empty job worker queue and 160 jobs from the default queue, which were being processed by waiting job workers. We can also examine histogram data; for instance, we had nine jobs executed faster than half a second and 16 jobs that executed faster than one second, indicating how efficient our processing has been.
00:27:04.370 This compact data allows us to compute averages, minimums, maximums, and percentiles. By this point, I can graph the results and visualize the metrics, although I will skip this lengthy explanation. Finally, we can achieve a dashboard with graphs about Sidekiq, as shown on their website. Here, we have two new graphs that we wanted to obtain.
00:28:40.040 One of them shows the queue sizes changing over time, allowing us to observe that our application has enough capacity. Even when we push thousands of jobs into the system, we are able to process them quickly, despite some jobs taking longer—like 20 or 30 seconds. Now, it’s time to share a story about how monitoring helped me debug some problems.
00:30:05.340 In one situation, one of our queues became incredibly full. Without monitoring, I would have had no idea what was going on, how long it might take to resolve, or what steps to take. However, by observing the queues graph, we could see that we were slowly draining it, as the number of jobs was decreasing slowly but steadily over several hours. It would likely take days to fully clear the queue, putting us in a difficult position where we knew we needed to act quickly.
00:31:56.960 We then noticed a clear negative correlation between the rate at which jobs entered the queue and the average processing time. This makes logical sense; as it takes longer to process each job, fewer jobs can be handled in the same timeframe. Throughout this process, I attempted various optimizations. For instance, when I scaled the number of Sidekiq processes, my expectation was that it would allow us to execute more jobs, but surprisingly it had no effect.
00:33:28.490 This prompted me to assess other metrics. It is crucial to measure everything about your applications. When I checked the dashboard for our PostGIS database, I discovered that scaling up had also increased the number of locks in the database, which affected jobs. It turned out that all those jobs from one customer were locking each other due to a new feature we had released, which led to a massive inflow of jobs.
00:34:42.110 This indicated a problem initiated by a single user pressing a shiny new button. The debugging process, thanks to monitoring, took only 1 to 5 minutes instead of hours, showcasing the effectiveness of our setup. Monitoring aids not only in tracking performance but can also trigger alerts. You have criteria to determine what is good or bad. A monitoring system can detect anomalies early, indicating when something is wrong in your application, allowing you to respond swiftly.
00:36:01.790 So, it’s very useful! However, can we improve this process? Yes, we can. We have created a framework with a set of gems that simplifies monitoring setup. For example, the Sidekiq integration is as simple as adding the gem to your Gemfile and configuring your Sidekiq setup. Additionally, you can export to different systems at once, helping facilitate migration.
00:36:24.020 You will have access to a DSL for declaring metrics and managing differences between monitoring systems. After setup, you only need to increment counters, observe histograms, and directly set gauge values within your code. Alternatively, you can monitor external systems, such as Sidekiq queues, and send metrics to the monitoring system.
00:37:06.860 Our project, Yabeda, is a nascent initiative but has adapters for some use cases in our applications today. Yesterday, NetBackup talked about the importance of monitoring request queue times in application services. Yabeda includes a Puma plugin for this metric.
00:37:31.360 Moreover, we have a metrics adapter developed not by me but by the community, which is a testament to the capabilities of open source. Thank you to Dmitriy for creating this advanced adapter. Open source truly excels!
00:38:05.040 To conclude, I encourage you to try Yabeda. We have an example application where you can clone the repository, run it using Docker Compose, and explore how metrics and graphs work within the application. Monitoring is great, but it will not fix your bugs automatically; instead, it speeds up debugging time, helping you understand ongoing issues faster.
00:39:05.000 You will gain confidence in comprehending your application's performance at any moment. This way, you’ll be able to immediately evaluate whether the optimizations you implemented were effective or not. Thank you very much for your attention. Please follow Evil Martians on Twitter, read our articles, and visit us on the second floor after this talk to collect stickers. We have plenty of Martian stickers. Now, if you have any questions, please feel free to ask.