Salim Semaoune

Scale Background Queues

Paris.rb Conf 2020

00:00:14.700 Hello everybody! I'm really glad to be with you today. I would like to talk about background queues, which is a very important topic in the Ruby world. First, I want to apologize because I am a little bit sick; that's why I'm sitting right now. So don't hesitate to throw tomatoes or things like that at me, and if I throw up on stage, it will be the best conference of the year! Let's go!
00:00:29.380 First, let me introduce myself. I’m Salim, and I’m a software engineer at Canto. Canto is a bank, and we are building everything in Ruby, transitioning from COBOL to Ruby. This is quite a huge gap, but it’s happening right now in France. I’ve been working with Ruby for quite a long time; I started my professional career in 2008, and I have enjoyed my journey so far.
00:00:54.280 As with any talk about scalability, I feel obliged to share the definition from Wikipedia. Scalability is the property of a system to handle a growing amount of work by adding resources to it. We will explore how this principle is applied in the Ruby world, especially in web applications. I'm sure you have all heard of libraries like Sidekiq, Rescue, and Delayed Job. These are among the most mentioned solutions when discussing how to scale a web application built on Rails. Most of the time, if you look online, you’ll find advice such as 'just use background queues'—you know, Sidekiq or Rescue. This is a clever solution as it helps reduce latency and respond to requests in a shorter amount of time, allowing you to serve multiple requests simultaneously.
00:01:49.179 So, how do we scale our job queues? I am sure this is a question that hasn’t been asked often, and for a good reason. You can get really far by using Sidekiq. However, the day may come when you start experiencing problems—such as increasing latency—and I’m sure you won’t be prepared for it. This happened to me a couple of years ago, and I have been searching for solutions ever since. My first piece of advice is to measure everything. If you look at the Sidekiq web interface right now, you won’t find that much useful information. You will see the number of jobs currently in the queue, the number of failed jobs, and other latency metrics, but it lacks comprehensive insights that show why things are failing in your queues.
00:02:53.370 At Canto, we measure various metrics to understand our job processing better. We have dashboards that display time spent on jobs, latencies within queues, and the number of failed jobs, among other things. This gives us a better picture of what is happening in our system. Interestingly, on the day I took some of my screenshots, we had zero failed jobs, which is a rarity in our experience, but it was only capturing data over a short five-minute timeframe. In reality, you may soon realize that your biggest issue comes from a singular job blocking the entire queue. It’s similar to issues encountered in networking programming where one slow job can hold everything up.
00:04:13.310 When this issue arises, the first obvious solution is to add more workers. Adding more workers means scaling up with additional machines, which inevitably translates to higher costs. While this might seem like a straightforward solution—as it aligns with the scalability definition we discussed—it's not always clever, especially from a financial standpoint. Scaling with more machines can lead to burnout from the constant expenditure required to maintain and scale your service.
00:05:04.890 Next, consider adding more queues. You may encounter critical jobs that need immediate attention, such as payments, while slow jobs from third-party services can be segregated to avoid interference. This separation sometimes works, but it has limitations. For example, in a typical Sidekiq setup, you often categorize jobs into critical, medium, and low priority queues. As issues arise, however, you can end up with numerous queues—over a hundred—each named after various service providers or queue times, which can lead to complexity and chaos. This behavior is consistent with how Sidekiq or Rescue function, as both rely on Redis and use a FIFO approach to job processing.
00:06:55.850 Over time, you may notice that marketing campaigns or other tasks could saturate your critical queues, causing job processing latency to skyrocket. We’ve experienced situations where jobs were delayed for more than two hours, which is problematic when you're enforcing one-time passwords, for instance. So, what smarter solutions exist? One potential strategy is to implement auto-scaling features to dynamically increase and decrease the number of workers based on load. I’ve seen many blogs discussing this approach, and it is beneficial in lowering latency and ensuring a quality service experience.
00:08:21.580 Another method is to set timeouts for your jobs. If you know that a third-party service will fail after a certain period, you can proactively reset the job timer to fail more quickly rather than waiting in vain. Another viable methodology includes circuit-breaking, especially used by larger companies. This idea is to recognize when a service is down and refrain from sending HTTP requests to that service until it is operational again. However, these methods can result in the creation of dead jobs, which will still need to be retried, potentially overwhelming your system once again.
00:09:52.800 During my analysis of Ruby stack performance, I evaluated where the time is being spent. A graph indicated that while most of the time was consumed by the Ruby virtual machine (VM), a significant portion was also lost waiting on Postgres and HTTP calls. This indicated under-utilized CPU time: the system was essentially idle, just waiting for responses. I believe there likely exists a more efficient solution that allows us to better utilize this waiting time to minimize latency.
00:11:23.120 An exceptional concept to address this issue is called an event loop. This approach is especially useful when dealing with slow I/O operations. In the Ruby ecosystem, we can reference event-driven servers like Puma or Nginx that exemplify this concept. The helpful analogy of an event loop is that it allows you to handle multiple tasks simultaneously without waiting for responses before proceeding to the next task.
00:12:06.329 Implementing an event loop in Ruby can be tricky, but it helps to know that there are existing libraries, such as EventMachine, which have been around for a while and are reliable. More recently, the Async framework has emerged and gained attention, particularly from the Ruby community, thanks to notable figures like Samuel Williams. He developed a web server named Falcon that employs asynchronous I/O and event loops using pure Ruby, and it's showing better scalability compared to previous solutions like Puma or Unicorn.
00:13:13.600 So, how do fibers factor into this equation? Fibers are a language construct in Ruby that have largely been underutilized until now. They allow for better control over the flow of your program by letting you pause execution and resume it later. When using fibers, you can streamline your jobs effectively without locking into a single-threaded process. Unlike traditional threads that come with synchronization challenges, fibers provide a much lighter and more efficient option for concurrency.
00:14:54.570 Now, let’s explore how to implement a Sidekiq-like system with fibers. Whenever you conduct an I/O operation, you yield the process. While waiting for the I/O operation to complete, you add this I/O object or file descriptor to a monitoring list, and when the data is ready, you resume the fiber. This fundamentally changes how we handle job queues by shifting to a non-blocking model.
00:15:26.100 Here’s a simple example of what using fibers with a Sidekiq-like system could look like. All the code within would be wrapped in an async block. This starts the event loop, allowing you to monitor I/O and resume jobs based on their readiness. Although this primitive implementation lacks robust error management, it illustrates the concept behind utilizing fibers.
00:16:00.240 There is a caveat to using popular libraries: not all are aware of or optimized for the event-driven model with fibers. Hence, modifications to existing libraries may be necessary. For instance, you’ll need alternatives for standard libraries commonly used. The good news is there are replacements available for most Ruby libraries, which facilitate compatibility without major alterations to your code.
00:17:08.940 Let’s discuss a library I developed called Quick, which acts as a replacement for Sidekiq. Although it's not fully battle-tested, the aim is to use fibers to achieve high concurrency and improved performance with synchronous I/O. In testing, I was able to queue 10,000 jobs and process them in about eight seconds on a single thread. In contrast, using 10 threads with Sidekiq took about 1,000 seconds or nearly sixteen minutes to handle the same job load.
00:18:46.730 In summary, Quick showcases the potential advantages of using async I/O, allowing a significant increase in job processing speed—up to 125 times faster than traditional methods. However, one must be cautious about benchmarking results; real-world performance can differ after accounting for setup time and other overheads. Looking to the future, there is much discussion around concurrency in Ruby 3, including concepts surrounding fibers. Initiatives are underway to enhance Ruby's performance concerning asynchronous operations, making it an exciting time for Ruby developers.
00:20:55.360 As I wrap up, I encourage interest and contributions to this emerging field. There's a real need for better handling of I/O operations in the Ruby language, and ongoing discussions signal a promising path forward. Thank you for your time, and if anyone has questions, I’m happy to take them.