Redis
Solid Queue Internals, Externals and all the things in between

Summarized using AI

Solid Queue Internals, Externals and all the things in between

Rosa Gutiérrez • September 26, 2024 • Toronto, Canada

Rosa Gutiérrez's presentation at Rails World 2024 focused on Solid Queue, a new backend introduced in Rails 8 designed for Active Job processing. The talk discusses the evolution, development, and challenges faced in creating this system, especially after prior experiences with Resque and Redis. Key points covered during the presentation include:

  • Background: Rosa outlined how Solid Queue emerged as a response to the inefficiencies encountered with Resque, emphasizing her experience at the Rails World conference where it was announced. She humorously noted the motivation induced by public promises regarding project delivery.

  • Need for Solid Queue: Several existing components were lacking adequate functionality, prompting the team to develop internal gems to enhance job handling, including mechanisms to prevent job loss and manage overlapping jobs effectively.

  • Core Components: Rosa discussed the architecture of Solid Queue, featuring multiple internal gems and forks that facilitate job execution, scheduling, and management, particularly the challenges faced with Redis memory limits when handling millions of scheduled jobs.

  • Design Innovations: The presentation covered the implementation of advanced database techniques such as ‘select for update skip locked’ to optimize job claiming processes and mitigate contention issues seen in earlier systems.

  • Efficient Polling: The introduction of a hot table for efficient job polling, alongside a method to keep track of workers’ heartbeats to manage job claims responsibly, were highlighted as significant design features.

  • Performance Metrics: The architecture achieved impressive performance metrics, processing about 20 million jobs daily across 74 virtual machines with 800 workers, while also allowing for bulk job processing.

  • Community Involvement: Rosa expressed gratitude for the Rails community’s collaboration in addressing challenges and bugs encountered, underscoring the importance of community support in software development.

  • Future Directions: The talk concluded with an invitation for further discussion and collaboration to enhance Solid Queue’s functionality as the team continues to test and refine the system based on real-world usage.

Overall, the presentation encapsulated the journey of Solid Queue from conception to deployment, emphasizing design decisions tailored for efficiency, community engagement, and a focus on performance improvements, marking a significant advancement for Rails developers handling background jobs.

Solid Queue Internals, Externals and all the things in between
Rosa Gutiérrez • September 26, 2024 • Toronto, Canada

After years of tackling background job complexities with Resque and Redis at 37signals, the team finally decided to build an out-of-the-box solution. Enter #SolidQueue, a default now in Rails 8. Rosa Gutiérrez presented Solid Queue at Rails World and shared the journey and the challenges they faced to get it live.

Note: Rosa lost her voice the morning of this presentation but put on her game face and delivered the talk anyway.

#Rails #Rails8 #solidqueue #resque #redis

Thank you Shopify for sponsoring the editing and post-production of these videos. Check out insights from the Engineering team at: https://shopify.engineering/

Stay tuned: all 2024 Rails World videos will be subtitled in Japanese and Brazilian Portuguese soon thanks to our sponsor Happy Scribe, a transcription service built on Rails. https://www.happyscribe.com/

Rails World 2024

00:00:09.639 Hi everyone! My name is Rosa, as she introduced me. This is actually the first time I’ve lost my voice like this. What a better moment to lose your voice than right before speaking at a major conference, right? I apologize for that, and I’ll do my best. I hope you enjoy this talk. Today, as they already said, we’ll be discussing Solid Queue, a new backend for AC jobs that will be the default in Rails 8.
00:00:31.880 Let’s take a moment to reflect on Rails World last year in Amsterdam. This was a key moment in the history of Solid Queue. I’ll show you a video that my partner, who was with me at the conference, happened to record while David was announcing this idea. I didn’t want to look because I didn’t know he was going to announce it that way. Before that conference, Solid Queue didn’t really exist much. I mean, it existed, but it wasn’t good; it didn’t work very well. It was just there. I hadn’t worked on it for several months, so when the keynote happened, I got to work and did almost everything. I call this methodology ‘promise-driven development.’ I highly recommend that if you’re struggling with motivation and procrastination, you should tell your boss to take you to a conference and promise the delivery of your project in front of hundreds of people. I guarantee you will recover your motivation right away.
00:01:00.879 You’ve seen this already in David’s presentation and in some articles, so let’s see how we got here. As you see here, we have three internal gems that we built and two private forks of public gems. You may wonder why we needed all this. Let’s go through them one by one to see what we needed and what we built.
00:01:18.959 The first one is obvious: we need to run the jobs. The next one was built by my colleague Donald, the same author of Solid Cash. The reason we built this was because, in Resque, you may lose jobs if your worker dies. Just imagine someone pulling the plug or something similar; you might risk losing jobs. So, Donald built this to handle that. Then we have the Resque Pool, a public gem we use to run multiple workers in different processes easily. We also have the Resque Scheduler, another public gem to support scheduled jobs in the future and delayed jobs. In addition, we have Resque Post, used to post a queue, for example, if you are having trouble in production and need to post a queue for any reason.
00:01:42.400 We also have sequential jobs, which is a tricky one. My colleague Jorge built this one because we had cases where we needed to ensure certain jobs in our applications didn’t overlap when they were running. Not so much for resources, but more for the application logic. For instance, we had one action to start a job being processed and another action that undoes what the first job did. When they ran in parallel, it created a mess. You could argue this is a design flaw in our application logic, but we had this issue all over, so we decided to tackle it with jobs, ensuring that they run sequentially. Finally, we have scheduled jobs, built by my colleague Jorge, due to the trouble we encountered when scheduling millions of jobs to run in the far future—talking about 30 to 90 days in advance.
00:02:03.760 The problem we faced was that we used Resque and Redis, which led to filling all available memory on the Redis server. This meant that we couldn’t enqueue more jobs, which was a nightmare. This gem became a tiny, tiny seat for Solid Queue because it uses the database to store scheduled jobs in the future and also has built-in capabilities to dispatch jobs in batches very fast, which is something we need in our applications when we send reports, etc.
00:02:26.560 With this list, we could start our first set of requirements to build a new system. We also wanted to build this using Hay as our test app. Another requirement was that we needed to allow for roughly 19 million jobs per day, this was back then, now we’re around 20 million. We also needed to configure a polling interval of 100 milliseconds. So, when we started on this, we wondered what would happen if we used a relational database, inspired by Solid Cash, which we were already running successfully in production.
00:02:51.840 We had some successful examples of this in the wild; good job delayed would actually be great, but it uses PostgreSQL, so we couldn’t use it directly. If we used this, we could also leverage Active Record to make the gem simpler. Some gems are very complex, but we wanted something much simpler that you could open and understand. And finally, as David mentioned in his keynote, we could offer this as something that works out of the box without having to configure any other moving pieces.
00:03:03.080 If we fulfill those goals, this should aspire to be a Rails default. However, if we wanted it to be a Rails default, we had to add a couple more requirements. We had to support the three major databases that Rails supports and all of their features. The only one missing here is priorities, which we didn’t use, but we added that.
00:03:29.560 Now let’s go to the meat of the talk. Traditionally, before we started using Resque and Sidekiq, there was a very popular gem called Delay Job. It used the database, but the traditional problem with this was that if you ran multiple workers for the same queue, you would run into contention issues. Resque, on the other hand, works by polling multiple queues and avoids the contention problem. Let me illustrate with a quick example how this would look in a database.
00:03:53.280 If you imagine this is your query to pull jobs, the workers can get the next jobs to work on. If we run this query over that table, it might return jobs 1 and 2. We then need to execute ‘for update’ to block those rows because otherwise, another worker could run the same query and claim those jobs, leading to them being executed multiple times. Thus, we need to lock the rows. While this is happening, all the other workers that are running the same query will be waiting since those rows are locked.
00:04:01.760 Now, the worker that got there first and claimed the jobs will update a boolean to indicate that it has claimed them and then will commit the transaction. Only then will the other workers that were waiting see the next two rows available. Essentially, we have workers waiting for each other because of this design.
00:04:30.000 So, how did we solve this? This works because we implemented ‘select for update skip locked,’ a feature available in PostgreSQL for a while and in MySQL from version 8 onwards. This allows us to skip rows that are locked by the first worker while still letting others process the available rows.
00:04:39.760 With ‘skip locked,’ when the second worker tries to claim jobs, it simply skips the rows that are already locked. Therefore, the workers can now claim jobs at the same time without conflicting over the same jobs. By implementing this, we have resolved the contention issue. However, we wanted to keep polling as efficient as possible by ensuring that our polling table remains small.
00:04:57.720 In our first prototype, we established a dedicated executions table, the ‘hot table.’ This table contains all the jobs ready to be executed. When we get a scheduled job, we store it in a different location so that we can have millions of scheduled jobs without slowing down polling, as they are stored elsewhere. When the time comes, the workers can claim jobs, and they aren’t bogged down by claims.
00:05:17.440 This is how the first design looks. We have five tables; only one table actually stores the job data. The other tables exist primarily for metadata and holding execution data for the system to function. We introduced different agents to work over those tables. We have workers and dispatchers. The workers will pull from the execution table while the dispatchers will be pulling from the scheduled job table. We added a supervisor to manage those processes, and when the dispatcher picks jobs ready to execute, it will move them to the execution table. The same applies to the workers, moving jobs from ‘ready’ to ‘claimed.’
00:05:40.079 With this first prototype, we were able to meet quite a few requirements. However, this design introduced a new issue. Because jobs are claimed and moved to another table, if a worker dies, the jobs stay claimed forever, which is not desirable. So, we introduced a process registry where workers send heartbeat signals. We created a new table called ‘solid_q_processes’ for storing the last heartbeat, enabling us to detect whether a worker is alive or not, linking all claimed jobs to that process.
00:06:06.400 If a worker fails and its heartbeat is deemed dead, we can release all those jobs it held. This works because the supervisor periodically checks whether workers are alive and takes care of releasing those jobs. We already checked this requirement and handled it. Next, regarding posting queues, it is straightforward; we just need a way to mark a queue as posted, which led us to another table called 'solid_q_posted.'
00:06:34.760 Now the challenging part: ensuring we can run the required volume of jobs and manage the polling effectively. This involved fine-tuning the polling queries since they were still running too slowly and burdening the database.
00:06:54.080 Let’s take a look at the ‘read execution table’ or hot table. The only relevant columns are three, which we need for polling. We wanted to support a specific way of running our jobs and the various methods of polling, which included supporting single queues and multiple queues of varying priority. We also introduced wildcard queuing.
00:07:06.480 To ensure queries are fast, we implemented two indexes: one to pull the table by queue name and another specifically for when using wildcard queuing. The types of queries we run follow a consistent pattern, which is crucial since if you use more than one queue name, the index cannot be utilized, resulting in a file sort, which slows things down and increases the number of rows examined.
00:07:24.080 In the current version of Solid Queue, when using two queues, we perform two polling queries—one for each queue sequentially. When using the wildcard ‘all queues,’ we don’t specify any queue names, and the other index handles that. These indexes are designed to expedite those queries significantly. We continuously strive to execute the minimum number of queries, achieving about 4,900 polling queries per second, typically taking around 110 microseconds to complete.
00:07:50.480 Most of these queries do not even examine rows, averaging less than one row per query. Hence, this process runs very quickly and does not pose any issues with our current setup. This process satisfied our requirement regarding loaded polling efficiently. The second challenging requirement was regarding sequential jobs.
00:08:14.720 Traditionally, the method of achieving this would involve checking for limits and restrictions during polling. However, I was resistant to altering the already fast polling queries. Instead, we shifted our focus from sequential jobs to a concept called concurrency controls. The first step was ensuring order wasn’t essential for most jobs that required no overlapping. We only had a few cases where order was actually significant.
00:08:37.440 We made adjustments in the application to allow jobs to run in any order. Then, we decided to introduce limit checks at the moment of enqueuing, rather than during polling. This meant that before a job could be queued, it would need to acquire a semaphore that allows it to proceed. This approach established a new setup, resulting in various execution types, including schedule, claim, and block executions. Yes, more tables!
00:09:05.760 The new Block Execution state introduced increased overhead because more read operations are required when enqueuing a job, especially for those marked as concurrency-controlled. The worst-case scenario is when a scheduled job also requires concurrency control—this could necessitate up to 11 write operations on the database.
00:09:36.000 Looking at preliminary numbers, sequentially we initially peaked at about 400 jobs per second. We determined our system needed to support this load, prompting us to separate the job system from our main app and migrate to a distinctive database. The default in Rails 8 is that upon installation using Solid Queue, it will set up a separate database from the beginning.
00:10:04.480 This design aims for easier management in the future compared to migrating where the system is already operational. We have now implemented high support for sequential jobs in this system and for scheduled jobs, which are recurring tasks. Although time is limited here, I will leave that as an exercise for you—all that’s needed is to have some tables, and you will find it’s doable.
00:10:28.080 There is so much more to share; we’ve added support for job operations, logging, instrumentation, and branding. We’ve developed a dashboard to manage and inspect your queues, which looks impressive. Presently, we are running Solid Queue at Hay at David Cell, processing about 20 million jobs per day across 74 virtual machines, utilizing 800 workers for job processing.
00:10:51.760 While the provision may seem excessive, our earlier experiences, particularly during the launch of Hay Calendar, led us to over-provision workers in anticipation of potential issues. Overall, this has allowed us to maintain similar latency levels to Resque while enhancing our throughput and efficiency with adding capabilities like bulk jobs. Cedar didn’t support bulk, but Solid Queue provides this out of the box.
00:11:16.060 During the migration to Solid Queue, we did not witness any impact on our global response time, even though enqueuing in Solid Queue is slower than in Resque. However, this didn’t impede performance significantly, as often it’s just zero or one job per request, and so it didn’t impact the overall response time meaningfully.
00:11:38.600 As my boss, David, said earlier, good frameworks are extracted, not invented. I’m unsure if Solid Queue qualifies as a proper framework, as parts of it developed organically. However, version 1.0.0 was released today, right before I came here, and it has really undergone rigorous testing. This emphasized the need to migrate all jobs and queues one by one, which involved substantial effort and nuanced tweaks across various deployments.
00:12:06.960 Rigorous testing implies facing challenging situations, persistent fine-tuning involving the app, and managing numerous deployments. There have been incidents, and while transitioning, no feature in Solid Queue was released without being extensively tested in production beforehand. Essentially, nothing was shipped without being validated through our internal processes. It’s about refining our systems based on our learnings from handling jobs at scale.
00:12:30.160 We prefer to refer to this process as being ‘fun trademark tested’ instead of using terms like ‘battle tested,’ as they evoke notions of warfare, which we want to avoid. More accurately, it’s about embracing the resilience nurtured as a result of facing adversity together in our team and the many challenges we tackle daily.
00:12:53.840 Now, I kindly request your assistance in making it even better. While we have significant experience with MySQL in production, we don’t have as much visibility on issues with PostgreSQL and Sidekiq. We are already seeing community collaboration, with several community members assisting in addressing many of the issues we faced in PostgreSQL or issues stemming from various other systems.
00:13:11.080 For instance, a concerning bug was identified in our database that was addressed by Andy CW. Community support has been indispensable in tackling these issues proactively. Even though I strive to ship only well-vetted features based on our experiences, it’s impossible to engage with every system and encounter every potential issue by myself.
00:13:35.679 This is where the Rails community excels, ensuring we receive timely assistance. To wrap up, I want you to know about my little dog, Moi. I usually take her with me to many conferences, hoping she might help with my talks. Poodles are highly intelligent, but unfortunately, bringing her to Canada is not straightforward. However, I brought some stickers of her for anyone interested.
00:14:00.000 If you want to discuss queues or jobs or if you fancy a sticker, please find me. I might not approach you as I will be quiet for the rest of the day, but I would be delighted to meet you and engage in conversation. Thank you very much!
Explore all talks recorded at Rails World 2024
+17