00:00:09.639
Hi everyone! My name is Rosa, as she introduced me. This is actually the first time I’ve lost my voice like this. What a better moment to lose your voice than right before speaking at a major conference, right? I apologize for that, and I’ll do my best. I hope you enjoy this talk. Today, as they already said, we’ll be discussing Solid Queue, a new backend for AC jobs that will be the default in Rails 8.
00:00:31.880
Let’s take a moment to reflect on Rails World last year in Amsterdam. This was a key moment in the history of Solid Queue. I’ll show you a video that my partner, who was with me at the conference, happened to record while David was announcing this idea. I didn’t want to look because I didn’t know he was going to announce it that way. Before that conference, Solid Queue didn’t really exist much. I mean, it existed, but it wasn’t good; it didn’t work very well. It was just there. I hadn’t worked on it for several months, so when the keynote happened, I got to work and did almost everything. I call this methodology ‘promise-driven development.’ I highly recommend that if you’re struggling with motivation and procrastination, you should tell your boss to take you to a conference and promise the delivery of your project in front of hundreds of people. I guarantee you will recover your motivation right away.
00:01:00.879
You’ve seen this already in David’s presentation and in some articles, so let’s see how we got here. As you see here, we have three internal gems that we built and two private forks of public gems. You may wonder why we needed all this. Let’s go through them one by one to see what we needed and what we built.
00:01:18.959
The first one is obvious: we need to run the jobs. The next one was built by my colleague Donald, the same author of Solid Cash. The reason we built this was because, in Resque, you may lose jobs if your worker dies. Just imagine someone pulling the plug or something similar; you might risk losing jobs. So, Donald built this to handle that. Then we have the Resque Pool, a public gem we use to run multiple workers in different processes easily. We also have the Resque Scheduler, another public gem to support scheduled jobs in the future and delayed jobs. In addition, we have Resque Post, used to post a queue, for example, if you are having trouble in production and need to post a queue for any reason.
00:01:42.400
We also have sequential jobs, which is a tricky one. My colleague Jorge built this one because we had cases where we needed to ensure certain jobs in our applications didn’t overlap when they were running. Not so much for resources, but more for the application logic. For instance, we had one action to start a job being processed and another action that undoes what the first job did. When they ran in parallel, it created a mess. You could argue this is a design flaw in our application logic, but we had this issue all over, so we decided to tackle it with jobs, ensuring that they run sequentially. Finally, we have scheduled jobs, built by my colleague Jorge, due to the trouble we encountered when scheduling millions of jobs to run in the far future—talking about 30 to 90 days in advance.
00:02:03.760
The problem we faced was that we used Resque and Redis, which led to filling all available memory on the Redis server. This meant that we couldn’t enqueue more jobs, which was a nightmare. This gem became a tiny, tiny seat for Solid Queue because it uses the database to store scheduled jobs in the future and also has built-in capabilities to dispatch jobs in batches very fast, which is something we need in our applications when we send reports, etc.
00:02:26.560
With this list, we could start our first set of requirements to build a new system. We also wanted to build this using Hay as our test app. Another requirement was that we needed to allow for roughly 19 million jobs per day, this was back then, now we’re around 20 million. We also needed to configure a polling interval of 100 milliseconds. So, when we started on this, we wondered what would happen if we used a relational database, inspired by Solid Cash, which we were already running successfully in production.
00:02:51.840
We had some successful examples of this in the wild; good job delayed would actually be great, but it uses PostgreSQL, so we couldn’t use it directly. If we used this, we could also leverage Active Record to make the gem simpler. Some gems are very complex, but we wanted something much simpler that you could open and understand. And finally, as David mentioned in his keynote, we could offer this as something that works out of the box without having to configure any other moving pieces.
00:03:03.080
If we fulfill those goals, this should aspire to be a Rails default. However, if we wanted it to be a Rails default, we had to add a couple more requirements. We had to support the three major databases that Rails supports and all of their features. The only one missing here is priorities, which we didn’t use, but we added that.
00:03:29.560
Now let’s go to the meat of the talk. Traditionally, before we started using Resque and Sidekiq, there was a very popular gem called Delay Job. It used the database, but the traditional problem with this was that if you ran multiple workers for the same queue, you would run into contention issues. Resque, on the other hand, works by polling multiple queues and avoids the contention problem. Let me illustrate with a quick example how this would look in a database.
00:03:53.280
If you imagine this is your query to pull jobs, the workers can get the next jobs to work on. If we run this query over that table, it might return jobs 1 and 2. We then need to execute ‘for update’ to block those rows because otherwise, another worker could run the same query and claim those jobs, leading to them being executed multiple times. Thus, we need to lock the rows. While this is happening, all the other workers that are running the same query will be waiting since those rows are locked.
00:04:01.760
Now, the worker that got there first and claimed the jobs will update a boolean to indicate that it has claimed them and then will commit the transaction. Only then will the other workers that were waiting see the next two rows available. Essentially, we have workers waiting for each other because of this design.
00:04:30.000
So, how did we solve this? This works because we implemented ‘select for update skip locked,’ a feature available in PostgreSQL for a while and in MySQL from version 8 onwards. This allows us to skip rows that are locked by the first worker while still letting others process the available rows.
00:04:39.760
With ‘skip locked,’ when the second worker tries to claim jobs, it simply skips the rows that are already locked. Therefore, the workers can now claim jobs at the same time without conflicting over the same jobs. By implementing this, we have resolved the contention issue. However, we wanted to keep polling as efficient as possible by ensuring that our polling table remains small.
00:04:57.720
In our first prototype, we established a dedicated executions table, the ‘hot table.’ This table contains all the jobs ready to be executed. When we get a scheduled job, we store it in a different location so that we can have millions of scheduled jobs without slowing down polling, as they are stored elsewhere. When the time comes, the workers can claim jobs, and they aren’t bogged down by claims.
00:05:17.440
This is how the first design looks. We have five tables; only one table actually stores the job data. The other tables exist primarily for metadata and holding execution data for the system to function. We introduced different agents to work over those tables. We have workers and dispatchers. The workers will pull from the execution table while the dispatchers will be pulling from the scheduled job table. We added a supervisor to manage those processes, and when the dispatcher picks jobs ready to execute, it will move them to the execution table. The same applies to the workers, moving jobs from ‘ready’ to ‘claimed.’
00:05:40.079
With this first prototype, we were able to meet quite a few requirements. However, this design introduced a new issue. Because jobs are claimed and moved to another table, if a worker dies, the jobs stay claimed forever, which is not desirable. So, we introduced a process registry where workers send heartbeat signals. We created a new table called ‘solid_q_processes’ for storing the last heartbeat, enabling us to detect whether a worker is alive or not, linking all claimed jobs to that process.
00:06:06.400
If a worker fails and its heartbeat is deemed dead, we can release all those jobs it held. This works because the supervisor periodically checks whether workers are alive and takes care of releasing those jobs. We already checked this requirement and handled it. Next, regarding posting queues, it is straightforward; we just need a way to mark a queue as posted, which led us to another table called 'solid_q_posted.'
00:06:34.760
Now the challenging part: ensuring we can run the required volume of jobs and manage the polling effectively. This involved fine-tuning the polling queries since they were still running too slowly and burdening the database.
00:06:54.080
Let’s take a look at the ‘read execution table’ or hot table. The only relevant columns are three, which we need for polling. We wanted to support a specific way of running our jobs and the various methods of polling, which included supporting single queues and multiple queues of varying priority. We also introduced wildcard queuing.
00:07:06.480
To ensure queries are fast, we implemented two indexes: one to pull the table by queue name and another specifically for when using wildcard queuing. The types of queries we run follow a consistent pattern, which is crucial since if you use more than one queue name, the index cannot be utilized, resulting in a file sort, which slows things down and increases the number of rows examined.
00:07:24.080
In the current version of Solid Queue, when using two queues, we perform two polling queries—one for each queue sequentially. When using the wildcard ‘all queues,’ we don’t specify any queue names, and the other index handles that. These indexes are designed to expedite those queries significantly. We continuously strive to execute the minimum number of queries, achieving about 4,900 polling queries per second, typically taking around 110 microseconds to complete.
00:07:50.480
Most of these queries do not even examine rows, averaging less than one row per query. Hence, this process runs very quickly and does not pose any issues with our current setup. This process satisfied our requirement regarding loaded polling efficiently. The second challenging requirement was regarding sequential jobs.
00:08:14.720
Traditionally, the method of achieving this would involve checking for limits and restrictions during polling. However, I was resistant to altering the already fast polling queries. Instead, we shifted our focus from sequential jobs to a concept called concurrency controls. The first step was ensuring order wasn’t essential for most jobs that required no overlapping. We only had a few cases where order was actually significant.
00:08:37.440
We made adjustments in the application to allow jobs to run in any order. Then, we decided to introduce limit checks at the moment of enqueuing, rather than during polling. This meant that before a job could be queued, it would need to acquire a semaphore that allows it to proceed. This approach established a new setup, resulting in various execution types, including schedule, claim, and block executions. Yes, more tables!
00:09:05.760
The new Block Execution state introduced increased overhead because more read operations are required when enqueuing a job, especially for those marked as concurrency-controlled. The worst-case scenario is when a scheduled job also requires concurrency control—this could necessitate up to 11 write operations on the database.
00:09:36.000
Looking at preliminary numbers, sequentially we initially peaked at about 400 jobs per second. We determined our system needed to support this load, prompting us to separate the job system from our main app and migrate to a distinctive database. The default in Rails 8 is that upon installation using Solid Queue, it will set up a separate database from the beginning.
00:10:04.480
This design aims for easier management in the future compared to migrating where the system is already operational. We have now implemented high support for sequential jobs in this system and for scheduled jobs, which are recurring tasks. Although time is limited here, I will leave that as an exercise for you—all that’s needed is to have some tables, and you will find it’s doable.
00:10:28.080
There is so much more to share; we’ve added support for job operations, logging, instrumentation, and branding. We’ve developed a dashboard to manage and inspect your queues, which looks impressive. Presently, we are running Solid Queue at Hay at David Cell, processing about 20 million jobs per day across 74 virtual machines, utilizing 800 workers for job processing.
00:10:51.760
While the provision may seem excessive, our earlier experiences, particularly during the launch of Hay Calendar, led us to over-provision workers in anticipation of potential issues. Overall, this has allowed us to maintain similar latency levels to Resque while enhancing our throughput and efficiency with adding capabilities like bulk jobs. Cedar didn’t support bulk, but Solid Queue provides this out of the box.
00:11:16.060
During the migration to Solid Queue, we did not witness any impact on our global response time, even though enqueuing in Solid Queue is slower than in Resque. However, this didn’t impede performance significantly, as often it’s just zero or one job per request, and so it didn’t impact the overall response time meaningfully.
00:11:38.600
As my boss, David, said earlier, good frameworks are extracted, not invented. I’m unsure if Solid Queue qualifies as a proper framework, as parts of it developed organically. However, version 1.0.0 was released today, right before I came here, and it has really undergone rigorous testing. This emphasized the need to migrate all jobs and queues one by one, which involved substantial effort and nuanced tweaks across various deployments.
00:12:06.960
Rigorous testing implies facing challenging situations, persistent fine-tuning involving the app, and managing numerous deployments. There have been incidents, and while transitioning, no feature in Solid Queue was released without being extensively tested in production beforehand. Essentially, nothing was shipped without being validated through our internal processes. It’s about refining our systems based on our learnings from handling jobs at scale.
00:12:30.160
We prefer to refer to this process as being ‘fun trademark tested’ instead of using terms like ‘battle tested,’ as they evoke notions of warfare, which we want to avoid. More accurately, it’s about embracing the resilience nurtured as a result of facing adversity together in our team and the many challenges we tackle daily.
00:12:53.840
Now, I kindly request your assistance in making it even better. While we have significant experience with MySQL in production, we don’t have as much visibility on issues with PostgreSQL and Sidekiq. We are already seeing community collaboration, with several community members assisting in addressing many of the issues we faced in PostgreSQL or issues stemming from various other systems.
00:13:11.080
For instance, a concerning bug was identified in our database that was addressed by Andy CW. Community support has been indispensable in tackling these issues proactively. Even though I strive to ship only well-vetted features based on our experiences, it’s impossible to engage with every system and encounter every potential issue by myself.
00:13:35.679
This is where the Rails community excels, ensuring we receive timely assistance. To wrap up, I want you to know about my little dog, Moi. I usually take her with me to many conferences, hoping she might help with my talks. Poodles are highly intelligent, but unfortunately, bringing her to Canada is not straightforward. However, I brought some stickers of her for anyone interested.
00:14:00.000
If you want to discuss queues or jobs or if you fancy a sticker, please find me. I might not approach you as I will be quiet for the rest of the day, but I would be delighted to meet you and engage in conversation. Thank you very much!