00:00:00.560
This talk is called 'How to Fail at Background Jobs,' and the slides are right there, so you can skip ahead if you'd like.
00:00:06.080
I think that failure is really interesting, and we learn a lot from it.
00:00:12.000
As the speaker before me mentioned, we often need to introduce the pain before we find the solution. I've experienced a lot of pain and maybe a few solutions.
00:00:19.359
Mostly, I want to talk about the pain. I was discussing the various topics I could cover in this talk with my good co-worker, Evan, who is sitting back there.
00:00:25.119
He said, "Oh, so you want to talk about how to fail at background jobs? Well, the answer is simple, right? Delay Job!" Does anyone here use Delay Job?
00:00:31.039
If you're considering using Delay Job, you might want to consult Evan to figure out why you shouldn't be using it. Evan works in support and has dealt with many engineering customers.
00:00:42.480
Frequently, he sees database performance issues related to customers using Delay Job, which can severely hammer their databases.
00:00:49.039
If you happen to be using Delay Job or are thinking about it, you might want to consider an alternative that's similar but requires Postgres, which was written by a Heroku guy.
00:00:54.960
And that's it—that's my talk. But I also want to mention Engine Yard's conference in August called 'Distill.' You should all submit to the call for proposals.
00:01:00.960
You could probably give a better talk than the one I just gave. Do I still have time?
00:01:07.360
Okay, great. I'll continue then. The talk I really want to give focuses on the lessons we can learn from failure, particularly about abstractions.
00:01:13.200
I think my current working theory is that I fail so much because the abstractions I use are wrong or fail me, or that I simply need better ones.
00:01:18.479
So, speaking of abstractions, let's talk about the Rails 4 queuing system. How many people have heard about the queuing system in Rails 4?
00:01:25.359
How many have heard that it's not going to be in Rails 4?
00:01:30.880
If you didn't know or if you’re interested, you can go to the commit message or the specific commit on GitHub and read all the comments about what was pulled out.
00:01:36.079
I'm going to attempt to summarize some of the reasons why it was pulled out and draw my own conclusions about why it was a failure.
00:01:42.480
First of all, the API for the Rails queuing system had significant issues. One commenter pointed out that the biggest problem was the definition of the queue.
00:01:50.079
The name of the queue where you push your jobs was defined at the place where you call the push, rather than in the job class itself.
00:01:56.320
Let me explain the API more clearly: the API allows you to create an object—any object you want—that has a run method.
00:02:05.360
According to the contract, you put that object onto some queue, and later, someone can pull it off and call run, executing the job.
00:02:11.280
The problem arises because the arguments needed to reconstruct that object are not part of this contract.
00:02:18.000
Somehow, all that data is embedded in the object, which requires serialization and deserialization.
00:02:25.760
Given the API, what they ended up having to do was use Marshall.
00:02:31.680
If you've ever tried to marshal an Active Record object, you'll know the end result is far from ideal.
00:02:35.360
This isn’t great for debugging issues in production. Moreover, marshalling is very limiting.
00:02:41.040
The marshaller can introduce problems, such as circular references or the inclusion of unnecessary data.
00:02:48.080
Finally, they alluded to having solved the completely wrong problem. One of the major use cases I envisioned for the Rails 4 queuing system was sending emails.
00:02:56.080
Action Mailers are a core part of Rails, and when sending an email, you don't want it to disrupt the request processing.
00:03:03.919
If you send an email directly from a controller action, that could take a while and slow down the request.
00:03:10.640
One of the primary reasons background job systems exist is to allow for such actions without interfering with request processing.
00:03:17.760
They proposed wanting changes to Rack so that simple tasks like sending emails could be handled without a separate job processor.
00:03:25.280
In an ideal setup, the email sending would occur after the request is sent to the browser but before processing is completed.
00:03:32.240
To achieve this, there’s a somewhat evil way to hack Rack. Rack expects a triplet of status, headers, and body.
00:03:39.040
We can create a proxy object around that body and implement our own each method to return the actual body.
00:03:47.440
The last thing we can do is send the email because we set the Content-Length header, which tricks the browser into thinking we're done.
00:03:54.560
However, we still need to figure out how to get this solution implemented into Rack.
00:04:02.240
After considering these issues, I thought perhaps my input could help solve some of them.
00:04:10.080
I attempted to create some pull requests, but it seems they’re going to rewrite this in Rails 4.1.
00:04:16.239
I don't expect it to resemble the current version in any way by the time they are done.
00:04:22.560
Let’s now go back to a story from 2009. I was working at a company called 3M, and this was our product: a Lava COS chairside oral scanner.
00:04:30.960
I was writing a Rails app—Rails 3 or perhaps Rails 2—and it was essentially a glorified file server.
00:04:37.680
We would upload files from these devices in the wild, and our app would organize and facilitate actions with those files.
00:04:44.160
At that time, the state of the art for background processing was quite limited.
00:04:50.639
We sought something, as our use case involved copying files from one location to another.
00:04:56.480
The state of the art was a queuing system called Starling. Does anyone here still use it? Any Workling users?
00:05:02.480
We used these queue systems for a while until we started reading the Twitter engineering blog.
00:05:08.960
This was when Rails was criticized for not being scalable because Twitter wasn't scaling well.
00:05:15.840
Concerned about our distribution and the need for high availability, we explored moving to Erlang.
00:05:22.960
RabbitMQ started gaining popularity, so we decided to transition from Starling to RabbitMQ as our backend.
00:05:29.680
RabbitMQ operates on multiple nodes, replicating each other. If one goes down, the queuing system continues to function.
00:05:37.839
It offers numerous benefits, including a protocol called AMQP.
00:05:43.920
I'm not going to explain this in detail, but this is how you can engage with AMQP. This is a code snippet for using AMQP.
00:05:50.960
Before GitHub existed, a nice person wrote this AMQP exchange runner for Workling.
00:05:58.239
So, we simply swapped out the Starling runner for our AMQP runner and thought we were good to go.
00:06:05.759
However, we encountered a little bug—not directly related to RabbitMQ.
00:06:12.000
The issue stemmed from the fact that RabbitMQ was faster than Starling.
00:06:19.120
We started getting Active Record 'not found' errors because we enqueued jobs in 'after_create' blocks.
00:06:25.440
The funny thing about 'after_create' is that it runs after the object is created, but not after the transaction involving that creation completes.
00:06:33.440
So, to create an object in the database, there are two important steps: the insert and then the commit.
00:06:40.160
When the insert happened, we enqueued the job—not knowing the transaction was still pending.
00:06:46.720
With RabbitMQ managing an open socket to the workers, the worker would immediately look up the object and find it wasn't there.
00:06:52.080
The hack solution was to implement a one-second sleep.
00:06:58.480
But another hack fix involved an outdated plugin we wrote which isn't in use anymore.
00:07:06.760
This plugin was interesting; it monkey patched the current connection object to run code upon executing the commit.
00:07:12.800
It was clever, but we continued facing bugs.
00:07:18.280
These issues may not have simply been bugs but fundamental flaws in the system.
00:07:24.480
What we wanted was to harness the power of RabbitMQ by actually using the core functionality underneath Workling.
00:07:30.640
However, we had several applications communicating through this simple abstraction.
00:07:37.440
Workling's abstraction assumed you had your Rails app accepting requests and workers processing jobs, without any further specification.
00:07:43.520
It didn't allow for naming queues, which led us to hack on a new implementation.
00:07:50.160
We eventually decided it was easier to discard Workling and write our own wrapper.
00:07:56.080
This new wrapper enabled us to specify queue names and allowed for sharing RabbitMQ between applications.
00:08:04.160
This meant one app's background jobs could run simply from messages enqueued by another app.
00:08:09.760
So, let's reflect on the lessons learned.
00:08:16.000
Workling served us well, but ultimately it failed to provide an abstraction that could last.
00:08:22.160
In hindsight, we also failed to open-source our solution, which we never fully developed.
00:08:28.560
Currently, at Engine Yard, we use Rescue extensively for managing background jobs.
00:08:35.680
Our primary use case is to boot servers on Amazon EC2.
00:08:43.680
Let me walk you through some example code to illustrate what we're doing.
00:08:50.080
When we create an instance of some server for a customer in our database, we have a job that boots that instance.
00:08:56.320
The job looks for the instance, creates it using a library called Fog to interact with AWS, and saves off the Amazon ID.
00:09:02.760
We then query Amazon to check if that server is running.
00:09:09.440
Note that all the code I’m presenting is merely for example purposes; don’t take it as best practices.
00:09:16.320
Next, we wait for that server to be up before moving on to attaching an IP.
00:09:22.399
We continue this process through all the necessary steps to set the server up.
00:09:28.560
This job can be perceived as somewhat unwieldy, and I discussed with my co-workers possible ways to refactor this behavior.
00:09:34.560
One direction could involve running this job in a completely separate system.
00:09:41.440
For this to happen, we wouldn't want to share databases. No one here does that, right?
00:09:47.200
So we must find a way to send all of the required information with the job arguments rather than merely an ID.
00:09:54.080
We need additional arguments in our 'perform' method, along with potential callbacks during the job.
00:10:00.640
This would facilitate updating the customer on progress.
00:10:06.000
While we never actually implemented this idea, we could also take the opposite approach.
00:10:13.760
Background jobs might not need all of their logic defined within the job class itself.
00:10:19.040
Instead of requiring a dedicated job class with logic included, we could create a simpler job.
00:10:25.600
For instance, a 'Method Calling' job could be just enough.
00:10:31.919
Every time we need to do something in the background, we’d just enqueue the 'Method Calling' job with the appropriate parameters.
00:10:38.080
While this worked for some jobs, it created a lack of clarity about what jobs were actually running.
00:10:45.679
So, I recently created a library called Async.
00:10:50.560
This library has a clever DSL for running methods asynchronously on Active Record objects at a later point.
00:10:58.399
The concept of pluggable backends allows for flexibility across different queuing systems.
00:11:06.240
Now, returning to our problem with instance provisioning jobs, we recognized a common issue: customers waiting on stale statuses.
00:11:12.480
They would see their instances stuck in a 'waiting' status, as jobs were running for hours.
00:11:18.560
During this time, we might observe far more workers than expected.
00:11:24.720
The total number of workers running might not align with the configuration.
00:11:31.679
This issue arose from either workers being replaced without decrementing the count or some workers hanging indefinitely.
00:11:39.440
It took me some time to realize that there are differing types of reliability within our queuing systems.
00:11:46.480
At 3M, we believed we were operating a reliable system because we used RabbitMQ.
00:11:53.600
RabbitMQ utilizes acknowledgments, durable queues, and durable messages.
00:12:00.160
Even if a RabbitMQ node fails, the queue remains intact.
00:12:06.560
However, we didn't consider the problems that could stem from the worker processes themselves.
00:12:13.760
We could ensure a job was delivered and would be requeued if it wasn’t acknowledged.
00:12:20.320
But what if the job was halfway through processing and then crashed?
00:12:27.680
There's a chance it could be processed twice, which is a significant concern.
00:12:34.720
Moreover, many libraries offer simple retry logic for exceptions raised during execution.
00:12:41.200
But if the Ruby process crashes unexpectedly or hangs up, it might not fire the retry logic.
00:12:47.920
We had challenges since we were opening sockets and SSH connections from background jobs, and if the connection vanished, Ruby wouldn’t know to close it.
00:12:55.040
As such, we faced problems monitoring and maintaining an appropriate number of workers.
00:13:02.080
We wanted to keep the pool at a certain number and to gracefully restart when deploying new code.
00:13:10.080
We deploy code multiple times a day, so maintaining workers while picking up the new code is critical.
00:13:16.800
We want older workers to finish current jobs before shutting down and restarting the new ones.
00:13:23.040
Furthermore, we encounter the issue of not knowing why a job failed or why the expected outcome did not occur.
00:13:30.000
Is it because the job was never enqueued in the first place, or because it failed at some point in the path?
00:13:37.440
Faced with all these challenges, I decided to address the simplest and most logical one first: understanding what’s happening.
00:13:44.480
To that end, I developed a Rescue plugin, as there are numerous plugins for Rescue available on RubyGems.
00:13:51.440
The plugin provides a way to track identifiers relevant to the job being executed.
00:13:58.080
For an instance provisioning job, you would want to track the instance and the customer account that owns it.
00:14:05.120
During enqueue and job execution, the identifiers help keep updates in sync.
00:14:11.680
You can also call methods to check for any ongoing jobs affecting the customer or the instance.
00:14:19.440
Additionally, you can look for jobs that failed within a certain time window.
00:14:27.440
Though I was likely the only one who utilized this for debugging, it was useful.
00:14:34.320
The bigger problem we faced involved job dependencies.
00:14:41.040
For example, if you add a database replica to a cluster, we needed multiple jobs.
00:14:47.920
Creating a replica means job A adds a snapshot of the master database, and Job B provisions the volume.
00:14:54.360
Before proceeding, Job B has to confirm that Job A has completed.
00:15:01.040
This dependency structure created a messy workflow, so we opted to write another Rescue plugin.
00:15:06.760
This new plugin was much more complex than the job tracking one but never made it into production.
00:15:12.960
While this plugin made it easier to express job dependencies, it didn’t aid in debugging.
00:15:20.160
As a last resort, we revisited the tracking plugin.
00:15:26.000
The job tracking plugin depended on another Rescue plugin called Metadata, which associates random metadata to jobs.
00:15:32.000
We decided to leverage database storage for tracking job outcomes instead of queuing everything in Redis.
00:15:39.680
To facilitate this, we created a Rescue job model and hooked into it whenever a job got enqueued.
00:15:46.000
During execution, we updated the corresponding database record until completion.
00:15:54.240
This data storage allowed us to perform robust SQL queries to identify the most common job failures.
00:16:01.360
We utilized this tracking for a while until we ultimately disabled it.
00:16:07.200
One of my coworkers, Andy Delcom, shared insights in a talk at Cascadia.
00:16:12.800
In his talk, he proposed a unique approach to tracking job dependencies by generating unique IDs.
00:16:20.000
When a customer issues a request, we dry-run it to produce a unique ID.
00:16:26.560
Any tasks enqueued as a result would include this unique ID.
00:16:32.640
This identification method could be invaluable for future debugging efforts.
00:16:38.000
Josh explored the concept of 'modeling intent'—focusing on what a customer hoped to achieve with objects in our system.
00:16:46.880
For background jobs, this implies creating specific models for tasks.
00:16:53.280
For instance, instead of merely queuing instance provisioning as a job, you would define it within your database.
00:16:59.640
This change allows us to track instances privately in the database.
00:17:06.160
Now the instance provisioning job aligns directly with an Active Record object.
00:17:12.720
This relationship provides valuable insights into running jobs, recently completed jobs, and job states.
00:17:18.560
Moreover, we could compare running 'instance terminate' jobs against 'instance provision' jobs to prevent conflicts.
00:17:24.720
Defining task state allows us to ensure idempotency across jobs.
00:17:30.320
Idempotence means that if you execute a job multiple times, it won’t alter results beyond the first execution.
00:17:36.080
For example, a GET request is idempotent (retrieving the same result), while a POST request typically alters data.
00:17:42.160
Implementing idempotent job designs simplifies our workflow by ensuring consistent behaviors.
00:17:49.760
Additionally, one of my coworkers created a tool called Viaduct.
00:17:55.360
This is a middleware without built-in rack support but provides job wrapping functionality.
00:18:01.760
Thanks to Viaduct, we could introduce instrumentation for job monitoring.
00:18:08.720
We learned from our experiences that the failure of abstractions is not the answer.
00:18:15.360
Altering or adding plugins to existing frameworks can be enjoyable and enlightening.
00:18:22.480
However, solving problems abstractly doesn’t always yield the best results.
00:18:28.080
Most importantly, directly solving the specific issues at hand often proves much easier.
00:18:34.320
For my recent project, I resolved to build a background job system from scratch.
00:18:40.400
Let’s distill what it requires down into three essential ingredients.
00:18:46.720
These ingredients are a work loop, monitoring capabilities, and a queue system.
00:18:53.760
The work loop is a piece of code tasked with picking up jobs, executing them, and monitoring them.
00:19:00.720
An anecdote I missed earlier: at 3M I was a developer and there was a separate operations person.
00:19:07.680
I thought I was excelling when I coded a failover mechanism for AMQP.
00:19:14.560
At Engine Yard, there are no operations personnel since our product is operational by nature.
00:19:21.760
This meant each engineer doubled as operations, leading to a richer understanding of managing background job systems.
00:19:27.680
The third crucial ingredient is the queue, which functions as a repository for jobs awaiting processing.
00:19:34.160
Does anyone here know who Terence is? He’s a maintainer of Rescue and Bundler.
00:19:40.800
In one talk, Terence discussed how the Bundler API operates—resolving dependencies without using Rescue.
00:19:47.360
Instead, he utilized a thread pool for background job processing.
00:19:53.760
The construct for this queue was a Ruby library called Queue, essentially a thread-safe array.
00:19:59.920
Beyond Queue, we could also utilize DRb (Distributed Ruby) to set up remote objects.
00:20:06.080
DRb enables Ruby processes to call methods on remote objects over a network, offering a simple mechanism.
00:20:12.240
This service management object could form the basis of our queue system.
00:20:19.920
For example, a service manager could handle child processes for background jobs.
00:20:26.080
A more familiar example of a queue system is Redis, the popular default queuing system.
00:20:32.320
This snippet illustrates using Redis to implement basic background processing.
00:20:39.200
Using rpush and lpop are the canonical commands to enqueue and dequeue jobs.
00:20:46.720
The beauty of Queue is that it allows for multiple backends, enhancing the system’s flexibility.
00:20:53.840
So let’s consider how we can consolidate these ideas into a practical, production-ready system.
00:20:59.920
I assembled a script that requires our Rails environment and outlines this implementation.
00:21:06.080
Establishing the loop initiates a set of steps to process invoices in our billing system.
00:21:12.320
The main objective is ensuring that we manage the incoming tasks effectively.
00:21:19.520
During processing, we also harness the concept of safe exit points.
00:21:26.000
Safe exit points define moments in the code where we can safely stop processing.
00:21:32.200
We signal the trap loop to terminate safely whenever certain conditions are met.
00:21:39.440
Are we nurturing a deeper understanding of how background jobs can be more efficient and reliable?
00:21:45.920
Let's reflect on lessons learned with abstraction awareness.
00:21:52.000
It is crucial to recognize when it is appropriate to create specific abstractions for your needs.
00:21:58.960
Sometimes, directly addressing the issue at hand yields far better outcomes than striving for abstract solutions.
00:22:06.000
Encouraging everyone to contribute to ensuring that problems are effectively solved is vital.
00:22:14.040
We should aim for meaningful contributions while continuously innovating the tools we rely on.
00:22:22.080
Thank you for your time, and I hope you found value in these struggles and successes.