Database Performance

How to Fail at Background Jobs

How to Fail at Background Jobs

by Jacob Burkhart

In the talk 'How to Fail at Background Jobs,' Jacob Burkhart discusses the complexities and pitfalls associated with managing background jobs in software development, particularly in the context of Ruby on Rails. The emphasis is placed on learning from failure to improve the abstraction mechanisms that assist with background job processing. Key points from the presentation include:

  • Failure as a Learning Tool: Jacob opens with the notion that understanding and examining failure leads to valuable insights in job management.
  • Critiques of Delay Job: Burkhart warns against using Delay Job due to its adverse impact on database performance, urging consideration of more efficient alternatives.
  • Rails 4 Queuing System: He critiques the API issues of the Rails 4 queuing system, notably its flawed handling of job information and serialization challenges that complicate debugging and make scaling difficult.
  • Transitioning Systems: Jacob shares his experiences transitioning from older queuing systems like Starling to RabbitMQ, highlighting challenges faced with job queuing timing and object availability during transaction processing.
  • Rescue Usage: The discussion shifts to his use of the Rescue library at Engine Yard, noting its application in provisioning servers on Amazon EC2. Here, he addresses the need for code refactoring to manage job complexity and improve queue management.
  • Pipeline of Job Dependencies: He describes the challenges of managing jobs that depend on each other, expressing the need for robust job tracking to monitor execution and outcomes across dependent tasks.
  • Idempotency Importance: Burkhart explains the critical concept of idempotency, which ensures consistent job execution without unintentional duplications, thereby enhancing reliability.
  • Custom Solutions: Finally, he reflects on building a background job system from scratch—emphasizing the importance of direct solutions over abstract ones. Jacob concludes with a reminder of the significance of understanding and mastering the mechanisms behind background job processing.

The overall takeaway is the necessity of refining the abstractions related to background job handling, to build systems that are understandable, maintainable, and effective. Lessons learned from various implementations serve to guide future development practices in this domain.

00:00:00.560 This talk is called 'How to Fail at Background Jobs,' and the slides are right there, so you can skip ahead if you'd like.
00:00:06.080 I think that failure is really interesting, and we learn a lot from it.
00:00:12.000 As the speaker before me mentioned, we often need to introduce the pain before we find the solution. I've experienced a lot of pain and maybe a few solutions.
00:00:19.359 Mostly, I want to talk about the pain. I was discussing the various topics I could cover in this talk with my good co-worker, Evan, who is sitting back there.
00:00:25.119 He said, "Oh, so you want to talk about how to fail at background jobs? Well, the answer is simple, right? Delay Job!" Does anyone here use Delay Job?
00:00:31.039 If you're considering using Delay Job, you might want to consult Evan to figure out why you shouldn't be using it. Evan works in support and has dealt with many engineering customers.
00:00:42.480 Frequently, he sees database performance issues related to customers using Delay Job, which can severely hammer their databases.
00:00:49.039 If you happen to be using Delay Job or are thinking about it, you might want to consider an alternative that's similar but requires Postgres, which was written by a Heroku guy.
00:00:54.960 And that's it—that's my talk. But I also want to mention Engine Yard's conference in August called 'Distill.' You should all submit to the call for proposals.
00:01:00.960 You could probably give a better talk than the one I just gave. Do I still have time?
00:01:07.360 Okay, great. I'll continue then. The talk I really want to give focuses on the lessons we can learn from failure, particularly about abstractions.
00:01:13.200 I think my current working theory is that I fail so much because the abstractions I use are wrong or fail me, or that I simply need better ones.
00:01:18.479 So, speaking of abstractions, let's talk about the Rails 4 queuing system. How many people have heard about the queuing system in Rails 4?
00:01:25.359 How many have heard that it's not going to be in Rails 4?
00:01:30.880 If you didn't know or if you’re interested, you can go to the commit message or the specific commit on GitHub and read all the comments about what was pulled out.
00:01:36.079 I'm going to attempt to summarize some of the reasons why it was pulled out and draw my own conclusions about why it was a failure.
00:01:42.480 First of all, the API for the Rails queuing system had significant issues. One commenter pointed out that the biggest problem was the definition of the queue.
00:01:50.079 The name of the queue where you push your jobs was defined at the place where you call the push, rather than in the job class itself.
00:01:56.320 Let me explain the API more clearly: the API allows you to create an object—any object you want—that has a run method.
00:02:05.360 According to the contract, you put that object onto some queue, and later, someone can pull it off and call run, executing the job.
00:02:11.280 The problem arises because the arguments needed to reconstruct that object are not part of this contract.
00:02:18.000 Somehow, all that data is embedded in the object, which requires serialization and deserialization.
00:02:25.760 Given the API, what they ended up having to do was use Marshall.
00:02:31.680 If you've ever tried to marshal an Active Record object, you'll know the end result is far from ideal.
00:02:35.360 This isn’t great for debugging issues in production. Moreover, marshalling is very limiting.
00:02:41.040 The marshaller can introduce problems, such as circular references or the inclusion of unnecessary data.
00:02:48.080 Finally, they alluded to having solved the completely wrong problem. One of the major use cases I envisioned for the Rails 4 queuing system was sending emails.
00:02:56.080 Action Mailers are a core part of Rails, and when sending an email, you don't want it to disrupt the request processing.
00:03:03.919 If you send an email directly from a controller action, that could take a while and slow down the request.
00:03:10.640 One of the primary reasons background job systems exist is to allow for such actions without interfering with request processing.
00:03:17.760 They proposed wanting changes to Rack so that simple tasks like sending emails could be handled without a separate job processor.
00:03:25.280 In an ideal setup, the email sending would occur after the request is sent to the browser but before processing is completed.
00:03:32.240 To achieve this, there’s a somewhat evil way to hack Rack. Rack expects a triplet of status, headers, and body.
00:03:39.040 We can create a proxy object around that body and implement our own each method to return the actual body.
00:03:47.440 The last thing we can do is send the email because we set the Content-Length header, which tricks the browser into thinking we're done.
00:03:54.560 However, we still need to figure out how to get this solution implemented into Rack.
00:04:02.240 After considering these issues, I thought perhaps my input could help solve some of them.
00:04:10.080 I attempted to create some pull requests, but it seems they’re going to rewrite this in Rails 4.1.
00:04:16.239 I don't expect it to resemble the current version in any way by the time they are done.
00:04:22.560 Let’s now go back to a story from 2009. I was working at a company called 3M, and this was our product: a Lava COS chairside oral scanner.
00:04:30.960 I was writing a Rails app—Rails 3 or perhaps Rails 2—and it was essentially a glorified file server.
00:04:37.680 We would upload files from these devices in the wild, and our app would organize and facilitate actions with those files.
00:04:44.160 At that time, the state of the art for background processing was quite limited.
00:04:50.639 We sought something, as our use case involved copying files from one location to another.
00:04:56.480 The state of the art was a queuing system called Starling. Does anyone here still use it? Any Workling users?
00:05:02.480 We used these queue systems for a while until we started reading the Twitter engineering blog.
00:05:08.960 This was when Rails was criticized for not being scalable because Twitter wasn't scaling well.
00:05:15.840 Concerned about our distribution and the need for high availability, we explored moving to Erlang.
00:05:22.960 RabbitMQ started gaining popularity, so we decided to transition from Starling to RabbitMQ as our backend.
00:05:29.680 RabbitMQ operates on multiple nodes, replicating each other. If one goes down, the queuing system continues to function.
00:05:37.839 It offers numerous benefits, including a protocol called AMQP.
00:05:43.920 I'm not going to explain this in detail, but this is how you can engage with AMQP. This is a code snippet for using AMQP.
00:05:50.960 Before GitHub existed, a nice person wrote this AMQP exchange runner for Workling.
00:05:58.239 So, we simply swapped out the Starling runner for our AMQP runner and thought we were good to go.
00:06:05.759 However, we encountered a little bug—not directly related to RabbitMQ.
00:06:12.000 The issue stemmed from the fact that RabbitMQ was faster than Starling.
00:06:19.120 We started getting Active Record 'not found' errors because we enqueued jobs in 'after_create' blocks.
00:06:25.440 The funny thing about 'after_create' is that it runs after the object is created, but not after the transaction involving that creation completes.
00:06:33.440 So, to create an object in the database, there are two important steps: the insert and then the commit.
00:06:40.160 When the insert happened, we enqueued the job—not knowing the transaction was still pending.
00:06:46.720 With RabbitMQ managing an open socket to the workers, the worker would immediately look up the object and find it wasn't there.
00:06:52.080 The hack solution was to implement a one-second sleep.
00:06:58.480 But another hack fix involved an outdated plugin we wrote which isn't in use anymore.
00:07:06.760 This plugin was interesting; it monkey patched the current connection object to run code upon executing the commit.
00:07:12.800 It was clever, but we continued facing bugs.
00:07:18.280 These issues may not have simply been bugs but fundamental flaws in the system.
00:07:24.480 What we wanted was to harness the power of RabbitMQ by actually using the core functionality underneath Workling.
00:07:30.640 However, we had several applications communicating through this simple abstraction.
00:07:37.440 Workling's abstraction assumed you had your Rails app accepting requests and workers processing jobs, without any further specification.
00:07:43.520 It didn't allow for naming queues, which led us to hack on a new implementation.
00:07:50.160 We eventually decided it was easier to discard Workling and write our own wrapper.
00:07:56.080 This new wrapper enabled us to specify queue names and allowed for sharing RabbitMQ between applications.
00:08:04.160 This meant one app's background jobs could run simply from messages enqueued by another app.
00:08:09.760 So, let's reflect on the lessons learned.
00:08:16.000 Workling served us well, but ultimately it failed to provide an abstraction that could last.
00:08:22.160 In hindsight, we also failed to open-source our solution, which we never fully developed.
00:08:28.560 Currently, at Engine Yard, we use Rescue extensively for managing background jobs.
00:08:35.680 Our primary use case is to boot servers on Amazon EC2.
00:08:43.680 Let me walk you through some example code to illustrate what we're doing.
00:08:50.080 When we create an instance of some server for a customer in our database, we have a job that boots that instance.
00:08:56.320 The job looks for the instance, creates it using a library called Fog to interact with AWS, and saves off the Amazon ID.
00:09:02.760 We then query Amazon to check if that server is running.
00:09:09.440 Note that all the code I’m presenting is merely for example purposes; don’t take it as best practices.
00:09:16.320 Next, we wait for that server to be up before moving on to attaching an IP.
00:09:22.399 We continue this process through all the necessary steps to set the server up.
00:09:28.560 This job can be perceived as somewhat unwieldy, and I discussed with my co-workers possible ways to refactor this behavior.
00:09:34.560 One direction could involve running this job in a completely separate system.
00:09:41.440 For this to happen, we wouldn't want to share databases. No one here does that, right?
00:09:47.200 So we must find a way to send all of the required information with the job arguments rather than merely an ID.
00:09:54.080 We need additional arguments in our 'perform' method, along with potential callbacks during the job.
00:10:00.640 This would facilitate updating the customer on progress.
00:10:06.000 While we never actually implemented this idea, we could also take the opposite approach.
00:10:13.760 Background jobs might not need all of their logic defined within the job class itself.
00:10:19.040 Instead of requiring a dedicated job class with logic included, we could create a simpler job.
00:10:25.600 For instance, a 'Method Calling' job could be just enough.
00:10:31.919 Every time we need to do something in the background, we’d just enqueue the 'Method Calling' job with the appropriate parameters.
00:10:38.080 While this worked for some jobs, it created a lack of clarity about what jobs were actually running.
00:10:45.679 So, I recently created a library called Async.
00:10:50.560 This library has a clever DSL for running methods asynchronously on Active Record objects at a later point.
00:10:58.399 The concept of pluggable backends allows for flexibility across different queuing systems.
00:11:06.240 Now, returning to our problem with instance provisioning jobs, we recognized a common issue: customers waiting on stale statuses.
00:11:12.480 They would see their instances stuck in a 'waiting' status, as jobs were running for hours.
00:11:18.560 During this time, we might observe far more workers than expected.
00:11:24.720 The total number of workers running might not align with the configuration.
00:11:31.679 This issue arose from either workers being replaced without decrementing the count or some workers hanging indefinitely.
00:11:39.440 It took me some time to realize that there are differing types of reliability within our queuing systems.
00:11:46.480 At 3M, we believed we were operating a reliable system because we used RabbitMQ.
00:11:53.600 RabbitMQ utilizes acknowledgments, durable queues, and durable messages.
00:12:00.160 Even if a RabbitMQ node fails, the queue remains intact.
00:12:06.560 However, we didn't consider the problems that could stem from the worker processes themselves.
00:12:13.760 We could ensure a job was delivered and would be requeued if it wasn’t acknowledged.
00:12:20.320 But what if the job was halfway through processing and then crashed?
00:12:27.680 There's a chance it could be processed twice, which is a significant concern.
00:12:34.720 Moreover, many libraries offer simple retry logic for exceptions raised during execution.
00:12:41.200 But if the Ruby process crashes unexpectedly or hangs up, it might not fire the retry logic.
00:12:47.920 We had challenges since we were opening sockets and SSH connections from background jobs, and if the connection vanished, Ruby wouldn’t know to close it.
00:12:55.040 As such, we faced problems monitoring and maintaining an appropriate number of workers.
00:13:02.080 We wanted to keep the pool at a certain number and to gracefully restart when deploying new code.
00:13:10.080 We deploy code multiple times a day, so maintaining workers while picking up the new code is critical.
00:13:16.800 We want older workers to finish current jobs before shutting down and restarting the new ones.
00:13:23.040 Furthermore, we encounter the issue of not knowing why a job failed or why the expected outcome did not occur.
00:13:30.000 Is it because the job was never enqueued in the first place, or because it failed at some point in the path?
00:13:37.440 Faced with all these challenges, I decided to address the simplest and most logical one first: understanding what’s happening.
00:13:44.480 To that end, I developed a Rescue plugin, as there are numerous plugins for Rescue available on RubyGems.
00:13:51.440 The plugin provides a way to track identifiers relevant to the job being executed.
00:13:58.080 For an instance provisioning job, you would want to track the instance and the customer account that owns it.
00:14:05.120 During enqueue and job execution, the identifiers help keep updates in sync.
00:14:11.680 You can also call methods to check for any ongoing jobs affecting the customer or the instance.
00:14:19.440 Additionally, you can look for jobs that failed within a certain time window.
00:14:27.440 Though I was likely the only one who utilized this for debugging, it was useful.
00:14:34.320 The bigger problem we faced involved job dependencies.
00:14:41.040 For example, if you add a database replica to a cluster, we needed multiple jobs.
00:14:47.920 Creating a replica means job A adds a snapshot of the master database, and Job B provisions the volume.
00:14:54.360 Before proceeding, Job B has to confirm that Job A has completed.
00:15:01.040 This dependency structure created a messy workflow, so we opted to write another Rescue plugin.
00:15:06.760 This new plugin was much more complex than the job tracking one but never made it into production.
00:15:12.960 While this plugin made it easier to express job dependencies, it didn’t aid in debugging.
00:15:20.160 As a last resort, we revisited the tracking plugin.
00:15:26.000 The job tracking plugin depended on another Rescue plugin called Metadata, which associates random metadata to jobs.
00:15:32.000 We decided to leverage database storage for tracking job outcomes instead of queuing everything in Redis.
00:15:39.680 To facilitate this, we created a Rescue job model and hooked into it whenever a job got enqueued.
00:15:46.000 During execution, we updated the corresponding database record until completion.
00:15:54.240 This data storage allowed us to perform robust SQL queries to identify the most common job failures.
00:16:01.360 We utilized this tracking for a while until we ultimately disabled it.
00:16:07.200 One of my coworkers, Andy Delcom, shared insights in a talk at Cascadia.
00:16:12.800 In his talk, he proposed a unique approach to tracking job dependencies by generating unique IDs.
00:16:20.000 When a customer issues a request, we dry-run it to produce a unique ID.
00:16:26.560 Any tasks enqueued as a result would include this unique ID.
00:16:32.640 This identification method could be invaluable for future debugging efforts.
00:16:38.000 Josh explored the concept of 'modeling intent'—focusing on what a customer hoped to achieve with objects in our system.
00:16:46.880 For background jobs, this implies creating specific models for tasks.
00:16:53.280 For instance, instead of merely queuing instance provisioning as a job, you would define it within your database.
00:16:59.640 This change allows us to track instances privately in the database.
00:17:06.160 Now the instance provisioning job aligns directly with an Active Record object.
00:17:12.720 This relationship provides valuable insights into running jobs, recently completed jobs, and job states.
00:17:18.560 Moreover, we could compare running 'instance terminate' jobs against 'instance provision' jobs to prevent conflicts.
00:17:24.720 Defining task state allows us to ensure idempotency across jobs.
00:17:30.320 Idempotence means that if you execute a job multiple times, it won’t alter results beyond the first execution.
00:17:36.080 For example, a GET request is idempotent (retrieving the same result), while a POST request typically alters data.
00:17:42.160 Implementing idempotent job designs simplifies our workflow by ensuring consistent behaviors.
00:17:49.760 Additionally, one of my coworkers created a tool called Viaduct.
00:17:55.360 This is a middleware without built-in rack support but provides job wrapping functionality.
00:18:01.760 Thanks to Viaduct, we could introduce instrumentation for job monitoring.
00:18:08.720 We learned from our experiences that the failure of abstractions is not the answer.
00:18:15.360 Altering or adding plugins to existing frameworks can be enjoyable and enlightening.
00:18:22.480 However, solving problems abstractly doesn’t always yield the best results.
00:18:28.080 Most importantly, directly solving the specific issues at hand often proves much easier.
00:18:34.320 For my recent project, I resolved to build a background job system from scratch.
00:18:40.400 Let’s distill what it requires down into three essential ingredients.
00:18:46.720 These ingredients are a work loop, monitoring capabilities, and a queue system.
00:18:53.760 The work loop is a piece of code tasked with picking up jobs, executing them, and monitoring them.
00:19:00.720 An anecdote I missed earlier: at 3M I was a developer and there was a separate operations person.
00:19:07.680 I thought I was excelling when I coded a failover mechanism for AMQP.
00:19:14.560 At Engine Yard, there are no operations personnel since our product is operational by nature.
00:19:21.760 This meant each engineer doubled as operations, leading to a richer understanding of managing background job systems.
00:19:27.680 The third crucial ingredient is the queue, which functions as a repository for jobs awaiting processing.
00:19:34.160 Does anyone here know who Terence is? He’s a maintainer of Rescue and Bundler.
00:19:40.800 In one talk, Terence discussed how the Bundler API operates—resolving dependencies without using Rescue.
00:19:47.360 Instead, he utilized a thread pool for background job processing.
00:19:53.760 The construct for this queue was a Ruby library called Queue, essentially a thread-safe array.
00:19:59.920 Beyond Queue, we could also utilize DRb (Distributed Ruby) to set up remote objects.
00:20:06.080 DRb enables Ruby processes to call methods on remote objects over a network, offering a simple mechanism.
00:20:12.240 This service management object could form the basis of our queue system.
00:20:19.920 For example, a service manager could handle child processes for background jobs.
00:20:26.080 A more familiar example of a queue system is Redis, the popular default queuing system.
00:20:32.320 This snippet illustrates using Redis to implement basic background processing.
00:20:39.200 Using rpush and lpop are the canonical commands to enqueue and dequeue jobs.
00:20:46.720 The beauty of Queue is that it allows for multiple backends, enhancing the system’s flexibility.
00:20:53.840 So let’s consider how we can consolidate these ideas into a practical, production-ready system.
00:20:59.920 I assembled a script that requires our Rails environment and outlines this implementation.
00:21:06.080 Establishing the loop initiates a set of steps to process invoices in our billing system.
00:21:12.320 The main objective is ensuring that we manage the incoming tasks effectively.
00:21:19.520 During processing, we also harness the concept of safe exit points.
00:21:26.000 Safe exit points define moments in the code where we can safely stop processing.
00:21:32.200 We signal the trap loop to terminate safely whenever certain conditions are met.
00:21:39.440 Are we nurturing a deeper understanding of how background jobs can be more efficient and reliable?
00:21:45.920 Let's reflect on lessons learned with abstraction awareness.
00:21:52.000 It is crucial to recognize when it is appropriate to create specific abstractions for your needs.
00:21:58.960 Sometimes, directly addressing the issue at hand yields far better outcomes than striving for abstract solutions.
00:22:06.000 Encouraging everyone to contribute to ensuring that problems are effectively solved is vital.
00:22:14.040 We should aim for meaningful contributions while continuously innovating the tools we rely on.
00:22:22.080 Thank you for your time, and I hope you found value in these struggles and successes.