Talks
Mo' Jobs Mo' Problems - Lessons Learned Scaling To Millions of Jobs an Hour

Summarized using AI

Mo' Jobs Mo' Problems - Lessons Learned Scaling To Millions of Jobs an Hour

Tanner Burson • February 20, 2014 • Earth

In the talk "Mo' Jobs Mo' Problems" by Tanner Burson at the Big Ruby 2014, the speaker discusses the intricacies and challenges faced when scaling job processing systems at Tapjoy, a mobile ad platform. The session emphasizes the significant growth in traffic handling—from upgrading to Ruby 1.9 to processing millions of jobs per hour, presenting several intriguing problems faced during this scaling phase.

Key Points Discussed:
- Job Processing Challenges: Burson introduces the struggles of managing job queues, particularly during high-traffic periods, where job backups reached tens of thousands, resulting in inefficiencies and data loss during event processing.
- Importance of Scalable Systems: The necessity for distributed systems and avoiding single points of failure is highlighted. The naive solutions to job processing faced immediate breakdowns, demonstrating the need for robust designs.
- Queue Selection and Trade-offs: Several queuing options were evaluated, including RabbitMQ, SQS, and Redis, based on factors like durability, availability, and throughput. Each queue had pros and cons, leading to decisions based on specific use cases.
- Building a Custom Job Processor: Finding existing job processors that fit Tapjoy’s requirements was challenging, leading to the development of a custom solution called Chore that could handle a variety of job types efficiently.
- Concurrency in Ruby: Burson explains the methods available for concurrency in Ruby, including forking, fibers, and threads, and shares his experiences and challenges in using these methods for effective job processing.
- Debugging Challenges: The complexities of debugging Ruby code, particularly in high-scale environments, are emphasized, with insights into race conditions and the importance of a deep understanding of the Ruby environment.

Conclusions and Takeaways:
- Job processing is a non-trivial task, especially at scale. The growth in request volumes brings challenges that require detailed consideration of design and implementation.
- Different workloads require tailored solutions; not all jobs can rely on the same processing strategies or systems.
- Recognizing the importance of availability and durability over throughput in some contexts leads to more informed decision-making in system design.
- Continuous learning from past experiences is crucial for building resilient and effective job processing systems.

Overall, Burson’s talk conveys that navigating the realm of job processing at scale in Ruby requires attentiveness to numerous factors, flexibility in solutions, and readiness for unexpected challenges while developing robust systems.

Mo' Jobs Mo' Problems - Lessons Learned Scaling To Millions of Jobs an Hour
Tanner Burson • February 20, 2014 • Earth

By Tanner Burson

At Tapjoy we process over a million jobs an hour with Ruby. This talk is a discussion of tools, techniques, and interesting problems from the trenches of scaling up over the last two years. Expect to learn a lot about Ruby job queues (beyond Resque/Sidekiq), performance, concurrency and more.

Help us caption & translate this video!

http://amara.org/v/FG3n/

Big Ruby 2014

00:00:20.400 All right, so thanks for the introduction, Mark. As he said, I'm here to talk about job processing; I think this is a really exciting and interesting topic that doesn't get the press it deserves in the Ruby community. Everybody knows about Resque, and everybody knows about Sidekiq, but much past that, it just kind of drops off into oblivion. This is stuff I've been working on pretty much for the last year, so I'm really excited to get a chance to come talk to everybody about this. I'm excited to be here with an audience of actual people today. In preparing this talk, I've given it at home, I don't know how many times yet, but I think my dog is sick of hearing about job processing. So I'm really glad to have people here to listen.
00:01:01.039 So obviously, I'm Tanner. I work for Tapjoy out of our Boston office. For those of you not familiar with Tapjoy, we are an in-app mobile ad platform. What does that mean? We have an SDK for mobile app developers that allows them to put ads, offers, and other monetization options into their applications so that users have ways to pay them that aren't just cash. Along with that, we've got about 450 million monthly active users. This is a big number. Because we're an ad network, we are dealing with lots and lots of requests—millions of requests a minute.
00:01:13.840 So what does that really boil down to? We're talking about scale. Right? This is a big Ruby conference, and we're here to talk about scale. Let's talk about scale. This brings up interesting problems of scale. Anything we put into production goes out to this load immediately, which means the naive solution fails—not in weeks or months; the naive solution fails in minutes. We have to avoid single points of failure; we need to have distributed systems from the start. We have to think about hard problems upfront immediately.
00:01:26.560 Every time I think this is really fun. I think these are really great problems to solve every day, and I'm excited to share some of these things that we've been doing. So let's set the scene a little bit. This all started way back in 2012. So what were we all doing in 2012? We were upgrading our apps to Ruby 1.9, we were watching Gangnam Style mashups, and waiting for the Mayan apocalypse. Outside of those things, things were going pretty well. We were pretty happy with the way things were, mostly. Good news: we're an ad network, and our traffic is going up—way up. We're seeing month after month, quarter over quarter, huge increases in traffic—50-plus percent increases within six-month spans. We're moving really fast, we're scaling up really big, and this is great for us because more requests mean more money.
00:02:00.639 But not everything is perfect. Our job system is starting to show some issues—predominantly during our high-traffic periods, jobs are backing up, and not just a little bit. Jobs are backing way up; we're seeing 10-plus thousand messages in queues that aren't clearing out. They'll stay that way until peak traffic starts to trend back down, and finally, we can get all caught up. This is not awesome. Secondly, we have an event processing system similar to what Coraline was talking about just a few minutes ago. Much like she was saying, these are important things to us. This data is really useful, but we've discovered that we're losing small amounts of this data. We don't know how much and we don't know exactly where in the pipeline we're losing it, but we're dropping some of it, and that's really bad.
00:02:43.680 For the worst news, all of this is our fault because the job system and the event processing system are entirely homegrown—emphasis on 'grown.' These are the kinds of systems you discover one day are there, and you can tell no one sat down to think hard and planned it out. No design reviews—this system just kind of accreted code over years. Now it's there in front of you, and it's a problem, particularly with the job system. We can't scale up individual jobs. We can't identify that one job that's going slow and say, "Hey, let's make sure we can do more of those right now." We can't really do that in the way it's designed.
00:03:21.360 Let's talk a little bit about what event processing means to us. For us, these are data points in our application, such as a user has seen an ad, a user has clicked an ad, or if we're lucky, a user has converted on an ad. That means we're getting money. This is good stuff. This is the core of our business analytics and intelligence system. We generate a lot of those—this is on the order of hundreds and hundreds of thousands per minute, you know, five to six hundred thousand per minute, lots and lots of these. We're losing a very small percentage of that, but we don't know how many that is, and we don’t know which ones we're losing, and that’s bad.
00:03:55.520 Part of the reason we don't know is that the way the system is designed currently is basically syslog. We write messages out to syslog; we have some custom log handlers that bundle and aggregate and do things to the data, and then ship it off to our actual data warehousing system. But syslog's kind of a mess, and it's really not easy to tell what's going on in there other than sometimes you get exit codes that aren't zero, which doesn't tell you much. So this is not awesome; it's time for us to do something about these problems. We're seeing all this growth and all this coming up ahead of us—we can't stay this way. We've got to fix things. But change is hard. Where do you even start with a problem this big? We can't just dive in and go fix all of this all at once; these are big systems with lots of data.
00:04:44.640 So we kind of need to figure out what's important to us first, and what we truly need to be approaching. What it comes down to is we're talking about message queues and job processing. When you start talking about that, you're really talking about two components: a queue, something that takes data in on one side and lets data out on the other side—it’s pretty simple—and secondly, you need something to extract data from that queue and do something meaningful to your application. Without that, it’s just data flowing through a queue that doesn't mean anything. We need something to make it mean something to us. So let's break this down a little bit.
00:06:22.560 When we talk about queues, there are a lot of choices. You may be familiar with some of these terms; I wasn’t familiar with all of them. I actually googled them the other day, and there’s just a crazy amount of them. I think most people in this room have probably heard of RabbitMQ or have seen Redis used as a queue, or maybe Beanstalkd if you're really up on some of the cool things in Ruby right now. But there’s just a ton of choices. How do we even figure out which of these things is useful and meaningful to us?
00:07:13.200 So we decide the same way we decide with every problem in computer science: we get three choices, and we only get to pick two of them. Great, we do this every time, but here we go again. In the case of queues, there’s really three things that matter more than anything else: we need durability, availability, and throughput. Let's break these terms down a little bit. Durability—what do we really mean there? All we mean is I want to put a message in a queue, and I want to be reasonably assured I'm going to get that message back out. We’re talking about avoiding disk failure or process crashes or the simple kinds of things that just happen every day in real systems.
00:08:19.520 Next, we're talking about availability; this is really critical. We’re talking about whether this thing is a single point of failure—hopefully not, because we don’t want single points of failure. We have big systems, but what kind of tuning can we do to that? How available can we make it, and what are some of the trade-offs in doing that? Lastly, and somewhat obviously, how many jobs can I mash through this thing? We've got a lot of jobs—we need to move a lot of data. How fast can we push this data in, and how fast can we get that data back out again? Because none of this matters if it’s going to take way too long; that’s not useful.
00:09:41.760 I guess that isn't the end, right? We live in the real world, and things cost money. At some point, we will have to make trade-offs based just on how much it will cost to do these things. You can imagine ideal scenarios where you can get blazing fast things that are always available and will never lose your messages, but it will cost you ten million dollars a server to have it—that's probably not feasible for anyone in this room. That’s not realistic, so we do have to consider cost, but it's not the top of our list.
00:10:15.120 So for us, we narrowed that list down to three choices. These choices were based on the criteria above and, truthfully, like anyone else, things we were familiar with internally—things we had experience with and knew what to do with. That list was RabbitMQ, SQS, and Redis. If you're not familiar with RabbitMQ, it's a blazing fast, really slick message queue that is flexible and can do almost anything you want. It’s pretty awesome in that regard. Then there’s SQS, Amazon's hosted queue system—it's a little less flexible, but it's really incredibly available. Lastly, there’s Redis, because we’re in Ruby, and everyone has used Redis or Resque or Sidekiq, and we know how this works. We have Redis around for other things anyway; maybe we can just use that.
00:10:54.960 Breaking down RabbitMQ, it can be incredibly fast. It can also be really durable; you can almost ensure that no message will ever get dropped—both putting them in or taking them back out. That’s awesome; that’s a really great feature, and it’s pretty fault-tolerant. The problem with all of this in the case of RabbitMQ is that’s entirely in how you configure it. You can configure it to be completely non-durable, incredibly fast, and not fault-tolerant at all—it’s a choice. So it's a little bit complex, and you really have to understand how Rabbit fits together and how it works to make those choices fit what you need them to do.
00:11:38.080 Then there’s SQS. It’s incredibly durable; I don't know if it's as durable as S3, which Amazon claims it is—like somewhere around the heat death of the universe, they will have lost one bit of data out of it. I don’t think we’re quite that durable, but it’s pretty darn close. After that, it’s incredibly available. I know Amazon gets knocked a lot for EBS and some of the failures and things they have there, but SQS actually tends to be the one thing that is always up inside of Amazon. Lastly, it's not as fast. All of these things boil down to it just isn’t going to be quite as fast as RabbitMQ. That might be okay; we’ll have to really look at the numbers to see how that shakes out.
00:12:27.040 Then there’s Redis. This is the one most people are probably familiar with, right? Redis is a key-value-ish data store—a bit of a Swiss Army Chainsaw of data stores. It can do anything, and it can do some of those things well. It turns out that queuing probably isn't the thing it does best, but it’s still really useful. It’s really fast; I think everybody here has played with it enough to know how incredibly fast it is. It's also reasonably durable; you can set how long stuff will reside in memory before it gets flushed to disk and manage all of that. But remember, we’re back way in 2012.
00:13:20.920 At that point, the idea of Redis Cluster, or Redis Sentinel, or any of the Redis-like clustering tools didn’t exist. Everyone was running around trying to figure out how to make that work. So we didn't have a good story in Redis about how to make it distributed or how to make it not a single point of failure, so that’s a real problem for us. What we ended up deciding is that really we had two very distinct use cases for our queuing: we have our event processing case, which we've talked about. We really care about this data because it’s really important to us, so it needs to be very durable. It is incredibly fast—we have so much of this data coming in that we need to be able to mash it through the queue really quickly.
00:14:12.960 Secondly, we have our general-purpose jobs. When we talk about general-purpose jobs, these are the things you always have running in your web app; we need to send an email, or we need to make an HTTP call to a partner, or maybe we have this other task that’s not terribly important, but it's really memory-intensive, and we just don't want to do that interactively in the app. These are the types of tasks we send off into jobs, and we have a lot of those too, but not quite as many as we do with the event processing stuff. Okay, so this is a Ruby conference, and I’ve been rambling about queues, so let's talk about Ruby.
00:15:18.080 I think that’s what everyone would rather hear about. So let’s talk about job processors. Job processors—we have a lot of those too. There are not quite as many here as in the queuing case, but there are a lot of job processors in Ruby. So again, we have to figure out what makes sense and how we can fit these things in. Interestingly, in our case, we’ve already picked a queue, so we can really narrow this down to things that can work with the queues we want to work with, and the answer is they don’t.
00:15:58.640 In 2012, we looked at basically every job processor you could use in Ruby and discovered that pretty much none of them did what we wanted. So let’s build one or two. So what is a job processor? At its core, it's just three things: it’s something that takes messages out of a queue, it’s something that does some decision making based on that message and finds a chunk of code we want to run, and then it runs that code. That’s really it—that’s all we have to do. This is easy. We’re engineers; we can do this.
00:16:29.680 Okay, but we’re talking about scale here, right? Like, this is big stuff. We can’t just have one machine read from a queue, find a job, and then run that job one at a time on a machine—that’s going to take a lot of machines to get through hundreds of thousands of jobs a minute. So we need to run these jobs concurrently. Alright, well, we’re in Ruby, so this gets a little bit harder, but we know how concurrency in Ruby works, so let’s nail this.
00:17:16.480 So Ruby concurrency—we pretty much have three choices: you have forking, which has been around since the dawn of Unix; you spin up a child process and do things in it. It can’t really talk back to the parent easily, but that's probably okay. Fibers—fibers are the cool new thing in Ruby 1.9; we should use these because new stuff is always better. Fibers are really cool; maybe we can make use of this. Then threads— we've been told forever that we can't use threads in Ruby, but maybe that's not true—maybe we need to look at this and see. Let’s talk about our RabbitMQ job processor.
00:18:35.360 We really looked around at a lot of things, even played with a few that supported RabbitMQ, and decided that it just didn’t fit what we needed. So what could go wrong? We can handle this. When you interact with Rabbit, it's highly asynchronous; you make a request for something, and it sometimes comes back later and tells you about it. When you tell it, 'Hey, I’m done with this job,' it'll tell you it has accepted that, some time later. Because of that, all the connections to Rabbit are also stateful, so the same connection that you open to receive a job is the same one you have to say you're done with that job.
00:19:28.480 So if we need multiple connections to be able to do this quickly, we have to be able to track all of those connections and which job matches with each one. Okay, that’s not terrible, but it's an important detail to note as we go through this. Lastly, it doesn’t have any inherent built-in support for batch processing. Wouldn’t it be nice if we could maybe speed this up by getting ten things at a time instead of doing one at a time? But Rabbit doesn’t really support that inherently.
00:19:59.400 We have these cool things called fibers, and we really think fibers can work here. If you’ve never played with fibers, A, you're probably lucky, and secondly, fibers are kind of like threads, except they exist in the same stack as your main application. So what Ruby does internally is it creates a single stack frame for the fiber, and for each fiber, it gets its own stack frame. It jumps back and forth in its own stack to switch between fibers; this is much faster than the context switches in threads. It's also much lighter weight than forks or anything else. Because you're working in one stack, you don’t have to worry about the same kind of thread mutex cross-talk nonsense you’re always fighting because you can’t end up in another place in the stack.
00:20:45.360 Now, because we're only in one stack, if we're bound on CPU, this doesn’t help us because we can do one thing and wait for it to finish, and then we can do the other thing. Well that’s not concurrency—that’s just the same thing we had before. But if we’re waiting on I/O, we can actually leverage fibers in a really neat way and say we're going to send off this data to the server, and instead of waiting for it to come back, we're going to go do something on that other fiber and we'll check back with this one later. We can keep doing that and run hundreds, or even thousands of fibers concurrently because you can get an amazing amount of work done in the 20 milliseconds it takes to make that network request. This is really, really cool.
00:21:27.440 So really fibers are great for these asynchronous workloads. This is really handy; it sounds perfect except, of course, there's always a problem. So fibers have this little thing: if you think back a second ago, we talked about fibers being allocated on the main stack, which makes them really fast and avoids a lot of the other pain, but Ruby has a fixed size for an individual frame on the stack. Depending on your Ruby version and the bugs it has in the 1.9 branch, the size of that stack frame varies, but it's really small—small enough that you probably can't save an active record object inside of a fiber. The callback chain inside of an active record call can easily be deeper than the amount of stack you have inside of one frame in Ruby.
00:22:08.720 This hurts. You're going along in your application, and then suddenly bam! System stack error. Cool, that was helpful, Ruby—thanks, I appreciate that. So down from that, you also, you know we talked about fibers being great when you need the I/O stuff, but it’s painful to handle that yourself. You want to find some sort of a framework or toolkit that manages those I/O switch-offs for you. At the time, we evaluated and went with EventMachine, which creates kind of an event loop and allows you to do neat things with callbacks. It’s really nice, but it also means all of your I/O has to know it’s being dealt with inside of a fiber because it has to know when to make those trade-offs and say, "Hey, I can get out of here."
00:22:47.760 So your code isn’t really agnostic to this. Your code has to know that it's running inside of these things, so it’s not a really easy thing to just slip into your application. There was also a really nice set of tools around EventMachine called Synchrony that lets you avoid callback chaining and a lot of the mess you end up with in evented code, but it is the most complex bit of Ruby code I've ever seen. We stumbled into a couple of bugs in it, and it was somewhat harrowing to go resolve. Lastly: system stack error again. Man, these things are awful, and not only are they really hard to expect or predict or know what will happen, but Ruby really doesn’t want to help you with this.
00:23:47.680 When you get a system stack error, you just get one line of backtrace, so we’re all used to being able to see the full backtrace of everything that just happened. But you get one line, and that’s the line that you blew the stack on. It could be 20 method calls deep in a library you didn’t know you were using on a line of code you had no idea was going to happen—that's all the data Ruby gives you. There’s a bug filed in Ruby to resolve this; it’s been there for about four years, so I wouldn’t hold your breath on that getting solved anytime soon.
00:24:40.480 So let’s kind of move on—what did we really learn from this? Well, if we're going to be running somewhat arbitrary code, fibers are not going to be good for that because we can't control that environment well enough to make sure we're staying inside of the stack and that we have the right network libraries enabled and that everything’s ready to play all night. We also learned as we went through this that RabbitMQ is actually pretty complex. The trade-offs of durability, availability, and clustering took a lot of work—more work than we expected to get all those things to fit together. We really should have looked harder at Celluloid; Celluloid is a common fiber evented choice, and we should have looked harder at that because, at the end of the day, I think it simplifies a lot of the things that get hard in EventMachine.
00:25:46.640 Alright, so we're done with RabbitMQ. We got all this up, and honestly, it actually all works. It was a little rocky, but we got there, and we're cool. Now we need to go tackle our general-purpose jobs. Let’s talk about what it takes to process jobs with SQS. As usual, there’s nothing; in fact, I doubt anyone in this room has hardly used SQS as a job processor. There’s nothing in Ruby that does this. It is shocking, so let’s go build it. We learned our lessons the last time—we can do this, and this will be easy.
00:26:31.840 SQS is a really slick system—like everything else Amazon does, it’s an HTTP-based API. You make calls to it, you push some data, and then you make a recall call to it to request some data. Like all HTTP things, this means it's synchronous, which adds a little bit of latency. Because we have to wait on those requests to come back, a call to SQS can take maybe 20 milliseconds to do a write. That’s a little bit longer than we would probably prefer, but we can probably get away with this for now. It does, however, have some really nice things in that it will allow us to do batch processing built-in; we can send it batches of varying sizes, and it will treat all of those as a batch of messages to fetch them back down as a batch of messages, though not necessarily the same batch we sent in. But that’s okay—we can just parallelize these requests and get a lot more done in that 20 millisecond window, so it probably balances out okay.
00:27:43.520 Let’s talk about concurrency again. We realize for general-purpose jobs, fibers are probably not the answer. So now we're down to our two choices: forks and threads. So why not threads? Well, I think everybody here has read or heard or seen how the global interpreter lock (GIL) in Ruby sucks. It kind of does; it prevents you from being able to run threads completely concurrently. The Ruby VM blocks on certain operations and will not let you have two threads actually operating at exactly the same time, but it can interleave those in ways that may get you close. Other reasons threads are scary include: we have a legacy application—it’s several years old—is it thread-safe? I don’t know. I don’t know if every gem we’re using is thread-safe, and I don’t want to be the one to find out. Thread safety is a little concerning here.
00:29:01.360 General-purpose job queues can have memory bloat. We're going to be fetching a lot of data, and we could be doing a lot of possibly dumb things in our jobs because we didn’t want to execute these concurrently in a request—so this process could grow pretty big. Historically, Ruby has let you grow that process really big, and then when you're done with those objects, they stay really big—that’s not awesome. So threads don’t give us the ability to get rid of that extra memory growth easily. I’m hopeful that 2.1 makes all of this better with all our great new garbage collection stuff, but at this point, we're barely on Ruby 1.9. We think we’re doing much better than we were on 1.8, but this is still concerning.
00:30:16.080 Okay, so we’re not going to thread our job processing. I guess that means forks. Forks are kind of boring, but we’re all used to forks, right? These have been around forever, and some of the systems we all use all the time anyway fork all the time. Rescue, notoriously, is a forking job system, and Unicorn, which is a solid Ruby web server, is built around forking behavior in Ruby. So this should be a well-worn path—we’re in good shape. This should be easy; let’s do this.
00:30:43.720 How can we make our SQS job processing work for us? Let's start by fetching with threads. I know, I know, I just got done saying we can't use threads, and threads are terrible. We can't use threads to process the jobs—that's scary. But if we want to use threads to get data out of SQS faster, that’s okay. In fact, we can avoid the GIL problem because this is all I/O. We're spending 20 milliseconds sending a request to SQS and waiting for data to come back. While that's happening, we could send five other requests to SQS and get more data, which allows us to fan that out quite a bit wider in a single process. We don’t have to fight a lot of the crazy thread safety issues because this is disconnected from our actual code.
00:31:53.680 Then, once we get that list of messages down from SQS, let's go ahead and fork a process with those messages. This gives us a really nice way to manage that memory bloat I was talking about. So we'll process that batch of messages, and when that batch is done, that child process will just die, hence any memory bloat that child process gained during its hopefully short lifetime is gone. We're back to just a small, same Ruby process, so we can manage this pretty well, and we can get a lot of these running at once.
00:32:41.560 Since we're building distributed systems here, we’re going to do what all the cool distributed systems do: have our own internal stats. We’ll run a little web server on a thread, hitting it to learn about how many messages it's processed, how many have errored out, and what the system's throughput looks like. This will be really cool and then we can have a nice dashboard aggregating this from all these processes across all our servers—this is a great idea! Okay, maybe not a perfect idea so we run into some problems pretty quickly with this approach. First, pipe processing is actually really slow. You might think that, as a fundamental underpinning of the Unix way, it would be faster. The answer is no; it’s slow. We could only get about six child processes running before the data going back and forth was consuming about 98% of our CPU. We had no time left to do any work or fetch more jobs; our throttling was actually based on the fact that the faster we got, the slower we got. This is a bad trade-off; this doesn’t work.
00:33:47.680 Additionally, we ran into an odd issue where occasionally the child processes just hang. They wouldn’t respond; they’d consume zero CPU. They’d stop at whatever memory they had, writing no log output, doing nothing, and just be there. They wouldn’t respond to signals; nothing is happening. They’re just stuck. We scoured Google and saw that both Rescue and Unicorn had some issues integrating with New Relic that might be related because we're doing something similar, so this is maybe it. A few weeks in, and we’re still dealing with this, we decide to just discard this whole pipe processing stats nonsense. We’re not that cool, and it’s not that useful. We added some more New Relic integration to tie into our internal monitoring system better, and this will be fine. We still can't reproduce this hanging issue. It’s there and it happens a lot, but we can’t reproduce it under controlled conditions.
00:34:27.280 It turns out debugging running Ruby processes is really hard. Pry debugger and many of those tools are great when Ruby is responding. But if you're deadlocked or you have an issue going on inside that process, there’s no way to access it. One of our engineers, who’s way better at C than I ever want to be, started just running the process and jumping in with GDB, poking around at what was happening within the process and inside the Ruby VM when it was hanging—and that’s when we found some really interesting stuff. It turns out Ruby had a little race condition in the early branches of 1.9.3. What happens is if you have any mutex lock in your parent process that’s open when you fork, there’s clearly a problem there. We have a lock, we’re running a set of code, and we have no way to go back and release that lock. So Ruby helpfully knows this will be a problem and releases all the locks once you fork a child.
00:35:33.960 In the child process, one of the first things it does when bootstrapping is removing all mutex locks—they don’t mean anything anymore. It had a little race condition that turned out if you forked a couple of processes at almost the same time, it didn’t execute that. So you’d end up with processes that were literally dead from the start—they could not interact with their code; they couldn't accomplish anything. Additionally, we were using the Rails express patch set, which back in the day was the only way to achieve good garbage collection in your Ruby app. There’s a long history of these patches floating around, and this was fine, but it had another bug fairly similar that would cause deadlocks. Lastly—and this is the best part—it’s weird it never came up before: Ruby’s DNS resolve library is written in C because you want it to be fast, and it uses system calls for looking up DNS addresses and giving you IP addresses.
00:36:29.440 It drops into C in a way that doesn’t release the GIL, which means every time you request a DNS address, for that length of the request, your process is locked. Nobody else can do anything, and they're waiting on that to finish. If you face a slow DNS request, or even a minor network hiccup—any number of things—this occurs hundreds of times, and the process hangs. The solution, which is never the solution in Ruby, was to rewrite it in Ruby. The solution to make a slow C library faster was to rewrite it in Ruby. This is because the Ruby code doesn’t trigger the GIL; since it doesn't move to a system call, it just uses Ruby’s native I/O libraries to do DNS requests, which never locks the interpreter. This ends up being significantly faster.
00:37:47.720 If you’re following along, that means in a month of debugging, we have encountered three Ruby bugs. There was also a small bug I didn’t mention because I don't understand it, but we ran into a small issue in glibc over a range of patches, which were the default installs on some Ubuntu version where creating a fork was significantly slower than it should be—like a hundred times slower. Who knew? It turns out the fix is really just to upgrade glibc. If you're interested in these things, the engineer who did most of this wrote up a really nice post in the Rescue project on one of their issues that’s related to all these problems. So I encourage you to look at that for more details.
00:38:39.840 So what did we learn when we got through SQS? We built the thing, debugged it, and got it running. We chose the safest set of things we could, and it’s still quite hard. Scale is hard, and we have something of a motto in our office—a motto you’re likely to hear us say regularly—and that's: the naive solution fails fast. Instantly. This was another example of that. We thought we were doing something straightforward and easy, but it turns out the subtle differences in the ways we were handling this compared to Resque versus Unicorn were leading us to edge case problems that weren’t really edge cases for us anymore.
00:39:10.960 Also, debugging Ruby systems code is hard. We don’t have good tooling for it. I recommend reading up on GDB and learning how to poke around a running Ruby process. There’s an immense amount of things you can learn from this experience, and Ruby actually provides some tools, when running under GDB, that will let you execute Ruby code from GDB and see results. We didn’t know any of that before starting this, and that was some really valuable information to have. Lastly, sometimes the bug really isn’t your bug. I don’t know how many hours we spent poring over all our code, but we tore it inside out. We had logging on every statement; we were timing all kinds of requests; we knew every detail of everything, and it still wasn’t working right, because it turns out Ruby was broken.
00:40:15.600 It wasn't our issue, but we wouldn't have been able to reach that point if we hadn't done the previous work. At some level, things can be broken, and when you start dealing with significant scale and processing tasks at high speeds, you will encounter issues that aren’t under your control more often than you'd think. What this ultimately boiled down to is that we wrapped up everything I discussed into a library we call Chore. It uses the same serialization format as Resque, which Sidekiq also utilizes, so it’s a simple little JSON hash that we pass across the wire into the queues.
00:41:26.560 With it, we can overcome some of the problems we faced over configuration per job, what should happen, and we're able to scale things up per job, managing them in order while ensuring sanity. We can set per process and per server what queues to listen to, what priorities effectively should be assigned, and how many resources to devote. Moreover, because we’re engineers and we don't like solving the same problem three times, it is agnostic to queue and to the concurrency option, which means you can plug in different queues to it. It’s designed with this intention in mind, and we already have two implemented in there right now.
00:42:42.320 Lastly, if you are fortunate enough to have an application that’s thread-safe and doesn't do crazy memory-intensive things, you can run it threaded, which will actually be faster than going through the forking method of processing jobs. The downside is, however, that it’s not released yet; it’s coming very soon—we’re currently going through our internal process to get all of that open-sourced, so that should be available shortly. Let’s look back and address the issues we had when we started and review what we experienced.
00:43:30.640 Our job queues are backing up—is pretty well solved. We have a job processor that’s much faster than our old one. Event processing is a little lossy—well, we replaced all that crazy syslog with RabbitMQ, and our new Rabbit job processing system means we can track these things. We know if we’re losing messages, and the answer is… zero! So that’s good; we’ve solved that.
00:43:54.160 Sadly, we still built all of this ourselves, so we still own the burden of everything and the debugging burden of it all. We didn’t escape from having to deal with things home-grown, but we do feel we've achieved a well-designed and well-understood structure with a little less suffering piled on top of it. Now we can actually scale up our jobs individually, managing them in ways that are more sane.
00:44:06.800 So now we’re here in 2014. It’s crazy, but we’re processing more than twice the amount of traffic we handled before. We're processing millions of events through this system per minute—millions. We're now processing well over a million of our general process jobs per hour. This stuff is running at scale, and it works. We designed something that fit our scale in 2012; we’re now a year later, and it still holds up. That’s really exciting for us, especially for those involved in building it.
00:44:39.200 In conclusion, what do we know? Job processing is hard. We thought it wasn’t that difficult, but really, things are hard. At scale, everything is difficult, and even seemingly simple tasks, like looking up a DNS address, can reveal themselves to be far from trivial. You can't assume anything when scaling—this was a painful lesson to learn. You shouldn't go into it thinking looking up a DNS address would create your biggest problems. In the end, it wasn’t our largest problem, but it did contribute.
00:45:20.800 Also, we didn’t treat all of our jobs the same at the end. We determined we had really different use cases, necessitating different queue technologies and trade-offs. In some scenarios, we accepted we couldn’t do the same thing, and that gives us more flexibility to move forward and more options at our disposal to manage things later.
00:45:50.640 Finally, we understood the trade-offs better. We worked through our RabbitMQ needs and learned the trade-offs of durability and performance. We also recognized when we chose SQS that we were accepting a slight hit to total throughput to gain much better availability numbers and durability we didn’t have to manage. Reflecting on our bonus card here, we also grasped the cost flow of our choices. We understood how costly it would be to build and scale these systems. One thing we liked about SQS is that it’s incredibly cheap. Since you pay per message, it scales as your load increases without needing large incremental jumps for adding nodes to clusters.
00:46:24.640 We treat it like a magic bucket. It takes 20 milliseconds to send data into it and 20 milliseconds to pull data out. We haven't found the point at which we throw so much at it that it starts taking 30 milliseconds. We haven’t found that limit yet, which is fantastic. So those are some really interesting trade-offs and things we’re excited about and proud of.
00:47:00.000 If any of what I've mentioned sounds interesting, we, like many here, are hiring. Either come find me later or check out our job site. That’s it. Thank you, everyone!
Explore all talks recorded at Big Ruby 2014
+14