Scientific Computing

Summarized using AI

Ruby Queue

Ara T. Howard • March 16, 2007 • Earth

The video titled "Ruby Queue" features a presentation by Ara T. Howard at the MountainWest RubyConf 2007. In this talk, Howard delves into RQ, a command line tool for building instant Linux clusters using the Ruby programming language. He emphasizes that RQ allows any user with basic Linux knowledge to set up a cluster in a matter of minutes, making it especially useful in scientific environments where large-scale computations are needed.

Some key points discussed include:
- Introduction of RQ: Howard explains the purpose of RQ as a tool designed for managing jobs across multiple nodes in a Linux cluster, allowing for efficient batch processing of tasks in a distributed system.
- Development Background: He recounts his experience at the NOAA, where the need for orchestration of numerous Linux boxes led him to develop RQ. The initial struggles with existing solutions like OpenMosix and SGE prompted him to create a simpler, more tailored solution.
- Architecture and Design Choices: Howard discusses the architecture of RQ, focusing on job queues, node isolation, and minimization of administrative overhead. He highlights his preference for using NFS for shared storage due to its availability despite its complexities.
- Locking Mechanism: The presentation details his approach to creating a locking mechanism for NFS to avoid file access conflicts, which led to the development of a LockFile class to manage concurrent processes effectively.
- Database Utilization: Howard chose SQLite as the underlying database for job management, attributing its robustness and effectiveness in handling shared access as key factors in RQ's success.
- User Experience and Features: He illustrates the simplicity of RQ through a demonstration of its command line interface, showing job submission and monitoring capabilities.
- Practical Use Cases: Throughout his talk, Howard offers examples of scenarios where RQ has been successfully implemented, including processing satellite data and generating images swiftly on a Linux cluster.

In conclusion, Howard emphasizes that RQ has become a mature and reliable tool for scientists and developers needing to manage distributed computing tasks effectively. The lessons learned throughout the development process underline the importance of zero administration, robust data management, and the realization that many users require more computing power rather than just managing data. The talk ends with an invitation for questions from the audience, highlighting Howard's collaborative spirit and openness to feedback.

Ruby Queue
Ara T. Howard • March 16, 2007 • Earth

Help us caption & translate this video!

http://amara.org/v/FGft/

MountainWest RubyConf 2007

00:00:09.760 All right, let's go. I actually just found out that I only had 45 minutes, so I'm going to be pushing through this pretty quickly.
00:00:14.880 I don’t know if anybody's online, but if you are, is anyone online? I guess not. They are? Okay, so if you want to follow along, this presentation is online at code4people.com.
00:00:26.800 The presentation is at presentations slash mountain west ruby conf mwrc.
00:00:32.320 A bulk of what I'm going to be talking about today is in a Linux Journal article, 7922; the URL is there. I just added this link last night. The rest of the presentation has had a few updates, but is complete online, and I will sync it up after the talk today.
00:00:49.920 Actually, it was kind of interesting; I forgot to add that link. As I did on my second slide here, I realized that it's actually Pat down here who hooked me up with Linux Journal to write about Ruby Q initially. I just wanted to give props to Pat for hooking me up and to all the organizers, because anything in the open source community tends to be thankless.
00:01:09.040 There are a lot of thankless tasks, so thanks to the organizers for doing all the hard work putting this together. Before we get moving too quickly here, just to take a second, let me tell you who I am. My name is Ara T. Howard. I've been a Ruby hacker for about five years. I actually work for the university, but I work at NOAA, the National Oceanic and Atmospheric Administration.
00:01:32.640 I used to work in a Forecast Systems Laboratory, and I now work in the National Geophysical Data Center. We work primarily with satellite datasets. For those of you that know me, this will be obvious, but just to make sure everyone knows, I'm a fairly informal person, so if anyone has questions, just throw them out as we go along here unless that gets out of hand.
00:01:58.960 I'm on Complaining Ruby a lot; you may see me, my signature tends to be dot RB scripts, so maybe we can run into each other there. The projects that I manage now include, of course, the Code for People repository, which I figured out the other day has about 50 projects of about a thousand lines each, translating to around 50,000 lines of open-source code—all paid for with your tax dollars. I encourage you to use those. My boss is pretty funny; he is an alien Elvis fanatic, and technically we should release this code with some sort of license, but he's really into pissing off the lawyers, so he just lets me dump whatever I want out there and figures he'll deal with it later. So, get it while the getting's good!
00:03:05.120 I also manage the Sci Ruby Wiki, which discusses scientific programming in Ruby. That's kind of the front end for a vaporware book that I'm never going to release on scientific programming with Ruby. At NGDC, what I do is write algorithms and a lot of code. We do a lot of big image processing; we're a very small group, so we wear a lot of hats.
00:03:30.239 I do all manner of things from low-level C bit twiddling; about 90 percent of the work that I do now is in Ruby. Primarily, what I do is design systems that you might say are medium size, say, 10 to 20-node distributed systems. There are about three of them that we run now, some Linux clusters, some near real-time satellite ingest systems, and some systems for generating global mosaic products.
00:04:01.840 I do a little bit of science, and I spend a fair amount of time reconciling bad software methods—also known as unscientific software methods—with what people purport to be scientific goals. I spend a fair amount of time doing that. You can read that as Fortran for anybody here that works for the government. This is the kind of data that we work with, and it's an example of the kind of product that we generate using the Ruby Q Linux clustering software suite.
00:04:32.320 This is a nighttime image of the Earth from the DMSP satellites. This is a mosaic of 15 orbits; we make these images every—let's see—this image, we make five bands of this. They're about a gig each, and we update this every 5 to 15 minutes on one of the systems that I mentioned. So, there's a Linux cluster behind that; obviously, it's a fair amount of data to be slinging around.
00:05:14.340 So, what I'm going to be talking about today is RQ. What RQ is, is a tool to build instant Linux clusters. That's basically it. It's a command line tool that should allow any non-root user with a couple of or even one Linux box and an NFS box to set up a Linux cluster in five minutes.
00:05:30.479 I've presented Ruby Q many times to scientists, and very often when I'm doing that, I won't even mention Ruby; that's not really part of the equation. So, while I've talked about it many times and given demos on using it, I've never really talked about the technical aspects of it. That's what I'm going to be talking about today. I'm not going to get too technical, but I'm just going to tell you a story about what I went through while writing this code because it has been quite interesting to me.
00:05:58.560 In any case, from a Ruby perspective, this may be my least interesting piece of software. There’s no metaprogramming or cool design patterns or anything like that! However, I learned more from writing this piece of software than any other piece I've ever written. In addition, it’s been the most heavily used piece of software that I've ever created. So, from my perspective, it was sort of an interesting journey to work on this project, and that’s what I’m going to relate to you for the most part.
00:06:29.360 I'm not going to be doing a bunch of code demonstrations here; I will show you some examples of using it. But the point of RQ is that if you go to the web, download it, and go through the tutorial, it really is instant. For someone like yourself, everyone in this room is comfortable installing software; you download a package, and in three lines, you have a Linux cluster. It might sound like a boring demonstration, but that's the point: it is supposed to be really boring and completely zero admin.
00:07:01.280 So, without further ado, let’s dig straight into RQ. The command line tool is called RQ. As I mentioned, it’s a command line tool for building instant Linux clusters. What does that mean exactly? People have different notions of what is meant by a Linux cluster: does that mean massively parallel processing? Is it some sort of batch system? There are other variants on that, like some sort of scheduler or is it multi-user? This really falls more into the category of a batch system. It's just a way to run embarrassingly parallel tasks.
00:07:38.000 You get a big stack of work and you want to farm out one job per node until the work is done. This is nothing like, you know, MapReduce or some kind of MPI compiled code. This is just a system for, you know, like converting your entire MP3 archive to OGG, which you would never do on government computers in the evening on the weekend.
00:08:08.000 So this is another little image, and I'm sorry it's a little small, but the stretched one looked terrible. I’m just showing you again the kind of examples of things that we do. This is not a particularly interesting image to look at; it’s a change image. The red areas are showing growth over time, even though they’re pretty little. The reason I have it up here and what's interesting about it illustrates why we needed—and still need—a Linux cluster.
00:08:41.840 This particular image, it’s hard to believe, took 240 terabytes of intermediate files to make and about three weeks of wall-clock time—not cluster time. So, some fairly heavy lifting. We do some near real-time stuff like this too, which is actually quite simpler to make, but we do some monitoring after storms. This shows when hurricane Katrina hit the Gulf Coast.
00:09:10.080 This is a—I guess I put these in reverse—no, no, this is the day after; and this, I think, is a month afterward. The red areas are showing power outages here, and we have to generate images like this fairly quickly. We actually got a request from the White House to make this image, and again, we need some heavy computing power and a very easy-to-use system for doing a bunch of work in a hurry. We use Ruby Q for that.
00:09:36.640 I'd like to back up a little bit and talk about why I wrote Ruby Q. Maybe someone who’s been around Linux clusters is wondering why I didn't use XYZ, so I want to give some history there. When I got to NGDC, they had just received this giant contract called the Big Dig. The Big Dig involved processing a lot of data, and by a lot, I mean we delivered 75,000 worth of tapes—just the tapes! Just the physical tapes of data to this client. So, this was a lot of data that we were processing.
00:10:06.960 They had been talked into by their IT department to buy a bunch of Linux boxes instead of, you know, a Cray or some other supercomputer because it was so much more cost-effective, which was a good decision. The problem was that the other people in my group were not too technical; no one really considered how to orchestrate 30 boxes sitting in a room. So, when I got there, what they were doing was SSHing to node number one to start a job, then SSHing to node number two to start another job. It was absolutely excruciating.
00:10:48.480 So, my first task was to make these machines work together. I’ve always assumed that when I have a problem, the answer is the first link on Google if I search. I am a heavy believer in using other people's code where possible. I pursued that route for this project. As an interesting aside, just a few days before this conference, I was doing a task and thought I knew exactly how I was going to do it. It was a little bit of image processing where I had to run a convolution filter over an image.
00:11:21.800 I do a fair amount of image processing with ImageMagick and RMagick, among some of the other Ruby facilities. But for those iterative types of processes, Ruby’s not so good; you need to be doing big matrix operations. I thought I was going to have to do this in Vigra—not Viagra, Vigra—a very good C++ computer vision library. I hate writing C++, though, so I just decided to Google, even though I knew nothing was out there, and turns out there is now an ANSI C-compatible computer vision library called CamelIA.
00:11:51.440 It has very cool Ruby bindings to it, so I was able to use that. I was very happy about that. It was the first link on Google if you searched for computer vision and Ruby—at least it was.
00:12:20.320 Getting back to what I was saying, we analyzed several of the big players: OpenMosix, SGE, Condor. These all take slightly different takes on clustering. For example, OpenMosix treats the cluster as one big computer. It makes transparent the fact that you have memory over on another machine and it does all this auto-magical stuff. That was terrible for us because we process big data files.
00:12:50.880 Like, a lot of our code will start out with allocations of six gigs—that's it—and so migrating a process across the network is a very bad idea. In our shop, code has to follow data, not the other way around. So, hopefully, Mosix was out. We looked at SGE, and it was okay, but it was just like a 50-ton sledgehammer to pound in the tack that we had in front of us.
00:13:10.960 That was particularly important because our shop is like a lot of shops where we have a guy that does all the grunt work, the data processing, and the operator, and he was going to be running this. I found it hard to figure out how to submit jobs in SGE; it took me weeks to install it in the first place. IT security made it just too big of a project. Next, we looked at Condor, which is a similar project, but they were just all way too big or not appropriate, and it shocked me there wasn’t something simple out there.
00:13:53.440 So we decided to roll our own—or at least I did at that point. I went to coffee quite a few times with my friend Dave Clements at Collective Intellect, and we just went round and round on what I needed. We came up with a really complicated diagram that was the architecture we wanted: a priority queue of jobs and n nodes. We didn't know if it was push/pull—that's why I have it bi-directional—to get jobs and run them as fast as they can.
00:14:39.360 This diagram had a few things that don’t jump out at you immediately, but they were quite important to the final architecture, which is there was no connectivity between nodes here. I wasn’t into detecting network partitions and dropped connections or trying to orchestrate the nodes. I wanted each node to work in isolation. In other words, if this queue is a boat that’s full of water that’s sinking, everyone’s got a bucket. That’s how it works. Everyone goes as fast as they can, and that’s it. If other nodes go up in smoke, I don’t care; everyone else keeps going.
00:15:10.639 That was the architecture that I wanted, and it was primarily because we’re a research shop and we do a lot of jobs that take weeks and weeks. There are very few things we’re not launching rockets for, and if a job doesn’t complete, I can find out about it on Monday. I don’t want the cluster down because communication between two nodes blew up or the switch went bad or something. I wanted a system that continued at all costs.
00:15:41.119 Another point to mention is that there’s no scheduling in this kind of a system. So the answer to ‘we can’t go fast enough’ in this system is: buy more nodes! I didn’t want to write a scheduler; I started, and it was just horrifically complicated. This was the architecture that we wanted.
00:16:07.440 After looking at and actually making some prototypes, we made an SSH based system that’s sort of like the Python cluster workstation cows. If anyone knows that, they just SSH jobs out everywhere. We did a prototype with that; we actually got some processing going. We talked a while about using MySQL. We had an instance of MySQL in the center... actually, several instances. I thought about using that as the orange box here to store the queue, but I decided against it for a couple of reasons.
00:16:43.680 At the time, we didn’t have a replication scheme. I didn’t want to poll, and MySQL doesn’t support asynchronous client callbacks like Postgres or Oracle support. I really didn’t want to be polling the database; I didn’t want the users of it to have to deal with the authentication username thing. Security in the government means six months added onto your project.
00:17:19.440 So, I wanted to manage the whole thing with normal UNIX permissions. But the biggest reason was that I didn’t want to add another point of failure to our system. We didn’t have replication, and I wasn’t happy with the availability of that service.
00:17:46.800 And that was what got me thinking about NFS. Like most shops—or most scientific shops I’ve been in—everyone has an NFS box sitting there. People don’t like to think about it, but when you put your code on there and your configuration, the reality is if that box stops, you stop! It turns out that it’s extremely difficult to make NFS highly available. It's very expensive; it can definitely be done, but when managers talk about wanting five nines in another system and NFS is involved, the answer is spend five nines on a NetApp—that’s what you do.
00:18:09.440 So we had the same setup. We had an NFS box that we were absolutely at the whims of—it had to be up and down. I wanted to leverage that, so I started thinking: what if I could put the shared storage on NFS? This isn’t exactly a revolutionary idea, but anybody who’s developed with NFS will know immediately some of the dangers that arise with 30 nodes simultaneously accessing something over NFS.
00:18:59.760 I follow this development process that I call hotspot development. I don’t know, there’s probably a better name for it, but it’s not top-down or bottom-up; I like to sketch out the rough idea of what I’m making and then drill down into the areas I think are likely to be problem spots and solve those in their entirety up front. I was most concerned in this case with shared access to stuff on NFS. I was very concerned about that.
00:19:43.680 Making that atomic, consistent, durable—all those things. The first thing that I found when I started working on that was that there was no access in Ruby to the fcntl command, which gives you safe locking over NFS, dude, man FN_CTL, aka POSIX compatible locks—file locks. So I wrote a Ruby gem that’s just a little C binding that gives you access to that. It’s no big deal; it’s like a 20-liner.
00:20:35.679 The reason I pointed that out is so that you guys know that it exists; but also—and here’s something I want to learn about today—I wanted to do this with DL, but I couldn't because the struct for the fcntl system call is such that all the fields aren’t defined in the spec; it just has those fields. So you don’t know the ordering of the spec! As far as I can tell, you can’t use it from DL because you need to completely describe the struct so it knows how to pack and unpack that binary structure in DL. If there’s a workaround for that, it would be great because there’s really no reason this needs to be compiled; it would be like a one-liner in Ruby DL.
00:21:25.920 Nevertheless, I made this code and I started testing it, and it worked! So I was on to the next thing. I’m paranoid, and I wanted—for reasons you’ll see later—an alternate locking mechanism that also worked for conflict resolution. So, I built this class LockFile, which is also available as a gem, and it makes NFS safe lock files.
00:21:56.560 Of the operations in NFS that are atomic, one is creating a hard link. It’s interesting that there’s one operation on NFS that’s actually atomic, and that’s link—that’s soft link make dir and open exclusive. All of those things that people use, none of them actually work! I stole this algorithm by the way—I didn’t make it up! There’s a lock file command on Linux; it’s a C program, and you can run it from the shell.
00:22:39.040 For my test, I was, you know, firing up a feeding frenzy on a shared data structure on NFS and running it for days and weeks. I could consistently core dump or get corruption from that. So, I rolled my own and added some features. The LockFile class implements a leasing mechanism so once a lock file is created, it will spawn a thread that keeps it fresh by touching it every configurable number of seconds. In that way, you can avoid stale lock files.
00:23:49.440 So, your system will never freeze on the weekend. In other words, if said lock file is older than whatever configuration you've set, we know that something went dreadfully wrong, and we can do some automatic recovery. Sometimes, you don’t want that, and that’s configurable, but this adds that logic on top of just the atomic file creation. I had that working, and I was noticing some really interesting features about lockd, which is the block daemon for NFS.
00:24:56.480 If any of you use this, these are very important to know about whether you’re using RQ or just NFS for shared access in general. NFS-based locking is not guaranteed to be fair! So what that means is if you have 30 machines that are trying or competing for a lock, the order that they get the lock in is completely undefined. What actually happens is very interesting, and I didn't look into the NFS code, but you can just imagine that the code that makes this work is if you have 30 nodes competing for the lock, one node would get it for like 512 times in a row, then another node would get it for like 512 times in a row.
00:25:47.999 So you can just see the wow, if somebody got it so many times in a row, give someone else a chance! This was not really a good thing because it meant somebody would get the lock for minutes on end, not getting a job before someone else would. The other issue that I had is firewalling. If you have a firewall in your shop and you have NFS, I guarantee you it doesn’t work and that your system administrators don’t know how to make it work!
00:26:57.680 We have really good system administrators and good security administrators, and they had no clue! The guys on the NFS list will tell you how basically NFS floats some ports in an FTP-style, you know, where the client and server negotiate what ports things are going to be open on, and you can pin those to specific ports. You could put that in your firewall, but that’s a very important consideration.
00:27:36.560 This is some ASCII art that didn’t really turn out great; what this is supposed to show is a sawtooth wave. The reason I have it drawn up here is that even once I got all those things working, I noticed that this—the way that the locks would be handed out—was kind of this predictable pattern where one node would get it for hundreds of times in a row.
00:28:16.480 Also, I noticed that I was acquiring the locks in a blocking fashion, right? That’s all you would normally do. You just get the lock and have your process go to sleep until you can get it. That resulted in very, very bad throughput for the locking. So, a node would go minutes without getting it.
00:28:54.640 I didn’t examine the code to see why, but I did some reading online, and it turns out that the way to get locks on NFS is in a way to pull. So to acquire the lock in a non-blocking fashion and be prepared to handle not getting it. The algorithm I came up with is a sawtooth wave where, basically, I pull many times rapidly in succession; you know, with a random sleep, to try to get the lock.
00:29:49.440 This is all configurable, but let’s just say eight times; I try to get it very fast and oh, I didn’t get it—back off ten seconds. Then I try again a bunch of times in succession, and back off 20 seconds. So you see where this is going; you back off. It’s a linear relationship, but you back off more and more and more until eventually, you become very impatient again and you start the process over. That resulted in our case, with 30 to 50 nodes, in extremely good throughput.
00:30:24.560 Basically, every node was competing to get the lock, and over the course of days, the way that the nodes got the lock was completely interleaved by node, and the same number of locks per node over a long period of time. So, that was actually a lot of work! It took me a couple of weeks to figure that out, but the lock policy I ended up running with is a combination of all those things.
00:30:50.960 Basically, using non-blocking POSIX locks, I implement a leasing policy on top of that to keep the file fresh so that other nodes can recognize when something's gone dreadfully wrong—which has happened once, incidentally, in three years of writing this code. And, finally, the LockFile class, which is not high performance but is robust and takes care of conflict resolution and a few other non-performance critical tasks in the code.
00:31:32.960 The database that I ended up using—the data store shared data store—is SQLite. It’s awesome! I don’t know if Jameis Buck is here, but I definitely wanted to give him props for that. Initially, I worked with a few of the pure Ruby solutions: PStore, FSDB, Madeleine, and figured I could just, as long as I was locking, use those right over NFS.
00:32:00.240 What’s the problem? Anyone know what the problem is? It’s caching! NFS is really heavy-duty client-side caching, so even though I’ve written code to make it seem like you can try to flush the NFS cache for pages, you can’t do it with absolute certainty! Again, this is one of those things where we were good for one day, two days, three days, four days, but after two weeks, the data store would be corrupt!
00:32:44.560 I was having to roll my own queries and all this, so I decided I would try SQLite. I had never used it at that point. Well, it is an extremely robust piece of C software that uses every trick in the book to ensure that data is flushed to disk. I personally think it actually does a better job at that than some of the more commonly used open network databases: MySQL, PostgreSQL. It’s very good.
00:33:35.200 I tested this by doing things like pressing ‘the button’ on the RAID array while everyone was writing to it. I pressed the button on the nodes. We brutalized it! Yeah, we could corrupt the database, but it would recover every time. To this day, we still have not had a database—despite hardware failures, RAID failures—that was not recoverable. I’m not saying it couldn’t happen; it obviously could! But it is an amazing piece of software.
00:34:23.600 At that point, we were basically at the alpha stage. You know, we had something that worked, the data store was robust, we could submit jobs to it, and they ran. And that was great! We were running about 30 nodes at that point, and we were doing some useful work with it. This here, this is my little demo. Like I said, I’m not going to bore you with the demo because it's just a command-line utility, so it's not that interesting to look at.
00:35:11.679 But this is going to show you how simple it is. Obviously, you’ve installed RQ at this point, which I’ll talk about later. So we have the capabilities shown right here, where we could create a queue. In other words, the command line always takes the name of the queue as the first argument. This queue, unless you’ve exported RQ_Q in the environment, is the case! If you’re subject to carpal tunnel syndrome, you don’t have to type the full path to the queue every time. You can set it in your environment as a mode of operation.
00:36:03.200 A lot of commands work this way, but all this is doing is creating a queue. The queue is a directory that has some stuff in it like the SQLite database and the external lock file that I use—some directories that are automatically put in your path. It’s really not that complicated. So that’s how you create a queue.
00:36:46.720 Next, you can submit a job to it. There are a variety of ways to do this; this is the simplest way, where I’m submitting to the queue whatever is in the rest of our v here, which is echo 42. And you can see we dump out a record of the job. There’s one table in the database, and these are the fields of that table. It's quite simple. So this probably makes perfect sense to everybody. The job needs to be run, so does everyone understand that? All we've done here is create a database on NFS, put a job into it, and no one is doing anything with it right now.
00:37:25.680 Next, you would start—I call it a feeder—the feeding process. This kind of scrolls off the slide here, but when you start the feeder, you know, this is the process that runs jobs from the queue. So that’s how you start this. This is just showing the Ruby logger output, and there’s really not that much that’s interesting about this, except we can see that the command job ID one is the PID on the host.
00:38:22.560 This is the exit status that was run; this was the exit status of that job. Of course, that information is written back into the database. But this is just showing you from the side that’s running the jobs; this gets logged. It’s like that’s it, basically. Right? You create the queue, put a bunch of jobs in it, then you start feeders on the nodes, and they just go. Now you have a Linux cluster!
00:39:09.600 So, that was the alpha version! We had some issues right away with that; does anyone know what those silly names are in NFS? Have you ever seen those dot NFS really long string of digits files on NFS? Essentially, what it means is in order to give the illusion of a local file system, you know, being cache consistent, if you have Node A and Node B that are both accessing the same file and one of them removes it, the server can’t remove it.
00:39:49.679 So, what it does is it creates this little dot file, dot NFS, in order to give consistency to the client that still needs it. It makes it work just like on Linux; you know when you remove the file and you’re editing it or it’s in memory, it’s still there? But with NFS, the server can go down and come up, and you still need that file on disk, so you get these dot NFS file names that are written!
00:40:35.200 The reason this was happening was due to a lot of database APIs being linked in a way where I was performing actions that required multiple transactions to achieve what I wanted to do. For example, I was starting a transaction on the database, looking for the highest priority—oldest job in the queue, running it, taking it out, and running it. But under the lock of the transaction, I wanted to write the PID back into the database.
00:41:29.280 I didn't want to do a separate transaction, so I started the job. Starting the job requires you to fork or run system calls or whatever. It turns out, it’s undefined to fork in most database APIs with the database handle open. I mean, what happens? Does the client flush it? The parent flushes it? People do this all the time, and they don’t think about it, but it can cause bugs, and in this case, it was causing a mess.
00:42:02.960 This wasn’t a big deal, but for the people using it, there were thousands of these things in the directory, so it was ugly, and users weren’t comfortable with it. This is a real problem. My whole design was based on a very simple idea, and I was like, ‘I want to double the number of transactions just to start a job?’ Besides, it’s not atomic—if I take the job and then crash and don’t start it, what happens to it? Do I introduce another state?
00:43:03.360 So, this is really aggravating! What I ended up doing was very satisfying: I wrote about a hundred lines of Ruby to create a process manager job runner daemon. Before I start the job, I start the DRb child, and the DRb child does all the forking and waiting on my behalf. So, I interact with it through DRb, and it works beautifully! It means I could fork without forking. I could have another process fork for me. Basically, it was totally transparent.
00:43:58.560 The way that that works is, basically, this is what I call a slave process. The DRb process has a pipe open to the parent with a thread in the background reading from it; if it ever throws an exception, it exits. In other words, we can't zombie this child. The DRb process is tied to the parent with a lifeline, this pipe that it just reads from until it detonates. This concept was abstracted and released as the slave.rb library gem. If I don’t know if Ezra is here, but that’s what’s under the hood of Background DRb.
00:44:40.000 So, we were moving on to bigger and better work at that point. This is a kind of cool grid; actually, it’s a depressing grid. What it shows is GDP crossed with nighttime lights, basically to represent population. The DINs here are digital numbers—the wider pixels are basically showing the percent of the people that are in poverty in these areas.
00:45:43.920 This is a somewhat easier-to-look-at representation where we mixed it with the world admin zones. You can see that in the dark maroon areas, 97 to 75 percent of the people are in poverty, and in other areas, 50 to 75 percent. The next issue I had was that starting the feeders by hand was less than ideal. Initially, I was doing it and screening the process so it could run in the background.
00:46:32.560 I was occasionally doing it by hand. Why didn’t I have a daemon process? Well, it’s the government! I mean I can’t open ports; I don’t have root, and I certainly can’t have something in init.d. It’s really painful! So what I did is when RQ starts, it gets a lock file, locally! In other words, when you start a feeding process on a host, it gets a lock file so you can’t run two instances of a feeder—one feeding process per host per queue.
00:47:24.960 So, I leverage that with cron. If you run RQ cron start, what it does is it adds a cron tab line that starts the queue! This command never fails; it’s a non-fatal error to start it if one’s already running! All this does is tickle the process every now and again—make sure it's running every 15 minutes. This gives you user space daemons. If the machine reboots on Sunday, you don’t have to come in; it just fires up by itself.
00:47:57.280 I use this technique in a couple of other pieces of code—my dir watch code, which is a piece of code for making event-driven systems off directory events (file created and so on). The daemon runs in exactly this way, and it’s just a cool design pattern for allowing normal users to install daemon processes. After solving those two problems, I guess we were pretty much beta. I had added some features—some of which we’ll see here—but it was shaping up pretty good.
00:48:50.080 It was running clean with not a lot of ugly files dropped around. It ran across reboots, and we had added some features at this point, like when you submit a job you can tag it with a keyword. Later, you can use that to query, which is very useful. We often dump 100,000 jobs into the queue, and you want to search for just yours.
00:49:41.920 A lot of the RQ commands—in fact, I think all of them—produce YAML on standard output. Most of the commands will take YAML on standard input, so the output of this command can be the input to submit or the input to delete or the input to resubmit. It was a beautiful thing using YAML for this because, of course, everyone can read it. I didn’t tell any of the scientists I work with at TML they had no problem saying, 'Oh, you know that’s the job ID,' but I also didn’t have to parse any of it, which was a good thing.
00:50:36.400 So, you can do these kinds of operations, so here we are. We’re looking for, this is, you know, SQL pattern matching, but you can search the queue for a pattern on the tag name. I’m looking at finished jobs, jobs that are done, and looking for ones done by a particular node. I can filter the fields that are printed out—that’s what the statement is saying—so I can just see a little subset of the database.
00:51:37.600 I collect status statistics, you know, the number of jobs in there, and all the various states, and some temporal statistics—the kind of stuff that you want to know when you check on a Sunday after doing your job to see how things are going and if everything's moving along like you thought it would. This is all kind of boring stuff, but this one is kind of interesting: these successes are programs that exited with a code 0. Failures are non-zero exit codes, and that’s it.
00:52:33.679 This last status is number 42, which we use as a warning signal. The only reason that’s interesting is it turns out when you have a piece of code that you think works and you throw 100,000 files at it, it doesn’t! So, we had to maintain a strict notion of what success was and have a warning code.
00:53:24.080 This is just showing a feature where you can do a hot rotation, basically a log rolling sort of feature of the queue. All the jobs that are finished, because the queue grows without bound, this rotates the ones that are finished or dead, and leaves only the pending, running ones in the queue. This—I mean, you’re not going to read this, but I’m simply showing here that you can access the queue through an API and I’m just gathering some custom statistics for this code.
00:54:20.320 Again here, this is showing a more sophisticated submit. Our queue works in a very Unix-y kind of way. I’m just finding a bunch of files and constructing commands on the fly. In this case, we’re taking the list of jobs to run from standard input. Standard output and standard error of every process is captured—it’s not necessarily a good idea to have a lot of standard output because you have all these processes on various nodes logging to NFS, but it can be very useful for debugging.
00:55:25.040 So, you can tailor a remote process. In other words, using RQ, this is just some lavish praise, and one of the biggest benefits of doing open-source development is that people start buying you beer when you show up in town for writing some software for them. So, the install of RQ is pretty simple; it’s a couple Ruby libraries, four of them, and the SQLite binding. None of you would have any problem with it.
00:56:06.000 But one thing I wanted to mention here—and maybe someone will have some feedback later in the conference—is that typically, with a cluster when you have an NFS server, you don’t want to install software on every node. You have an NFS server, you want to put it on that server and use it.
00:56:56.800 So, I have a custom installer to do that because none of the standard mechanisms, gems I guess being the one that most people use now, really support that notion of putting a whole hierarchy there with dependent libraries that are outside of Ruby. I am working on a gem installer, and I have a few issues trying to dovetail that with this idea of a whole hierarchy that’s separate and outside of system space.
00:57:51.440 I did look last week, and if anyone has experience with this, I’d like to do a Ruby script to EXE distribution of it. So, people can just download a binary and plop it in, but I have very little experience with that. If anyone does, maybe find me after this. So where are we going in the future with RQ? We probably won’t go that much farther. It’s pretty mature; it does what we set out to do.
00:58:50.320 It's extremely robust; I have hundreds of users for several years now, and no critical failures. I mean, I’ve had some bugs, but the system has never crashed for anybody that I know of yet. I’m going to be looking at a variety of ways to do package management better. I mentioned Ruby script EXE.
00:59:37.760 We’re installing RQ in a few classified facilities, which is pretty cool, just to get Ruby in the door in some classified government facilities. We’re looking more at doing single-host usage. People are using RQ just as a back end to queue processes on a single host.
01:00:05.360 Let’s just say you have a Rails application, and users can spawn processes. You don’t want your users to be able to click a button and create 3000 processes. So, people are starting to use it not in NFS. Of course, you can install it on localhost and just use it to queue jobs, so you can ensure that only one or two are running at a time.
01:00:36.800 Jeremy Hein Gardner is actually working on a Rails plug-in to do exactly that. I think this is my last slide here. The lessons I learned while working on this project were surprising. One was that we’re still compute-bound. People really think that data is the big problem now, and it is for a lot of people, but it turns out that a lot of people actually need to harness more computing power.
01:01:05.040 They have a lot of work, and that’s surprisingly difficult to orchestrate a number of nodes to do a task. I mentioned before, SQLite rocks! If you’re looking for an embedded database, I highly recommend it. NFS can be frustrating, but it’s really the only thing in town. It's the standard for shared file systems and I guess at this point we just live with it. There are a lot of issues with it though.
01:01:49.760 I learned that administration sucks; I think I can confidently say that RQ is zero admin. Every administration-based task I’ve automated. I don’t like to look at software once I put it into production, and RQ has that feature now where you can just start it and never look at it again.
01:02:45.760 If you’re going to be doing a lot of NFS work, you need to roll your own locking. If you just use the built-in default API, it’s not going to perform well. If any of you are using LVM on your RAIDs, it really kills performance—at least with the configuration that we had. We were not able to run the Linux Volume Manager on any of our RAID servers.
01:03:25.040 This one is worth mentioning. I want to skip over some of the last ones here, but one of the things about RQ that I didn’t anticipate, but is a very nice feature, is that most people when they’re mounting an NFS box mount it hard. What that means is that if the NFS server goes away, any access to it, that process is just automatically put to sleep. It doesn’t die; it doesn’t throw an error. That’s typically the behavior you want.
01:04:25.040 So, the cool thing with this is that if our server that has the queue and a lot of our data on it goes away, the cluster doesn't die; it just freezes! Even when we've had to rebuild RAIDs and it’s taken three days, I don't have to shut down the cluster. I just tell them to take the thing offline, and when it comes back up, everything just starts where it left off!
01:04:54.720 These three sets here, it turns out that distributing jobs is by far the simplest thing to do in a system. Moving your data around is the hardest thing, and RQ doesn’t address that at all. In our shop, we run vsftpd with anonymous FTP on all the boxes and move data around that way. But just be warned, that’s by far the most difficult thing with distributed computing.
01:05:43.320 I just want to mention, just getting a little jab at Chad here, too—I was bugging him last night, but I’ve used gems, and it’s great for Ruby developers to distribute libraries, but we don’t really have a mechanism for distributing Ruby applications—which is what RQ is—to a scientist that doesn’t even know what RQ or what Ruby is. He doesn’t need to know; he doesn’t want to know!
01:06:15.760 We don’t have a good way of distributing that. Eric Veenstra has done some very cool work with RubyScript EXE that I hope to leverage more, but I think we need a place on Ruby Forge for that for people to just distribute applications and to move outside of the Ruby developer realm to building applications for people who could care less about Ruby—just to write software for them.
01:06:58.640 The last thing I’ll mention is that I was very frustrated initially with the fact that I couldn't open ports and I couldn’t have daemons run, and I was just handcuffed every which way when I was building this software. But it turned out that was a very good thing because after I made it, it turns out there are a lot of other people in the same situation. Because it played within those constraints, now they can use it, and so I have a different attitude toward sometimes annoying constraints placed on developers.
01:07:46.000 So, that was a powerful lesson; did I learn! So, I guess I’ll open up for questions?
01:08:40.200 Okay, great! So, we’re going to break, and that’s it. If anyone has any questions, come on up!
Explore all talks recorded at MountainWest RubyConf 2007
+12