Polyglot Paralellism: A Case Study in Using Erlang and Ruby at Rackspace

Ruby

Phil Toland

1 talk

Polyglot Paralellism: A Case Study in Using Erlang and Ruby at Rackspace

by Phil Toland

In the presentation titled "Polyglot Paralellism: A Case Study in Using Erlang and Ruby at Rackspace," Phil Toland discusses how Rackspace addressed significant operational challenges in managing and backing up 20,000 network devices across nine data centers on three continents. The core challenge was to create a robust solution with minimal failure rates in a high-throughput environment where SSH access to the devices was complicated by latency and communications bottlenecks.

Key points covered in the presentation include:
- Initial Challenges: Rackspace struggled with previous solutions using Perl and Ruby that lacked reliability and efficiency due to slow I/O operations and inefficiencies in handling vast datasets. Early attempts resulted in a 260-gigabyte MySQL database that suffered from slow queries and a rigid schema that hindered adaptability to new devices.
- Adopted Technologies: Rackspace adopted a polyglot approach, leveraging Ruby for its front-end Rails application and Erlang for backend processes due to its superior support for concurrency and fault tolerance. MongoDB was integrated for its flexibility compared to traditional RDBMS.
- Operational Benefits: The new system increased scalability and reliability while improving maintenance through simplified code structures. Moreover, Erlang facilitated the independent operation of worker processes, allowing the system to handle failures gracefully without service interruptions.
- RESTful API Implementation: Utilizing the Web Machine framework in Erlang, the team developed a REST API to efficiently handle network device communication and data management.
- Key Learnings: The team emphasized understanding and respecting Erlang’s programming paradigms, including its functional nature and concurrency model, which greatly enhanced application efficiency.

In conclusion, the mix of Erlang and Ruby provided a successful solution to Rackspace's complex device management needs, showcasing the power of polyglot programming and the importance of choosing the right tool for the job. The presentation highlights not just the technical solutions but also the crucial lessons learned during the development process.

00:00:19.920 All right, so my name is Phil and I'm Mike, and we work at Rackspace. You may have heard of us; we host a few servers—100,000 servers to be exact. We're going to talk about our experiences building a polyglot application using Rails and Erlang.

00:00:30.320 This isn't going to be an Erlang tutorial, but we'll show you just enough so you can look at the code and understand what we're talking about. So with that, I guess we'll just get started.

00:00:42.480 Part one of the problem started when I began at Rackspace about two years ago. There was a significant issue: we had 20,000 network devices that needed to be managed across nine data centers on three continents—North America, Europe, and Hong Kong. All of our operations run from a data center in Virginia, which means we have high latency links to London and Hong Kong, and we have 5,000 devices in London. This situation was not trivial.

00:01:01.359 These devices include firewalls and load balancers, and they manage traffic effectively, but they are not designed for high-throughput management. Managing most of these devices requires SSH access, which involves using expect to drive the SSH session. As you can imagine, this process is not very fast.

00:01:26.560 Additionally, many of these firewalls have been purchased throughout the history of Rackspace. While some of the newer devices come with more robust management APIs, most of the older devices were designed for small to medium-sized businesses. Consequently, the product design didn't consider the need for parallel configuration at the scale we operate.

00:01:55.840 The problem we encountered was the need for a high-throughput solution to back up all these devices every 24 hours. We needed to update the devices, and all of this had to happen during off hours for the data centers where the devices are located. For example, if a device is in London, we run our backups at seven o'clock Central Time to avoid disrupting their business hours.

00:02:21.360 The primary bottleneck we faced was I/O communication with all these devices. This was one of those scenarios where we were hard-bounded by network I/O. Therefore, to speed things up, we needed to talk to more devices in parallel.

00:02:34.000 An interesting point arises when considering parallel computing. Often, we think about parallelizing computation, but in our case, the issue lies in parallel access to many different devices. This fact played a significant role in the solution we devised. We also had to deal with massive blobs of data coming from numerous devices that generated a substantial amount of information.

00:03:06.480 A lot of backups equal a big database. The system in place when I started had a 260-gigabyte MySQL database, and almost all of it was in one table. At the time, we were saving a backup from the device every night, regardless of whether the configuration changed, which resulted in a lot of wasted data storage.

00:03:39.280 While we could optimize the amount of data stored, given the size of devices and the fact that even relatively stable devices require changes occasionally, we still observed a significant amount of data. As Phil mentioned, we implemented measures to reduce the data we had to store, but we didn’t want to get too clever with deduplication.

00:04:07.199 It's best to keep the data as blobs that are easily accessible. Moreover, conducting ad hoc searches through that information can be difficult but is essential since we need to find specific pieces of information quickly.

00:04:30.880 For example, we had a specific load balancer vendor that had a flaw in its SSH daemon, which posed a security risk. We needed to identify how many firewalls were behind that load balancer and might have been vulnerable, which necessitated taking swift action to secure them.

00:04:56.000 With the old MySQL database, due to improper indexing, that query would have taken an entire day to run, and we simply didn’t have that time. The purpose of these backups is to be able to restore devices when necessary. We have Service Level Agreements (SLAs) with our customers, which set a strict time limit for device restoration. This means the technicians must rapidly access that information to pull configurations and load them onto devices. Fast and reliable data access is imperative.

00:05:39.200 Additionally, each time we interact with a device, we must record events for auditing purposes, compliance with standards, and traceability. Knowing when we last touched a device is critical because we frequently interact with them, which leads us to retain a considerable amount of event information.

00:06:04.320 We faced challenges in MySQL due to improperly indexed tables. We thought about addressing this through migrations, but every time we attempted to add indexes, it literally caused the database to crash. This was a moment of significant frustration—and it made us contemplate other employment opportunities.

00:06:42.000 The rigid schema of a relational database also made it challenging to adapt to new devices, as each device type possesses different properties. As we manage firewalls, load balancers, and IDS, we want to track various attributes. Backup processes can mean different things for different devices, particularly with varying vendor technologies. For some, it might be just a handful of configuration lines, while for others, like the F5 Big-IP, it could be a large zip file containing numerous files.

00:07:15.200 The ability to reliably store this data and serve it to our technicians for quick retrieval is essential. To give some context to our growth, in 2009, we had approximately 17,000 network devices that we backed up, primarily a mix of firewalls and load balancers. Now, we're at about 21,000 devices. Over the past two years, we've added roughly 4,000 devices, and this growth is likely to continue.

00:07:57.960 The solution in place at that time was composed of multiple Ruby applications—some using Rails, some scripts—and there were overlapping responsibilities among these applications. We had three applications communicating with firewalls in slightly different ways, which complicated matters.

00:08:14.880 This overlap made it difficult to make changes. If we discovered a flaw in our methodology, we had to modify different applications, each structured a bit differently, thus complicating the fix. We also experienced scaling problems. Concurrency used Ruby 1.8 threads, and because the issue was I/O bound, it meant that threads were blocking due to I/O operations.

00:08:50.080 To add to our challenges, the version of expect that shipped with Ruby 1.8 analyzed the input stream one character at a time, which was ridiculously slow. I would add that expect.rb is something we complain about frequently. Given that it was designed for a different environment, reading one character at a time seemed logical then, but with our scale—fifteen to twenty thousand devices—it became an inefficient algorithm.

00:09:21.280 We did explore vendor device managers, but they typically only work for specific types of devices delivered by particular vendors, and the number of devices per manager is often very limited. Additionally, these solutions tend to be expensive. What we needed was a solution that would work for the diverse array of devices present in our data centers.

00:09:44.160 To illustrate the problem, big corporations might have, say, 500 firewalls, and they could purchase a couple of licenses for an enterprise device manager from Cisco or F5. However, asking one of those vendors for a solution to manage 15,000 devices would immediately raise a red flag due to potential costs, as this scale is rarely encountered.

00:10:16.560 We implemented a better solution by creating a Rails application for the front end, while the back-end application that handles communication with the devices was developed in Erlang, with everything stored in MongoDB. We were fully buzzword compliant.

00:10:59.920 Today, we won't focus too much on Rails or MongoDB, as most in this room are likely familiar with Rails, and to be frank, our Rails app is not particularly interesting. In regard to MongoDB, we simply set it up and began using it.

00:11:24.160 Interestingly, we might be among the few Rails developers in Texas who have had a good experience with MongoDB. Many developers express concerns about data being corrupted or certain collection names inadvertently routing data to nowhere without warnings; however, that has not been our experience.

00:11:43.680 We have navigated its limitations and remained mindful of them, and we have faced virtually no significant issues. We followed the principle of building the simplest solution that could work, which is something prominently advocated in agile methodology.

00:12:06.320 When asked why we chose MongoDB over Couch when using Erlang, the straightforward answer is that we weren't using Erlang when we selected MongoDB. Additionally, we ultimately felt that MongoDB fit our problem statement better than Couch did.

00:12:39.760 To clarify, the decision to go with MongoDB was made before I joined Rackspace, and it’s not something I want to criticize Phil for. He's very comfortable under the bus, it seems. To be clear, we've had no issues with MongoDB. From my experience with Couch, I’ve felt that its focus on views and distributed map-reduce operations didn't align with what we were aiming to achieve.

00:13:10.240 While Couch looks impressive on the surface, the programming paradigm would require us to invest more time learning it. In contrast, MongoDB appears similar enough to MySQL or PostgreSQL, allowing us to navigate it without extensive learning.

00:13:36.080 You won't find a message queue in our solution because we didn't see the necessity for one. Instead, we run numerous batch jobs within our Erlang application, initiated from the Rails app. The configuration file stores these jobs, while the Erlang application polls for information, allowing us to run scheduled tasks at specific intervals.

00:14:19.040 Our application architecture contains a Rails application at the front end and an Erlang application interacting with our MongoDB in the background. We also have our network devices operating at the core, and other clients within the company connect to our REST API.

00:14:42.080 By a show of hands, how many here are Rackspace customers? Great to see you. You are part of a smart audience!

00:15:03.200 Are any of you using the Rackspace portal, MyRackspace? If you manage your firewall devices from within the portal, you are indeed communicating via our REST API—the Firewall Manager.

00:15:33.600 Now, let’s shift gears and discuss Erlang. Who in this audience was familiar with Erlang when they came in today? A show of hands? And how many of you have Erlang code in production? Just one, and he works at Rackspace, so that doesn’t quite count.

00:16:26.320 Let's look at a brief history of Erlang. It was invented at Ericsson Computer Science Laboratory by Joe Armstrong in 1986. By 1988, Erlang was employed in actual telecom hardware, and in 1998, it was released as open-source. The most notable product built with Erlang is the AXD 301 ATM switch created for British Telecom, which featured about a million and a half lines of Erlang code and achieved nine nines of uptime during a two-year evaluation.

00:17:06.960 It's important to note that Erlang does not guarantee nine nines; that was an exceptional product. However, it does provide the tools necessary to achieve five nines or any reliability target you desire. Erlang is a functional language, distinct from the imperative object-oriented languages most of us are used to.

00:17:51.840 It is dynamically typed and employs the concept of single assignment—once you assign a variable, its value cannot be changed. Although many may initially find this challenging, this feature helps maintain a cleaner code structure.

00:18:31.840 In practical usage, if you keep your functions small and contain your variable assignments within those functions, the approach becomes manageable. For example, consider the simple pattern matching shown in the Erlang example of variable assignment.

00:19:06.080 Mutable data structures are absent in Erlang, leading to the creation of new copies for any modifications made. Surprisingly, this architectural decision, which some deem inefficient, has an important practical upside—minimal state. This aspect aids with concurrency and parallelism.

00:19:52.720 Let's take a look at an example of creating a new dictionary in Erlang. You'll find the first parameter is the key, the second is the value, and the third is the existing dictionary you wish to alter. UITableViewController and records employ structured names, which are a notable element of Erlang’s syntax.

00:20:29.920 Erlang is designed with concurrency in mind. This isn't a library extension; it's an integral part of the language. Everything about Erlang, from the virtual machine (VM) to the higher levels, is created with concurrency at its core.

00:21:06.720 To provide an example of productivity in Erlang programming, let us consider the standard factorial function. This simple but illustrative code displays how the functional production paradigm extends to real-world use, specifically tail recursion optimization, which Erlang handles quite well.

00:21:41.760 Next, let's examine a slightly more complicated example of implementing QuickSort. In addition to illustrating the recursive function, it also showcases destructuring, a practice familiar to those coming from languages like Ruby.

00:22:16.720 Now, let’s transition to discussing two key features in our Erlang application: the batch job framework used for device backups and updates, and the RESTful API built with the Web Machine framework. The job framework consists of a runner that spawns multiple worker processes.

00:22:57.200 These workers handle actual job logic through a designated callback module. Thus, communication between the runner and the workers is crucial to the operational efficiency of the system. Our runner stars a worker and is notified when it's ready; it then queues an item for processing.

00:23:38.959 One of the appealing aspects is that the workers operate independently of the runner. If a worker crashes, Erlang’s monitoring features allow the runner to be notified, which enables it to instantiate new workers without any interruption to the ongoing processes.

00:24:16.320 To illustrate an incident involving our production instance deployed on two VMs, one of which was misbehaving, we successfully transitioned all responsibilities to the second VM without any downtime, thus maintaining operational continuity while we reinstalled the problematic VM.

00:25:05.440 Moving on, let's discuss the RESTful API primarily built with Web Machine. This framework allows us to develop applications more akin to writing a tailored web server while also exposing the nitty-gritty details of the HTTP request-response cycle.

00:25:42.560 Web Machine operates under the principle of convention over configuration, ensuring sensible defaults throughout the HTTP lifecycle. I've included a state diagram that planners might find beneficial as they familiarize themselves with the workflow.

00:26:23.200 Building resources in Web Machine involves creating simple functions that handle various HTTP methods. A resource is essentially a plain Erlang module that contains functions to manage request processing, making it highly straightforward.

00:27:02.320 For those who may not be well-versed in HTTP, there might be a little curve to overcome, but the architecture is simple, relying on basic functional principles.

00:27:38.000 In developing a dispatch rule, it resembles Rails routing, binding parameters to the incoming URLs and directing requests to the appropriate resources based on resource availability.

00:27:55.200 When I refer to resources, these modules contain callbacks that correspond to different stages in the request cycle, allowing for highly customizable API behaviors.

00:28:33.600 For instance, if a resource does not exist, it can be configured to trigger a 404 response, while authorization checks can ensure secure access.

00:29:03.680 It's important to note the flexibility that comes with building these modular resources, which you can compose to define how your application will operate while remaining clear and maintainable.

00:29:40.560 Throughout our experiences with Erlang, we learned several valuable lessons. One lesson is to respect the programming paradigms within the language; Erlang's distinct type system can feel low-level compared to Ruby.

00:30:15.160 As a result, understanding how strings work in Erlang—being lists of integers rather than traditional string types—can take some adjustment.

00:30:52.000 Furthermore, the absence of built-in hash or dictionary data types may necessitate a rethink in how developers structure their code and data.

00:31:27.840 However, leveraging Erlang's capabilities and focusing on its core strengths in concurrent programming will lead to efficient and powerful applications.

00:32:03.200 For anyone considering exploring Erlang, be sure to embrace OTP instead of handling low-level primitives, because it provides a robust framework for building Erlang applications.

00:32:40.480 In conclusion, I'm Phil, and this is Mike. Please don’t forget to rate our talk at Speakerrate.com. If you have questions that we couldn't address, feel free to find us for additional discussions throughout the day.

00:33:26.080 Thank you, and have a great day!

LoneStarRuby Conf 2011