Growing Distributed Systems

Bryce Kerley

Bryce Kerley

1 talk

#distributed-systems

#cap-theorem

#ruby-on-rails

#service-oriented-architecture

#data-consistency

#react

Growing Distributed Systems

by Bryce Kerley

Growing Distributed Systems is a talk by Bryce Kerley, presented at Ruby on Ales 2014, focusing on the challenges of developing distributed software systems. Distributed systems are defined as those that do not operate on a single computer and face constant issues due to their complex nature. Here are the key points discussed in the talk:

In conclusion, Kerley emphasizes that while distributed systems are complex and often fraught with issues, they can be effectively built with a focus on user value and incremental development strategies. He suggests that developers can harness the capabilities of Ruby for distributed systems without needing to pivot to other programming environments.

The talk reassures attendees about the feasibility of scaling their systems while addressing the common challenges associated with distributed systems.

For further insights, attendees are encouraged to consult the shared resources and to engage with the strategies discussed during the talk.

00:00:18.480 So, I live in Miami, and my birthday was right in the middle of RailsConf and RubyConf, and nobody stopped their presentation to wish me a happy birthday. I was leaving for the airport on Tuesday morning, and my neighbors thought I looked ridiculous. My head was really hot; it was about 80 degrees outside, and I was worried that Ben would be insufferably weird, that someone at the hotel would really dislike style sheets, and there would be people I wanted to hire as life coaches who kept a couch on their porch in the rain.

00:00:30.480 So, I'm Bryce. I work at Basho, and I work on the React Ruby client. React is a distributed database. What even is a distributed system? Nobody is really sure, but they are computer systems that don't fit on a single computer. Distributed systems are literally always broken. They are broken all the time; there's always something that's malfunctioning, something that's kind of failing, or something that's slow, which makes it really hard to make them work correctly.

00:01:03.200 These are the kind of systems where you have to worry about light cones, which is about how events propagate. I've played a lot of Team Fortress 2 over the last decade, which is a little embarrassing. The soldier on the left and the sniper on the right both have different latencies to the server in the middle. Whenever the soldier pops into view of the sniper, both react as soon as they see it happen. Who wins? The soldier shot a rocket, which is going to take time to reach the sniper, while the sniper fires immediately.

00:01:40.239 The server has to make a lot of hard choices about which client to trust or which combination of client states to trust. In a lot of cases, something completely incongruous will happen. In TF2, the sniper might shoot the soldier, and the soldier will die, perhaps after he pops back into cover, while the sniper will explode due to a rocket the sniper didn't see hit them. Leslie Lamport, a great expert on distributed systems, says that a distributed system is one in which a computer you didn’t know existed can make your computer stop working.

00:02:18.400 I'm going to talk a little bit about the rules and theory of distributed systems, what inconsistency and unavailability mean, and how to make your Rails app distributed. There's a very famous problem or thought experiment called the Byzantine Generals. It involves two generals, each in valleys on either side of a hill. They must coordinate to attack a town at the hill at the same time or they will both lose. The only way they can communicate is through horse messengers that pass right by the town.

00:02:53.200 Whenever they coordinate times, the first step is to send one soldier or messenger to say they will attack at dawn. However, the first general who sent the messenger doesn’t know if the other general got the message. Now, the other general wants to reply, but the town can intercept these messengers. It turns out there’s no protocol that covers every possibility of intercepted horse messengers, leading to compromises. Unfortunately, a lot of these compromises or protocols are self-organizing critical, meaning that when a small piece breaks, usually part of the safety protocol, it can bring the whole system down.

00:04:19.360 For example, in 2008, Amazon S3 had single-bit errors in their gossip protocol between machines. They would send a message saying there was an error, and as they retransmitted, the errors accumulated. Eventually, they had to bring down all of S3, leading to a terrible Sunday on the internet. In December 2012, GitHub encountered a hardware issue in their network. File server pairs would say they had been partitioned and would shoot the other node in the head, which is the name of the protocol. Unfortunately, they would both die or both act as masters, resulting in divergent states which took a long time to sort out.

00:05:17.680 Finally, over a decade ago, the power grid in the northeastern U.S. had a wire touch a tree during high loads, causing parts of the network to shut down due to safety protocols. Long story short, all traffic lights in New York, Toronto, parts of Washington D.C., and even as far south as Atlanta were left without power because of these safety cut-offs. So, going back to the idea that everything is broken, when your system scales and you have a lot of parts, things will break.

00:06:08.360 At any given time, the Backblaze backup company has 40 dead hard drives. Gears of War II had 40,000 hours of play-testing before release. However, within the first 15 minutes of release, there were more players than there were hours spent in play-testing, and they discovered new bugs that hadn't been detected. The big takeaway in distributed systems is the CAP theorem: consistency, availability, and partition tolerance. You can pick two, but you will always have partitions, so you have to pick one.

00:06:48.560 In conclusion, distributed systems are a land of compromise. Let’s talk a little more about the CAP theorem. Your system will have partitions. There’s a great blog post by Kyle Kingsbury, also known as Afer, called 'The Network is Reliable,' which details the unreliability of networks. You will experience partitions, so do you want to sacrifice everything being correct or functioning at all?

00:07:02.400 If you want to sacrifice functioning, Apache ZooKeeper is a fantastic project that allows you to get a quorum and ensure that your data is correct. In it, you run an odd number of nodes ideally in different data centers. If you lose some data centers, as long as you have a majority still working, they will continue serving requests, while the minority will fail and say they can't meet quorum.

00:07:26.560 Whenever you don't fulfill requests, that’s a valuable property to have. If you're signing up for a new Twitter account, you don’t want to claim someone else’s username, and Twitter would rather you not sign up at all than steal a username from anyone in this audience. On the flip side, we also have React, which is an AP system. When React loses a quorum, it will still accept write operations and perform reads. The writes should eventually become durable and persist at every node, but the reads may be incorrect.

00:07:55.360 Sometimes that really doesn't matter. For example, if you're checking out a YouTube video and notice that it has been viewed fewer times than liked, it’s because the view count isn't critical. YouTube prefers to record likes, reads, and views, etc., and have them momentarily inaccurate rather than not function at all. Ideally, you want a hybrid setup with multiple data stores that have different semantics for different situations.

00:09:00.560 Thus, you can maintain consistency where it's crucial, like for Twitter usernames, and have eventual correctness for tweets themselves. I assure you this will be my last plug: React 2, which will release soon, will feature both existing weak consistency and strong consistency. Lastly, regarding distributed Rails, you're already operating in a distributed way.

00:09:14.640 Whenever someone registers for your app, they typically enter their username and email address and hit sign-up, after which you send them a confirmation email. However, what happens when you've committed the transaction to Postgres, but it never completes and fails to communicate back to your Rails client? Will they receive the welcome email? Probably not. Do they get a 500 error from your Rails app? They do.

00:09:45.360 They'll reload and resubmit the form dialog, leading to a username already taken error. What you thought your Rails app looked like, with neat boxes, validation, and callbacks, is more complicated than you anticipated. Sometimes, the database connection is lost partway through a transaction. Sometimes, everything works but the user loses connection when they switch to airplane mode or leave cell coverage. A powerful way to handle this is by embedding as much business logic as possible within your SQL database.

00:10:17.320 You can enforce correctness in SQL. Many issues arise when I validate uniqueness in Rails but forget to add the corresponding index in SQL, leading to multiple submissions causing havoc later. SQL is an excellent language for writing business logic. However, SQL relies on ACID guarantees, which require most SQL systems to be consistent and partition-tolerant but not necessarily available.

00:10:36.960 Conversely, NoSQL provides flexibility, as there are no expectations for NoSQL systems to be ACID-compliant. They operate under a different set of rules known as BASE: basically available, soft-state, and eventually consistent. NoSQL systems, in general, are not as programmable as SQL, so you can’t depend on your data store to enforce all your business logic. This leads to the temptation of embedding all the application's logic into model classes.

00:11:09.280 This situation is referred to as a 'monorail.' For instance, Twitter started as a Rails app and continued in Rails. It grew larger, and they faced challenges with multiple users trying to work on the same model, which didn't scale. While it might have scaled computationally, it didn't scale organizationally.

00:11:46.680 So, what happens is that you want to move off the monorail and transition to a service-oriented architecture. This might sound daunting, but it extends the solid principles of individual responsibility from objects or classes to services that manage a specific function. Consider the user class in your application. If it exceeds 50 lines, 100 lines, or 1000 lines, that's too much.

00:12:03.920 Your user class probably provides more than one service to your app. It checks passwords against Bcrypt or Scrypt, handles user profile fields, uploads new avatars, and more. It doesn't make sense for all these services to link back to the same model. For username uniqueness, it may be more practical to store that in a sharded MySQL or sharded Postgres database, while larger user profile fields could go into something less expensive, perhaps stored in Memcached or Redis.

00:12:39.040 This way, you can continue to expand your model, breaking it into smaller components. You should create databases or services that act as databases for different services that the user model provides. Next, you'll want to extract the various parts of your app. If you're Twitter, your most pressing issue is scaling to millions of tweets.

00:13:16.560 It's natural to extract tweet storage as a separate app from user profiles, following, hashtags, etc. Once you decide what to extract, establish the verbs and nouns that you'll use to communicate with your distributed app's separate service. For example, fetching a tweet by ID, performing a full-text search on tweets, or posting a tweet, you must also establish error-handling protocols.

00:13:44.760 For instance, if a user posts the same tweet five seconds ago, or if the tweet appears spammy compared to previous submissions, document these protocols that might not already be in place. Implement these protocols and conduct tests to build this service. What is the benefit? You achieve smaller deployments. In a 2009-2010 Rails app, using Capistrano might take 20 minutes to deploy, and all your users might get shunted off simultaneously.

00:14:29.440 The goal of a service-oriented approach is to ensure smaller deployments, deploying a new tweet service to just one percent of your machines at first, gauging stability while maintaining a low error rate and then incrementally deploying it broader.

00:14:46.240 How do you test this? You simply continue writing the unit tests you already have along with higher-level service tests that ensure the service runs as expected. Acceptance testing your system that runs on hundreds or thousands of computers is a different matter. So, you instrument parts of your system and deploy them to a small user base.

00:15:25.520 For example, at Big Ruby in Texas two weeks ago, Netherland presented how they refactored GitHub using science, incorporating numerous probes to gather insights. They conduct A/B testing behind the scenes, running both a new version and the old version in parallel. If the new version returns different outputs, they log the differences without inconveniencing users, enabling informed decisions about functionality.

00:16:14.800 Next, working offline is a specialized case of working online. I, for one, don't run a web server on my iPhone, leading to latency issues with online websites. Machines occasionally halt for garbage collection and stop responding. I often work on planes, and it's disheartening when Google Drive fails to set up offline mode correctly, forcing me to use a simple text editor.

00:16:55.360 Packet loss can be a significant issue. People are working to develop smarter clients to navigate this. Many are tiring of traditional click-buttons resulting in screen reloads. Turbo links are now built into Rails, along with Ember, Backbone, etc., allowing smarter offline apps.

00:17:34.560 The big move is to ensure functionality offline, recording information and addressing inevitable conflicts. For instance, who here has used Google Wave? With only a few hands raised, Google Wave implemented an operational transform protocol, which sounds complex—and it can be, but generally, it operates as follows.

00:18:29.200 Every time you edit something on the smart client, JavaScript reflects your changes immediately and sends a copy of those operations to the server. The server receives the operations from clients, reorganizes them, and yields a coherent result, sending both the results and collected operations back to clients.

00:19:11.440 This enables clients to receive real-time updates for a truly interactive user experience. Although Google Wave may be mostly defunct now, they employ similar technology in Google Docs, allowing multiple cursors and collaborations without glitching. Another significant offline app is Git, which effectively names many changes and provides a fallback mechanism where users can select which version they prefer.

00:19:50.080 Lastly, another prominent offline app I've studied is Vesper, created by some celebrity iOS developers. They ensure their offline client can independently name changes and can merge data on the server when viable, prioritizing the retention of data and preferring duplicates over data loss.

00:20:21.600 These are the three main topics I wanted to cover. I realize I've gone a bit fast. At a high level, Ruby can work effectively for building distributed systems. You don't need to shift to an unfamiliar JVM language with strange punctuation.

00:20:47.360 The main priority is to create something that users find valuable. Every complicated system that functions has emerged from a simple functioning system rather than a complex one that failed. You can develop your systems incrementally, resolving bottlenecks as they arise, without scrapping the entire codebase and starting anew.

00:21:18.320 You can definitely scale your system as required, and that’s everything I wanted to present, excluding numerous bonus slides. You can find all my relevant links at bitly.com/roa-minus-dist, and I'm BonzoAsk on Twitter.

00:21:52.640 Are there any questions so far? Yes?

00:22:18.000 The question concerns handling many deployments and parallel versions when working with distributed systems, particularly regarding versioning different APIs. You manage this in the same way as conventional REST APIs, ensuring that having a little version field can be helpful. While it's unnecessary to build your distributed services solely on REST principles, having organizational methods such as releasing, say, version two of the tweet service, can provide guidance.

00:22:42.560 It's also important to inform clients well in advance when breaking older versions and ensure your protocols are designed to accommodate or phase out obsolete versions without much hassle. However, that can be a tough challenge.

00:23:03.440 Do you have any further questions?

00:23:07.040 I'll share a few more insights I noted. One alternative perspective on the CAP theorem pertains to harvest and yield. Harvest refers to how much data you gain from a specific request, while yield is a measure of the likelihood of that request succeeding. Furthermore, when implementing distributed systems, you lose the auto-incrementing ID columns.

00:23:29.680 This is because while you're writing to a table, someone else might be waiting for the ID to increment before executing their write. Therefore, Twitter utilizes a tool named Snowflake, which generates guaranteed unique IDs that are still loosely ordered enough for storing sequences like tweets.

00:23:54.560 With databases, SQL systems typically focus on ACID, while NoSQL systems fall under BASE rules, which stands for basically available, soft-state, and eventual consistency.

00:24:00.960 There’s also a large array of justifications for extracting services, and many offline databases have not satisfied many users when it comes to performance and management.

00:24:31.200 Lastly, I recommend checking out Jeff Hodges' presentation from Recon West 2013 titled 'The Practicalities of Productionizing Distributed Systems,' which is exceptional and listed within the resource links I shared.

00:24:53.040 Is there anything else you would like to discuss?

00:25:10.560 Thank you for your attention.

Ruby on Ales 2014