Fault Tolerance

Summarized using AI

Smoke & Mirrors: The Primitives of High Availability

Paul Hinze • March 29, 2015 • Earth

The video titled "Smoke & Mirrors: The Primitives of High Availability" presented by Paul Hinze at the MountainWest RubyConf 2015 explores the concept of abstraction in computer science and its critical role in building highly available applications. The discussion initiates with the idea that abstraction allows computer systems to simplify complex realities through strategic 'lies', leading to practical implementations like the internet and high availability applications.

Key points include:
- Definition of Abstraction: Abstraction is described as simplifying complexity to direct focus away from technical intricacies.
- Web Request Breakdown: Hinze breaks down a web request to GitHub, illustrating multiple underlying technologies such as DNS, TCP, and HTTP that contribute to the user experience without requiring users to understand the complexities involved.
- High Availability Explained: He emphasizes the importance of fault tolerance, stating that it involves anticipating and reacting to failures in a system.
- Redundancy as a Primitive: Redundancy is presented as a simple yet powerful form of fault tolerance, exemplified by modern servers using RAID systems and load balancing to handle potential failures smoothly.
- Handling Stateful Services: The challenges of redundancy in stateful services like databases are covered, where replication strategies are vital and trade-offs between consistency and performance must be considered.
- Monitoring Systems: Monitoring is crucial for identifying potential issues before they escalate into failures, thus maintaining system reliability.
- Concluding Thoughts: Hinze advises that designing highly available systems begins with recognizing that failures will occur and that systems should be structured accordingly. Therefore, high availability should be a fundamental consideration in technological decisions.
Overall, the talk emphasizes that understanding and preparing for potential failures will lead to more robust and reliable applications.

Smoke & Mirrors: The Primitives of High Availability
Paul Hinze • March 29, 2015 • Earth

by Paul Hinze
Many of the greatest achievements in the history of computers are based on lies, or rather, the strategic sets of lies we generallly call “abstraction”. Operating systems lie to programs about hardware, multitasking systems lie to users about parallelism, Ruby lies to us about how easy it is to tell a CPU what to do… the list goes on and on.
One of the primary “strategic lies” of the internet is the presentation of each service as though it were a discrete, cohesive entity. When we use GitHub, we think of it as just “GitHub”, not a swarm of networked computers. This lie gives us the opportunity to build high availability applications: apps designed to never go down.
Let’s take a tour through the amazing stack of tools that helps us construct high availability applications. We’ll review some of the incredible technology underlying the internet: things like TCP, BGP, and DNS. Then we’ll talk about how these primitives combine into useful patterns at the application level. I hope you’ll leave with not only a renewed appreciation for the core innovations of the internet, but also some practical working knowledge of how to go about building and running a zero-downtime application.

Help us caption & translate this video!

http://amara.org/v/GVgo/

MountainWest RubyConf 2015

00:00:08.960 Hello, my name is Paul, and I'm from HashiCorp. Here is my intro slide.
00:00:23.119 This is a picture of me from my first week at my job at HashiCorp. I like this picture because you can clearly see all the layers of emotion: disbelief, excitement, and ultimately fear. It's all in the eyes.
00:00:38.800 You may recognize some of HashiCorp's products. We make Vagrant, Packer, Surf, Console, and Terraform. I'd like to start today by talking about abstraction.
00:00:56.480 In my opinion, abstraction is the single most important concept in computer science. It's a principle that underlies so much of modern technology.
00:01:06.640 So what is abstraction? In the talk summary, I called it 'strategic lying,' which is a bit unfair. For today, let's define it as the use of simplifying metaphors to manage complexity.
00:01:17.600 The word comes from the Latin 'abstractus,' which is the past participle of 'abstrahere,' meaning to draw away. This aligns with our usage in computer science, where it’s about directing focus away from complexity.
00:01:29.519 The use of abstraction has certainly enabled us to build some remarkable wonders, case in point: the internet. To illustrate this concept, let's quickly go through one of my favorite exercises: breaking down some layers of abstraction underlying a single web request.
00:01:42.320 Let’s say you pull up GitHub in your browser. Most of the time, that’s how you think of it—just pulling up GitHub. However, there's a lot happening behind the scenes when you do that, all in such a way that you don't have to think about it. So for the next few minutes, let’s consider it.
00:02:01.200 Inside your browser, a complicated layout engine determines the position, layering, shape, font, color, and behavior of each element on the page. The layout engine knows what to do by consuming large blobs of text that dictate how to render the content. This text comes in formats like HTML, CSS, and JavaScript, along with assets like images and fonts.
00:02:22.800 So we know that a browser needs blobs of text to function. But where does that text come from? The browser asks GitHub, which means we first have to get GitHub's address.
00:02:39.599 The Domain Name System (DNS) is vast and intricate, an abstraction layer we will explore later. For now, suffice it to say, the browser asks DNS to look up the IP address of github.com and gets an answer back.
00:02:58.959 Now that we know GitHub's address, how do we ask it to send us the blobs of text we need? That's done through a simple text-based protocol called HTTP, which most of you likely understand.
00:03:16.000 We know how we’re going to structure the conversation with GitHub to get our blobs of text, but how do we actually get messages back and forth between our browser and GitHub? For that, we need a connection, and that's where TCP comes in.
00:03:28.640 TCP establishes a connection between two nodes on the internet, allowing them to exchange text. It's essentially a two-way pipeline for slinging messages back and forth. But the browser isn't directly connected to GitHub’s servers.
00:03:44.640 If I’m doing this from Chicago, for example, there’s a significant distance to cover to establish that connection. This is where the Internet Protocol (IP) comes into play.
00:04:01.040 To reach GitHub from Chicago to San Francisco, you can actually ask a tool like MTR, which displays all the hops between one computer and another. In my case, that’s 13 hops across eight different possible servers.
00:04:12.000 The first two IPs will be my router and my ISP's router in Chicago, followed by two hops in the D.C. area, four in Tennessee, and then several difficult-to-pinpoint nodes somewhere in Colorado and Utah before finally arriving in San Francisco.
00:04:25.760 So how did that path get chosen? Each hop along the route has a router that receives a packet with a portion of our message and the target IP of GitHub’s server. That router uses its routing table to determine which upstream connection is best to send the packet.
00:04:42.560 The router makes a decision and forwards the packet to the next router, which follows the same process. This operation is repeated 13 times until the packet reaches its destination. But how do those routers keep their routing tables updated to know what the best next hop is?
00:05:00.240 They use something called Border Gateway Protocol (BGP). I find BGP fascinating, and it's hard for me not to turn the rest of this talk into a BGP seminar. For today, let's just mention that BGP lets routers exchange reachability information with each other.
00:05:09.000 The major concepts include announce, which is the ability to route to a network that was previously unreachable; withdraw, which indicates that a network is no longer accessible; and update, which signifies that the path to a network has changed. The core routers of the internet process thousands of these messages per second, allowing the internet as a whole to route messages reliably.
00:05:27.840 These dynamics mean that even though the system is constantly in flux, it remains capable of reliably routing packets. This allows a logical connection to be established between two locations for exchanging blobs of text, which can then be rendered by a browser’s layout engine. To you, it’s simply pulling up GitHub.
00:06:11.520 So, that’s abstraction in action on the internet. The internet is cool; I think it’s awesome how it works. I could talk about this all day long, and while I may have shoehorned it into this talk, I promise it’s relevant.
00:06:28.320 Now, let’s discuss high availability. I’ll start by explaining what it is not. It’s incredible to me that a top-tier company like Chase would take extensive maintenance windows, seeming to go down for six hours every weekend.
00:06:42.480 I hope the explanation is financial and not technical, as otherwise it seems like they just gave up. High availability is essentially about fault tolerance: the ability of your system to avoid, detect, and minimize the risk of failures.
00:07:03.200 This is a realistically impossible task as it involves facing chaos head-on. One reason I find it such an engaging challenge is because successful high availability requires tackling uncertainty.
00:07:18.560 How do we begin tackling chaos? The first step, like in many problem-solving approaches, is to admit that you have a problem. Once we acknowledge that things are going to fail, we can start thinking about why they are likely to fail.
00:07:40.800 Once we identify what is likely to fail, we can begin to take steps to anticipate and recover from those failures. That’s essentially the process for constructing highly available systems: anticipate failure, prepare for failure, and then react when failure occurs.
00:08:05.760 This can also be framed as asking yourself what could fail in your system and then deciding what actions to take based on the knowledge that it will eventually fail.
00:08:26.720 After something fails, analyze what you anticipated correctly, what you didn’t, how your preparations performed, and what changes to implement for future failures. That will be the structure of each primitive we look at today.
00:08:46.880 We’ll discuss anticipated failures, preparations made for those failures, and the reactions taken when things go wrong. You’ll see that each of these primitives exists within an abstraction, so the rest of the system doesn’t have to handle the details of preparing for failure.
00:09:04.320 Let’s start with redundancy. Redundancy is probably the simplest form of fault tolerance. If you know something might fail, just keep several of them on hand. However, in a fault-tolerant system, this means more than just having a closet full of spare parts.
00:09:23.920 When we implement redundancy within a system, we use abstraction to enable the rest of the system to treat multiple redundant copies as a single entity. For instance, when we think about hardware failures, we must consider what happens when a physical component fails.
00:09:41.840 The modern server serves as a great case study in redundancy, incorporating features designed to mitigate common component failures. This includes using RAID to treat a group of disks as a single disk, allowing any of them to fail without a complete system failure.
00:09:56.960 It also utilizes link aggregation for network connections and can include dual or even multiple power supplies. This means that in a modern server, you can literally disconnect any cable or remove any hard drive, and everything keeps working without a hiccup. In fact, it might just beep and send an email notification.
00:10:13.760 As application developers, we need to recognize that though hardware engineers are doing their best to prevent servers from failing due to component issues, we must still expect that servers will fail regardless.
00:10:29.520 A significant part of designing a fault-tolerant system involves asking what happens if a server fails—this applies to both physical servers and cloud infrastructure. Therefore, to implement redundancy at the server layer, we need to allow the network to treat a group of servers performing the same task as a single server.
00:10:48.320 Several technologies in the internet stack can help achieve this. One example is a transparent proxy, which listens on a single IP address and directs incoming messages to multiple servers behind it. Those servers then process the messages as if they originate from the proxy, ultimately allowing redundancy and improving reliability.
00:11:08.520 In addition to providing simple redundancy, the proxy also enables another critical benefit: load balancing. Proxies that meet this role are often referred to as load balancers, allowing the overall system to handle more traffic than a single server could alone.
00:11:40.480 When considering high availability, understanding the impact of server loss within a load balancing cluster is essential. For instance, if we plan on losing any one of three app servers, we need to ensure each can handle 50% of the incoming traffic.
00:12:00.960 To determine whether a load balancer can route traffic to a given backend server, it often makes what are called heartbeat requests. These small periodic requests expect a specific response. If a server fails to respond correctly or at all, the proxy removes it from the rotation.
00:12:22.240 Now we have a solid setup where we can lose any one of our app servers. However, we must also consider what happens if we lose the load balancer itself. In our example, if the load balancer is the single point of failure (SPOF), the impact could be significant.
00:12:43.360 Chasing down single points of failure is a vital exercise when architecting a highly available system. You might solve an SPOF at one layer only for it to crop up at another layer; addressing them is a continual effort.
00:13:05.120 So we need to tackle the issue of the load balancer. Recall when we translated the github.com domain name to an IP address; it doesn’t just respond with a single IP, but usually with a handful. This design allows GitHub to use multiple endpoints to service requests.
00:13:23.920 Each of these could have its own load balancer and, utilizing the heartbeat pattern we discussed, they could evaluate their health and be removed from DNS responses if necessary. This appears to be a great solution.
00:13:39.920 However, the rub is that DNS is heavily cached across multiple layers. Clients may hold onto IP addresses that have been removed from the pool for longer than we expect, creating a potential risk.
00:13:56.320 In our example, if one of the IPs has failed and was removed from DNS, the browser might still have cached the old value, potentially causing issues. Therefore, we must consider going below the IP layer to better manage this situation.
00:14:13.200 We need one IP address held by multiple machines, which presents its own challenges. However, we can achieve redundancy through a process called clustering.
00:14:21.840 Clustering occurs when a group of machines collaborates to ensure a service is provided. In this setup, it’s sufficient for just one of those machines to maintain the IP address.
00:14:37.920 Technologies that facilitate this include algorithms like Paxos and Raft, as well as software like Pacemaker. I envision a cluster as a scene from a gangster movie, where three thugs are sitting around deciding who will take care of that IP address.
00:14:56.800 While this metaphor may not reflect reality, the concept hinges on a constant chatter about who is okay and who can handle the IP address. They must work closely to determine their status as it is often impossible for a single server to know whether it can’t reach another due to its own failure or that of a peer.
00:15:17.600 The act of transferring a service from one server to another is called failover. In a situation like this, we often encounter automatic failover, which does not rely on a human to intervene.
00:15:42.240 A quick note: clusters of load balancers can synchronize their TCP state tables, allowing for switchovers between load balancers while maintaining ongoing requests. It’s amazing to observe how seamlessly they operate without interrupting service.
00:15:59.920 However, zero-request-dropping failover is achievable because packet retransmission is built into TCP. When failover occurs, a few packets may get dropped, but TCP ensures the remote side will request a retransmission.
00:16:11.920 In practice, this means that at the HTTP layer, everything continues to function smoothly. Timeouts and retries are crucial from the perspective of abstraction.
00:16:31.680 Ideally, a client could always make a request and trust that any failure would be handled downstream. However, in the real world, especially with HTTP services, this isn't always the case.
00:16:47.040 I suspect that the lack of properly set timeouts is one of the leading causes of outages in modern web applications. Thus, properly considering these can yield significant advantages.
00:17:09.600 Next, I want to discuss the elephant in the room: managing state. The toughest class of services to make fault tolerant are those that handle state, such as databases.
00:17:24.560 The challenge with redundancy and stateful services is that you need the entire system to agree on a single source of truth. The notion of a ‘single source’ leads to potential single points of failure, a common issue in this context.
00:17:46.560 The trade-offs in attempting to make a data store fault tolerant are described by the CAP theorem, which summarizes the concepts of consistency, availability, and performance. Simply put, you can achieve two of the three.
00:18:07.840 Different data storage models occupy various positions on this spectrum, but all involve some form of replication. Replication is the process of continuously shipping data from one location to another, thereby ensuring redundancy.
00:18:28.560 Replication can be synchronous, where data must be confirmed in multiple places before it’s considered written, or asynchronous, where a write is acknowledged before it’s copied.
00:18:49.440 This means synchronous replication sacrifices performance to ensure consistency, while asynchronous replication improves performance but does so at the potential cost of consistency.
00:19:07.120 Once replication is taking place, other primitives, such as clustering, automatic failover, and read load balancing, can be added to allow databases to handle various failures.
00:19:27.920 There are certainly tools available for managing state in a highly available manner, but this aspect is the most challenging part of any infrastructure.
00:19:48.560 The final primitive of high availability I wish to mention is monitoring. Monitoring is a vast subject with conferences dedicated to it, but for this talk, I’d like to highlight two significant points.
00:20:06.360 First, failure doesn’t always mean outright breakdown; it can simply arise from a server hitting a reasonable limit, such as a disk becoming full. Proper monitoring can help identify resource depletion before it leads to system failure.
00:20:20.720 A second key aspect of monitoring is that it’s your best resource for understanding failures you didn’t foresee. Observing metrics like response times and system load can provide insights into potential failures.
00:20:37.840 In conclusion, if I could sum up my advice for designing highly available fault-tolerant systems in one question, it would be: What happens when it fails? Ask this about every component of your system, from app servers to caches and external APIs.
00:20:54.560 If you make it a habit to regularly ask this question about the systems you work on, you will ultimately develop more fault-tolerant systems.
00:21:11.600 High availability should always be a priority when making technology decisions. Remember that while there will always be trade-offs, options better suited for fault tolerance will exist.
00:21:24.560 Highlighting these considerations early on can prevent you from becoming cornered with an unstable system due to previous technology choices that neglected high availability.
00:21:42.320 I also believe that in due time, we application developers will progress to the point where we won’t need to concern ourselves with the complexities of creating highly available fault-tolerant systems.
00:22:00.320 We'll be able to lift the abstraction level of our tools to the point where using them becomes as straightforward as accessing GitHub, with the HashiCorp logo prominent in the background.
00:22:13.120 Now is a good time for me to mention that I have HashiCorp stickers, so feel free to find me if you’d like one. In the meantime, I believe high availability should always be a consideration because chaos is powerful and unpredictable.
00:22:35.760 We possess a variety of tools to manage it, and I’d like to add perseverance to this list—a word I have trouble spelling. Every instance of this term you've seen from me has been corrected through spellcheck.
00:23:01.280 Perseverance—a dedication to learn from failure—is what has led humanity to where we are today. We make mistakes, learn from them, and ultimately improve. It’s a process that, while not without its challenges, has served us well.
00:23:22.960 Thank you very much for your time. I look forward to hanging out with you all.
Explore all talks recorded at MountainWest RubyConf 2015
+13