00:00:08.960
Hello, my name is Paul, and I'm from HashiCorp. Here is my intro slide.
00:00:23.119
This is a picture of me from my first week at my job at HashiCorp. I like this picture because you can clearly see all the layers of emotion: disbelief, excitement, and ultimately fear. It's all in the eyes.
00:00:38.800
You may recognize some of HashiCorp's products. We make Vagrant, Packer, Surf, Console, and Terraform. I'd like to start today by talking about abstraction.
00:00:56.480
In my opinion, abstraction is the single most important concept in computer science. It's a principle that underlies so much of modern technology.
00:01:06.640
So what is abstraction? In the talk summary, I called it 'strategic lying,' which is a bit unfair. For today, let's define it as the use of simplifying metaphors to manage complexity.
00:01:17.600
The word comes from the Latin 'abstractus,' which is the past participle of 'abstrahere,' meaning to draw away. This aligns with our usage in computer science, where it’s about directing focus away from complexity.
00:01:29.519
The use of abstraction has certainly enabled us to build some remarkable wonders, case in point: the internet. To illustrate this concept, let's quickly go through one of my favorite exercises: breaking down some layers of abstraction underlying a single web request.
00:01:42.320
Let’s say you pull up GitHub in your browser. Most of the time, that’s how you think of it—just pulling up GitHub. However, there's a lot happening behind the scenes when you do that, all in such a way that you don't have to think about it. So for the next few minutes, let’s consider it.
00:02:01.200
Inside your browser, a complicated layout engine determines the position, layering, shape, font, color, and behavior of each element on the page. The layout engine knows what to do by consuming large blobs of text that dictate how to render the content. This text comes in formats like HTML, CSS, and JavaScript, along with assets like images and fonts.
00:02:22.800
So we know that a browser needs blobs of text to function. But where does that text come from? The browser asks GitHub, which means we first have to get GitHub's address.
00:02:39.599
The Domain Name System (DNS) is vast and intricate, an abstraction layer we will explore later. For now, suffice it to say, the browser asks DNS to look up the IP address of github.com and gets an answer back.
00:02:58.959
Now that we know GitHub's address, how do we ask it to send us the blobs of text we need? That's done through a simple text-based protocol called HTTP, which most of you likely understand.
00:03:16.000
We know how we’re going to structure the conversation with GitHub to get our blobs of text, but how do we actually get messages back and forth between our browser and GitHub? For that, we need a connection, and that's where TCP comes in.
00:03:28.640
TCP establishes a connection between two nodes on the internet, allowing them to exchange text. It's essentially a two-way pipeline for slinging messages back and forth. But the browser isn't directly connected to GitHub’s servers.
00:03:44.640
If I’m doing this from Chicago, for example, there’s a significant distance to cover to establish that connection. This is where the Internet Protocol (IP) comes into play.
00:04:01.040
To reach GitHub from Chicago to San Francisco, you can actually ask a tool like MTR, which displays all the hops between one computer and another. In my case, that’s 13 hops across eight different possible servers.
00:04:12.000
The first two IPs will be my router and my ISP's router in Chicago, followed by two hops in the D.C. area, four in Tennessee, and then several difficult-to-pinpoint nodes somewhere in Colorado and Utah before finally arriving in San Francisco.
00:04:25.760
So how did that path get chosen? Each hop along the route has a router that receives a packet with a portion of our message and the target IP of GitHub’s server. That router uses its routing table to determine which upstream connection is best to send the packet.
00:04:42.560
The router makes a decision and forwards the packet to the next router, which follows the same process. This operation is repeated 13 times until the packet reaches its destination. But how do those routers keep their routing tables updated to know what the best next hop is?
00:05:00.240
They use something called Border Gateway Protocol (BGP). I find BGP fascinating, and it's hard for me not to turn the rest of this talk into a BGP seminar. For today, let's just mention that BGP lets routers exchange reachability information with each other.
00:05:09.000
The major concepts include announce, which is the ability to route to a network that was previously unreachable; withdraw, which indicates that a network is no longer accessible; and update, which signifies that the path to a network has changed. The core routers of the internet process thousands of these messages per second, allowing the internet as a whole to route messages reliably.
00:05:27.840
These dynamics mean that even though the system is constantly in flux, it remains capable of reliably routing packets. This allows a logical connection to be established between two locations for exchanging blobs of text, which can then be rendered by a browser’s layout engine. To you, it’s simply pulling up GitHub.
00:06:11.520
So, that’s abstraction in action on the internet. The internet is cool; I think it’s awesome how it works. I could talk about this all day long, and while I may have shoehorned it into this talk, I promise it’s relevant.
00:06:28.320
Now, let’s discuss high availability. I’ll start by explaining what it is not. It’s incredible to me that a top-tier company like Chase would take extensive maintenance windows, seeming to go down for six hours every weekend.
00:06:42.480
I hope the explanation is financial and not technical, as otherwise it seems like they just gave up. High availability is essentially about fault tolerance: the ability of your system to avoid, detect, and minimize the risk of failures.
00:07:03.200
This is a realistically impossible task as it involves facing chaos head-on. One reason I find it such an engaging challenge is because successful high availability requires tackling uncertainty.
00:07:18.560
How do we begin tackling chaos? The first step, like in many problem-solving approaches, is to admit that you have a problem. Once we acknowledge that things are going to fail, we can start thinking about why they are likely to fail.
00:07:40.800
Once we identify what is likely to fail, we can begin to take steps to anticipate and recover from those failures. That’s essentially the process for constructing highly available systems: anticipate failure, prepare for failure, and then react when failure occurs.
00:08:05.760
This can also be framed as asking yourself what could fail in your system and then deciding what actions to take based on the knowledge that it will eventually fail.
00:08:26.720
After something fails, analyze what you anticipated correctly, what you didn’t, how your preparations performed, and what changes to implement for future failures. That will be the structure of each primitive we look at today.
00:08:46.880
We’ll discuss anticipated failures, preparations made for those failures, and the reactions taken when things go wrong. You’ll see that each of these primitives exists within an abstraction, so the rest of the system doesn’t have to handle the details of preparing for failure.
00:09:04.320
Let’s start with redundancy. Redundancy is probably the simplest form of fault tolerance. If you know something might fail, just keep several of them on hand. However, in a fault-tolerant system, this means more than just having a closet full of spare parts.
00:09:23.920
When we implement redundancy within a system, we use abstraction to enable the rest of the system to treat multiple redundant copies as a single entity. For instance, when we think about hardware failures, we must consider what happens when a physical component fails.
00:09:41.840
The modern server serves as a great case study in redundancy, incorporating features designed to mitigate common component failures. This includes using RAID to treat a group of disks as a single disk, allowing any of them to fail without a complete system failure.
00:09:56.960
It also utilizes link aggregation for network connections and can include dual or even multiple power supplies. This means that in a modern server, you can literally disconnect any cable or remove any hard drive, and everything keeps working without a hiccup. In fact, it might just beep and send an email notification.
00:10:13.760
As application developers, we need to recognize that though hardware engineers are doing their best to prevent servers from failing due to component issues, we must still expect that servers will fail regardless.
00:10:29.520
A significant part of designing a fault-tolerant system involves asking what happens if a server fails—this applies to both physical servers and cloud infrastructure. Therefore, to implement redundancy at the server layer, we need to allow the network to treat a group of servers performing the same task as a single server.
00:10:48.320
Several technologies in the internet stack can help achieve this. One example is a transparent proxy, which listens on a single IP address and directs incoming messages to multiple servers behind it. Those servers then process the messages as if they originate from the proxy, ultimately allowing redundancy and improving reliability.
00:11:08.520
In addition to providing simple redundancy, the proxy also enables another critical benefit: load balancing. Proxies that meet this role are often referred to as load balancers, allowing the overall system to handle more traffic than a single server could alone.
00:11:40.480
When considering high availability, understanding the impact of server loss within a load balancing cluster is essential. For instance, if we plan on losing any one of three app servers, we need to ensure each can handle 50% of the incoming traffic.
00:12:00.960
To determine whether a load balancer can route traffic to a given backend server, it often makes what are called heartbeat requests. These small periodic requests expect a specific response. If a server fails to respond correctly or at all, the proxy removes it from the rotation.
00:12:22.240
Now we have a solid setup where we can lose any one of our app servers. However, we must also consider what happens if we lose the load balancer itself. In our example, if the load balancer is the single point of failure (SPOF), the impact could be significant.
00:12:43.360
Chasing down single points of failure is a vital exercise when architecting a highly available system. You might solve an SPOF at one layer only for it to crop up at another layer; addressing them is a continual effort.
00:13:05.120
So we need to tackle the issue of the load balancer. Recall when we translated the github.com domain name to an IP address; it doesn’t just respond with a single IP, but usually with a handful. This design allows GitHub to use multiple endpoints to service requests.
00:13:23.920
Each of these could have its own load balancer and, utilizing the heartbeat pattern we discussed, they could evaluate their health and be removed from DNS responses if necessary. This appears to be a great solution.
00:13:39.920
However, the rub is that DNS is heavily cached across multiple layers. Clients may hold onto IP addresses that have been removed from the pool for longer than we expect, creating a potential risk.
00:13:56.320
In our example, if one of the IPs has failed and was removed from DNS, the browser might still have cached the old value, potentially causing issues. Therefore, we must consider going below the IP layer to better manage this situation.
00:14:13.200
We need one IP address held by multiple machines, which presents its own challenges. However, we can achieve redundancy through a process called clustering.
00:14:21.840
Clustering occurs when a group of machines collaborates to ensure a service is provided. In this setup, it’s sufficient for just one of those machines to maintain the IP address.
00:14:37.920
Technologies that facilitate this include algorithms like Paxos and Raft, as well as software like Pacemaker. I envision a cluster as a scene from a gangster movie, where three thugs are sitting around deciding who will take care of that IP address.
00:14:56.800
While this metaphor may not reflect reality, the concept hinges on a constant chatter about who is okay and who can handle the IP address. They must work closely to determine their status as it is often impossible for a single server to know whether it can’t reach another due to its own failure or that of a peer.
00:15:17.600
The act of transferring a service from one server to another is called failover. In a situation like this, we often encounter automatic failover, which does not rely on a human to intervene.
00:15:42.240
A quick note: clusters of load balancers can synchronize their TCP state tables, allowing for switchovers between load balancers while maintaining ongoing requests. It’s amazing to observe how seamlessly they operate without interrupting service.
00:15:59.920
However, zero-request-dropping failover is achievable because packet retransmission is built into TCP. When failover occurs, a few packets may get dropped, but TCP ensures the remote side will request a retransmission.
00:16:11.920
In practice, this means that at the HTTP layer, everything continues to function smoothly. Timeouts and retries are crucial from the perspective of abstraction.
00:16:31.680
Ideally, a client could always make a request and trust that any failure would be handled downstream. However, in the real world, especially with HTTP services, this isn't always the case.
00:16:47.040
I suspect that the lack of properly set timeouts is one of the leading causes of outages in modern web applications. Thus, properly considering these can yield significant advantages.
00:17:09.600
Next, I want to discuss the elephant in the room: managing state. The toughest class of services to make fault tolerant are those that handle state, such as databases.
00:17:24.560
The challenge with redundancy and stateful services is that you need the entire system to agree on a single source of truth. The notion of a ‘single source’ leads to potential single points of failure, a common issue in this context.
00:17:46.560
The trade-offs in attempting to make a data store fault tolerant are described by the CAP theorem, which summarizes the concepts of consistency, availability, and performance. Simply put, you can achieve two of the three.
00:18:07.840
Different data storage models occupy various positions on this spectrum, but all involve some form of replication. Replication is the process of continuously shipping data from one location to another, thereby ensuring redundancy.
00:18:28.560
Replication can be synchronous, where data must be confirmed in multiple places before it’s considered written, or asynchronous, where a write is acknowledged before it’s copied.
00:18:49.440
This means synchronous replication sacrifices performance to ensure consistency, while asynchronous replication improves performance but does so at the potential cost of consistency.
00:19:07.120
Once replication is taking place, other primitives, such as clustering, automatic failover, and read load balancing, can be added to allow databases to handle various failures.
00:19:27.920
There are certainly tools available for managing state in a highly available manner, but this aspect is the most challenging part of any infrastructure.
00:19:48.560
The final primitive of high availability I wish to mention is monitoring. Monitoring is a vast subject with conferences dedicated to it, but for this talk, I’d like to highlight two significant points.
00:20:06.360
First, failure doesn’t always mean outright breakdown; it can simply arise from a server hitting a reasonable limit, such as a disk becoming full. Proper monitoring can help identify resource depletion before it leads to system failure.
00:20:20.720
A second key aspect of monitoring is that it’s your best resource for understanding failures you didn’t foresee. Observing metrics like response times and system load can provide insights into potential failures.
00:20:37.840
In conclusion, if I could sum up my advice for designing highly available fault-tolerant systems in one question, it would be: What happens when it fails? Ask this about every component of your system, from app servers to caches and external APIs.
00:20:54.560
If you make it a habit to regularly ask this question about the systems you work on, you will ultimately develop more fault-tolerant systems.
00:21:11.600
High availability should always be a priority when making technology decisions. Remember that while there will always be trade-offs, options better suited for fault tolerance will exist.
00:21:24.560
Highlighting these considerations early on can prevent you from becoming cornered with an unstable system due to previous technology choices that neglected high availability.
00:21:42.320
I also believe that in due time, we application developers will progress to the point where we won’t need to concern ourselves with the complexities of creating highly available fault-tolerant systems.
00:22:00.320
We'll be able to lift the abstraction level of our tools to the point where using them becomes as straightforward as accessing GitHub, with the HashiCorp logo prominent in the background.
00:22:13.120
Now is a good time for me to mention that I have HashiCorp stickers, so feel free to find me if you’d like one. In the meantime, I believe high availability should always be a consideration because chaos is powerful and unpredictable.
00:22:35.760
We possess a variety of tools to manage it, and I’d like to add perseverance to this list—a word I have trouble spelling. Every instance of this term you've seen from me has been corrected through spellcheck.
00:23:01.280
Perseverance—a dedication to learn from failure—is what has led humanity to where we are today. We make mistakes, learn from them, and ultimately improve. It’s a process that, while not without its challenges, has served us well.
00:23:22.960
Thank you very much for your time. I look forward to hanging out with you all.