Beauty and the Beast: your application and distributed systems

00:00:16.470 Hello, everyone! My name is Emily Stolfo, and I'm going to talk about distributed systems and your code. This talk is called "Beauty and the Beast." I am a Ruby engineer at Elastic, where I use Elasticsearch. It's great to see so many familiar faces here.

00:00:34.590 If you're aware of Elastic, the company has a number of products. In this talk, I will discuss how we have very complex ecosystems in our applications. Sometimes, you might have Elasticsearch running in the background and be using it without even realizing it, alongside other databases, servers, and services.

00:00:53.010 I work on the clients team, which writes client code that interacts with Elasticsearch. There are different clients available for the various Elastic products, but my focus is on Elasticsearch. I'm also an adjunct faculty member at Columbia University, where I've taught classes on web development, specifically Ruby on Rails, and most recently New SQL databases. Last spring, I gave an overview of different New SQL databases and their use cases.

00:01:16.320 I’ve lived in Berlin for about five years, originally hailing from New York. I love the city! I’m a maintainer of several Ruby gems that relate to Elasticsearch. These gems can be categorized into two sections: one that deals with using pure Ruby with Elasticsearch, and another that integrates Elasticsearch into Rails applications.

00:01:46.680 Before joining Elastic last June, I worked at MongoDB for six years, where I co-wrote and maintained the Ruby driver. My experience has given me significant insight into how developers interface with distributed systems. Over these seven years, I've encountered numerous questions and bugs regarding the libraries I've maintained. I've realized that with a deeper theoretical understanding of distributed systems, including their behavior, protocols, and algorithms, developers can write more clever code, which ultimately leads to fewer bugs.

00:02:34.220 This brings me to my friend Bear, an application developer in Berlin. Like many others, he uses distributed systems in his applications but hasn’t stopped to consider what a distributed system truly is or the algorithms and protocols involved. Because Bear never took the time to understand distributed systems, he simplistically believes they are straightforward.

00:03:09.330 This belief is reflected in his coding, where he often writes simple operations. For example, Bear connects to a client library and pings the distributed system. If he gets a positive response, he assumes he can proceed with an operation, such as inserting data. This, however, is where the complexity of distributed systems comes into play.

00:04:07.910 For instance, if Bear performs an operation right after getting a positive response, an election might occur in the distributed system, electing a new primary server that now must handle write operations. If he attempts to perform a write action during such a transition, he might receive an error indicating that there is no primary server, which showcases how distributed systems can operate independently and unexpectedly.

00:05:00.499 Bear’s misunderstanding stems from the elegant designs of distributed systems that obscure their inherent complexities. On the surface, they appear simple, making it easy for developers like Bear to become complacent. However, as I mentioned earlier, applications today are intricate—composed of numerous services and databases, sometimes up to 10,000 APIs, which we might not even realize involve Elasticsearch.

00:05:48.860 This talk is divided into two main sections: The first will cover the 'Beast'—distributed systems theory—where we will explore consensus, synchronization, and related concepts. Some of these principles may remind you of computer science courses you’ve taken but may not have engaged with in years. For those without formal education in this field, some information may be brand new. We won’t, however, delve into security since it's a complex topic deserving its own discussion.

00:06:31.849 The second section—the 'Beauty'—will look into Bear's code, addressing errors he encounters and how he can implement retry mechanisms effectively. To connect it to the fairy tale, while the Beast may seem unsightly at first, love prevails, ultimately exposing beauty beneath the surface.

00:07:27.440 So, what exactly defines a distributed system? The textbook definition is a system composed of autonomous computers that collaborate to present the appearance of a unified whole. It focuses on middleware—the algorithms, protocols, and communication methods that integrate various components into a cohesive unit.

00:08:25.570 I use the metaphor of 'Cerberus'—the three-headed dog from mythology—to illustrate a distributed system. Each head represents an independent unit, yet they collectively serve a singular purpose, demonstrating how distributed systems can host numerous components while functioning as a single entity.

00:09:03.800 One key characteristic of distributed systems is that communication among components is often invisible; users do not need to know how these systems interact internally. This hidden organization is crucial for scalability—adding more nodes should theoretically not affect performance, much like how Cerberus can grow more heads without losing efficiency.

00:09:49.540 Furthermore, faults within the system are also hidden from the user. This is essential, as users often remain oblivious to the underlying issues that can arise due to failures in a distributed system.

00:10:05.690 Now, let’s get a bit philosophical. This is my favorite slide of the presentation because it summarizes several hard truths: Nothing is free, nothing is reliable, and nothing is secure. These assumptions must be recognized; if they are neglected while developing code for a distributed system, failure is inevitable.

00:10:51.220 In the realm of distributed systems, consensus and synchronization play pivotal roles. Consensus pertains to communication and reaching an agreement among various nodes, while synchronization involves ensuring that all nodes have the same information at the same time. When addressing consensus, it's helpful to examine specific code examples.

00:11:42.600 For instance, in Elasticsearch, there’s a feature called 'shard' that divides indices among nodes. When operations are executed, it's important that they are effectively acknowledged by the shards involved, as failing to do so leads to issues across the entire system.

00:12:25.300 More formally, consensus is utilized to ensure overall system reliability despite faulty processes. Its applications in programming can include choosing a leader, managing transactions in databases, or synchronizing actions like clock synchronization or load balancing—especially relevant in distributed systems.

00:13:15.140 Understanding synchronization in distributed systems introduces the concept of wall clock time. In common usage, this is how we measure time on our devices, but in a distributed context, ensuring that all nodes have synchronized clocks is challenging. Instead of wall clock time, researchers prioritize the order of events via logical clocks.

00:14:36.940 Additionally, election algorithms are employed to designate roles among nodes in a distributed system. For example, the Raft algorithm used by MongoDB offers users automatic electing of leader nodes, allowing systems to adapt over time with minimal disruption to services.

00:15:06.610 Next, let’s examine reliability—essential for any distributed system—to ensure that data exists in multiple locations. Replication serves to enhance reliability, but it also leads to challenges surrounding consistency. Even when multiple copies of data exist, they might not always reflect the same state due to replication lag.

00:15:57.500 To handle these consistency challenges, developers need to categorize their expectations and make decisions accordingly.

00:16:00.230 Different models of consistency exist, such as causal consistency, which requires a cause-effect relationship between operations. For instance, if one operation precedes another in time, reading the results must reflect that order. MongoDB provides specific constructs for maintaining such causal consistency in client sessions.

00:16:55.180 Then, there’s the concept of eventual consistency—essentially stating changes will happen at some undetermined future point, removing the requirement for immediate consistency. In scenarios like social media, where updates don’t require instantaneous visibility for all users, eventual consistency suffices.

00:17:43.900 When considering client-centric consistency, it allows the application to offer guarantees regarding user experience, catering to the specific needs of your application. This aspect is critical for understanding how to present operational characteristics to individual users.

00:18:20.240 As our discussion progresses into fault tolerance—which relates back to our Cerberus analogy—it's essential to maintain functionality even when parts of the distributed system may fail. To manage this, it’s crucial to categorize faults as transient, intermittent, or permanent, each requiring different management strategies.

00:19:06.830 Implementing redundancy helps mitigate the impacts of various faults, allowing your system to utilize backup nodes seamlessly when necessary. Recovery strategies also play a crucial role here, determining whether your system will revert to a previous state or dynamically adapt to a new functional state.

00:19:38.450 As Bear realizes the complexity of distributed systems, he must shift his focus towards the beauty of structured code. We’ll move on to addressing specific error scenarios he might encounter while programming. There are two primary categories of errors: network errors and operational errors.

00:20:09.680 Network errors can manifest as fleeting, transient 'blips,' which may occur either on the way to the server or back from it, often making it unclear if the server received the initial request. To demonstrate this, we can refer to an instance where the server returns an error, but one must discern its source correctly.

00:21:24.310 Operational errors, on the other hand, indicate that a connection was made, but the operation failed due to specific reasons, such as authorization issues or a missing primary node in a replica set.

00:22:04.930 Understanding these distinctions in error types enables Bear to develop retry strategies based on the context of each error. For example, if he experiences a network blip, a straightforward retry might be sufficient, whereas managing an operational failure may require more careful consideration.

00:22:57.830 When implementing retry logic, it is essential to balance the operation's idempotency. An idempotent operation can be executed multiple times without altering the outcome—crucial for safely retrying operations while maintaining data correctness. We'll discuss how to construct such idempotent operations effectively.

00:23:56.330 For example, in MongoDB, a unique operation can ensure that repeated requests for the same action don’t lead to duplication. By leveraging unique identifiers in their code, Bear could guarantee that no matter how many times an operation is retried, his application would resist redundancy.

00:24:36.870 Similarly, Elasticsearch offers optimistic locking, where every document comes with an associated version, allowing clients to match and validate the current state with their operations, minimizing issues when retrying failed requests.

00:25:27.050 As we explore these solutions, it’s important for Bear to understand how to utilize gems effectively. A well-designed client library encapsulates a range of options to manage errors and implement retries, making the development process smoother.

00:26:23.530 Bear’s familiarity with his libraries should include expectations for network blips, primary node availability, and handling different types of failures. Understanding the features those libraries offer will empower him to write more adaptive and reliable code.

00:27:25.260 Testing is another critical aspect for Bear to consider. Testing code against a single server on localhost falls short of assessing the full capabilities of his application. He needs to work in environments that reflect how an application operates in distributed systems.

00:28:18.080 Employing a network in his testing paradigms will simulate congestion and help him understand operational behaviors better. As a client author, Bear should acknowledge that his code will interact with distributed systems beyond the simplistic framework of localhost.

00:29:04.080 Finally, as Bear begins to grasp these principles, he reaches out about issues he's encountering with Redis. His inquiries indicate an evolving thought process—he's considering whether to use mutexes to manage potential race conditions. This illustrates that he is thinking more deeply about synchronization in distributed environments.

00:30:02.150 In our discussion, I prompted Bear to consider aspects of Redis design, querying behaviors, and nuances of write operations, emphasizing the importance of thorough analysis in diagnosing issues within distributed systems.

00:31:23.500 With these reflections, I look forward to Bear's future inquiries, as he continues internalizing distributed systems theory.

Beauty and the Beast: your application and distributed systems

Key Points Discussed:

Main Takeaways: