Beauty and the Beast: your application and distributed systems

00:00:12.099 Welcome back to "Beauty and the Beast: Your Code and Distributed Systems" by Emily Stolfo. Emily works at Elastic, maintaining their Ruby client and Rails integrations project. She is also an adjunct faculty member at Columbia University and has a passion for long-distance running and making sourdough bread. Originally from New York, she is currently based in Berlin. Let's welcome her to the stage!

00:00:40.370 Hi, my name is Emily and I'm going to give a talk called "Beauty and the Beast: Your Code and Distributed Systems." I'm a Ruby engineer at Elastic, where we develop products like Elasticsearch, Logstash, and APM. I actually work on the clients team, which focuses on coding libraries used for interfacing with the Elasticsearch server itself.

00:00:53.360 Prior to my current role, I worked at MongoDB for six years on their clients team, giving me substantial experience with distributed systems. The reasons I wanted to give this talk are twofold. First, I've spent nearly seven years interacting with an open-source community where everyone can see and use my code, which mostly comes in the form of gems that interface with distributed systems.

00:01:17.230 In resolving issues and communicating with this community, I observed that application developers could write more clever and intelligent code when armed with a better understanding of the theory behind distributed systems. This knowledge can help them anticipate the errors they may encounter and write more resilient code. The second reason for this talk is that applications are becoming increasingly complex, often broken into many different components and resembling distributed systems themselves.

00:01:46.790 With an extra layer of knowledge surrounding distributed systems, people can learn to write applications more coherently and intelligently. In addition to my role at Elastic, I'm also an adjunct faculty member at Columbia University where I've taught courses on web development, primarily focused on Ruby on Rails.

00:02:10.050 Most recently, I conducted a class on NoSQL databases, reviewing and comparing different types of new SQL databases, analyzing use cases, and discussing which scenarios might be more appropriate for one database over another.

00:02:34.300 As I mentioned, I originally hail from New York but I've been residing in Berlin for the past five years, where I truly love living. I maintain various gems that make up the pure Ruby offering for interfacing with Elasticsearch, as well as the integration with Rails.

00:02:54.500 Let me ask, who here uses MongoDB? Great to see! As I said, I previously worked with MongoDB for a long time, co-writing the Ruby driver, and now I'm glad to be part of the elastic community. Today, we will dive into how to better understand and manage distributed systems.

00:03:11.540 Allow me to introduce my friend, Bear. He is an application developer from Berlin who fits the profile I just described. His application is highly complex and interfaces with both MongoDB and Elasticsearch. With additional knowledge about the theory of distributed systems, Bear can develop a more robust and resilient application.

00:03:46.150 Though Bear thinks distributed systems are simple, they actually present some complex challenges. Let's take a look at an example of his code. Here we have a simple database object, which is retrieved from one of the gems I maintain.

00:04:12.030 This DB object represents the core of the Elasticsearch gem and will allow Bear to ping the database. If he receives a positive response, he assumes the database is available and proceeds with an insert operation. However, this perspective is too simplistic; distributed systems tend to be more complex due to their autonomous nature.

00:04:56.600 Between the time the ping is sent and when Bear executes the write command, various events can occur—an unexpected election for a new primary might begin, or the network could fail, leading to an unresponsive database. Any of these scenarios could potentially produce an error instead of a confirmation that the operation was successful.

00:05:45.300 Thus, while Bear's initial impression of distributed systems might be that they are simple, understanding their inherent complexity is crucial. This knowledge allows Bear to write simpler code capable of anticipating the distributed system's behavior.

00:06:17.250 The complexity not only lies within the systems themselves but also in the applications interacting with them. Bear's application shares some characteristics with distributed systems; it operates like a coordinated chaos, continuously sending and receiving requests across numerous APIs.

00:06:39.000 In this talk, we will focus on education targeted towards Bear. The first part will cover the fundamentals of distributed systems theory, and the second part will analyze Bear's code and apply his new theoretical knowledge to produce more resilient software. We'll define what a distributed system is and explore certain concepts.

00:07:07.500 Ordinarily, we define a distributed system as a set of autonomous computers that collaborate to present the appearance of a single coherent system. Relevant to this discussion, it is often regarded as middleware—software that creates a communication bridge between discrete components, which we refer to as nodes or servers.

00:07:22.490 In a distributed system, these elements must exhibit coordinated behaviors and agree on protocols or algorithms, which come together to form a unified entity. I often liken it to Cerberus, the mythic creature with multiple heads; although it is a single beast, it consists of many components that should cooperate effectively. If one head is lost, the remaining heads can still maintain a functional unit.

00:08:00.670 Moreover, communication among the various parts of a distributed system must be invisible to the user, so their experience remains seamless and coherent. The internal organization of these components, along with their interactions, should remain opaque to the user; understanding these complexities shouldn't burden the end-users or even developers.

00:08:29.840 A distributed system should also be smooth to scale—able to grow from a few components to hundreds without overwhelming the user. Regarding fault tolerance, a distributed system must exhibit the capacity for some components to fail without disrupting the users' experience. If one part of the system goes down, the remaining components must be able to step in and carry on functioning.

00:09:06.120 So, one of the key takeaways here is that nothing is reliable, nothing is free, and nothing is secure. Making such false assumptions can lead to the failure of applications interfacing with distributed systems. Consistency is vital, and application developers must take great care to ensure that their code accounts for potential failures or errors.

00:09:41.330 Now, let's dive into some concepts, the first of which is consensus. This refers to the collective agreement within a system, allowing components to return to a known good state after an error. In distributed systems, consensus enables individual processes to agree on a value or action, thereby improving the overall reliability of the system.

00:10:05.910 Using Elasticsearch as an example, if a node goes down, you could still achieve a consensus on which nodes are active. When Bear makes a request, the system can discern the state of its nodes and adapt accordingly to maintain consistency and reliability.

00:10:50.600 In practical terms, consensus might mean coordinating transactions and ensuring that all nodes reflect the same information. The distributed system must reach an agreement, whether for committing a transaction or electing a new leader in a multi-node setting. In the real world, the implications of failure are widespread—from clock synchronization to smart grid control.

00:11:24.640 Next, we discuss synchronization. This is crucial because it permits nodes within a distributed system to agree on event ordering. Time is not merely a metric based on physics but a complex concept that affects distributed systems' operations. Keeping different components on the same ‘wall clock’ time can be difficult, so researchers focus on logical clocks instead.

00:12:20.320 Logical clocks facilitate the ordering of events in distributed systems, disregarding absolute time values, such as those reflected on devices. By utilizing logical clocks, different nodes can maintain an understanding of when operations occurred while ensuring that processes are executed in the right order, which is fundamental to distributed system performance.

00:12:57.250 Roles within distributed systems are determined by election algorithms. Various algorithms exist for determining which node is the leader, such as the Raft algorithm or the Bully algorithm. Understanding how these elections resolve issues of leader unavailability and data consistency is essential for Bear, as it will help him interpret errors when making requests. Should there be a network partition, election algorithms guide the system through temporary failures.

00:13:48.290 This ties back to Bear's code—if he pings a node and it is no longer the leader due to an election, the client should accurately report that error back to him. He should be aware of how retries and error handling work in order to maintain a solid application performance.

00:14:25.060 In terms of system reliability, replication is essential. Writing data to one component and propagating that change across the system helps maintain consistency, but it also introduces challenges. Bear needs to consider how out-of-date or stale a piece of data can be, as well as how quickly the consensus model can restore consistency after an error or failure.

00:15:02.270 To manage consistency, various models exist, which can provide guarantees around data state. Causal consistency is one of these models, ensuring that operations maintain a defined order. For example, if Process 1 writes a value and Process 2 reads it, the system guarantees that Process 3 will see all operations in the correct sequence.

00:15:54.540 Understanding client-centric consistency is another key concept. This allows one client to observe a certain sequence of operations regardless of changes in other clients' states. Bear can leverage this understanding to write safer and more resilient applications.

00:16:30.290 Finally, let’s talk about fault tolerance, which permits systems to recover from server failures. By identifying fault types, like transient, intermittent, and permanent faults, Bear can create strategies that employ redundancy to maintain high availability of the components in a distributed system, ensuring that if one node goes down, another can step in.

00:17:20.570 With the theory of distributed systems laid out, Bear can take these concepts back to his application code and improve its resilience. He must understand—and account for—the various types of errors he may encounter. With increased awareness of the complexity inherent in distributed systems, he can write more intelligent and robust applications.

00:17:55.720 Errors can be classified as either network errors or operational errors. Network errors may come from interruptions during requests while operational errors can arise from interacting with databases improperly, like writing data to a non-primary node. Bear needs a clear strategy for error handling; network issues could be transient blips or complete outages.

00:18:36.740 Moreover, Bear must acknowledge the importance of distinguishing between these error types. For instance, in cases of authorization issues, simply retrying would be futile. Depending on the situation, he can outline a plan to gracefully handle errors—potentially retrying with limits for network failures while easily abandoning operational errors.

00:19:21.200 This leads us to idempotency: operations where executing the same action multiple times will yield the same result. For Bear, ensuring that certain operations are idempotent simplifies the design of retry logic. For example, in MongoDB, he can create a unique value to ensure that subsequent tries don't inadvertently modify data more than once.

00:20:06.480 To reinforce this knowledge, we can discuss an example using Elasticsearch. Bear can also leverage versioning to ensure that when an operation is being retried, the updates are based on accurate information—preempting conflicts during concurrent operations.

00:20:49.490 Exploring the APIs of the gems he utilizes is equally vital. Bear should ensure he understands how the client libraries interact during network disturbances, what errors are returned, and how these clients behave when components in the distributed system fail.

00:21:34.930 When working with client libraries, Bear will have the advantage of having their internal logic handle certain behavior, streamlining his own code. For instance, he can analyze options such as how frequently the client pings the server to ensure it's operational.

00:22:12.940 With these insights and preparations, Bear’s testing practices become paramount. Testing against a single server instance isn’t adequate. It’s vital for him to create tests that mimic distributed system behavior, including network latency and transient faults to gauge how resilient his application truly is.

00:22:55.710 These tests serve to validate the application against conditions typical of production environments. By employing strategies such as simulating network outages in various scenarios, Bear can comprehend potential failures and develop effective strategies for application reliability.

00:23:39.820 Bear, while feeling overwhelmed, should recognize that this wealth of information empowers him to enhance his application. He no longer needs to resort to emergency learning but stands prepared to confidently tackle potential issues in distributed systems.

00:24:20.780 Throughout the session, the question of race conditions has arisen, leading Bear to contemplate a mutex for ensuring operation order. The important takeaway here is to leverage what he has learned about distributed systems and analyze how his application interacts in a distributed environment.

00:25:03.370 Ultimately, Bear can transform his understanding into practical solutions that employ resilient error handling and avoid unnecessary complications. To anyone with questions regarding distributed systems and their applications, I encourage you to reach out.

00:25:51.340 Thank you all for your time. If there are any questions, I’d love to discuss them, including opportunities with Elastic!

00:26:44.719 In closing, during this talk, we covered several aspects of distributed systems theory and their implications for application developers. Understanding complex concepts like consensus and fault tolerance, as well as visualizing scenarios like distributed networks' failures, are essential for creating resilient applications in today's tech landscape.