Talks

We All Build Distributed Systems

wroc_love.rb 2017

00:00:11.550 Hi, I'm Maciej, and before I start the core of my talk, let me check one thing. I'm not sure if I've chosen the right topic.
00:00:17.490 The organizers accepted it, but maybe they were tricked. Generally, on the internet, when people write popular blog posts about distributed systems, there are three main topics. The first one is, of course, distributed databases.
00:00:30.270 The second is deployment in the cloud — Kubernetes, Heroku, and so on. The last one is microservices. Of course, we all want microservices.
00:00:42.859 Let me ask a very important question: How many of you have ever built a database engine? Not used, but built. Oh, one brave person! Another question: How many of you have built tools for large-scale cloud deployment? Just one person? Okay, easier question: How many of you have ever deployed at least two microservices in one application?
00:01:02.490 Oh, that's better! But this talk is not about databases, nor about the cloud, and not about microservices. So what is this talk really about? Why would I be telling you about distributed systems? I claim that all of us create distributed systems, even if we don't fall into any popular category.
00:01:30.120 Before I answer that, let me share a bit about myself. I'm a Ruby developer at TextMaster, where we provide professional translation services on the web. Besides being a Ruby developer, I'm also greatly interested in methodologies for creating software that is meaningful for end users.
00:01:58.950 I enjoy reading about distributed systems, and this talk is the result of years of reading and thinking about the topic. I never thought it would actually be applicable to my daily work, but I believe I should share what I've learned. I'm also involved in the Ruby user group in Wrocław. If you happen to be around, just ping us; maybe there will be a meeting, and it would be great if you could join us.
00:02:23.370 Additionally, I give some courses and lectures at Wrocław University of Technology, but it is more of a hobby for me rather than a main occupation. So, that’s a bit about me. Now let's discuss why you should care about this talk. I will demonstrate what features of distributed systems are important to me and share a case study that illustrates how I stumbled upon this topic.
00:02:55.380 So, what is a distributed system? A professor named Tanenbaum provided a definition, saying it is a system that works on multiple computers but, for the user, appears to work as if it were on a single computer. That is a really good definition. Another expert, Leslie Lamport, explained how distributed systems do not work. We had a perfect opportunity to check his definition during the AWS downtime, which many people, not just those in IT, learned about.
00:03:37.950 Many popular applications, such as Trello and Docker Hub, rely on AWS. When it fails, many regular applications may fail as well. So that is what distributed systems are, but as we found out, probably no one thinks they build them. Who builds web applications? A few of you! That's good. Web applications are simple; you can build one in 15 minutes in Rails using a scaffold and managing simple requests and responses.
00:04:04.769 But we are now a bit wiser. After building that first basic application, we added more complexity to the frontend with Redux and various data stores. This complexity can make the errors that appear in web applications seem unusual and much more complex than we initially thought. Now, what do these two kinds of applications have in common?
00:04:56.160 Let’s refer back to the definition of a distributed system: a set of computers that appears to work as one. Think about modern web and mobile applications: consider Facebook. Do you care if you use Facebook on the web or on mobile? Or if you use Trello on the web or on mobile? To you, it is a single system. What matters is the overall user experience and that you can use it effectively.
00:05:23.370 I realized that we can describe modern web and mobile applications as distributed systems simply because, for the end user, it doesn't matter which client they use. They just want a consistent experience wherever they access the application. You might ask, what’s the point of all this? Maybe this is just an academic exercise, proving that web apps are distributed systems.
00:05:54.840 I’ll discuss why this is important and what I have learned. I will focus on three guarantees that may hold in distributed systems: consistency, availability, and partition tolerance. Let’s start with consistency, which means that all data in all nodes appears to be a single copy of the data.
00:06:34.390 Consider a very simple notes application, which we will call Ruby Notes. If we write a note on the Android client, we should see the same note on our computer. If we write 'five' on the Android device, we should always read 'five' on any other device. However, there is an important disclaimer: consistency here does not imply the strict validation of data; it is more about serialization and maintaining a single copy.
00:07:11.300 The next guarantee is availability, meaning that every node that works responds meaningfully. This means that if I write 'five,' it should respond with data, not an error message. In mobile applications, if a client, for example, is on a phone, the app should provide all features effectively even when the network is down.
00:07:37.200 The last guarantee is called partition tolerance, which indicates that, because the system operates in a distributed environment, the network may lose packets, and in extreme scenarios, a client can go offline. In such cases, if the network drops all packets, the client does not know if the note on the other side is down or if the network itself is down.
00:08:13.730 It is important to remember that while consistency and availability are features of our applications, partition tolerance is the truth of life. It's an inherent property of the network we use. Thus, we can discuss consistency and availability much more than we can about partitions themselves.
00:08:51.960 If we take all three of these guarantees, we can agree that they are all useful. It would be ideal to have all three of them, but it might not be possible. In 2000, a theorist named Eric Brewer proposed a conjecture suggesting that in distributed systems, we cannot have all three guarantees simultaneously.
00:09:39.370 Let’s consider a simple scenario: switch off the network on your device and try writing something on Facebook. There are only two possible outcomes. Either the client rejects your message to preserve consistency, or it accepts your message to maintain availability, which means the system becomes inconsistent.
00:10:02.390 If we only have those three features and need to choose two, we have three choices: choose consistency and availability, ignoring partition tolerance; or choose consistency and partition tolerance, but at the cost of availability; or finally choose availability and partition tolerance, but ignore consistency.
00:10:31.660 Choosing consistency and availability seems like the perfect choice, but ignoring the reality of partitions — the unreliable nature of networks — means that this approach can only work effectively in a single-host environment. This might work only if your mobile app has no backend or if you’re writing a browser application without network communication.
00:11:05.280 To provide an example of what can happen if you make that choice in a distributed environment: let’s say we write a note in our Ruby Notes application through a web form, and before we quit, we hit Save, but the network is down. What happens? We see an error message in the browser, and if it isn't modern, our data could be lost. This is a clear example of how disregarding partitions can result in data loss.
00:11:52.450 So, let’s try a different approach: drop availability to maintain consistency. In this case, our application will only work fully when the network is available. If there are reconnections, it should retrieve all previously created data. This is convenient because data remains in order, with no inconsistencies, which is easier for developers.
00:12:39.910 However, the downside is that users can’t use the application when there’s no network. Alternatively, if we prioritize availability and partition tolerance, the app can be used all the time. Users can write notes even without network coverage, leading to a good user experience; however, synchronization is required when the network is restored, and that can be challenging.
00:13:18.069 These choices are often seen in real-time domains. For example, in finance, one might think that financial operations should be inherently consistent. However, banks generally choose availability because it is more profitable. Consider ATM machines that operated offline because it was advantageous for banks. Users were able to access funds at all times, which made applications profitable.
00:14:04.660 To summarize, we cannot choose all three guarantees: consistency, availability, and partition tolerance. We need to select two of them, and because the first choice is often not viable due to network partitions, the decision typically comes down to a business choice between consistency and availability. This is where the client's need to work offline may drive the design of the application.
00:14:26.759 What I learned from this is that these choices are not made for the entire application but can vary based on specific use cases. Let me show you a case study from a recycling company for which we developed an application. They had two types of users: admins in the back office and mobile agents called buyers who visited scrap yards.
00:14:56.789 These buyers visited scrap yards to buy scrap metal and other commodities, which were later brought back to the primary base for recycling and sale. Most operations took place at scrap yards with little access to the Internet. The buyers arrived at the scrap yard with an Android tablet running our application to collect commodities, pay for them, and record operations.
00:15:26.630 Having to work offline was a challenge, and that’s where we started to recognize some significant limitations as we were used to crafting applications that favored consistency over availability. In the following sections of my talk, I will show you some cases of synchronization that we developed while working on this project.
00:15:58.450 There is a very important disclaimer regarding this project: when I say 'we,' I mean the entire development team. We weren’t just one company working on the project. We also worked collaboratively. I intend to share the knowledge we collectively gained rather than just highlight what I personally developed.
00:16:35.640 Regarding the types of synchronization, we initially tackled single-directional synchronization. When we shifted to two-way synchronization while dealing with concurrent edits, the complexity increased significantly. The initial approach involved fetching all required data, and before I describe the first case, let’s discuss the domain.
00:17:05.780 The commodities were bought based on prices that were updated weekly, with prices generated each Monday. When we began development, the client assured us it would be a small dataset; therefore, we decided to fetch all data. Initially, everything worked well, and we were happy with our progress.
00:17:36.420 However, after some time, as the client gained more confidence in our software, they began generating significantly more data. Eventually, the application started to struggle under the load. We experienced timeouts, particularly with our hosting provider, Heroku. To cope, we implemented more aggressive caching when there were no issues with Heroku.
00:18:08.110 However, in the meantime, the Android client began hanging and crashing due to memory limitations. This prompted us to realize that we needed to find better solutions, and pagination became necessary when data exceeded acceptable limits. Our first approach to pagination resembled a conventional web application where we just decided how to start.
00:18:37.640 As we made subsequent requests to fetch one page after another, it worked reasonably well for web applications. However, when synchronizing all data, gaps between requests can lead to issues: if an entity was removed, pages could shift, causing us to skip entities and disrupt synchronization.
00:19:23.090 If, for example, a buyer arrives at the scrapyard and tries to make a purchase without valid pricing, their day could be significantly affected. The solution we developed allowed the mobile client to decide the timestamps for the pages to avoid missing data. However, this also did not resolve the critical peak load we faced on Mondays.
00:19:58.000 So, we decided to mix these two approaches. The client would send in the timestamp when the data synchronization should begin and include the desired page size, ensuring that no data was missed. The client would not face congestion, gaining more control over how much data they could process.
00:20:36.590 The next synchronization topic revolves around what happens when the client buys products at the scrapyard and makes payments. It is vital that admins also see this data. There are instances where synchronization needs to be bidirectional. For instance, if a buyer works from one tablet, submits data, the tablet crashes, and they switch to another tablet for another purchase, they need to ensure all necessary data is available.
00:21:21.130 We initially handled this by fetching the timestamp of the last synchronization, identifying all changes made since then, and sending the changes to the server. After receiving a response, the server would also send its changes to the client.
00:22:03.010 This method ensured that both client and server maintained their respective datasets without any conflicts arising from changing timestamps. We needed to maintain a synchronization process to filter these changes properly, ensuring that data from both ends was not lost or overridden.
00:22:48.630 However, when we began testing this synchronization, we encountered problems due to the identical dataset being sent back and forth. Data was overridden on both sides, leading to some losses. To remedy this, we decided to record status changes as separate events, which allowed us to synchronize changes individually.
00:23:11.120 This approach enabled us to retain complete status records while ensuring that changes could exist separately from the main dataset. Although this may seem straightforward, it reflects a necessary step in synchronization practices.
00:23:56.320 Yet, as clients frequently require more features, they often introduce push notifications and real-time updates. This added complexity impacts synchronization, especially as user expectations increase.
00:24:32.650 Let’s consider the importance of retransmission as a final synchronization challenge. Imagine when a client sends data to the server, and just before the server can respond, the network drops. That’s where retransmission becomes crucial. If we rely solely on the conventional scaffolded controller design, the data could be recorded multiple times, leading to errors in processing.
00:25:15.970 To deal with this, we began to identify each payment uniquely on the client side, ensuring that our synchronization remained orderly. This strategy allows clients to resubmit their data confidently without worrying about duplicating transactions.
00:26:04.970 After several months working on these synchronization processes, I learned to ask myself if our synchronization method is idempotent before releasing it. It saved us from significant issues down the line.
00:27:14.630 Ultimately, I have learned four critical patterns regarding data synchronization. When dealing with large datasets, pagination is essential, allowing clients to dictate the size and boundaries of the data retrieval. Next, it’s beneficial to distinguish between sent and received data during synchronization.
00:27:59.770 Moreover, it is easier to synchronize immutable events than mutable states—this significantly aids in making the synchronization process idempotent while identifying changes on the client side.
00:28:34.160 In conclusion, I set out to convince you that we all build distributed systems, and I hope I have shown how some principles of distributed system design are significant to us. As developers, we encounter unreliable networks, and it is essential to understand the delicate balance between availability and consistency.
00:29:29.050 What is good to remember is that even though we must make choices regarding these aspects, these choices are not set in stone. Initially, many of our client applications may lean towards consistency, but as new features and requests arise, we can migrate toward availability.
00:30:12.460 It’s crucial to synchronize what is immutable to prevent overwriting of critical data. Overall, effective synchronization practices can prevent many pitfalls and ensure a smoother user experience.
00:30:59.568 Lastly, I'd like to note that most of you have likely encountered many of these concepts before. Some of you may have already learned about the CAP theorem or had experiences with data synchronization, perhaps the hard way in production.
00:31:41.020 This talk emphasizes the significance of principles from theoretical computer science that are applicable in our everyday work. Theoretical proofs and guidelines aren't just academic concepts—they can be directly relevant to how we develop applications.
00:32:28.430 Thank you for listening. Remember that investing time studying theoretical principles can yield significant benefits in your practical work. Many problems have been solved in other realms of computer science, and it is always valuable to apply those solutions where they fit.
00:34:23.860 You.