Distributed Systems: Your Only Guarantee Is Inconsistency

Talks

Anthony Zacharakis

3 talks

Distributed Systems: Your Only Guarantee Is Inconsistency

by Anthony Zacharakis

In this presentation titled "Distributed Systems: Your Only Guarantee Is Inconsistency" at Euruko 2017, Anthony Zacharakis discusses the complexities of distributed systems, particularly focusing on the inconsistencies that can arise within data during such transitions. Drawing from his experience at DigitalOcean, he illustrates the importance of understanding these inconsistencies utilizing billing processing as a relatable example.

Key Points Discussed:
- Introduction to Distributed Systems: Zacharakis stresses the growing significance of this topic, especially with the rise of microservices and service-oriented architectures.
- Importance of Month-Close Processing: He explains the month-close pipeline in billing, detailing the actions taken for each customer, like generating invoices and charging users. The increasing scale leads to challenges maintaining this process.
- Challenges of Synchronous Processing: The current reliance on a single class for several tasks presents issues, particularly with performance and handling failures due to external dependencies, which complicates maintenance.
- Role of Background Workers: The presentation highlights the advantages of using background workers to improve workflow. Jobs can be queued, prioritized, and retried automatically, enabling more efficient handling of heavy tasks. Zacharakis notably mentions Sidekiq as a preferred tool in this context.
- Practical Example of Job Handling: He outlines how to implement Sidekiq by developing a separate job class, which can run expensive tasks asynchronously, reducing processing time significantly.
- Dealing with Idempotency Issues: Zacharakis explains the necessity to address idempotency, which is the challenge of ensuring that if a job fails, it can safely retry without causing duplicate charges or errors.
- Unforeseen Bugs and Adjustments: The transition to asynchronous processing led to unexpected issues such as duplicate payments, demonstrating the need to adjust existing assumptions about timing and process flows.
- Acceptance of Dynamic Environments: He encourages acknowledging that conditions may change and emphasizes designing systems to adapt and handle these changes gracefully, thus preparing for a self-healing system.

Conclusions and Takeaways:
- The key lesson is the essential nature of adapting to inherent inconsistencies within distributed systems. By understanding and planning for unexpected behaviors, systems can be designed to handle failures and changes more effectively. Zacharakis underlines that systems must be built with the knowledge that the environment is dynamic, encouraging a shift from synchronous to asynchronous thinking, while fostering resilience within the architecture.

00:00:14.930 Hi everybody, thanks for coming. I know it's getting late in the day, and it can be hard to concentrate. But if you make it through this, you'll get the coffee break, so look forward to that! I'm Anthony, and I'm here to talk to you about distributed systems, specifically the inconsistencies inherent in your data when you transition to one of these systems. With all the buzz around microservices and service-oriented architectures today, this topic is becoming more and more important.

00:00:26.780 I work at DigitalOcean, which is a cloud provider that offers networking, storage, and server solutions, among other things. If you're an email marketing company with Black Friday around the corner and you need something between a hotfix and a long-term code solution, this could be really interesting for you. We provide infrastructure as a service, so keep us in mind. More importantly, I work on the billing team, focusing on everything related to payment processing, invoicing, and billing. That's what I want to talk about today.

00:00:53.500 One of the most important processes in our billing system is our month-close processing. This is where we tally up all the services our customers have used and bill them accordingly. After that, we take any necessary actions. Although my talk focuses on this billing example because it's relatable, the concepts transcend and can apply to any situation where you have a pipeline of steps to perform, and you're looking for ways to scale and make it more distributed.

00:01:22.880 Our end-of-month pipeline looks like this: we have a list of actions we perform for each user, such as generating invoices, charging them, emailing them, placing account holds on people who don't pay, and generating reports for our internal users. There are many tasks to manage in this pipeline, and as we scale up, it becomes increasingly difficult to sustain. Many actions here don't need to be performed in a strict order; there are some loose dependencies, but we aim to transition to a more parallel system.

00:01:59.900 Here's a slimmed-down version of our actual month-close class: we have a service class called MonthClose, and we can perform these actions for all users. However, there are various problems with doing everything in one class. Many of these actions are expensive; for example, generating invoices and determining what a user has used takes time and resources. Some actions have external dependencies, such as charging a user or emailing them, which involves third-party services. Additionally, the complexity of our logic makes it hard to maintain this class.

00:02:32.200 We also encounter issues with idempotency—what happens if a job fails halfway through? If one of our third-party services is down and fails, what occurs when we rerun the job? Will we recharge the user accidentally? It’s difficult to predict these unintended side effects in our current setup. This situation makes it hard to refactor because touching anything can be risky due to many implicit assumptions.

00:03:11.030 This is where background workers become incredibly useful. They provide several benefits for our workflow. You can persist jobs in SQL or Redis. Once you queue a job into a worker, you know it’s stored and will run at some point. We can prioritize job queues to adjust how quickly different tasks are processed. For example, if you have a lower-priority job—like a report—you can set it to run at your convenience. You can also schedule jobs for immediate execution or delayed start, which grants flexibility.

00:04:03.240 A significant advantage of using background workers is their automatic retry feature. If a worker fails because of a network issue, it will automatically rerun when the issue resolves. You don’t have to intervene—it simply retries for you. Additionally, batching allows you to run a group of workers and execute a callback when they all finish, whether successfully or with errors. At DigitalOcean, we heavily rely on this system, as demonstrated by our Sidekiq dashboard, which shows we process billions of workers, reflecting possibly a year’s worth of data.

00:04:44.539 While we use Sidekiq, there are other background processing tools available. Options include Delayed Job for Active Record and Resque, which are also effective. If these tools interest you, I suggest exploring them, as the Ruby community offers many fantastic choices. Now I want to illustrate what using a background worker looks like in practice.

00:05:49.610 On the left, we might have our expensive jobs, which can be costly to run. Normally, you would initialize and run the expensive method directly. However, you really just need to add another class, like an example for Sidekiq. You would add an ‘ExpensiveJobWorker’ class and include the Sidekiq worker, setting the necessary options. For instance, let’s place it in a high-priority queue and define a perform method that triggers our expensive job class. This adjustment allows us to run the job in the background seamlessly.

00:06:57.290 For instance, we might set it to process within ten minutes. It’s just as easy to accomplish, and you could also run more complex tasks on a recurring schedule, such as daily processes. One helpful gem for this is the Whenever gem, which uses a DSL that lets you define cron jobs to run daily, hourly, or at other intervals. Our month-close worker operates this way, running monthly for all users, allowing us to keep the logic in one place in our repo instead of having it scattered across various file definitions.

00:08:05.280 Let's apply this to our example, transforming our previous approach. When generating an invoice, instead of directly processing payments and emails synchronously, we can run those tasks in the background, passing through the amount owed. This change helps us address the idempotency issue that arose earlier. If our payment worker fails due to a payment provider being down, it will simply retry automatically without requiring intervention on our part. We are effectively decoupling the actions, which is a notable benefit.

00:09:11.480 Moreover, we can now scale components separately. Different third-party providers may have varying throttling requirements and SLAs. By setting throttling limits on different workers, we can efficiently manage resource usage and scaling for bottlenecks. The results have been promising. Previously, we executed the month-close process synchronously, taking about 30 minutes per user on average, and requiring up to two days to complete for all users. For finance teams, waiting two days to understand transactions isn't feasible, as they would constantly check on collection status.

00:10:35.600 With the new approach, we’ve reduced processing time to less than 10 minutes per user, and we're able to finish the entire monthly process for all users overnight in less than 8 hours. We just go to bed and wake up to find everything done smoothly. However, this transition came with its own set of unforeseen bugs. For instance, we had users reporting duplicate payments, which led us to investigate.

00:11:54.460 The issue stemmed from the fact that our timing assumptions changed. Previously, once a payment was made, we’d quickly handle notifications and errors. In contrast, this new asynchronous flow meant that we could not guarantee the timing between steps. Jobs would be queued and might take longer to execute than expected, allowing users to receive notifications for multiple payments, leading to confusion. This scenario illustrates how our mental model does not always correspond to reality. We may have designed workflows under the illusion of perfect timing, but the reality is that numerous factors can influence business processes.

00:13:23.480 Our first takeaway is that when you pass information in your system, you make assumptions that reflect the state of the world at that specific time. For example, when we generated the invoice and specified the amount owed, it represented the state of the world at that moment. However, the state may have changed by the time the payment worker executed the task. Similar assumptions can lead to discrepancies when we transition from synchronous to asynchronous processes, resulting in implicit behavioral changes.

00:14:48.140 So, how can we address this? One idea is to immediately run payments right after invoices are generated by utilizing high-priority queues in Sidekiq. However, this can lead to a priority arms race, where every new process tries to outrun previous ones, ultimately complicating the queue. The better approach is to accept that the world changes and build our processes to accommodate this assumption. For our payment worker, we need to check the user's current balance when processing payments. If the current balance doesn't match the expected amount, we need to perform appropriate actions, whether that means charging the user, throwing an error, or doing nothing.

00:16:16.020 This type of logic might feel messy or impure, but it's essential to acknowledge that reality itself is dynamic. By preparing for these changes, we can build systems that not only accommodate errors but can gracefully adapt to changes in the environment. Effectively, we’ll be developing systems capable of self-healing, much like the T-1000 robot from Terminator 2, which could recover from damage efficiently.

00:17:09.440 The next approach involves

See Slides on speakerdeck.com

EuRuKo 2017