How to Ensure Systems Do What We Want and Take Care of Themselves

by Michał Zajączkowski de Mezer

In the presentation titled "How to Ensure Systems Do What We Want and Take Care of Themselves," Michał Zajączkowski de Mezer discusses approaches to designing systems that function reliably and autonomously. Despite the common occurrence of technical challenges, he emphasizes the importance of preparing systems to handle potential issues effectively. The core theme revolves around creating robust systems with self-managing capabilities, utilizing principles that minimize the need for manual intervention in production environments.

Key points discussed in the presentation include:

- Understanding System Components: Systems are composed of various components that need to communicate and process data effectively.

- Abstract Patterns for Components: Using simple abstract patterns is vital when connecting components to enhance design and development practices.

- Message-Passing Concept: The approach of message-passing allows different actors to send and receive messages, facilitating data processing. This concept also prepares systems to deal with inevitable failures.

- Key Terms in System Design: Michał introduces three key terms related to execution modes and delivery semantics that are crucial for understanding system reliability:

- At Most Once: Action is triggered one time only, which may risk missing action execution.

- At Least Once: An action is attempted multiple times until confirmation of success is received, but could result in duplicate actions.

- Exactly Once: This ideal mode seeks to ensure that a message is processed only once, though it is difficult to implement without a reliable mechanism.

- Importance of Message Reliability: He stresses the significance of not losing messages and implementing retries when issues arise during message handling.

Overall, Michał underscores that system stability and effective failure management requires deliberate design choices and an understanding of the inherent risks within system processes. By applying these principles, developers can significantly enhance the resilience and autonomy of their systems, reducing the frequency of reactive fixes in production environments.

00:00:16.199 Okay, hi everyone.

00:00:25.439 My name is Michał Zajączkowski de Mezer, and I'm here from Naguro. My first takeaway is that no matter how much you prepare, you always encounter technical issues.

00:00:32.399 But the team here is amazing, so let's give them some applause.

00:00:44.460 I feel really privileged and honored to be here, to share my thoughts and perspective with you.

00:00:49.620 Especially since my presentation doesn't include a single Ruby line of code; in fact, it doesn't have any code at all. It's language agnostic.

00:00:56.120 Gathering all these thoughts has been very useful to me, and I hope that at least some of you wonderful people will find some useful or fresh insights.

00:01:21.540 Before I start, I would like to give special thanks to Mikhail Bronikowski, who contacted me to give this speech. I wouldn't be here without him.

00:01:34.200 Today, I'm going to discuss how to ensure systems do what we want and take care of themselves.

00:01:41.759 Sounds bold, right? Stability heaven.

00:01:50.159 I would really like to know about your experiences, but in my career as a back-end engineer, I often find myself and my colleagues operating in production.

00:01:55.640 More often than I would like, we are hot-fixing production, manipulating production data, and ensuring everything goes smoothly. This is bad for many reasons.

00:02:15.720 I believe these issues can be avoided by design, at least some of them. What I often see is that people struggle with not respecting or providing certain processing guarantees.

00:02:27.360 Of course, bugs create many issues; I won't give you any silver bullets, but I hope to provide you with a bag of hints and a useful perspective to consider for systems that will lead you toward this goal.

00:02:40.819 Let's start with a broad view. Any systems we build are made up of many components that process data and communicate with each other.

00:02:48.480 Here's my first piece of advice: when connecting these components, use a simple abstract pattern.

00:03:00.540 This will help you design and code components that take care of themselves.

00:03:08.940 We have various backgrounds here, so as these elements may sound a bit vague, let's drill down into the details.

00:03:24.060 I won’t be saying anything new; this is knowledge gathered from wiser people through many experiences. For the sake of simplicity, I'll use the abstraction of message-passing.

00:03:39.900 In message-passing, various actors send or receive messages and process data in between. A very important thing to remember is that failures can happen at any time.

00:03:45.239 No matter how good your system is, anything can break, and this is the most truthful fact in this world.

00:03:52.380 This message-passing concept can be applied to almost anything. We had great talks before about various communication patterns.

00:04:02.400 We've had discussions about queuing jobs and event sourcing. Many times, I was inspired by previous speakers discussing event-passing.

00:04:16.320 To continue, we need a few more terms to remember. We have three key terms: execution modes, processing guarantees, and delivery semantics.

00:04:31.680 Let’s go through them, and you will see they’re not too difficult. The first one is 'at most once.' This means that whenever I want to do something, I trigger it just once.

00:04:45.860 You have to be a bit paranoid about what you know about what you did to really understand what that means.

00:05:00.540 If I’m not sure whether I did something, even if I have some traits that suggest I might have done it, I won’t try again.

00:05:06.380 This may mean the action didn't actually happen, and that is the risk. The next execution mode is 'at least once,' which is the other side of being paranoid.

00:05:39.060 If I'm uncertain whether something was completed or executed, I will attempt to do it as many times as needed until I’m sure about it.

00:06:02.700 As you may imagine, the risk here is that an action can potentially happen multiple times, which is also not ideal.

00:06:15.479 What we all actually want is 'exactly once.' When a sender sends a message, the intention is to send it once. Similarly, when the receiver gets the message, they want it processed only once.

00:06:38.640 Achieving this is challenging and hard without a reliable mechanism to ensure it.

00:06:44.400 But we have a recipe, so let’s look at the first component of the recipe: 'at least once.' What does this mean in practice? It means we need to retry our messages.

00:07:00.540 Messages are crucial because we don't want to lose them. If we send a message and something breaks, we need to retry.

00:07:12.799 We want to ensure we receive confirmation of success. That's great, but what happens if the receiver is in trouble and the sender doesn’t know?