Talks

How to Ensure Systems Do What We Want and Take Care of Themselves

wroc_love.rb 2022

00:00:16.199 Okay, hi everyone.
00:00:25.439 My name is Michał Zajączkowski de Mezer, and I'm here from Naguro. My first takeaway is that no matter how much you prepare, you always encounter technical issues.
00:00:32.399 But the team here is amazing, so let's give them some applause.
00:00:44.460 I feel really privileged and honored to be here, to share my thoughts and perspective with you.
00:00:49.620 Especially since my presentation doesn't include a single Ruby line of code; in fact, it doesn't have any code at all. It's language agnostic.
00:00:56.120 Gathering all these thoughts has been very useful to me, and I hope that at least some of you wonderful people will find some useful or fresh insights.
00:01:21.540 Before I start, I would like to give special thanks to Mikhail Bronikowski, who contacted me to give this speech. I wouldn't be here without him.
00:01:34.200 Today, I'm going to discuss how to ensure systems do what we want and take care of themselves.
00:01:41.759 Sounds bold, right? Stability heaven.
00:01:50.159 I would really like to know about your experiences, but in my career as a back-end engineer, I often find myself and my colleagues operating in production.
00:01:55.640 More often than I would like, we are hot-fixing production, manipulating production data, and ensuring everything goes smoothly. This is bad for many reasons.
00:02:15.720 I believe these issues can be avoided by design, at least some of them. What I often see is that people struggle with not respecting or providing certain processing guarantees.
00:02:27.360 Of course, bugs create many issues; I won't give you any silver bullets, but I hope to provide you with a bag of hints and a useful perspective to consider for systems that will lead you toward this goal.
00:02:40.819 Let's start with a broad view. Any systems we build are made up of many components that process data and communicate with each other.
00:02:48.480 Here's my first piece of advice: when connecting these components, use a simple abstract pattern.
00:03:00.540 This will help you design and code components that take care of themselves.
00:03:08.940 We have various backgrounds here, so as these elements may sound a bit vague, let's drill down into the details.
00:03:24.060 I won’t be saying anything new; this is knowledge gathered from wiser people through many experiences. For the sake of simplicity, I'll use the abstraction of message-passing.
00:03:39.900 In message-passing, various actors send or receive messages and process data in between. A very important thing to remember is that failures can happen at any time.
00:03:45.239 No matter how good your system is, anything can break, and this is the most truthful fact in this world.
00:03:52.380 This message-passing concept can be applied to almost anything. We had great talks before about various communication patterns.
00:04:02.400 We've had discussions about queuing jobs and event sourcing. Many times, I was inspired by previous speakers discussing event-passing.
00:04:16.320 To continue, we need a few more terms to remember. We have three key terms: execution modes, processing guarantees, and delivery semantics.
00:04:31.680 Let’s go through them, and you will see they’re not too difficult. The first one is 'at most once.' This means that whenever I want to do something, I trigger it just once.
00:04:45.860 You have to be a bit paranoid about what you know about what you did to really understand what that means.
00:05:00.540 If I’m not sure whether I did something, even if I have some traits that suggest I might have done it, I won’t try again.
00:05:06.380 This may mean the action didn't actually happen, and that is the risk. The next execution mode is 'at least once,' which is the other side of being paranoid.
00:05:39.060 If I'm uncertain whether something was completed or executed, I will attempt to do it as many times as needed until I’m sure about it.
00:06:02.700 As you may imagine, the risk here is that an action can potentially happen multiple times, which is also not ideal.
00:06:15.479 What we all actually want is 'exactly once.' When a sender sends a message, the intention is to send it once. Similarly, when the receiver gets the message, they want it processed only once.
00:06:38.640 Achieving this is challenging and hard without a reliable mechanism to ensure it.
00:06:44.400 But we have a recipe, so let’s look at the first component of the recipe: 'at least once.' What does this mean in practice? It means we need to retry our messages.
00:07:00.540 Messages are crucial because we don't want to lose them. If we send a message and something breaks, we need to retry.
00:07:12.799 We want to ensure we receive confirmation of success. That's great, but what happens if the receiver is in trouble and the sender doesn’t know?