Beyond the Current State: Time Travel to the Rescue!

00:00:13.900 Hello everyone! As it was already mentioned, my name is Michele. You can find me on the internet as Colette, and I work at a company called Solaris Bank. We are primarily a tech company with a banking license, consisting mostly of software engineers. If anyone is interested in the opportunities at our company, feel free to come talk to me after this talk or at the party later. Today, I would like to discuss a few interesting concepts that I have picked up over the last few years, which I found extremely useful when tasked with crafting resilient software systems.

00:00:51.210 Without further ado, let me start with an application containing multiple modules implemented in a single codebase. The approach to building applications like this is commonly known as a majestic monolith—just kidding! It's actually known as layered architecture, typically represented in three layers. We all know what this means: when a client wants to make changes, a request is issued, during which the client is put on hold. The application validates the data, processes it, and, at the end, persists the result. All fine and dandy! However, this result mutates the current state, which was understood until that point, and an additional query is usually performed against the database to fetch this mutated state and return it to the client. This is referred to by some as the request-response cycle. This is pretty straightforward, and for most applications, it works just fine. However, in certain situations, if the application we are building, or the business domain we are trying to describe, is extremely complex— or if we have built a particularly intricate application—it can become very hard to reason about our code.

00:01:41.080 It is also common to encounter situations where we want to increase the scalability of our system or just dream of having a system with superpowers. One noticeable and straightforward optimization you can implement is to recognize the read-write disparity. In essence, for most applications, we have many more reads than writes. Although some applications may have more writes than reads, this observation can lead us to structure an application in a way where commands are issued to a write endpoint, which validates the user input and immediately responds with either a validation failure or the location of our query, or to a read endpoint. This example uses a typical HTTP interface, but it doesn’t have to be limited to that. Usually, the response contains unique identifiers, and the client is redirected to this read interface. This redirection can happen any time in the future because, in most situations, it is understood that clients who send payloads to write already know what they have sent and what the previous state was. Therefore, there is little reason for the writer and reader to be the same, except in situations where they are forced together, perhaps to check for errors that may have already occurred.

00:02:29.340 At some later point in time, the client might come and say, "Okay, I have this identifier and would like to check the state." Then, a database query would be made, responding to the client with the current state. This approach may seem very simple, and it is; when we strip down the extras, we can actually see what the new application filled with data would look like. This optimization concept is known as Command Query Responsibility Segregation (CQRS). We have discussed it over the last couple of days, but no talk has really described it here—apologies if you already know about it. So, what is CQRS? Well, there is another concept called Command Query Separation (CQS), which was formulated by Bertrand Meyer and described in his 1988 book, "Object-Oriented Software Construction." If you read the slide, I think you would agree that this principle still holds as true today as it did back then.

00:03:15.780 Applying the same principle at the service level is what CQRS is about. The term was coined by Greg Young and first publicly mentioned on his blog in 2009. If you have not heard about Greg Young before—which I find hard to believe at this conference—I highly recommend going on YouTube, searching for his name, and watching his videos. You will definitely learn a lot. Delving deeper, we can see that the requirements for our write model can differ significantly from those of our read model. Write models implement some complex business logic, while read models merely present simple queries against our persistence layer, displaying the previously prepared state. These can be modeled to meet specific business requirements, allowing us to have multiple views on the same data structure, using elements like materialized views in databases. \ With this, we can split our application into two self-contained parts. While this particular concept can be beneficial in transitioning from a monolith to a service-oriented architecture—what we call microservices today—you can still view this construct as a single application or as a holistic entity. This is perfectly fine; it's just implementation detail. However, if we decide to physically split our application into multiple parts, we obtain one significant benefit, which is the unlocking of our first superpower: scale. This isn’t much of a power, but it’s a start.

00:04:31.210 Before proceeding, I believe it's important to mention two things that some of us either take for granted or don’t spend much time contemplating. One of them is the very idea of the current state in systems dealing with data. We commonly mutate this current state, but maintaining only the current state comes with some drawbacks. The most obvious drawback is that every state mutation effectively removes knowledge about prior states. In essence, things get forgotten. Another interesting drawback is much more apparent when dealing with distributed systems, where atomically updating databases and publishing events is a major challenge. Distributed transactions are possible, but they come at the cost of performance, which ironically is usually the main motivator for moving your architecture to a distributed system.

00:05:29.650 However, there is another perspective from which to view the current state. If we examine how databases work in the background, we can often come across the term "transaction log." To cite Wikipedia—"a transaction log is a history of actions executed by a database management system, used to guarantee the ACID properties over crashes or hardware failures." What we think of as an application's current state seems to be, in essence, a product of a sequence of events that introduce changes to the state.

00:06:05.390 To some of us, this concept may imply eventual consistency, which is another topic I would like to discuss. It is somewhat foreign yet familiar, as the real world operates on the principle of eventual consistency. Once we accept and embrace this fact, especially when dealing with distributed systems, we can adapt our systems accordingly. For example, in Germany, where I work, there is a system called Elster for dealing with taxes. To get specific credentials, you go to their website, register, and submit your details, after which you receive a response. You would get your username and password in seven to eight days. This delay is perfectly acceptable due to security features. But despite the latency, the credentials are issued. It's a procedural flow that used to be enormously different.

00:07:03.180 Our patterns in applying immediate consistency have often been too strict. In reality, the current world often contradicts this. Once the fact becomes evident, and we are comfortable with the idea, we can start to reap the benefits. In the system we are developing, rather than trying to persist a state every time a business-relevant event occurs, we can publish an event within a persistence layer capable of managing it and, because events are immutable and represent facts that have already transpired, we can structure our store to permit only appending and reading. This event store can subsequently trigger projectors allowing the state to be projected into various forms: into a graph database, in memory, or a specific part of the state that suits a certain business component needing it.

00:07:59.580 I won’t delve deeply into event sourcing here, just focusing on the essential advantages we gain. Essentially, each time we construct a memory representation of the state by replaying events for the specific aggregate. If you look at the command functions, we have a domain model which serves as a command model. This process has already been described by Nathan yesterday, but let me briefly go over it. Essentially, when you get a request, you can treat your HTTP request as a command or produce a value object that’s a command. It goes to a command handler which retrieves all previous states based on this command, identifies the relevant aggregate, and asks the event store for all the events that have happened on that aggregate. You replay them in order to construct the current state in memory, which you can then use to apply the new event produced from this command.

00:09:04.290 Once this event or aggregate satisfactorily meets your business logic requirements, you can then persist a new event and proceed with your system. We do this for each request. A question arises: is this slow? Let's consider the principle of a bank account. My bank account might have 100 to 200 entries per month. Within a period of 10 years, this could amount to around 200,000 entries. How long would it take to fold through these 100,000 items? I measured that it takes approximately less than one second for such operations. And this pertains only to writes, demonstrating that this approach is not inherently slow. In some domains, such as advertising, which may process petabytes of data daily, however, optimizations truly become necessary.

00:10:13.980 The concept of capturing snapshots has already been explored in these discussions, and I won't go deeper into it now. Essentially, applying this methodology empowers our system, offering us a new superpower: the ability to time travel. Keeping all business-relevant facts for the lifetime of the system enables us to perform a multitude of functions. The most obvious one is projecting state at any previous point in time. For instance, if you want to know the project state three weeks ago, we can easily construct a projector that meets this requirement, allowing us to conduct temporal queries. Furthermore, we've gained the ability to foresee the future—a concept often referred to as precognition in a comic book context. One additional superpower we gain is that features can be constructed as if they were conceived on the very first day—as long as we have the relevant events.

00:11:45.830 Imagine, for instance, an e-commerce scenario where a manager informs you that they want to send promotional emails to every customer who previously added specific items to their cart over the last year, but who subsequently removed them. In a conventional system, you might sigh and explain that this feature will take a year to implement, as you have lost all of that data. However, in such a system, you can simply respond, "No problem; we’ll build a projection, because we possess all of the relevant business events that have already transpired, and we retain them in our system." This proves to be an incredibly powerful tool based on a deceptively simple concept.

00:12:36.400 The next superpower we gain is total self-reconstruction. Take, for example, the case of a junior developer who accidentally drops a production database. This could lead to immediate termination, but if you have an event store that logs events securely and is backed up, you typically wouldn’t need to worry—provided you don't allow individuals to destroy your event store. This could have been a perfect learning opportunity for that developer. With your event sourcing strategy, reconstructing their current state becomes trivial, enabling complete recovery within hours. This is less than ideal for customers, but still better than losing all your data without backups. The benefits extend further. When I previously worked with Ruby on Rails, there was a concept of migrations that could alter the schema—and while these migrations serve importance, they still harbor their own concerns.

00:13:35.300 In an event-sourced system, you do not need to execute migrations. You can change the schema of your projections and build a new projection accordingly, populating it with the necessary information and redirecting your application to read from this new projection, abandoning or repurposing the old one. Lastly, we have the superpower of enhanced charisma, particularly when it comes to regulatory environments. If regulators inquire how you build software, explaining that you utilize a ledger from which everything is projected generally makes a positive impression. This can greatly enhance your standing with regulators.

00:14:28.510 There are additional benefits to this approach, such as debugging an exact state—time travel proves exceptionally useful in addressing race conditions that may occur when multiple systems are simultaneously writing to the event store with relevant business events that need synchronization. We once had such a case and, due to a mistake, it became difficult to pinpoint the source of an issue caused by race conditions, which are notoriously challenging to debug. Consequently, we reversed the state to where it hadn’t been broken and replayed events sequentially—allowing us to identify the cause in under five minutes. This is a task that could have taken me a week had I been working with the current state model. The next benefit is the simplicity of testing without the need to update or delete things, which streamlines and accelerates your testing process.

00:15:24.510 It should be practically self-explanatory. Additionally, backing up your system is trivial on a per-event basis; you can create a projection or reactor that absorbs every event that takes place within the system, ensuring you track a global identifier to see if every event occurs as intended. There are techniques to manage this—effectively allowing redundant backups. When I first learned of such concepts, I found myself bewildered, pondering why nobody had explained that these capabilities existed or why they weren't taught in schools, as many established software engineering techniques have remained unchanged for the past forty years—yet concepts like double-entry bookkeeping, which is fundamentally a ledger, have persisted for over five centuries. I believe such knowledge is quite valuable and simple enough to incorporate into one's toolkit.

00:16:20.900 Please note that while you can certainly apply CQRS and event sourcing independently, I would like to emphasize, based on my personal views that while CQRS can perform quite well on its own, event sourcing only holds significance if you aim to create a read state from a stream of events. By definition, this is also a CQRS implementation. I had the fortunate experience of encountering a system where event sourcing was utilized purely to maintain records. In this instance, state updates and direct reads occurred from a command model. The rationale was to diminish complexity; however, after a few months of engineering, the reality was that complexity had actually increased considerably. A significant amount of logic congregated in our workflows, which is a term we use to describe our orchestration layer.

00:17:26.690 While this arrangement resulted in fewer system objects, the increase in complexity was substantial. By rejecting the immediate state model and applying CQRS patterns, the proliferation of classes expanded significantly. We ended up with many new system objects, but these were focused, fulfilled single responsibilities, and were easy to test and reason about, leading the system to operate flawlessly. Of course, nothing in software engineering comes without trade-offs, which has been discussed numerous times throughout this conference. One notable trade-off extends beyond the technical details to a broader mind shift; it becomes increasingly hard to shift the thought process from writing the current state to projecting the state based on a sequence of events.

00:18:36.080 I am a prime example of this; it took significant time to wrap my mind around this concept. I thoroughly grasped the idea on paper, but in practice, it was challenging to realize. Time and practice can assist in this adaptation process. I was fortunate that my employer invested the time researching this methodology and did not demand an immediate live system. This turned out to be immensely helpful. The second hurdle involves hiring and training engineers in these techniques, which proves to be challenging. Finding engineers with experience in these concepts can be particularly difficult, thus, if you aim to have a swiftly functioning system as a startup, it may be prudent to first build up the team’s experience before introducing these methodologies.

00:19:53.510 Regarding the final point, the nature of eventual consistency is especially daunting when managing legacy systems. Embracing this fact is one thing, but it is not easy—dealing with external systems can complicate matters further. Therefore, the concept of reactors that I mentioned previously—the projectors which represent the state—should also include components that interface with external systems. These are called reactors, which respond to certain events, sometimes even to build the state. Intermixing these concepts can help, but bear in mind, it becomes incredibly challenging to reason through external systems simply as another persistence repository. However, these are challenges that one can surmount, even if often they feel daunting in practice.

00:20:56.010 I would like to propose some tips we've discovered while developing systems. First, projectors and reactors must be idempotent. This is not overly difficult to accomplish but can present complications. I will refrain from going into specific implementations here as Nathan has already discussed that. The second tip involves aggregate scope sequence numbers. Early on, I was unaware of this necessity. Avoiding anomalies like events coming out of sync across different systems can be mitigated using multiple instances of your write model. Using a storing system, such as Postgres, can help facilitate this. However, each event should also have a locally scoped number in addition to its global identifier.

00:22:00.390 The third point is to reuse your command model aggregates for responses. This is an effective strategy for layering your system. In essence, when you develop such a system, you usually have an interface, typically supported by a message bus. Construct your command and serialize it, placing it onto the bus. There is a command handler further along the line, allowing you to decouple your system. It's an effective method.

00:23:04.390 However, if you skip this step and lack a command bus or event bus, one can often pre-emptively predict the response to a request for your aggregated state. This provides you the necessary assurance of what the response will be, despite the unknowns regarding how the projection will evolve. This is crucial in certain contexts, while simultaneously teaching developers to remain cautious in their expectations.

00:24:03.710 The next point involves the use of a command UUID to aid in the identification of duplicates within your system. Commands should carry a unique identifier, which, if incorporated into the event's metadata, can help in preventing the sourcing of duplicate events, presuming your commands and events maintain a one-to-one relationship. Lastly, consider leveraging sagas to help manage reactor synchronization errors. Though some discussions have happened relating to the saga pattern or distributed service manager, the essence remains that it solves problems with external systems when handling these type of errors.

00:25:17.140 Should your request to the external system time out without a definitive outcome, your system may need to engage in follow-ups to ensure operations’ completion. An external system can check after a period to verify if something was written. If it has been documented, it will subsequently source an event; if it has not, it will either re-try or escalate it by emailing a human operator for a time-sensitive adjustment. This process becomes necessary when dealing with business-critical functions but could prove unwieldy without such a setup.

00:26:15.000 Thank you very much for your attention and engagement. Please feel free to reach out post-talk for deeper discussions!

00:26:32.690 I also welcome any questions you might have regarding what I've presented.

00:26:58.200 In the command pattern, do you implement a project with commands, saving commands in a database? As an object-oriented developer, I understand users operate in tables, so I wonder about maintaining a level of abstraction between data and database.

00:27:17.420 Regarding storing commands, truthfully I don't save commands; I log them within a logging system. However, as discussed with Nate, they do source commands in event IDs. It's dependent on the business context whether storing a command is deemed significant. If it has business relevance, then certainly, store it. If not, events are what matters, as commands merely express intent.

00:27:47.270 As for storing numerous events in databases, businesses that receive high volumes create methods for managing this. Some might utilize systems like Kafka for event storage, although I wouldn't personally recommend it due to alarming difficulties shared by other users. Financial entities often have an obligation to submit end-of-year reports, at which point they can take a complete snapshot, store it externally, and continue running their operations from there.

00:28:26.860 In this case, retaining data becomes manageable, but companies handling heavy workloads in advertising might face more challenges. Different strategies must be employed, aligning with the specific business sector.

00:28:42.890 With regards to time travel, it suffers when it comes to log compaction, since it only functions for historical events.

00:29:02.240 In a merchandising business, log compaction can become an obstacle since data becomes irrelevant after completing a campaign. However, in contrast, e-commerce operations can afford to retain such events for reporting and analytics purposes.

00:29:58.050 Thank you for your insightful questions. I appreciate your engagement and I’m eager to continue discussions following this presentation.

00:30:30.330 My name is Ivan Pašalić, and you can find me easily on the Internet. Please connect with me if you'd like more discussion around these valuable concepts.

00:30:44.910 Thank you all for your attention.