Working with RailsEventStore in Cashflow Management System

00:00:04.580 There is one fun fact about Łukasz. Who can guess what programming language he used two years ago?

00:00:11.340 Yes, Łukasz was a .NET developer two years ago. I'm really happy that I helped him transition to Ruby.

00:00:19.440 Today, Łukasz will talk about a real-world project where Rails Event Store was introduced, or more generally, where event-driven architecture was adopted. I hope some of you will implement something similar in your projects soon.

00:00:32.520 So, let's welcome Łukasz. Good luck!

00:00:57.719 Thanks, André. Yes, I admit, two years ago I was addicted to .NET, and I’m really grateful that he helped me transition away from it. I'm honored to be the first speaker at this conference.

00:01:08.939 As you already know, my name is Łukasz. I work at a company called Arkansaw on a daily basis, helping rescue legacy projects. In this talk, I will share six stories about how we work with Rails Event Store in Trezy, our cash flow management application.

00:01:26.340 Let's begin with an overview of our cash flow management application. This application assists small businesses in managing their cash flow. Users can view their bank account balances in one place, and we also handle some accounting tasks for them, so they don’t have to wait for their accountants to do it once or twice a year, which is the case for many.

00:01:39.000 The Rails Event Store is an open-source event store implementation for Ruby on Rails, established in 2014. It acts as a library for publishing, consuming, storing, and retrieving events. In this talk, I will focus on event sourcing in conjunction with event-driven architecture, domain-driven design patterns, and testing, which will play an important role in this presentation.

00:02:03.840 Is anyone here familiar with this concept? Please raise your hand. Great! Some of you are, while others may not be, but let’s move on to the first story, which is about our integration with an open banking provider.

00:02:30.239 As I mentioned, we are a cash flow management application, so we need access to your bank account data. However, we don’t create the integrations ourselves; we rely on third-party providers, known as open banking services, to send us the data. This enables us to show you the balances of all your accounts and transactions.

00:02:55.319 Based on this information, we can generate useful reports for the small businesses using Trezy.

00:03:01.080 The banking providers integrate with us through webhooks. They essentially hit our API and send us data in the payload. There are at least two approaches to process a webhook. The first is a synchronous approach: when we receive a webhook from the banking provider, we can pass the payload through the application's layers until it lands in the database.

00:03:14.819 While this method works, it raises a question: what if something fails during this journey to the database? For instance, there are many issues that could arise in the application layer, such as violating an index, changes in the database schema, or processing logic errors.

00:03:39.000 To address this, we could store the payload as a technical event, which leads us to the first relation to event sourcing. Instead of saving it in a traditional database, we store it in the event store. Once we do that, we can publish it to a queue and process it asynchronously.

00:03:50.940 This is what the process looks like: when the banking provider sends us a webhook, it first hits our REST API. We then store and publish a technical event. Afterward, a queue takes the event, and an event handler processes it, fetching the payload and storing it in the database along with any necessary calculations.

00:04:22.259 This approach offers several benefits. Firstly, we gain an external system audit log—we have the payload information stored as technical events that we can access whenever necessary. This is incredibly useful for debugging and troubleshooting.

00:04:39.300 Additionally, the performance improves; responses to third-party systems are much faster since we don’t process data immediately. We only store the payload, so there is minimal database interaction. Once stored, we can publish the event, which takes only milliseconds.

00:04:57.780 Later, we can process it asynchronously depending on the availability of processing units. A notable advantage is that we can scale our web server that receives the webhooks independently of the web jobs.

00:05:17.280 It's likely we will have more web jobs processing the webhooks than web server instances, and we don't have to rely on the third-party retry mechanism. Once we have stored the payload, if something goes wrong, we can always access it.

00:05:28.500 This means if there’s a bug, I can fix my code and retry as many times as needed without depending on whether the third-party provider will send me a retry payload, which might not happen for a while.

00:05:41.340 Keep this in mind; I will refer to it in the next stories. Now, let’s return to the open banking provider. We decided to move the processing of webhooks into an async process.

00:06:00.000 As with most things, a question arises: what could possibly go wrong here? Nothing is without drawbacks. While there are many benefits, one problem is that the webhooks may come out of order.

00:06:23.119 This ordering is critical. Imagine we receive a webhook with a bank account balance of 100, and then we publish it as a technical event, but the processing fails. Next, we receive another webhook for the same account with a different balance. If we successfully process the second one and then go back to the first, we might overwrite the data, which we want to avoid.

00:06:39.000 To solve this issue, it's also a good moment to introduce the context of the project we are working on. Some of you might think this is a green field, but it’s actually not. The code exists within a startup context, and this picture conveys the challenge quite effectively.

00:06:53.539 Sometimes, I feel like when I want to change one small thing, I have to be careful not to break something else, resulting in unforeseen consequences.

00:07:06.500 So, what does this mean in reality? For most concepts in the system, there is one relational data model containing the entire structure. We call this logical coupling. I won't mislead you; this is common.

00:07:21.360 While it served its purpose and earned revenue, extending that model at this point would be impractical. We began seeking a new model and, after several iterations, arrived at a refined design.

00:07:40.560 In this new model, the white boxes represent the pieces of information needed, such as those for external providers, timestamps, statuses, and, of course, IDs. There are three primary operations: opening a bank account, closing one, and performing a specific operation.

00:08:02.460 The yellow boxes depict the rules. We utilized event sourcing notation here. The first rule states that an operation can only be performed if the bank account is not closed. The second rule prevents processing webhooks with timestamps older than the current one, which protects against data inconsistency.

00:08:19.980 This is how it appears in Ruby code. Some of you who attended the workshop might recognize this. This class is quite straightforward, including the aggregate route game, which allows the use of methods for handling events.

00:08:34.620 This class checks whether operations can be performed while also managing the event and its publication.

00:08:46.500 However, I must confess that we had to be pragmatic here; we just didn’t have time to rewrite half the system. Our existing approach required writing to the bank account model, which forms the top layer, where we instantiate the class and execute the operation.

00:09:08.760 This means that changes to the read model occur in conjunction with the write model. It’s often not easy to adopt a perfect system in a real project.

00:09:24.580 The underlying assumption is that the stream will always tell the truth. Multiple reasons may alter the state of a bank account. For instance, if we have a soft delete mechanism with the ActiveRecord, deletion of a bank account can happen for various reasons.

00:09:39.000 However, when we check the stream and see that the bank account is closed, we know not to act on it. Next, we have a story regarding the close bank account and missing events.

00:09:57.919 You may already guess that we introduced a problem when implementing the aggregate tasked with managing operations.

00:10:13.200 When we receive a webhook and publish it as a technical event, we begin processing it, but it fails. The error indicates that we can’t save the record, which surprised me since there was nothing wrong with the code.

00:10:29.640 I reviewed the stream, and everything seemed fine; it indicated the bank account was not closed. However, upon examining the code snippet that generated the output, I noted it failed during the save operation.

00:10:46.080 This issue was only encountered in specific bank accounts, presenting a real challenge to debug and reproduce accurately.

00:11:02.640 To address this, I utilized my advanced debugging techniques and replayed the event since it was stored in my event store.

00:11:16.500 As it turns out, the read model was self-deleted for some reason, but I noticed we were missing the event indicating the bank account closure.

00:11:32.520 Considering our policy stating the stream as the source of truth, this missing event should not have happened. It became clear we should have included it in the bank account aggregate to maintain a reliable event stream.

00:11:48.720 Previously, a new feature had been implemented to ignore all incoming webhooks for specific data, which posed substantial risk.

00:12:04.600 The developer neglected to incorporate the necessary changes into the aggregate, bypassing the single source of truth for this part of the logic.

00:12:22.520 While some may view this as poor design by introducing multiple verifiable sources, it was a conscious decision aimed at retaining a reliable stream.

00:12:39.000 To resolve this issue, the first part was straightforward. I wrote a test to confirm the fix. Once the problem was identified, I implemented the correction, which made it pass.

00:12:57.880 But the challenge didn't end there; I now had to backfill the historical data, a critical aspect of maintaining an effective event-Sourcing strategy.

00:13:16.920 This step is crucial when managing legacy systems that continue to evolve rapidly.

00:13:31.200 This dynamic environment prevents us from adhering strictly to every architectural principle. We emphasize adapting solutions to the specific context of the current business needs.

00:13:46.200 Now, let’s serve another story regarding pending transactions; another hurdle that arose with event sourcing that I had to tackle.

00:14:02.520 Events in an async system that depends on external data from providers can be complex. Let me illustrate what pending transactions entail.

00:14:19.560 One of our banking providers notified us they could send transactions not yet fully processed by the bank. Customers could view these pending transactions early, but when they sent a new batch, we needed to clear the previous ones.

00:14:35.040 After a few weeks with this feature in production, we received complaints from customers stating that they couldn’t see new transactions, and the bank account balance seemed incorrect.

00:14:53.780 This situation occurred multiple times a week, and at first, I brushed it off, thinking customers didn’t understand our role as a bridge, not a bank.

00:15:12.480 After some investigation, I pulled the recent events and discovered a delay. We realized 700 pending transactions were missing, and the resulting inferential calculations yielded an incorrect balance.

00:15:29.700 With newfound determination, I checked the event logs provided by the open banking provider and observed that they had indeed sent us the relevant events, which we recorded as events.

00:15:44.520 However, I ran into the same save record error once more. After initial panic, I noted the payload looked healthy.

00:16:03.000 Employing my debugging repertoire once more, I replayed the event and traced the issue back to the soft-delete configuration in the ActiveRecord setup, which unexpectedly affected the pending transactions.

00:16:25.680 Here’s how I resolved it. First, I wrote a test confirming my understanding of the bug, then I fixed the issue and deployed the correction.

00:16:43.720 Moreover, unlike traditional debugging methodologies, I could replay the event without waiting for another webhook; I didn't need to wait for the bank provider to resend it.

00:17:01.120 This significantly boosted customer satisfaction as they did not experience prolonged delays.

00:17:16.560 Next, I want to cover scenario involving large transactions and categorizing them accurately.

00:17:35.040 As I mentioned, the classification of transactions is essential for reporting and managing cash flow effectively. Customers can manually select a category for their transactions.

00:17:52.720 In Poland, for instance, cars available for rent observe specific travel expenses. There are instances where transactions are automatically classified by our system to save time and avoid the need for manual input.

00:18:05.520 This entire system was built as a monolith initially, and over time, the model became too large with distributed business logic causing an amalgamation of complexities.

00:18:24.000 The consequence was a lack of cohesion as business rules got scattered everywhere, especially within various event handlers.

00:18:41.760 When we realized transaction classification was crucial, we knew the project needed addressing, as customers kept randomizing their transactions.

00:18:57.680 After deliberations with our team, we crafted a project to rewrite the classification mechanism to enhance stability and predictability, aiming to do so effectively.

00:19:13.600 A breakthrough introduced an automatic classification feature. To ensure reliability, I laid out exploratory tests to establish a point of control.

00:19:30.240 Crucially, all of this hinged on solid integration testing due to the unpredictable nature of event-driven architectures.

00:19:54.960 We developed a classification algorithm that learns from existing data. For instance, if a user manually classifies one transaction, the system can suggest the same classification for future transactions.

00:20:14.240 The code encapsulates the business rules carefully. Every logical pathway is robustly checked against numerous criteria, maintaining our emphasis on quality.

00:20:30.000 We utilized a short, simple Ruby object that does not depend on an ActiveRecord setup, simplifying unit testing.

00:20:44.400 Following development, we aimed for complete mutation coverage, ensuring our confidence in the transition to production.

00:21:02.000 Eventually, we backfilled the historical data. However, we decided to introduce a rich part into the classification process, separating the read and write model.

00:21:19.720 By developing an event handler, we can maintain a simplified read structure while adapting flexibly to future requirements.

00:21:36.640 After deploying these changes, we instantly received notifications of errors related to incorrect transaction classifications, prompting us to revise our implementation.

00:21:52.640 To clarify the hierarchy in our decision-making process, we established a clear classification pipeline evaluating rule-based classifications first, followed by account classifications, ensuring that manual overrides remained within context.

00:22:08.600 This iterative feedback helped us improve our system, allowing us to refine processes regularly.

00:22:23.520 In this cumulative process, we maintained visibility in real time, practically recalibrating pathways for transactions to protect against categorization errors.

00:22:38.800 This feedback continues informing future refinements, demonstrating the iterative value of our evolving architecture and classification algorithm.

00:22:54.400 Once we launched the new classification method, the operational efficiency improved significantly, demonstrating the simplicity behind a well-structured event-driven architecture.

00:23:09.240 Ultimately, we adapted this system to accommodate user feedback dynamically, also tracking responses to past decisions.

00:23:27.200 The alignment of user input with historical data created a robust foundation for future data-driven decisions.

00:23:44.080 With tracking mechanisms in place, we anticipated even smoother transitions for users returning to our application.

00:24:02.880 With the event streaming available throughout the systems, we realized what’s mistaken or unreachable will always be held within the archived data.

00:24:19.280 To encapsulate this process in user-focused terms, we've defined specific feedback categories to identify business analysis opportunities easily.

00:24:36.040 Ultimately, a streamlined flow facilitates a thorough understanding of past actions concerning customer retention, creating numerous opportunities for improvements.

00:24:54.080 There is no arbitrary limit restricting us to only sending notifications if we can access value in existing data. Businesses may always iterate upon categorical classifications based on relevant events.

00:25:13.760 This system induces flexibility when responding to varying types of user interactions while remaining grounded in analysis. Flexibility is a significant advantage in our rapidly changing landscape.

00:25:32.740 Ultimately, the power of event sourcing becomes evident through improved decision-making facilitated by thorough analysis over historical events.

00:25:50.080 At the end of this journey, embracing an event-driven architecture requires acknowledging the inherent complexities while cherishing its advantages.

00:26:04.560 My message to you all is: don't let misconceptions on event sourcing intimidate you. Understand its role within the system and approach it step by step, recognizing that it can seamlessly integrate into legacy systems as much as new builds.

00:26:23.720 When adapting legacy projects, break down the problem, evaluate your business needs, and determine how much of the architecture should be applied and when it should take place.

00:26:38.560 I believe this is essential in guiding practical implementations towards assured success. Working iteratively incorporates flexibility while keeping proactive measures in sight.

00:26:55.920 Ultimately, event sourcing and domain-driven design are not magic solutions to every problem, but serve specific purposes in unique contexts.

00:27:15.560 None of these patterns are silver bullets and may not resolve every concern you face; there are definitely traps to watch for.

Working with RailsEventStore in Cashflow Management System

Key Points Discussed:

Conclusions and Takeaways: