Building Resilient API Dependency

Ruby Unconf 2019

00:00:02.560 Let's give a warm welcome to Sergey Dolganov. He's working at Evil Martians.

00:00:04.779 Currently, he is working remotely while on a European trip. Thanks for being here.

00:00:12.730 Okay, thank you! Hello everyone, my name is Sergey Dolganov. Today, I'm going to talk about how we build resilient distributed applications using primarily two things: the contracts approach and implementing sagas.

00:00:20.770 A couple of words about myself: I came to you from St. Petersburg where I live with my dog and my drum kit. Most of my spare time is spent with them. Today, I'm here because I work for a company called Evil Martians. At Evil Martians, we help startups to grow their businesses at web speed by creating and contributing to various open-source tools.

00:00:34.749 The project I’m working on is called eBay Mac or eBay for Business, where we help sellers from 30 countries, including those from Eastern Europe and Russia, as well as some countries in Asia, to sell millions of products. An interesting aspect of this system is that it is a fully distributed system because almost any action from our users results in changes to an external system that we can access only through HTTP APIs.

00:01:01.629 We found that the best way to build such a complex application is by using contracts. When I talk about the contracts approach, I don’t mean a specific library or gem; I’m referring to a broader concept. When you think about the contracts approach, you need to keep in mind three types of things: preconditions, postconditions, and invariants. Let me show you how this can be applied. You can imagine that any functional block in your system could be controlled. By this, I mean you can apply certain rules for the input, rules for the output of this functional block, and rules for the state that is changed during the processing of this functional block.

00:01:43.960 If any of those rules are broken, you may actually encounter execution problems within your functional block because you might not understand how to process the execution later. This is essentially what a contracts approach is about. Finally, we are building an application that fully complies with the contracts approach, but it was not structured that way from the beginning. So let me share the story of the evolution of our application through this process.

00:02:26.440 First of all, we needed input/output control, then state control, and finally came up with a genius design for this control logic. The story dates back several years ago when our sales team came to us and said, 'You know, we have many orders that need to be sent through DHL, and it would be great if we could automatically register them with a button click in the main market.' So we said, 'No problem! Let's take a look. It seems pretty simple: we have an API client, a business domain, and an external system. Let's communicate and get started.'

00:02:58.450 We deployed it but soon realized that, in the real world, DHL behaves as a complex external system with peculiarities that we could not predict. None of this was specified in the documentation. At this point, we understood that having already deployed the application, the only way to learn how to handle it correctly was to debug it in production. The first chapter of our journey is termed 'Standalone Policy' because, in order to operate correctly, we introduced a variety of validations between the business domain and the API client. Each validation can be viewed as a policy object.

00:04:05.319 Additionally, to gather this knowledge effectively, we introduced a concept we termed 'contract.' I was inspired by Netflix's idea of chaos engineering, which suggests that production isn't as terrifying as it seems. We implemented a tagging system for each response and request between our systems, where each tag serves as a marker denoting the type of information we have about the communication session. If there was a situation where we couldn't understand how to process this request and response, we marked it with 'unexpected behavior.'

00:04:49.510 For each request—for example, a tariff request for DHL—when we start getting responses flagged with unexpected behavior, we needed to incorporate logic somewhere in the system. Thus, we created various policy objects, placing them first in the validations of data we intended to send, and then added pre-processing logic concerning what DHL actually expects from us.

00:05:30.850 Next, let me show you the policy object. The policy object is simply a tool that we introduced in our company, called tram policy. It serves as a simple abstraction over errors and operates much like Active Record validations, but without dependencies except for a dry initializer. You simply list the options or parameters you expect and then define the methods that will operate as validation. A validation method adds entries to an error list if there are any issues with the policies.

00:06:13.450 After passing through several validations, we have a call to a class called tariff mapper. The tariff mapper is nothing more than a dry initializer class; it takes some input and generates a hash that can easily be converted to XML, JSON, or any desired format. You’ll notice that there's default logic since sometimes we need to alter certain values when transferring our business domain models to DHL to ensure that they will accept it.

00:06:52.440 This solution proved to be effective, but we recognized that it was quite complex. The underlying problem was that we were missing a level of abstraction. All of this logic existed within service objects; thus, we lost clarity on the purpose of the code. In essence, we didn’t quite understand what it was, leading us to categorize it simply as service object. We achieved what we needed with the actual integration, but postponed addressing the core issue.

00:07:31.700 Under the hood, this utilized dry initializer and tram policy, which we refer to as contracts. This was created during actual integration. One day, our sales team approached us again and mentioned, 'In Russia, things work a bit differently. They use Russian Post for their shipments and send thousands of parcels as sellers. We have millions of parcels ready to ship and they would prefer not to register such a high volume manually by clicking buttons thousands of times. Could we automate this process with Russian Post?'

00:07:59.220 Thus, we began to consider the differences. The key change was that we needed to execute multiple operations in a single transaction at the external system. If something failed, we didn't want Russian Post to have any traces of incomplete transactions that could cause miscommunication between the systems. This leads us to the next chapter: sagas. Sagas are simply a fancy term for business transactions or distributed transactions. In cases where you don’t have access to the source code of the services you are utilizing, there’s an approach called orchestration, which states that every time you communicate with an external system, you should implement a rollback method.

00:09:08.730 Orchestration can be thought of similarly to traditional integration with methods for rolling back and retrying. If everything executes successfully, you continue; if not, you rollback actions in reverse order. For us, this was handled under methods. We created a class we call 'Pipeline,' where on the left side, we receive a parcel input and, on the right side, the output we receive is also a parcel but with the understanding that it could either be the initial parcel or one successfully registered with all necessary details.

00:10:11.640 Along the way, we executed different tasks like obtaining transportation costs and directions, all relying on communication between systems. If something fails, we simply perform a rollback. This raised an interesting question: why isn’t there a library for this task? I wrote one, but it's not ideal just yet. I began considering what happens in the real world, especially when we encounter a runtime error or an unexpected failure along the chain that prevents our transactions from executing.

00:11:03.040 This is particularly relevant since we are running our application on Kubernetes, where any pod can be restarted at any instant. As you might expect, this situation does occur from time to time. I won't go into too much detail, but often there are thousands of processes running in parallel; thus, any background executions can cause unpredictable ordering and mishaps in states. As a result, we started exploring solutions for handling these situations.

00:11:56.620 We devised a solution in our Ruby code by writing a class called Pipeline. It features method chaining that lists the operations we plan to run in one single transaction. After completing the chain, you initialize it with the subject and call it. What’s interesting here is that we avoid using conditional statements, relying instead on status payloads as outputs. Regardless of whether it’s a success or failure, we consistently generate a status object.

00:12:57.600 In the case of success, the payload yields a successfully registered parcel, while in the event of an error, we see the errors we encountered. Let’s discuss what occurs under the hood. The solution we found uses event sourcing. For those unfamiliar, event sourcing involves storing events and their outcomes in separate tables. The first table purely retains events and the second tracks the completion status of each event. The intention is that any row created in the events table cannot be modified and retains all necessary information.

00:14:26.100 When we use event sourcing, we can effectively continue the execution from the point we left off prior to any interruption. Therefore, if an action fails or our application restarts in the middle of a chain, we have regular jobs in place that generate a done event with a failure status and attempt to perform a rollback operation. This will happen automatically as needed.

00:15:26.200 Notably, even if we start the cleanup process, we will always know which actions were processed and which were not, allowing us to effectively handle such situations. For example, one day our sales team came to us explaining that in many European countries, they prefer using EPS over DHL. The two services appear somewhat similar on the surface, but we quickly recognized that we could not reuse any existing code from our DHL integration since we lacked essential abstractions.

00:16:09.530 So that led us to consider how best to organize our validations without relying on service objects. I initiated discussions within our company where team members with relevant expertise suggested we should try to write the code in Haskell. In functional programming, validations are effectively applied, and this led me to explore Haskell. Once I began to learn Haskell, my appreciation for functional programming grew. However, I still loved Ruby and wanted to incorporate specific concepts from functional programming into my Ruby projects.

00:17:50.580 The concepts I aimed to introduce relate to composition and error handling. In functional programming, there's a concept called algebraic data types, which enables the creation of two operations on the types: products (end) and sums (or). This can be seen in the way we define our validation types. We could say that a request contract is simply the combination of two validations—the original person contract and, if that's successful, it leads to the schema contract.

00:19:11.440 The 'or' method can also be utilized, representing values like recoverable input errors or invalid requests. By combining these validation types, we gain the ability to express complex business logic in a clear manner. Now, how are these validations implemented? It involves using 'refinement types,' which are just a combination of types and predicates. For example, you can specify that an integer must be greater than five or apply it to a business logic context, like validating that a certain record properly represents a cat in our system.

00:20:37.120 In practical terms, we could utilize a refinement type to handle our response validations in a manner of speaking. When validating, we might have to create a box with a label that indicates it should contain a cat and then validate the rules surrounding what it truly contains. If we succeed in unpacking, then we obtain an object that represents a parcel or a user, but if we mistakenly place a dog inside, we’ll encounter a validation error indicating that the expected item differs from the actual.

00:21:28.340 Our implementation utilizes two methods: a match method that gathers validation context, and an unpack method that resolves the refinement type. Ultimately, when dealing with JSON responses, we ensure that if validation succeeds, we return the object. If we don't meet the requirements, we can wrap the error in another refinement type for advanced cases, integrating policy objects organized homogeneously.

00:22:30.040 What's compelling about our contracts solution is that we no longer rely on separate tagging logic; we now use the names of particular refinement types for requests and before responses. The contract first performs input validation. Should the validation pass, we proceed to execute the block where the actual HTTP transaction occurs. The result of this block is subsequently wrapped in another refinement type. This methodology ensures that our contracts consistently yield a refinement type.

00:24:24.370 Interestingly, under the hood, within the base contract, we use a gem called sniffer that tracks requests and responses, including the full HTTP session. This ties back to our initial intent to ensure any mismatched requests are tracked. Consequently, we have instances where contract failure cases occur, enabling us to log those errors effectively. Ultimately, if we encounter unexpected behavior during parsing, we can promptly identify a post failure or contract failure and address it accordingly.

00:25:46.220 In summary, we've successfully evolved our system, developed methods to control inputs and outputs, regulated state management via sagas, and figured out optimal ways to organize validations. However, we still have pending work to release the pipeline module. We understood that we shouldn't prematurely release a gem that isn't finished, especially due to the complexities involving lock mechanisms and event sourcing.

00:27:01.360 As soon as we finalize those tools, we will open source the pipeline and present it as a seamless solution for Rails applications, allowing for distributed transactions in the context I’ve shared today.

00:27:31.410 Lastly, I compiled a list of books and topics that have inspired my journey. I highly recommend 'Category Theory for Programmers' for those keen on exploring functional programming. If you’ve always hesitated to learn Haskell, 'Learn You a Haskell for Great Good' is a fantastic starting point.

00:27:58.810 Finally, 'Chaos Engineering' is a key concept to explore, derived from Netflix’s approach that encourages us not to fear debugging in production. Thank you! Here are some links where I will share the libraries soon as they are released, so stay tuned in our blog.

00:28:28.060 You can also follow me on Twitter and other social networks for updates. Thank you!