00:00:08.960
Hello, everyone. Today, we're going to talk about distributed systems.
00:00:23.439
By a show of hands, how many people here are currently working on some sort of distributed system, or considering working on one? That's quite a lot of us.
00:00:34.560
We'll use a loose definition of a distributed system as multiple applications working together. Coming from our shared background, we're all very familiar with the significant benefits that Ruby on Rails conventions offer us. They allow us to move quickly because we can start with sensible defaults and only deviate when necessary for a specific application.
00:00:52.560
The company I work for, MX, has been building a distributed system composed of Rails applications for the last three and a half years. We've learned a lot of valuable lessons along the way, and we've turned the most valuable insights into conventions that guide our team's work. Today, we’ll share some of those lessons.
00:01:12.799
Instead of providing you a simple list of things you should do, which may be challenging for you to assess its value for your team or system, we will discuss three key ideas. First, we're going to collaboratively build a sample project, something that represents the typical experience of a junior engineer at MX.
00:01:34.080
Your first job, as a new engineer, will be to delete user data whenever a user is removed from the system. This sample project should help illustrate our experiences, the pain points we faced, and the lessons we learned, making this transferable knowledge to your own context.
00:01:50.480
Our system consists of 18 Rails applications running together in production. Along the bottom, in green, we have what we call front-end applications. These applications deal with public HTTP interfaces but don't manage their own databases, delegating that responsibility to our core services.
00:02:01.520
Each of these blue core services manages a database of information. On the sides, we have periphery services. The services on the left focus on aggregating and analyzing data, while those on the right help us connect to banks and credit unions to retrieve information.
00:02:19.200
Now, when we instruct you to clean up user data across these 18 applications, you might feel overwhelmed. You might consider reading through every README for all repositories to understand where to begin, but that's simply going to be impossible. So, here's our first pro tip: create a directory for your distributed system.
00:02:43.919
We wanted our directory to meet two primary criteria. Firstly, it should be easy for humans on the team to identify where to look for something. Secondly, outdated documentation can create frustration and confusion, so it’s crucial that we keep the documentation updated. It’s important that the production codebase uses the same directory to achieve its functionality. If documentation is outdated, it will lead to broken code, and you'll know about it.
00:03:05.760
To achieve this, we utilize an internal gem called Atlas. Atlas houses a set of Protocol Buffers definitions. Protocol Buffers is a data serialization format developed by Google, and while there are other options available, we have found notable benefits using Protocol Buffers. However, this approach isn’t strictly required.
00:03:20.719
One core idea in your directory is to define how to divide responsibilities within your system. Each application should be responsible for a designated set of resources, which helps streamline our communication. During the talk, you will hear us mention the terms 'application' and 'resource' frequently.
00:03:40.639
For example, if you want to learn about the user resource, you would check the directory for the application that manages users. This will lead you to virtually all the information you need about users. Having it as a private gem allows us to version it using semantic versioning.
00:04:09.440
This versioning gives clues to team members about whether a change is backward compatible or not, using cues that any developer familiar with gems already understands. Let’s look at an example. If I search the Atlas project for ‘user,’ I will find a user.proto file. This immediately indicates that users are managed by the 'Amigo' application, as denoted by the directory.
00:04:29.360
Looking at the file, a definition of what a user looks like will be defined there, illustrating how a user is represented as it flows between the applications in our system. At the bottom of the file, you'll see the definition of an RPC service. Protocol Buffers defines RPC by outlining various services, each containing its own RPC calls.
00:04:55.600
It’s essential to establish that each resource must have a set of RPC calls, which are owned by a single service or application. Generally, you’ll find that each resource holds specific RPC calls to execute its functions.
00:05:13.600
For instance, RPCs include the name, a request type, and a response type. While it may seem counterintuitive, having static types for these messages proves to be beneficial in certain cases, as proper typing has its time and place.
00:05:39.840
Designing an API often means that it won't stay the same forever. It’s chaotic when changes are necessary. Thus, establishing upfront which changes will be backward-compatible is vital. For changes that aren't backward-compatible, creating a deprecation life cycle allows your system to evolve gradually, without requiring a total overhaul of the communications every time you modify a message.
00:06:12.000
Protocol Buffers handle this exceptionally well, defining semantics for which migrations are backward compatible, which is crucial. Over time, our teams have grown to appreciate this upfront structure. Every engineer knows if their changes will create issues for the message consumers.
00:06:35.600
We’ll proceed with discussing a scenario: imagine we released version 1.0.0 of the Atlas gem, which contains a user definition with just a grid and a name. If my application sends a user search request to the Amigo service, it’ll receive a 1.0.0 user object containing just those fields.
00:06:48.560
Next, suppose the product team wants to add an email field for users, resulting in an updated 1.1.0 version of Atlas. If another application, like Newman, continues using the old version, it will still retrieve a user object from the API, but it will be missing the new email field.
00:07:07.760
Adding an email field does not break compatibility, and this situation highlights our desire for a 'tolerant reader' principle, so changes don’t disrupt communications between services. Each team should follow a clear convention around these interactions to maintain consistency.
00:07:30.560
However, later, if the product team decides to remove the name field from the user object and introduce separate first and last name fields, this version change from 1.1.0 to 2.0.0 becomes a breaking change. Newman would depend on the old name field, which would now be absent.
00:07:53.360
To address this shift, we could implement a deprecation strategy. With version 1.2.0, the system would mark the old field as deprecated while concurrently including the new fields. This way, when Newman queries a user, it can still access the old name field while being informed of the upcoming removal.
00:08:08.560
When Newman later updates its dependency, it will receive a deprecation warning during a bundle update, prompting necessary adjustments before the old field is removed entirely.
00:08:23.560
This method allows teams to effectively communicate changes and warnings without necessitating meetings or global emails to address deprecations. It is wiser to keep these decisions localized within each team as they advance with their application.
00:08:43.280
Now, what do we want from our directory? Ideally, we want to easily find the information we're interested in, trust its accuracy since it should support programmatic use, and have a clear plan for making changes without causing disruptions.
00:09:04.280
We need to be mindful of what semantics surround changes. It’s important to know when something is likely to break, providing control over the necessary actions.
00:09:16.320
With our directory in place, let’s shift focus to the responsibilities for cleaning up associated user data. Whenever a user is deleted, it affects multiple other applications.
00:09:28.480
Every application operating within the system needs to clean up or take action regarding that user. Similar to our stated goal of decoupling services, we should establish a straightforward way to handle these events.
00:09:45.280
This is important to ensure that when one service, like Amigo, removes a user, it doesn’t necessarily need to be aware of the interconnected relationships in every application downstream.
00:10:06.080
Instead, we can utilize an event to inform relevant systems asynchronously. Whenever Amigo ensures that user data is cut, it can broadcast messages to notify others about the deletion, leading to the proper cleanup on their ends.
00:10:23.440
A system where each application independently subscribes to the events they care about will help maintain independence and modularity. This will also enable us to introduce features in the future without breaking existing functionalities.
00:10:43.839
When implementing an event system, idle coupling is crucial. The need to retain a relaxed relationship between components allows teams to move at different speeds, ensuring individual applications can progress based on their own update cycles.
00:11:02.839
An effective event system should feel like a natural extension of Rails. We plan to implement this using RabbitMQ. While RabbitMQ is just an example, the core philosophy stands—the event mechanisms must feel intuitive and cohesive with the existing Rails conventions.
00:11:25.960
Each application will watch out for events published by the source service when significant changes occur. This avoids tightly coupling components and maintains independence in how each service decides to respond.
00:11:44.640
RabbitMQ serves another purpose; it can help with load distribution and redundancy by managing which applications receive copies of published messages. This prevents multiple nodes from processing the same event.
00:12:01.760
It’s essential to note that not every application may wish to receive all messages—this necessitates implementing an opt-in system. As our architecture grows, we can streamline which services should receive which notifications.
00:12:21.680
RabbitMQ simplifies this process, ensuring that multiple instances receive exactly the messages they are interested in, leading to operational efficiency. This avoids potential confusion over the processing of redundant events.
00:12:43.680
The goal is to ensure that when changes occur, there’s a clear path for other systems to respond without being inundated by unnecessary noise. By keeping the messaging layer efficient, we find ourselves in a sustainable environment.
00:13:01.920
To encapsulate interactions effectively, we'll need to abstract RabbitMQ's usage. We've created an abstraction called ActionSubscriber. It's an internal gem I've used that allows our Ruby applications to connect and communicate through message queues more efficiently.
00:13:25.920
ActionSubscriber simplifies the interactions with the RabbitMQ framework, allowing us to interface with it seamlessly from our Rails applications. If you find it intriguing, feel free to reach out and explore this with our team.
00:13:49.200
As we define how these messages are structured and sent, we keep it familiar for Rails developers. By doing so, we maintain a coherence that eases adoption across teams. Keeping the message definitions and consensus around the types helps foster clarity in communication.
00:14:06.800
This will ultimately lead to actionable responses that integrate smoothly into our application's business logic. Each system can handle incoming messages as though they were performing ordinary synchronous operations.
00:14:24.360
As we hydrate these messages into understandable objects, it allows engineers to interact with them as they would typically engage with params in a Rails controller. Middleware will aid in decoding the messages so each service gets an event in a familiar format.
00:14:43.040
The publishing of events also needs to be properly structured. Amigo, as the publisher, is responsible for notifying any interested applications about events like user creation, updates, and deletions. This will facilitate a flexible ecosystem that speaks the language of Rails development.
00:15:03.120
Eventually, the goal is for this gem to move into the public domain, thereby allowing its use beyond our team. Should you have thoughts about how its API should operate, I welcome those conversations.
00:15:21.760
Regarding distributed systems, whether one must engage with message brokers, the overarching concern always comes back to design principles, ensuring robust structures that provide maintainability.
00:15:43.720
As we further refine our event-handling strategies, teams can manage these efficiently without tying everything back together, promoting freedom in how different elements communicate within the ecosystem.
00:16:05.200
Each aspect of building a system like this can seem overwhelming, but our focus here is to build frameworks that reduce coupling and enhance cohesion. This leads to a system that continually evolves while maintaining stability.
00:16:27.280
For all the developers looking to implement event-driven systems, consider embracing patterns that feel natural with your existing technology stacks. This should feel almost like a familiar function, rather than forcing novel practices that can complicate the integration.
00:16:48.120
The transition process is crucial—we want to ensure teams understand how effective communications enhance their practices. This groundwork makes it accessible for teams to adapt quickly.
00:17:12.480
Summing up, it’s vital to instill practices that enhance communication and collaboration across teams, while balancing the necessity for independence in their workflows. Effective messaging, proper organization, and low coupling promote a healthy distributed environment.
00:17:37.760
Lastly, let’s reflect on our objectives as we create efficient systems and return to deploying code. By fostering a culture that embraces these events, your team will engender a responsiveness and adaptability that proves beneficial as systems evolve.
00:18:03.280
Always remember the importance of small, frequent changes to maintain upward mobility in deployment practices. Foster collaboration, enjoy the development process, and lean on each other for support.
00:18:28.480
In closing, we’ve covered different aspects of maintaining effective communication and deployment conventions while transitioning towards event-driven architectures. These insights will guide teams moving forward.
00:18:50.960
The essential takeaways: create a directory to streamline resource access, ensure you publish significant events to facilitate communication, and use small deployments to minimize risk.
00:19:12.960
As we transition away from monolithic systems, remember that change begins at the team level. These interactions matter just as significantly as the technology stack itself.
00:19:34.640
Integrating mentorship and continued learning will be pivotal as teams grow more accustomed to distributed architectures, facilitating a culture of improvement.
00:20:02.360
Thank you for being a part of this discussion, and for your engagement throughout. Let's continue fostering our learning on these topics!
00:20:22.720
If any questions arise or you’d like to chat about implementing these ideas, I’m here for you!