Some Funny Things Happened on The Way to A Service Ecosystem

by Chris Hoffman

In the video, titled "Some Funny Things Happened on The Way to A Service Ecosystem," Chris Hoffman discusses the transition from a monolithic architecture to a service-based ecosystem, sharing experiences from his work at OpTaurus, an e-commerce company. He reflects on the challenges faced during the shift, emphasizing the importance of learning from mistakes in adopting microservices. The presentation covers several key insights:

Monolith to Services: Hoffman introduces the concept of moving from a monolith to a service architecture, sharing background about OpTaurus and illustrating the difficulties encountered in implementing services, such as issues with authentication (auth) and inter-service communication.
Learning from Mistakes: He emphasizes that their early attempts were fraught with errors, particularly in managing shared models, query optimizations, and handling latency issues between services, ultimately leading to project cancellations.
Data Sharing Challenges: Hoffman discusses how data sharing between services posed serious complications, advocating for a model where services broadcast information rather than synchronize data, which often led to data inconsistency.
Service Development Insights: The speaker advises starting with small, manageable service projects that add functionality instead of extracting from the monolith, highlighting the need to ensure project visibility to gain organizational support.
Technical Approaches: Specific strategies for implementing services are outlined, including a recommendation to avoid complex, asynchronous protocols in favor of simpler HTTP-based communication.
Operational Ownership: Hoffman asserts that developers should take ownership of operations in a microservices environment, advocating for conventions that empower engineers to effectively manage production incidents.

In conclusion, Hoffman shares several important takeaways for those looking to implement microservices, including the need to:

- Avoid Active Record callbacks to simplify data extraction.

- Focus on projects that deliver noticeable benefits to gain political support within organizations.

- Establish strong operational practices and conventions to streamline deployment and maintenance of services.

This talk aims to prepare teams for the potential pitfalls they may encounter on their journey to a service ecosystem, providing actionable insights gained from real-world experiences at OpTaurus.

00:00:10.880 Hi, I'm Chris Hoffman, and we're going to talk about services, specifically how to move from a seemingly intractable monolith to an ecosystem of services that you can operate well and sustainably. I work at UpToro; we're an e-commerce firm headquartered in DC. The only thing that's interesting about us is how we source our inventory: we aggregate retail returns.

00:00:25.980 If you're Best Buy, Target, or some other major retailer, you turn over your return merchandise to us. We test and grade everything that comes through our warehouse and determine, based on the client's recovery goals, whether we should resell it individually, in bulk, donate it, or recycle it. We sell items on a variety of online marketplaces, such as Amazon, Best Buy, eBay, and our own marketplace called Blinked.

00:00:44.789 Things ending in 'queue' will happen again. I don't know why we end things in 'queue' even after five and a half years; we just do. That's about all you need to know about the business model. So, I've been at UpToro for five and a half years. We still have a monolith, but we are increasingly moving towards an ecosystem of services around it.

00:01:08.580 Over that time, we've undertaken several service projects, and I've been involved in a bunch of them. They weren't all successes, so today I will talk about three of them, give you enough context about what we were trying to do, what we did wrong, what we did right, and finish up by providing some advice on what to do before starting any service projects.

00:01:30.030 I started in 2012 and walked into a company that was already trying to figure out how to break apart our model. These conversations were happening when I got there, but we didn't actually give it a go until 2013. We started with authentication and authorization because that's easy, right? Well, not so much. With any multi-tenant software, authentication and authorization are never just about permissions. It's always more complex.

00:01:52.799 In our case specifically, authorization meant knowing not just what permissions a user has based on their user group but also who they are employed by and what warehouse they’re assigned to. Our software runs on hardware provided in our clients' warehouses, which presents an immediate problem for extracting things to an authentication service.

00:02:04.920 It’s very easy to identify that the user group permission model should be owned by authentication and not inventory. However, we faced a challenge across network boundaries, as the authentication service still needed to know the client warehouse to authorize a request. This relationship is vital because warehouses are central to our inventory management system.

00:02:34.051 Initially, we didn’t know nearly as much about distributed systems as we do now, so we didn’t resolve this data-sharing problem adequately. Instead, whenever inventory needed to authorize a request, it would ping the authentication service asking, 'Hey, is this person allowed to do this?' However, this was complicated by the fact that we had a completely locked application in staging.

00:03:02.319 When you block a web request that requires making another web request, the first thing you realize is that too many web requests at the same time can cause your service to fail. Eventually, we learned this lesson in production. The latency introduced by this architecture was unacceptable for our system. It transformed a zero-network transaction request into multiple network requests.

00:03:45.857 Even with dedicated inventory instances, the performance penalty was an issue; thus, we abandoned the project. While it's amusing to share these ghost stories with diagrams, it's essential that you actually learn something from my experiences so that you can avoid similar mistakes.

00:04:06.992 The first takeaway I noticed immediately was that callbacks are not your friend. Their primary purpose seems to be hiding code, and if they're hiding code that only modifies that model, it's not a good thing. After this project at UpToro, we decided to eliminate Rails callbacks. We developed other patterns for managing persistence actions and try to forgo those callbacks whenever we can.

00:04:28.689 In defense of callbacks, they do handle transaction management for you, ensuring that all callbacks related to a particular save occur within a transaction, but this behavior can be replicated without them. It’s essential to possess a solid understanding of transactions and transaction isolation, as these concepts will only become more important as you scale up to a larger service ecosystem.

00:05:03.850 However, even if you’ve eliminated callbacks, data sharing remains challenging. How do you determine which service owns shared models? We kind of punted this issue and ended up establishing a second monolith, realizing only years later that when building a distributed system, a proper architecture needs foresight.

00:05:37.990 Initially, we thought that warehouses and clients belonged to inventory, but in reality, they should have been owned by the authentication service. The right approach would be to have the authentication service manage warehouses and clients, with both the authentication and inventory services having their own databases.

00:06:01.679 The authentication service would handle requests for creating new clients, warehouses, and users, creating a smooth user experience since everything would happen in one place. However, inventory still needs access to that information to process requests for scanning units. We had to ensure that the authentication notified inventory whenever there were changes related to clients or warehouses.

00:06:38.460 We needed a mechanism so that inventory wouldn't freak out if clients’ or warehouses’ statuses changed. We envisioned a data replication method rather than traditional caching or synchronization, where the data doesn’t need to remain identical across all platforms. If you were present during the previous talk, this could be one of the most primitive examples of event sourcing.

00:07:05.350 For example, both the shipping and accounting services need to understand the 'unit' concept, but they have different requirements for what that entails. Shipping doesn’t need to know a unit's weight, whereas accounting doesn’t need to know how much was charged for that unit. In our current approach, we designate an 'origin service' for the model we wish to share among multiple services.

00:07:43.929 Typically, this is identified at the start of the lifecycle. For instance, in our setup, the authentication service serves as the origin for users, warehouses, and clients. This means that any other service that is not the origin service must include an 'origin ID' column as a foreign key that points to the originating service's ID.

00:08:10.180 The origin service does not need to include an 'origin ID'; it can create globally recognized system-wide records. Other services can create local records that represent the data in the entire system. For instance, in shipping, a service might make a call to a third-party API for rates and store the resulting weight in its database, without this data previously existing.

00:08:37.740 The takeaway is that if you need to share data, consider broadcasting the changes instead of trying to synchronize it. Ultimately, your first service project will likely be a learning experience, and while it's crucial to implement some functionality, make sure it doesn't come at the expense of significant architectural complexity.

00:09:05.640 A notable lesson was that we aimed too large at first, which caused inefficiencies in our work. Therefore, the very next year, we decided to pick the simplest project possible; we focused on product photo processing, such as uploading, resizing, and cropping images.

00:09:37.279 While I’m sure you can envision this as a task suitable for a background worker, we had an ambition to disrupt the way services functioned in UpToro. Instead of using a decentralized synchronous protocol like HTTP, we envisioned adopting a centralized asynchronous protocol to transform our service interactions.

00:10:08.759 To achieve this, we developed a Rack server that communicated via AMQP, which allowed us to keep our current Rails applications while adopting this new protocol. Additionally, we created a gem called 'AMQ Party', which mimicked the semantics and interface of HTTParty.

00:10:35.760 In this setup, a client service communicates by publishing a request message to a request channel. Since the service can’t respond directly, it publishes a response message back to a paired response channel for the client service to consume. Unfortunately, this approach turned out to be quite complicated and, while we presented it as a service, it was still dependent on the inventory database.

00:11:08.799 This led to the realization that adding features is easier than extracting them from a monolith. This extraction challenge is one of the two hardest problems when working in a service ecosystem.

00:11:27.200 When considering your first service, aim for minimal data extraction, allowing your new service to add functionality rather than extract it, and ensure it’s something of significance that stakeholders will recognize and value. Ultimately, the implementation should appear seamless and yield positive visibility across your organization.

00:11:54.200 The next service project we worked on was bulk sales pricing, launched in 2015. Historically, bulk sales were handled by a two- or three-person team who communicated directly with customers. This method was successful but did not scale, so my role involved collaborating with our data science team to create accessible pricing models based on historical data.

00:12:21.480 We identified this as another suitable opportunity for a service, as our infrastructure team was focused elsewhere at the time. I took on the responsibility for provisioning and configuring this new service while ensuring it was ready in time for our upcoming regression tests. Initially, I underestimated the complexity, thinking it would be straightforward due to existing patterns.

00:12:44.480 Despite our confidence, it took a daunting amount of time to get the service ready largely due to the absence of necessary conventions. However, the infrastructure team’s patterns played a vital role in expediting processes, allowing me to focus more on logic rather than the infrastructure logistics.

00:13:10.080 We concluded that if we are to expand upon our services, developers should be responsible for operating applications. This empowers developers by granting them ownership of their codebases while ensuring they understand how their systems operate in production.

00:13:37.700 Next, developers must be equipped with the knowledge and tools required to effectively respond to production incidents. We need conventions that make it easy for developers to know where logs and dashboards are located so that whether it’s 4 a.m. or 4 p.m., they are not left scrambling.

00:14:07.720 To facilitate this, it’s valuable to create tools or wrappers around existing conventions to help maintain consistency. Whether you're providing project skeleton templates or managing deployment tools, empowering developers with the ability to follow conventions can foster efficiency.

00:14:36.370 In summary, the best practices I’d advise include getting rid of active record callbacks, managing your transactions actively, and ensuring you have clear data ownership conventions. Look to create mechanisms that reinforce those conventions in your organization, especially if you’re building shared services.

00:15:02.670 Finally, ensure your services are built around visibility and utility within your organization. The political elements of a service ecosystem project should not be overlooked for sustainable development and growth.

00:15:30.850 Thank you all for being a fantastic audience. I sincerely hope you've gained insightful takeaways from my stories and lessons learned. If you have any questions or comments, I would be delighted to address them.

00:15:51.620 Do we have time for questions? If anyone needs to locate me, I have this ponytail to help you out. Thank you very much.