00:00:10.880
Hi, I'm Chris Hoffman, and we're going to talk about services, specifically how to move from a seemingly intractable monolith to an ecosystem of services that you can operate well and sustainably. I work at UpToro; we're an e-commerce firm headquartered in DC. The only thing that's interesting about us is how we source our inventory: we aggregate retail returns.
00:00:25.980
If you're Best Buy, Target, or some other major retailer, you turn over your return merchandise to us. We test and grade everything that comes through our warehouse and determine, based on the client's recovery goals, whether we should resell it individually, in bulk, donate it, or recycle it. We sell items on a variety of online marketplaces, such as Amazon, Best Buy, eBay, and our own marketplace called Blinked.
00:00:44.789
Things ending in 'queue' will happen again. I don't know why we end things in 'queue' even after five and a half years; we just do. That's about all you need to know about the business model. So, I've been at UpToro for five and a half years. We still have a monolith, but we are increasingly moving towards an ecosystem of services around it.
00:01:08.580
Over that time, we've undertaken several service projects, and I've been involved in a bunch of them. They weren't all successes, so today I will talk about three of them, give you enough context about what we were trying to do, what we did wrong, what we did right, and finish up by providing some advice on what to do before starting any service projects.
00:01:30.030
I started in 2012 and walked into a company that was already trying to figure out how to break apart our model. These conversations were happening when I got there, but we didn't actually give it a go until 2013. We started with authentication and authorization because that's easy, right? Well, not so much. With any multi-tenant software, authentication and authorization are never just about permissions. It's always more complex.
00:01:52.799
In our case specifically, authorization meant knowing not just what permissions a user has based on their user group but also who they are employed by and what warehouse they’re assigned to. Our software runs on hardware provided in our clients' warehouses, which presents an immediate problem for extracting things to an authentication service.
00:02:04.920
It’s very easy to identify that the user group permission model should be owned by authentication and not inventory. However, we faced a challenge across network boundaries, as the authentication service still needed to know the client warehouse to authorize a request. This relationship is vital because warehouses are central to our inventory management system.
00:02:34.051
Initially, we didn’t know nearly as much about distributed systems as we do now, so we didn’t resolve this data-sharing problem adequately. Instead, whenever inventory needed to authorize a request, it would ping the authentication service asking, 'Hey, is this person allowed to do this?' However, this was complicated by the fact that we had a completely locked application in staging.
00:03:02.319
When you block a web request that requires making another web request, the first thing you realize is that too many web requests at the same time can cause your service to fail. Eventually, we learned this lesson in production. The latency introduced by this architecture was unacceptable for our system. It transformed a zero-network transaction request into multiple network requests.
00:03:45.857
Even with dedicated inventory instances, the performance penalty was an issue; thus, we abandoned the project. While it's amusing to share these ghost stories with diagrams, it's essential that you actually learn something from my experiences so that you can avoid similar mistakes.
00:04:06.992
The first takeaway I noticed immediately was that callbacks are not your friend. Their primary purpose seems to be hiding code, and if they're hiding code that only modifies that model, it's not a good thing. After this project at UpToro, we decided to eliminate Rails callbacks. We developed other patterns for managing persistence actions and try to forgo those callbacks whenever we can.
00:04:28.689
In defense of callbacks, they do handle transaction management for you, ensuring that all callbacks related to a particular save occur within a transaction, but this behavior can be replicated without them. It’s essential to possess a solid understanding of transactions and transaction isolation, as these concepts will only become more important as you scale up to a larger service ecosystem.
00:05:03.850
However, even if you’ve eliminated callbacks, data sharing remains challenging. How do you determine which service owns shared models? We kind of punted this issue and ended up establishing a second monolith, realizing only years later that when building a distributed system, a proper architecture needs foresight.
00:05:37.990
Initially, we thought that warehouses and clients belonged to inventory, but in reality, they should have been owned by the authentication service. The right approach would be to have the authentication service manage warehouses and clients, with both the authentication and inventory services having their own databases.
00:06:01.679
The authentication service would handle requests for creating new clients, warehouses, and users, creating a smooth user experience since everything would happen in one place. However, inventory still needs access to that information to process requests for scanning units. We had to ensure that the authentication notified inventory whenever there were changes related to clients or warehouses.
00:06:38.460
We needed a mechanism so that inventory wouldn't freak out if clients’ or warehouses’ statuses changed. We envisioned a data replication method rather than traditional caching or synchronization, where the data doesn’t need to remain identical across all platforms. If you were present during the previous talk, this could be one of the most primitive examples of event sourcing.
00:07:05.350
For example, both the shipping and accounting services need to understand the 'unit' concept, but they have different requirements for what that entails. Shipping doesn’t need to know a unit's weight, whereas accounting doesn’t need to know how much was charged for that unit. In our current approach, we designate an 'origin service' for the model we wish to share among multiple services.
00:07:43.929
Typically, this is identified at the start of the lifecycle. For instance, in our setup, the authentication service serves as the origin for users, warehouses, and clients. This means that any other service that is not the origin service must include an 'origin ID' column as a foreign key that points to the originating service's ID.
00:08:10.180
The origin service does not need to include an 'origin ID'; it can create globally recognized system-wide records. Other services can create local records that represent the data in the entire system. For instance, in shipping, a service might make a call to a third-party API for rates and store the resulting weight in its database, without this data previously existing.
00:08:37.740
The takeaway is that if you need to share data, consider broadcasting the changes instead of trying to synchronize it. Ultimately, your first service project will likely be a learning experience, and while it's crucial to implement some functionality, make sure it doesn't come at the expense of significant architectural complexity.
00:09:05.640
A notable lesson was that we aimed too large at first, which caused inefficiencies in our work. Therefore, the very next year, we decided to pick the simplest project possible; we focused on product photo processing, such as uploading, resizing, and cropping images.
00:09:37.279
While I’m sure you can envision this as a task suitable for a background worker, we had an ambition to disrupt the way services functioned in UpToro. Instead of using a decentralized synchronous protocol like HTTP, we envisioned adopting a centralized asynchronous protocol to transform our service interactions.
00:10:08.759
To achieve this, we developed a Rack server that communicated via AMQP, which allowed us to keep our current Rails applications while adopting this new protocol. Additionally, we created a gem called 'AMQ Party', which mimicked the semantics and interface of HTTParty.
00:10:35.760
In this setup, a client service communicates by publishing a request message to a request channel. Since the service can’t respond directly, it publishes a response message back to a paired response channel for the client service to consume. Unfortunately, this approach turned out to be quite complicated and, while we presented it as a service, it was still dependent on the inventory database.
00:11:08.799
This led to the realization that adding features is easier than extracting them from a monolith. This extraction challenge is one of the two hardest problems when working in a service ecosystem.
00:11:27.200
When considering your first service, aim for minimal data extraction, allowing your new service to add functionality rather than extract it, and ensure it’s something of significance that stakeholders will recognize and value. Ultimately, the implementation should appear seamless and yield positive visibility across your organization.
00:11:54.200
The next service project we worked on was bulk sales pricing, launched in 2015. Historically, bulk sales were handled by a two- or three-person team who communicated directly with customers. This method was successful but did not scale, so my role involved collaborating with our data science team to create accessible pricing models based on historical data.
00:12:21.480
We identified this as another suitable opportunity for a service, as our infrastructure team was focused elsewhere at the time. I took on the responsibility for provisioning and configuring this new service while ensuring it was ready in time for our upcoming regression tests. Initially, I underestimated the complexity, thinking it would be straightforward due to existing patterns.
00:12:44.480
Despite our confidence, it took a daunting amount of time to get the service ready largely due to the absence of necessary conventions. However, the infrastructure team’s patterns played a vital role in expediting processes, allowing me to focus more on logic rather than the infrastructure logistics.
00:13:10.080
We concluded that if we are to expand upon our services, developers should be responsible for operating applications. This empowers developers by granting them ownership of their codebases while ensuring they understand how their systems operate in production.
00:13:37.700
Next, developers must be equipped with the knowledge and tools required to effectively respond to production incidents. We need conventions that make it easy for developers to know where logs and dashboards are located so that whether it’s 4 a.m. or 4 p.m., they are not left scrambling.
00:14:07.720
To facilitate this, it’s valuable to create tools or wrappers around existing conventions to help maintain consistency. Whether you're providing project skeleton templates or managing deployment tools, empowering developers with the ability to follow conventions can foster efficiency.
00:14:36.370
In summary, the best practices I’d advise include getting rid of active record callbacks, managing your transactions actively, and ensuring you have clear data ownership conventions. Look to create mechanisms that reinforce those conventions in your organization, especially if you’re building shared services.
00:15:02.670
Finally, ensure your services are built around visibility and utility within your organization. The political elements of a service ecosystem project should not be overlooked for sustainable development and growth.
00:15:30.850
Thank you all for being a fantastic audience. I sincerely hope you've gained insightful takeaways from my stories and lessons learned. If you have any questions or comments, I would be delighted to address them.
00:15:51.620
Do we have time for questions? If anyone needs to locate me, I have this ponytail to help you out. Thank you very much.