Service Extraction at Groupon Scale

by Jason Sisk and Abhishek Pillai

The video titled "Service Extraction at Groupon Scale" features Groupon engineers Jason Sisk and Abhishek Pillai discussing the challenges and strategies involved in transitioning from a Ruby on Rails monolithic architecture to a service-oriented architecture (SOA) at scale. The presentation highlights the significant challenges faced by Groupon due to rapid growth and high traffic, particularly when their systems encountered performance issues that led to outages. Key points from the presentation include:

Transitioning from Monolith to SOA: Groupon initially utilized a monolithic Rails application, affectionately dubbed the 'cobra,' which became difficult to manage as the business grew and the codebase expanded to nearly 2 million lines.
Identifying Pain Points: The engineers discuss the performance degradation experienced during high traffic, reflecting on lessons learned from early scaling pains and how they contributed to the decision to explore service extraction.
Service Extraction Process: Groupon utilized a methodical approach to separate components by creating new services, starting with the order service, to alleviate database contention issues.
Service Wall Tool: To simplify the extraction process, they developed a 'service wall' that helped maintain separation of concerns, allowing teams to manage dependencies without directly coupling their services.
Greenfield Approach: The presentation also covers building new services from scratch (greenfield services) that connect to legacy systems via a message bus, providing a modern approach to handle critical interactions.
Manager Accessor Pattern: This pattern allowed them to manage data from various sources and hide complexity behind a simple interface in the codebase, further supporting the scalability of their services.
Strategic Planning for Service Extraction: Sisk and Pillai emphasize the necessity of having a clear strategy for breaking down monoliths, including understanding domain boundaries, preparing for unexpected challenges, and maintaining good coding practices. They stress the importance of effective communication between teams.

In conclusion, the speakers remind the audience that while rail systems can be advantageous for rapid development, they can become burdensome without appropriate planning and strategy for splitting services. They encourage developers to share knowledge and solutions to common issues faced during this transition. The overarching message revolves around the importance of being prepared for changes and challenges when evolving from a monolithic architecture to a distributed system of services.

00:00:18.640 Thanks for coming! I know there are some other cool talks happening right now, but you're here, so that's awesome. Let's get started!

00:00:24.560 You're here to learn about how to tame cobras.

00:00:31.199 My name is Jason Sisk. I work at Groupon and have been here for a couple of years, predominantly working on Ruby on Rails systems, back-end development, etc.

00:00:42.879 And I do not like onions! My name is Abhishek Pillai and I have been at Groupon for about two years as well. Jason and I work on a team that manages back-end services, essentially dealing with inventory. I, on the other hand, do not like fruits.

00:01:00.559 Part of what we're going to share today is a bit of a history lesson about the early challenges Groupon faced, including site outages due to scaling issues. We want to share the story of the developers who handled those problems and some of the decisions they made.

00:01:12.000 We want to kick off with one important point.

00:01:23.280 Boom! Pause.

00:01:28.640 We don't have to pause for that long. So, back around 2007, we were doing what all the other cool kids were doing: we were using a Rails monolith—and to some degree, we still are. Rails 2 was a great framework.

00:01:34.320 Who was using Rails 2? Alright, so, you and us! Rails is a fantastic framework we all love, and that's why we're here. What's great about it is that it's excellent for agile teams. For us, it was really simple. We could make quick decisions and iterate on products very fast with a small team of five to ten developers. We had a single repository, a single test suite, and a single deploy process. Everything was straightforward.

00:01:55.200 Most importantly, we had a shared conceptual understanding of the codebase. When we wanted to make a change, we knew where to put it.

00:02:06.719 Additionally, what was great—and still is about Rails—is that integrating components is very easy. The convention over configuration model, associations, and so forth allow you to put things together quickly.

00:02:23.040 But we didn't come here to talk to you about Rails; we came here to discuss cobras and how to tame them.

00:02:32.239 At Groupon, we actually have a monolith, and we like to call it the primary web app. But for the purposes of this talk, Jason and I thought it would be fitting to come up with a more scientifically accurate name for it: the Centralized Omnipotent Big Ass Rails Application.

00:03:28.879 So, let’s take you back to 2009 for a moment. Groupon was about two years old, and we were still gaining momentum. People would come into the office in Chicago, wake up, open up New Relic, and see data that looked like this.

00:03:35.280 In the middle of the night, everything looked great, and all systems were functioning well. But as soon as people woke up and started using it, our performance would sharply drop. Eight months later, we were handling about 30,000 requests per minute, and everything felt like it was on fire!

00:04:06.879 We blamed Oprah. Oprah crashed Groupon—not once, but at least twice! Also, the gap (a clothing brand) crashed Groupon as well.

00:04:14.480 The reality is, Groupon crashed Groupon. We were not scaling properly, which turned into a bad situation for us.

00:04:26.880 The cobra was getting fatter and fatter. Initially, we started with about three to five hundred commits per month. But over the years, we averaged about 2,000 commits in a single month, with a lot of developers working on the same codebase.

00:04:40.080 And we began to think seriously about service-oriented architecture (SOA) as it was already becoming painful to manage the monolith.

00:04:50.800 However, as we looked into the eyes of the cobra, it scared the heck out of us. We had a lot of scoping problems, largely due to model coupling. One of the biggest issues preventing us from extracting services early on was the natural convention coupling occurring in the models as the code grew.

00:05:06.000 Let's consider an oversimplified example. Say you're on the 'My Groupons' page and want to view all the groupons you purchased, along with their titles. In the monolith, you might encounter a set of dependent relationships that demonstrate cyclical dependencies.

00:05:29.680 Building these types of associations became quite common, which proved problematic. For instance, you'd instantiate a user, leading to a database lookup for that user in the users table to retrieve orders and subsequently map over those orders to get all the deal titles.

00:05:55.600 This design exemplified a violation of the Law of Demeter, which is undesirable. On the surface, the code looked clean, but it effectively coupled our components together, creating challenges.

00:06:09.680 At the time, if designed properly, you could avoid these pitfalls from the beginning. However, Rails conventions did not encourage this early on. As you can see, our codebase grew significantly, with nearly 2 million lines of code, and the backend began to present a significant operational burden.

00:06:40.000 Testing was painful as well; we had to wait around 45 minutes for builds to run. After running tests, developers had to find something else to occupy their time while the tests executed. Meanwhile, our release engineering team put considerable effort into optimizing for speed.

00:07:05.680 Deployments were tedious. The deployment process for the monolithic application took about three hours. This created a terrible development experience, especially as teams began to divide ownership—they wanted to iterate on features that mattered to them without being held back by the massive monolithic application.

00:07:17.120 Additionally, the deploys were happening only once a week, which severely hindered a team's efforts that might have wanted to practice continuous deployment.

00:07:37.160 Development pace was increasing, leading us to contemplate where the best place would be for the next line of code. As heard in a previous talk, the ideal location is within the area you are modifying, but bloated models hinder this flexibility.

00:07:51.600 All these factors contributed to a painful development environment, pushing us further towards more serious service extraction.

00:08:03.440 If there is one takeaway from this first part, it’s that cobras are great—but only up to a point. We needed to alleviate our pain immediately. We required a quick extraction of code.

00:08:54.440 Thus, we decided to extract a new service and build it on top of an existing schema, starting with the order service. It was responsible for significant database contention since many users were buying groupons concurrently.

00:09:09.440 While this was a good problem to have, it was overwhelming our database. By choosing to start with orders, we could create a long-lived model for our domain.

00:09:31.680 To illustrate our objective: we wanted to separate the orders code base from the monolith, allowing it to have its own database while retaining read-only access to the monolith’s database.

00:09:43.200 Consequently, we focused on making sure the order service could access the cobras database, keeping coupling manageable.

00:10:02.720 Finding all the ways components were coupled was a tricky task, as Rails callbacks and model associations created numerous hidden dependencies.

00:10:12.960 To aid in this process, we built several tools, one of which we dubbed the 'service wall.' The primary goal was to delineate the concerns of orders within the application.

00:10:26.480 Initially, we separated order services into a distinct directory to promote clear boundaries. Within this directory, we housed its app, lib, and specs.

00:10:42.880 In the environment.rb file, we iterated through these services, adding them to the load path so the application would appear as one big application, while, in reality, the code remained separate.

00:11:07.680 One way we implemented the service wall was by disabling model access. The method specifies which services to restrict or deprecate model access, thereby preventing violations.

00:11:23.520 If a disabled model access method is invoked during tests, it would raise an error, preventing any compromise in the service wall.

00:11:41.760 Alternatively, using the deprecate model access method would only log the infractions without raising errors, allowing us to identify any points of failure during development or staging.

00:11:54.960 This approach is not recommended in production due to performance issues.

00:12:09.480 At the top of your controller, you’d specify either disable model access or deprecate model access, depending on your needs. Importantly, this allows you to exempt actions that are not causing violations, thus enabling you to address issues incrementally.

00:12:23.520 Another challenge we faced during extraction was the leftover cruft from our old domain code. Teams frequently questioned if certain endpoints were even being utilized anymore.

00:12:46.560 In response, a small team of developers created a tool called Route 66 to track down and manage cruft in both our old monolith and the new service.

00:13:07.920 Route 66 answers questions such as whether endpoints are still currently in use. The tool analyzes log files, checking the frequency with which controller actions are accessed.

00:13:18.319 A route that is only hit once a week can be effectively identified, allowing for aggressive de-crufting.

00:13:37.679 While we preferred separating code to simplify the process, we still faced challenges from legacy database dependencies. A notable issue with our extraction practice is that we still found ourselves tied to the legacy database despite the improvements.

00:13:54.560 We employed various tactics to address these challenges; for instance, we also considered building greenfield services using a message bus. Sometimes, you must keep the legacy API operational due to numerous client dependencies that exist.

00:14:38.560 Greenfield work is often preferred among developers. In a similar situation, we envisioned the greenfield extraction where the red box represents all new code. A client gem being utilized in the original monolith runs alongside the new service.

00:15:00.959 Once this new service writes data to its database, a message is sent, which the new service consumes, effectively routing the data to its dedicated data store.

00:15:16.720 The noteworthy aspect to keep everything in sync during these service cutovers is the background sync worker that ensures a one-way data transfer from the old database to the new one.

00:15:33.119 There are pros and cons to this approach as well. One of the advantages includes swiftly eliminating legacy data while utilizing greenfield features that developers typically enjoy designing.

00:15:49.600 Additionally, minimizing cutover risk through efficient data syncing allows for a smoother transition, preventing significant disruptions due to database changes.

00:16:05.600 However, creating a synchronization worker is no trivial task. It's equally challenging to build a validation engine to ensure data remains in sync and to handle potential race conditions.

00:16:18.760 Jason and I work on a team that manages inventory at Groupon. One of our future goals is extracting vouchers from the order service.

00:16:40.080 Vouchers are critical, as they represent what customers redeem. We needed an identification process to differentiate between vouchers neatly stored in a database and prices stored in the legacy database.

00:17:19.600 Groupon has grown significantly since the introduction of orders, creating an international platform that operates in numerous countries, including Berlin, London, Chennai, Korea, and many more.

00:17:48.800 Now, the responsibility of our service is to ensure that anyone querying voucher data doesn’t have to understand the intricacies of the various data sources.

00:18:02.480 In managing these diverse sources of truth, we utilized the manager accessor pattern to streamline the codebase’s structure. With this pattern, the controller could simply specify to talk to a designated manager object to initiate a find request on the voucher.

00:18:46.400 Inside the manager lies all the complexity of data access. In this way, separate accessors interact with various databases—the cobra accessor accesses legacy data while an international accessor might employ HTTP calls to external sources.

00:19:24.960 Ultimately, all the information is consolidated and returned to the controller, making the system more modular while incorporating multiple data sources.

00:19:40.720 Although the facade pattern we implemented has its benefits, it also brings significant complexity, especially as accessors are coupled with schema changes.

00:19:56.720 We must ensure that our legacy schemes still align with necessary data access requests while adapting to new service-oriented architecture.

00:20:12.960 These three extraction patterns we explored are only a small snippet of what Groupon has been navigating. These practices should resonate within many of your experiences.

00:20:29.760 In addition to these methods, there are excellent discussions at RailsConf, and I encourage everyone to explore those. Consider allowing your teams to decide on tactics suitable for SOA implementation, as you may discover neat strategies previously unknown to you.

00:21:02.760 This experience has taught us many lessons about service extraction. Taming a cobra is serious business. As I always say, 'Yippie-ki-yay, you probably won't need it right now.' But once you start feeling the pains of your architecture, it's essential to prepare for action.

00:21:29.120 The tipping point guiding your shift towards service-oriented architecture isn’t clear-cut. The process is more of an art than a science. As you begin discussing SOA and feel the pressure, ensure you formulate a strategy to tackle these challenges.

00:22:11.920 Don't wait for Oprah to crash your site! It's also crucial that you allow your domain to progress organically. Models that seem vital today may become obsolete over time.

00:22:23.760 The monolith’s true benefit is its ability to support quick iterations. Additionally, when diving into service extraction, always maintain a clear strategy.

00:22:51.440 Identify which elements need breaking apart and which need to remain within the monolith. Understand the priorities relating to these focuses. We can easily run into issues when we operate in a scattershot manner without comprehending our business model.

00:23:09.760 Focus on things that are distinctly their own while considering maintenance challenges or legacy designs that necessitate extraction. You should be cognizant of the unexpected—scope creep can bite you when handling larger codebases.

00:23:38.240 You must also think comprehensively about your service stack and understand how your business operates. Always conceptualize data flow, inter-service agreements, and the need for caching to meet load and latency requirements.

00:24:05.760 It’s vital to consider inter-service messaging during the extraction process. If your services need to communicate, allocate thought to the structure of those messages.

00:24:20.240 Consider the implications of delivery SLAs, guarantees, and payload structures. Remember to also prioritize authentication and authorization protocols.

00:24:42.400 Building a supportive infrastructure around your services is crucial. We were fortunate at Groupon; we had dedicated teams striving to generate tools for smoother service deployment.

00:25:01.760 In your organization, it's essential to ensure you think critically about simplifying these processes.

00:25:13.440 When discussing service-oriented architecture, it's important to adopt UUIDs from the outset. Doing so effectively separates your systems from your databases.

00:25:30.119 Good code practices are imperative. This is easier said than done, but you should adhere to solid principles, ensuring separation of components and reflecting on the purpose of their coupling.

00:25:56.880 Think critically about your models—these will evolve into your service APIs. Ensure public methods clearly articulate service behaviors.

00:26:12.080 As you develop your systems, be cautious about entwining your components. For instance, introducing associations in Rails expands the API, creating new pathways for potential coupling.

00:26:22.960 Always test! It's common knowledge yet needs reiterating. High-level testing should take precedence, especially as you approach service extraction—expect extensive breakage in unit tests.

00:26:39.479 Emphasize solid coverage with high-level, end-to-end tests, particularly since spinning up other services for testing can be challenging. A strong set of integration tests fosters confidence in making changes.

00:26:57.360 Communication is key! As service extraction efforts expand, share findings with other teams. Document solutions, create internal gems, put insights to paper, and deliver presentations.

00:27:15.679 At Groupon, we encourage collaborative environments through our core architecture forum, where developers gather to discuss new services or design problems.

00:27:28.079 In conclusion, cobras are fantastic. Rails is fantastic, and cobras serve valuable functions, but proceed with caution!

00:27:36.839 Deciding to raise a baby cobra means you must be prepared for what follows.

00:27:58.560 By the way, we are hiring! If you want to help us tackle these challenges, please talk to us after this session, check out our booth downstairs, or tweet at us.

00:28:07.920 We owe a debt of gratitude to those who supported us in crafting this talk. The insights we gained from many individuals, who contributed to our service extraction processes, encompass the experiences we presented today.

00:28:31.040 We greatly appreciate having mentors who dedicate their time, knowledge, and support to help us navigate these challenges. Thank you all for your attention!

00:32:03.840 Thank you!