00:00:18.640
Thanks for coming! I know there are some other cool talks happening right now, but you're here, so that's awesome. Let's get started!
00:00:24.560
You're here to learn about how to tame cobras.
00:00:31.199
My name is Jason Sisk. I work at Groupon and have been here for a couple of years, predominantly working on Ruby on Rails systems, back-end development, etc.
00:00:42.879
And I do not like onions! My name is Abhishek Pillai and I have been at Groupon for about two years as well. Jason and I work on a team that manages back-end services, essentially dealing with inventory. I, on the other hand, do not like fruits.
00:01:00.559
Part of what we're going to share today is a bit of a history lesson about the early challenges Groupon faced, including site outages due to scaling issues. We want to share the story of the developers who handled those problems and some of the decisions they made.
00:01:12.000
We want to kick off with one important point.
00:01:23.280
Boom! Pause.
00:01:28.640
We don't have to pause for that long. So, back around 2007, we were doing what all the other cool kids were doing: we were using a Rails monolith—and to some degree, we still are. Rails 2 was a great framework.
00:01:34.320
Who was using Rails 2? Alright, so, you and us! Rails is a fantastic framework we all love, and that's why we're here. What's great about it is that it's excellent for agile teams. For us, it was really simple. We could make quick decisions and iterate on products very fast with a small team of five to ten developers. We had a single repository, a single test suite, and a single deploy process. Everything was straightforward.
00:01:55.200
Most importantly, we had a shared conceptual understanding of the codebase. When we wanted to make a change, we knew where to put it.
00:02:06.719
Additionally, what was great—and still is about Rails—is that integrating components is very easy. The convention over configuration model, associations, and so forth allow you to put things together quickly.
00:02:23.040
But we didn't come here to talk to you about Rails; we came here to discuss cobras and how to tame them.
00:02:32.239
At Groupon, we actually have a monolith, and we like to call it the primary web app. But for the purposes of this talk, Jason and I thought it would be fitting to come up with a more scientifically accurate name for it: the Centralized Omnipotent Big Ass Rails Application.
00:03:28.879
So, let’s take you back to 2009 for a moment. Groupon was about two years old, and we were still gaining momentum. People would come into the office in Chicago, wake up, open up New Relic, and see data that looked like this.
00:03:35.280
In the middle of the night, everything looked great, and all systems were functioning well. But as soon as people woke up and started using it, our performance would sharply drop. Eight months later, we were handling about 30,000 requests per minute, and everything felt like it was on fire!
00:04:06.879
We blamed Oprah. Oprah crashed Groupon—not once, but at least twice! Also, the gap (a clothing brand) crashed Groupon as well.
00:04:14.480
The reality is, Groupon crashed Groupon. We were not scaling properly, which turned into a bad situation for us.
00:04:26.880
The cobra was getting fatter and fatter. Initially, we started with about three to five hundred commits per month. But over the years, we averaged about 2,000 commits in a single month, with a lot of developers working on the same codebase.
00:04:40.080
And we began to think seriously about service-oriented architecture (SOA) as it was already becoming painful to manage the monolith.
00:04:50.800
However, as we looked into the eyes of the cobra, it scared the heck out of us. We had a lot of scoping problems, largely due to model coupling. One of the biggest issues preventing us from extracting services early on was the natural convention coupling occurring in the models as the code grew.
00:05:06.000
Let's consider an oversimplified example. Say you're on the 'My Groupons' page and want to view all the groupons you purchased, along with their titles. In the monolith, you might encounter a set of dependent relationships that demonstrate cyclical dependencies.
00:05:29.680
Building these types of associations became quite common, which proved problematic. For instance, you'd instantiate a user, leading to a database lookup for that user in the users table to retrieve orders and subsequently map over those orders to get all the deal titles.
00:05:55.600
This design exemplified a violation of the Law of Demeter, which is undesirable. On the surface, the code looked clean, but it effectively coupled our components together, creating challenges.
00:06:09.680
At the time, if designed properly, you could avoid these pitfalls from the beginning. However, Rails conventions did not encourage this early on. As you can see, our codebase grew significantly, with nearly 2 million lines of code, and the backend began to present a significant operational burden.
00:06:40.000
Testing was painful as well; we had to wait around 45 minutes for builds to run. After running tests, developers had to find something else to occupy their time while the tests executed. Meanwhile, our release engineering team put considerable effort into optimizing for speed.
00:07:05.680
Deployments were tedious. The deployment process for the monolithic application took about three hours. This created a terrible development experience, especially as teams began to divide ownership—they wanted to iterate on features that mattered to them without being held back by the massive monolithic application.
00:07:17.120
Additionally, the deploys were happening only once a week, which severely hindered a team's efforts that might have wanted to practice continuous deployment.
00:07:37.160
Development pace was increasing, leading us to contemplate where the best place would be for the next line of code. As heard in a previous talk, the ideal location is within the area you are modifying, but bloated models hinder this flexibility.
00:07:51.600
All these factors contributed to a painful development environment, pushing us further towards more serious service extraction.
00:08:03.440
If there is one takeaway from this first part, it’s that cobras are great—but only up to a point. We needed to alleviate our pain immediately. We required a quick extraction of code.
00:08:54.440
Thus, we decided to extract a new service and build it on top of an existing schema, starting with the order service. It was responsible for significant database contention since many users were buying groupons concurrently.
00:09:09.440
While this was a good problem to have, it was overwhelming our database. By choosing to start with orders, we could create a long-lived model for our domain.
00:09:31.680
To illustrate our objective: we wanted to separate the orders code base from the monolith, allowing it to have its own database while retaining read-only access to the monolith’s database.
00:09:43.200
Consequently, we focused on making sure the order service could access the cobras database, keeping coupling manageable.
00:10:02.720
Finding all the ways components were coupled was a tricky task, as Rails callbacks and model associations created numerous hidden dependencies.
00:10:12.960
To aid in this process, we built several tools, one of which we dubbed the 'service wall.' The primary goal was to delineate the concerns of orders within the application.
00:10:26.480
Initially, we separated order services into a distinct directory to promote clear boundaries. Within this directory, we housed its app, lib, and specs.
00:10:42.880
In the environment.rb file, we iterated through these services, adding them to the load path so the application would appear as one big application, while, in reality, the code remained separate.
00:11:07.680
One way we implemented the service wall was by disabling model access. The method specifies which services to restrict or deprecate model access, thereby preventing violations.
00:11:23.520
If a disabled model access method is invoked during tests, it would raise an error, preventing any compromise in the service wall.
00:11:41.760
Alternatively, using the deprecate model access method would only log the infractions without raising errors, allowing us to identify any points of failure during development or staging.
00:11:54.960
This approach is not recommended in production due to performance issues.
00:12:09.480
At the top of your controller, you’d specify either disable model access or deprecate model access, depending on your needs. Importantly, this allows you to exempt actions that are not causing violations, thus enabling you to address issues incrementally.
00:12:23.520
Another challenge we faced during extraction was the leftover cruft from our old domain code. Teams frequently questioned if certain endpoints were even being utilized anymore.
00:12:46.560
In response, a small team of developers created a tool called Route 66 to track down and manage cruft in both our old monolith and the new service.
00:13:07.920
Route 66 answers questions such as whether endpoints are still currently in use. The tool analyzes log files, checking the frequency with which controller actions are accessed.
00:13:18.319
A route that is only hit once a week can be effectively identified, allowing for aggressive de-crufting.
00:13:37.679
While we preferred separating code to simplify the process, we still faced challenges from legacy database dependencies. A notable issue with our extraction practice is that we still found ourselves tied to the legacy database despite the improvements.
00:13:54.560
We employed various tactics to address these challenges; for instance, we also considered building greenfield services using a message bus. Sometimes, you must keep the legacy API operational due to numerous client dependencies that exist.
00:14:38.560
Greenfield work is often preferred among developers. In a similar situation, we envisioned the greenfield extraction where the red box represents all new code. A client gem being utilized in the original monolith runs alongside the new service.
00:15:00.959
Once this new service writes data to its database, a message is sent, which the new service consumes, effectively routing the data to its dedicated data store.
00:15:16.720
The noteworthy aspect to keep everything in sync during these service cutovers is the background sync worker that ensures a one-way data transfer from the old database to the new one.
00:15:33.119
There are pros and cons to this approach as well. One of the advantages includes swiftly eliminating legacy data while utilizing greenfield features that developers typically enjoy designing.
00:15:49.600
Additionally, minimizing cutover risk through efficient data syncing allows for a smoother transition, preventing significant disruptions due to database changes.
00:16:05.600
However, creating a synchronization worker is no trivial task. It's equally challenging to build a validation engine to ensure data remains in sync and to handle potential race conditions.
00:16:18.760
Jason and I work on a team that manages inventory at Groupon. One of our future goals is extracting vouchers from the order service.
00:16:40.080
Vouchers are critical, as they represent what customers redeem. We needed an identification process to differentiate between vouchers neatly stored in a database and prices stored in the legacy database.
00:17:19.600
Groupon has grown significantly since the introduction of orders, creating an international platform that operates in numerous countries, including Berlin, London, Chennai, Korea, and many more.
00:17:48.800
Now, the responsibility of our service is to ensure that anyone querying voucher data doesn’t have to understand the intricacies of the various data sources.
00:18:02.480
In managing these diverse sources of truth, we utilized the manager accessor pattern to streamline the codebase’s structure. With this pattern, the controller could simply specify to talk to a designated manager object to initiate a find request on the voucher.
00:18:46.400
Inside the manager lies all the complexity of data access. In this way, separate accessors interact with various databases—the cobra accessor accesses legacy data while an international accessor might employ HTTP calls to external sources.
00:19:24.960
Ultimately, all the information is consolidated and returned to the controller, making the system more modular while incorporating multiple data sources.
00:19:40.720
Although the facade pattern we implemented has its benefits, it also brings significant complexity, especially as accessors are coupled with schema changes.
00:19:56.720
We must ensure that our legacy schemes still align with necessary data access requests while adapting to new service-oriented architecture.
00:20:12.960
These three extraction patterns we explored are only a small snippet of what Groupon has been navigating. These practices should resonate within many of your experiences.
00:20:29.760
In addition to these methods, there are excellent discussions at RailsConf, and I encourage everyone to explore those. Consider allowing your teams to decide on tactics suitable for SOA implementation, as you may discover neat strategies previously unknown to you.
00:21:02.760
This experience has taught us many lessons about service extraction. Taming a cobra is serious business. As I always say, 'Yippie-ki-yay, you probably won't need it right now.' But once you start feeling the pains of your architecture, it's essential to prepare for action.
00:21:29.120
The tipping point guiding your shift towards service-oriented architecture isn’t clear-cut. The process is more of an art than a science. As you begin discussing SOA and feel the pressure, ensure you formulate a strategy to tackle these challenges.
00:22:11.920
Don't wait for Oprah to crash your site! It's also crucial that you allow your domain to progress organically. Models that seem vital today may become obsolete over time.
00:22:23.760
The monolith’s true benefit is its ability to support quick iterations. Additionally, when diving into service extraction, always maintain a clear strategy.
00:22:51.440
Identify which elements need breaking apart and which need to remain within the monolith. Understand the priorities relating to these focuses. We can easily run into issues when we operate in a scattershot manner without comprehending our business model.
00:23:09.760
Focus on things that are distinctly their own while considering maintenance challenges or legacy designs that necessitate extraction. You should be cognizant of the unexpected—scope creep can bite you when handling larger codebases.
00:23:38.240
You must also think comprehensively about your service stack and understand how your business operates. Always conceptualize data flow, inter-service agreements, and the need for caching to meet load and latency requirements.
00:24:05.760
It’s vital to consider inter-service messaging during the extraction process. If your services need to communicate, allocate thought to the structure of those messages.
00:24:20.240
Consider the implications of delivery SLAs, guarantees, and payload structures. Remember to also prioritize authentication and authorization protocols.
00:24:42.400
Building a supportive infrastructure around your services is crucial. We were fortunate at Groupon; we had dedicated teams striving to generate tools for smoother service deployment.
00:25:01.760
In your organization, it's essential to ensure you think critically about simplifying these processes.
00:25:13.440
When discussing service-oriented architecture, it's important to adopt UUIDs from the outset. Doing so effectively separates your systems from your databases.
00:25:30.119
Good code practices are imperative. This is easier said than done, but you should adhere to solid principles, ensuring separation of components and reflecting on the purpose of their coupling.
00:25:56.880
Think critically about your models—these will evolve into your service APIs. Ensure public methods clearly articulate service behaviors.
00:26:12.080
As you develop your systems, be cautious about entwining your components. For instance, introducing associations in Rails expands the API, creating new pathways for potential coupling.
00:26:22.960
Always test! It's common knowledge yet needs reiterating. High-level testing should take precedence, especially as you approach service extraction—expect extensive breakage in unit tests.
00:26:39.479
Emphasize solid coverage with high-level, end-to-end tests, particularly since spinning up other services for testing can be challenging. A strong set of integration tests fosters confidence in making changes.
00:26:57.360
Communication is key! As service extraction efforts expand, share findings with other teams. Document solutions, create internal gems, put insights to paper, and deliver presentations.
00:27:15.679
At Groupon, we encourage collaborative environments through our core architecture forum, where developers gather to discuss new services or design problems.
00:27:28.079
In conclusion, cobras are fantastic. Rails is fantastic, and cobras serve valuable functions, but proceed with caution!
00:27:36.839
Deciding to raise a baby cobra means you must be prepared for what follows.
00:27:58.560
By the way, we are hiring! If you want to help us tackle these challenges, please talk to us after this session, check out our booth downstairs, or tweet at us.
00:28:07.920
We owe a debt of gratitude to those who supported us in crafting this talk. The insights we gained from many individuals, who contributed to our service extraction processes, encompass the experiences we presented today.
00:28:31.040
We greatly appreciate having mentors who dedicate their time, knowledge, and support to help us navigate these challenges. Thank you all for your attention!
00:32:03.840
Thank you!