Stop Building Services, Episode 1: The Phantom Menace

00:00:08.059 I'm Rachel Myers. My talk today is called 'Stop Building Services.' I'm going to do a quick intro about myself.

00:00:15.120 I am on the board and a volunteer with an organization called RailsBridge, which holds free workshops for marginalized people around the world. We are trained to teach Ruby and Rails.

00:00:26.609 A few years ago, we realized that we should split off some of our back-office functions so we could do Bridge Foundry. This supports Closure Bridge, Mobile Bridge, and Cooperage, among others.

00:00:38.719 So, get in touch with me if you want to help out with RailsBridge or if you have a technology that you want to get into the hands of marginalized people.

00:00:51.510 Another side project I have, with my coworker Jessica Lord at GitHub, is an initiative to get great development environments running on inexpensive hardware like Chromebooks. This can make programming accessible to many more people.

00:01:03.870 If you are doing something similar or want to start, please get in touch with me. My day job is as a Ruby engineer at GitHub, where I work on the New User team.

00:01:15.360 We focus on improving the platform specifically for people who are joining GitHub to learn to code. My talk today includes Star Wars jokes.

00:01:31.799 I’m either happy to tell you or sad to tell you that the whole talk is full of bad Star Wars jokes. Because of this and because I'm presenting something that might be controversial—namely, that we should stop building services—I brought tomatoes.

00:01:45.240 Feel free to throw them at me if you dislike what I say. Just raise your hand, and our emcee can assist you with that. Please don't really hand out any tomatoes.

00:02:04.320 So, I’m also going to use a lot of Ruby examples, but this talk is not just about Ruby. It’s really about how we are making technical decisions.

00:02:11.770 In general, my thesis is that we need to be more thoughtful about how we're building services. We can't just say, 'Build more services.' Building smaller services forever isn't a solution. At some point, we need to discuss why we're experiencing these problems.

00:02:31.000 To clarify, Service-Oriented Architecture (SOA) is a style of software architecture that deploys distinct services instead of a single monolithic application. While outside of Ruby and Java this has a long history, it's still relatively new for Ruby, and we're figuring it out.

00:02:45.100 Service-Oriented Architecture has become popular in Ruby primarily because Rails apps allow for rapid development, making it easy to build large applications that can become quite difficult to manage.

00:03:00.459 We've told ourselves some mantras that led us to believe in building services. For instance, we often say, 'Monoliths are big.' Yes, that's true. Monoliths can have a ton of code, and because of that, they can be hard to navigate.

00:03:19.329 It's also probably true that in a large application, there are parts that are badly factored, making it confusing.

00:03:24.670 We also tell ourselves that 'Monoliths are hard to change.' This can stem from bad abstractions formed early on. If you built things around a bad abstraction over time, correcting it can become challenging.

00:03:42.670 We claim that monoliths don't work for big teams, meaning if you're responsible for one part of an app and someone else keeps making changes, there's often no good way to enforce boundaries within a monolithic app.

00:04:06.130 Spoiler alert: It's often rude to separate repositories and then restrict a team member's commit access. If that happens, it's more likely you have an organizational problem.

00:04:29.560 We also say that monoliths make apps slow, but I don't think this argument holds up. Though there could be scenarios where dedicated workers for subsets of requests might increase speed.

00:04:40.930 Saying 'monoliths aren’t web scale' is simply trolling. If I had 14 different micro-applications, I could scale them and claim to be web scale. I used to believe all these things passionately back in 2012. I’m contrasting my past self with present experiences.

00:05:03.190 With the examples I’m sharing today, I’m reflecting on decisions I made that I now consider poor choices. This isn’t about criticizing anyone else; it’s about confessing my past mistakes.

00:05:28.360 Before diving into these missteps, I want to discuss how we should actually make architectural decisions. I've made many architectural mistakes, and if you asked me in 2012 what to consider for architecture, I would have shared those mantras without question.

00:05:54.100 The point I want to make is that mantras aren't the answer. When making architectural decisions, we need solid reasons and evidence. I want to introduce the idea of falsifiability, a concept from the philosophy of science introduced by Karl Popper.

00:06:13.240 The central idea is that there should be some evidence that could change my mind. If there’s evidence that could cause me to reassess my belief, that belief is falsifiable. Conversely, if there’s no evidence that could change my perspective, that belief is not falsifiable. I propose that our architectural decisions should be rooted in falsifiable beliefs.

00:06:40.000 We need to be convinced through evidence rather than relying on mantras. I used to hold non-falsifiable beliefs, but I eventually changed my mindset.

00:07:05.220 So, today I’m going to propose that we make decisions based on evidence. Next, I will suggest criteria to consider when looking at evidence, and I’ll walk through case studies as a way to examine that evidence.

00:07:30.840 At the end, I hope to draw some conclusions and establish principles for future decisions, such as grouping together things that are likely to change together rather than splitting them into separate services.

00:07:47.590 We should also be aware of the differences between libraries and services, and be careful to not mix those two strategies. Now, what should we look for when assessing our architecture?

00:08:05.470 I believe our architecture should meet three key criteria: First, it should enhance the resilience and fault tolerance of our product. We should prefer architectures that allow us to withstand and recover from failures.

00:08:35.360 We should also avoid introducing unnecessary new fragility. Secondly, the architecture should make working on and improving the product easier. If it’s easier to understand, debug, and enhance features, this indicates a better architecture.

00:08:52.560 Lastly, our architecture influences how well teams can collaborate. Everybody is likely familiar with Conway's Law, which states that the structure of an organization is mirrored in the software design produced by that organization.

00:09:14.930 For example, if there’s a lead architect giving orders and individual teams aren't collaborating effectively, the main app might talk to many services, but those services won't communicate with one another.

00:09:30.470 It’s essential to keep these considerations in mind as we proceed. I am framing this discussion with certain criteria I believe are important.

00:09:43.360 Now, let's move on to the case studies, where I will explain all the ways I messed things up. Before I do that, I need to clarify that I’m an engineer at GitHub, and these stories are not about GitHub.

00:10:00.000 Most of the experience you have when visiting GitHub.com is based on a single monolithic Rails app. We do have services, like our metrics collection service, but even features that aren't core functionality are housed in the main app.

00:10:16.380 For instance, notifications, A/B testing frameworks, and audit logging all live in the main app. The first case study I want to discuss is a feature request for a neglected part of the site where the code was far more tangled than in other areas.

00:10:35.660 This was a highly desirable feature that allowed users to vote on items they wanted sold in the application. This feature is common in e-commerce because people tend to buy items they vote for.

00:10:51.810 Here’s what happened when we attempted this. Our JavaScript was essentially a hot mess.

00:11:03.390 It mixed behavior and presentation, as well as obscure browser fixes, all within a single line of code. There were hundreds of lines similar to this. Remember, the feature was simply voting yes or no; thus, there was no reason for the complexity.

00:11:24.740 Often we didn’t even know if the browser fixes were relevant for browsers we still supported. The code lacked any apparent structure or intention behind its lines.

00:11:37.490 As a response to this confusion, we decided that the JavaScript was unrefactorable—a fascinating idea in retrospect. The server-side code had issues too.

00:11:56.130 I recreated a model to describe the structure: we had hats that could be voted on, even though they weren’t actual products yet. Then we had hats that could be purchased, despite these two sets of items having entirely different behaviors.

00:12:12.420 This confused data model violated the single responsibility principle. If we had been properly following Sandi Metz's advice on managing complexity, we would have refactored this into two distinct classes—one for purchasable items and another for votable ones.

00:12:29.060 We mistakenly believed that creating services would serve as an alternative approach to managing complexity, thinking we could refactor the code as we built new services.

00:12:49.000 Our project goals included making a small improvement to the feature, and we viewed this as a chance to refactor and better manage complexity.

00:13:02.360 To begin, we had our main database and main app. We created a service that operated with a new database, but then we made one poor decision: we connected it back to our old database to retrieve information on existing hats.

00:13:19.300 This should have been a warning sign. However, we continued to make another bad choice: we realized the main app also needed to know some attributes regarding the voting hats.

00:13:33.310 We couldn’t wait for the API to be built, so we connected the main app directly to the voting hats database. We should pause and recognize that this is what I call 'the diagram of doom.' If you find yourself here, stop what you are doing.

00:13:53.790 Drawing this type of diagram signals that you didn't understand the code you were extracting well enough to identify its boundaries. Rather than creating a neatly defined new service, we instead mirrored the poorly factored code in our original app with an ecosystem of services.

00:14:24.000 We found so much complexity in the model and its associations with hats that we weren't willing to spend time to untangle it. Instead, we packaged these models into a gem and included that in both our main app and service.

00:14:42.290 I assure you, I make better decisions now; I'm just sharing my past mistakes. The first failure here was drawing the incorrect lines around our services.

00:15:03.520 This is a significant risk that often goes unaddressed in discussions regarding services. Creating classes with well-defined responsibilities is essential before attempting to build a service.

00:15:20.990 More importantly, services are not a substitute for addressing code complexity. Teams driven to write services because they are struggling with messy code will not succeed.

00:15:41.909 If you’re dealing with unmanaged complexity, focus on refactoring that code until you have a proper grasp on it. Only then should you reconsider building a service.

00:16:04.640 In our Example, we locked bad code within our new architecture. By the way, I drew those databases using Keynote's drawing tools—what do you think?

00:16:23.480 It took me a while to realize that services and libraries are designed for two different purposes: a library encapsulates behavior in your app, whereas a service extracts behavior, relieving you of responsibility.

00:16:42.290 If you find yourself extracting services and then sharing libraries between them, it indicates that something has gone wrong. Either you've drawn the boundaries incorrectly or haven’t fully extracted behavior.

00:17:00.490 Returning to our architectural criteria, this project was overly focused on improving code quality, but we ultimately failed to make it easier to understand and enhance the code.

00:17:21.779 So, let's discuss how GitHub is handled today. GitHub remains a massive monolithic Rails app with corners that are neglected, poorly understood, and inadequately tested. Refactoring is critical, but risky.

00:17:39.000 To aid with refactoring and rewrites, we maintain an open-source project called Scientists. This allows you to define an old code path and a new code path, running both in production to report discrepancies.

00:17:57.300 You can investigate these discrepancies to determine whether an edge case needs to be accounted for in your new code path or if it's actually a bug in the old code. It’s surprising how often the latter is the case.

00:18:25.000 This way, you can ship your code and test it with real users without impacting their experience. For more on this, my coworker Jesse Toth gave a great talk called 'Easy Rewrites with Ruby and Science.' It's about her approach to writing the permissions code and handling edge cases.

00:18:42.000 She successfully rolled out tested, reliable, and well-factored code, which contrasted sharply with what existed prior to her efforts.

00:19:04.950 In our second case study, we began building services to manage authentication and authorization because we anticipated needing to authenticate across many applications.

00:19:24.950 This seemed like a natural first step for us, given that our complexity was concentrated in our user model, which would allow us to streamline things.

00:19:45.490 However, we were also experiencing issues with team ownership and responsibility. Some teams consuming the API would make changes without consulting those responsible.

00:20:02.050 Furthermore, we noticed people making proposals for code they weren't accountable for, only to leave without following through—something I've termed the 'swoop and poop.'

00:20:19.050 This was particularly problematic for the team maintaining the user code, and extracting that code seemed to provide the complexity management and separation we needed.

00:20:34.680 Looking back at our architectural goals focused on improving code understanding and team collaboration, we learned from our past mistakes.

00:20:52.200 Rather than creating a new database for the identity service, we pointed it to our existing main database, allowing requests to flow through.

00:21:12.520 Initially, everything worked smoothly. We felt accomplished, like we were living the dream, finally running services.

00:21:27.570 Simultaneously, around this time, we launched an iOS app that would need to authenticate users, similar to the main app.

00:21:43.160 You would think that authentication would work seamlessly; a request would come in from the iOS app to the identity service, hitting the main database for a response.

00:22:07.040 Yet, since user objects were modified across various areas of the app, we could never fully extract all user behavior into the identity service.

00:22:29.780 As a consequence, the identity service had to repeatedly call back to the main application for every authorization and authentication query.

00:22:45.950 This was a failure; we ended up creating the identity service but failed to make it functional on its own. We inadvertently set ourselves up with significant operational complexity.

00:23:04.400 Now, instead of having one single point of failure with our main app, we created two interdependent applications. If either of these applications had issues, it would crash the entire system.

00:23:19.890 To make a Star Wars analogy, think of it like the droid command ship where various fighters are attacking. Despite its shields being up, it falls due to unforeseen circumstances.

00:23:39.090 That’s what we created—a massive single point of failure in our architecture.

00:23:54.570 Furthermore, the identity service just wasn't handling the load of our main app in any meaningful way because it was still reliant on the main database.

00:24:11.150 We ended up with more network calls than before. It's common to hear that Service-Oriented Architecture (SOA) helps reduce load on an app and supports scalability.

00:24:28.150 However, the relationship between SOA and scaling can be complex. Initially, we believed that adding more workers to offload tasks from the main app would enhance scalability.

00:24:49.920 But the reality is that the identity service could still create a bottleneck, as exemplified by the coffee shop analogy—multiple lines versus a single line.

00:25:11.390 In many cases, having a single pool of workers may be more efficient than attempting to divide responsibilities too simplistically.

00:25:31.030 We also needed to accurately account for the time spent on network calls, which can quickly add up. With the identity service needing to regularly communicate with the main app, we encountered additional delays.

00:25:47.930 When evaluating the situation, our architecture inadvertently led to increased fragility and less resilience, creating two concurrent points of failure.

00:26:05.780 Regarding team responsibilities, we attempted to define clearer roles but didn’t effectively solve our core issue of team members respecting each other's boundaries.

00:26:20.700 As a solution, every controller class is assigned a responsible team. These values are then tracked in our error monitoring system.

00:26:39.330 This method allows me to see the errors in my areas of responsibility and, consequently, creates a culture where teams need to coordinate changes before merging code.

00:26:58.300 In our third case study, this was not necessarily destined to fail, but poor implementation led to complications.

00:27:16.540 Initially, we needed a unified set of assets for our website, but we soon developed a mobile web experience and several services, including potential native apps.

00:27:34.580 To avoid repeating our resources, we sought to share these assets across all applications—everything from shared headers to e-commerce functionality like size charts.

00:27:50.390 Instead of duplicating efforts, we thought we could create a gem to consolidate our assets, enabling each app to fetch them as needed.

00:28:08.670 While this was a convenient solution initially, problems arose when assets needed to be updated.

00:28:23.550 In those cases, significant changes to assets could create a big emphasis on uniformity across applications.

00:28:37.290 As a result, we faced coordination challenges to redeploy assets simultaneously, something not inherent in our original deploy process.

00:28:54.050 This became a major pain point. While we drew slightly better boundaries this time, we regretted this architecture because our deployment workflow complicated things.

00:29:09.290 Ultimately, we discovered that our deployment processes hindered our ability to implement changes effectively.

00:29:25.950 Additionally, when new team members joined, they couldn't easily jump in and contribute, which had been a strength of working within a monolith.

00:29:45.680 People would frequently get confused when changing aspects of the application housed within the gem, as this led to issues around permissions and updates.

00:30:00.810 In retrospect, while we successfully avoided duplicating our assets, we found the price we paid for not making information accessible proved greater than expected.

00:30:20.390 Though this change wasn't as severe as those in our identity service, it still didn’t deliver the results we initially sought.

00:30:40.030 I’d like to explore some alternatives that could have improved this process. We should have created an asset service rather than relying on a gem to holistically manage assets.

00:30:59.790 This would have better enabled uniform changes across the app without sacrificing flexibility or taking the heavy hit of logistical implications.

00:31:20.900 It’s also worth noting that monolithic apps, like our own, often face similar issues when attempting to manage larger teams or complex features, afterall.

00:31:31.820 Lastly, I want to address the importance of failing gracefully. In previous iterations of this talk, some attendees voiced skepticism regarding the necessity of building services.

00:31:49.420 So, I will now present a situation where building a service actually worked out well. It involved developing a social feature within an already established site.

00:32:06.640 For example, we received a request for a comments feature on our main page. While not core to the website, we desired to avoid compromising central functionalities.

00:32:24.120 Building this feature as a separate service allowed us to ensure that if the comment system failed or lagged, the rest of the site would remain operational.

00:32:39.980 Finish detail had to be ironed out to ensure minimal impact on the user experience. Ultimately, we were able to accomplish our goals.

00:32:55.280 As we conclude, I find it pertinent to share that while separating services can boost resilience, it’s essential to remain aware of how service-oriented architecture may still present a variety of pitfalls.

00:33:12.000 If you must embrace SOA, ensure that your code is sufficiently untangled first. Only then should one venture forth, firmly grounded with the knowledge gained and expanded upon.

00:33:30.600 Second, it's crucial to remain vigilant regarding potential points of catastrophic failure. Conversely, minimize the total number of these failures; doing so brings a plethora of advantages, from happier developers to greater operational integrity.

00:33:48.730 Lastly, remember to group parts of your application that are likely to change cohesively together; this foresight will mitigate many potential pain points during the development process.

00:34:05.810 In closing, I reiterate that many motivations for pursuing SOA stem from frustration with code quality, yet services can easily exacerbate those same issues.

00:34:19.050 The first step should always be to focus on enhancing code quality within your core monolith. If there’s only one takeaway from today, let it be that.

00:34:37.360 I’d like to express my appreciation for the amazing artwork displayed, drawn by Sam Brown, and for the generous memes from the website Meme Center, which were used throughout.

00:34:49.990 If you feel inclined to either agree or disagree with me, I invite you to engage in civil discourse. You can reach out to me; I’m Rachel Myers.

00:35:08.490 Thank you for your attention.