00:00:08.059
I'm Rachel Myers. My talk today is called 'Stop Building Services.' I'm going to do a quick intro about myself.
00:00:15.120
I am on the board and a volunteer with an organization called RailsBridge, which holds free workshops for marginalized people around the world. We are trained to teach Ruby and Rails.
00:00:26.609
A few years ago, we realized that we should split off some of our back-office functions so we could do Bridge Foundry. This supports Closure Bridge, Mobile Bridge, and Cooperage, among others.
00:00:38.719
So, get in touch with me if you want to help out with RailsBridge or if you have a technology that you want to get into the hands of marginalized people.
00:00:51.510
Another side project I have, with my coworker Jessica Lord at GitHub, is an initiative to get great development environments running on inexpensive hardware like Chromebooks. This can make programming accessible to many more people.
00:01:03.870
If you are doing something similar or want to start, please get in touch with me. My day job is as a Ruby engineer at GitHub, where I work on the New User team.
00:01:15.360
We focus on improving the platform specifically for people who are joining GitHub to learn to code. My talk today includes Star Wars jokes.
00:01:31.799
I’m either happy to tell you or sad to tell you that the whole talk is full of bad Star Wars jokes. Because of this and because I'm presenting something that might be controversial—namely, that we should stop building services—I brought tomatoes.
00:01:45.240
Feel free to throw them at me if you dislike what I say. Just raise your hand, and our emcee can assist you with that. Please don't really hand out any tomatoes.
00:02:04.320
So, I’m also going to use a lot of Ruby examples, but this talk is not just about Ruby. It’s really about how we are making technical decisions.
00:02:11.770
In general, my thesis is that we need to be more thoughtful about how we're building services. We can't just say, 'Build more services.' Building smaller services forever isn't a solution. At some point, we need to discuss why we're experiencing these problems.
00:02:31.000
To clarify, Service-Oriented Architecture (SOA) is a style of software architecture that deploys distinct services instead of a single monolithic application. While outside of Ruby and Java this has a long history, it's still relatively new for Ruby, and we're figuring it out.
00:02:45.100
Service-Oriented Architecture has become popular in Ruby primarily because Rails apps allow for rapid development, making it easy to build large applications that can become quite difficult to manage.
00:03:00.459
We've told ourselves some mantras that led us to believe in building services. For instance, we often say, 'Monoliths are big.' Yes, that's true. Monoliths can have a ton of code, and because of that, they can be hard to navigate.
00:03:19.329
It's also probably true that in a large application, there are parts that are badly factored, making it confusing.
00:03:24.670
We also tell ourselves that 'Monoliths are hard to change.' This can stem from bad abstractions formed early on. If you built things around a bad abstraction over time, correcting it can become challenging.
00:03:42.670
We claim that monoliths don't work for big teams, meaning if you're responsible for one part of an app and someone else keeps making changes, there's often no good way to enforce boundaries within a monolithic app.
00:04:06.130
Spoiler alert: It's often rude to separate repositories and then restrict a team member's commit access. If that happens, it's more likely you have an organizational problem.
00:04:29.560
We also say that monoliths make apps slow, but I don't think this argument holds up. Though there could be scenarios where dedicated workers for subsets of requests might increase speed.
00:04:40.930
Saying 'monoliths aren’t web scale' is simply trolling. If I had 14 different micro-applications, I could scale them and claim to be web scale. I used to believe all these things passionately back in 2012. I’m contrasting my past self with present experiences.
00:05:03.190
With the examples I’m sharing today, I’m reflecting on decisions I made that I now consider poor choices. This isn’t about criticizing anyone else; it’s about confessing my past mistakes.
00:05:28.360
Before diving into these missteps, I want to discuss how we should actually make architectural decisions. I've made many architectural mistakes, and if you asked me in 2012 what to consider for architecture, I would have shared those mantras without question.
00:05:54.100
The point I want to make is that mantras aren't the answer. When making architectural decisions, we need solid reasons and evidence. I want to introduce the idea of falsifiability, a concept from the philosophy of science introduced by Karl Popper.
00:06:13.240
The central idea is that there should be some evidence that could change my mind. If there’s evidence that could cause me to reassess my belief, that belief is falsifiable. Conversely, if there’s no evidence that could change my perspective, that belief is not falsifiable. I propose that our architectural decisions should be rooted in falsifiable beliefs.
00:06:40.000
We need to be convinced through evidence rather than relying on mantras. I used to hold non-falsifiable beliefs, but I eventually changed my mindset.
00:07:05.220
So, today I’m going to propose that we make decisions based on evidence. Next, I will suggest criteria to consider when looking at evidence, and I’ll walk through case studies as a way to examine that evidence.
00:07:30.840
At the end, I hope to draw some conclusions and establish principles for future decisions, such as grouping together things that are likely to change together rather than splitting them into separate services.
00:07:47.590
We should also be aware of the differences between libraries and services, and be careful to not mix those two strategies. Now, what should we look for when assessing our architecture?
00:08:05.470
I believe our architecture should meet three key criteria: First, it should enhance the resilience and fault tolerance of our product. We should prefer architectures that allow us to withstand and recover from failures.
00:08:35.360
We should also avoid introducing unnecessary new fragility. Secondly, the architecture should make working on and improving the product easier. If it’s easier to understand, debug, and enhance features, this indicates a better architecture.
00:08:52.560
Lastly, our architecture influences how well teams can collaborate. Everybody is likely familiar with Conway's Law, which states that the structure of an organization is mirrored in the software design produced by that organization.
00:09:14.930
For example, if there’s a lead architect giving orders and individual teams aren't collaborating effectively, the main app might talk to many services, but those services won't communicate with one another.
00:09:30.470
It’s essential to keep these considerations in mind as we proceed. I am framing this discussion with certain criteria I believe are important.
00:09:43.360
Now, let's move on to the case studies, where I will explain all the ways I messed things up. Before I do that, I need to clarify that I’m an engineer at GitHub, and these stories are not about GitHub.
00:10:00.000
Most of the experience you have when visiting GitHub.com is based on a single monolithic Rails app. We do have services, like our metrics collection service, but even features that aren't core functionality are housed in the main app.
00:10:16.380
For instance, notifications, A/B testing frameworks, and audit logging all live in the main app. The first case study I want to discuss is a feature request for a neglected part of the site where the code was far more tangled than in other areas.
00:10:35.660
This was a highly desirable feature that allowed users to vote on items they wanted sold in the application. This feature is common in e-commerce because people tend to buy items they vote for.
00:10:51.810
Here’s what happened when we attempted this. Our JavaScript was essentially a hot mess.
00:11:03.390
It mixed behavior and presentation, as well as obscure browser fixes, all within a single line of code. There were hundreds of lines similar to this. Remember, the feature was simply voting yes or no; thus, there was no reason for the complexity.
00:11:24.740
Often we didn’t even know if the browser fixes were relevant for browsers we still supported. The code lacked any apparent structure or intention behind its lines.
00:11:37.490
As a response to this confusion, we decided that the JavaScript was unrefactorable—a fascinating idea in retrospect. The server-side code had issues too.
00:11:56.130
I recreated a model to describe the structure: we had hats that could be voted on, even though they weren’t actual products yet. Then we had hats that could be purchased, despite these two sets of items having entirely different behaviors.
00:12:12.420
This confused data model violated the single responsibility principle. If we had been properly following Sandi Metz's advice on managing complexity, we would have refactored this into two distinct classes—one for purchasable items and another for votable ones.
00:12:29.060
We mistakenly believed that creating services would serve as an alternative approach to managing complexity, thinking we could refactor the code as we built new services.
00:12:49.000
Our project goals included making a small improvement to the feature, and we viewed this as a chance to refactor and better manage complexity.
00:13:02.360
To begin, we had our main database and main app. We created a service that operated with a new database, but then we made one poor decision: we connected it back to our old database to retrieve information on existing hats.
00:13:19.300
This should have been a warning sign. However, we continued to make another bad choice: we realized the main app also needed to know some attributes regarding the voting hats.
00:13:33.310
We couldn’t wait for the API to be built, so we connected the main app directly to the voting hats database. We should pause and recognize that this is what I call 'the diagram of doom.' If you find yourself here, stop what you are doing.
00:13:53.790
Drawing this type of diagram signals that you didn't understand the code you were extracting well enough to identify its boundaries. Rather than creating a neatly defined new service, we instead mirrored the poorly factored code in our original app with an ecosystem of services.
00:14:24.000
We found so much complexity in the model and its associations with hats that we weren't willing to spend time to untangle it. Instead, we packaged these models into a gem and included that in both our main app and service.
00:14:42.290
I assure you, I make better decisions now; I'm just sharing my past mistakes. The first failure here was drawing the incorrect lines around our services.
00:15:03.520
This is a significant risk that often goes unaddressed in discussions regarding services. Creating classes with well-defined responsibilities is essential before attempting to build a service.
00:15:20.990
More importantly, services are not a substitute for addressing code complexity. Teams driven to write services because they are struggling with messy code will not succeed.
00:15:41.909
If you’re dealing with unmanaged complexity, focus on refactoring that code until you have a proper grasp on it. Only then should you reconsider building a service.
00:16:04.640
In our Example, we locked bad code within our new architecture. By the way, I drew those databases using Keynote's drawing tools—what do you think?
00:16:23.480
It took me a while to realize that services and libraries are designed for two different purposes: a library encapsulates behavior in your app, whereas a service extracts behavior, relieving you of responsibility.
00:16:42.290
If you find yourself extracting services and then sharing libraries between them, it indicates that something has gone wrong. Either you've drawn the boundaries incorrectly or haven’t fully extracted behavior.
00:17:00.490
Returning to our architectural criteria, this project was overly focused on improving code quality, but we ultimately failed to make it easier to understand and enhance the code.
00:17:21.779
So, let's discuss how GitHub is handled today. GitHub remains a massive monolithic Rails app with corners that are neglected, poorly understood, and inadequately tested. Refactoring is critical, but risky.
00:17:39.000
To aid with refactoring and rewrites, we maintain an open-source project called Scientists. This allows you to define an old code path and a new code path, running both in production to report discrepancies.
00:17:57.300
You can investigate these discrepancies to determine whether an edge case needs to be accounted for in your new code path or if it's actually a bug in the old code. It’s surprising how often the latter is the case.
00:18:25.000
This way, you can ship your code and test it with real users without impacting their experience. For more on this, my coworker Jesse Toth gave a great talk called 'Easy Rewrites with Ruby and Science.' It's about her approach to writing the permissions code and handling edge cases.
00:18:42.000
She successfully rolled out tested, reliable, and well-factored code, which contrasted sharply with what existed prior to her efforts.
00:19:04.950
In our second case study, we began building services to manage authentication and authorization because we anticipated needing to authenticate across many applications.
00:19:24.950
This seemed like a natural first step for us, given that our complexity was concentrated in our user model, which would allow us to streamline things.
00:19:45.490
However, we were also experiencing issues with team ownership and responsibility. Some teams consuming the API would make changes without consulting those responsible.
00:20:02.050
Furthermore, we noticed people making proposals for code they weren't accountable for, only to leave without following through—something I've termed the 'swoop and poop.'
00:20:19.050
This was particularly problematic for the team maintaining the user code, and extracting that code seemed to provide the complexity management and separation we needed.
00:20:34.680
Looking back at our architectural goals focused on improving code understanding and team collaboration, we learned from our past mistakes.
00:20:52.200
Rather than creating a new database for the identity service, we pointed it to our existing main database, allowing requests to flow through.
00:21:12.520
Initially, everything worked smoothly. We felt accomplished, like we were living the dream, finally running services.
00:21:27.570
Simultaneously, around this time, we launched an iOS app that would need to authenticate users, similar to the main app.
00:21:43.160
You would think that authentication would work seamlessly; a request would come in from the iOS app to the identity service, hitting the main database for a response.
00:22:07.040
Yet, since user objects were modified across various areas of the app, we could never fully extract all user behavior into the identity service.
00:22:29.780
As a consequence, the identity service had to repeatedly call back to the main application for every authorization and authentication query.
00:22:45.950
This was a failure; we ended up creating the identity service but failed to make it functional on its own. We inadvertently set ourselves up with significant operational complexity.
00:23:04.400
Now, instead of having one single point of failure with our main app, we created two interdependent applications. If either of these applications had issues, it would crash the entire system.
00:23:19.890
To make a Star Wars analogy, think of it like the droid command ship where various fighters are attacking. Despite its shields being up, it falls due to unforeseen circumstances.
00:23:39.090
That’s what we created—a massive single point of failure in our architecture.
00:23:54.570
Furthermore, the identity service just wasn't handling the load of our main app in any meaningful way because it was still reliant on the main database.
00:24:11.150
We ended up with more network calls than before. It's common to hear that Service-Oriented Architecture (SOA) helps reduce load on an app and supports scalability.
00:24:28.150
However, the relationship between SOA and scaling can be complex. Initially, we believed that adding more workers to offload tasks from the main app would enhance scalability.
00:24:49.920
But the reality is that the identity service could still create a bottleneck, as exemplified by the coffee shop analogy—multiple lines versus a single line.
00:25:11.390
In many cases, having a single pool of workers may be more efficient than attempting to divide responsibilities too simplistically.
00:25:31.030
We also needed to accurately account for the time spent on network calls, which can quickly add up. With the identity service needing to regularly communicate with the main app, we encountered additional delays.
00:25:47.930
When evaluating the situation, our architecture inadvertently led to increased fragility and less resilience, creating two concurrent points of failure.
00:26:05.780
Regarding team responsibilities, we attempted to define clearer roles but didn’t effectively solve our core issue of team members respecting each other's boundaries.
00:26:20.700
As a solution, every controller class is assigned a responsible team. These values are then tracked in our error monitoring system.
00:26:39.330
This method allows me to see the errors in my areas of responsibility and, consequently, creates a culture where teams need to coordinate changes before merging code.
00:26:58.300
In our third case study, this was not necessarily destined to fail, but poor implementation led to complications.
00:27:16.540
Initially, we needed a unified set of assets for our website, but we soon developed a mobile web experience and several services, including potential native apps.
00:27:34.580
To avoid repeating our resources, we sought to share these assets across all applications—everything from shared headers to e-commerce functionality like size charts.
00:27:50.390
Instead of duplicating efforts, we thought we could create a gem to consolidate our assets, enabling each app to fetch them as needed.
00:28:08.670
While this was a convenient solution initially, problems arose when assets needed to be updated.
00:28:23.550
In those cases, significant changes to assets could create a big emphasis on uniformity across applications.
00:28:37.290
As a result, we faced coordination challenges to redeploy assets simultaneously, something not inherent in our original deploy process.
00:28:54.050
This became a major pain point. While we drew slightly better boundaries this time, we regretted this architecture because our deployment workflow complicated things.
00:29:09.290
Ultimately, we discovered that our deployment processes hindered our ability to implement changes effectively.
00:29:25.950
Additionally, when new team members joined, they couldn't easily jump in and contribute, which had been a strength of working within a monolith.
00:29:45.680
People would frequently get confused when changing aspects of the application housed within the gem, as this led to issues around permissions and updates.
00:30:00.810
In retrospect, while we successfully avoided duplicating our assets, we found the price we paid for not making information accessible proved greater than expected.
00:30:20.390
Though this change wasn't as severe as those in our identity service, it still didn’t deliver the results we initially sought.
00:30:40.030
I’d like to explore some alternatives that could have improved this process. We should have created an asset service rather than relying on a gem to holistically manage assets.
00:30:59.790
This would have better enabled uniform changes across the app without sacrificing flexibility or taking the heavy hit of logistical implications.
00:31:20.900
It’s also worth noting that monolithic apps, like our own, often face similar issues when attempting to manage larger teams or complex features, afterall.
00:31:31.820
Lastly, I want to address the importance of failing gracefully. In previous iterations of this talk, some attendees voiced skepticism regarding the necessity of building services.
00:31:49.420
So, I will now present a situation where building a service actually worked out well. It involved developing a social feature within an already established site.
00:32:06.640
For example, we received a request for a comments feature on our main page. While not core to the website, we desired to avoid compromising central functionalities.
00:32:24.120
Building this feature as a separate service allowed us to ensure that if the comment system failed or lagged, the rest of the site would remain operational.
00:32:39.980
Finish detail had to be ironed out to ensure minimal impact on the user experience. Ultimately, we were able to accomplish our goals.
00:32:55.280
As we conclude, I find it pertinent to share that while separating services can boost resilience, it’s essential to remain aware of how service-oriented architecture may still present a variety of pitfalls.
00:33:12.000
If you must embrace SOA, ensure that your code is sufficiently untangled first. Only then should one venture forth, firmly grounded with the knowledge gained and expanded upon.
00:33:30.600
Second, it's crucial to remain vigilant regarding potential points of catastrophic failure. Conversely, minimize the total number of these failures; doing so brings a plethora of advantages, from happier developers to greater operational integrity.
00:33:48.730
Lastly, remember to group parts of your application that are likely to change cohesively together; this foresight will mitigate many potential pain points during the development process.
00:34:05.810
In closing, I reiterate that many motivations for pursuing SOA stem from frustration with code quality, yet services can easily exacerbate those same issues.
00:34:19.050
The first step should always be to focus on enhancing code quality within your core monolith. If there’s only one takeaway from today, let it be that.
00:34:37.360
I’d like to express my appreciation for the amazing artwork displayed, drawn by Sam Brown, and for the generous memes from the website Meme Center, which were used throughout.
00:34:49.990
If you feel inclined to either agree or disagree with me, I invite you to engage in civil discourse. You can reach out to me; I’m Rachel Myers.
00:35:08.490
Thank you for your attention.