Service Oriented Disasters

by Rachel Myers

In her talk "Service Oriented Disasters," Rachel Myers addresses the complexities involved in building and maintaining modern applications, particularly in the context of Ruby on Rails and service-oriented architecture (SOA). The presentation begins with an overview of code complexity, emphasizing that while starting a Rails application may be straightforward, as it grows, complexities arise that developers must confront. Myers highlights the importance of understanding these complexities fully, especially when making architectural decisions that can lead to sustainable maintenance and scalability.

Key Points Discussed:
- Importance of Code Complexity: As applications mature, addressing code complexity is essential. Myers references Sandy Metz’s principles for managing this complexity, such as adhering to the single responsibility principle and refactoring code for readability and reusability.
- Misunderstandings of SOA: Myers contrasts the popular perception that SOA is a blanket solution for managing application complexity. She emphasizes that uncritical adoption of SOA can lead to its own complications, such as difficult-to-maintain services and inter-service dependencies.
- Case Studies:
- The first case study revolves around "Hats for Spacedogs.com," where attempts to extract a feature related to hat voting led to unmanaged complexity and a failure to properly separate functionalities. This illustrated the dangers of extracting services without fully grasping the existing code.
- The second case study involved building an identity service. Here, the service ended up entangled with the main application, creating a situation where both had to remain operational, leading to a single point of failure.
- Finally, the dangers of a monolithic application without proper separation of responsibilities are discussed, showcasing how such systems can quickly become problematic.

Conclusions and Takeaways:
- The presentation concludes with the recognition that while SOA can provide benefits, it is not always the ideal solution for managing code complexity in applications. Instead, understanding the code and making informed architectural decisions based on situational needs is crucial.
- Myers encourages developers to be wary of over-architecting solutions and to focus on simplicity where possible, reflecting on her experiences and the lessons learned from each case study presented.

00:00:00.320 So, I’m going to do a quick summary of all the goodie two-shoes stuff I do. There’s RailsBridge, which, like Josh said, holds workshops around the world. These workshops are free to everyone and are aimed at women and marginalized groups to learn Ruby and Rails. If you want to have one near you, talk to me. Bridge Foundry is another way to get your technology in the hands of marginalized groups, so if you want to do that, let me know.

00:00:30.160 Tomorrow night, GitHub is holding a patchwork workshop, which is a free workshop for anyone who wants an introduction to Git and needs help. There’s a link to all these things in a handy Bitly bundle I made for this talk. If you want to RSVP to the patchwork, or if you want to participate in RailsBridge or Bridge Foundry activities, you can check it out there. This is bit.ly/serviceorienteddisasters, all one word. It also contains links that are relevant to my talk.

00:00:51.840 Thank you all for being here! Before jumping into service-oriented architecture, I want to say a few words about code complexity. Code complexity is a problem that everyone who is maintaining a Rails app is going to run into eventually if you don’t already have it. It’s easy to start a Rails app, but as that application grows, parts of the code become entangled with other parts.

00:01:04.559 As applications mature, we need to be thinking about this complexity more and more. We should begin addressing every feature as a way to extend it in the future. Secondly, high-level architecture decisions can have a huge impact on code complexity and where that complexity resides. So, we should absolutely be thinking about that as we decide our architecture. We should aim for an architecture that reduces overall complexity.

00:01:24.720 We are fortunate to live in an age when figures like Sandy Metz have extensively addressed how to deal with code complexity. I believe she has provided valuable guidance. I’ll now attempt to summarize all of her points—essentially a lifetime of work—on one slide. The gold standard according to Sandy Metz for removing complexity is that each class should represent a single responsibility. That class can still have complexity within it, but the class itself should conceptually be simple. It should do one thing, and we should refactor the methods in those classes to make them shorter, more readable, and better named to make them easier to change and reuse.

00:01:52.800 However, those kinds of refactors can require a significant amount of effort, and you might not have much to show for it once you’re done, as you haven't shipped a new feature. Therefore, it’s necessary to compromise and concentrate your efforts on reducing complexity in the most important code paths. You should look for tightly coupled areas in your application—where that code is passed through frequently. If you discover complexity in a section of the code that you rarely use, you don’t need to worry about it as much. Stable code paths that aren’t changing or are at the end of the line are the places where complexity won't obstruct your work.

00:02:20.000 Sandy Metz summarizes all of this in one slide. I have linked to her blog and book in the Bitly bundle. One area she hasn't addressed in depth is the conversation around service-oriented architecture. I really appreciated Sabrina’s talk earlier about rewriting a complex Rails app into services; it was excellent. I’m going to offer a contrasting perspective based on my own experiences, which I hope will complement each other.

00:02:52.000 I think it’s useful to merge the discussions about service-oriented architecture and code complexity because, in my view, service-oriented architecture has been misunderstood by many as a solution for managing complexity, and I don’t think that's actually the case.

00:03:06.000 If service-oriented architecture (SOA) is new to some, it is a framework for managing software by building small services that each represent a distinct function. The goal of SOA is to provide easier maintenance and more sustainable scalability compared to a large monolithic application. There are many talks, blog posts, and books discussing the benefits of building out a service-oriented architecture.

00:03:27.000 Lauren gave a fantastic talk yesterday about how to extract a service from an existing application, and Sabrina did the same today. I would like to see more analysis on the ultimate success or failure of these projects. I think we are only in the early stages of understanding this, and as such, I’m going to make an effort to address those points in this talk as well.

00:03:59.000 I’m eager to learn about the successes and challenges of breaking these large applications into smaller services. Service-oriented architecture may work in specific situations, but perhaps we should not aim to replace all of our large Rails apps with services.

00:04:15.000 There are significant downsides to consider. It's possible to end up with services that are difficult to test, hard to debug, five times the work to deploy, and as Coraline pointed out yesterday, a poor SOA implementation can create an ecosystem of services that mirrors your bad app—with the added complexity of network failures.

00:04:36.000 So, the first case study I’m going to share highlights extracting tangled code. To set the stage: the team at Hats for Spacedogs.com received a request to revive an old, neglected feature on the site that would allow space dogs to vote on hats, and we would then sell the winning hats on the site.

00:05:01.000 Most of the app was reasonably well-factored; however, this part, which wasn't core to our functionality, had become much more tangled than the rest of the code. The client-side code was the absolute worst. The entire feature was a long function that mixed logic with presentation and weird bug fixes over and over.

00:05:25.000 It’s astonishing how a simple feature became so dauntingly complex, primarily due to a lack of structure. On the server side, the code suffered from a similar issue: the hats you could vote on were the same classes as the hats you could actually buy, even though their behaviors were entirely different.

00:05:53.000 Moreover, if a hat won the contest and we were going to sell it on the site, we created a second hat record, with an association between the two. This resulted in a confusing data model, and if we had managed our complexity properly, we would have followed Sandy Metz's advice and refactored it into two separate classes to accurately reflect their distinct behaviors.

00:06:07.000 We would have adhered to the single responsibility principle, but we didn't manage that. When we looked to update this feature, we wanted to resolve some of the existing issues in Rails. This hat model was violating the single responsibility principle as it merged behaviors for what I'm going to differentiate as buyable hats and votable hats.

00:06:31.000 The front-end code was so confusing that it appeared nearly impossible to refactor. Therefore, we had unmanaged complexity, and we hoped to see clear lines to extract features, aiming for an app composed of votable hats and votes. We were aware of SOA, which promised small, well-defined, and easy-to-maintain services.

00:06:57.000 We began rewriting those features into a new application using 'rails new' and defined the data model we had always desired. Then we started planning how to import the old votable hats records into our new application, which were essentially hat records, into our votable hats app. This process felt glorious; we were progressing rapidly and believed we were doing what was ideal.

00:07:19.000 But as we began developing a migration script to import votable hats and their associations into our new application, we quickly realized we needed to make adjustments to fit our data model. That’s when things took a downturn. For context, we faced numerous questions about whether the votable hat won the contest and if it was available for purchase, among many others.

00:07:45.000 We found that the votable hats had more dependencies on the hat model and its peculiarities than we had previously understood. Nevertheless, we persisted, diagramming our main application and the new votable hats app, which had their own databases.

00:08:07.000 We made a crucial mistake by connecting to the old database, which should have been a warning sign. Next, we realized the main app needed attributes about votable hats, and since we couldn't wait for our API to be completed, we connected to the mobile app's database.

00:08:33.000 In hindsight, this was the diagram of doom. If you find yourself sketching this out, consider stopping for coffee, or perhaps even seeking a new job. It indicates that you've drawn the lines around your new service incorrectly—you're not actually extracting an independent service.

00:08:55.000 At this point, we should have called it a disaster, but we proceeded regardless. We realized there was too much complexity in the hat model and its associations, and the quickest solution was to reuse everything in the new application.

00:09:17.000 So, we packaged up all our models into a gem, which we shared with our new app. Reflecting on this, we created an ecosystem of services that mirrored the unmanaged complexity of our initial codebase, which was not a success.

00:09:46.000 What can we learn here? Firstly, we failed to draw the correct boundaries around our new service. We defined it around a customer-facing feature rather than a logical unit in our application.

00:10:06.000 We drew those lines without really understanding the code, as if to avoid it. We believed that SOA would help manage our complexity, but that wasn't the case. The hat model presented challenges for us, and instead of refactoring to grasp its dependencies before extraction, we attempted to extract before we properly understood the system.

00:10:29.000 Fortunately, we recognized this was a poor approach, and we backed out before going live with it. We face similar issues at GitHub too.

00:10:44.000 GitHub.com is a massive monolithic application built in Rails, and there are areas of it that are neglected and poorly tested, making refactoring risky yet essential.

00:11:02.000 To assist in such efforts, GitHub maintains an open-source project called Scientist. It allows you to define an alternate code path and simultaneously run the old and new code paths in production, reporting back whether their outputs agree.

00:11:16.000 This way, users will still get the output from the existing code while you assess how the new code performs in real-world scenarios, which is particularly useful for revealing bugs and unexpected behaviors.

00:11:37.000 Jesse Toth, who recently reworked much of the permissions code at GitHub, gave a talk a few months ago about using Scientist to rewrite convoluted and poorly understood code, ensuring that when it shipped, it came with tests and was easy to understand and maintain. It was an impressive project that didn’t take as long as one might expect.

00:11:56.000 Now, onto the next case study. Back at Hats for Spacedogs.com, we aimed to build an identity service that would manage authentication and authorization for our applications.

00:12:06.000 The complexity resided in two models: our hat model and our space dog model, which served as our version of a user model. We believed that an identity service would help simplify complexity by isolating the intricacies of the space dog model.

00:12:21.000 In the process of defining this identity service, we aimed to resolve various issues. Initially, we defined our first service for the votable hats based on customer-facing features, but this time we focused on one of our Rails models, leading to some important lessons.

00:12:43.000 Our intention was to prevent close coupling with services, like votable hats, but we still envisioned service-oriented architecture as a means to create small, maintainable services—a goal that ultimately didn’t materialize.

00:13:03.000 In this endeavor, we neglected to provide the identity service with its own database. Our grand idea allowed requests to come into the main app, which would hit the API for the identity service, querying the main database for a response.

00:13:16.000 Later, we built a mobile app and incorrectly assumed it would function similarly, but we didn’t understand the breadth of where the space dog model was being utilized across the main application. As a result, we were never able to fully extract the identity service.

00:13:35.000 This oversight meant that every time the identity service required information, both systems hit the database before yielding a response, which was disastrous. This setup posed significant risks, surpassing the previous votable hat application, especially since it proceeded to production.

00:13:53.000 Moreover, it resulted in increased operational complexity as the identity service could not function independently of the main application, creating a situation where both needed to remain operational. If either failed, users couldn’t shop for hats as space dogs.

00:14:13.000 This led to a significant workload shift onto the operations team without any clear benefits, presenting the worst-case scenario where we created a single point of failure that was poorly understood.

00:14:35.000 We mistakenly believed that offloading tasks onto smaller services would enhance scalability. In our scenario, we thought if our main application required 2,000 servers, adding 100 servers for authentication would improve performance.

00:14:55.000 However, since the identity service shared uptime requirements with the main application, this assumption was misguided. Users would still need to authenticate through the identity service, creating an unnecessary bottleneck.

00:15:17.000 Instead, if most requests were managed from a larger pool of servers, a segregated setup could lead to faster processing. This reflects how service-oriented architecture may not be ideal for every case.

00:15:36.000 The situation could become suboptimal if the uptime requirements differ between the two services. For those attempting SOA at their Ruby applications, I suggest reviewing queuing theory, which could prove beneficial.

00:15:55.000 Lastly, we aimed to define clear team responsibilities through this project, drawing inspiration from GitHub’s model. In our main application, we have a YAML file that assigns a team to every single file in the application. This has streamlined responsibilities, making it clear who manages each part of the code.

00:16:14.000 So, sometimes simplicity is the best solution. Now, I’ll conclude with my final case study. This one didn’t seem destined to be a disaster like the others, but the implementation created significant issues.

00:16:42.000 We had a monolithic Rails application. The appealing aspect of monoliths is that you only require one set of assets. However, soon enough, we ran into significant problems due to the lack of separation.

RubyConf AU 2015