Laying the Cultural and Technical Foundation for Big Rails

by Alex Evanczuk

In the talk titled "Laying the Cultural and Technical Foundation for Big Rails" at RailsConf 2022, Alex Evanczuk addresses the challenges of scaling large Rails applications and the social and technical frameworks necessary for sustained growth. The discussion reflects on Gusto's journey as an organization transitioning from a smaller codebase to managing an increasingly complex Rails monolith. Evanczuk emphasizes that as the number of contributors grows, maintaining velocity and managing complexity becomes critically important.

Key Points Discussed:

- Introduction and Background:

- Recognition of contributions made by the Rails community and the importance of the conference.

- Explanation of Gusto’s system graph, illustrating its various subsystems and interconnections.

Velocity Challenges:
- Acknowledgment of decreasing per-contributor velocity despite increasing team size.
- Difficulties faced when implementing new features, which sometimes added complexity rather than simplifying it.
Complexity Management:
- Strategies attempted by Gusto to manage growing complexity, including the use of gems and microservices with mixed results.
- Evolution of the approach to focus more on societal and cultural changes alongside technical solutions.
Big Rails Definition:
- Presentation of a definition for "big rails" as a system of socio-technical tools and practices that help scale Rails development.
- Five principles formulating their approach: accountability and ownership, clear boundaries, thoughtful dependency management, gradual adoption, and sustainable feedback loops.
Implementation Details:
- Development of tools like code ownership gems and teams to clarify ownership within the codebase.
- Organizational restructuring to segment code based on business logic rather than the typical architectural layers.
- Introduction of Packwork, a tool to manage dependencies and ensure clear communication between domain areas.
- Cultivation of a culture where developers take responsibility for maintaining clean boundaries within their systems.
Cultural and Behavioral Shifts:
- Focused effort on educating teams about the benefits of the new systems and advocating for accountability in managing dependencies.
- Introduction of feedback loops to continuously engage developers with the system’s design and practices.

Conclusion:

Evanczuk concludes by encouraging the Ruby on Rails community to continue refining tools and practices to combat complexity in large applications. He invites collaboration and emphasizes that scaling a Rails application sustainably requires both cultural change and technical adaptation.

00:00:00.900 Welcome everyone.

00:00:12.900 Thank you for coming to my talk called "Laying the Cultural and Technical Foundation for Big Rails."

00:00:15.360 To start off, I want to take a moment to thank everyone who's contributed to and maintained Rails. They have created something bigger than themselves and inspired people and businesses of all sizes. Thank you to them.

00:00:21.119 I also want to thank everyone who helped put together this conference and brought us here today. A little about me: my name is Alex, and I'm really happy to be here today to speak with you all. I appreciate you taking the time to attend. I live in Vermont with my partner, and I enjoy many things, including gardening. This is a picture of our work-in-progress garden. I also love eating fresh fruits and vegetables, composting, and other crunchy things.

00:00:43.579 I have worked at Gusto for nearly six years now, primarily on the product infrastructure team, focusing mostly on back-end modularization. Before that, I worked on the benefits team. I truly love working at Gusto, and I’ll be available throughout the conference and beyond if you want to learn more.

00:01:38.820 To dive in, this is Gusto's system graph. Each of the black rectangles you see represents a subsystem within Gusto's large Rails monolith, and the red arrows indicate when one subsystem communicates with another. Over time, despite our team growing at a linear or even exponential rate, we noticed that our per-contributor velocity was actually decreasing.

00:01:50.620 It became harder to add new features, and we frequently found that implementing a new feature made adding subsequent features more challenging rather than easier. We realized that people were struggling with making large-scale changes in our codebase because large-scale changes require structure at scale, which we were lacking.

00:02:01.740 So, how do we even begin to solve this problem? We tried a couple of different approaches. We created gems and Rails engines, but in the best cases, we found that our business logic was so entangled that the most we could manage was to extract small, mostly functional components that lacked significant business value and wouldn’t make a dent in the overall complexity of our application.

00:02:29.340 In the worst cases, it led us on a bit of a wild goose chase. We recognized that we had many years of work ahead if we wanted to extract all of our code into gems and engines. We also attempted to implement microservices for new and existing services, which had varying degrees of success.

00:02:40.920 However, none of these techniques could adequately address the challenges posed by our ever-growing Rails monolith. This is the story of how Gusto's complexity grew and brought me here today to share that journey with you. In this talk, I will discuss some of the progress we've made and share tools and techniques that have been helpful.

00:03:43.740 As I share this story, keep in mind that I've taken some creative liberties in my storytelling to hopefully help you avoid making some of the mistakes we made along the way. But I’d be happy to share some of those mistakes if you're curious.

00:04:42.000 Gusto first began as a payroll provider for small businesses in California. Over the years, we expanded both geographically and in the problems we solve. Today, we assist companies in running payroll, ensuring tax compliance, setting up health insurance benefits, and taking care of their employees, among other tasks.

00:05:03.000 Gusto experiences significantly lower web traffic and data storage requirements compared to many other similar companies. If we do our job right, users spend very little time on our website, running payroll or figuring out how to set up their health insurance benefits. Personally, I believe the less time users spend on our site and the less information we hold about our customers, the better.

00:05:32.000 Thus, this story is not focused on performance, big data, or web traffic throughput, but rather on scaling domain complexity. If this narrative feels familiar to you, you may also be working in a large Rails application. We want to change something about this narrative—and we can.

00:06:03.000 Regarding 'big Rails,' many people have different definitions. One definition is that big Rails is a system of socio-technical tools, practices, and conventions that facilitate scaling Rails development in terms of lifespan, the number of contributors, and complexity. Among these, five key principles have guided a lot of our approach to 'big Rails': accountability and ownership, clear boundaries, thoughtful dependency management, gradual adoption, and intentionally curated sustainable feedback loops.

00:06:21.840 Accountability and ownership are crucial concepts. Imagine a piece of code in your codebase. When that code is called, who receives the error? How do you route the error to the right team? Who do you consult regarding a change to this code when 'git blame' shows dozens of contributors, many of whom may no longer be with the company? Accountability and ownership are about clarifying what team is responsible for whom, as well as the behavioral and structural concerns of the codebase.

00:07:06.060 We aim to make it straightforward to identify ownership, both for human operators working within the codebase and through automated tooling. For instance, an engineer should be able to identify what team owns a part of the domain easily. Bug logs, monitoring, asynchronous processes, and more should all be attributable to a specific team.

00:07:38.160 To accomplish this, we created two gems, named 'code ownership' and 'teams.' We open-sourced these in advance of this talk, along with several other tools that I'll be discussing. Look out for the Ruby gems logo in the bottom right for references to these open-source gems. You can also find all the references I discuss in the Gusto Engineering blog.

00:08:05.040 On the left, you can see a basic team decoration. On the right, it demonstrates how code ownership can link a piece of code back to the team responsible for it. With this, we could tie our codebase to specific teams and individuals. Our error monitoring system assigns bugs to individual teams, allowing them to take responsibility for their impact on system behavior. They can also define their own service level agreements (SLAs) and tolerance for errors based on their business objectives.

00:08:45.200 We configured our asynchronous work system, Sidekiq, to tag jobs with owner information, so monitoring dashboards could be filtered by team. We also use this setup to generate our GitHub code owners file programmatically, ensuring teams received notifications for code reviews.

00:09:13.060 As a bonus, this sparked conversations about code ownership—or the lack thereof—throughout our codebase, both for new and existing code. Consequently, this pushed us towards the necessity of creating boundaries, as it's difficult to establish boundaries without accountability and ownership.

00:09:44.000 Clear boundaries involve working towards easily understandable conceptual and mechanical separations between domains. Each system should only ever communicate with other systems through intentionally maintained public APIs. Thoughtful dependency management seeks to minimize dependencies to reduce cognitive load when understanding a system. When we must add a dependency, we should do so explicitly.

00:10:36.240 We should avoid creating cycles in our dependency graph, as these reduce our ability to understand a subsystem in isolation. To move towards this goal, we started with one simple change to a standard Rails convention. In the standard Rails app, you have an app directory and secondary directories for architectural concerns, like models, views, controller services, followed by directories for business domains.

00:11:38.760 Earlier, I mentioned I was on the benefits team for an extended period. This meant that as a product engineer on that team, I had to navigate across folders to work within my team’s codebase. This was not only cumbersome, but it violated an important principle of coupling and cohesion, which states that things that change together should reside together.

00:12:23.340 What we really wanted was all the benefits-related files to be in the same folder. So that's precisely what we did. We felt it was far less meaningful to organize the app by architectural layer and much more meaningful to organize it by domain.

00:12:56.640 Note that we didn’t divide it by team, as our software shouldn’t reflect our organizational structure. This change didn’t require a significant amount of additional technology initially; we just had to set up some load paths, which we simplified by open-sourcing a gem called 'stimpak.' Once added to your gem file, it sets up Rails load paths and more for this new structure.

00:13:35.160 Please keep in mind that this switch is well supported by Rails in case you want to configure it yourself. In fact, it's quite similar to Rails engines, but we chose not to utilize Rails engines for several important reasons I will address later.

00:14:14.460 This pattern also came with numerous additional benefits, such as having test files co-located alongside the code they test. The technologies used for a domain became more of an implementation detail hidden from consumers. I don't mean to imply that Rails' default way of organizing by architectural layer is incorrect.

00:15:00.000 In fact, most new Rails projects tend to focus on a single domain, so the left-side pattern makes a lot of sense for those applications.

00:15:54.600 Now that we have our app divided by domain, the next question becomes how to systematically manage the relationships between those domains. This is where Packwork comes into play, and I want to thank Shopify for this wonderful tool, which has provided us with numerous opportunities.

00:16:30.520 With Packwork, we maintained the structure we had before, organizing first by domain and then introducing a package.yaml file that you can see there on the right. Note that Packwork isn't concerned with how your file system is organized, but we found it incredibly beneficial when our system was organized by domain first.

00:17:37.340 We also added owner fields to each package.yaml, which is a custom field that Gusto added and is used by the code ownership gem. Next, we incorporated public folders into each package to house our public API while keeping everything else private.

00:18:03.860 Your basic building block for Packwork is called a package. At Gusto, we often refer to this as a pack, mainly because it's quicker to say and write. A pack is simply any folder of code that has a package.yaml at its root, which are the nodes in this graph.

00:18:55.680 Each node in the graph has an inner concentric circle representing the private API, while the outer ring shows the public boundary that a pack exposes to the outside world. Packwork provides several ways to declare something as public or private, but at Gusto, we prefer to have a public folder because we like the idea of everything being private by default.

00:19:57.240 I've added two other systems, so now we have three packs: benefits, HR, and payroll. In each pack's package.yaml, there are lists detailing dependencies on the other packs, represented by these large white arrows.

00:20:00.840 Packwork requires that these explicitly stated dependencies never form a cycle, which is one of my favorite features of Packwork. In this small system, the HR pack depends on both the benefits and payroll packs, while the benefits pack depends on payroll.

00:20:51.000 Next, Packwork will parse every Ruby ERB and Rake file using the same parser that RuboCop employs. After parsing, it retains a list of every constant, class, or module referenced within the code of that pack. These references are denoted here as purple squares.

00:21:33.720 It's essential to note that Packwork relies on static analysis of the codebase to operate. This means Packwork isn't required in production nor is it loaded during runtime. Additionally, implicit references to constants, classes, or modules may be obscured from Packwork's view.

00:22:16.260 At Gusto, we like using Ruby's static type checker, Sorbet, to give Packwork more to work with. Once Packwork has parsed the files, it draws edges between any reference to a class, constant, or module (the purple squares) and its definition (shown as an orange circle). Packwork knows where something is defined because Zeitwerk, Rails' autoloader, establishes a convention that Packwork relies upon.

00:23:18.520 Please note that these arrows may not respect the public API of the other package. Similarly, a pack can use a class from another package without explicitly stating a dependency. Packwork represents these deviations from the intended API use and dependencies as dependency and privacy violations.

00:24:22.640 So let's discuss these dotted red arrows. The solid green arrow at the bottom right indicates that Packwork doesn’t detect a problem because the HR pack is referencing the public API from the payroll pack and has declared a dependency on payroll.

00:25:28.180 On the other hand, the topmost violation indicates a reference from benefits to HR, which constitutes both a privacy and a dependency violation—hence the double line. This is a privacy violation due to the benefits pack referencing a class constant or module from the HR pack that is private, meaning it doesn't exist in the HR pack's public folder.

00:26:28.900 Furthermore, this is a dependency violation because the benefits pack is utilizing HR without having declared HR as a dependency. These dependency violations may even happen if benefits were using the public API. The importance of these clear boundaries and thoughtful dependency management is crucial for managing a large Rails application, and Packwork makes these principles much easier to follow.

00:27:22.260 The code snippet on the lower left shows example code relating to the Packwork graph. In this case, the HR helper is part of the HR pack's private API, and benefits does not declare a dependency on it.

00:27:50.960 Packwork constructs this graph and outputs these red dotted arrows as essentially a to-do list of violations, as shown in the bottom right corner. I cannot emphasize how impactful this to-do list is for allowing us to incrementally improve our system by identifying areas where we need to reinforce boundaries and track our progress.

00:28:32.780 But what about gems and engines? It’s important to note that this workflow is quite distinct from using inline gems and engines to modularize applications. When comparing the two, Packwork supports gradual modularity.

00:29:09.160 Extracting large areas of existing systems as a gem or engine often forces you to confront modularization issues in areas that might offer low business value. In contrast, Packwork allows for aspirational gradual modularity by enabling you to state an ideal system diagram and giving you a to-do list to guide you toward that goal.

00:30:11.460 In other words, Packwork decouples statements regarding system structure and boundaries from the implementation of those boundaries. Additionally, it makes altering boundaries inexpensive and easy because creating packs and moving files between them does not have to affect your runtime at all. This makes it particularly advantageous for Greenfield projects, where you can learn more about domain boundaries along the way.

00:31:10.620 As we address inline gems and engines, distribution or versioning aren’t necessary, but as of now, Packwork does not support those features. Test speeds are relatively comparable when using Spring and Boot Snap. Gems are advantageous because they support strict boundaries, as a package with no violations can easily become a gem.

00:32:01.100 To that end, we've released another gem called Package Protections to ensure that package boundaries remain as clean as a gem or engine. Lastly, Stimpak allows packages to incorporate some engine features with less boilerplate code.

00:34:02.420 Overall, we continue to strongly believe that gems and engines are—and will remain—a critical component in the modularization toolchain. We’ve also identified areas where, although certain parts of the system could be gems, at Gusto, we're perfectly fine if they choose to remain as packages indefinitely.

00:35:18.420 I could spend a lot of time comparing and contrasting gems and engines with Packwork packages, but I’ll move ahead for now. If you have further questions, please let me know after the talk.

00:35:45.860 A system never starts off as a big Rails application; it grows into one organically. Practices that have contributed to success at one scale may lead to confusion and challenges at another scale. Similarly, tools designed for large Rails applications may not suit small Rails apps.

00:36:32.560 Therefore, big Rails tools must be adoptable in gradual increments. Scaling a large Rails app includes technical components but cannot be achieved without addressing behavioral and cultural elements. We can't just turn on these technologies and anticipate enhanced growth.

00:37:12.120 This transformation requires substantial evangelizing and educating at every step. Throughout this transformation, I worked with teams to ensure we achieved the desired outcomes.

00:38:09.020 My first goal was to get every team to turn on privacy and dependency enforcement, meaning that the Continuous Integration (CI) system would fail if Packwork detected a system boundary violation. To do this, I needed to identify whom to engage with and reached out to individuals with high context, encouraging them to add team owners to packages.

00:39:15.420 Over the course of about a year, nearly all our packages now have a single team owner. In some instances, achieving this required developers to reorganize their code to decouple domain areas controlled by different teams.

00:40:00.460 I met with the teams that owned the packages and delivered essentially the same message I'm sharing with you now about why I believe this process is crucial. One by one, teams committed to establishing a boundary between their public and private APIs and explicitly managing their dependencies.

00:40:55.640 This seemed to turn out very well, and you can see the standard S-shaped technology adoption curve starting in October when we began collecting this data. As expected, teams adopted these tools at different rates and with varying levels of enthusiasm, but there is general consensus about the value of this approach.

00:41:25.840 However, I often found that teams were simply updating the violation to-do lists as if that was the singular correct way to resolve a failing Packwork build step. Every time there was a RuboCop error, a user would just add to the RuboCop to-do list.

00:42:25.420 While linting is relatively straightforward for people, Packwork requires us to create public APIs and be intentional about our dependencies, making it much harder to effectively use this tool compared to a linter. This highlights the need to rethink our feedback loops regarding a large Rails system, as well as any software system.

00:43:50.420 Here are some events we care about where we want to enhance our feedback loops. Note that we could potentially add countless more events to this list, but these are just a few examples of events that consistently occur, and for which we already have a platform to build upon.

00:45:01.440 Regarding the challenge I faced, I noticed that often users merely executed the command to update the Packwork to-do list. To understand why users were doing this, I needed to dig deeper. This involved doing something non-scalable to figure out how to make it scalable.

00:46:21.680 To achieve this, I set up a Slack integration to alert me whenever there was a new privacy or dependency violation. I would comment on the pull requests (PRs) asking users to share more info. I created a spreadsheet to track each PR where I left comments, and over the course of about a year, covering approximately a thousand pull requests, I learned a great deal.

00:48:21.680 I used the spreadsheet to generate histograms showing why users updated the to-do list, which helped improve our documentation and foster a better understanding of how to support these developers.

00:49:35.420 More significantly, I frequently met with developers over Zoom to share insights about what we were doing and why. Through this process, developers began to become more familiar with the system, leading to a cultural shift.

00:50:11.400 Developers recognized that there were reasons for our efforts and that real people cared about how they interacted with these messages and used these tools. Over time, I noticed that developers started to proactively add context about these violations, often addressing the system design concerns before I even had a chance to comment.

00:51:11.640 To enhance the efficacy of these feedback loops, we developed some tooling. Early user feedback indicated that developers desired a quicker feedback loop, prompting us to create a VS Code extension for Packwork and to incorporate the CI check as a configurable Git commit hook.

00:52:11.640 We also established systems to automatically leave helpful inline comments on PRs when Packwork detects a privacy or dependency violation, as well as instances when the developer updates the to-do list.

00:53:37.840 This graph shows the average number of Packwork violations for each file over the past year or so. Overall, we observe a clear downward trend. Regarding the system graph from earlier, while there remains substantial work ahead, we continue to make progress.

00:54:28.000 What’s next? Just as Rails itself is a product of an engaged and passionate community, I hope we can approach 'big Rails' applications similarly. Many organizations are currently or will soon face the challenges of managing significant domain complexity.

00:55:50.020 I am immensely grateful for all the contributions from individuals and companies that have played a part in the solution. Some questions I have for the community are: In what ways can Ruby and Rails continue to provide excellent tools and cultural norms that help users create well-modularized systems?

00:56:52.380 What can be learned from the different conventions of Packwork packages, gem specifications, and other packaging systems? All these tools, including those developed by the broader community, are imperfect.

00:57:30.540 If you have an interest in this problem space, I invite you to join us at Gusto or any of the other communities. Everyone is welcome.

00:58:01.420 You can also catch me right here after the talk to discuss this material or to simply reach out via email. Your feedback would be greatly appreciated. Whether you try out the tools and share your experiences, leave comments, open pull requests, or experiment with diverse approaches, everything is welcomed.

00:58:43.000 I hope I've been able to convey some of the value we've extracted from these tools and strategies, and I’m excited to collaborate on solving some of these challenges. Thank you.