Talks

2000 engineers, 2 millions lines of code: the history of a Rails monolith

2000 engineers, 2 millions lines of code: the history of a Rails monolith

by Cristian Planas and Anatoly Mikhaylov

In their presentation at BalticRuby 2024, Cristian Planas and Anatoly Mikhaylov from Zendesk discuss the intricate history and evolution of a Rails monolith over a decade. They emphasize the need for collaboration and adaptability in engineering practices, particularly as legacy systems evolve under changing technological landscapes. The talk highlights the importance of understanding both technological and organizational aspects, drawing on various experiences and lessons learned throughout their tenure at Zendesk.

Key points discussed include:
- The historical perspective on monolithic architectures, which have evolved from being frowned upon to experiencing resurgence as the community recognizes their value.
- The journey of Zendesk from a Rails monolith to an event-driven architecture, indicating a natural progression towards modularization while retaining Rails as a core component.
- The growth and challenges of front-end technologies, marking a shift from multiple frameworks to a more centralized use of React.
- Challenges related to database performance and scaling as data sets grow, highlighting the need for efficiency at the database level.
- The perspective on testing within the Ruby community, particularly the debate between testing and typing, and the importance of maintaining a robust test suite as part of a large codebase.
- Concluding reflections on the interconnected roles of performance scaling and effective engineering organization management, coupled with the call to leverage community collaboration for overcoming challenges.

The speakers concluded with an optimistic view of the future of the Ruby community, encouraging professionals to learn from the past and collaborate to achieve better outcomes as technologies continue to evolve.

00:00:06.000 Christian and Anatoly welcome everyone to Malmo. You are not in the GR event; this is the Baltic Ruby. My name is Christian, and we both work as senior staff engineers at Zendesk. It's really nice to be here giving this talk.
00:00:19.279 We’re excited about this event because it’s related to what I’m going to talk about. You might be surprised, but this talk is about the history of a Rails monolith. You might wonder how this is related, but if you know a bit about the community, you understand that it’s not just about technology.
00:00:32.119 The main aspects are the politics, the drama, and the mess involved. Let’s be honest; there’s a lot of drama in the Ruby community, and it took me only a few minutes to recall plenty of stories. I’m not criticizing; I actually find the drama entertaining. As Matt once said, open-source communities tend to lose momentum when people become bored, but we seem to thrive on it.
00:01:15.040 A few years ago, I attempted to learn Scala and attended a few conferences. Unfortunately, I can’t show you actual pictures, but let’s just say the average Scala speaker’s appearance left much to be desired.
00:01:32.920 Let’s start discussing the main topic of the talk. Recently, I reflected on my career over the last ten years. There were multiple reasons for this reflection: after ten years abroad, I returned to Barcelona, it marked my tenth anniversary at Zendesk, and I’ve been watching a TV show called "Great People of Culture". It’s an anime series about typical elves, warriors, and wizards, following a tradition of magic that can metaphorically represent technology.
00:02:21.000 One of my favorite examples is the novel "A Wizard of Earthsea", where the idea is that you can perform magic if you know the true names of things. This concept aligns perfectly with abstractions in object-oriented programming.
00:02:56.560 As Arthur C. Clarke famously said, 'Any sufficiently advanced technology is indistinguishable from magic.' How does this relate to my talk? In the anime, the main character is an immortal elf who has lived for 2,000 years. The show is primarily told through flashbacks, showing how certain spells and technologies revolutionized magic. This concept resonates with our view as engineers at Zendesk.
00:03:39.600 Anatoly and I feel we have a rare form of knowledge, much like the elf's forgotten magical wisdom. We both have been with Zendesk for ten years, which is quite rare in this industry. We want to share what we’ve learned, particularly about our Rails monolith.
00:04:21.600 I will discuss not only our monolith at Zendesk but also the concept of monoliths and how it has evolved over the past decade. I can take you back to a time when monolithic architectures were viewed negatively. Five years ago, they were completely out of fashion.
00:04:40.000 There's been a recent comeback, and I believe that many software debates within the engineering community behave like pendulums. Opinions often swing between extremes throughout the decades. While it's important to respect varied opinions, some pendulums can be dangerous if you lean too far in one direction. For instance, remember the time when NoSQL databases were deemed to replace relational databases?
00:05:25.759 I doubt companies that discarded all their relational databases fared well. Personally, I think that the microservices boom was driven by a zero-interest rate phenomenon, whereby companies had abundant funds to optimize their tech before truly needing those optimizations. It's interesting to note that after I presented these ideas a few months ago, DHH echoed a similar sentiment on Twitter.
00:06:17.560 I’m not implying you should never implement microservices; rather, they do make sense in various situations. Just months ago, a friend in Montreal described his company’s overly complex microservices architecture, and I initially thought they were doing great—only to find out they were struggling.
00:06:33.080 It's challenging now with the layoffs that many companies face. Imagine managing thousands of microservices when your workforce is cut in half. Still, let’s not lose hope; there’s a space for microservices, but it’s vital to evaluate when they make sense.
00:07:30.360 Looking at our journey at Zendesk, we initially created a monolith in Rails. The idea was to extract parts of it into gems so we could save time when building new applications. Currently, however, we are at a stage I call the 'event-driven architecture era' where different applications generate events through Kafka, consumed by other applications.
00:08:18.240 Despite the big monolith, we’re experimenting with modularization. We’re using patterns from Shopify to create clearer boundaries within our code. However, before delving too deeply into the history of Rails, I want to touch on the front-end developments over the past decade.
00:08:53.759 When I entered the industry around the early 2010s, it felt like entering the 'Great JavaScript Wars.' Every week, a new JavaScript framework emerged. Back in the day, many big applications used various front-end technologies, like Backbone, Ember, or React. The enthusiasm for creating new frameworks among JavaScript engineers is legendary.
00:09:57.880 At Zendesk, we had our in-house framework, lovingly called 'CJs' (nobody remembers the original name), developed by an unknown engineer. It was never documented and has become a daunting maintenance task over the years. If that engineer is listening, please reach out.
00:10:52.480 Eventually, like many companies, we centralized our efforts on React, rewriting most of our front-end code. I genuinely enjoy working with React, especially with React Native. However, I sometimes wonder if React triumphed simply because people grew weary of constantly learning new frameworks.
00:11:13.840 Let me shift gears and discuss infrastructure and deployment. Over the last 15 years, we transitioned from VPS services like GigaSpaces and Rackspace, eventually moving to our own metal. After using dedicated servers for five years, we decided to migrate to the cloud using AWS.
00:12:03.680 Currently, all our databases run on AWS, utilizing Aurora. I won’t go into detail about this part of our journey—there are blog posts explaining the shift, including interviews with our original CTO.
00:12:45.760 Next, I’d like to discuss how we adopt new technologies at a large company like Zendesk, which has around a thousand engineers spread across multiple global offices. Despite using a variety of technologies, Ruby remains our primary language.
00:13:32.680 In the past, integrating new technology was relatively easy; you just needed to talk to your manager. However, it led to chaos, as engineers had different needs and experiences. To address this, we implemented a 'tech menu,' requiring proposals for new technologies to be reviewed and approved by a team of architects.
00:14:07.680 Some discussions have drawn much attention within Zendesk, particularly around which languages to use for microservices. We accepted Java and Scala, but ultimately Java became the more dominant language due to hiring challenges with Scala.
00:14:57.760 Acquisitions have been instrumental in introducing new technologies. For example, we acquired Zing, a chat company from Singapore, ten years ago, which used Python. Consequently, working at Zendesk could also involve Python.
00:15:43.680 The most intriguing aspect of acquisitions is domain fusion. As a customer support company, we manage tickets, while Zing focused on chat. Thus, integrating the two systems raises conceptual questions about what constitutes a ticket or a chat conversation.
00:16:36.840 To create a common experience for our users, we need to ensure that the integration feels fluid. Our historical experience leads to essential agreements about shared features across various domains, aligning everything to the concept of a 'ticket' as our core building block.
00:17:19.200 Now, I'll hand over to Anatoly, who will dive into database performance.
00:17:50.200 Anatoly begins his presentation by expressing appreciation for Japanese-made products, stating that he uses Ruby at work and pointing out his Japanese-made guitar and watch. He then asks how many in the audience have used Ruby for over 15 years.
00:18:04.800 Anatoly started with Ruby in 2007, and it remains his favorite language. Shifting topics, he indicates that complex systems resemble weather systems—unpredictable. For instance, decisions within large systems may not align with expectations.
00:18:41.200 Anatoly references a saying that all happy systems are alike, while each unhappy system is unhappy in its own way. This challenges the notion that perfection can be achieved without addressing every little detail.
00:19:06.240 Anatoly emphasizes that the database is the primary bottleneck at Zendesk. When queried whether Ruby is fast or slow, he argues that database performance ultimately determines application performance.
00:19:51.760 Drawing attention to complexity, he explains how the size of the data set impacts response time and requires attention. Anatoly makes the point that addressing database efficiency is paramount, as the data structure dominates performance.
00:20:51.360 He highlights three critical aspects of databases: consistency, runtime, and complexity. You cannot have all three at once, and the choice of which two to prioritize shapes your infrastructure design.
00:21:14.240 Anatoly discusses Zendesk's early days when database queries were relatively straightforward, but as data increased, performance declined, exposing the complexity of real-time queries.
00:22:05.960 He notes that as the data set expanded, complex queries would no longer work. They consequently began optimizing the database to address the increase in data while moving away from complexity.
00:22:56.320 Anatoly brings up the Pareto principle, which suggests that a small percentage of queries consume the vast majority of resources. Optimizing these queries can yield significant performance improvements.
00:23:36.480 Turning the focus back to Christian, he shares thoughts on maintaining the monolith, echoing the importance of testing and reliability.
00:24:04.080 Christian mentions an ongoing debate in the Ruby community about testing versus typing. Traditional Ruby philosophy suggests that rigorous testing eliminates the need for strict type definitions.
00:24:38.560 After introducing the Ruby 3 type system, the Zendesk team still relies heavily on tests with over 1.6 million lines of code and approximately 55,078 tests in place.
00:25:09.600 Christian introduces Ryan Davis, the protector of the main branch. He humorously emphasizes the importance of testing, illustrating the significance of ownership in a vast codebase.
00:25:49.840 In the codebase, ownership is assigned to teams or individual engineers to facilitate smoother code reviews and integration.
00:26:17.120 Christian then highlights the historical reliance on MiniTest within Zendesk, as it provides speed and efficiency, running 11,572 tests for each pull request.
00:27:05.480 Despite efforts to enhance reliability, challenges like flaky tests persist due to the vast size of their test suite.
00:27:49.320 Historically, testing required running the entire suite when just a single test failed, but advancements now allow for more efficient testing, improving reliability.
00:28:31.480 Christian underlines Zendesk's commitment to keeping the codebase maintainable using ownership and testing without compromising quality during upgrades.
00:29:14.880 He outlines challenges faced over the years, including heavy reliance on metaprogramming causing difficulties when updating Rails versions and drawing attention to issues illustrated by deleted or modified code in their branches.
00:30:32.960 Christian reiterates how they struggled with maintaining effective features in previously diverged gems and frameworks during transitions to updated Rails.
00:31:54.480 In closing, Christian emphasizes the importance of learning from the past, sharing that they have had several first-mover disadvantages in the competitive Ruby space, often leading to larger issues down the road.
00:32:35.120 To avoid repeating those mistakes, he advises that when you discover unmet needs or challenges within the Rails environment, rather than executing significant changes independently, collaborate with the community for better outcomes.
00:33:09.520 As they wind down their presentation, they emphasize the interconnectedness between successfully scaling performance and managing engineering organizations: while mitigating performance issues is vital, understanding the organizational landscape plays an equally crucial role.
00:34:19.320 Finally, Christian shares reflections on the themes of history, lessons learned, and hopes for the Ruby community's bright future—urging participants to consider potential opportunities at Zendesk as they continue hiring.