Talks

The history of a Rails monolith

Cristian Planas is a software engineer working primarily with Rails since the release of Rails 3, more than 10 years ago. Currently, a group tech lead and senior staff engineer at Zendesk.

Anatoly Mikhaylov is a Performance and Reliability engineer with over 15 years of experience. He's a part of the Capacity and Performance team at Zendesk.

They're delivering a story about a long-running Ruby monolith powering a successful business at scale.

Balkan Ruby 2024

00:00:05.080 Hello, Bulgaria! I'm super happy to be talking here. One of the reasons this presentation is about the past is that this is the homeland of someone important to me, my childhood hero. Does everybody know who this guy is? Before someone talks about Bulgarian soccer, I want to mention that in Barcelona, we know a lot about soccer, but there's another reason I’m happy to be here.
00:00:17.880 It's about a novel by Georgi Gospodinov called 'Time Shelter'. I recommend you read it. The general idea is that people don't want to live in the present; they prefer to live in an imaginary past and recluse themselves there. This is also the topic of my presentation in a certain way.
00:00:39.800 Let's start with some general ideas. This conversation will be about time. It's a moment in my life where I felt the need to reflect on the last ten years. I recently moved back to Barcelona after living abroad for ten years, mainly in Denmark and the U.S. Interestingly, I have also been working at Zendesk for the past ten years. As an aside, I started watching a show called 'Freen', which you might find interesting if you're into anime. In my opinion, it's one of the greatest anime shows ever, as it explores magic—a great tradition in storytelling that can also metaphorically relate to technology.
00:01:15.680 In this show, you find an elf who is immortal, having lived over a thousand years and witnessed the evolution of magic. It’s similar to how frameworks evolve; what once seemed strange can become mainstream. I relate to this character because, like her, I worked in the same company as an engineer for ten years, which is quite unusual in our field, making me a custodian of forbidden magical Ruby on Rails knowledge.
00:01:56.600 This presentation is about sharing that knowledge. I want to talk about the Ruby monolith. I've been working with Ruby for about fifteen years now; this picture is from 2010. Back then, we had interesting times; we liked monoliths. I am happy to see that monoliths are back in favor, at least in some communities. Five years ago, if you mentioned starting a company with a monolithic architecture, people would think you were making a foolish choice. However, I believe in a pendulum effect in software engineering: concepts float back and forth between extremes.
00:02:46.400 For example, monoliths were very mainstream fifteen years ago, then everyone rushed to distributed systems, and now we find ourselves somewhere in the middle. I don’t mean to suggest that opinions on architecture don’t matter—this pendulum can be dangerous. If a startup initiates with twenty microservices before it even has a user base, success becomes hard to achieve. Interestingly, I believe that the recent boom in microservices is tied to a zero-interest-rate phenomenon; companies had ample funding without millions of users at the onset.
00:03:36.960 This phenomenon can lead to premature optimization, creating overly complex architectures too soon, which can hinder future scalability. I remember meeting a friend in Montreal who told me that their startup had more microservices than customers. I’m not against microservices; they have their place in technology. But my opinion aligns with Joni Warner, the CTO of Gab, reminding us that there’s a gradient between different architectural decisions.
00:04:14.320 If you’re starting a company, starting with a monolith can be the safer choice, and then gradually exploring the service-oriented architecture as the company grows makes sense. Now, how does this apply to Zendesk? We started with a simple vision: our CEO wrote a Ruby on Rails application. Over time, as we grew, we decided to create smaller Rails applications to serve various functions, while these applications shared core logic.
00:05:05.040 Internally at Zendesk, we have a lot of shared private gems utilized by our legacy applications. Then we transitioned to what I call the service era. This shift occurred as we began acquiring new products and desired integration—like creating a unified account and user service without relying solely on our Rails monolith. This led us to write new applications in different technologies and adopt event-driven architecture, where we generate Kafka messages consumed by other applications.
00:05:40.720 Now, regarding the future of our monolith: I believe we will focus more on developing strict modules that interact only in specific ways. We’re actually beginning to integrate into the monolith. However, before moving forward in this section, I want to discuss the front end, even though this is a Ruby conference. Let’s revisit the 2010s—a time where anyone developing web applications might feel some PTSD from the JavaScript wars.
00:06:40.920 Do you remember those days when Hacker News would have multiple new frameworks every week? This created a need for constant learning. At Zendesk, we utilized various frameworks like Vapor, Ember, and React at different times, forcing us to rewrite many applications frequently.
00:07:00.920 We even reached a point where we had an internal framework called KJS, named so informally because no one could remember its actual name. This showcases how everyone seemed eager to create frameworks, adding unnecessary complexity. Ultimately, the Javascript scene began to standardize around React, simplifying life for developers, as it reduced the need to learn countless different frameworks.
00:07:52.440 At Zendesk, we have a robust Ruby monolith that has been in place for over seventeen years. In those years, it was easy to introduce a new technology to the stack. However, as our company grew, we recognized the necessity for a more structured approach. Thus, we developed a technology menu listing approved technologies that teams must use. If a team wants to introduce a new technology, they must submit a proposal for approval by the architecture team.
00:09:05.760 These discussions have been quite lively within the company, especially when we debated which technology to use for service writing. I advocated for Elixir, but the decision settled on Java and Scala, a compromise that allowed various parts of our company to develop services with those languages.
00:09:43.960 As of 2024, Java remains more prevalent than Scala, partly due to hiring difficulties with Scala developers. Interestingly, many of the technologies we adopted have come through acquisitions; half of our existing products are inheritors of acquired firms, like Zopim, which transitioned us into Python design.
00:10:25.120 However, one of the critical lessons learned from acquisitions is not merely about technology; it's also about domain fusion. At Zendesk, we specialize in customer support. When we acquired a chat company, we faced the challenge of integrating two different domain models into a cohesive service.
00:10:52.920 This task of merging domains quickly became a technical challenge as we sought to align chat interactions with ticketing functionality. As we navigated these complexities, we also realized the growing importance of integrating AI technology within customer support interactions.
00:11:40.000 This has led us to label everything as a ticket, which has opened up significant debates within our architecture team about how to equate chat, conversations, and phone calls to a single entity for consistency. Now, Nat will discuss database performance.
00:12:15.600 Hi everyone, my name is Anatoly, and I’ll be covering database performance at Zendesk. Currently, we have around 2,000 engineers and 6,000 employees in total, which means managing large data sets is critical for our database performance.
00:12:48.800 How many of you deal with large data sets in your database? I would venture to say everyone will eventually face this issue. The importance of database performance is evident as application performance directly relies on how efficiently our database operates. Since the transition from data centers to the cloud seven or eight years ago, databases have become the biggest bottleneck.
00:13:32.160 When I began working on database performance issues seven years ago, I discovered the high-performance MySQL book, which outlines deep troubleshooting techniques that highlight the need to focus on fixing query issues instead of over-generalizing server performance problems. One important concept to remember is that fixing database performance issues requires significant effort, and small oversights can trigger broader failings.
00:14:20.720 It's crucial for applications to remain reliant on a well-performing database as it directly impacts the reliability and speed of our operations. At Zendesk, we learned that having too much data can severely hinder server responsiveness and make operations much trickier.
00:15:01.320 One major takeaway is to not expect cloud providers to magically improve speed; simply accepting reality helps streamline operations. The truth is, as data sets grow, performance deteriorates, so managing that growth becomes paramount for a seamless user experience. I cannot stress enough that analytical queries and transactional queries cannot coexist effectively within the same database environment.
00:15:47.680 Running long analytical queries on top of transactional queries can degrade performance significantly. As we discovered with our own setup, we ran into trouble because the query planner in MySQL became overwhelmed due to the growing data set, making it imperative to address any potential index misses or small errors before they escalate.
00:16:32.040 The past also teaches us that our applications depend heavily on the reliability of data interactions. As we have evolved at Zendesk, we recognized that maintaining a small, efficient set of data interactions—like focusing on a mere 20% of the most critical queries—can lead to significant performance improvements.
00:17:20.840 However, I also want to address the point about how we keep the large Ruby application reliable. A huge part of the Ruby community’s ethos suggests that we don't need types because we have tests. Since we don’t utilize types or other structures, we must ensure comprehensive testing, now integrating unit tests into our workflows.
00:18:06.040 At Zendesk, we have maintained around 55786 unit tests—not counting integration tests or browser-based API tests. A pivotal individual behind our success is our Custodian, famous for protecting our main branch and ensuring the integrity of our code. Making sure the code base remains stable and cohesive is central to maintaining a functional application.
00:19:08.440 One common question from new employees is why we chose to use MiniTest over RSpec, despite RSpec being popular. The principal reason is performance—MiniTest is smaller and faster than RSpec, and with runtime efficiency important to our workflows, we find it advantageous to utilize it instead.
00:19:48.080 We experienced issues with flaky tests—those that pass most of the time but fail occasionally—because they affect our build confidence. Previously, our testing systems could run over 111,000 tests, making it cumbersome to pinpoint failures, but with recent improvements, we have simplified our processes significantly.
00:20:46.080 We maintain strict ownership of our code—using GitHub's code owners feature—and we require teams to claim ownership of files before making changes. This organizational structure provides clarity and streamlines contributions.
00:21:43.600 Now, let's discuss upgrading Rails. Zendesk has been around for a while, and we've had to navigate upgrading from earlier versions, which has often relied on our comprehensive testing infrastructure. As we move through versions, it is understood that we should always run tests for both the current version we intend to maintain and a target version for which we aim to upgrade.
00:22:38.200 Historically, we found issues keeping up with the Rails ecosystem, especially because we adopted certain features early on, like strong parameters; we’ve later suffered along with version shifts. Through these upgrade processes, we’ve constantly wrestled with our earlier choices, unifying methods and ensuring consistent usage across the organization.
00:23:49.680 I urge you to seek community collaboration rather than working alone. If you find a gem or library that might require improvement, contribute to it instead of reinventing the wheel. Leverage the collective experience and knowledge to evolve your projects in line with industry standards.
00:25:44.240 Lastly, I want to highlight my main idea today. In the world of software development, while performance is vital, the real reasons companies stumble often lie elsewhere. Many startups do not fail due to performance issues but rather internal mismanagement or organizational challenges that are less visible.
00:26:23.560 It’s crucial to consider not just scalability in tech but also the organizational structure that governs a company. Building a well-functioning organization can feel like a time bomb, but learning from those experienced before us can make all the difference. The past can inform the future but should never cage us into a fixed mindset.
00:27:26.400 It’s essential to listen to the past and learn without letting preconceived notions hinder progress. By merging the lessons of the past with visions for the future, you may become knowledgeable and help the Ruby community thrive.
00:28:12.320 Thank you!