Talks

2000 Engineers, 2 millions lines of code: The history of a Rails monolith

2000 Engineers, 2 millions lines of code: The history of a Rails monolith

by Cristian Planas

In this presentation at the Ruby Warsaw Community Conference, Cristian Planas, a Senior Staff Engineer at Zendesk, discusses the evolution and maintenance of a Rails monolith over a span of 17 years. The session highlights the challenges and strategies involved in sustaining a large codebase while adapting to technological changes and scaling demands. Planas draws parallels between the tech community and cinema, using a reference to the film Gladiator 2 to illustrate that narratives within software development can sometimes be as dramatic and convoluted as those in movies.

Key Points:

- Rails and Monoliths: Planas emphasizes the importance of Rails as the core of a successful startup and its viability over time, despite past doubts about monoliths.
- Experience at Zendesk: He shares insights from his 10 years at Zendesk, which scaled from a startup to a major player serving hundreds of millions of users, while maintaining Rails as the backbone of their architecture.
- The Pendulum of Software Architecture: Plans discuss the shifting opinions in the tech community, particularly the perspectives on monolithic versus microservices architectures, highlighting that neither approach is without its trade-offs.

- Evolution of Technologies: He details Zendesk's journey from a monolith to experimenting with service-oriented architectures as the company grew and acquired other businesses, leading to a need for more modular solutions.
- Frontend Development: Planas recounts the “JavaScript Wars,” discussing Zendesk's adoption of multiple JavaScript frameworks and the eventual centralization on React due to its robustness and community support.
- Adopting New Technologies: He explains Zendesk's 'tech menu' for approving new technologies, reflecting the challenges of integrating diverse tech solutions into a large organization.
- Scaling Databases: Planas describes various database optimization techniques including sharding and archiving, which are essential for managing large datasets.
- Testing and Code Management: He stresses the significance of testing in maintaining quality amidst a vast codebase, citing Zendesk’s robust testing culture and challenges with flaky tests.
- Rails Upgrades: Planas shares experiences with Rails upgrades, stressing the importance of thorough testing to ensure smooth transitions between versions and calling attention to the technical debt associated with previous code practices.

Conclusions and Takeaways: Planas concludes that while performance issues are often visible, maintaining a well-organized engineering team is more crucial yet less perceptible. He encourages engineers to engage with the community, learn from historical practices, and not to rely solely on tradition when approaching new problems. This ongoing journey of adapting and evolving technologies at Zendesk exemplifies the empowerment drawn from accumulated knowledge in the engineering realm.

00:00:09.160 Okay, let's start the second presentation. Welcome our second speaker.
00:00:14.759 Cristian Planas is a Staff Engineer at Zendesk, and he will bring us the topic about a really huge monolith. So, we are capable of doing that in Rails. Let's welcome Cristian!
00:00:22.439 Good luck, thank you!
00:00:48.800 Hello, my name is Cristian. This is a talk for the Ruby Warsaw Community Conference. You didn't enter the wrong cinema; I'm still going to talk about Rails.
00:00:55.520 Funny enough, until I literally arrived at the theater, I didn't know this talk was going to happen in a cinema.
00:01:06.200 Anyway, I wanted to start by talking about Gladiator 2, and you will see why. So, I decided that it would be nice to put the trailer; it's a short trailer—only 1 minute.
00:01:20.600 [Trailer plays] This is not a matter over which a duel should be fought to the death; it should be settled quietly. If this were to get out, war would follow. Rome, if you think this can be stopped, you're fooling yourselves. All of this madness began with Maximus Meridius, and now, years later, we are paying for his deeds.
00:01:40.920 You rest your bones; I'll finish your quest for you.
00:01:46.399 I'm tired of reliving that nightmare every day. What do you want from me? Revenge. This is the last step in a master plan, long in the making.
00:02:05.039 Let the games begin!
00:02:48.239 I want you to understand that Ruby is very special. We are a really special community. This presentation is about the history of Rails through the monolith of Zendesk, which has been around for 17 years now.
00:03:00.800 Why is storytelling important? If you like cinema, and I love cinema, there's been a lot of discussion about how historically accurate 'Gladiator 2' is.
00:03:11.560 It's not the first time we have this debate; remember Napoleon, the previous movie by Ridley Scott? A lot of people said that they had to add drama to the story, making things more beautiful.
00:03:29.319 Some directors do this. But do you know which film didn't need any embellishment to be epic? Our community is full of dramas; it's actually fantastic. Last year, we saw DHH versus TypeScript; there used to be a website called 'Ruby Dramas.com.' I think they removed it. It's fascinating. Don’t get me wrong, I love this; it’s a huge entertainment value.
00:03:50.879 For about a year or so, I tried to become a Scala developer, and it was terrible. I went to a Scala conference, and the majority of speakers looked like responsible people in nice suits. Meanwhile, I also attended a Ruby conference where the keynote speaker looked very different.
00:04:10.360 I mean, we are special, that’s for sure. Now, let me introduce why I wanted to talk about this topic. For me, it's an interesting moment in my life. Maybe that’s why I'm reflecting on the last 10 to 15 years; that's the time I've been a Rails developer.
00:04:31.080 I moved back to Barcelona, my hometown, after 10 years. I also celebrated my 10th anniversary working for Zendesk.
00:04:41.479 Additionally, I started watching this TV show called 'Feren.' Does anyone know about it? It's an anime show in a typical fantasy environment—like 'Lord of the Rings,' featuring elves, magic, and swords.
00:04:54.240 It follows a beautiful tradition in fantasy of talking about technology and various themes philosophically through magic.
00:05:10.199 One of my favorite novels states that magic is knowing the true names of things. This resonates with me; if you create the right abstraction, you control it. It’s how I think.
00:05:22.280 This metaphor isn’t new; it reflects ideas from various thinkers. What makes 'Feren' special and relevant here is that she (the main character) lives in a world where there are very few elves. Elves are immortal, and the majority of the characters are human.
00:05:36.199 She experiences development through magic and other elements across centuries. Events that initially didn’t matter become significant over time.
00:05:45.440 This mirrors how we develop technology, particularly software. After watching the show, I started identifying with Feren. She's an elf over a thousand years old, possessing forgotten magical knowledge.
00:05:58.400 I feel equally mythical; after spending over 10 years at Zendesk, I possess forgotten magical Rails knowledge I'd like to share.
00:06:05.919 First, I want to address the concept of a monolith. This was quite popular in the early 2010s; it was seen as the way to go. But a few bad ideas arose during that time.
00:06:12.880 By the late 2010s, it became terrible to speak positively about monolithic architectures. Lately, it's become more acceptable to defend them. I find it unsurprising; debates in tech, particularly in software engineering, behave like a pendulum.
00:06:29.959 First, there’s option A, then A all the way, then to option B, and we oscillate between them. I really dislike when people talk about software engineering as a hard science; we are more like architects.
00:06:44.880 This doesn’t mean we should think all ideas are right; it can be a dangerous pendulum. Ideas can sometimes be wrong or impractical. I remember when many proclaimed relational databases were finished.
00:06:54.879 Everyone was told to move to NoSQL. While I use NoSQL databases sometimes, I believe relational databases are still very important.
00:07:09.839 If you thought that was spicy, here’s an even crazier take: I believe the microservices boom in the last decade was a zero interest rate phenomenon, meaning that a great deal of investment was poured into tech.
00:07:26.399 As a result, some things that shouldn't have worked well worked out just fine. I find this slide amusing because just a few weeks after I included it in my presentation, DHH said something similar.
00:07:41.920 Many people probably shared the same sentiment: microservices required an extensive workforce. I recall visiting a friend's startup in Montreal years ago.
00:07:56.480 He told me they had more microservices than customers. I was always worried he'd see my presentation and think I was making fun of him, because he said he was the only person who understood their architecture.
00:08:10.839 He felt they were not doing great. Now that there’s less money in tech, we’re seeing the trade-offs that come with using microservices.
00:08:19.480 Maintaining a service with thousands of microservices is far easier with 8,000 engineers than with just 2,000. The added complexity becomes much clearer in this scenario.
00:08:34.720 Don't misunderstand; I'm not against microservice architecture. I think it's a useful tool in the right circumstances, just like monoliths.
00:08:45.360 It's not like the current CEO of Twitter, who calls microservices a bloodbath; that's pretty crazy to me. My opinion aligns more with that of GitHub's CTO, Jason Gaylord, who says it’s a spectrum: when starting a new company or application, you should begin with a monolith and then gradually break it down into smaller services.
00:09:05.480 This is similar to what we did at Zendesk.
00:09:20.040 Initially, we aimed to create an ecosystem of Rails applications that shared logic. The idea was to centralize logic like authentication through shared files. If you look at our monolith, we have numerous private gems shared by several applications.
00:09:34.720 Then we entered what I call the service era. This period started when we began building services, not so much out of necessity but due to acquiring new companies.
00:09:50.080 For example, we acquired a chat company; you wouldn’t want account information for the chat service mixed with those of other products. Therefore, we needed a centralized account service to share data across applications.
00:10:02.880 Currently, we are in the era of event-driven architecture, where multiple services write events via Kafka, and other services consume them, triggering additional events.
00:10:19.360 Still, we maintain a large Rails monolith and engage with various models. We’re experimenting with different ways to use them.
00:10:30.480 Right now, we're using Prowler, but we’re still deciding how to incorporate it effectively. I didn’t want to move on from this section without mentioning front-end development.
00:10:45.440 If you've been working in tech since the late 2000s or 2010s, as I have, you’ve likely witnessed what I termed the ‘JavaScript Wars.' If you recall 2014, there was a new JavaScript framework making waves every few months.
00:10:58.200 At Zendesk, we still use a mix of Backbone, React, and Ember. I'm surprised we didn't use Angular, honestly.
00:11:09.760 We even have something amusing that we call CJs, a term that no one else understands. The reason for CJs is that one engineer named Cart wrote his own framework and then left the company.
00:11:25.440 As you can imagine, that was a challenge since we didn’t have any documentation.
00:11:32.159 Currently, we are centralizing on React for several reasons. First of all, React is a great framework—personally, I love React Native. But I believe that the community just got tired of so much change.
00:11:43.040 We needed to focus our efforts on learning one framework that would remain useful for a while.
00:11:58.400 Next, I want to discuss adopting new technologies—how we decide what we adopt at Zendesk. We were born as a Ruby company.
00:12:07.599 Yet, in many situations, additional technologies are necessary. Once upon a time, adding new technologies to the stack was relatively easy, and it proved to be a bad idea.
00:12:21.839 The challenge arises when you have a large team, like ours, with 1,900 people in product alone, and everyone adds whatever they want.
00:12:35.839 This chaos led us to establish a 'tech menu.' If you want to start writing applications in Elixir, great! You need to write a document and have it approved by the architectural team.
00:12:48.480 A notable example was deciding which language we wanted to use for writing services. The two winners were Java and Scala. I was on Team Elixir; we even wrote a proposal to adopt it, but it was rejected.
00:13:03.040 They told us that for those proposals, we already had Java and Scala. Over time, due to the size of our company, different parts were using Scala while others were using Java.
00:13:17.920 However, we faced difficulties hiring Scala engineers or teaching existing engineers how to use Scala. Now, more and more of the engineering teams at Zendesk are using Java for services.
00:13:32.080 Interestingly, most technologies are adopted not through intention but rather via acquisitions. Over time, we have acquired quite a few companies—around half of our main products come from acquisitions.
00:13:45.040 One example is our acquisition of Zopim, a company from Singapore. Zopim used Python, which has now become a Zendesk language.
00:13:57.760 A couple of years ago, I presented at RailConf about scaling a Rails application. I shared a personal story about my experience before joining Zendesk.
00:14:11.720 I co-founded a company called Playful Bet, which, although it never made any money, became quite popular. We were one of the 50 most visited websites in Spain, but I was the only engineer, which made scaling incredibly hard.
00:14:28.060 At some point, I realized it didn't make sense to complain when companies like Zendesk, GitHub, or Shopify were successfully scaling their applications.
00:14:38.960 To understand why some claim that 'Rails doesn't scale,' it’s important to note the situation in which Rails became popular. Here's a picture from 2007 featuring the founding partners; you can see only two.
00:14:55.880 The one in front is our first CTO, Morton, and this coincided with the rise of Web 2.0. Before this, websites were static. However, with Web 2.0, users began to interact more.
00:15:10.560 This led to an increase in the needed database interactions, especially with the emergence of platforms like Twitter, which became an example of how Rails could work.
00:15:24.680 It’s humorous that Twitter's struggles with scaling became an example of why Rails might not be reliable, even though they did have issues.
00:15:39.840 Do you know the infamous 'fail whale'? I wrote my master’s thesis based on Twitter’s architecture. The good news is that you are not alone in your experiences.
00:15:55.600 Many companies, including GitHub, Shopify, and Zendesk, have scaled Rails significantly. Shopify's work is incredible; during Black Friday, their CEO Toby Lut tweeted that they were receiving 60 million requests per minute.
00:16:07.520 You might wonder what Twitter leadership thinks about Rails now, considering its history.
00:16:18.760 I have enough material to produce a full presentation on that topic. But here's some shameless self-promotion—I wrote a book about it.
00:16:36.080 One of the most critical aspects of scaling an application is the database. Typically, it's the bottleneck for most applications.
00:16:54.740 There are multiple techniques for optimization, aside from caching. Sharding is one way that surprisingly many people may not be familiar with.
00:17:09.480 The basic idea is that if your database can be fragmented, especially in B2B environments, where you only want to share data with singular accounts, it could lead to a huge problem.
00:17:26.000 So, splitting the database into smaller, manageable units becomes crucial. For instance, if you have a billion records and break it into 1,000 databases, each should average 1 million records.
00:17:35.159 That’s much easier to manage! Another technique we utilize is archiving. We send all data to DynamoDB.
00:17:47.600 As for what data gets archived, that's a decision we continuously evaluate. We also regulate the functionalities afforded to archived data.
00:18:01.000 In general, we’ve found there are three properties in good databases: they maintain a consistent runtime, manage complex queries, and house a large dataset. However, you can only have two of the three.
00:18:19.920 In the startup phase, you can maintain both query complexity and runtime, but as your dataset grows too large, performance often suffers.
00:18:36.480 Microservices tend to scale better because they advocate for simpler queries. When data is distributed among numerous databases serviced by each, complex queries across multiple tables become unmanageable.
00:18:44.560 Next, I'd like to discuss testing and reliability in a big company. We're a large organization at Zendesk, and it’s timely to talk about testing, especially since an antivirus update broke Windows systems.
00:18:56.839 Please, test your applications thoroughly.
00:19:08.720 What I want to focus on here is the traditional debate in the Ruby community: STS versus types. You may have heard that Ruby applications are challenging to maintain due to lack of types.
00:19:23.640 The original premise of Ruby was you don’t need types, as long as you write a substantial number of tests to back it up. It’s critical to have tests, regardless.
00:19:38.320 I think both sides have valid arguments, but at Zendesk, we prioritize tests; we maintain extensive test coverage.
00:19:45.440 To illustrate, we maintain over 1.6 million lines of code. Can anyone guess how many unit tests we have?
00:19:55.440 No, it's not 5 million. It starts with a five.
00:20:00.640 So, we have 557,805 tests.
00:20:11.720 Ryan, one of our Junior Engineers, came to me recently, saying that they were unable to merge a feature because there was someone asking for more tests.
00:20:27.679 I responded, 'Oh, you must be talking about Ryan—he's the author of MiniTest and its main maintainer.' The next question new engineers ask is usually: why do we use MiniTest?
00:20:38.639 There are other more popular testing frameworks in Ruby, but we have our reasons for choosing MiniTest. It's smaller than other options, has less magic, and is much faster.
00:20:53.680 When we tried to apply more features to MiniTest, we ran into challenges. One significant issue is the prevalence of flaky tests, particularly when it comes to performance testing.
00:21:06.480 Right now, we run tests in groups of 2,000. This means if one test fails, you only need to rerun those 2,000 tests, which is much more efficient.
00:21:17.760 Previously, running over 100,000 tests would be necessary to identify a single broken test.
00:21:28.880 One key aspect of this process is ownership. With a significant codebase comes a lot of developers.
00:21:43.360 Historically, our monolith has seen 2,459 commits. While Zendesk has been around for 17 years—which might suggest a lot of code is from the past—570 different engineers pushed code just last year.
00:21:59.720 If we don't set limits and rules, we could easily fall into chaos. To prevent this, we use a GitHub feature that establishes ownership; all files must be owned by at least one team.
00:22:14.480 This makes changes more accountable. If an incident occurs, it’s much easier to detect who to approach.
00:22:31.520 Next, I'd like to discuss upgrading Rails.
00:22:38.720 We've been using Rails for a long time; we started with Rails 1.
00:22:46.640 Our original CEO, Mikel, worked with DHH before creating Rails.
00:23:02.760 Regarding upgrades, tests are invaluable in predicting what may break during an upgrade.
00:23:15.200 That's why we run tests on both the current version and the version we wish to upgrade to, which helps avoid regressions.
00:23:27.920 If a test suite breaks, that’s acceptable; as long as it turns green, it cannot go back to red, which protects everyone’s work during updates.
00:23:44.640 Throughout 17 years of upgrading Rails, we faced many pain points.
00:23:56.480 Too much metaprogramming was a significant issue. While I adore this book that explains Ruby, I found maintaining Rails code with excessive metaprogramming incredibly challenging.
00:24:06.640 In the 2010s, we all embraced this concept until we hit a wall during the upgrades to Rails 5.
00:24:21.840 Attempting to upgrade to Rails 5 broke most of our tests; everything seemed to explode.
00:24:32.920 At that point, we decided to adopt a stricter approach. For years, we had been fixing issues with monkey patches, doing the bare minimum to continue building features.
00:24:52.880 To maintain a well-maintained application, we had to go into full cleanup mode. The transition from Rails 4 to Rails 5 took two and a half years.
00:25:07.920 We meticulously cleaned our code, removing as much magic as possible, which was intensive work.
00:25:21.520 During the last year, we documented the evolution of test failure rates as we refined our approach.
00:25:38.440 What was challenging? One challenge was stronger parameters.
00:25:53.520 Initially, we implemented these slightly before Rails adopted them, necessitating updates across our endpoints to conform with Rails.
00:26:06.720 Another significant challenge arose with how dir and attributes functioned; we relied heavily on this in callbacks. The implications of these changes meant that endpoints broke or returned ambiguous results.
00:26:21.760 While the solutions weren’t incredibly complicated, they required extensive communication with countless teams worldwide.
00:26:36.560 For us, it took at least six months to update effectively. The bright side, however, is that since cleaning magic out of the application has made things significantly easier.
00:26:49.760 My concluding thoughts on this matter are that often, we faced the disadvantage of being first movers.
00:27:05.760 We tried building something ourselves before the community established solutions. Sharding, for instance, is something we've used for over 10 years.
00:27:20.160 We still have our unique logic for sharding, but one of our objectives now is to align with the standard that Rails has adopted since version 6.
00:27:35.840 I advise everyone to join the community. If you think a feature in Rails is deficient, create an issue, engage with others, and avoid working in a silo.
00:27:49.920 In conclusion, I'd like to leave you with a few thoughts.
00:28:05.760 I previously gave a presentation focused solely on scaling Rails performance but later realized that performance isn’t always the most pressing issue for companies.
00:28:20.800 The obviousness of performance problems, like a slow Rails app, is easily noticeable and often results in breaking.
00:28:34.160 Confronting a poorly managed engineering organization, however, is much harder to detect.
00:28:49.200 What defines a well-run engineering environment can even be debated. In my opinion, knowing history helps you appreciate the feel you get when you speak with a seasoned engineer.
00:29:02.720 When you have someone who has been around longer recognizing a problem, it feels empowering to both of you.
00:29:18.320 But the past can also be perilous. Beware of relying on prior knowledge simply because it's tradition.
00:29:32.000 There’s an anti-pattern called the 'Frozen Caveman' that I discovered while researching for this presentation.
00:29:48.080 So, what’s the takeaway? Listen—listen to the history of applications and programming languages, but maintain an emotionally neutral perspective, if possible.
00:30:03.840 I’m happy to be here today because we're currently hiring!
00:30:43.559 Hi, thanks for your speech! I have a question: what's the current version you're running and how lengthy was the last update?
00:30:55.200 I honestly am not sure; I think it's around Rails 7.1 or so?
00:31:05.240 Yes, typically we're about catching up. After moving to 5, we are generally on track, often just one minor version behind.
00:31:14.240 Shopify runs its large Core Monolith on Rails' main branch. Why not do that instead of running two branches?
00:31:29.880 Well, for a time, we didn’t want to jump too far ahead, and in the past, we were running version 4 while Rails 6 was already out.
00:31:41.400 Now, it makes more sense to adopt the main branch as we’re caught up.
00:31:50.480 Hi Cristian, that was an amazing talk! My quick question: I noticed much custom code and libraries at Zendesk.
00:32:02.520 How do you tackle onboarding new engineers? How long does the process take, and could you share two or three best practices?
00:32:14.320 That’s a valuable question. It varies significantly. Depending on which team someone is joining, the code base they’re dealing with may be quite different.
00:32:26.960 For example, if you work in a smaller service, the learning curve is more manageable. However, working in the monolith poses challenges.
00:32:41.679 In essence, you can join a specific team, grow within that framework, and remain there, although switching teams is also an option.
00:32:58.640 Yes, I've personally been in four or five different teams over my time at Zendesk.
00:33:09.360 Regarding event architecture, do you employ techniques like event sourcing or CQRS?
00:33:15.960 Yes, we are moving towards more event sourcing for inter-service communication.
00:33:27.960 Thank you for the talk! How long does it take to run your 55,000 tests on CI?
00:33:39.680 It varies, but it typically takes around 15 to 20 minutes.
00:33:46.480 We run them in groups of 2,000 tests for better performance.
00:33:54.920 Do you have system tests or integration tests? How do you manage these?
00:34:02.479 We have a set of feature tests that run alongside the unit tests. Before deploying, we stage our environment and run API and browser tests.
00:34:12.000 We maintain a Canary test to check for issues prior to deploying commonly.
00:34:23.680 Thank you for the presentation! Considering you have numerous gems, how do you manage versioning?
00:34:30.960 Yes, we manage several gems in our code, and when conflicts occur, we typically merge the pull request from the first one.
00:34:39.720 In recent cases, we focused on collaborative efforts to manage potential conflicts.
00:34:46.560 The last question: how often do you deploy to production?
00:34:58.720 We deploy about two to three times daily. These deployments affect the entire platform.
00:35:06.920 Thank you for your inquiries, everyone!