Mega Rails

by Jack Danger Canty

In his talk "Mega Rails" at GoGaRuCo 2012, Jack Danger Canty from Square discusses the challenges and solutions for managing large Rails applications as they scale. The primary theme revolves around the evolution of Rails apps from small beginnings to complex structures that often lead to technical debt and team dissatisfaction. Canty emphasizes two main takeaways:

Rails Application Growth: As a Rails app matures, it often becomes a 'monorail'—a large, monolithic codebase that is difficult to manage. He illustrates this growth by describing the transition from a simple application to one with numerous controllers and issues related to slow testing and developer morale.
Management Issues: Canty highlights the lack of ownership among developers as a key issue when working with large codebases. With ownership, teams can better manage their products, experience satisfaction from their contributions, and respond to incidents effectively.

In discussing specific pitfalls, Canty warns against:
- Tying email templates to the API backend: This creates a tangled codebase that slows down feature development and complicates the deployment process.
- Using the main datastore for analytics: It can be detrimental when your primary application database also serves analytics needs, as it affects performance and manageability.
- Over-reliance on Rails defaults: While Rails is great for rapid development, it often leads to shortcuts that generate technical debt, such as putting too much functionality into controller methods without thinking of long-term maintenance.

Canty presents examples from the challenges faced at Square, urging developers to adopt a three-step process to combat these issues:
1. Build Service Interfaces: Internal APIs should be established to facilitate easier separation of components as the application grows.
2. Prepare for Team Scaling: When the time comes, replicate or clone code to retain functionality without modifying core systems directly.
3. Manage Data Better: Split databases for different functionalities to improve application performance and responsiveness.

Ultimately, Canty advocates for building conventions around code ownership and modularization to maintain developer happiness and cultivate healthy agile practices in large Rails environments. He concludes by emphasizing the importance of shared experiences among developers to navigate the complexities of large codebases efficiently.

00:00:08.639 Our last speaker of the day is Jack Danger Canty. He works at Square. He's a part-time Rubyist and a full-time feminist. He moved to San Francisco last year and is firmly ensconced in the Mission. I almost put him in at the first talk of the day today as punishment for breaking everyone at the party last night, but I figured that might be too mean. Anyway, he's here to talk about some of the things he's learned at Square for building large, complicated applications. Thanks, Jack!

00:00:41.120 Thank you, Josh. My name is Jack Danger, and I work at Square. Did people go to the party last night? Anyone? You're welcome, and I'm sorry! Just to get that out of the way—also, my slides are going to be a little dark. I apologize; I'll put them online right after. As I said, I work at Square, and we have a lot of problems, but also hopeful solutions to those problems. We work with iOS, Android, networking, and hardware. We have electrical engineers, and we rack our own hardware in data centers. We do a lot of high-availability work.

00:01:05.960 Basically, if you're good at anything computer-related at all, we would really like you to join the team. We're growing fast and hope to grow even faster! Today I'm going to be talking about a talk I originally called 'Mega Rails,' as in when Rails gets very large, but I've renamed it to 'Monorails' because, as I've begun to do more research and talk to more people who are in the same situation, the term we all use for monolithic Rails apps is 'Monorails.' It’s consistent, which is surprising. The progression of a Rails application from its humble beginnings with 'rake generate something' or 'script generate something' to where it ends up has a pretty common pattern.

00:01:39.079 I'm going to walk you through that and show you what to avoid. If there's two things I want you to take away from this, it's: one, that Rails applications do go in a particular direction, and you can avoid it if you're thoughtful; and two, that the problem is primarily about ownership—this is what you should pay attention to. So, if you'll indulge me, imagine you have not a monorail, but a nice little greenfield Rails application with a few controllers. People can sign up, and you just got featured on TechCrunch. Now, let’s fast forward three years into the future—it's 2015, and you’ve just left a marathon night of rescuing your monorail.

00:02:28.040 You have this Rails app now in multiple data centers, and in each data center, you've got a $30,000 MySQL box. I'm not even exaggerating here. If you get a Fusion IO card because the I/O load is too high for one box, you can squeeze more life out of your MySQL database, but that card costs 15 to 20 grand. So your MySQL database machine is as expensive as a whole cluster, and you now have around a thousand controllers. If you haven't worked on a big application, you might think that number is too high, but a lot of Rails applications now are well in excess of a thousand controllers and who knows how many models.

00:03:08.159 You have test coverage, but you don’t know how much because there's no way you could run all your tests; it would take seven years, or most likely, it would segfault halfway through. There's no way to know how covered it is. More importantly, you have sad developers. Your 22 developers come to work each day and talk about how they're not sure they want to work at a big company anymore because you have 30 employees. They want to work on something nice and small, where their efforts have meaning and where they can actually make a difference.

00:03:56.519 You've got a slow test suite. Your tests are perhaps exhaustive, but when you have a critical issue that needs to be fixed in production, it's slow and hazardous to get it out there. Running your test suite is just a gigantic pain; it might even be somebody’s full-time job. You might think that you can avoid experiencing this—that this isn’t going to happen to you. I don't think that's the case unless you've been through it a couple of times. I think I would probably have fallen into this trap again. But before I explain what I think the problem is, let’s take another look at that first day of your greenfield application and the decisions that Rails makes.

00:04:40.080 There's a clear pattern in terms of the defaults you go with and how you organize things when you're getting the application up and running. If you're curious, think this really couldn't happen to you—these are great companies, full of brilliant people, and they've all run into the problem. Each one of these companies has survived a monorail; it grew, they managed it, and eventually split some services out of it. Every company did this in a slightly different way, knowing that they had to.

00:05:24.960 Most of these companies actually still have one giant monorail that they're trying to recover from. The heroes of the organization, the people who would be writing compilers otherwise, spend their days pulling pieces of Rails 2 code running on 1.8.7 out into some sort of service that is manageable. A sidebar: this is not a talk about 'Can Rails scale?' That’s not a question you should ask. Asking 'Can I get a lot of customers on Rails?' is just fine—Heroku, more dynos, please! Ta-da, you’re done! Rails just scaled itself. That’s not the problem you have.

00:05:51.239 So let’s avoid that question. Real problems you have with scaling are scaling your data, scaling your code base, scaling your customers, and scaling your feature count. The data aspect is mostly a solved problem. People have been addressing this for years, and you can too; it’s really a matter of getting familiar with your data, understanding it, finding out what it likes, what it hates, what shape it needs to be in, and where it wants to live for you to do the kind of things you need to do with it. It’s hard, but totally possible.

00:06:51.440 Scaling your codebase is something GitHub fixed. When I say scaling your customers, I don’t mean marketing efforts to get customers; I mean managing them. This is primarily about communication, engaging with your customers, and logging so you can explain to your customer what just happened to them, their account, and then scaling your feature count. Rails is great at this. You want a new feature? You create a new path with a controller of the same name, a model of the same name, and a database table of the same name—and it works! It’s a Rails engine taking a vertical slice and throwing it right in. It’s wonderful; Rails is better at scaling your feature count than most things.

00:07:28.600 What we don’t talk about is scaling your developer headcount. That’s one thing for which there is no default way to manage in Rails. There’s no tool off the shelf that lets us do this, and there’s no pattern that we’ve identified in our community that helps us with it. So, let’s go back to the beginning. This is you on day one of a greenfield project. You have one product, implemented with one application, and probably one developer—maybe four, probably one that’s great. It works fine.

00:08:13.759 Then you get another developer who joins the team, and you say, "Hey, commit code on your first day to our app—that's where our code is!" So she sits down, commits code, and your pace doubles. In the early days, this is a great pattern, and it’s how you should start. You need an admin interface; add a nested set of controllers under admin, and you’ve got a bunch of admin views—that’s so great! Only allow a few users in there.

00:09:04.000 Then you need some analytics to track trends over time. Well, just put that in the admin controller views. You just want to have the graphs right there, so you write it into the view. Rails can do anything—let’s make it do everything! Now, that sounds like I’m saying that tongue-in-cheek, but let’s be real: you wouldn't know if your product was worth spending your finite time on Earth building unless you got to a point where you could test it in the market. If you can't get there, then you have no idea if you're wasting your time, so Rails helps you get there quickly with a product you can keep building on.

00:09:55.280 The reason we have monorails is because people reach a point where they think, "This is a little technical debt, but it’s good enough—let’s keep going." You don’t have to do a rewrite when you realize you can still keep going in this direction. Rails can do anything; we make it do everything.

00:10:37.880 This will work fine even if there’s technical debt. You sub-class Action Mailer for no reason, because you only do it once, and then you make a bunch of methods that just copy-paste one into the other whenever you want to send a new email. This is not good software design, but it works fine. You get seven or eight methods in there sending seven or eight templates, and they’re only ever called from one place in your code each. So, you could probably just put it there instead, but this works, and you have email support—it’s going fine.

00:11:36.000 Now, the user model is a little hairy because you’re throwing a bunch of stuff there. But when your users suddenly have phone numbers, you add a field to the database and have some validation on that. Maybe they have a Twitter account or a Facebook ID; you’ve got that there too. Now you have 40 columns in your users table, which is hard to manage, but all the data is right there. It’s convenient—it’s so easy to add modules or methods inside the user.rb to manage the localized data there. This is the most convenient way to place the data and the easiest way to get a feature out the door.

00:12:13.640 If you’re still in the phase where you’re testing whether your product should even exist, that’s the right place to put it, and Rails makes it very easy. Let’s say you finally decide to take the time series work out of your admin views and put it into some sort of Ruby module. Congratulations! You extracted it into lib, and now you’ve got some Time Series generator engine—that's great! If you work at an amazing company where you have specs in spec/lib—sidebar: if you’re at a company that doesn’t let you write tests for everything, you should say, "I write tests for everything or I quit!" This is for your benefit.

00:13:38.080 So, we're up and running with our Rails application—an obligatory cat slide. Most of your features are there. They’re not all organized the way you want, but it’s working, and it’s actually bringing in some revenue. So, what is the problem? The problem is that Rails is optimized toward the beginning experience. It’s optimized for getting up and running quickly. All good software has trade-offs. Bad software tries to do it all. Take MongoDB, for example—it’s great for some things but not for others.

00:14:42.960 The stuff it is great at is amazing, whereas Postgres is generally good for most things. Your software, in this case Rails, is likely optimized to get you up and running quickly, taking your first 30 days into account and saying, "Let’s make that magical." It does a fantastic job at that but does not optimize for three years in, and it's heavily not optimized for that stage. I’m going to walk through just a couple examples of how Rails could go off track from the beginning. If you’ve ever worked on giant enterprise projects, they might have picked the right column, but Rails will always choose the left.

00:15:41.760 You begin with one database and you put all your code there, as you need to do joins or something. Rather than many databases, where you have to connect across different ActiveRecord connections and manage some sort of multiple thread pools to connect across however many machines, the problem is that even in the single database, if you've seen how Rails does its queries when you do eager loading, it’s going to select a bunch of records from one table, grab those IDs, then go select another table with those IDs and grab another set of records.

00:16:18.800 Sometimes it will do joins, but often it just performs eager loading with those multiple queries. So, if you're hitting the database multiple times anyway, why couldn't the first query be in one database? You get those IDs and then look up user locations sorted by distance from a certain point in Solr or Redis or some other database. This isn't hard to do; it’s very trivial—you just can’t rely on how ActiveRecord wants to behave.

00:17:04.520 If you have 10,000 records, I don’t mean to be a hater, but I’ve run into a lot of pain using MySQL. MySQL and Postgres are both fine at that count—everything's going to happen in less than a second, and you won’t notice a difference. However, Postgres does things really well in scenarios that MySQL does not. For example, there’s a guy, Nolan Evans, at Square who was working on an analytics project where he collected probably a billion records at one point, and he sent out an email to the engineering team one night saying he's starting role plan 2280—it was a JIRA ticket to add a new column and change the default type of this other column in MySQL.

00:18:04.800 He remarked that it took 28 seconds! This was due to the fact that he was using Postgres, which doesn’t require an entire table rewrite to alter some of the default settings on columns or to add a new column. MySQL, however, needs to copy all that data over, rewrite it, and block the entire database the whole time. You have downtime; it’s tough. Now, Action Mailer—is it bad? No, it’s fine! It’s a nice templating system.

00:19:10.480 However, it has no place in most applications. You should have an Action Mailer application that you maintain separately, maybe hosted in Sinatra or something, to manage sending emails. As Copen was saying earlier today, there should be a mail service. At some point, you might realize that you’ve got this big app, and you don’t know what to pull out. Then you see everything related to email template generation is just baggage—it doesn’t need to be there.

00:19:57.440 It’s both because you have two HTML templates: one for the web and one for the crazy HTML that works inside emails. Getting somebody to maintain both of those is tough and hard. Those should be different jobs since anybody working in HTML email should be given a break. This should be done in a different process or project with their own deployments, so lightweight copy changes can go out at will, without needing to deploy your API at the same time.

00:20:57.920 Another bad idea would be to have lots of data in the Users table during your first 40 days, which really seems like a good idea. You really should have lots of data in your Users presentation, not in the account (which might be called Person, or whatever you name it to identify a human being in the real world). This should contain some sort of unique ID and the minimum necessary authentication data, and that's it. Everything else is a feature that relates to that user and if they grow, will probably be extracted into a service.

00:21:45.440 In the meantime, it’d be nice to have a table for features in lib/specs. That's great that they’re extracted. If you have specs—obviously you should have specs—then you’re running the specs for this extracted library code every time you run your build, which is silly because it’s in lib—meaning it should be a dependency and not part of your core application. If anything is in lib, it needs to be extracted into a private gem. If it’s in lib and does not have tests, it needs tests and then should be extracted.

00:22:36.000 Nothing should be in lib for longer than 10 days. If you have a file called something like object_patch.rb, I recommend deleting that or adding it to Yehuda's list of bad ideas made for good reasons a while ago, but delete this after 2013. Validates presence of, validates uniqueness of—they're for generating error messages, not for validating your data. If you have an application being developed on your laptop in development mode, you really just need a left column.

00:23:24.200 If you’re doing data with anyone else's data—any person who is not you or employed by your company—you need DB validation constraints because otherwise, you’ll have corrupt data and violate trust with your users. They’ve just given you something or you’re supposed to record something, and then you lost it. You dropped it on the floor! In particular, customers who pay you can call you up and ask, "What is going on?" A bad answer would be, "I don’t know; we lost your data!" Default logging is something that is not part of the defaults. Tom or David did a great job this morning discussing how Rails logs what it does.

00:24:37.760 Your application also needs to log what it does because you need to know what’s actually happening. Log every significant action—this does not happen by default! Here’s one: you need to analyze your data. In the beginning, sure, you can do that in your database; it’s not too hard to make SQL perform multi-table joins and process time series. You extracted that time series generator engine into lib and now a private gem.

00:25:36.560 The issue is that data may be in the wrong shape for your needs, and it's interfering with what's happening on your site at the moment. You should probably extract it away from your main database either into a replica that sends your data or via ETL, which means Extract, Transform, Load. Xavier taught me at Square everything we needed to know about how to not perform functions well, and he fixed some of our habits. If you have questions about data, talk to Xavier.

00:26:51.360 He’s amazing! He set us up with replicas, star schemas, and methods to take your data out of the transactional database and put it into a form that works for querying—where you can group things while evaluating trends. Before I joined about a year ago, I saw this amazing admin interface that had been built before I came on board. It had lots of trends, charts, and graphs, with so much data coming from the main database.

00:27:40.080 We couldn’t have any delay at all—not even the delay of a replica. One day, we deployed some updates, and our admin page was on a completely different host. We don’t mind a little downtime when deploying—our passenger instances just go down for a few seconds. However, they connect to the same database as our code, and we were ramping back up from the deployment.

00:28:32.280 One individual decided he needed to refresh that page, so he hit command-R. Nothing happened because Rails has a slow boot-up process. Command-R again. He must have played a lot of video games as a kid because he kept hitting it—maybe 60 times! Eventually, when the rails instances booted, they all ran the most expensive page on our entire site simultaneously, which allowed this one guy to take us down. That’s a significant issue because a single user shouldn’t be able to take you down—one person hitting refresh is not okay.

00:29:26.760 There’s a reason why these Rails applications hit a point and fail. Mel Conway said, "Organizations which design systems are constrained to produce designs which are copies of the communication structures of these organizations." Put another way, the shape of your people defines the shape of the products they create.

00:30:13.960 All these examples I’ve given you were the normal growing pains of an app growing up. Use new data types, new databases, clean things up, move stuff around, and tech problems are resolved in ways we’re getting better at, but they aren't the real issue. There's an emotional problem at hand that I think we, as a community, haven’t addressed very well—this problem is ownership.

00:30:58.560 If your team is responsible for building something, then your team’s success should be dependent on the success of that product. If you produce something that doesn’t do what it was supposed to do, then you should know that was your failure—own it and adjust it. When you stay up all night finishing a feature, you'll do it because you know that your combined effort produces success. The issue arises when your product's quality is intermixed with unrelated tasks handled by others, and you have no control over it.

00:31:42.960 Now, Rails can handle this; it can grow, split, and allow people to work without stepping on each other's toes. But that’s not how we’re doing it in Rails right now. There’s no clear definition of your responsibilities, and a little bit of that is because in any Ruby object, you can do anything to any other Ruby object.

00:31:57.799 Rails can totally do this. If you split your big team into smaller teams, chances are you’re going to have a monorel—a large piece of code you can't manage. When you make that HR management decision, you have an unwieldy code that is not defined by role, task, feature, or section. To close this talk—and to close this conference—I’d like to walk you through a three-step process that every large application must go through.

00:32:41.960 Most of the ones I’ve talked to have done a great job taking ownership over the pieces of the code in a way that a young Rails app did not have. While it’s three steps, only the first one works the same for everybody. The first step is to build service interfaces inside your app. Build internal APIs—not for everything, but for anything you think maybe someday shouldn’t live here anymore.

00:33:44.760 The easiest example is email; instead of ApplicationMailer to send or Notifications, deliver whatever it is, give the minimal information necessary to perform the task. Email to person using a template with an arbitrary dictionary or hash of data—that's all you need. Then you'll reach a certain point when you need teams and they'll say, 'Oh, we need services. We should cut some things out,' and some teams say, 'We’ll take everything on the other side of that interface.' Nothing you do will change.

00:34:24.720 The second step usually looks something like this: either you copy the app via git clone and delete parts you don't need, or you re-implement the functionality from scratch. Importantly, there's a line where nothing changes, counting on another team’s work. In reality, due to the way we write Ruby on Rails applications, you’ll likely have to modify files, but that’s inappropriate—you should need to delete, move, or add files, not modify them in order to pull features apart.

00:35:00.360 Step three, you'll have to make adjustments to your data. Maybe in the beginning, both talk to the same database, and then you replicate it across and have it connect to its own database. How you manage that entirely depends on you and the unique approach each company takes to solve it opportunistically.

00:36:02.080 We should recognize the human cost of not understanding the benefits of these service interfaces early and knowing that the moment we change our group structure, we’ll find ourselves in trouble. You may end up spending time maintaining someone else's code, whether that’s yours or somebody else’s—and just debugging it for an extended time.

00:36:58.960 Square sponsored the party last night—not just because we like recruiting people, though we need help. We sponsor events because we struggle to write Ruby apps together, forming bonds through shared experiences. We deserve to go into work each day and be really happy. Ruby was optimized for happiness by Matz, and Rails was optimized for happiness by DHH. The kind of Rails apps we’re building now aren’t optimizing for happiness, and that’s why we need to build some conventions around that.

00:37:44.600 So that we can continue to enjoy working every day on thousand-controller projects and know how to tackle these challenges together.

00:38:20.799 My name is Jack Danger. I work at Square, and I thank you for your time!