Oh S***: How to bring a big Rails website down (and how not to)

00:00:14.679 Hi, okay, my talk is called 'Lessons in Continually Breaking a Huge Website.' Can everyone hear me? Good, okay, that's good. My name is Tim Morgan.

00:00:20.279 I'm a senior Rails developer at Scrib. I was hired at Scrib as kind of like a Rails guru because I knew the API really well, but I didn't really have a lot of experience in scaling. That was kind of a trial by fire for me at Scrib, and I learned a lot of things the hard way.

00:00:33.920 This talk isn't going to focus on security vulnerabilities or anything like that. Instead, I'm going to share what I've learned as a Rails developer for Scrib. I believe that as engineers, we learn the most when we break things. You know, when we're knee-deep in the debugger, that's when we really start to understand Rails and the software we're working with.

00:00:46.760 A bit about Scrib: Scrib is a really large Ruby on Rails website. Here's our Quadcast page. We handle about 40 million users per month in traffic, serving 36 megabytes of traffic per second. Every day, 50,000 new documents are uploaded into Amazon S3. My point isn't to dazzle you with statistics; rather, I want to highlight that when you have a website this big, it becomes incredibly easy to break it if you don't fully understand what you're doing and what Rails is doing behind the scenes.

00:01:18.920 I'll take you through some of the more infamous mistakes I've made. These mistakes are almost always problems of scale. It's something you wouldn't even think twice about if you had, say, 3,000 users per day on a hobby site. But once your traffic explodes, these issues can really come back to bite you. Almost all of these mistakes center around how Rails interacts with the database. At Scrib, we use MySQL, so there will be some MySQL-specific content, but I will also try to generalize it a bit and include information on PostgreSQL and Oracle for those using those databases.

00:01:43.320 So, my most devastating mistake ever at Scrib, which I'll discuss in just a moment, resulted in our DBA giving me a few good pointers, some of which were expressed with a bit of shouting. Here's a log of our conversation from Campfire. Essentially, he told me to understand exactly what my code is doing. It's crucial to recognize that Rails can't always be relied upon to generate the best query for every situation. In fact, sometimes it can't even generate a good query. I hope everyone is familiar with SQL's EXPLAIN and EXPLAIN ANALYZE tools. If not, just prefix your query with EXPLAIN, and it will provide information on how the query would run, without executing it if it is slow. This is very useful, and you'll see examples of this in the upcoming slides.

00:02:29.239 To boil it down to a few points: every time you add a feature, you should run that feature on your site and check your query log. Unless it's the most trivial of features and doesn't touch the database, you should always review your query log. You need to understand what each query does, and if you suspect it may be a problem, use EXPLAIN on that query. Testing with a large database will frustrate you with how slow some parts of your code are, and believe me, your users will be frustrated too. It's a bit hypocritical for me to say this, as I don't always do it myself, but when I have, it has proven very useful.

00:02:51.280 With Rails 3.0, we're going to be a bit more agnostic with our ORM, and some of you may have used DataMapper. DataMapper has a really cool tool called DM Sweatshop, which allows you to create a fully populated production database using your factories with an arbitrary number of rows. This way, you can create a database larger than you would expect to have in production, and when you interact with your site, you might notice some crucial slowdowns, making you just as frustrated as your users.

00:03:07.680 Lastly, pay close attention to your indexes. Hopefully, everyone here understands the basics of indexes. For example, if you have a users table with a login column and you need to look up users by login, you need an index on that login column. Things get more complicated with more complex queries. Indexes are very database-specific; MySQL and PostgreSQL employ different strategies for using indexes. Therefore, it's vital to understand how your database utilizes indexes and what indexes you will need for different queries.

00:03:23.200 Now, let's begin with some anecdotes. The first topic has been discussed extensively, but I've added what I hope is a novel twist! It's about the problem with 'find' and 'batches.' We're discussing situations where you need to update every row (or many rows) in your table, but this isn't something you can do with a SQL UPDATE statement. You need to instantiate every row as an ActiveRecord object to perform some operation, like downloading from S3 and uploading to a different bucket.

00:03:36.280 The problem arises when you do `User.all;` you'll instantiate every single record in your database all at once, if you can even fit it in memory. It's still going to be very slow because, of course, it triggers a very slow query: `SELECT * FROM users;`. In Rails 2.3, we got the `find_each` method, which calls `find_in_batches`. This method grabs the first N rows from the table, looks at the ID of the last item in the batch, and uses that ID to fetch the start of the next 10 rows. We avoid using OFFSET in MySQL because it's slow—it performs a linear iteration over the table to find the N-th item, while using the primary key (which should have an index) allows for a b-tree lookup that is O(log n).

00:04:13.560 The issue arises with composite primary keys. If you have a table, like a group memberships table that joins a groups and memberships table, it doesn't require a surrogate key. User ID and group ID are enough to uniquely identify a row, and we often utilize composite primary keys. Rails doesn't handle composite primary keys well on its own, but there is a composite primary keys plugin. However, the `find_in_batches` method, which I've copied and pasted here, will break if you use composite primary keys. Not realizing this, I foolishly pushed my implementation to production, and Scrib went down.

00:04:56.520 During the fix, I discovered a few bugs. Notably, the table names weren't escaped, so if you have a table name that's also a reserved keyword, it will break. Lastly, these records are yielded within the scope, so if you supply your own scope, it will be overwritten.

00:05:24.280 The composite primary keys method allows you to pass a list of primary keys, and when you call `model.primary_key`, it returns an array instead of a string. The solution is to build a query generator that loads batches of objects in a similar manner to counting multi-digit numbers. For instance, if you have three primary keys, it works similarly to counting three-digit numbers.

00:05:43.000 The full solution is available at this URL. If you use composite primary keys and need to utilize `find_in_batches`, this is a gist URL that monkey patches the default `find_in_batches` implementation. There's some contention around monkey patching, so if you object to it, you can rename the method to avoid overwriting an existing one. This also resolves the two bugs I mentioned earlier, enabling you to use tables with reserved SQL keywords, and thus avoid issues while ensuring that your custom scopes are respected. Standard disclaimer: this may contain bugs, so be sure to test it before deploying.

00:06:51.120 The next topic is arguably my most infamous moment: the problem with `validates_uniqueness_of`. Some of you are laughing; I know some of you have experienced this before. It stands as a glaring example of why you should monitor your query log, explain your queries, and investigate even the simplest changes in the database. Here's a very abbreviated example from Scrib's user model. This is pretty standard stuff, until the requirement comes down the pipeline for case insensitivity in logins.

00:07:13.520 The requirement necessitated that if I signed up with one login and someone else used the same login but with a different case combination, they should be denied registration. We all know how to address this issue: set `case_sensitive` to false. I was working late one night, and I quickly added this change, pushed it out, and went home. The next morning, I discovered Scrib was down. Everyone on Campfire was upset with me because they managed to bisect the problem back to my commit. How could this have happened? Let's examine the SQL Rails runs for a user model validation. The core offender is the LOWER(login) portion, forcing MySQL to calculate the lowercase version for every user's login one by one. With over 10 million users, it took two minutes to execute that query, running it every time a user attempted to sign up, which was multiple times per minute. Before long, Scrib went down, and I had thoroughly angered my DBA.

00:08:03.760 There was, however, a silver lining to this experience—I learned something valuable about indexing. So, what’s the solution? If you're lucky enough to work with PostgreSQL or Oracle, you could create a functional index on LOWER(login). It takes a bit of time to create the index, but it effectively resolves the issue. Unfortunately, MySQL doesn’t support functional indexes.

00:08:16.880 In MySQL, however, it gets trickier. MySQL is case-insensitive for string comparisons by default, which complicates things. With `validates_uniqueness_of`, MySQL will make it case insensitive regardless of what you specified. Essentially, anywhere you use `validates_uniqueness_of`, it will implicitly treat the comparison as case insensitive. So, the solution is to use the BINARY keyword in MySQL. This keyword tells MySQL to treat the comparison as binary data instead of a string. It incurs no speed penalties, so it's a feasible workaround.

00:09:03.680 In practice, you need to modify `validates_uniqueness_of` to include this BINARY keyword when `case_sensitive` is true and leave it out when it is false. These situations could become convoluted, especially if different environments treat BINARY differently. It is essential to address this correctly. And this was not the only issue on that fateful day; this led to a separate bug as well. For some peculiar reason, I decided to use ORDER BY next to the login column. Consequently, this caused an excessive query similar to before.

00:09:47.760 As a result, I ended up committing the code. This would be the second time I broke Scrib that day. We have the whole procedure documented for solving this issue which you can also refer to online. Yet again, it's a monkey patch for `validates_uniqueness_of`; it's functional, but it doesn't assess whether you're using MySQL. The BINARY keyword exists in other databases as well, so it's prudent to only include it if your project is running on MySQL. The disclaimed version may also contain bugs, so be careful.

00:10:49.960 The next topic revolves around the confusion between using `delete` and `destroy`. This issue is more of a concern with Rails documentation than anything else. Rails documentation generally specifies how different methods operate. However, this area falls short when it comes to using `delete` and `destroy` with `has_many` and `belongs_to` associations. The behavior of these methods changes based on how you set up your association, which may seem disconnected from where you'd expect it to be.

00:11:09.720 To illustrate, I'll share a real-world situation that occurred at Scrib. We have a category model, which groups documents around a common subject. It is joined via a 'category membership' model, where a category has many category memberships and many documents through them. We set a `dependent: :delete_all` option. As most of you know, `dependent: :delete_all` is usually faster than `dependent: :destroy`, as it immediately performs a SQL delete.

00:11:34.520 One day, we needed to modify the behavior. When we delete a category, we wanted to tag all documents associated with that category. If there were five documents in an aviation category, we wanted to tag them as 'Aviation' before removing that category to maintain some context. It's a straightforward implementation: simply add a before_destroy callback to tag those documents. I thought I was clever, implementing it using a more complex SQL syntax so that I wouldn't have to iterate over and instantiate each row individually.

00:12:10.080 However, when the query ran, I quickly discovered it returned no rows. In essence, by the time we reached the before_destroy callback, all category memberships had already been deleted. This raises the question: Why would Rails remove associated objects before the destroy method is called? This behavior, admittedly frustrating, is due to how Rails handles the delete callback order. The fix becomes less obvious after this revelation and likely perplexes many developers. The key is repositioning your code: move it to the top of the association declaration.

00:12:53.360 The reasoning for this is that `has_many` calls do a bit of metaprogramming. When you call `has_many`, it invokes a before_destroy method that it generates on its own. The before_destroy hooks execute in the order they are defined. If your `has_many` appears before your before_destroy callback, it will trigger first. This may seem counterintuitive, but that’s how Rails functions. If you, like me, had no awareness of this behavior and attempted to directly delete the records after the callback by simply removing the dependent: :delete_all association, you may have been surprised to discover that when you call delete all subsequently, it won’t delete those category memberships.

00:13:51.360 The reason this occurs is due to the phrasing used. The delete_all method removes records from the association without actually deleting them. Rails interprets this as effectively removing the objects, even going so far as to set their foreign key to NULL. Hence, these now orphaned category memberships hang around in your database, taking up space even though conceptually they are detached. Essentially, you might be looking for functionality similar to ActiveRecord's delete_all method that performs the SQL delete directly, accomplishing precisely what you intended.

00:14:35.360 So, this awareness is crucial. The ActiveRecord.delete_all method is designed to perform one exact deletion and nothing else. Don't hesitate to embrace some LAI magic if you want to employ precise control over your database operations. By the way, a quick note: If you include a conditions hash when using delete_all, Rails may ignore the value you've suggested for the database column. Hence, you could end up inadvertently deleting unnecessary records; this happened to us with our credit cards table! Fortunately, at the time, there were very few credit cards in the system since we had just launched. We ended up having to inform users that they needed to enter their credit cards again.

00:15:15.760 This unfortunate event became more manageable with the understanding that you can't store credit card information directly—these had to be stored in an external vault. We were simply maintaining the IDs of the credit cards stored by a large enterprise vault company. Thus, users might have thought they were still in the system, but their records were gone.

00:15:39.680 This story was just a brief example of the challenges presented in Rails. Not only does it map out the flaws of documentation but reveals that Rails 3, with its abstracted query logic, intended to address these particular concerns. Lastly, I'm going to cover a very MySQL-specific topic, unfortunately. I apologize for anyone who doesn’t use MySQL, but here is where many of our problems stem.

00:16:53.600 The topic at hand is indexes. Indexes can be wonderful, yet they're a source of issues when they are absent. Database performance hinges prominently on implementing the right indexes. Yet many times, particularly newer Rails developers don’t appreciate what should be indexed and at which point it’s valuable. Additionally, as the database grows, introducing new indexes becomes more challenging. That is why it's crucial to think strategically and consider what indexes you will need early on while the row count remains low within your tables.

00:17:34.680 Let's say you own a scenario where your database might be affected by massive spikes in user interaction, resembling what happened with Scrib. When you are in a position where adding new indexes is no longer straightforward, you'll ensure that your indexes are effectively tailored for critical searches conducted by your application. For example, suppose you want to retrieve all user documents for a profile page.

00:18:08.760 To perform this lookup efficiently, you need to have proper index design in place. Without and optimized database structure, the typical 'delete' functionality can incur a lot of unnecessary processing while causing slowdowns. In addition, it may further weaken the performance of your database as you run more queries on this table. Therefore, you need to ensure you're correctly indexing the fields you rely upon beforehand; this observation may seem almost trivial, but I've seen it lead to headaches down the line.

00:18:40.560 Therefore, my recommendation is to add indexes at the onset of database design, prior to scaling. Outline the performance implications of growing your database. Should you notice performance hiccups, integration of these indexes may conflict with existing functionality. Consequently, when needed, force MySQL to use particular indexes through usage or force index statements.

00:19:22.400 In conclusion, if your user document table is very large, adding an index can lock the table for extensive amounts of time. Ah, and be mindful of indexing the appropriate fields. If you find you're stuck in a situation with extensive data, using MySQL's index hints can alleviate some burdens until a better arrangement is made. MySQL often selects reasonable indexes. Still, you may occasionally need to assist MySQL to select the best index to optimize your queries effectively.

00:20:00.140 Ultimately, with frameworks and databases growing more sophisticated and complex, understanding what queries your code is generating is essential. Unless the SQL is trivially obvious, it is essential to consider how behavior may change at scale. You should be using the EXPLAIN tool when appropriate. Test your code against heavily populated databases as well. While I don’t always implement this practice, it's insightful.

00:20:21.800 Additionally, using tools such as DM Sweatshop can assist in finding ways to create random test data that approximates actual sizes and behaviors in production effectively. Finally, understand your indexes and develop a robust awareness of how they function, because indexes are often misunderstood by newcomers in the field.

00:20:33.800 That concludes my talk. Are there any questions?

00:20:55.320 Have you done any experimenting with PostgreSQL using some of the same techniques?

00:21:04.520 Actually, Scrib runs on MySQL, but I use PostgreSQL for my smaller personal websites. I've encountered some issues with PostgreSQL; the functional index is nicer for some of the nitty-gritty problems. I generally don't run large databases like Scrib. However, there are many experts who can provide deeper insight into optimizations in PostgreSQL.

00:21:20.480 Have you looked into other database technologies, like NoSQL? Many of the challenges we're facing in Rails affect how we store references. When you have large tables with millions of rows, it can be tough to operate using a relational database like MySQL.

00:21:40.360 NoSQL databases are gaining traction, driven by innovations in companies like Facebook and Twitter, which have made significant advancements. Many agree that relational databases serve a very specific use case, and have become stretched in terms of general use. Typically, at Scrib, we primarily rely on MySQL for core data, but that could change in the future.

00:22:20.400 Thank you.