Biggish Data With Rails and Postgresql

by Starr Horne

In the talk "Biggish Data With Rails and Postgresql" at RailsConf 2014, Starr Horne explores the challenges of managing large datasets in Rails applications using PostgreSQL. The presentation focuses on how traditional ActiveRecord conventions become ineffective as databases grow, leading to performance degradation. Horne shares his experiences from working at Honeybadger, a company that monitors errors and uptime, and emphasizes that common Rails practices can falter as data volume increases. The key points discussed include:

Understanding Biggish Data: Horne clarifies that biggish data doesn't involve complex solutions like Hadoop but refers to practical issues faced as simple applications scale up to handle large data volumes.
Performance Considerations: The speaker discusses how the performance characteristics of a database change significantly as its size increases. For example, typical pagination systems may fail at high page numbers due to inefficiencies in how queries are executed.
Query Optimization: Horne emphasizes the importance of refining queries to limit the number of touched rows. He introduces the EXPLAIN command in PostgreSQL as a tool for understanding query performance and suggests using range queries to avoid the pitfalls of inefficient pagination.
Infrastructure Improvements: The talk covers practical steps to enhance database performance, such as increasing RAM, optimizing disk I/O, and utilizing connection pooling to manage database connections efficiently.
Database Management Techniques: Key strategies for maintaining performance over time include regular database vacuuming, partitioning data to improve deletion processes, and implementing read replicas for intensive queries.
Backup Strategies for Large Datasets: Horne highlights challenges in backing up large databases and recommends using tools like WAL-E for incremental backups.

Ultimately, the speaker reassures the audience that while managing biggish data can be daunting, many issues arise one at a time and are solvable with practical approaches. The overarching message is that proper preparation and proactive measures can significantly ease the transition into handling large datasets, making applications more robust and performant.

00:00:17.119 Thank you all for coming. It's going to be an awesome talk, quite possibly the best talk you'll ever hear in your life. But, I actually have a built-in escape hatch here, so if things start going wrong, I might just bail.

00:00:28.160 A couple of weeks ago, I was having a bit of a panic attack about this presentation. As a programmer, my natural habitat is in some dark place, looking at a screen; it's not talking to a bunch of people under bright lights. What's up with this?

00:00:39.600 During this panic attack, I went for a walk to clear my mind. I was listening to my favorite podcast called Hardcore History. I highly recommend it. This particular episode was about the Punic Wars.

00:01:06.080 In case you haven't heard about the Punic Wars, let me give you a brief rundown. A long time ago, there was this place called Rome, and they decided they should take over the world, which they largely achieved. However, there was this one country that was a thorn in their side for a long time: Carthage.

00:01:30.640 They fought back and forth in a stalemate, and then one day, Hannibal had this brilliant idea. He decided he was going to lead a sneak attack on Rome by marching his army through the Alps. That's pretty cool, right? But the most awesome part of the story, at least for me, is that he had war elephants.

00:01:56.399 Some people have doubted this story over the years. In 1959, a British engineer borrowed a circus elephant named Jumbo and marched it across the Alps as part of a bar bet. I guess the moral here is not to underestimate elephants; they don't like it, and they have long memories.

00:02:40.879 Now, this talk is really about biggish data. So, what is biggish data? It's not about big data or Hadoop clusters; that's way over my head. It's also not about fancy architectures. Instead, I want to discuss something that I find more practical and interesting: how to keep your app working as your production dataset grows into biggish data territory.

00:03:57.200 This issue can arise even if you don't have a fire hose of data coming in. For example, if you run a popular e-commerce site or a site with many users, you can accumulate a ton of data over the years. As this data accumulates, you may find that your site's performance gradually declines in ways you don't quite understand.

00:04:12.560 This talk is based on my experience at Honeybadger. In case you haven't heard of us, we provide exception performance and uptime monitoring. Right now, we have about a terabyte of errors in our database, and we acquire roughly two gigabytes of new errors per day. All of this goes into a plain vanilla PostgreSQL database behind a pretty basic Rails application.

00:04:38.080 The good news is that PostgreSQL can handle this pretty easily. The bad news is that if your app isn't engineered to handle this level of data, you're kind of screwed. A 100 megabyte database behaves fundamentally differently than a one terabyte database.

00:04:54.400 Many of the normal conventions we use in Rails apps just stop working when you have this much data.

00:05:00.720 For example, I learned that pretty much every pagination system breaks after page 2000. Even if you try to delete half your data, deleting can take a lot of time.

00:05:08.159 I will explain why this happens, how you can work around it, and how to optimize your queries in your database. To truly understand this, we need to go back in time to the summer of 1978, a summer filled with disco and the VT-100 terminal.

00:05:30.080 The VT-100 was the first computer terminal you could buy from a company and set on your desk to prove you were the alpha nerd of the office. It was also the time when Oracle 1.0 was being developed, one of the first databases to use SQL. This was written in PDP-11 assembly language.

00:06:00.800 Now, in our modern times, marketers tend to throw around buzzwords like cloud computing and platform as a service, but back then the buzzword was 'real-time computing,' which meant generating your reports in five minutes instead of five days.

00:06:14.240 While that was impressive at the time, it’s not the kind of thing I’d want to build a web app on top of. But over the past 36 years, something funny happened: Moore's Law transformed this idea into a web stack.

00:06:30.360 Most people know vaguely what Moore’s Law is, but in technical terms, it means that computers get more awesome as time goes forward, defined as being able to do more stuff faster and process more data efficiently. I’d like to present, for the first time in public, Starr’s corollary to Moore’s Law: as database growth outpaces Moore's Law, you literally travel back in time.

00:07:10.320 I can already see your minds being blown; it sounds like bubble wrap popping. To survive in this new time period, you need to understand the methods and motivations of the people from that era. We need to start thinking like our ancestors, who were very interested in hardware.

00:07:25.120 If you happen to be experiencing database scaling issues where your app is slowing down due to the growing size of your database, you can probably solve it right now by purchasing a real computer. A real computer can provide the two critical resources databases need for high performance: a lot of RAM and fast disk I/O.

00:07:44.480 Virtual servers tend to lack both of these capabilities, so investing in a physical machine can be a game-changer. While you're at it, make sure to use multiple disks. It’s not just about having a RAID array; you want to separate your operating system, database data, and logs onto different hard drives to enable better I/O scheduling.

00:08:05.600 If solving your issues with a real computer has worked, great! If you want to stay, I won’t be offended. However, if that didn't work or if you want to ensure that everything is optimal, you should next examine your queries.

00:08:30.760 Most of us develop against small datasets, so inefficient queries often go unnoticed. Books have been written on query optimization, and I can't cover everything in 30 minutes, but I’ll point out one vital tool: EXPLAIN.

00:08:43.840 PostgreSQL offers an incredible feature called EXPLAIN, where you pass a query and it shows you the query plan. This can look overwhelming, but for the purposes of this talk, we only need to focus on one metric: rows.

00:09:02.240 The more rows you touch in a query, the more data your system has to process to provide the answers you want. The key to query optimization is limiting the amount of rows touched. For example, if you run an inefficient COUNT query, PostgreSQL will literally count every row one by one; this is not efficient.

00:09:51.199 Pagination systems tend to break around page 1000 or 2000 because of how offset and limit work, which is similar to COUNT. They can lead to PostgreSQL counting many unnecessary rows.

00:10:04.080 The solution to the inefficiency is to use range queries. Range queries are faster because they leverage indexes, resulting in fewer rows being touched compared to offset and limit operations. A well-structured query can see performance improvements.

00:10:45.120 Sorting is another tricky element; it can be very fast if done correctly, but it can also lead to performance issues if the data isn't sorted the way the index is configured. Understanding how to structure your sorts in conjunction with your database indexes is crucial for performance.

00:11:00.160 As a recap for query optimization, you should develop against a real dataset and utilize the EXPLAIN tool often. Always strive to minimize the number of rows your queries process because touching more rows generally leads to slower performance.

00:11:45.119 Now, back to the pacing of our talk. For the second half, we’ll cover infrastructure issues related to biggish data, including workflows for handling large datasets. Here’s our plan: we will have cute pictures representing various topics and links if you want to do further reading.

00:12:06.079 Our ancestors were deeply involved in the creation of disk operating systems. It seems like they were developing new systems every couple of seconds, yet today we find ourselves with only a few main choices. It’s interesting how times change.

00:12:34.079 The first recommendation for Linux users is to increase your read-ahead cache. Many people aren't aware of this, but making this change can lead to a doubling of read throughput. Linux pre-fetches the next set of blocks into RAM when it sees consecutive requests.

00:12:53.679 The default setting for the read-ahead cache is typically around 256k, but increasing it to 2 or 4 megabytes can substantially improve performance. Additionally, use a modern file system; ext3 is not recommended, and if you're using ext4 or XFS, you may want to tweak journaling settings.

00:13:21.679 You'll want to adjust PostgreSQL settings based on the RAM available in your machine. An easy way to do this is with a tool called pgTune, which analyzes your setup and configures PostgreSQL with sensible values.

00:13:46.400 Another critical aspect is the VACUUM command in PostgreSQL. It’s essential for cleanup because queries and updates can leave behind a mess, and VACUUM goes and removes unnecessary data. However, this operation can be resource-intensive.

00:14:01.120 If your server is heavily loaded and your VACUUM operations are taking a hit, you might be tempted to reduce how often VACUUM runs. However, this can lead to problems, as we learned the hard way. Generally, the solution to VACUUM issues is to run it more frequently.

00:14:24.759 Next, I'm going to touch on velocity. If you have a huge influx of data or many read requests, managing database connections becomes vital. Each connection in PostgreSQL is a separate process that consumes its own RAM, making it crucial to limit your connections.

00:14:43.920 Implementing connection pooling is a great way to address this challenge. In Ruby, there are libraries to handle this, but you can also use a tool like PgBouncer, which serves as a proxy between your Ruby application and PostgreSQL.

00:15:01.759 Another common issue is dealing with locks. When updating rows within a transaction, locks are placed until the action is completed, preventing other operations on the same rows. Avoiding situations where a record is updated continuously in quick succession helps mitigate locking issues.

00:15:14.240 For intensive queries, consider creating read-only replicas using PostgreSQL's streaming replication. You can direct heavy queries to the replica, ensuring that the primary database remains responsive for your users.

00:15:36.240 Partitioning is an excellent practice for managing large datasets. PostgreSQL allows you to set up partitioning schemes where data is organized into separate tables based on different criteria, which aids in significantly speeding up data culling and archiving.

00:15:57.799 This way, deleting old data becomes a matter of dropping a table instead of running lengthy delete queries that can freeze your system, especially when dealing with two terabytes of data.

00:16:15.520 Lastly, managing backups is a significant concern for large datasets. Using traditional cron jobs to dump a two-terabyte database into S3 doesn't scale well. Instead, taking advantage of PostgreSQL's streaming replication allows for incremental backups.

00:16:36.759 Tools such as WAL-E make it simple to upload these incremental backups to S3, providing a straightforward route for point-in-time restores.

00:16:58.559 As I've worked on my slides, I realized there's a lot of information here. This can be overwhelming, but remember that the issues with biggish data tend to present themselves one at a time. You don't have to know everything at once.

00:17:20.480 You can tackle each challenge as it arises, and I have faith that you all can handle this because you're incredible. None of us expected we could turn transistors into cat memes, but we did it, and that’s the type of people we are.

00:17:43.920 If you’re interested in discussing this further, please come up to me after the talk. I have delicious, spicy Mexican candy as an incentive. If you'd like to learn more, all the referenced links from today's talk will be available online.

00:18:06.320 Also, if you're looking for better visibility into production errors, check out Honeybadger. We love Rails developers. And that brings me to the end of the show.

00:18:22.960 Thank you!