Eileen M. Uchitelle
CRUD! The Consequences of Not Understanding How ActiveRecord Translates into MySQL

Summarized using AI

CRUD! The Consequences of Not Understanding How ActiveRecord Translates into MySQL

Eileen M. Uchitelle • March 17, 2014 • Earth

In this talk titled "CRUD! The Consequences of Not Understanding How ActiveRecord Translates into MySQL," Eileen Uchitelle explores the complexities of using ActiveRecord in Ruby on Rails and its interactions with MySQL, particularly when handling large datasets. Eileen emphasizes that while ActiveRecord simplifies database interactions, it is crucial for developers to understand how it translates commands into SQL to prevent performance issues, such as MySQL timeouts and memory leaks. She discusses common CRUD (Create, Read, Update, Delete) operations, presenting examples for each and illustrating the performance pitfalls associated with naive use of ActiveRecord.

Key points discussed in the video include:

- Understanding CRUD Operations: The CRUD functions are foundational to data management in applications. Eileen provides definitions and usage examples for each operation.

- Performance Issues with Large Datasets: Eileen shares her experiences with data sets, revealing how operations that perform well with small data sets can break down with larger ones.

- Optimizing Create Operations: Instead of individually inserting records, Eileen advocates for batch inserts to enhance efficiency. She highlights that a batch insert for 10,000 records can reduce operation time from 45.9 seconds to under three seconds by executing a single SQL insert statement rather than multiple ones.

- Efficient Read Operations: Utilizing methods like 'findeach' and 'pluck' allows for memory-efficient and faster data retrieval, especially as the dataset size increases.

- Improving Update Operations: Eileen contrasts the 'update
attributes' method, which is slow due to instantiation for each record, with 'updateall,' which executes a single update query, vastly improving performance.

- Cautions with Delete Operations: She explains the difference between 'delete
all' and 'destroy_all', emphasizing the importance of understanding model associations in handling deletions to avoid unintended consequences.

- Profiling Tools: Eileen suggests tools like New Relic and Bullet to help developers profile their queries and identify inefficiencies.

In conclusion, the key takeaway from Eileen's talk is that while ActiveRecord is a powerful tool for database interactions, a good understanding of its operations and their implications on performance is essential. Developers should strive to write thoughtful and efficient queries to manage large datasets effectively and avoid common pitfalls in application performance.

CRUD! The Consequences of Not Understanding How ActiveRecord Translates into MySQL
Eileen M. Uchitelle • March 17, 2014 • Earth

By Eileen Uchitelle

The magic of ActiveRecord database interactions is easy to rely on and allows us assume it knows best. Without a solid understanding of how ActiveRecord translates into MySQL, however, significant issues can arise. This is particularly true with large data sets and complex model relationships. My talk explores an example for each CRUD function and shows how these queries can result in MySQL timeouts, memory issues or stack level too deep errors. The examples will examine the consequences of chaining large datasets, uses for Arel, and how to avoid encountering major problems and most importantly, how these queries can be rewritten to run more efficiently.

Help us caption & translate this video!

http://amara.org/v/FG2L/

MountainWest RubyConf 2014

00:00:25.519 Hello everyone! Today we're going to talk about dealing with a large data set from an Active Directory. I'm Eileen Uchitelle. I work for Fishme, and you can find me almost anywhere online at the handle @anycodes. That's my blog dot com.
00:00:40.000 You may have known me in the past because I'm a carpenter, but I recently got married and changed my last name. I know it's hard to pronounce, so just think of it as Uchitelle without the 'ld.' I'm a senior developer at Fishme. We provide simple solutions that help our customers educate their workforce about phishing by sending simulated spear-phishing attacks.
00:01:01.920 Everyone here should know what phishing is, but if you don't, shoot me an email, and I'll send you some ways to recognize phishing attacks. Over 17 million emails have been sent with the Fishme app since 2009. With that statistic in mind, it's no surprise that Fishme has a lot of data.
00:01:16.799 I've built many applications in the past but had never worked with data sets this large. Queries that ran fine with 10 or 100 records would blow up when dealing with 10,000. ActiveRecord is the library that wraps around our database interactions and protects us from having to write raw SQL.
00:01:36.960 It's like a magical uniform full of wonder and happiness, but naively ignoring SQL output is not a remaining school of thought. Ignorance leads us to blame ActiveRecord for our problems instead of being accountable for the code that we write.
00:01:54.960 There were quite a few times where I felt ActiveRecord failed me because I didn't write thoughtful queries. I wasn't paying enough attention to how my code and data interacted, and I set myself up for disappointment. When MySQL crashed or my queries were slow, I felt frustrated, asking how could ActiveRecord treat me this way.
00:02:13.200 But deep down, I think this was an act of ignorance on my part. I assumed its magical properties extended to being obvious. We need to understand how ActiveRecord translates into MySQL so we can tell it how we want it to behave. We can all be friends with it.
00:02:35.280 After a ton of research, I figured out how to better write my ActiveRecord queries to avoid MySQL timeouts, memory leaks, and slow queries. I'm here to share my mistakes so that you can avoid personal problems. After discussing each of the fundamental CRUD functions—Create, Read, Update, and Delete—I'm going to demonstrate a problem I've encountered and the solution I came up with.
00:02:54.480 CRUD functions are relevant to any persistent data model and can be found in many aspects of an application. Create can be represented by 'create' and 'new' in Rails or 'insert' in MySQL. Read can be demonstrated with 'find,' 'find_each,' and various other methods in Rails, as well as 'select' in MySQL.
00:03:10.239 Update is handled by 'update,' 'update_all' in Rails, and 'update' in MySQL. Lastly, Delete corresponds to 'destroy' and 'delete,' which are synonymous with 'destroy_all' and 'delete_all' in Rails and MySQL, respectively. Once you understand these functions, it's typical to find CRUD operations in all aspects of our applications, from the database to the user interface to HTTP request methods.
00:03:38.640 My examples are represented by database operations and ActiveRecord methods. For my examples, let's imagine our application is an address book. I'm using an address book as a model of associations between people. Each user has many contacts and belongs to many categories.
00:04:03.120 Contacts belong to a user and to many categories. Categorizations are the junction model that connects both. Without categorizations, contacts do not know about categories, and vice versa. Let's start with the Create operation.
00:04:22.880 Imagine we have a CSV spreadsheet with 10,000 products. We can run through each row of the CSV and create each individual contact using ActiveRecord. This will create SQL statements constructing an insert statement for each individual contact that needs to be added to the database.
00:04:55.520 However, this insert statement will run ten thousand times, which will take quite a while. What if there was a way to insert more than one record at once? After a lot of research, I found a method to speed up the creation of these 10,000 records.
00:05:24.160 This method involves using MySQL's batch insert, which allows you to insert multiple records simultaneously. While this isn't commonly done in Rails, it can greatly improve efficiency.
00:05:41.520 Back to the spreadsheet with ten thousand contacts, we'll read each row of the CSV and create an array of all the contacts that need to be inserted into the database. It’s crucial to note that MySQL can't handle all ten thousand records at once. We must lower our MySQL maximum query size.
00:06:07.520 After extensive trial and error, I found that 2000 was a reliable batch size for my servers, but you'll need to experiment on your own. Until I've inserted all the contacts, the contact values array shifts back, removing a specified number of contacts from the front of the array until it raises empty.
00:06:32.000 An SQL statement is then filled with the column names and the values to be inserted. Batch insert is made possible through the insert syntax by feeding in all the values for each record we want to create. It's important that column names and values line up perfectly.
00:06:50.720 Finally, we connect to the database and execute the insert statement. I'd like to emphasize that this example assumes we've sanitized the input against SQL injection. Batch inserts create a MySQL query that looks similar to other create statements except it's chaining all the values instead of making a new insert statement for each contact.
00:07:09.440 Let’s check the speed on these two queries using a benchmark tool from the Ruby standard library. The output represents system CPU time along with the elapsed real-time for my examples. For the example where we created each record individually, it took approximately 45.9 seconds.
00:07:41.920 That’s a long time to wait for 10,000 records to be inserted into the database. My SQL batch insert took less than three seconds. That’s a huge difference! Benchmarking times may vary slightly based on garbage collection, memory allocation, and the versions of Ruby and Rails that you're using.
00:08:07.920 However, this doesn't change the fact that MySQL batch insert is drastically faster than creating and saving each individual record. But since we aren’t saving each record, no callbacks will be fired, and we skip directly to the database.
00:08:34.440 Now let's look at reading data. Let's say we want to output the first name of each contact. There are multiple ways to achieve this result. You can iterate through each contact and output the first name, but that can be very memory costly if we have a lot of records.
00:08:58.080 Instead, a single SQL statement can be run, collecting all the records at once. Using 'find_each' helps save both time and memory. 'Find_each' will collect data in batches of thousands without overloading the memory. Since we're only outputting the first name, we can use 'pluck' to retrieve just that attribute.
00:09:21.440 This will be much faster because 'pluck' creates an array of strings instead of returning whole objects with unnecessary attributes. Let’s compare these queries for performance.
00:09:54.240 Each benchmark indicates that while 'find_each' is not much faster, when collecting records, we're more concerned with memory and time efficiency—especially when dealing with larger data sets.
00:10:09.040 We can see a significant time difference with 'pluck,' which is much faster than using 'each' or 'find_each', as we only need one attribute, the first name.
00:10:32.320 What would these benchmarks look like with 100,000 records? When the data set increases from 10,000 to 100,000 records, the length of time taken increases significantly, and the time savings become more apparent.
00:11:04.440 For 100,000 records, the individual create methods take quite a long time, while 'find_each' becomes relatively slower, and 'pluck' proves to be the fastest method by a considerable margin.
00:11:30.080 Another addition method that ActiveRecord provides is 'find_by_sql.' This method allows us to craft custom SQL queries when writing queries this way may be faster and more efficient since ActiveRecord doesn’t always know the best way to retrieve the needed data.
00:11:55.680 Now let's move on to updates. Suppose you want to change all categorizations from the 'co-workers' category to the 'network' category. In this scenario, you're collecting all records to update the category attribute on each one using 'update_attributes.'
00:12:20.160 This process instantiates and updates each individual record for all 10,000 entries, generating multiple SQL statements. A more efficient method would be to use 'update_all,' which creates a single SQL update statement and allows records to be updated without instantiation.
00:12:51.440 The 'update_all' method produces a single update statement: 'UPDATE categorizations SET category = new_category WHERE ...'. This method is much faster and efficient because all records are updated at once without running 'save' on each object.
00:13:20.480 The first query, which updates each individual record one at a time, takes approximately 14 seconds, while 'update_all' executes almost instantaneously. This is an incredible savings in performance.
00:13:48.320 Now, let’s focus on delete functions. We first need to discuss the differences between 'delete_all' and 'destroy_all.' How ActiveRecord handles associations can affect their outcomes.
00:14:02.720 If you delete a model through 'destroy' and dependencies are set to 'destroy_all,' then deleting a contact will also remove all associated categorizations recursively. Conversely, 'delete_all' will not fire any callbacks, and will simply remove records in bulk.
00:14:23.760 For instance, if we use 'delete_all' on contacts, only those contacts will be removed, regardless of whether dependencies are set to 'delete_all' or 'destroy.' In contrast, simply running 'destroy_all' will execute callbacks for dependencies, deleting relationships.
00:14:48.720 This difference in handling relationships emphasizes the importance of understanding how models are associated. While using 'delete_all' removes contacts all at once without triggering callbacks, 'destroy_all' processes each contact individually, ensuring callbacks are executed.
00:15:12.160 Now, consider a scenario where you try to run a query like 'contacts.destroy_all.' It won’t delete contacts related to the categories simply because the category doesn’t directly associate with contacts.
00:15:32.080 The only way for categories to know about contacts is through categorizations. Such queries can lead to confusion, as many developers may assume that deleting contacts associated with a category will work as intended.
00:15:54.960 It's essential to write clear and concise code to avoid misunderstandings with model associations. Developers should be cautious when executing deletions in relationships to prevent unintended consequences.
00:16:20.800 For instance, when we set the dependency option to 'delete_all,' it should ideally perform efficiently. However, it may not operate as expected because of how the collection proxy works with associations.
00:16:46.160 You might notice unexpected delays since, when deleting through associations, dependencies generally need to collect every object to delete, which can slow down performance.
00:17:08.240 The most effective approach in such cases would be to collect categorizations through the category model and deal with deletions directly, which will produce clear SQL statements.
00:17:29.679 So what have we learned from these examples? ActiveRecord is a great tool, but we shouldn't let its magical properties lead us to make assumptions.
00:17:58.320 Being mindful of how our queries are processed is crucial, especially with large data sets. Changing how we frame our ActiveRecord queries can help us avoid complications.
00:18:14.400 There are great tools available to help us profile our queries and usage. For instance, New Relic can identify slow SQL queries that consume memory, while Mini Profiler tracks your application's queries and timings.
00:18:32.000 Bullet identifies potentially inefficient 'n+1' queries and reminds you to optimize your loading. Thank you, everyone!
00:19:38.240 You.
Explore all talks recorded at MountainWest RubyConf 2014
+8