00:00:25.519
Hello everyone! Today we're going to talk about dealing with a large data set from an Active Directory. I'm Eileen Uchitelle. I work for Fishme, and you can find me almost anywhere online at the handle @anycodes. That's my blog dot com.
00:00:40.000
You may have known me in the past because I'm a carpenter, but I recently got married and changed my last name. I know it's hard to pronounce, so just think of it as Uchitelle without the 'ld.' I'm a senior developer at Fishme. We provide simple solutions that help our customers educate their workforce about phishing by sending simulated spear-phishing attacks.
00:01:01.920
Everyone here should know what phishing is, but if you don't, shoot me an email, and I'll send you some ways to recognize phishing attacks. Over 17 million emails have been sent with the Fishme app since 2009. With that statistic in mind, it's no surprise that Fishme has a lot of data.
00:01:16.799
I've built many applications in the past but had never worked with data sets this large. Queries that ran fine with 10 or 100 records would blow up when dealing with 10,000. ActiveRecord is the library that wraps around our database interactions and protects us from having to write raw SQL.
00:01:36.960
It's like a magical uniform full of wonder and happiness, but naively ignoring SQL output is not a remaining school of thought. Ignorance leads us to blame ActiveRecord for our problems instead of being accountable for the code that we write.
00:01:54.960
There were quite a few times where I felt ActiveRecord failed me because I didn't write thoughtful queries. I wasn't paying enough attention to how my code and data interacted, and I set myself up for disappointment. When MySQL crashed or my queries were slow, I felt frustrated, asking how could ActiveRecord treat me this way.
00:02:13.200
But deep down, I think this was an act of ignorance on my part. I assumed its magical properties extended to being obvious. We need to understand how ActiveRecord translates into MySQL so we can tell it how we want it to behave. We can all be friends with it.
00:02:35.280
After a ton of research, I figured out how to better write my ActiveRecord queries to avoid MySQL timeouts, memory leaks, and slow queries. I'm here to share my mistakes so that you can avoid personal problems. After discussing each of the fundamental CRUD functions—Create, Read, Update, and Delete—I'm going to demonstrate a problem I've encountered and the solution I came up with.
00:02:54.480
CRUD functions are relevant to any persistent data model and can be found in many aspects of an application. Create can be represented by 'create' and 'new' in Rails or 'insert' in MySQL. Read can be demonstrated with 'find,' 'find_each,' and various other methods in Rails, as well as 'select' in MySQL.
00:03:10.239
Update is handled by 'update,' 'update_all' in Rails, and 'update' in MySQL. Lastly, Delete corresponds to 'destroy' and 'delete,' which are synonymous with 'destroy_all' and 'delete_all' in Rails and MySQL, respectively. Once you understand these functions, it's typical to find CRUD operations in all aspects of our applications, from the database to the user interface to HTTP request methods.
00:03:38.640
My examples are represented by database operations and ActiveRecord methods. For my examples, let's imagine our application is an address book. I'm using an address book as a model of associations between people. Each user has many contacts and belongs to many categories.
00:04:03.120
Contacts belong to a user and to many categories. Categorizations are the junction model that connects both. Without categorizations, contacts do not know about categories, and vice versa. Let's start with the Create operation.
00:04:22.880
Imagine we have a CSV spreadsheet with 10,000 products. We can run through each row of the CSV and create each individual contact using ActiveRecord. This will create SQL statements constructing an insert statement for each individual contact that needs to be added to the database.
00:04:55.520
However, this insert statement will run ten thousand times, which will take quite a while. What if there was a way to insert more than one record at once? After a lot of research, I found a method to speed up the creation of these 10,000 records.
00:05:24.160
This method involves using MySQL's batch insert, which allows you to insert multiple records simultaneously. While this isn't commonly done in Rails, it can greatly improve efficiency.
00:05:41.520
Back to the spreadsheet with ten thousand contacts, we'll read each row of the CSV and create an array of all the contacts that need to be inserted into the database. It’s crucial to note that MySQL can't handle all ten thousand records at once. We must lower our MySQL maximum query size.
00:06:07.520
After extensive trial and error, I found that 2000 was a reliable batch size for my servers, but you'll need to experiment on your own. Until I've inserted all the contacts, the contact values array shifts back, removing a specified number of contacts from the front of the array until it raises empty.
00:06:32.000
An SQL statement is then filled with the column names and the values to be inserted. Batch insert is made possible through the insert syntax by feeding in all the values for each record we want to create. It's important that column names and values line up perfectly.
00:06:50.720
Finally, we connect to the database and execute the insert statement. I'd like to emphasize that this example assumes we've sanitized the input against SQL injection. Batch inserts create a MySQL query that looks similar to other create statements except it's chaining all the values instead of making a new insert statement for each contact.
00:07:09.440
Let’s check the speed on these two queries using a benchmark tool from the Ruby standard library. The output represents system CPU time along with the elapsed real-time for my examples. For the example where we created each record individually, it took approximately 45.9 seconds.
00:07:41.920
That’s a long time to wait for 10,000 records to be inserted into the database. My SQL batch insert took less than three seconds. That’s a huge difference! Benchmarking times may vary slightly based on garbage collection, memory allocation, and the versions of Ruby and Rails that you're using.
00:08:07.920
However, this doesn't change the fact that MySQL batch insert is drastically faster than creating and saving each individual record. But since we aren’t saving each record, no callbacks will be fired, and we skip directly to the database.
00:08:34.440
Now let's look at reading data. Let's say we want to output the first name of each contact. There are multiple ways to achieve this result. You can iterate through each contact and output the first name, but that can be very memory costly if we have a lot of records.
00:08:58.080
Instead, a single SQL statement can be run, collecting all the records at once. Using 'find_each' helps save both time and memory. 'Find_each' will collect data in batches of thousands without overloading the memory. Since we're only outputting the first name, we can use 'pluck' to retrieve just that attribute.
00:09:21.440
This will be much faster because 'pluck' creates an array of strings instead of returning whole objects with unnecessary attributes. Let’s compare these queries for performance.
00:09:54.240
Each benchmark indicates that while 'find_each' is not much faster, when collecting records, we're more concerned with memory and time efficiency—especially when dealing with larger data sets.
00:10:09.040
We can see a significant time difference with 'pluck,' which is much faster than using 'each' or 'find_each', as we only need one attribute, the first name.
00:10:32.320
What would these benchmarks look like with 100,000 records? When the data set increases from 10,000 to 100,000 records, the length of time taken increases significantly, and the time savings become more apparent.
00:11:04.440
For 100,000 records, the individual create methods take quite a long time, while 'find_each' becomes relatively slower, and 'pluck' proves to be the fastest method by a considerable margin.
00:11:30.080
Another addition method that ActiveRecord provides is 'find_by_sql.' This method allows us to craft custom SQL queries when writing queries this way may be faster and more efficient since ActiveRecord doesn’t always know the best way to retrieve the needed data.
00:11:55.680
Now let's move on to updates. Suppose you want to change all categorizations from the 'co-workers' category to the 'network' category. In this scenario, you're collecting all records to update the category attribute on each one using 'update_attributes.'
00:12:20.160
This process instantiates and updates each individual record for all 10,000 entries, generating multiple SQL statements. A more efficient method would be to use 'update_all,' which creates a single SQL update statement and allows records to be updated without instantiation.
00:12:51.440
The 'update_all' method produces a single update statement: 'UPDATE categorizations SET category = new_category WHERE ...'. This method is much faster and efficient because all records are updated at once without running 'save' on each object.
00:13:20.480
The first query, which updates each individual record one at a time, takes approximately 14 seconds, while 'update_all' executes almost instantaneously. This is an incredible savings in performance.
00:13:48.320
Now, let’s focus on delete functions. We first need to discuss the differences between 'delete_all' and 'destroy_all.' How ActiveRecord handles associations can affect their outcomes.
00:14:02.720
If you delete a model through 'destroy' and dependencies are set to 'destroy_all,' then deleting a contact will also remove all associated categorizations recursively. Conversely, 'delete_all' will not fire any callbacks, and will simply remove records in bulk.
00:14:23.760
For instance, if we use 'delete_all' on contacts, only those contacts will be removed, regardless of whether dependencies are set to 'delete_all' or 'destroy.' In contrast, simply running 'destroy_all' will execute callbacks for dependencies, deleting relationships.
00:14:48.720
This difference in handling relationships emphasizes the importance of understanding how models are associated. While using 'delete_all' removes contacts all at once without triggering callbacks, 'destroy_all' processes each contact individually, ensuring callbacks are executed.
00:15:12.160
Now, consider a scenario where you try to run a query like 'contacts.destroy_all.' It won’t delete contacts related to the categories simply because the category doesn’t directly associate with contacts.
00:15:32.080
The only way for categories to know about contacts is through categorizations. Such queries can lead to confusion, as many developers may assume that deleting contacts associated with a category will work as intended.
00:15:54.960
It's essential to write clear and concise code to avoid misunderstandings with model associations. Developers should be cautious when executing deletions in relationships to prevent unintended consequences.
00:16:20.800
For instance, when we set the dependency option to 'delete_all,' it should ideally perform efficiently. However, it may not operate as expected because of how the collection proxy works with associations.
00:16:46.160
You might notice unexpected delays since, when deleting through associations, dependencies generally need to collect every object to delete, which can slow down performance.
00:17:08.240
The most effective approach in such cases would be to collect categorizations through the category model and deal with deletions directly, which will produce clear SQL statements.
00:17:29.679
So what have we learned from these examples? ActiveRecord is a great tool, but we shouldn't let its magical properties lead us to make assumptions.
00:17:58.320
Being mindful of how our queries are processed is crucial, especially with large data sets. Changing how we frame our ActiveRecord queries can help us avoid complications.
00:18:14.400
There are great tools available to help us profile our queries and usage. For instance, New Relic can identify slow SQL queries that consume memory, while Mini Profiler tracks your application's queries and timings.
00:18:32.000
Bullet identifies potentially inefficient 'n+1' queries and reminds you to optimize your loading. Thank you, everyone!
00:19:38.240
You.