JD Harrington

Microtalk: Take a picture, it'll last longer

Cheap code branches permeate our daily workflow, but the code we write is only half the story. Introducing data model changes into production can challenge developers and ops people alike, but how do we deal with these issues in our experimental phase?
In this talk, we'll discuss using features provided by modern filesystems like ZFS & Btrfs to branch databases along with code, letting developers experiment and ops people model complex environments, all from the comfort of our laptops.

Help us caption & translate this video!

http://amara.org/v/FG91/

GoRuCo 2013

00:00:11.480 Thank you.
00:00:16.379 Hey everybody, I am JD. You can email me at jdharrington.net. I'm Psy on Twitter and GitHub, and I work at StreetEasy, focusing on both application development and system operations.
00:00:22.580 So a quick show of hands: how many people here use Git? Cool! It seems that almost all of us use Git at this point, and there are a lot of great features in it. However, when we all started using it, we were switching from Subversion or something else, and the thing that we all said was, 'We heart branches!'
00:00:39.540 We like branches because Git makes them easy and fast. Branching has become part of our day-to-day workflow. The really great thing about branching is that it allows us to sandbox our experiments. So, if I want to work on a new feature, I create a new branch, and I start working on it. If someone comes along and says, 'Hey, something is broken in production,' I can switch back to master, make the changes there, commit them, push them out, and then go back to working on my experiment. Git makes this process super easy.
00:01:07.200 However, things can get complicated. If, in your experimental branch, you make some changes to the database that aren't compatible with what's in master—like renaming a column, renaming a table, or changing a relationship from one-to-one to one-to-many or many-to-many—it becomes a total pain to manage your database while switching back and forth between those branches.
00:02:05.520 Thinking about this, the thing we would really like is to be able to branch the database too. That's a great idea, but how do we do that? I'm going to show you a few ideas, and these examples will use MySQL, but the concepts here can apply to any database or data store.
00:02:15.599 The general process will involve stopping MySQL, swapping in different database files, and then restarting MySQL. We all know how to stop and start MySQL, so the tricky part is figuring out how we can swap in the different files. The first thing we could do is to simply copy the entire database. We stop MySQL, move our MySQL directory, name it after our master branch, copy everything, assemble our copy in place, and then restart MySQL. That works, but if my database is like 100 gigabytes, it takes forever to make those copies.
00:02:44.640 Additionally, what if the copy doesn't finish because my SSD runs out of space? Clearly, that's a problem. So let's consider something else. We're looking for a Git-style branching functionality, so we think maybe we can put the entire database into a Git repository.
00:03:01.519 The process would be the same: stop MySQL, add everything into Git, check out a new branch, restart the server, and see if that works. It does work, but it still takes forever to add large files in your database directory to Git, and you're using a significant amount of extra disk space. Git does some internal compression, so you won't waste as much space, but you're still using more than you would like.
00:03:22.740 On the bright side, switching to a new branch is really fast because you're on master, check out a new branch, and that happens super quick. However, you will still have to work out how to commit changes to your database, which can start to feel like it's probably the wrong tool for this job.
00:04:04.360 Then I thought, 'Hey, there’s this thing called ZFS.' How many people are familiar with ZFS? Cool! ZFS was developed by Sun; they started writing it in 2001 and had its first public release in 2005. It was developed as the next-generation file system for Solaris.
00:04:31.740 ZFS is interesting because it combines a standard file system, a volume manager, and software RAID into one component. If you're using the standard set of Linux tools to achieve similar functionality, you would need a file system like XFS or ext4, LVM as your volume manager, and MDADM for software RAID. ZFS uses a copy-on-write transaction model on disk, meaning that anytime you're changing a block of data, the block is first copied into a new location before you make the modifications and update the reference to that new block. It also reuses a similar model to manage memory, allowing you to spin up multiple processes without using excessive RAM at startup.
00:05:48.600 Because of this model, ZFS can provide fast and efficient snapshotting and cloning features. A snapshot is essentially a point-in-time representation of an existing file system. Once you take a snapshot, you can clone it to create a new file system based on that snapshot. Since this process involves updating references rather than duplicating data, it happens very quickly and uses minimal initial disk space. When you clone a new file system and make changes over time, ZFS stores the differences between those two file systems.
00:06:49.140 This approach starts to sound like Git branches. Additionally, ZFS is not just for Solaris anymore; ports exist for macOS, Linux, and FreeBSD. If we apply ZFS to our problem at hand, we can do something like this: a ZFS file system consists of a storage pool and the name of the file system. Here we're looking at 'myapp/MySQL.' When we take a snapshot and create a new file system, we can name it after our new branch.
00:07:25.220 We snapshot and include a new file system, naming it after our new branch, and then seamlessly swap it into place and restart MySQL. This is efficient and incredibly fast while using virtually no extra disk space. However, the drawback is that typing all of that out every time you want to branch can be tedious. Nobody is likely to do that consistently.
00:08:19.560 So, this sounds like a job for Git hooks! Luckily, Git has a post-checkout hook that we can use for this process. This hook triggers every time you check out a file or branch. There is a parameter identifying whether you're checking out a branch, which allows us to automate this task.
00:09:04.920 I've written a Git hook that implements this functionality. It encapsulates everything we discussed earlier wrapped up in a bit of Ruby. This runs every time we switch branches, automatically creating a new file system named after the current branch.
00:09:24.420 As a result, this system works seamlessly and efficiently. Restarting MySQL will take longer than creating the file system. This approach works well for standard development workflows. Additionally, I developed a Vagrant plugin called vagrant-zfs, which applies the same concept on a per-VM basis, making it excellent for testing database clusters.
00:10:11.339 Using this plugin, you can easily deploy multiple VMs that need the same seed data, each getting their individual file systems rapidly. Again, this solution is fast, efficient, and does not require extra disk space.
00:10:21.899 This functionality will soon be updated for the new version of Vagrant, and that’s all I have.
00:10:24.540 If anyone has questions, I have slides and a walkthrough available that shows you how to set this up on your Mac, along with the code in the post-checkout hook.