Microtalk: Take a picture, it'll last longer

by JD Harrington

In the video 'Microtalk: Take a picture, it'll last longer' presented by JD Harrington at GoRuCo 2013, the speaker discusses the integration of branching strategies from Git into database management to enhance development workflows. He emphasizes the challenges developers face when integrating data model changes into production during their experimental development phase.

Key points discussed include:

Importance of Branching: JD highlights how Git has revolutionized branching, making it easier for developers to manage features and returns to stable versions, but notes that database changes complicate this process.
Challenges in Database Management: The speaker identifies specific issues when making incompatible database modifications, such as renaming columns or tables, which complicates the switching between branches.
Potential Solutions: Various approaches to creating database branches are discussed:
- Copying Database Files: While this is simple, it can be inefficient, especially with large databases.
- Using Git for Database: The concept of putting the database in a Git repository is presented, but this method still has overhead due to large file sizes.
Introduction of ZFS: JD explains ZFS, a filesystem that features efficient snapshotting and cloning, making it ideal for branching databases due to its copy-on-write nature, which saves disk space and time when creating new branches.
Git Hooks for Automation: To streamline the branching process, JD introduces the Git post-checkout hook, which can automate the association of database snapshots with Git branches, thereby minimizing manual effort with coding in Ruby.
Application in Virtual Environments: He mentions the development of a Vagrant plugin, vagrant-zfs, which enables rapid deployment of individual filesystems for different VMs, providing a practical application for testing environments.

In conclusion, JD stresses the efficiency of using modern filesystems like ZFS to manage database changes alongside Git branching, resulting in a streamlined workflow for developers and operations teams. He encourages the audience to explore these technologies and suggests resources for implementation on Mac systems.

00:00:11.480 Thank you.

00:00:16.379 Hey everybody, I am JD. You can email me at jdharrington.net. I'm Psy on Twitter and GitHub, and I work at StreetEasy, focusing on both application development and system operations.

00:00:22.580 So a quick show of hands: how many people here use Git? Cool! It seems that almost all of us use Git at this point, and there are a lot of great features in it. However, when we all started using it, we were switching from Subversion or something else, and the thing that we all said was, 'We heart branches!'

00:00:39.540 We like branches because Git makes them easy and fast. Branching has become part of our day-to-day workflow. The really great thing about branching is that it allows us to sandbox our experiments. So, if I want to work on a new feature, I create a new branch, and I start working on it. If someone comes along and says, 'Hey, something is broken in production,' I can switch back to master, make the changes there, commit them, push them out, and then go back to working on my experiment. Git makes this process super easy.

00:01:07.200 However, things can get complicated. If, in your experimental branch, you make some changes to the database that aren't compatible with what's in master—like renaming a column, renaming a table, or changing a relationship from one-to-one to one-to-many or many-to-many—it becomes a total pain to manage your database while switching back and forth between those branches.

00:02:05.520 Thinking about this, the thing we would really like is to be able to branch the database too. That's a great idea, but how do we do that? I'm going to show you a few ideas, and these examples will use MySQL, but the concepts here can apply to any database or data store.

00:02:15.599 The general process will involve stopping MySQL, swapping in different database files, and then restarting MySQL. We all know how to stop and start MySQL, so the tricky part is figuring out how we can swap in the different files. The first thing we could do is to simply copy the entire database. We stop MySQL, move our MySQL directory, name it after our master branch, copy everything, assemble our copy in place, and then restart MySQL. That works, but if my database is like 100 gigabytes, it takes forever to make those copies.

00:02:44.640 Additionally, what if the copy doesn't finish because my SSD runs out of space? Clearly, that's a problem. So let's consider something else. We're looking for a Git-style branching functionality, so we think maybe we can put the entire database into a Git repository.

00:03:01.519 The process would be the same: stop MySQL, add everything into Git, check out a new branch, restart the server, and see if that works. It does work, but it still takes forever to add large files in your database directory to Git, and you're using a significant amount of extra disk space. Git does some internal compression, so you won't waste as much space, but you're still using more than you would like.

00:03:22.740 On the bright side, switching to a new branch is really fast because you're on master, check out a new branch, and that happens super quick. However, you will still have to work out how to commit changes to your database, which can start to feel like it's probably the wrong tool for this job.

00:04:04.360 Then I thought, 'Hey, there’s this thing called ZFS.' How many people are familiar with ZFS? Cool! ZFS was developed by Sun; they started writing it in 2001 and had its first public release in 2005. It was developed as the next-generation file system for Solaris.

00:04:31.740 ZFS is interesting because it combines a standard file system, a volume manager, and software RAID into one component. If you're using the standard set of Linux tools to achieve similar functionality, you would need a file system like XFS or ext4, LVM as your volume manager, and MDADM for software RAID. ZFS uses a copy-on-write transaction model on disk, meaning that anytime you're changing a block of data, the block is first copied into a new location before you make the modifications and update the reference to that new block. It also reuses a similar model to manage memory, allowing you to spin up multiple processes without using excessive RAM at startup.

00:05:48.600 Because of this model, ZFS can provide fast and efficient snapshotting and cloning features. A snapshot is essentially a point-in-time representation of an existing file system. Once you take a snapshot, you can clone it to create a new file system based on that snapshot. Since this process involves updating references rather than duplicating data, it happens very quickly and uses minimal initial disk space. When you clone a new file system and make changes over time, ZFS stores the differences between those two file systems.

00:06:49.140 This approach starts to sound like Git branches. Additionally, ZFS is not just for Solaris anymore; ports exist for macOS, Linux, and FreeBSD. If we apply ZFS to our problem at hand, we can do something like this: a ZFS file system consists of a storage pool and the name of the file system. Here we're looking at 'myapp/MySQL.' When we take a snapshot and create a new file system, we can name it after our new branch.

00:07:25.220 We snapshot and include a new file system, naming it after our new branch, and then seamlessly swap it into place and restart MySQL. This is efficient and incredibly fast while using virtually no extra disk space. However, the drawback is that typing all of that out every time you want to branch can be tedious. Nobody is likely to do that consistently.

00:08:19.560 So, this sounds like a job for Git hooks! Luckily, Git has a post-checkout hook that we can use for this process. This hook triggers every time you check out a file or branch. There is a parameter identifying whether you're checking out a branch, which allows us to automate this task.

00:09:04.920 I've written a Git hook that implements this functionality. It encapsulates everything we discussed earlier wrapped up in a bit of Ruby. This runs every time we switch branches, automatically creating a new file system named after the current branch.

00:09:24.420 As a result, this system works seamlessly and efficiently. Restarting MySQL will take longer than creating the file system. This approach works well for standard development workflows. Additionally, I developed a Vagrant plugin called vagrant-zfs, which applies the same concept on a per-VM basis, making it excellent for testing database clusters.

00:10:11.339 Using this plugin, you can easily deploy multiple VMs that need the same seed data, each getting their individual file systems rapidly. Again, this solution is fast, efficient, and does not require extra disk space.

00:10:21.899 This functionality will soon be updated for the new version of Vagrant, and that’s all I have.

00:10:24.540 If anyone has questions, I have slides and a walkthrough available that shows you how to set this up on your Mac, along with the code in the post-checkout hook.