00:00:15.519
Um, my name is Brandon Keepers. I'm a developer at GitHub, and you can find me online as Beekeepers or onboard on both Twitter and GitHub. So feel free to send me a message. I'm here today mostly because I made the organizers feel guilty.
00:00:24.160
Back in March, when I saw that this was announced, I thought, 'Hey, this conference is on my birthday!' I submitted a talk, and that meant I should get to come to Hawaii for my birthday. So it worked out well.
00:00:36.399
This talk is about using Git as a database. I don't think I'm the only one with this idea. In fact, when I first thought of it, you know those ideas when you're like, 'Yes, this is so revolutionary!' then you Google it and realize thirty people have had the same idea and tried it.
00:00:41.600
But I still want to experiment with it. This was one of those moments like in the Windows commercials where the guy's in the shower, and he comes up with a beautiful idea: what if I could use Git as a data store? It works great for code; I love it for that.
00:00:53.280
But what would it look like to use it as your application's main database? One of my coworkers, John Hoyt, and I played around with this. We built an app called Gasket because we really liked Pivotal Tracker at the time. We thought, 'It would be really awesome to have those issues in our Git repository with the rest of the code.'
00:01:10.960
You can check it out on GitHub; it’s still up there, although there haven’t been any commits for a while. Before we get too far into this, I have a couple of disclaimers.
00:01:26.560
First, I work at GitHub, which you may or may not know. We use Git here, and a lot of people at GitHub know a lot about Git.
00:01:32.239
Despite appearances, like that guy with his toe in his mouth during the video shot, I assure you he's brilliant. Trust me. I will show you later the response I received when I was doing research for this talk. I’m not one of those people at GitHub who knows all about Git internals.
00:01:44.880
I know a fair amount about using Git and would say I'm a fairly advanced user, but when it comes to the internals of Git, I was pretty much a novice. So there’s my first disclaimer. My second disclaimer is about the term 'NoSQL.' I think it's a poor marketing term.
00:02:02.000
I love NoSQL databases; I use key-value stores all the time. I've reached a point where it pains me to use an SQL database. But I think 'NoSQL' is a bad term. When people talk about NoSQL, they typically mean a database that is schema-less and non-relational.
00:02:17.120
This means it doesn't care about the underlying data structure or schema, nor does it care about the relationships within your data. In these systems, you define the relationships and give them meaning yourself. So there are my two disclaimers.
00:02:32.239
Today, I want to look at how we can use Git as a NoSQL database, covering some reasons why this is really awesome, and also some reasons why it really sucks.
00:02:44.000
So, how is Git a database? If we look at Git's man page, it describes itself as the 'stupid content tracker.' It's so humble.
00:02:51.679
But I found this interesting: Git doesn't even claim to be a distributed version control system in that one-line description; it only claims to be a content tracker.
00:03:03.360
I'll be using the term 'database' rather loosely here. If we look at the dictionary definition, a database is a structured set of data held in a computer.
00:03:11.280
Basically, all of the data in your computer is structured because it's in a file system. So that's the working definition we have here.
00:03:17.040
So if it’s a database, how do we store data in it? Here we go with Git 101. We initialize the database by creating a directory and then using 'cd' to navigate into it.
00:03:25.960
Next, we call 'git init' to create a new database. We can then write data to our database. For instance, I will store JSON data.
00:03:35.760
I echo a JSON string into a file, add that file to Git, and then commit the changes, effectively adding data to our database.
00:03:46.720
We can read that data back out by using the command 'git show' followed by the file name. Awesome! Thanks for coming, that's how you use Git as a database.
00:03:55.520
Of course, using a file system is obviously not a good data store. If we want to get technical, it is a key-value store; you can give it a key, which is a path, and stick some value in it.
00:04:02.560
Other key-value stores like Redis, Cassandra, and Amazon Dynamo are all awesome databases. So let's look at how we would use Git more like one of these databases, instead of directly accessing files.
00:04:12.400
If we just followed the file system route, we'd quickly run into issues, as there are about three operations I have to perform on the file system before I can commit.
00:04:19.279
This means that only one process can perform those operations at a time, so we need to dig deeper to understand how to use this.
00:04:25.000
Most of us are probably familiar with all the Git commands, known as 'porcelain' commands— like 'git add,' 'git commit,' and 'git show.'
00:04:34.640
However, there's more to Git than just those commands. If you look through the Git man pages, you'll find many commands known as 'plumbing.'
00:04:43.680
These plumbing commands are the foundation of how Git is structured. They consist of really small components, which are molded together by higher-level commands.
00:04:52.480
To use Git as a database, let’s explore what happens with 'git add' and 'git commit' that gives us this structure.
00:04:57.920
By doing this, we will learn more about Git’s data structure. At the lowest level in Git, we have blobs. Does anyone know what blobs are?
00:05:04.880
A few people do? So in Git, blobs are the fundamental building blocks. We can take any content we want and store it into a blob.
00:05:11.440
Using the command line, we can echo a string we want to save and pipe it into 'git hash-object,' which generates a SHA-1 checksum for that content.
00:05:22.240
It will store it in a file with the same SHA. All blobs are unique based on their contents.
00:05:29.120
If I try to take the same data and create another blob, it's going to point to that exact same blob. We can save it using 'git hash-object -w' to write it.
00:05:37.600
In this way, we can actually see the unique SHA that represents those contents, which are now stored in our Git database.
00:05:47.760
When we initialized that directory with 'git init,' it created the '.git' directory. And after hashing our objects, we discover that there's one blob stored in there.
00:05:59.440
To read these blobs back out, we can look directly into the file system. We can find the '.git/objects' directory and see the files stored with our SHAs.
00:06:07.760
If we tried to view the file directly, it would appear as binary data. Instead, we can use 'git cat-file' followed by the SHA to retrieve the content.
00:06:15.760
At this point, it’s starting to look a little more like a database. We can have multiple clients writing blobs without them stomping on each other.
00:06:24.560
How do we update data? Git’s underlying data model uses the contents to generate a SHA, which makes blobs immutable.
00:06:32.240
Once we store something in it, we can’t change it. If we change it, it will create a new ID, which isn’t helpful in a database context.
00:06:40.000
Next, we have trees.
00:06:49.840
Trees are like directories in your file system; they can hold a list of blobs or even other trees.
00:06:56.560
What we want to do is stage some changes into a tree—essentially telling it to keep a reference to this blob that we created.
00:07:06.320
We can do this with 'git update-index' to add the cache information, which consists of the file attributes of the file we're adding, along with its SHA.
00:07:14.640
Next, we run 'git write-tree,' which will generate another SHA.
00:07:22.360
Just like blobs, the SHA of trees is dependent on the content they contain.
00:07:30.080
Now we have a tree that points to a blob, and both have their own SHAs. You can also have multiple blobs in a tree, which looks like a directory with files.
00:07:38.080
You can even have a tree pointing to another tree with blobs underneath that.
00:07:46.080
The next important component is commits.
00:07:59.680
We've discussed blobs and trees, but now we need a marker in time that indicates the state of the tree at a specific moment.
00:08:05.680
To do this, we use 'git commit-tree.' We provide a commit message along with the SHA of the tree we want to commit.
00:08:12.640
This creates a commit object that points to that tree, which in turn points to the respective blobs.
00:08:19.920
However, we still can't reference our file, '1.json,' without changing the way we operate.
00:08:27.520
The final step is writing a reference. This reference serves as a symbolic link to the latest state of our database.
00:08:35.760
We use 'git update-ref' and give it a friendly name, such as 'refs/heads/master.'