Brandon Keepers
Git: The NoSQL Database
Summarized using AI

Git: The NoSQL Database

by Brandon Keepers

The video, titled "Git: The NoSQL Database," features Brandon Keepers discussing the use of Git as a potential data store, extending its application beyond just code management. The talk explores the flexibility and advantages of utilizing Git's schema-less structure for storing data, as well as the challenges that come with it.

Key Points:

  • Introduction to Git as a Database:

    • Brandon introduces himself as a developer at GitHub and shares his interest in using Git as an unconventional database.
    • The concept is not new, but he has wanted to experiment with it.
  • Understanding Git's Structure:

    • Git is described as a content tracker, which supports the idea of using it as a data storage solution.
    • A basic understanding of how to initialize a Git repository and store JSON data is provided to illustrate its functionality as a database.
  • Functions of Git in Data Storage:

    • Git storage involves creating blobs that hold data, which can be referenced by SHA-1 checksums, making them unique based on content.
    • The process of updating data and the immutability of blobs are highlighted, distinguishing Git from traditional databases.
  • Usage of Trees and Commits:

    • Brandon explains that trees function similar to directories, holding blobs or even other trees, providing a structured way to reference multiple data items.
    • The role of commits in marking specific states over time is discussed, further solidifying Git's capability to manage changes.
  • Practical Application:

    • An example application, "Gasket," was created to merge issue tracking with code repositories, showcasing practical implementation.
  • Challenges in Scaling:

    • The speaker addresses potential issues like concurrent access and load distribution when using Git as a data store.

Conclusion:

Brandon concludes that while using Git as a database can offer incredible flexibility and unique advantages through versioning and branching, it also presents challenges that should not be overlooked. Further exploration and experimentation within this framework could lead to innovative practices in data management.

The audience leaves with an understanding of the mechanics of Git as a database and the complexities involved in such a paradigm shift.

00:00:15.519 Um, my name is Brandon Keepers. I'm a developer at GitHub, and you can find me online as Beekeepers or onboard on both Twitter and GitHub. So feel free to send me a message. I'm here today mostly because I made the organizers feel guilty.
00:00:24.160 Back in March, when I saw that this was announced, I thought, 'Hey, this conference is on my birthday!' I submitted a talk, and that meant I should get to come to Hawaii for my birthday. So it worked out well.
00:00:36.399 This talk is about using Git as a database. I don't think I'm the only one with this idea. In fact, when I first thought of it, you know those ideas when you're like, 'Yes, this is so revolutionary!' then you Google it and realize thirty people have had the same idea and tried it.
00:00:41.600 But I still want to experiment with it. This was one of those moments like in the Windows commercials where the guy's in the shower, and he comes up with a beautiful idea: what if I could use Git as a data store? It works great for code; I love it for that.
00:00:53.280 But what would it look like to use it as your application's main database? One of my coworkers, John Hoyt, and I played around with this. We built an app called Gasket because we really liked Pivotal Tracker at the time. We thought, 'It would be really awesome to have those issues in our Git repository with the rest of the code.'
00:01:10.960 You can check it out on GitHub; it’s still up there, although there haven’t been any commits for a while. Before we get too far into this, I have a couple of disclaimers.
00:01:26.560 First, I work at GitHub, which you may or may not know. We use Git here, and a lot of people at GitHub know a lot about Git.
00:01:32.239 Despite appearances, like that guy with his toe in his mouth during the video shot, I assure you he's brilliant. Trust me. I will show you later the response I received when I was doing research for this talk. I’m not one of those people at GitHub who knows all about Git internals.
00:01:44.880 I know a fair amount about using Git and would say I'm a fairly advanced user, but when it comes to the internals of Git, I was pretty much a novice. So there’s my first disclaimer. My second disclaimer is about the term 'NoSQL.' I think it's a poor marketing term.
00:02:02.000 I love NoSQL databases; I use key-value stores all the time. I've reached a point where it pains me to use an SQL database. But I think 'NoSQL' is a bad term. When people talk about NoSQL, they typically mean a database that is schema-less and non-relational.
00:02:17.120 This means it doesn't care about the underlying data structure or schema, nor does it care about the relationships within your data. In these systems, you define the relationships and give them meaning yourself. So there are my two disclaimers.
00:02:32.239 Today, I want to look at how we can use Git as a NoSQL database, covering some reasons why this is really awesome, and also some reasons why it really sucks.
00:02:44.000 So, how is Git a database? If we look at Git's man page, it describes itself as the 'stupid content tracker.' It's so humble.
00:02:51.679 But I found this interesting: Git doesn't even claim to be a distributed version control system in that one-line description; it only claims to be a content tracker.
00:03:03.360 I'll be using the term 'database' rather loosely here. If we look at the dictionary definition, a database is a structured set of data held in a computer.
00:03:11.280 Basically, all of the data in your computer is structured because it's in a file system. So that's the working definition we have here.
00:03:17.040 So if it’s a database, how do we store data in it? Here we go with Git 101. We initialize the database by creating a directory and then using 'cd' to navigate into it.
00:03:25.960 Next, we call 'git init' to create a new database. We can then write data to our database. For instance, I will store JSON data.
00:03:35.760 I echo a JSON string into a file, add that file to Git, and then commit the changes, effectively adding data to our database.
00:03:46.720 We can read that data back out by using the command 'git show' followed by the file name. Awesome! Thanks for coming, that's how you use Git as a database.
00:03:55.520 Of course, using a file system is obviously not a good data store. If we want to get technical, it is a key-value store; you can give it a key, which is a path, and stick some value in it.
00:04:02.560 Other key-value stores like Redis, Cassandra, and Amazon Dynamo are all awesome databases. So let's look at how we would use Git more like one of these databases, instead of directly accessing files.
00:04:12.400 If we just followed the file system route, we'd quickly run into issues, as there are about three operations I have to perform on the file system before I can commit.
00:04:19.279 This means that only one process can perform those operations at a time, so we need to dig deeper to understand how to use this.
00:04:25.000 Most of us are probably familiar with all the Git commands, known as 'porcelain' commands— like 'git add,' 'git commit,' and 'git show.'
00:04:34.640 However, there's more to Git than just those commands. If you look through the Git man pages, you'll find many commands known as 'plumbing.'
00:04:43.680 These plumbing commands are the foundation of how Git is structured. They consist of really small components, which are molded together by higher-level commands.
00:04:52.480 To use Git as a database, let’s explore what happens with 'git add' and 'git commit' that gives us this structure.
00:04:57.920 By doing this, we will learn more about Git’s data structure. At the lowest level in Git, we have blobs. Does anyone know what blobs are?
00:05:04.880 A few people do? So in Git, blobs are the fundamental building blocks. We can take any content we want and store it into a blob.
00:05:11.440 Using the command line, we can echo a string we want to save and pipe it into 'git hash-object,' which generates a SHA-1 checksum for that content.
00:05:22.240 It will store it in a file with the same SHA. All blobs are unique based on their contents.
00:05:29.120 If I try to take the same data and create another blob, it's going to point to that exact same blob. We can save it using 'git hash-object -w' to write it.
00:05:37.600 In this way, we can actually see the unique SHA that represents those contents, which are now stored in our Git database.
00:05:47.760 When we initialized that directory with 'git init,' it created the '.git' directory. And after hashing our objects, we discover that there's one blob stored in there.
00:05:59.440 To read these blobs back out, we can look directly into the file system. We can find the '.git/objects' directory and see the files stored with our SHAs.
00:06:07.760 If we tried to view the file directly, it would appear as binary data. Instead, we can use 'git cat-file' followed by the SHA to retrieve the content.
00:06:15.760 At this point, it’s starting to look a little more like a database. We can have multiple clients writing blobs without them stomping on each other.
00:06:24.560 How do we update data? Git’s underlying data model uses the contents to generate a SHA, which makes blobs immutable.
00:06:32.240 Once we store something in it, we can’t change it. If we change it, it will create a new ID, which isn’t helpful in a database context.
00:06:40.000 Next, we have trees.
00:06:49.840 Trees are like directories in your file system; they can hold a list of blobs or even other trees.
00:06:56.560 What we want to do is stage some changes into a tree—essentially telling it to keep a reference to this blob that we created.
00:07:06.320 We can do this with 'git update-index' to add the cache information, which consists of the file attributes of the file we're adding, along with its SHA.
00:07:14.640 Next, we run 'git write-tree,' which will generate another SHA.
00:07:22.360 Just like blobs, the SHA of trees is dependent on the content they contain.
00:07:30.080 Now we have a tree that points to a blob, and both have their own SHAs. You can also have multiple blobs in a tree, which looks like a directory with files.
00:07:38.080 You can even have a tree pointing to another tree with blobs underneath that.
00:07:46.080 The next important component is commits.
00:07:59.680 We've discussed blobs and trees, but now we need a marker in time that indicates the state of the tree at a specific moment.
00:08:05.680 To do this, we use 'git commit-tree.' We provide a commit message along with the SHA of the tree we want to commit.
00:08:12.640 This creates a commit object that points to that tree, which in turn points to the respective blobs.
00:08:19.920 However, we still can't reference our file, '1.json,' without changing the way we operate.
00:08:27.520 The final step is writing a reference. This reference serves as a symbolic link to the latest state of our database.
00:08:35.760 We use 'git update-ref' and give it a friendly name, such as 'refs/heads/master.'
Explore all talks recorded at Aloha RubyConf 2012
+13