Talks
Keynote: Wrapping Your Head Around Git
Summarized using AI

Keynote: Wrapping Your Head Around Git

by Paolo "Nusco" Perrotta

In the keynote titled "Wrapping Your Head Around Git," Paolo Perrotta explores the intricacies of Git, aiming to demystify its underlying structures. He argues that many new users struggle with Git despite knowing basic commands because they lack an understanding of the .git directory and the tool's core principles.

Key Points Discussed:

  • Git as an Onion:
    Perrotta begins by simplifying Git to its core functions, shedding light on its fundamental components by comparing Git to an onion—removing layers of complexity to reveal a simple content tracker.
  • Persistent Hash Map:
    He describes Git as a persistent hash map that utilizes SHA-1 hashing for storing files, emphasizing that identical content will always yield the same hash, thereby ensuring data integrity.
  • Directory Structure in .git:
    He explains the anatomy of the .git directory, detailing how commits and branches are organized, including the objects directory that houses all committed content.
  • Commit Objects and Tree Structure:
    Each commit in Git references its own tree structure, which contains blobs of data (files). This allows Git to manage versions efficiently, ensuring identical files take up only one space in storage.
  • Version Management:
    Perrotta illustrates how new commits create versions while maintaining a reference to previous commits, creating a branching history that enables efficient project management.
  • Branching and Merging:
    The significance of branches is discussed, emphasizing that branches function as pointers to commits and allow users to manage different lines of development seamlessly. The merging process reconciles changes from multiple branches.
  • Distributed Version Control:
    The keynote also outlines Git’s distribution mechanism, describing how cloning a repository transfers the entire history and how commands like ‘git fetch’ and ‘git push’ facilitate smooth collaboration between local and remote repositories.

Conclusion:**

Paolo Perrotta concludes that understanding the internal workings of Git is crucial for using it effectively, encouraging users to appreciate its sophisticated design and the elegance behind its operations. Mastery of these concepts can significantly enhance users' capabilities in version control and collaborative software development.

In summary, Perrotta aims to empower developers with the knowledge of Git's structure, asserting that profound understanding transforms the use of Git from a basic command-line experience to a robust, elegant system for version control.

00:00:14.920 Consider the entire concept of Git as an onion. If you remove distribution, what you're left with is solely version control. Forget about the idea of versioning itself, and pretend that every project has only one commit. This simplification makes things much easier to understand.
00:00:27.760 Once we remove that outer layer of complexity, you're left with what Git calls itself: a stupid content tracker. You provide it with files and directories, and it stores those files and directories away. But that's still too big of a concept to grasp right now.
00:00:39.120 Let’s dive deeper and go straight to the core of Git. What I would argue Git is, at its heart, is a persistent hash map. It may sound peculiar, but allow me to demonstrate.
00:01:03.960 When you give Git a piece of content—for example, the string 'something'—it returns a hash, specifically a 20-byte hash. This functionality, like many others in Git, operates at a low level, but you can access it from the command line. There is a command called 'git hash-object' that allows you to do this.
00:01:20.280 If I run this command here, Git expects to read from a file. Since I don’t have a file, I'll read from standard input instead. I'll pipe the string into it with the echo command.
00:01:32.799 So I will type 'something' literally. For those unfamiliar with Unix commands, I am streaming this string into the command, and what I get back is the hash 'dba'. This is the same hash I noted earlier, demonstrating that identical content produces the same hash every time.
00:01:43.360 This is immensely important in Git because it uses SHA-1 hashes everywhere. All of your files are hashed in this manner—let's call it SHA-1 for short.
00:02:01.680 Some newcomers to Git might question if there are enough SHA-1s available to represent all possible files. What if two different files in my project have the same SHA-1 hash? Yes, that would be unfortunate, but it's unlikely to happen. Let’s crunch some numbers for perspective. For instance, the odds of winning the American Lottery, you are looking at a one in 175 million chance. That’s a tough scenario to visualize, so let me illustrate it.
00:02:42.840 Imagine if I created 175 million fortune cookies and put a different number inside each, including the jackpot number. If a fortune cookie is about 5 centimeters long, you could create a line of them reaching across continents, and you'd be amazed to find you would end up close to where I live, around Venice.
00:03:07.360 Now, suppose you were to walk that entire line of fortune cookies. At some point, you'd get hungry and feel the urge to eat one. You are only allowed to eat one fortune cookie during the entire voyage. When you open it, lucky you! You find the jackpot number. But wait, this gives you a 'gambler's mindset.' You say, 'I must be lucky; let me do it again!' You walk back, eat another cookie, and remarkably, you win again. Your friends start calling you a lucky bastard!
00:03:39.799 However, the chances of two random hashes colliding—not just winning the jackpot once but doing it consecutively—is astronomically improbable. In fact, the probability becomes staggering: it's like winning the jackpot billions of times in a row! Hence, while SHA-1 hashes are not unique within a single project, they are unique in the universe.
00:04:43.480 You could take every software project on Earth, place them into the same Git repository, and despite potential performance issues, you'd still experience no hash collisions.
00:05:02.759 Now, shifting back to our hash map, I mentioned it being a persistent hash map. To make it persistent, you can use the '-w' switch in the 'git hash-object' command.
00:05:26.640 It breaks because it doesn't have a home; it lacks a location. Every Git project contains a '.git' directory in its root where Git stores configuration settings and the object database.
00:06:02.680 So, let me create a new directory. I'll be using 'git init' to set it up, which creates the required .git folder in the root directory. Now, if I check inside, I can see an objects directory that contains a couple of folders named info and pack—just ignore those for now. Currently, there are zero objects, as I've just initialized it.
00:06:49.720 Now, let’s generate the hash again and store it in the object directory. If you look inside the objects directory again, you will find a folder named 'de,' which is not coincidentally at the start of the hash. It contains a file that follows the pattern of the hash. The reason for this structure is to avoid cluttering a single directory with too many objects.
00:07:18.080 Inside the file is the content, which has been compressed and zipped into a small header. Another command I can use is 'git cat-file', which will allow me to pass the hash or just a few digits of it. When I run this with '-t', it specifies the type of this thing, and Git will call it a blob. Your files in Git are referred to as blobs.
00:08:07.080 If I run it with '-p', it will show the content: 'something'. So far, so good.
00:08:33.360 This persistent hash map is the core of Git. Now, let’s work towards the next layer. This next layer involves a lot of command line usage and looking at hashes, so don’t try to memorize every step. Instead, focus on the structure. Let’s say I start adding files to my project. I have a shell script prepared to add a few files, so I don’t have to do it manually.
00:09:21.000 Upon inspecting my project, I can see that there is a README file, which contains the string 'something,' and an SRC directory that contains two files: one with 'something else' and another again with the string 'something.' Now I will quickly add these files to the Git staging area, preparing them for commit.
00:09:59.720 Now, if I check the Git status, these files are ready, and I can proceed to commit them. I'll provide a commit message using the '-m' switch, which I'll call 'first.' Now that the command has been executed, I can check again how many objects are in the database.
00:10:32.120 This time, I see that there are five objects in the database. The question is: why five? I will use 'git cat-file' again. When I check the Git log, it shows me the commit and its hash.
00:11:03.280 Now, if I run 'git cat-file -t' with that hash, it tells me it's a commit. What’s inside a commit? It includes the metadata such as the commit message, the date of the commit, and the author. All of this is encapsulated in the commit object.
00:11:29.360 What’s less obvious is this 'tree.' A tree in Git is equivalent to a directory. The root of your project is a tree. If I run 'git cat-file -p' on the hash of this tree, I’ll discover its contents. It will have references to the SRC directory and to the README file.
00:12:17.360 Let me illustrate this graphically. You have a commit called 'first,' which references a tree as its root. This tree references two items: the SRC directory and the README file. The SRC directory, in turn, references two files: one containing the string 'something else' and another that also contains 'something.' The key point is that the name of the file is not stored in the blob; it's in the tree that contains the blob.
00:13:08.480 If you have two files that are identical, there will only be one object in Git’s storage, which enhances Git’s efficiency. Now, while indeed every time you add new objects, Git will periodically compress and optimize storage, making this efficient is more of a backend detail.
00:13:48.679 Another interesting part of Git is the underlying structure of the objects themselves. You have blobs containing content, and trees that contain more trees and blobs, with the names organized in the trees. This elegantly forms a filesystem.
00:14:17.679 Git is primarily a filesystem and understanding this will help clarify how Git functions, as well as how it relates to traditional filesystems. It's an abstracted way of managing file data.
00:14:54.079 Now, let’s consider the next layer of determining version. Editing the README file, I add a new line that states, 'my git project.'
00:15:07.480 Now, we have a new version to commit. I commit this edition with the message 'second.' Again, if we look up the Git log, we see a hash for this new commit.
00:15:22.160 When I run 'git cat-file -p' and check the hash, I see something conceptually different than in the first commit: this second commit now has a parent. It’s akin to branching history.
00:15:56.559 By visualizing this, you'll note that the second commit points back to the first commit. Thus, every time a new commit is created, it creates a snapshot of the file system in its exact state at that moment in time—all while reusing what can be reused from previous commits.
00:16:44.440 From here, Git branches out. Changing the README yet again for another commit means you are interacting with not just the new file but also linking back to previous commits.
00:17:09.919 Now, let's explore the branching system they implemented in Git. If I issue the command 'git branch,' I can see all branches. Initially, we only have one branch called master.
00:17:34.840 Let's create another branch, called 'fix-me,' so now we have two branches. Let’s see what branches are managing under the hood.
00:17:52.620 Digging inside the .git directory and observing the refs folder is particularly enlightening. Inside heads, we see files labeled 'fix-me' and 'master,' both of which contain the latest commit hash.
00:18:20.159 This implies that branches are simply pointers or references to the latest commit. The rationale behind the visual differences—green for the current branch and white for non-active branches—makes understanding which is the head easier.
00:18:45.480 Now, if I make some changes in the command line and commit them, I’m effectively creating a new commit. This new commit now has as its parent the latest commit from 'master,' maintaining the structure of the commits.
00:19:17.560 If I later switch branches to 'fix-me,' Git will update my HEAD to reflect this change. It allows the user to navigate back and forth between different states in the history without losing any commits.
00:20:06.160 Let’s say I create another commit while in the 'fix-me' branch. Each new commit keeps a record of its parent, thus enabling a tree of versions.
00:20:39.640 If I want to merge branches, Git strives to create a new commit that reconciles both branches, merging the changes from both historical points.
00:21:10.680 Another effect of merging branches is that it allows Git to create a new commit that encompasses elements from both branches and resolving any possible conflicts.
00:21:39.680 But, there’s also the durability of versions to consider. Imagine a scenario where branches diverge and then I try to 'git rebase master.' This operation might seem straightforward, as it is intended to reattach your modified commits to the latest version.
00:22:13.440 However, since Git commits are immutable, you cannot change a commit directly. Instead, you create new copies with changed parents to preserve Git’s integrity.
00:22:51.067 Therefore, under Git, if two branches have similar content, it makes updates easy but also leads to potential confusion. To avoid losing track of versions, simply ensure every commit you want to retain is accessible via a branch.
00:23:34.119 Now, when branches become unreachable, Git will eventually garbage collect those commits to preserve space. However, any reachable commits will stay, thanks to the reference links through branches.
00:24:07.560 As we dive deeper into distribution, I would love to illustrate how it works together with the local and remote repository.
00:24:35.919 So, when cloning, Git copies the .git folder, transferring all history and objects to your local environment. Each object retains its place and affiliation with the entire history of your project.
00:25:01.919 Once local and remote repositories expand over time, you 'git fetch' to synchronize your local repository with the latest data from the remote.
00:25:34.919 On the other end, when you want to share your changes, you 'git push' your commits to the remote repository. In both actions, Git maintains uniqueness through hash IDs to prevent potential conflicts.
00:26:14.919 This system makes it incredibly efficient, enabling seamless collaborative coding and efficient version management without risking data loss.
00:26:41.799 So my presentation aims to provide a deep understanding of Git and its core systems. With this knowledge, you are empowered to navigate challenges with confidence and clarity.
00:27:09.399 Finally, as we summarize, the goal is to encourage you to learn how deep Git actually goes. Once you understand the underlying model, everything else becomes simpler.
00:27:30.360 Git is not merely a tool; it is a sophisticated system, and understanding its internals offers you a foundation for working fluently with it.
00:28:01.000 Now you have seen the layers that make up Git—its structure reverberates throughout the design, making it powerful and efficient.
00:28:32.480 In conclusion, if you learn Git not just at the surface but understand how it organizes the metadata underlying its commands, your experience and efficiency will dramatically improve.
00:28:58.480 Thank you very much for your attention.
Explore all talks recorded at Garden City Ruby 2015
+8