00:00:14.920
Consider the entire concept of Git as an onion. If you remove distribution, what you're left with is solely version control. Forget about the idea of versioning itself, and pretend that every project has only one commit. This simplification makes things much easier to understand.
00:00:27.760
Once we remove that outer layer of complexity, you're left with what Git calls itself: a stupid content tracker. You provide it with files and directories, and it stores those files and directories away. But that's still too big of a concept to grasp right now.
00:00:39.120
Let’s dive deeper and go straight to the core of Git. What I would argue Git is, at its heart, is a persistent hash map. It may sound peculiar, but allow me to demonstrate.
00:01:03.960
When you give Git a piece of content—for example, the string 'something'—it returns a hash, specifically a 20-byte hash. This functionality, like many others in Git, operates at a low level, but you can access it from the command line. There is a command called 'git hash-object' that allows you to do this.
00:01:20.280
If I run this command here, Git expects to read from a file. Since I don’t have a file, I'll read from standard input instead. I'll pipe the string into it with the echo command.
00:01:32.799
So I will type 'something' literally. For those unfamiliar with Unix commands, I am streaming this string into the command, and what I get back is the hash 'dba'. This is the same hash I noted earlier, demonstrating that identical content produces the same hash every time.
00:01:43.360
This is immensely important in Git because it uses SHA-1 hashes everywhere. All of your files are hashed in this manner—let's call it SHA-1 for short.
00:02:01.680
Some newcomers to Git might question if there are enough SHA-1s available to represent all possible files. What if two different files in my project have the same SHA-1 hash? Yes, that would be unfortunate, but it's unlikely to happen. Let’s crunch some numbers for perspective. For instance, the odds of winning the American Lottery, you are looking at a one in 175 million chance. That’s a tough scenario to visualize, so let me illustrate it.
00:02:42.840
Imagine if I created 175 million fortune cookies and put a different number inside each, including the jackpot number. If a fortune cookie is about 5 centimeters long, you could create a line of them reaching across continents, and you'd be amazed to find you would end up close to where I live, around Venice.
00:03:07.360
Now, suppose you were to walk that entire line of fortune cookies. At some point, you'd get hungry and feel the urge to eat one. You are only allowed to eat one fortune cookie during the entire voyage. When you open it, lucky you! You find the jackpot number. But wait, this gives you a 'gambler's mindset.' You say, 'I must be lucky; let me do it again!' You walk back, eat another cookie, and remarkably, you win again. Your friends start calling you a lucky bastard!
00:03:39.799
However, the chances of two random hashes colliding—not just winning the jackpot once but doing it consecutively—is astronomically improbable. In fact, the probability becomes staggering: it's like winning the jackpot billions of times in a row! Hence, while SHA-1 hashes are not unique within a single project, they are unique in the universe.
00:04:43.480
You could take every software project on Earth, place them into the same Git repository, and despite potential performance issues, you'd still experience no hash collisions.
00:05:02.759
Now, shifting back to our hash map, I mentioned it being a persistent hash map. To make it persistent, you can use the '-w' switch in the 'git hash-object' command.
00:05:26.640
It breaks because it doesn't have a home; it lacks a location. Every Git project contains a '.git' directory in its root where Git stores configuration settings and the object database.
00:06:02.680
So, let me create a new directory. I'll be using 'git init' to set it up, which creates the required .git folder in the root directory. Now, if I check inside, I can see an objects directory that contains a couple of folders named info and pack—just ignore those for now. Currently, there are zero objects, as I've just initialized it.
00:06:49.720
Now, let’s generate the hash again and store it in the object directory. If you look inside the objects directory again, you will find a folder named 'de,' which is not coincidentally at the start of the hash. It contains a file that follows the pattern of the hash. The reason for this structure is to avoid cluttering a single directory with too many objects.
00:07:18.080
Inside the file is the content, which has been compressed and zipped into a small header. Another command I can use is 'git cat-file', which will allow me to pass the hash or just a few digits of it. When I run this with '-t', it specifies the type of this thing, and Git will call it a blob. Your files in Git are referred to as blobs.
00:08:07.080
If I run it with '-p', it will show the content: 'something'. So far, so good.
00:08:33.360
This persistent hash map is the core of Git. Now, let’s work towards the next layer. This next layer involves a lot of command line usage and looking at hashes, so don’t try to memorize every step. Instead, focus on the structure. Let’s say I start adding files to my project. I have a shell script prepared to add a few files, so I don’t have to do it manually.
00:09:21.000
Upon inspecting my project, I can see that there is a README file, which contains the string 'something,' and an SRC directory that contains two files: one with 'something else' and another again with the string 'something.' Now I will quickly add these files to the Git staging area, preparing them for commit.
00:09:59.720
Now, if I check the Git status, these files are ready, and I can proceed to commit them. I'll provide a commit message using the '-m' switch, which I'll call 'first.' Now that the command has been executed, I can check again how many objects are in the database.
00:10:32.120
This time, I see that there are five objects in the database. The question is: why five? I will use 'git cat-file' again. When I check the Git log, it shows me the commit and its hash.
00:11:03.280
Now, if I run 'git cat-file -t' with that hash, it tells me it's a commit. What’s inside a commit? It includes the metadata such as the commit message, the date of the commit, and the author. All of this is encapsulated in the commit object.
00:11:29.360
What’s less obvious is this 'tree.' A tree in Git is equivalent to a directory. The root of your project is a tree. If I run 'git cat-file -p' on the hash of this tree, I’ll discover its contents. It will have references to the SRC directory and to the README file.
00:12:17.360
Let me illustrate this graphically. You have a commit called 'first,' which references a tree as its root. This tree references two items: the SRC directory and the README file. The SRC directory, in turn, references two files: one containing the string 'something else' and another that also contains 'something.' The key point is that the name of the file is not stored in the blob; it's in the tree that contains the blob.
00:13:08.480
If you have two files that are identical, there will only be one object in Git’s storage, which enhances Git’s efficiency. Now, while indeed every time you add new objects, Git will periodically compress and optimize storage, making this efficient is more of a backend detail.
00:13:48.679
Another interesting part of Git is the underlying structure of the objects themselves. You have blobs containing content, and trees that contain more trees and blobs, with the names organized in the trees. This elegantly forms a filesystem.
00:14:17.679
Git is primarily a filesystem and understanding this will help clarify how Git functions, as well as how it relates to traditional filesystems. It's an abstracted way of managing file data.
00:14:54.079
Now, let’s consider the next layer of determining version. Editing the README file, I add a new line that states, 'my git project.'
00:15:07.480
Now, we have a new version to commit. I commit this edition with the message 'second.' Again, if we look up the Git log, we see a hash for this new commit.
00:15:22.160
When I run 'git cat-file -p' and check the hash, I see something conceptually different than in the first commit: this second commit now has a parent. It’s akin to branching history.
00:15:56.559
By visualizing this, you'll note that the second commit points back to the first commit. Thus, every time a new commit is created, it creates a snapshot of the file system in its exact state at that moment in time—all while reusing what can be reused from previous commits.
00:16:44.440
From here, Git branches out. Changing the README yet again for another commit means you are interacting with not just the new file but also linking back to previous commits.
00:17:09.919
Now, let's explore the branching system they implemented in Git. If I issue the command 'git branch,' I can see all branches. Initially, we only have one branch called master.
00:17:34.840
Let's create another branch, called 'fix-me,' so now we have two branches. Let’s see what branches are managing under the hood.
00:17:52.620
Digging inside the .git directory and observing the refs folder is particularly enlightening. Inside heads, we see files labeled 'fix-me' and 'master,' both of which contain the latest commit hash.
00:18:20.159
This implies that branches are simply pointers or references to the latest commit. The rationale behind the visual differences—green for the current branch and white for non-active branches—makes understanding which is the head easier.
00:18:45.480
Now, if I make some changes in the command line and commit them, I’m effectively creating a new commit. This new commit now has as its parent the latest commit from 'master,' maintaining the structure of the commits.
00:19:17.560
If I later switch branches to 'fix-me,' Git will update my HEAD to reflect this change. It allows the user to navigate back and forth between different states in the history without losing any commits.
00:20:06.160
Let’s say I create another commit while in the 'fix-me' branch. Each new commit keeps a record of its parent, thus enabling a tree of versions.
00:20:39.640
If I want to merge branches, Git strives to create a new commit that reconciles both branches, merging the changes from both historical points.
00:21:10.680
Another effect of merging branches is that it allows Git to create a new commit that encompasses elements from both branches and resolving any possible conflicts.
00:21:39.680
But, there’s also the durability of versions to consider. Imagine a scenario where branches diverge and then I try to 'git rebase master.' This operation might seem straightforward, as it is intended to reattach your modified commits to the latest version.
00:22:13.440
However, since Git commits are immutable, you cannot change a commit directly. Instead, you create new copies with changed parents to preserve Git’s integrity.
00:22:51.067
Therefore, under Git, if two branches have similar content, it makes updates easy but also leads to potential confusion. To avoid losing track of versions, simply ensure every commit you want to retain is accessible via a branch.
00:23:34.119
Now, when branches become unreachable, Git will eventually garbage collect those commits to preserve space. However, any reachable commits will stay, thanks to the reference links through branches.
00:24:07.560
As we dive deeper into distribution, I would love to illustrate how it works together with the local and remote repository.
00:24:35.919
So, when cloning, Git copies the .git folder, transferring all history and objects to your local environment. Each object retains its place and affiliation with the entire history of your project.
00:25:01.919
Once local and remote repositories expand over time, you 'git fetch' to synchronize your local repository with the latest data from the remote.
00:25:34.919
On the other end, when you want to share your changes, you 'git push' your commits to the remote repository. In both actions, Git maintains uniqueness through hash IDs to prevent potential conflicts.
00:26:14.919
This system makes it incredibly efficient, enabling seamless collaborative coding and efficient version management without risking data loss.
00:26:41.799
So my presentation aims to provide a deep understanding of Git and its core systems. With this knowledge, you are empowered to navigate challenges with confidence and clarity.
00:27:09.399
Finally, as we summarize, the goal is to encourage you to learn how deep Git actually goes. Once you understand the underlying model, everything else becomes simpler.
00:27:30.360
Git is not merely a tool; it is a sophisticated system, and understanding its internals offers you a foundation for working fluently with it.
00:28:01.000
Now you have seen the layers that make up Git—its structure reverberates throughout the design, making it powerful and efficient.
00:28:32.480
In conclusion, if you learn Git not just at the surface but understand how it organizes the metadata underlying its commands, your experience and efficiency will dramatically improve.
00:28:58.480
Thank you very much for your attention.