Dissecting Git's Guts, Emily Xie - Git Merge 2016

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
great so how is everyone today awesome so my name is Emily and for this talk dissecting gets guts I'll be doing a deep dive into how git works under the hood but before I begin I wanted to get a show of hands how many of you here use git on a daily basis right most if not all of you so I really like teaching this topic because programmers use this tool all the time but most have a rather superficial understanding of how it actually works so I think you've seen this comic once before tonight but xkcd makes a pretty funny joke of this of how for most people get is complete sorcery and because of that they are kind of left in the dark once they screw something up right so how many of you have had something go terribly wrong and get yeah I believe there's a term for that git happens fortunately knowing what goes on under the surface empowers you as it leads to a better intuition and thus a better ability to navigate the system once something goes horribly awry or just even in everyday usage of the program so that's what I hope you walk away with from this talk tonight more comfort in using git and a better conceptual understanding of the system for example what is at the heart of a branch or what is happening when you are doing a git checkout or what is a detached head state and the way I'll approach this talk is with the assumption that you are familiar with git porcelain commands this is the term for all of the high-level commands that you as an end user interface with on a daily basis so push get pull get log get merge and so on and so forth and while we're on the topic of terminology on the flip side you have something called get plumbing and we're not talking about toilets here plumbing is the term for all of the low level commands that allow you to manipulate inspect and compare the basic structures of get and we'll use these plumbing commands as a tool to sort of poke around in dissect how git is fundamentally structured and this is how we're gonna do it first we'll talk through the concept of the dot git folder which is where the get magic happens then we'll drill in and look at the objects directory which is where get stores stuff will later move on to the refs folder to explore how get aliases things and lastly we'll take a look at git pack files which is how gets saved space and how that sort of ties into the system's general structure sounds good great so let's walk through the git internals from the ground up starting with what makes a git repository and I'm gonna go create a folder and I'm gonna go into it and I want to run a number of git commands on it like git log but even though I have git installed on my machine I can't do it yet and if I try I get a fatal error that's because before you can perform project specific operations on a folder you actually have to initialize git and the way you do that is through a porcelain command does anybody know what that command is get in it yes so this puts into place the scaffolding that get needs to operate on that project great so now it tells me there's a git folder that's been created but if you run an LS on it you don't really see it but actually let's do an LS la which shows the hidden folders and files in a directory standing for list format to make it look pretty and a to show all of the files and there you go now you can see it the very bottom one there it is so the dot git folder is what makes a directory a git repository it's where everything that gets stores where project lives and because of this if you ever wanted to backup or duplicate a git project you can simply just copy over this hidden folder and you'd have all of the history intact in fact when you run a git clone from a repo from a remote source like github or git lab that is essentially all that you're doing copying over the dot git folder so now it makes sense that the dot git folder would be hidden to begin with right if this folder contains all of the vitals you typically don't want people to kering around with it but we know what we're doing so let's run an LS to see what get guts are made of cool so the parts were concerned about for this talk are the objects folder Refs folder the head file as well as an index file that has yet to materialize so these four things constitute the heart of gets structure the rest of it is either just personalized configurations or user-defined scripts that are beyond the scope of this talk who has gone into the dot git folder by the way awesome a roomful of intrepid explorers well then let's go spelunking and let's examine this folder to see what it's all about we'll start diving into the objects directory you might wonder where does get keep all of the different versions of my files all the content and the answer is right here in this objects folder this is just a folder which functions as gets database and the term that you can use to describe how this database works is content-addressable file system which is essentially just a method of storing information so that it may be retrieved based on content so let's do this let's put an object in the get database I'm gonna go ahead and write a file hello world because why not fill this up with hello world I'm gonna save this and now we're gonna use a plumbing command called get hash object that allows us to copy over this file into our objects database the W here indicates to get that you want to write it and we pass in the name of the file that we want to store and great so now we get back this hash this weird-looking string of gibberish comprised of 40 hex characters and in your time using git I'm sure you've run into these who's seen these yeah we've all seen them get log is probably the example that you're thinking of so this hash is generated by the sha-1 algorithm which is built into that hash object plumbing command we just used and the hash it produces is for the most part uniquely generated based on the contents of a file to drive home my point I'm gonna run the raw sha-1 function on this content on my command line to see for ourselves and repeat pens and metadata file type in size to match the fact that get hash object automatically does this for you right before running sha-1 and that's just to escape the bang character and here's your command and voila for any given set of content you will always rely ibly generate the same hash key and you can do this really any number of times and you're always going to get the same hash back for this content however if you alter the content in any form I'm gonna go in and I'm gonna change this I'm gonna add like an s2 hello world you'll notice that you'll get something drastically different back so getting a hash that uniquely matches with a set of content kind of in the way that a fingerprint is a unique identifier for a person this contact is this concept is super important as it's at the heart of what makes it a content addressable filesystem anyway this hash can now be used to retrieve the content that we have saved in the objects folder and by the way this is where we have saved it I'm just gonna show you and it's organized such that the first two hash characters creates a subdirectory under the objects folder and the remaining 38 functions as the file name and we can retrieve the contents by using the plumbing command get cat file which allows you to inspect any get object the PFLAG here stands for pretty as in readable format and we pass in the hash of the corresponding object and there you go there are the contents of the file but actually if you try to read it like any other file if I try to cat it you'll see that what you get back is entirely gibberish so the files that get saved into its database aren't just stored as raw copies of what you have that makes no sense if you think about the fact that git has to operate on potentially massive code bases I think for example of my previous company which had repos with hundreds of thousands of lines of code and just thousands of files it would not be scalable rather the contents are compressed into these smaller objects and Z Lib is the compression library that they use so now that you've witnessed how get stores our files let's play around with some versioning I'm gonna go ahead and edit this file to make version number two but I'll open it up in vim I'm just gonna add another line and add the world is beautiful because it certainly is I'm gonna save that and now if I run the hash object command on it passing the W flag once again to write it passing the name of the file you'll see that what we get back is a different hash as you can expect so now if you show the contents of our objects folder and I'm just gonna use a shortcut here for that fine command you'll see that there are no now to get objects stored in the get database and if we open this object that we just created what you see might surprise you so it's the newest version of our file in its entirety I taught a workshop version of this talk at my previous company and in that time I found that consistently developers tended to think that version 2 of a file in the objects database would be a DIF off of version 1 I think some other version control systems like subversion stores diffs but git is a different animal for each version it stores an entirely different copy of your file initially there was however a follow-up to the statements we'll get to it a little bit later on in the talk so another thing to note is that git uses the hash to detect when a file has changed and will thus be more selective of when to store new object so if you try to store a file that is exactly the same line for line character for character in to your objects directory it will detect that it already exists and it won't duplicate it but will spit back at you the same hash as you see here and this is true if you saved a dozen files with a different name into your database and that is part of its beauty it's pretty space efficient in the way that it's designed so anyway this type of object that we've been talking about has a very specific name I briefly mentioned it before but for demonstration purposes we can use the plumbing command cat file to inspect the object and this time we'll pass in the T flag so as to indicate that we want to know the type and there you go blob so that's the name of this object that we've been creating and looking at it inspecting and blobs are important because they are the primary object store containing all of your file content but you might have noticed that when we looked at these blobs they are literally only your file content so how do we know what file name this blob goes with or how does get represent saving multiple copies of a file under this under different names and more importantly if what if we want to group a bunch of files together to create a snapshot because that's what git is right it's not just a snapshot of one file but your entire folder at any given point so there's another type of object for this that gives us this layer of information that we're looking for and it's called free objects so whereas blobs correspond to your file contents well you can think of tree objects as a complete snapshot of your project directory so we're gonna make a tree to demonstrate but first we need an index file because that's what get make Treece from so you might ask what is an index file well actually you know what it is because there's a user-friendly metaphorical term for it that you see all the time and it's called staging area so to move forth let's first stage some files AKA putting stuff into our index by using the plumbing command update index so the add flag specifies that we're adding to the index and we pass in the name of the file that we want to stage and actually I'm gonna go ahead and add yet another file into our index I'm gonna create one fubar I'm gonna open it up I'm gonna write some gibberish very meaningful and I'm going to save it and this time I don't actually have to manually call hash object beforehand because update index has that functionality built into it if the blob isn't already there in the objects database it will automatically be added under the hood and now if we look at the dot git folder we see that we now have this index which wasn't there before and if we inspect the contents of the index with the plumbing command get LS files you can now see that it is a running list of the stuff in our staging area so under the hood that's all your staging area really is and if you run I get a porcelain get status there you go you'll see that we have indeed added stuff into our staging so by now we've saved files into the gets get database and we've updated our staging and you've probably already guessed which porcelain command we've effectively mimicked by using low-level plumbing what is it get add that's right so now that we have an index for the tree object to base itself off of let's go ahead and finally write that tree and as you'd expect of all objects you write to the object folder you get back a hash let's examine it we'll use the cat file command pass in P for pretty and there you go so this is what a tree object looks like and you'll see that it contains the file mode the object type a reference to the sha hashes of the blobs along with the file name one thing to note that isn't being shown here by the way is that in addition to pointing to our blobs trees can also reference other trees too so as to illustrate this notion of subdirectories and if you're paying close attention you might notice that this tree object in fact looks exactly like the index file that we just looked at right and that's absolutely the case except that unlike the index file which is meant to be in a constant state of flux because it is your staging area the tree object is a finalized snapshot captured and persisted into your git database and you can see that this is the case here because when you list this all of your objects we now have another item in there so now it's official we now have a snapshot of our current working directory stored in our gate objects folder but it seems like we're still missing a little bit of metadata here you don't have any information about who saved these snapshots what time they were saved or why they were saved so enter the concept of the commit object which takes care of all that so anyway let's go and create a commit object will use the plumbing command commit treat for this well pass in the sha hash of the tree object this is the one that we just made and like with all objects that we've dealt with so far we get back a sha hash and get predictably sticks it into the objects database as you can see right here and if we take a look at it we'll see that what you get back looks like our typical commit this is pretty familiar right so very importantly at the very top we have the hash of the tree that this object points to we have the author that's me we have the committer that's also me and then we have the timestamps and then the all-important commit message by the way that is a terrible commit message I hope your commit messages are a lot more descriptive than this so I'm just gonna take a drink of water so as of now by writing a tree and creating a commit object we've effectively mimicked a porcelain 'get commit so let's put another commit on top of this one because I want to demonstrate how git relates commits to one another so let's say that I've Korea I'm editing this foo bar file for this commit I'm gonna go ahead and add some more lines of gibberish here save that I'm going to update my index again I'm going to write this tree and then I'm going to create a second commit so here I'm gonna so here's the hash of the commit of sort of the tree that I want to make a commit out of and here I'm going to chain the commits with this PFLAG piece standing for parent telling it to link this commit to a parent commit the one that came before it so as to indicate this sense of hierarchy and we get back a hash as we expect so let's take a look at this commit object-- and it looks pretty similar to the previous commit but notice in this one there is now actually an additional line in there one that says parent along with the hash of the previous commit so now if you run a git log on this specific commit hash there you go then you'll see house by simply chaining these commits we're starting to build up a history so there you have it you've performed a low level get ad and get commit and if you are following along you probably noticed the tree like way that get is structured at the very lowest level we have the contents of our files which is the blob and each revision is a different blob and then you have another layer built on top of that in order to associate these files in a snapshot which is the tree object and on top of that we have yet another layer of metadata the commit objects which are then chained to one another for the purpose of forming a history and everything is chained in one direction by the uses of a by the usage of hashes to point from one node to the next the children always knows its parents but never the other way around and if you drop anywhere on this chart and follow the pointers you'll never end up back where you started thus it is a dag which is a directed graph that contains no cycles directed acyclic graph and these hashes are generated based on the Kanta files contents which in turn as you saw contains the hash of any preceding nodes this creates a chain of dependencies in which the hash of each subsequent object depends on the one before it in this way it is also structured as a Merkel tree and as it's both a dag and a Merkel tree some like to call it by the hybrid name of Merkel dag regardless of what you call it you can start to see what sort of advantages this structure might hold for one because of the hashing in that the key is uniquely generated based on the contents you can verify that the data you put in will always be the data that you get back out it's a way to maintain data integrity if there's any corruption it will absolutely notice it at the same time it allows for any sort of deduplication of any common children which I demonstrated at the blob level and for another it makes for a highly flexible lightning-fast piece of software given that we can content address any of the nodes in the data structure you have all the content and it's just a matter of pointing to it via the hashes which brings us to our next point which is references so now we know how git stores our information in three primary object types blobs trees and commits and it's kind of all actually just floating around in the objects directory they're not organized by like whether it's a blob or a you know tree or a commit but when we work with git how do we keep track of what commits we work off of well usually we're working with branches so there's our answer but what exactly is a branch and how does getting know what objects go with a given branch so the answer is it's pretty simple branches in git are merely aliases or pointers to the commit objects and these pointers reside in your refs heads folder refs meaning references and heads meaning the top-level commit for a given alias so this directory should contain a running list of all branches that are in this repository we listed it but there's nothing there so far that's because we don't have any branches yet but let's change that we can use the plumbing command get update ref to do so and we pass in the name of the branch that we want to make master and the commit that we want to point it to which was the last commit we just made and now if you list your refs heads folder you'll see that you now have a master branch in there and if you open this guy up you'll see that all that really is to it it's just a text file I could cat it it contains a hash of the commit object that this branch points to so in effect what we've done is this we just created a reference to the commit and that's all there really is to a branch really now oftentimes when you're working in gets you're branching off of master so let's try that let's branch off of master and see what happens we made a feature branch and now if we list the refs heads folder again we'll see that we've now created another reference in the heads folder and if you open up this new branch to see the hash we'll just cat it it's not an object you'll see that the hash for this branch is exactly the same as the one for master so effectively we've created two references into the same commit and visually it looks something like this we just added a new branch for the same commit right so that's all that happens when you freshly branch off of master but let's say that we've edited some files and then added a commit on top because this is what we normally do when we're working on a new feature branch right I'm going to create I'm gonna edit the hello world file save it and we're gonna use some porcelain commands to speed ourselves up here I'm just gonna do a git add do a git commit also a really terrible commit message don't do this and if you open up the branch file now what happens is that you see that this new branch now points to that new commit this is a new hash and if you take a look at git log you'll see that the top hash corresponds to the latest commit and now it's chained to all of the pretty uh preceding commits so visually it looks something like this we changed a file and we staged it and then we committed it which then creates a commit object and links it to the prior commit and at the same time moves the branch reference so that it now points at this new commit object and if you wanted to check out master and get your master revision back what would happen is that git would read the master branch file which would contain the commits hatch and from their fault it would follow the chain of hashes through the trees until it gets to your relevant blob objects and from there on it unpacks those blobs into your working directory and now you're probably wondering how does getting know what branch we're currently working on so that when we do a git commit like we just did how do we know what branch to move that commit to and the answer is in the head file which resides on the top level of the dot git folder and this is just a text file that points to the path of your branch as you can see right there so git log git branch along with a bunch of other commands that you run when you want info on your current branch reads off of this file so if you bring head the head file into the picture the diagram then starts to look like this I ran out of space there and when you do a check out to another branch under the hood what's happening is that this head file is getting edited so that it points to another branch and by the way you've probably also seen the term detached head state floating around what a funny term right it's a very memorable phrase if you're wondering what that is it's when you do a git checkout to a commit that no branch points to so it just looked like that that is a detached head so anyway the overarching point that I really want to drill in here is that branches are not some sort of fleshed out entity rather they are literally just a text file with a pointer to a commit hash in it and we do this because we need a human readable and meaningful way to reference the commit object that we want to work off of because those 40 character hashes are seriously hard to remember right but will certainly remember something like master or feature which is by the way a terrible name for a branch so one last topic get PAC files so I mentioned earlier that git saves a copy of each version of your file right so that is the case then you might start it they start thinking to yourself jeez that would become pretty clunky pretty fast right I'm gonna show you it's gonna show the contents of our objects folder and there you have it this is the smallest repo ever but it already has this many objects how does this scale so what I've been showing you so far is called loose objects the thing is that gets sometimes automatically packs up these loose objects into a binary file called a pack file in order to save space and be more efficient and it does this by using Delta's so we can manually reenact this process by using the get GC command and the GC here stands for garbage collect and with garbage collect get performs a rather complicated algorithm to determine which of these objects are similar and then it picks a base to then make the Delta which is the differences between these objects and stores that instead which saves quite a bit of room so before I run this command to start packing things up let's run actually a get count objects on this to see what we have so the H flag here represents human readable and you see that we have a count of eleven loose objects for a total size of forty four bits and now if we run a get GC compress some things run a get count objects on it again just to get the stats we now see that we have way fewer loose objects hanging around and the total size has significantly shrunk and if we do another sort of listing of our objects it now looks different so the pack file little guy on the last line is the single file containing your deltas and the index immediately above contains offsets into that pack file so that you can reference your optics pretty quickly and if you are running if you run the plumbing command verify pack you'll see that all the things have been packed up in your pack object copy the whole path okay there you go so that's it for my portion on how git saves space and manages to say as to stay even more lightweight and that actually also concludes my talk for how git works so I hope I've at least somewhat demystified gets for you but really the best way to learn one's topic is to go in and dig for yourself and so I wanted to share with you some of the resources I used in preparing this talk the first item Pro gets by Scott Chacon and Ben Straub is an amazing open source book and you've probably actually seen it around online when you google for some get help personally it is my get Bible and if you've learned a lot from this talk you'll learn even more from reading this source because I've structured much of my talk and content around it and drew from it's really brilliant sort of explore explanatory approaches so I'd like to give content to this so I'd like to give credit to this book as well another great perspective is on internals is Mary Rose cooks get from the inside out and Josh Wiggly who explains get from the bottom up and a particularly great talk I found is by Matthew McCullough if you're more of a video type learner and lastly I discovered this gorgeous very recent blog post by Tyler Cipriani which does a very thorough job visualizing gets merkel Dague structure and i want to thank everyone who helped me in preparing this talk by providing an audience for a dry run and offering feedback and next I wanted to thank all of you for being a great audience it looks like we're fresh out of time so if you have any questions I can't take them but I'll be hanging around you can ask me then I'm very nice or feel free to tweet at me and I'll respond there so thank you very much
Info
Channel: GitHub
Views: 8,624
Rating: 4.9810429 out of 5
Keywords: git, github, VCS, programming, version control, open source, software development, collaboration, github training, git basics, octocat, what is github, git (revision control software)
Id: Y2Msq90ZknI
Channel Id: undefined
Length: 35min 13sec (2113 seconds)
Published: Wed Nov 16 2016
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.