Git Internals by John Britton of GitHub - CS50 Tech Talk

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
all right good afternoon everybody thank you so much for coming cs50 is super excited to introduce John Britton with github today who's going to give us sort of a deeper dive into the internals that underlie get the popular open source control software sees 50 uses get for just about everything we do so we're super excited to take a deeper look as well so without any further ado John Britton with github thank you everybody yes shy too much pizza all right so just a brief overview of what we're gonna be covering today I'm gonna walk you through creating and get repository the underlying data structures in git and the process of creating a commit and kind of how all that stuff works I'm gonna assume already that you already have some understanding of how to use git you'll be able to get take something away from this no matter what but it really will be building on the fact that you probably already know or have used git before so yeah let's get started we're gonna be able to command line most of the time just to look at things and I'll be using the whiteboard a lot too so git is a version control system the idea behind git is that it allows you to track changes to your project over time and that's useful for things like collaborating with others or finding out when bugs were introduced in reverting changes and it's also really useful as a documentation system if you've ever worked on a project that's been you know built by previous developers or something you did a few years ago and you don't remember why you did it you can go back at your look at your git repository your history and kind of use it as a way to remember your your the way you've thought about things to find a solution a really important attribute of git and what makes it a very powerful version control system is the fact that it was designed to be distributed and the fact that it's distributed is the reason why there are some things that don't always make perfect sense you know if you've ever used subversion or other version control systems you might have noticed that they have things like sequential version numbers one two three four or five and get you have these long hash strings that identify versions or or commit history checkpoints and the reason for that is because git is decentralized so we're gonna look at how it was design and the kind of the motivation for it as well so to get started I'm just going to create a git repository I'm gonna say git git init example what this command does is it just creates a new repository on disk for us called example I'm in my home directory and this will create it in my home directory on my machine you can see it reports back initialized empty git repository and I can change directories into my new repository pretty good first step you'll notice on the right hand side that it says I'm on the master branch and I'm in a clean status what this is doing is my my terminal every time that I press ENTER it will just run git status and report the status on the terminal for me we can also check the status using the git status command alright you can see we're on the initial commit and so we've got our repository created the first thing I want to look at is the command LS LS shows you what's in the directory it looks like there's nothing in the case of Unix so I'm on a Mac I'm using you know a UNIX based operating system files that begin with a dot or a period or considered to be hidden files and are not displayed when you run LS there's a command LS dash H which will show hidden files as well LS - aah there we go now you can see that I have a folder in my git repository called dot git so this is the first kind of internal thing that we're gonna look at a git repository is actually just a folder on your machine with a folder inside of it called dot git with some metadata inside of that it's important to know like we said in the beginning git is distributed so all the commands I'm going to run unless I otherwise tell you otherwise they're going to be local commands only there's no network activity going on it's just things happening on my machine the way that it works is that as I make changes in my my git repository in my project those changes get logged and saved as files inside of this dot git directory other version control systems have a client-server kind of model where you have a centralized server on the network and every time you make an action you have to connect to that centralized server and communicate and coordinate between you and all the other clients with get there is no centralized server there's no back end process or daemon it's all file operations in this docket directory all right let's look inside of that docket directory so I have a command called tree trees over the nice UNIX command it just gives you an ASCII art tree structure and I'll just do tree get and what you can see here is that there's a bunch of folders and files inside of the dot gate directory in practice you'll never need to use this you should never mess around with this stuff if you delete things in here you can damage your repository but it's good to know how this stuff works so you got some stuff in here the two things we're going to be focusing on are down here this objects directory and then this ref directory so I'm gonna use the whiteboard for this I think this is probably something that if you used git before you're familiar with and it's the concept of having a staging area so you have three working staging and repo so I like to think about it like this the purpose of git is to provide a version control system that allows you to track snapshots of your project over time as developers you're probably familiar with the working directory you have your compiling tools you have your home directory you have your IDE your text editor there's a folder on your machine that edits files in the working directory and you know that's that's pretty straightforward and then get layers on top of that two more kind of areas to think about as in your mental model and it's a staging area and the repository the staging area the way to think about this is like a rough draft you work with your editor in your working area put things exactly how you want maybe making a feature you fix a bug and as you make the progress there you add it to your staging area to get it just right when it's just right you save it into your repository history permanently so we have a couple commands this is get add so what get add does is it takes a file in your working directory and it stages it it puts it in your rough draft to later be saved into your repository history and then you have another command git commit and what that does is it takes the rough draft that you have in your staging area which is a representation of a snapshot of your project exactly the way you like it you can put things in take things out and then you run this git commit command and it saves it into your repository as a permanent snapshot forever that you can then later share via the network to your collaborators or to github so I'm going to walk through these commands and as I walk through these commands I'm going to show you what happens under the hood so to simulate writing some code I'm just gonna use the command touch and I'm gonna say touch read me MD MD stands for markdown it's just a text file type that you can put text into touch creates a file on disk with no contents so I did touch readme IMD you'll notice that my status has changed over on the right-hand side and if I do LS in my current directory you'll see I have a new file called readme MD it's pretty straightforward the file if we investigate with cat you can see there's an empty file there's nothing inside get status reports that I have a new file that's untracked and it says to use get add right there so I'll say get add readme now at the same time as I'm gonna do that I'm going to open up another tab and I'm gonna see the example and I'm gonna do watch - end 0.53 get and on the right hand side this is just showing us the contents of the dot get directory every half a second it's updating so whenever I do something that edits the dot the dot get directory you'll see it happen on the right hand side in real time so when I run get add readme got MD what actually happens but we said we have our working directory which has the readme that blank file in the working directory that my editor can use and under this staging area which represents a rough draft of my work and get add copies the file it's important to think it's not a move it doesn't take it from one place and move to the other it copies it over so he's get add readme and it copies it to the staging area and when we do that you'll notice that something changed in my dot git directory so this like kind of furthers the proof that there's no network activity there's no back-end server all we're doing is editing files on disk it's in this hidden folder there's a new object created so this objects directory it has it has a really interesting way of storing stuff we'll get into later but I want to talk about the objects so we have objects there are four objects that get stores I'm only talking about three of them today just for simplicity but the first is called a blob the second is a tree and the third is a come in the fourth object if you're really curious is an annotated tag but we're just not going to get into that today so when we did get ad read me MD what happened get conceptually took my readme file and put it in the staging area but what it actually did is it it took the file looked at the contents of the file took those contents and put it as a copy into a file in the objects directory which represents a blob which stands for binary large object it's basically just a collection of data blobs don't have names it's just the raw data and then it took that raw data and a little bit of header information like how long the file is what type of object it is and it ran it through the SHA hashing algorithm and after that hashing the hashing algorithm always outputs a 40 character hex kind of output which is in this case e 6 9 te 2 9 dots out dot dot dot all that stuff over there oops if we do this make it a little bit bigger so that's this e 6 9 te whatever this whole line and so it's calculated that hash and then it created a directory in the objects directory called a 6 and then it took the rest of the hash and it used it as the file name and then it stored the raw data which in our case is the empty file so it's a it's a blob object of 0 bytes and that was hashed and that's where it came up with this a 6 thing so what would that what that tells us is that every time you run get add you're actually putting something in your objects directory so that's on your local machine things are being saved but what's important with git is that we care about snapshots of the entire project we don't care about individual files so the next step which you're probably familiar with is get commit commit lets you create a kind of a check point in time a snapshot of your repository that references the entire working directory at a point in time so let's do that commit - em create empty readme when you create a commit you also provide a message and now again watch on the right hand side what happens you can see that there are two more objects that were created all right now these objects are of the other types tree and commit let's investigate let's look at these more closely and see what those objects are actually made of so the first thing I'm going to do here is do a command called get show so get show is a plumbing command and get so broadly speaking get is divided into sub commands there can there they're called the porcelain and the plumbing the plumbing commands are kind of under the hood you don't usually use them that often but other more user-friendly commands are made by stringing these together a common example is the command get pull is actually two commands get fetch and get merge together so our case get show it allows you to look at an object so in my case I'm gonna look at our first object which is the block get show E 6d 7b a21 and I don't have to type out the whole thing I can just type out the shortest unique identifier for it maybe I said Ron get show E 6 D 7 oh I'm so yeah you're right Thank You ecigs ninety ee6 98 there we go okay all that work for something very interesting it's an empty file right like we kind of already knew that but like you can prove it by using it show it shows you that that object is actually in fact an empty file object an empty blob object is more appropriate so let's show the other stuff get show and this time I'll type it correctly 6d d7 BA so what object is this I have no idea but it tells me it's a commit object I tells me what the commit identifier is tells me the author that's me John Britton and the date it was created the message and then it this is not actually stored in the commit this is just kind of a reference of like what was changed by this was computed so it shows that I add a new a new empty file so let's check out the third get show F 9 3 e 3 and you can see here there's an object called a tree and it has one item inside now these these get show commands are pretty raw they're not showing you there they're just they're just showing you like a basic overview we can get more detailed with this as well so in the case of the commit I can say get show - - pretty equals raw and this will give us more information again you don't need to know this but it's good to inspect these things so I'm gonna go ahead and say 6d d7 ba right okay there's a lot more information here I'll point with this so that it's easier so first you have the commit again then you have the tree so this shows us that when we make a snapshot when we do get come in what we're doing is we're creating a moment in time that says this person created a snapshot with this message and that snapshot is represented by this tree a tree is just in the way with a directory this is a tree that represents what your working directory looked like at that point in time of your project so there's the tree and the point of the tree you also see that there's an author here and a committer this is something to get that catches people sometimes the author in the committer there's why they'd they're twice the author is the person who actually wrote the code the committer is the person who saved that change to the repository as git was actually created for the Linux you know kernel development it's pretty common for somebody to author changes send a patch file to a maintainer and then the maintainer commits it so it allows you to track those two things independently and then you have the message there's another command I'll show which is get LS tree this is a way to look and inspect tree objects and see more information with what's inside them so I'll say F 9 3 e 3 and in this case let's try it again there we go you'll see that there's a few different parts of this tree first it has the number 1 or 100 which means this is a regular regular file 644 you probably recognize you might recognize recognize this as a UNIX permission you know 6 4 4 7 5 5 7 7 like those kinds of things this is the permissions of the file in the project so get actually version controls the permissions on disk as well so if you edit the permissions of a file in your project that's considered a change and get and it will be version controlled then what type of object does this tree reference so we're looking at what's inside the tree this is a list of all the items LS so we have one regular file with 6 port permissions which is a blob object and the blob object is referenced by this hash this unique identifier ISA x9 so on so on so on and then lastly it has the file name and this is very interesting the file name of the file is not stored in the blob object it's stored in the tree think about why that might be what happens if you have multiple files that have the same contents with this system you're able to have multiplies with the same contents but get only soars at one time even if they have different file names what happens if you change the name of a file should you duplicate that data when you change the name of a file no with git all you do is you create a new tree object which has a new file the object in your objects repository that represents the file the blob doesn't change so file name changes don't actually affect the the raw data and so what this means is that we're able to really easily rename files without a high cost so this is what's inside of a tree so we've been able to inspect those three different objects yeah go ahead so the question was when I delete a file gate will show that as a new file that was deleted is that what you're saying yes so so the question was about what happens when I delete a file so say have a repository and I have a file inside and I delete the file what happens in git is get says this file was expected to be here what you need to do is you need to add use get add to stage the fact that you deleted the file you need to tell get my a rough draft should be updated such that the file is no longer in my rough draft so you do that and then you commit it and when you make the new commit what happens is the new tree for that commit will have one less entry in the list oh no it doesn't actually track the name change so the question was if I rename a file git shows it as a deleted file and a newly added file under the hood what I'm explaining is that when you rename a file what happens is that there's a blob object that contains all the data that that file possesses so say you have a 5 mega megabyte audio file in your repository all the binary data in there is stored in an object called abc123 and then there's a tree that referenced it as like intro dot WAV or mp3 whatever that file is if you change that name the blob object there was no data in that blob object that was modified so what will happen is in the tree a new tree will be created with a pointer to the same blob object and a different name now what you're seeing reported about in the status output that's a totally different control thing and there's actually a feature of get that allows you to detect when a file was renamed but under the hood the way it's stored is what I want to just explain I hope that answers it I'm also going to take some more questions at the end that I can do some demos of that to go ahead yeah so there is a sub command what the comment was is there's a get sub command called get mV so just you get help MV this is get move command it basically does your operating system move and it also manages your staging directory or your staging area for you as well at the same time so you don't have to do the add command separately all right so to continue we have three objects so one is a tree what one's a commit one's a tree and one's a blob and so we have our first bit of history in our project I'm gonna pull up a command called git Kay if you have get installed get Kay should come with get for you it's a GUI tool and it gives you a visualization of your repository so it's pretty basic right now our repository is very simple we just have one commit here's the commit and if we click on it you'll see it has the message and then down below you have you know kind of some information about it what I want to drive home here is that when you're thinking conceptually about git as a user you don't think about trees and blobs you just think about commits you think about I took a snapshot I took a snapshot I took a snapshot I want to look at this snapshot I want to go back and look at that snapshot I want to merge these two snapshots together I want to create a branch and work on the side you don't think about the trees and blobs but this discourse this talk is about the internals so you don't see any of the complexity of like the blobs and trees and stuff on here so we have one commit and I'm gonna point out this thing called called a branch when you create a git repository you get one default branch it's called master by default and it doesn't mean anything special what master is as a branch is it just a pointer it's a pointer it's also called a reference to any commit in your repository think about it as a bookmark a tool for navigating around your repository history and I'll show you where that's stored as well so if we go into our dot git directory on the right side you'll see here there's this directory called refs for references heads and one called master this is by default if I go in here and I say git or if I do cat which just shows the contents of the file refs dot git refs heads master yes sure like that it's better so I'm going to do cat get refs heads master and that will show you what's inside what's inside that file can anybody guess what's in there a sha thank you so this is the hash of the commit that master references so master is a bookmark that points to a commit we only have one commit so it's obviously the one that we just created but it's basically it serves as a kind of navigational tool for you if you ever heard the term being on a branch being on a branch just means that when you make your next commit that branch will be updated to point to that new commit you just created it will be advanced conceptually people think about branches as like a divergence you go in two different directions that's the wrong mental model forget you know it makes sense that you're doing some work off on the side but actually the mental model you want to have is that a branch is just a pointer a bookmark just somewhere in my repository history so as we get into more complex history you'll get to see how this works a bit more I want to pause for just a second and see if anybody has any questions on the stuff that I've covered up to this point people seem to be mostly coming along yeah go ahead I think I get it but just generally blob seems to be related to get bad commits obviously people commit but a commit is going to hatch okay so the question was about you know get ad seems to be related to blobs get commit seems related to commits what I want to emphasize is that get ad is really the main thing to think about what that is the staging area of creating your rough draft you can put things into the staging here and get it ready every object has a hash okay so a hash is just the identifier and the reason so I mentioned at the beginning of the presentation about gets distributed one of the challenges to overcome with a distributed system is how to reliably give things identifiers without communication so what we have is with the sha sha hashing algorithm we have a deterministic way of creating an identifier for a particular object without communicating so within the case of a blob any data that we're storing we take the contents of that data that file and we run it through the algorithm and that's where we get the identifier in the case of a tree we did get LS tree we did this command before get LS tree think about what we said before that the object identifier is a shot of the contents of the object so the object identifier of this tree F 9 3 dot dot dot that objects either never is actually a hash of the contents of the tree which is the list of items in the tree so this right here this data is what we took and we hashed and create an identifier and what's interesting is this data includes the hash of all the blobs and trees underneath it so if any one bit changes in any file any subdirectory then the identifier of the tree will change it will be different and it'll be drastically different because of the way shot is a shot hashing algorithm works and then with a commit you take this the staging area which has like kind of a prepared tree of a rough draft of your working directory that you want to save forever and you create a new object that new object show this that new object has an identifier right here now identify that identifier is calculated by the contents of the object the content of the object includes the tree and the hash of the tree and also all the other stuff so what it means is that if any bit anywhere along the line including all of this author information the messaging you'll get a different but deterministic identifier for the commit and it's really important does anybody know what that data structure is called or you have like kind of like way down at the bottom one thing changes the hashes going up it's a hash tree I think also a Merkle tree is another way another way to say it so that is the kind of core concept of get that allows you to be distributed using those kinds of things because otherwise we would have to communicate about what our identifier czar so yeah that's a good question for after class I would be happy to talk about my terminal after class we have a tight timeline though so I don't get into that right now any other questions before I move on just making sure I want to make sure every bees following along and coming along with us on this stuff okay so we've made our first commit let's start talking about multiple commits all right so what happens when I create a new file let's say touch new feature dot RB so I'm a ruby developer I make a new feature just create a new file maybe I'll you Adam new feature - RB I open up my text editor I say class new feature and you know def foo puts are all right some simple code get status there's a new file that's on tracked again we follow the same kind of pattern get add new feature and on the right side you see one new object pop up it was somewhere down here then we do git commit dash M add new feature all right and now it's committed and two more objects popped up but you get K let's look at this from the perspective of just commits first commit existed before the second commit is a new commit master is updated that pointer the bookmark now points to the new commit all right if I go in here I you cat dot get refs heads master you can see that it's this new identifier I do get show I normally would put an object identifier here normally I'm typing in these hashes but now we know master is just a pointer and since we were on the master branch we made a new commit that pointer points to the newest commit so I could say get show master as shorthand for get show 99 FBA whatever I don't should type that all out so you can see that there's a new commit it says add new feature and it shows me add if I can do get get show - - pretty equals raw master and this is where we see the additional details and what I want to point out here is there's a new entry this new entry is the parent so when we create a commit it includes all the data about all the blobs and trees and stuff that are in there includes metadata about who authored the commit when it was authored it includes a message and it includes a reference to its parent so the commit that came before a lot of times people think about it as you know I left a lot of times people think about why don't I have a a pointer going the other way what's the next commit like if I go from the back going forward well because all these objects objects being blobs trees and commits all of these objects are immutable objects they're stored in your objects directory once you create them they can never be touched we couldn't possibly go in and add a pointer to the future thing right we couldn't add a pointer to the new commit that we created we'd have to know about the identifier of that commit before we even created it it'd be impossible so in this case what we do is we just create a pointer going back what's also interesting is because the pointer going back to its parent is included in the commit object the identifier of the new commit object is unique and in a way based on what came before so if you imagine we have two parallel universes we have two projects one project where I create a me in the in the main folder of the project and I create a commit another project where I create I do two commits I do first commit with an empty readme and then I create a second project with put in all the same contents even though everything we did in the second commit is the same because there's a history even we did at the same time with the same author and everything identical because there's a history there the one with two commits the first ant second commit identify us would be different right even though the some of the work is is the same so that makes it so that in a distributed system we could all be on a team together we could all clone one repository we could all do the exact same work and we would all end up with a unique commit ID no fires all right however if we did all exact same work and we all created this 10 megabyte file that was in our reference somewhere in our repository we all did it and then we all synched with each other we would never transfer that 10 megabyte file because they would all have the same hash in our communication between each other we would know that because we have the same hash we have the same file we don't need to transfer that all right that makes sense yeah cool yes that's a great question so the question is about I think what you're getting at is when I make a change to a file in my project does get store the changes or does get do something else right so if we look at our structure we have commit points to a tree the tree points to blobs and other sub trees there's no reference in there of add if we don't store gifts if you look at subversion subversion stores disks and at first glance that might sound like a smart idea we can store diffs because you know they take up less space right like it's less it's just like oh I only change one line if I have a 10 megabyte file I change one line now I have to duplicate the 10 megabyte file right and that's actually how it works under the hood we do duplicate things if you change one bit of a file you get a totally new object with that with that with that data duplicated however if you think about computational complexity if you're using a subversion repository and you want to check out you want to get a version of the project how do you get the version of you go through the entire history and you sum up all the diffs up to that point and you have to do a calculation that calculation has n parts and being the number of commits in your history with git it's an o1 operation you just go to the commit you pull it out and it's a perfect reference already made if you want to compare two points you can do a computation to compare point a and point B right that's how it works in that situation others others have questions on that stuff update up this point so the question the question is what's the size of my repository is it you know the number of commits times the number the amount of files I have you would think that because I just told you that this is a living story but there's really an intelligent layer and get I won't get into the like I won't get into mucking out the commands today and that but um there's a compression layer so what happens is you have this objects directory right over here all of these objects each one is unique it represents a unique blob tree or commit however when you have a lot of blobs that are mostly similar so you edit one character each time get is smart enough to take all those blobs together and run a compression algorithm on them and put them into what's called a pack file so that you don't end up wasting space actually it's possible for projects to have their entire get history be smaller than one checked out version of the project so like something like a huge you know huge operating system which we have you know hundreds or hundreds of megabytes maybe gigabytes of source code files in one check out in a git repository it could be smaller than that because of the compression it's a very good question all right so now what I want to get into is a bit more about branches and how that works so we learned about graphs heads master we learned that branches are just pointers and as we make commits those pointers get updated so we'll make another commit and we'll go in and say what's modify a file or let's let's make another new file so in here we'll say a touch new feature 2 dot RB get add new feature to get commit - M add another obviously like you know this is not great practice to just write this kind of commit messages but this just for demonstration you know I don't actually write code like that so get K sound we have another commit our pointer points to that there's a command I'm going to draw this on the board also so far we've only gone in one direction you take stuff that's in your working area you put into staging in your rough draft and then you save that as a snapshot into your repo so we've got something like this we made our first commit which was just empty readme then we added a new feature we've made a second commit that had a pointer here then we had a third new feature - and that points here and now we have a pointer here called master right now if you want to look at all three of these areas together what's in my repo the latest thing there what's in my staging around what am i working here and they're all equal so what's a command to compare these things does anybody know get commanded compare diff all right so get diff if you use get diff you can actually compare you know what's changed but a lot of times people use get diff without knowing what it's comparing so when you run git diff what what do you think you're actually comparing let's try it out so right now I have get dip right now there are no differences if I touch a new file and everyone get dick oh I have to actually let's not use that as an example let me do this going here example edit get status get diff so I added a file that's already in git and you can see that something changed now what's happening here is it's comparing what's in my working area with what's in head or the latest thing in the the repo history if I do get add I'm sorry a feature to get diff sorry with the staging area it's sometimes confusing for me too so it's comparing what's in the working area what's in the staging area so it's basically saying there's a new line in the working area that's not in the staging area then I ran get add and now when I compare them it says there's no differences so even though my repository doesn't have that edit that I made it shows git diff shows no differences the other command that we haven't seen that goes the opposite direction here is git reset so git reset lets you take things out of your rough draft so as I get reset new feature 2 and then I run git diff now it says you know we have this one line in our working area that's not in our staging area and I can riad it there's another command git diff - staged so this is where I was getting getting to gift access stage lets you choose what you're comparing so now instead of comparing your working your staging your your staging area to repo so you can go so you can dude if we'd be here and then diff - that is staged would be there so you can see what's in all these different areas and one of the things that we said was really important about get is being able to go back in time right so how do we go and get what was in our history before and and what actually happens under the hood so I'm gonna just run tree in here and you can see my project has three files it has an empty readme it has a new feature which has some code in it and has an empty it has this new feature - which has some examples in it which I'm gonna throw away open up Adam I'm just gonna go ahead and delete that so if I say get status I'm up Oh get reset you - there you I'm all clear I've got those three files and I have to be commit to my history but you're gonna see was get logged so if I go through get log you can see one two three now one of the things that we said was really good is that we can go back in time and look at what happened in the past so this command git check out what get check out what you do is specify a commit a snapshot of your project and update your working directory to reflect that project so on the right side I'm going to close that I'm gonna run instead watch - n 0.53 dot so this is the list of files that are in my working directory updating every half a second okay if I do get I'll just get log and I'll go back to my first commit I'll just grab this I say git checkout and specify the commit identifier of my first commit what happens in my working directory all the other files disappear so what good is actually doing under the hood is it's going into that history finding that commits that commits referencing a tree that three references subtrees and files or blobs and it takes what that sub tree or that tree under the commit represents and it makes the working directory on my machine look identical to that representation to the snapshot in time that's really scary for some people because basically just deleted all my code right it just deleted all the code in my working directory and like I don't know what's going on so it's not scary because I can use git checkout master and it all comes right back yes so is that yeah so that's a good question so the question was I understand that this is coming from all this data is coming from the docket folder which is a hidden folder storing all my get history but that also exists inside of my working directory so is it immune to those changes I would say yeah the the way that that works is the docket folder is special so you can't add the docket folder to get that would be like you know kind of meta just be an infinite loop of objects representing themselves over and over again so yeah get getting smart enough to know not to mess with that basically it's everything but the docket folder gets gets messing around with it and actually there are some under the hood I don't want to misspeak because I don't know the exact details of it but under the hood there are some like performance optimizations that happen on some operating system where there's like basically rather than hard deleting the files and moving the files and copying them it's using hard links so that it can be really really fast so that operation of checkout can just be boom like so but I don't know the full details of that yeah in the back there's a question what about them okay so the question was what happens to untracked files when you do a checkout gets get tries to be smart it will never throw your work away unless you explicitly tell it to with a command like force or something like that if there are if there are untracked files in your repository and get can do its job without impacting those files it will just do it and ignore them if for example you have an untracked file that has a specific file name and a different version of your code has that file name and you're trying to check that out it'll throw an error it'll say you can't do that there's a conflict and it'll make you move it out of the way or you just get stash or something to that effect the question was get RM doesn't remove does the guitar and remove the blobs from previous commits of a file when you do get RM and no it doesn't what it does is it removes it from your working directory and it adds the deletion of the file to the staging area so if we think about the working area and the staging or the stage enter is a copy so when we add a new file we copy it over to the staging area but if we want to delete a file we have to copy the deletion over to the stage so I'll demonstrate that because that's kind of tricky one that people people don't always get so in my case first get status we're all clear I can do LS there's three files I'm going to delete this get this new feature too so there's a command get RM and I could use that but I'm not going to because that's a kind of a compound command what it does is a few different steps I'm gonna do each step I'm going to use my operating system RM command new feature - I'm gonna delete it and I'm gonna run get status and what you see is that get reports that the new feature to file was deleted but it will say changes are not staged for commit right here change this not stage for commit use git add RM file to update will be committed so my editor my command line my IDE all that stuff just operates on the working directory it doesn't know anything about version control necessarily they might have the add-ons but like broadly speaking they're just working on your working directory so when I deleted it I removed it some working directory and if I do get diff what happens is it'll say your working directory deleted the file but it exists in the staging area so what you need to do and this is counterintuitive is you need to add the deleted file to the staging area so what you do is get add new feature to RB get status now it says changes are ready to be committed you deleted file new feature to RB and that's in your staging area so when you make a commit the new commit the referenced tree of that new commit Wilkin will no longer contain new feature - you look very confused by this as deleted to the stage in this case an attribute so I can use get reset new feature to Darby mmm maybe not get no I can get check out - - new so yeah no ok I don't know how to do I forgot anyways I could get that back and if I used git RM what I wanted to demonstrate was doing get RM and then the name like if I'd use new feature what will happen is get status it moves straight to this deleted status so if used get RM so get RM makes it simpler so it's like more logical you just do get RM it deletes it from your operating system and then it stages it all all in one step so that's the logical thing people do the reason you need to know about the git add a negative file is like say for example you use a build tool and the build tool like accidentally check some things in and then it removes them but like they're showing up as deleted files but they're not staged you need to know how that works yeah so let's how many just come and get reset - at hard head what this does is it basically says throw away everything that's not committed and go back to the latest version of the stuff you were working on so I'm just gonna go back to the point where we're here with three so yeah we learned about check out and how it works and how it updates the working area we learned about the structure of a commit blobs trees and the committee objects themselves and we learned about how commits are related to each other via like a parent relationship and there's a pointer that goes goes through all of these things the the really important thing to take away from all of this is that every bit and get matters like every single bit down to the very bottom finest grain thing impacts every other thing so sometimes people will ask like oh I committed this really large file to my repository can you remove it from my history and the answer that is yes but what happens is when you remove something from the history then like we said all these objects are immutable so the only way to remove something from the history is actually to create new objects that lack the thing you want there and use those objects instead but by definition the way it works all those all of those objects are going to have new identifiers so basically what you do is you rewind history and then you rewrite history as if the thing didn't happen so I'm going to show you how that works so we have git log and we have three commits in our history history let me do like a couple more LS def bar puts let's go get add you feature it get commit that time add our method and then go back here and I'll do one more deaf bonce puts alright so I'll just get que just I did that so I have more commits to work with so say for example we're in the situation we have all these commits and there's this file called new feature and you really just want to remove it from all repo history how could you possibly do that well what you can do is use this command called git rebase and what Gabri base does conceptually is our situation is this now and master is here and so what we need to do is we need to go back to the first commit where that new feature was introduced which I think was this one and we needed to create a parallel universe where there are different commits and each one of these commits no reference is the one before it and this one references the original commit and like say this is a we'd call this a prime and like you know BB prime so on so forth and what we need to do is go back through history grab these and create new commits that lack the features that we want to remove so I'll show you how to do that so there's a command git rebase and what it takes is an argument of the commit to go back to and you're in your history so git log we're going to go back one two three four commits and there's five in the history so we'll do git rebase - I head tilde for what that means is go from where we currently are or currently active thing go back for commits from now excuse me yes the question was can you use just a hash yeah this is a shorthand so I didn't have to like copy and paste that big hash or look it up I could also do you know master you know tilde for actually want I think I want three but anyways it's the same thing this is what's called a commit ish in get terms basically it's something that represents a commit ID so you could you could parse this in to look at master or go back three steps get that commit idea of that in substitute so I'll do that and it will open my editor and it will give me a let a list of things that happened so it'll give me all the all the commits so there's actually I did choose the wrong the wrong number so I need to go back to four so here we go first we have add new feature I'll use this one so the streamers can see we have add new feature that was one commit then we have add another new feature which we want to keep and then we added these two methods to that first feature so we want to do is essentially you know get rid of those commits rewrite history and just remove them you could modify each individual commit like say I did a bunch of other work in those commits that's a little complicated for a quick demo but what I'm gonna do is I'm just gonna delete this commit delete this commit and delete this commit and I save the file and I close my editor and what you'll see is down on the bottom it had a quick it said rebasing one of one down below and everyone get ke what you'll see is I only have two commits and I have my readme and I have my new feature too so my new feature file is just gone delete it from the repository however the new commits that I created so I only have a and I now have a prime so I have the original commit and a prime all those other commits I just kind of throw away just a second I'll get forget to you so we were able to remove that stuff however all of these if I do get log my first commit which was just a create empty readme that commit our universe is the same as it was before nothing changed but this new one is a totally so if we go into our objects directory what we actually have is something like this where master no longer points there and we have basically just master here just pointing at that so what's interesting and you'll notice I didn't delete those commits from the whiteboard I didn't erase them that's because they still exist on disk in my in my machine they're not referenced so after some amount of time which is configurable in the advanced kind of config stuff forget they'll be garbage collected also if you do a network interaction they will not be transferred in a network interaction because they're not referenced they're not useful but because gets a version control system you it wants to be able to recover from mistakes so say what if I made a mistake on my rebase and I actually deleted the wrong file and now my repository history is gone well it has this really handy thing and this will be the last thing that I cover before we wrap up which is I like to call it version control for your version control it's the ref log so what it is is a version history for each of the references in your repository that tells you what the status of that reference was okay so in our case we have a master branch and previously the master branch was here the master the master branch is not one of the core objects it's mutable so if I copy this repository to github or to a colleague they'll have their own master branch that can independently be pointed to something else and move around as a bookmark within their repository all of these objects are immutable and will be the same and can't change but these can be changed so git keeps a history of every state of all of these refs and that's called the ref log and so if we want to undo the rebase which would essentially say move this back here like this the way to do that is with the ref log so I'll demonstrate that and then we'll wrap up so if I do get ref log you'll see here the history I did a rebase finish rebased pick rebase start and then this is where I did you know the last the last place where I made changes was head at 5:00 with these funny brackets which is when I added the Baz method to my my class that had the code in it so I'm gonna take that and I'm gonna go to my command line as you'd get and keep an eye on the right hand side here as well actually I think I can do this at the same time to maybe get K so I'll put this here and what's this I'll put this here and so it'll be able to feel all of this update at the same time I think so get reset def is hard head at five so what this is saying is take my master branch and put it back to where it was five steps ago in my in its history okay and what you'll see is my my feature my file called new feature will show back up on here and the commits that I deleted so this commit will go away and the other four commits that I deleted or you know three or four whatever it was will come back so hopefully this works okay it showed up here I think I need to run get K separately this doesn't update real time yeah alright so now we've undone the rebasing which was mucking around with our history and if I do this one more time and this time just do one I can undo the undo which will put us back over four and then I can go here and say get K and now we're back alright so that's everything thank you all very much for coming [Applause]
Info
Channel: CS50
Views: 26,306
Rating: 4.969512 out of 5
Keywords:
Id: lG90LZotrpo
Channel Id: undefined
Length: 57min 39sec (3459 seconds)
Published: Wed Apr 11 2018
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.