Advanced Git: Graphs, Hashes, and Compression, Oh My!

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
[Music] all right thank you very much for having me back I have 50 to 60 minutes of things that I want to share with you and this is slightly different format than some of my typical presentations because it's it's very command line driven but what I want to show you is actually kind of an abstract concept so I have a real dissonance that's playing here and the reason that I want to bring this to the Java users group there's actually little in terms of Java syntax is that the get tooling has become even in the last couple of months ever ever ever more important to to Java people it is in fact we ran some statistics the other day my colleagues are with me here from github when we were at some stats the other day and Java is just off the charts in all of our repos who ran some site wide statistics and the other piece that goes along with this is eclipse is now 60% moved over as a foundation over to get as a repository so in fact this tool is everywhere in the Java ecosystem what I'd like to do is actually take an opportunity to kind of step back to computer science kind of all the way back to dare I even say your University or kind of deep technical level training and tell you a little bit about how this system is constructed so as to make the day-to-day usage far less intimidating I have very little that I want to say about myself I like to keep this as low-key as possible I'm Matthew McCullough I work for github which I certainly believe is the best job on the planet and the thing that I am here to talk to you about involves both the github system but more this evening focused on the get tooling when you bring get into a into an organization you'll often bring up the readme at some point at which point if it's not utterly rejected by the mere fact that someone would put this kind of thing into the readme for a product you then win because you get to use the best version control system that I've ever used in my career I've only worked for github here in this calendar year but yet I've taught yet for five years and for much of my 15 years of doing Java software development that is who I am at the core I've really struggled with version control because it's been one of those things that's supposed to enable you to do more stuff but it's often been the thing that's held us back that changed when I started to use get and one of the things that it enabled is a massive amount of collaboration this is a graph that I actually don't believe has been shown publicly before this is a contribution graph from last week and this is github.com being built and these are the simultaneous actions of the people on our team this graph has been partly public for a week or two now and this is the number of deploys and commits that we do and you say what does this have to do with the technology that you're actually showing us but this is really the product of what using get gives you collaboration ease of deployment all these flexible build and kind of work models but I want to go to the opposite end of the spectrum because it's rare that you actually get the opportunity to do so we often treat programming as this ultimately simplistic kind of activity that simply is just sitting in front and making sure you type the right sequence of characters but there's an awful lot of math and science to this and that's the hat that begins right now so here's the beginning of all of this we use a centralized version control system and in that we get the simplicity of a database giving us a monotonically increasing revision number our one two three four you may have even heard me use that line before but when we switch to get we use a hash of the content itself I often say to people who are new to this domain that it's an easy way to think about it is that it's a fingerprint it's an identifier for the content inside the file inside the folders inside the commit and also for tags for primary things that are hashed here now these forty hex characters are on the surface if you were to just treat them as this thing that you have to refer to in reference terrible horrible to actually cope with if you were talking to colleagues and trying to use this as the means of how are you coming on nine a be two to three it's impractical to say the least now what they represent is extremely useful and what I need to give you is a bridge between this unwieldy number space and something that's manageable to hand from person to person now this digest on the content is globally unique for all intents and purposes physicists would even agree that two to the hundred and sixtieth power which is our number space for this is a globally unique identifier so we'll treat it as such for the rest of the talk now you see that thing and you can imagine with me teaching introductory classes to get that after you show this slide half the class leaves and the ones who stay cannot believe that this is the new way that we identify individual commits neither can I neither could I until I saw the science behind it we are going to create a git repository without ever using git commands and I believe this will give you a newfound understanding for the system we are going to head over to our terminal window where we're going to do our bit of coding tonight and I'm going to get in it project one now you say wait a minute you claimed you we're not going to use get to actually initialize this but hold on I'll be truthful about this I'm going to show you what's in this dot get directory so that this little setup of a set of folders and directories will have no real impact on what we create for our commits and our content the piece that I'm most interested in this is the dot git directory that I'm showing you the thing that contains the version controlled objects that we're going to write at the moment it has this objects directory that is empty info and pack to small file to small directories beneath there but no files whatsoever and I'm going to continue to repeat to run this tree command and we're going to look what appears in that objects directory and make sense of it so our next step is where I actually get to ask you using anything about get at all we're going to write some content to a file so I'm going to do just a simple echo and we're going to wrap it in quotes and we're going to very carefully make sure that our capitalization is everything is is just as it should be and we're going to write it to a specially named file over here that we call hello dot txt now having written that file to disk yet does not yet know anything about it and what I need to do is simply use the git add command which causes that to be digested we're taking kind of a approach here using it a little bit and soon enough not using it at all to add that to version control but what does that really mean that means that if I show you the dot get directory once again that we have taken the content of that file and run it through the sha-1 digest function and determined that the individual fingerprint for that files content is 5 5 7 DB 0 3 de9 9 7 okay all right what is this folder what does this thing actually represent over here what's written out to the directory well I can show you I can list that file that's now preserved but I can also manually recreate this same set of structure now I'm going to go one more level further with a git commit so we'll have two more objects created in this directory that one represents the files content next we need to represent the directory and then also the commit the action the transaction the wrapper the shell that sits around side the folders and the files beneath it we're going to supply a commit message over here and we're going to carefully type it as first hello exactly like so and this commit that we've now made has saved a couple more files that you'll see with the tree directory we now have three of them one for the file one for the folder and one for the commit that represents the transaction that encompasses both of those previous two steps now we've written this to disk and the next step that I think is a lot of fun to show off is to actually take and run the contents of above a little string over here especially formatted string over through a command line tool that produces that same hash code now and this line of code I'm not running a get function anymore and yet if you do a little bit of comparison over here for just a second and look at the content I see 7f and I see 0 for echo Delta Charlie so on I see the same kind of identifier that I saw when we actually wrote this out to disk now this print this printf that we just did here this blob this 12 hello world with a character at the very end is actually giving us a few bit of few bits of information that are also in those encapsulated pieces in the objects directory this is the type of thing that was written to disk a blob means a singular file it doesn't actually have a mode for binary versus text everything it writes is simply just a blob the next is the number of bytes and you say well we could read that by actually looking at the contents of the file but it's an optimization that it actually stores the length at the time that it writes it it's kind of a cheat a shortcut if you will and then we actually have our null byte over here our actual characters and a null terminator on the same and running it through this function running it through this digest over here has provided us a means of without get calculating sha-1 hash codes now given that we've run that through we have a question of matching that up with the exact content the hello world and the file and directory structure that we had created before well I'll do this I'm going to actually route some content in from standard in kind of cheating halfway between this get list and the get form that I used from before I'm actually just going to take an echo statement in this case and I'm going to put hello world over here with a little improvement to the quotes my shell could be slightly unhappy with that particular choice of quotes and we're going to get hash object to a dash W and we're going to say that we're going to get the content from standard in now some of these modes you wonder why would get actually provide this low-level function which it appropriately by strata by layer calls the plumbing there's the porcelain on the top the user facing functions and the plumbing the ones that function beneath it well this plumbing piece over here allows us to actually write the object to disk and - standard in in this case says get the content from what I'm piping from that first shell function there so when I run this instruction over here when I've got let me make sure I've got my standard in and all of my quotes appropriately as they should be we're writing this out to writing this out to disk running this but wait a minute wait a minute women I am clearly missing something and I'm missing a dash over here at the very end - - standard in much better than our M - that appeared there we go so finally we've produced from an echo stream the ability to write one of these same files this 5:5 that we're getting here on our disk now what did this do to our tree f dot get so far we're just kind of feeling our way around we're feeling the walls kind of seeing what we can see but wait a minute wait a minute wait a minute that five five seven Delta Baker was already there and you saw that this produced exactly the same output when I wrote it so what we've now done is show that git actually doesn't have any magical recipe it's not a specific flavor or an incantation that the tool itself came up with what we've discovered the first little layer that we've peeled back is that you can actually just use standard commands like echo or even build your own kind of a conglomerate function with Shaw and you can digest the contents of these strings and end up with these unique fingerprints that's the thing that we want to kind of recognize at this point you can produce these fingerprints just from the contents of strings or the contents of bytes now given that we've gone all the way into that and we now said it's written to disk and we kind of seen even where it is in the file and folder structure could we just read this back from disk I mean could I take the cat function and just read back into the five five seven Delta Baker directory file over here could I read that back to screen well that does not look quite like what I expected it's not as naked or as exposed as the other layers that I've shown so what happened here well what we actually got in this case was some Z live decompression of the file that was in that directory but you know what we can actually set up our own little Perl alias and get the same kind of thing to be able to decompress in the absence of git I can set up an alias over here with a Perl M compress Z live what well a little Perl function that I'm deaf Mining ii will actually be a little bit clearer what i'd like to do as the second step to this is to have it deflate that's a little program that I just set up here before you're using Perl well I wanted to be polyglot this evening so I now am able to deflate the object this hanging in this directory I hope ever so slowly it's becoming just a little bit and a little bit and a little bit clearer how these are put together just a little bit at a time it's starting to sharpen that these things are just decompressed are compressed strings that we can deflate that have and look at that pattern you recognize that from a few minutes ago blob 12 hello world do you remember that you're starting to be able to put the threads of these pieces together you thought they were kind of scattered all over and you're starting to see the same thing show up and show up oh so it's just a what is it with the byte count and the content in the file correct and the file name itself is the fingerprint of the unique content correct and the first two characters of those 40 characters are the directory name and the remaining 38 characters are just the file name itself done I've literally explained how every git commit works and the only thing that I've left opaque at all in all of this is actually just the compression library itself you could even install your own zi live command line utilities you could use this from Ruby there's Java equivalents to the same but that's even the only small piece that I've left and it simply looks like zip compression as you're putting it into the end out of that file for a little bit of efficiency as you write it to disk now that we've seen that piece I actually want to go one step a little bit further and say well then how would get actually write out the tree the next most important piece here well that would be done through another plumbing command called update index we can say update the index add cache info aw a UNIX file permission over here 644 we recognize that 100 a normal file and get and then we have over here the hash of file that was already calculated and the file name now why does it matter that we write it in this form and why does it matter that when we're writing the directory the tree the folder that our source code is contained in why do we supply the file name in the hash then well this takes us for our first time Ashley back over to slides for just a minute because everything that we've done so far can be turned back into what I believe is a very very helpful picture of the data structure itself as we're building these things from top to bottom and some of you may have already seen a graph or a picture like this that I use often in my talks we calculated a digest of the files contents but what you saw at the lowest level possible is that the file name actually didn't matter at all at the time that the files content was written to disk this is one portion of gets efficiency and that if you copied and pasted the exact contents of a file it really only gets written to disk once but then gets pointed to by any number of directories that happen to use that content you saw that firsthand because there was no remnant of the file name itself in the first little object that we wrote it was in wait a minute let's go back to it to be doubly sure it was over when we were writing the directory names that we actually assigned content to be paired up with a file name it was at this level that those two things were finally glued together back to the picture again as we start to build this up the picture has exactly the same concepts up here we have the type and the hash and the file name stored at the tree level which we commonly say is the same as a directory and these things of course can be nested one beneath another all the way until we get up to the level of a commit this is not a varargs field at the top it is a singular field for the top level tree that it points to the parents are a Varg field you can have more than one parent that's a merge but in this case the tree the file in the folder system is a singular field it can only point to one and this is why the DA get directory lives in the top of your project structure that is the singular directory that captures everything else beneath it it has to be one at the top of the pyramid that's pointed to by each and every commit but the commits hash itself is what since we've been looking at the constituent parts the composition of the commit is a pointer to the tree so it knows what content what files what filenames belong in there but also a pointer to parents also encapsulating the content of the author and the committer is separate fields and the commit message itself it's all of these pieces glommed into the final little bit at the top of the tree if we look at our tree of dot get again does it actually give you any useful commands that you can go at and kind of act on those things without worry whipping up Ruby or Perl or Java scripts to actually touch those of course it does and I actually think that you'll find a reasonable use case for these in your day to day use of git git provides a function called cat file so now we're getting back to kind of on the fringe of a useful utility back out of the land of computer science back to a practical sense and cat file can actually serve you reasonably well if somebody gave you a hash of the last deployment to the site and let's say that we happen to know that we could get a list get log - - 104 F delta10 that's our most recent commits ID that we can get cat file and ask it what type is for F Delta 1 in this case and it says why Matthew that is a commit and I say ok I'm going to repeat the same instruction in this time cat file I'd like you to give me the contents of that particular file that I'm reading from the objects directory and the result is oh it's funny how that matches up exactly with the diagram a pointer to the tree the author the committer and the commit message you're now viewing the composition of these individual parts that are calculated as that hashed sum for the commit now that we have a pointer to the tree do you remember tree git do you remember that 9 7 8 or four we just saw that as a result of running down instruction nine seven Baker four let's try to repeat the same thing let's ask it what the type is of nine seven Baker four that should be should be sufficient oh that's a tree well no surprise there you already knew that before running the instruction but we are interested in what the content of it is and here is where that information was captured when we did the right tree it's got the mode over there the one hundred representing a standard file 644 being the the user right and other read privileges on the file blob being that it's a file it doesn't care if it's text or image it's a blob not a tree or a commit there's the identifier of the content itself and there's the representation of the file name at the end of the line there's all your pieces put together in a way that you might never have gone down to the bottom of this kind of stack but it really seems kind of simple in hindsight it's just composition of files folders and commits each with their unique identifiers at each level of this data structure now that we've kind of built this out piece by piece there's an interesting take on this of well what do I do with it though I have this knowledge and maybe it's kind of computer science but what what basis does this give me for using it for using it better for trusting it more for perhaps they've been switching to it well now that you've seen these three pieces I think this example that I often use of damaging a repository becomes a lot more meaningful I'm going to copy recursively here for just a second the project one that we were just working on and I'm going to call it project one damaged I'm going to go into this damaged repository it's not damaged yet but we're going to soon do our work on it and we're going to once again for the sake of reminding you look at that most recent hash for F Delta one zero get status is very happy get log is very happy it's showing me the commits over here but I'm going to take VI and go into the dot get directory and over into the objects folder and over into for F and complete the file name and at the very front of this insert a character that effectively makes it incongruent incomplete unhappy and so I'm going to write that out to disk and then I'm going to drop back to the command prompt having written a new character into that file and I will run get status at which point it says no you don't this repository is no longer valid now I did that on an extremely small one with just one commit it's very shallow but that made the example nice and simple and clear but the same thing has value on a large repository with a great deal of history you can run get fsck to actually do a filesystem check on that repository and with a few options switches you can actually have it recalculate everything both with the full and if you want to change the rule set so that doesn't allow some of the older miss behavior that was present in some ancient editions of git you can also have it be strict about the things that it finds you could even set this up on a cron job if you like we have this as part of some of the back-end infrastructure at github but this is the kind of thing that you could run on your own repositories you could set up your own job to health check them and this is giving you a guarantee I think it's almost hard to to value to figure out how much value this is this is giving you a guarantee that every bit of every commit ever written to the repository from the beginning of time until now is in its proper place without any bit rot or malicious manipulation this is making sure that all those check sums recalculated back from from oldest to newest still all match up and you know how that happens because they're simply just linked one to another you saw that top to bottom picture of a single commit but then how would it commit be linked to its neighbor the answer is incredibly simple the next commit the very first one has a special marker in the parent field called nil but all the subsequent ones pass that inception original commit point to their neighbor and this linkage from one to the next to the next would be damaged would be broken and in a way that you can't really trick it to match back up if you write content you say we'll win it I'll take one of the dot Java files and I'll put some sort of backdoor into it but remember when runs the SHA function you're not controlling the output that's a fingerprint of the contents Oh drats you can't control what the fingerprint will be there and the fingerprint gets written into the tree directory and into the tree data structure that represents the directory and when that is sha sum it's a listing of all the content beneath it so that will then turn out to be a different hash if you made even the slightest change to the file and that will propagate all the way up until it breaks the linkages between these because the hash of the commit is based on the hash of the tree is based on the hash of the files and you simply cannot maliciously craft in a reasonable amount of time a file that would be a replacement for the existing one would say a backdoor placed in that is a very fascinating thing that it gives you but this is kind of what you get for free what do you actually get as far as workflow and behavior from all of this manipulation of commits and trees and the merges that we often do as part of a software development process well some things that were increasingly hard as this team scale grew and grew in a version control system are incredibly easy here because they're baked into the data structure which is why I feel completely comfortable saying there's value in knowing how it works when you have divergent branches you're working on a feature and the main branch at the same time the master branch and you go to merge those back together you often don't stop to think how is it finding what actually needs to be merged but the answer here is baked into the data structure simply to the point that we say it is finding where these two commits have this commit and this one the same parent in fact I can genera size that statement to say the merge base is where two commits have the same parent hash the same identifier that was where they both had the same common ancestor and that's where they started to diverge to different directions in fact if we try this from the command line you say is that visible I mean is that just something that you're showing us or is that something that you can actually see from the command line you can see it quite well if I do a get branch over here and I go back into the healthy repository instead that will be a wise choice on my part I will clear off the screen and start a get branch that I'll call feature number one over here to the side and having created that branch so I've got a free feature one branch thats hanging off to the side let me show you the list of those branch names there like so and now that I've got the split I'll make some random changes on each of these two I'll do two random master branch changes like so and I'll get checkout over to the feature one branch and I'll do two more feature txt changes now I'll visualize what's going on here with those two we've now got two arms splitting in different directions now I am on currently the feature one branch and I could ask it in very simple form get please give me thee and yes merge base is a top level command give me the merge base of this and say master I guess and what what are we trying to actually do here wait a minute that's insufficient I need to give it two arms for it to consider what is the merge base between aha master and let's say feature one we're actually giving it two inputs and I'll relate this to how you type the merge command in just a second but look at that there's the common that simple simple algorithm being run race backwards from these two points until you find a commit we're both on these two racetracks are pointing at the same common ancestor end of algorithm end of loop that now gives us a really clean graph as to what could potentially be merged together and is a basis for why git is easier at merge time you're not supplying a long list of revision numbers and in fact you never really for people who are using get at their jobs have to type this this is happening for you at the time that you perhaps get check out master branch and get merge in something like feature one people often think well it always has kind of a context sensitivity when I type a get merge command and you're right in fact it's kind of like that merge base where I ended up having to supply two parameters instead of one this is saying well I'm going to assume the one you're on because I need a current working directory in case there's conflicts or things you have to resolve so simply for an optimization reason just have you supply the other branch that you need to merge in but what is less commonly known is that you can supply as many branches as you like here wait a minute wait a minute that seems to complicate things but no it doesn't because the algorithm that I just described scales perfectly well walk back the paths on all the branches given until we find a common ancestor for all of them it simply is for more a couple more checks two more checks in this case but the algorithm remains absolutely consistent it's just how many ARMs that we're checking at each point in time and I think there's almost a comfort level and knowing that it's not point is fancy or as black box that you might have assumed it's a pretty simple algorithm that produces a pretty simple outcome which means a lot more automatic merges that we don't have to deal with in terms of conflicts if we can describe these things in terms of these kind of soft reference names these pointers to branches to tags what more could you possibly derive from this well in fact I'll show you something that I've actually never seen in a presentation before in terms of get syntax but has been present for quite some time you all may know the very basics of committee SH and tree ish but now rather than laughing at these names you actually know where they derive from because the two things that I showed you we wrote the three actually are the blobs the trees and the commits so you see that these are just reference and notations for pointing at these long remember annoyingly long for T hex character identifiers so why was I able to get away with is using as little as four in these cases well in fact this causes me to introduce one more command to your vocabulary inside almost every git command is a first run at Rev parse before it runs what you asked it to do Rev parse takes any kind of short hash let's say like the for Frank Delta one that we had from before and turns it into it's nice globally unique identifier most programs kind of in the same way that we might reference variable names and those being translated to memory addresses that was a human convenience that we could give it a friendly name or some small part of the identifier this is ultimately what the system the code actually wants for referring to any given commit so in fact whether you're running merge whether you're running cherry-pick whether you're running show or log or even rebase it is turning your short hashes your friendly names for branches and tags and small little bits of Kemetic this into their long counterpart as the very first step now there's got to be some sort of compromise it seems like between this shorthand form and the longhand form especially if you're going to start kind of describing and passing it around and also giving some relative position because you don't want to be pedantic about having to apply a tag of I just left my computer for coffee you know number 21 for the time of the day that's just unreasonable to expect you to put a tag every time that you walk away so there's a get instruction that actually helps out a lot with this I'm going to generate a few more commits generate some random changes of two more for more sample to txt files like so and I'm going to ask it to please describe the most recent commit this describe command is actually really pretty helpful because what it actually does is walks back on the graph and finds the nearest annotated tag that points to this and then gives you a relative position to that tag but what I seem to get is a complaint about this that didn't seem to give me anything helpful at all there are three modes for tags and one of the benefits that I'd like to see you walk out of with here tonight is that that tag that I ran just a moment ago is probably something that you start using a lot less after tonight's presentation there is an equivalent tag command with the - little a that I would assume should become your default this is an annotated tag and to be honest it's really more a real tag than the one without the flag is an annotated tag is a first-class citizen and in fact it would be helpful if I'm going to control C and not write that output and I'm going to get GC to garbage collect my objects directory I'm going to tree my dot get objects folder to show you what I have hand hanging in there for the moment I want to show you a before and an after I want to run a normal tag so we'll just say normal tag right here and I'll run that tree of dot get objects directory again and there's no change before and after I wrote a tag but nothing got written into the objects directory why because the form of tag that the default writes is just a pointer to an existing hash but it doesn't have any of the integrity the contents the who the what the when the where of a normal commit an annotated tag the one that I was trying to run prior to that get tagged - a stands for annotated is actually writing an object so we'll say in this case sample with annotated like so and having done that one what you'll find is that up pops my commit message editor this is my place to be able to type this is a tag to show to my SFO drug friends and save and close and the before and after is what you're supposed to take note of in this case now you see that an object was actually written to that objects directory this is what I consider to be my go to tag command because inside here wait a minute are you telling me this would actually work remember that thing that we used before - type T could I ask it what zero one eight eight happens to be yes I can and it tells me it's a tag well then can I ask it what the contents of that tag is why absolutely you could there's an object a type a tag a tagger who did it and then lastly the commit message and now that looks an awful lot like the commit that you saw from a while ago it is in fact about the same as the commit writing code but the only thing that's missing over here is it doesn't point to a tree why not because the tree is already preserved in this commit we have the state already saved we want to point at an existing state and describe why we got to that state we cut a release we were rolling back that was a trial run that we pushed out to production that's what it tags purpose serves so it is an indirection to point over at an existing commit when we ask get Rev parse we're starting to glue these commands back together if it would be able to tell us anything about sample with annotated it of course converts it over into the first-class citizen like so but if I were talking about one of those other tags from before so let's see what other one I just like this tag like that that actually points at the commit itself let me say that once again the Rev parts of the annotated tag says here is your actual tag object because there is such a thing in the dot get objects directory the Rev parse of this other lighter-weight dare I even say of minimal use tag form is actually pointing over at the actual commit wait a minute can you prove that to us of course by going right back over to cat file again please show me the type of Delta - - charlie you're seeing that number come from the bottom little element there ah indeed that's a commit whereas if I try to do the same thing on the zero one eight eight but you already knew that that's actually pointing at a tag so what I think this helps you do is take all of this stuff that's happening beneath I find that people use get far more effectively when there's a sense even if it's not your daily knowledge even if you have to flip back to the slide deck it seems a lot less special or amazing as to what's happening beneath it's simple file manipulation and you can in fact with the file system see exactly where it's writing these things now you say well that's one small tangible bit to this but I think it really really gives you a grounding in the language of this tree ish and commits --is-- that you can use to navigate this graph the carrot is the most common navigation character that we teach in early classes on this kind of thing with get because that's one commit back so you now remember that the data structure is a linked list so we just start at the end or whatever place we've identified and go one step prior or two steps prior but that's pretty tedious so we could go five steps prior changing that symbol over to a tilled rather than repeating that carrot symbol we can also use the thing same thing to constrain two elements in the graph so you could say I'm interested from here to here and you could certainly supply these unique identifiers and they're going to be Rev parsed into their full piece but you could also use a tag and a tag or you could even use a branch and a branch as any of these two places because ultimately they're all the same thing they're just pointers to nodes in a graph and you're talking about walking each of the steps between these two points in fact that's what a lot of get operations are is here's a starting point here's an ending find the shortest path between the two walk done end of algorithm in many of those cases we also have some nice symbolic words things like head that also can be combined with Tera top raters tilt operators and then can be mixed and matched even to the point that you have the prior prior prior as part of a range we have the branch names we can merge we can also use those symbols with the branch names and we can do it longhand or shorthand we can use longhand like that but it doesn't make us write that it'll allow us to use the shorthand of things like this why can we be sure that it's pointing at the same thing well in fact if I have a remote repository this can turn out to be quite useful if I do something like a git clone of get at github comm : github training /hello get world like so and clone this down to my local disk so there we go we just retrieved a whole repository I can ask it to get LS files and see what I have here locally and see a listing of those and for any of those files I could get the hash of those I could also get a listing of my branch names and if I turn on things like verbose I start to see these identifiers and start to think oh so maybe rather than being annoyed by the precision of these 40 hex character codes I can say that it's a little less ambiguous than the human friendly names like master get LS remote which will show me all of the remote heads pointed to by this repository all these pull requests all these branches that were created let's try that again piped through more this time and do a little contrast in comparison with the master branch that we were on before I see bisect feature master gh-pages 8 dog 8 zero in this case and you see this is LS remote command is asking the the github up side upstream side of the repository what that identifier happens to be and I could ask that same simple question of my local repository over here get what is the most recent commit will do a branch dash V or get could you tell me what the Rev parse is of head and every single thing is telling me exactly the same sequence of characters and what I now get the confidence of this of is when I see it on a remote identification on a web page github as a list of the branches as a list of Rev parse well this is all telling me that I have a global unique number I said I was going to literally treat it as such from a physics perspective that says the state of the entire repository on that branch and locally on this branch and the most recent commit from the current branch are all pointing at exactly the same thing now there's got to be a little bit of a way to kind of mix and match and to kind of blend these two things together for some identification and some ability to to navigate this here's some of that syntax that I said you're unlikely to have seen before and now suddenly make sense from a user perspective this is legal to literally type out like that I was typing cat file and then I was looking at the most recent hash then I copied and pasted to get the hash of the tree the directory structure and files that lived beneath it but that's a common enough operation that we can do that just like so let's give it a try the masters tree in this case and let's do a comparison get cat file of - type for head what does that provide odds of commit let's look at - P it says that the tree is Charlie 72 466 so let's try our alternative approach to this let's do a get Rev parse let's wrap it in a nice set of double or single quotes over here and let's do a master and a carrot with the wrapped brace of tree like so in a single tick at the end just contrast and compare that's finding this commits tree I wanted to make that exceedingly clear before I press ENTER and with now suddenly all the surprise of the evening is diminishing towards the end and maybe I should have turned the whole presentation on its head but all of these things are seeming pedestrian now maybe even 45 minutes ago they would have seen crazy confusing and the more I repeat these in the more ways that I show you to commit the same thing you're like so they're all just 40 hex character identifiers of a state of the files or a state of the whole repo and you've suddenly demystified almost everything about get that has the potential to confuse you down the road it's just states of the whole repo or states of the file system themselves in fact we can go even further with this we can say now that we wrote a Briton a properly annotated tag is there a way that we can refer to the most recent tag but then some distance or position from that well of course we can let's go back and try again and say get please describe as we tried before the current head and it says no annotated tags can describe this as of yet ah you're right now can I describe let's say the master branch what's the same outcome for this you're right that is true I've got that same problem the same difficulty but if I actually allowed it to use one of those non annotated tags that's what this option switch will do here I'm actually going to place it more appropriately in the front right before the branch name itself and in this case it says oh I'm now allowed to use these non annotated tags and I can create this hybrid syntax that's what we've got is hybrid syntax in which it takes an existing tag and melds it with the hash let me show you the list of tags that we have at our disposal now let me add another tag to this let me do a git tag - a - little a of an annotated one like so and I'm going to apply this to one commit ago that's where I'm going to put it in the graph a good tag to be sure save and close and then try a get describe we're going to try this again of head ah and of Master ah and there we have in both cases a hybrid identifier that's a little bit more human-friendly but is still blended with that globally unique identifier and I've actually found this to be a very reasonable thing to perhaps embed inside and about dialogue if you like it's a little bit of both it's kind of a human friendly portion as well as a machine identify a pilot means xi an identifiable piece we did that tag get ahead tilde 1y back 1 so as to give a little bit of a distance because they get described of the current branch if I didn't put the tag farther back would have nicely simplified just to the tag itself almost in a kind of way of smack smack Matthew you dummy you're on the tag so there's no need to actually have any of the the marker alongside it what this provides you is so that I don't have to be pedantic about applying tags I can have one kind of drifting farther back in history that I applied let's say yesterday or the morning before and it still will use that as a basis for me making more commits ahead so kind of last last street sign scene plus distance from street sign would be a very nice little way to kind of map that to an analogy sweet so this is a very useful bit to embed inside as I said before you're about dialogues it's highly scriptable you could just take the output of this and write something that captures and writes it over into a properties file I've done exactly that in a Java project and it worked really well but what then you should be almost comfortable with as soon as we ran it once is this syntax this should become part of your daily routine as well you remember that recently somebody made a commit that said something about a variable name or a person or a particular type of change there's a shorthand syntax for that you can simply say get find me log Rev parse cat file any of those instructions and using this commit --is-- notation that we've been talking about say find me a commit that has those words embedded in them you can even go further and say that you want it to refer to a specific file so you can be more articulate this is kind of generic of course right could hit multiple things so it finds the first one from the current point working backwards because it's a linked list that points at the parents but this is a little bit more precise this allows us rather than saying on a particular commit rather than a generic commit to say you know what on one particular commit I'd like to read all the way down to the level of a file where does this become useful I've seen people struggle or use graphical user interfaces to try to get an old copy of the file into the current directory or even lighter weight than that get an old copy of the file and just see it on screen they'll use a GUI to go over to the right commit and then somehow display it in there but that's accessible from here as well let's go back and look at our history with a git log - - stat to see where we've recently been there's a travesty ml file there's a build Gradle file that I recently did so what I could do is take a little bit of this hash identify right here copy that to my clipboard and having embedded in my head the file structure get could you show me the contents of that commit hash well of course it can but that's not what I want to see I want to see one specific file in there so I can simply just say at the end of that build Gradle and it dutifully outputs just that single one this again is boring because it's demystified because you now know that it's navigating through the graph to that certain point and it's opening the tree and it's looking for a file with that path and it's doing a get cat file - P of that files hash identifiers contents look back look in and to process so many of these things that seemed like they might have been or could have been complicated are in fact very very simple and straightforward as long as someone's actually stated what that recipe and process happens to be now for one that uses numbers where if I were king I perhaps would not have put them in quite the syntax form but here they are anyway let's say you're in a merge scenario you're in the middle of a merge the merge has not gone well perhaps you have a conflict it says conflicted file and you think oh I wanted to see the way it was and the merge and I used the tool and it's wrong and I'm like half-baked but I don't want to roll it back with a merger board I just want to see the way the file used to be in the current staged area or in the common ancestor that was achieved by merge base or the one that is the target on the current branch that you're on or the one that you're bringing in the source branch that you're folding over this syntax within the file name at the end of it in the context of emerge is allowing you to see with like the get show command the contents of each of these by state let me repeat that again from the first one you're in a merge you have let's say a conflict you're not yet complete with the merge so there's a sense of what is the current staged files contents maybe it's partly resolved what is the one that was the common ancestor found by merge base the place where the file had its inception its split apart or the one on you could even call it right and left as many people do for the thought of a merge each of those has their own syntax being uniquely identified so now all of this dialect this dialogue this ability to get at the files in the contents is really simply just a navigation language like anything else that we'd use and sequel or even lat/long coordinates it just has some funny manipulators at this point but does all of this actually help you be a better get user absolutely I've seen people even with just a few of these things like accessing the merged base or the writer left piece you know go from well I've got to fire up a tool and get down to the right commit and make sure that I dig in and somehow copy and paste it and put in another text editor to simply being able to navigate back over here like we did with the build.gradle file and perhaps even just pipe that over to their favorite editor and pop it up like so the time from typing that and getting it there versus watching anybody use a mouse to scroll is a real time-saver especially when you consider that that is repeated over and over and over again and I think that that with the difficult statement means that we actually can return to our comfort zone of thinking of programming once again of something like this I hope you enjoyed the demonstration I hope that it was something different that was the intent and I thank you for your time thanks again I know it was kind of a nice little string of content so my intent now is actually to take and repeat or clarify anything in this nice long string of stuff that that you'd like me to do I'll take questions and I'll repeat them and and show off any demo that goes with this the question was as you're building up a git graph of commits like this we've got some commits and we have branches off to the side and then we decide at some point to perhaps rebase these well those of us kind of in the know of how git manipulates these things would say not that rebasing and these other activities are moving things around but rather that it's making a copy of them and putting the new flavor of the branch the arm the series of commits somewhere else now when it does this manipulation the key the thing that you care about is that the label let's call this over here feature one for this branch name over here it is the label that's actually being moved around when you let's say rebase that's happening and the garbage collection algorithm is simply looking for heads that have no human readable form of either a tag or a branch name those are the two qualifiers that make them eligible for living it then proceeds to walk back the series of commits that that's useless that has no human readable name that's useless that as in a human readable name until it gets to a path that's kind of electrified if you will that is reachable to use them the more articulate term that's reachable by some branch name that we do still know about so it's actually really simple it's walk commits that are no longer reachable by a human reference name tag or branch name end of story and I love that many of these things are simpler than then maybe you thought they were in a good way other hands other questions yes sir let me jump back there explain is perfectly like foreigners talk about a like a bird base so let me try but let me see if I'm getting it right so at any point that I'm off the rails about what you're asking have me start over or clarify so I'm going to repeat and I'm going to see if I understood it correctly so here's a two branches and we're going to put some labels on these we're going to talk about a feature branch feature future features spelled funny there we go and a master branch up top and we're going to talk about merging this back together in the traditional sense that's our normal merge pattern so when that occurs what's actually happening the question was what is happening here is that those two pieces are writing a new commit at the end of the graph this is the default behavior for the recursively evaluated set of branches and that commit simply stores two pointers one is the identifier of the the parent commit that's kind of the primary branch and then let's change the numbers up to kind of signify something that came over from the feature branch so it's simply a node that points at the two ancestors that it brought together plus any rectifying commit activity to resolve conflicts the the code that you help glue back together that didn't quite fit well so let me stop there is that okay sorry good and some coming from the subversion huh University if you're doing a merge like that you can either be merging to the branch or burning back to the front so what differentiates like whether you're merging to the branch or all right so let me repeat this by starting with code to do this I'm going to clear off the screen here for a second I'm on the master branch I'm going to start a branch off to the side that I'm going to call feature two I'm going to generate a few random changes to master gen generate two random changes on master I'll show you what the differences here txt and I'll get check out feature two and I'll likewise generate two random changes over here to the side so this is a feature txt and if I get merge master into this branch which happily go save clothes it's all happy and all the way back down to there we go oops projectors okay let's come back over here so now that we're back at our scratch window something a little funny going on with the projector there but let's go back into our project that we were in and we're on our feature two branch and I just merge that so I'm going to get show the most recent commit and what you'll see is that there is an order sensitivity so I'm answering the question by doing an example there is an order sensitivity to the branch parents that are listed on a merge commit the first one is the branch that we were currently on kind of the destination in that case and the second one is the one that was brought in so the additive piece so the order is what controls what from and what to if we had merged the feature branch into the master branch what you would have seen is the order of those two flipped on the master branch would be a commit that listed itself it's own predecessor first and then the incoming feature one as the second node so order sensitive sequence sensitive why was that because I changed there were simultaneous changes on both of the branches just to do a little bit of a quick little history history run over here history of the most recent tale - most recent 5 commands over here wait a minute go like that you'll see that what I did is I created the branch and then I generated changes on master then I toggle to the previously created branch and generated more changes on the feature so I was creating commits that caused them to diverge and then brought them back together if master had been quiet let's say for example you might have gotten the fast-forward very very good that you called it out would you be able to publish that history that you've got like they're part of the presentation that's already done it will end up in a gist less than an hour after this so yes I will publish the history for everything that I've typed this evening I even have a cheat sheet that goes with this so every little example that I did in Ruby and Perl and Python you can copy and paste from so yep yes sir he was sure that I'm like what the mid graph looks like for Anna yes yes so let me repeat that by by doing a picture over here we have a commit graph this is what we're starting off to answer the question and we have some branches over on the side so the same pattern that I've been setting up before but what we effectively have is a box over here in which we can place tags the question was around tags now in this box over to the side if I write a traditional tag a non annotated tag what I have is a file not a commit so it would be inaccurate for me since I'm using arrows as my commit thing here to actually use this what I'm going to do instead is say that it is a a pointer back to some commit in the graph it's a indirection that has a name that points at the Rd written thing now this to put a label on this box we'll call this our nan it's actually called a reference tag so we'll just call it a reference tag in that case what would an annotated tag look like differently though it would be an actual commit in that box or honest-to-goodness object sitting there that in the inside of that commit has a reference to a particular commit the other one was merely the indirection the second one was who what when where message plus the indirection back to the other piece so let me start with a little text document over here to kind of outline both the question and the answer basically I'm going to summarize as how does the pack file work I don't like that it captures all the hashed objects and please permit any typos we do this objects and also how would I get at something that is in a pack file well to the latter piece I can answer that I it is just that same zeal I've come depression that we used on the individual file so you could use that same decompress function that we use during class on a pack file and what you'd get is brute and output of directories and files that were the original constituents in the folder and file system and nothing else unique no more decomposition or digging into them now each of those individual files just as we did at the beginning of class is still in turn individually compressed so you might have one more step to go but it is just a two layer shell the outer shell encompassing all the files and folders and the inner ones exactly as we showed them in class so using a Perl or Python or Ruby library would be the way to crack them open but I want to not answer your question or answer it in a different direction with one other instruction over here as I clear the screen for this history repository that we have over here let me show you here's all of our history and if I then get GC which then garbage collects and all of those things are now in a bat and now in a pack file having run the garbage collection I still can use any of these plumbing commands that we use tonight I think this is very helpful to be able to see this I can use any of the get cat file that we were using before any of the get show against any particular hash all of the fact that the files are now in the pack file the compressed archive or not in it is completely hidden from you so that all the vocabulary that I use this evening that are git commands will function whether or not it's inside or outside of a pack file so what it provides you is this this nice thin abstraction as to whether and they're not there in there and that would be your interface by which you programmed to to getting stuff out of it it'd be more consistent than writing your own compressed archives yes sir basically how do you get the confidence of getting kind of the golden copy that you had back in subversion knowing that the the flavor that's up there is the same that everybody else has and I think that answer is really really easy it is is ugly as the sounds at first surface it is compare hashes because that is actually a stronger statement than you could have ever made in subversion you can mathematically dare I even say physics identify the state of everything that led up to there and say that someone else has every bit in exactly the same place in exactly the same state so it's actually a stronger level of guarantee for the golden copy so it is very reasonable in a build process and even in a deploy process to ask what the production hash is of the copy that's out there for a customer and CI servers like Jenkins for example very popular in the Java space will do an archive and will do a md5 or sha-1 sum you can do an archive command where it will keep track of those on the per Bell basis and what's crazy is you can actually even put that in the manifest autom assets you like when it's writing the jars and you can ask a production customer without any identifying information in the jar names you can ask them this little tidbit of information and tell you when and who caused the build to be published by do it using the search function in Jenkins I mean I can tell you it was a button press by Jen on Friday at 5:00 p.m. by just that mere little tidbit of unique and signature and we all know we've all done tech support at some part of our lives even if it's for family and friends the person on the other end will lie to you about what their it's on revision six okay check this file it's on revision six I'm reading it right now yes no you're not you're reading Dilbert but it but when they give you these unique identifiers it kind of destroys that because they can't tell you a hash code that doesn't map to a build that you produced or at least you can easily smell it out it doesn't map to any build that you've ever seen come out of your system what I was thinking is Oh setup isn't probably the right ideas you can get in it right and then and then after that you could you don't do any get or check-in or whatever from that McCain you use something else oh as a hosting solution to have like the clean server copy of it that's about the right idea that's in fact what clicking the new button on github does is is it get an it of an empty repository with a - - bear on the server side I don't like is when I look at that machine and that get status like I always see everything is deleted is what I mean ah but you can uh - - bear is the option switch you're looking for it means without a working directory and so you can never actually make any changes there it just becomes the recipient yup that's the right vector and and I get free I give free office hours every two weeks so if you have any trouble just come and I'll help you set it up - and maybe one more I mean this is a little question but we were looking to get objects directory and I noticed that all the directories were just a byte and then the longer part of the hash yep why one by one I'm four is it 15 recursive why is it flat like that so the balancing was kind of through a semi scientific process of trying to figure out what file systems kind of liked it's a compromise to be sure but of the forty hex characters that are the the sha-1 digest forty hex characters the first two are made as the as the folder name in the remaining 38 or the file and that turned out to be happy on most file systems for creating balance there are some file systems that do not like a million files in a single directory and this was a very inexpensive way to to not yeah it's still true depending on depending on what kind of sand storage you're putting it on and so on the performance characteristics are pretty varied there's some really anomalous performance curves of millions in a single directory it's it's still a concern on on some platforms and remember not everybody has the advantage of moving all the way up the sliding scale let's say we solve it tomorrow let's let's say that that's truth everything that ships is that way we have longtail customers still using and I say customers not even get hub customers get server implementations in places that have systems that are required to be four years old for airline requirements and so on so just because it's solved you know even if we do solve it it doesn't mean that it's solved for the long tail of everybody using the system so it's a durable solution it's a good solution so effectively a compromise of balancing for file systems to give to make them a little bit happier by giving them a blend of files that are nicely balanced in fought and folders as opposed to a million in a single in a single directory I'll stay for more questions but I think it's great to actually call that formally concluded there's some prizes to give away so I don't want to stand in the way of that thanks once again you can ask me anything in the hallway [Applause]
Info
Channel: InfoQ
Views: 52,118
Rating: 4.9702969 out of 5
Keywords: git, version control system, github, matthew mccullough, graphs, hashes, compression, marakana, techtv, yelp, sfjava, java
Id: ig5E8CcdM9g
Channel Id: undefined
Length: 68min 59sec (4139 seconds)
Published: Wed Sep 19 2012
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.