Diffing and Merging Jupyter Notebooks with nbdime | SciPy 2016 | Min Ragan Kelley

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
all right hi I'm in I work on the Jupiter project now out of a similar research lab in Oslo in Norway and I'm gonna be talking to about some work that my colleagues at at Samuel have been working on under partly under the Jupiter project and partly hundreth an open dream kid project which is a you're a large European project for virtual research environments for mathematics but one of the things they let us do is solve problems like different emerging notebooks so we start as at this point roughly all of my talks start with some new definition of what is a notebook so just for anybody if there is anybody who is not quite familiar a notebook is as Jupiter defines it is a notebook is a document containing some prose which might include math that Jupiter encodes in in markdown some code to run in in cells and then the notebook document itself also contains the output that's the result of running this code which can be images it can be HTML it can be JavaScript it can be can be all kinds of stuff but that's kind of the concept of the notebook but when you start thinking about if he can merging you need to think about well what what is actually the kind of the data structure of a notebook what what what is a notebook really and it's not quite as nice as the the rendered view it's we've got this collection of JSON data we don't store the purely the the rendered result we store actually a structured document of the information that we need to reconstruct the document so that means that we've got metadata about metadata about different pieces of the document each cell is it is a JSON object we have the source code for for the cells and then we have actually keyed by mime type the outputs of every output in the in the document so in principle that's nice we've got you know the structured information that's easy to find if you open a notebook and in Jason and just load it as a Python dictionary it's easy to poke around and see what everything means there's certainly value in that but when you start part of the goal if you've got a no pile format that's Jason humans shouldn't be looking at it right healin should be looking at some processed result of that so anytime you've got a human looking at Jason then something has gone wrong so one of the reasons for that is if we look at the you know our source code from our nice Python cell you can tell like that's it's not that hard to read but you know we've got these extra quotes this unnecessary indentation newline symbols on the end and a bunch of commas you know it's not the best way to look at Python code it's not horrible but it's you know it's not the best and then we've also got down here you know we have this nice feature that a notebook is a single file that you can share you can look at it on envy viewer and that's part of the reason for that is that the outputs are in the file part of the disadvantage of that is that Jason is a text format and images are not text which means that you have to do a text encoding of binary data which means we've got these base64 blobs that are you see just the hint of how not-so-nice they are so when you start thinking about the difference between two notebooks what does what does that look like if you just say alright give regular traditional tool what's the difference between two notebooks might look a bit like this it's not again not so horrible in this case where you can see you've got the the text input and some text output you know there's a little bit of extra cruft on the the quotes and everything however you know it's not quite as easy to see as if it were just a plain Python script or markdown file or something like that or if you look at a different part of the same file it's just a big mess so I have asked get what's the difference in this text file and says well there's a whole lot of text that's different and you know the notebook format and I know that that's not text that's an image but when I asked my diff tool to diff this JSON file it doesn't know that so it's gonna say here's a bunch of text it's different that's your problem and so one of the things that we're looking to deal with is how do we look at look at notebooks and actually understand what's different about them so that's not that's the situation right now which is less than great so what about the the tools we have for working with notebooks what's kind of the current situation how how are things going so github is great github renders notebooks with with a little help from from MB convert you know you look at a notebook on github you don't see Jason you see a notebook that's that's really nice what happens when you look at a diff on on github it says ah that if his too gross I'm not even gonna try and so that's that's again that great if you're trying to review a pull request and all you get is this is different I don't know what to do then you know that impedes your your you know your github code review workflow looking at pull requests all in one place that's you know we don't it's not a great way to work and it's kind of one of the disadvantages of moving from traditional Python scripts and and files and things to notebooks if you're working in a collaborative setting on on github this is kind of one of one of the main drawbacks of collaborating with notebooks on on github so that's not great so what about our regular local tools so if you if you read what github said it said you should take this file and go look at it on on your own computer so what does that happen if we do it on our own on our own computer so most diff tools are line based ifs so they say this is a text file text files are composed of lines I'm going to compare the lines and then show you a difference of the lines they assume files have no structure they don't understand anything about the content of the files and all liens are treated equally that you know this is a file that has some content all the lines as far as I'm concerned all the lines should be treated the same and they have no understanding of the content of that document so even with the Python file where you're changing some text in in a function the diff tools won't don't have an understanding of the Python language and and sometimes make silly decisions about showing you a difference in a file based on the naive alignment that they do since they don't try to understand to the content of the file and for simple text things like like Python scripts and markdown and stuff that's that's usually not a big deal it's usually close enough and then you can you can help it over the finish line so about markdown so if we had a notebook we convert it so and we convert lets us convert notebooks to markdown this is a a diff of notebook converted to markdown we see we actually just see you know the there was a line removed there's an output that changed or maybe it didn't actually we don't know and then the there's some Python code that changed so it's just it's getting us a nice plain text plain text difference that's just I see there's some input that changed but it discards information right when you convert a notebook to markdown all you really get is the input and sometimes that's great sometimes actually all you care about is the input and then you're really separating the I want to save kind of this source you know the the input that I'm working with and then I don't actually want to save all the interactive stuff and that's fine and if that's what you want to do you can serialize things to mark down during the diffing process and you actually get a pretty good pretty good experience you can also get kind of somewhere in between by splitting files like so rather than this card against your information you actually take it out of you take it out and put it somewhere else so you you're not discarding it you're just putting it aside and there can be challenges with that because then the files can get out of sync where you can leave leftover files from the last X board and it can get a little bit tricky and there are tools I PMD and note down that provide things like converting notebooks to markdown converting markdown to notebooks and even in your notebook web application when you're save your notebook it will actually write in a markdown file instead of a JSON file and for collaborating with things on github this can provide a better experience than the the default JSON format there are disadvantages that people other people can't actually open your notebooks without installing the same tools so it's you know there's a bit of a trade-off of a nicer github workflow versus a file format that more people have the the tools to to work with so what about notebooks when we're thinking about doing these these comparisons notebooks are structured data we've got all this structural information that we as people who understand notebooks know you know this is a cell this is input this is output and there's a hierarchy of kind of importance of the data like the most important thing is the the input you know the code and the markdown that you wrote that you get when you save it to markdown but that's not the only information that you care about but we can sort of if we're thinking about comparing two notebooks we can prioritize input over the other content because we know the other content can be regenerated but maybe we actually maybe we still care that it changed or how it changed and there's also different kinds of images different kinds of data so it's not just Jason native Jason stuff we were also cramming a bit of a bit of other types so we can theoretically actually reason about you know the kind of output that this is we can we have that when we're looking at the document we actually have the information about what what this piece of the document is and we can we can use that information to make decisions in terms of cut in terms of computing the comparisons so Jason patch so we've got a JSON document there is a standard library for JSON diff and Jason patch there's a I 80 F standard I think defining how to compare JSON files and that's great it understands the structure because notebooks are Jason a Jason diff will always it'll never mess up your dayson make it invalid but don't understand the content so it gets us to you know the the structure and not messing with the structure but it doesn't get us to the the fact that they're still going to be that big blob of image text or or or things like that so it gets us partway there but it doesn't let us make the you know the more until more intelligent thoughtful decisions about the fact that this isn't just a JSON document we know a lot about the structure of the document that we're looking at so from our perspective when we started working on this project what should the difference be when you're when you're comparing notebooks should always be valid that you know step one that if you compare notebooks you merged notebooks you should get a notebook back you shouldn't get this isn't Jason anymore if you've ever encountered that after trying to merge notebooks it should be you know properly structured it should take you know it should actually take the content of the notebook into account when you're when you're doing the comparison you should be thinking about what you know what what is in the notebook what's important you know how do we deal with that so a bit of an aside about how one computes diffs because how we think about notebooks is is relevant to how we compute compute the diffs when you when you dip comparing to two things the main thing that you do is or most all of the work you're involved in is this LCS problem for finding these longest common subsequences so if you've got two sequences and then you want to align them based on the longest common subsequence so that you get you get the smallest difference you know there's lots of different correct transformations to get from the top sequence to the bottom sequence but in order for the the DIF to be intelligible and useful you want to have a the smallest transformations or you want to have small transformations and kind of make it as clear as possible what actually happened so the first alignment it's easy we got three elements that are the same in the same position but then we've got another longer subsequence that is not not quite aligned and there are kind of diverging elements in the middle and so part of computing the difference of two sequences is actually finding these and then figuring out which ones should be treated as the same and then similar and then kind of finding the the transforms for removing those two items and that removing the red and the green and adding the blue and then adding the green on the end to go from the top to the bottom so if we have some code it doesn't really matter what the content of this code is this is from the NV dime source you can see that there is a block that's the same it's the same it's lined there's another block that's the same but not aligned and then we can see that in between there there's some a bit that was deleted for going from left to right but then the first line is not quite the same there's a couple extra characters and so this points to even when you're doing the plain text if it's more complicated than a sequence than a sequence of lines because with a sequence of lines it's just a scalar is it the same or is it not the same and with a text file you actually deal with some amount of similarity you know is this is this the same line that may be changed a little bit or is this a totally different line that happens to be similar and so there's a lot of heuristics in in doing in computing alignments in terms of how you deal with similar items and this is what this comes up a lot in the notebook different diffing because a notebook isn't just a bunch of lines that have have differents distances they have structure and when we're comparing to two cells for whether they're the same or similar different pieces of that cell actually should rate differently in terms of how you compare whether it's the same cell or not so when you're lying notebooks first thing we do is we realign and say like is this cell in all its input and output is it is it the exact same if it is it's the same cell right there's there's no question there but then we also do comparison ignoring differences in output so if you took the same cell and you reran it and it produced different results based on some changes further up in the document we should still be able to identify that yeah this is probably the same cell because the inputs the same and then we also aligned on the content of the input of the cells so just like a regular line diff we'll do some similarity comparison on individual lines that might differ a little bit we also do alignment based on how similar the content of the cell is so that we can tell that a cell is actually the same cell if you changed a few lines here and there that brings us to the actual project and B dime a lot of people are wondering why the hell it's called n B dime it's for a notebook diff and merge and it's also because NB diff was taken and we also do merge so what are what are our goals in the project so we're making tools for dipping and merging notebooks and that includes command-line rendering of diffs so you're just doing terminal hacking away you want to say okay all right what's the difference I don't want to go somewhere else what's the difference between these these notebooks but also you know notebooks really are rich documents if you want to see really see what how a notebook changed then an HTML view is a logical choice so we'll also be be providing HTML renderings with the diffs between notebooks and because part of the point of this is easing the pain of collaboration is we want to integrate with with get tools so that we can actually solve the problems of working with git and notebooks can be a pain we want to reduce exactly specifically that pain not just solve difficult ooh still be unpleasant so the basic tools for this part of NV dime R & B diff for a console death of notebooks and B diff web for a web rendering of of the difference and and B driver for integrating with git which I'll get to in a little bit so what does it look like when you do when you compute the difference two notebooks it looks a bit like this so I've got two notebooks and I can see at the top I've got some text output that changed so I was using matplotlib one five two in the other notebook I was using matplotlib the new map outlet beta because this was from this week and then I changed the because I'm using new new matplotlib I set a color map to two Avernus and I could see that there's an image that changed but I don't see that there's a huge bunch of stuff that changed I just see there's an image here I'm not gonna show you all of it I'm just gonna show you that there's an image here because I know you're in a terminal and I can only talk to you through text I know this is in image it's pointless for me to show you all of it because your eyes don't understand how to turn base64 into pixels and so I'm just gonna say yeah there's stuff here and I'm gonna hide it from you but I'm not gonna hide that there stuff here and then we can see we also see so in Jupiter when you produce an output it can have multiple representations will actually show you the diff all the representations so we can see that not only did the image change but also the text representation which just is a memory address so not that interesting and we can also see that there's some metadata that by updating map pot live the height of my figure actually changed a little bit so MP diff web just called a different program what does that look like same notebook it looks a bit like that so we've got cells that are aligned we can tell that these are the same cell but some lines have been changed some lines have been added there's a cell that's exactly the same so we just show it show it in the middle there and then here's our plot that's different with the different color map and what about get soget has the notion of drivers and tools I get driver is a plug-in for doing a custom diff so computing a custom diff or custom merge operation at the command line and so we have get and B diff driver that you can enable on the command line to say hey get let me take care of disks of notebooks and get tool for whatever reason is a separate thing for launching GUI applications to say I want to I want to actually launch an application to view the diff of these notebooks I got mb'd if tool for that which you invoke as with any get GUI tool we use get diff tool instead of get diff and then add dash G for GUI and now we can demo a couple things so I can prove that it actually works I've got two notebooks here I can compute the diff I can see yes there is in fact a diff and see the inputs have changed some inputs have changed yeah I guess it's good off little and and some applets have changed and the Conda and if I used is also different that's how I switched my poly version so I can see that my kernel is different but I can also do and be diff web and that opens a browser that gives me my view of the notebook and if you don't care about the output you can hide it can look at the cells and see you know here's my here's the different my notebooks there's a lot of UI work to do we got the basics and I've got a repo ready here so we did that I mentioned to get diff driver so I've enabled the get diff driver for NB dime so I can do get diff I can compare two branches and it called out to NB and B diff so now when I compare two branches on any repo on my computer if there are notebooks I'll see this instead of Jason so anywhere nobody will ever show me that base64 crap again and I'm showing this is a commit I have on the ipython repo changing one of our example notebooks and I can call the diff tool so I invoke get diff tool - g4 I get open your GUI diff for any files any files that happen to have changed in this commitment and that launches the diff of that notebook which is substantially bigger because it's a real notebook with a lot of stuff and I changed so this is illustrating custom display stuff and my Python so I changed the parameters of the Gaussian I can see the distribution is different I can see the rendering is also different because there's no book is from a long time ago and map outlet preferences and things have changed my map public references that is all right and I've changed parameters you can see the outputs where they're different no simple alignment you can see you know I changed one two one two three two three two one you know you can you can review and it's you know synchronized synchronized scrolling and things for if you need to move around for long lines and and you can see cells added show up is green on the right and cells removed show up is right on the left this is deleted cell right there's a lot of rendering and stuff work to do the main thing we're focusing in on is just getting the information on the page and then we've got wonderful jupiter designers that we can ask for help to make it look nice and then I just and I could see so the metadata so this is another thing where we have so you can hide the output if you're not interested but we show it to you by default metadata is similar but the other way around will tell you that the metadata changed but we'll hide it for me by default because the odds are you don't care but you'll see last time I ran it I was running Python 34 - this time it was three five one right I can expand to see the rest of the metadata if I'm interested go back so another side what about markdown you know if you if you just care about the input using markdowns of your disks is also totally fine and you can do this the same thing with getting diff drivers you can pre-process files to convert them to mark down with mb convert and then when get diff shows you that it'll show you the markdown version of two files and Tim head has posted about this on Twitter last week and it networks fine that's just a couple lines of good configuration so the last bit where I'll finish what about merge so merge is a lot harder and there's a lot that we will have but because I'm saying we will have it we don't have all this but what we have now I will show you so I'm in this repo and I've got my local and remote I've got nothing configured this is regular default get I'm gonna do get do get merge I get okay conflicts that's fine Auto merge failed there's conflicts in the file what happens if I look at that well there's trouble now it's not actually adjacent file anymore what if I reset it and then enable the notebook MPI merge driver run the same merge command again this time it succeeded and I can open my notebook to see what happened and I've got a valid notebook so at least that's a start and then I can see I've got conflict markers saying you know where the you know I got output from one side and output from the other and I've got both my figures and if I rerun it locally it will all be consistent yeah so we've got lots of work to do talk to designers about viewing it custom mimetype discussed some discs for every mime type handle unresolved conflicts this only handles that only works well if it can actually do the auto merge there's a lot of work we need to do on the merging stuff and then a web app for actually doing conflict resolution and emerge in while merging in a book and then integrate it into the cool Jupiter lab stuff and then just pointing out other people's work and be stripout is a get filter for if you don't want to ever commit outputs to get get envy strip out will let you do that with a get filter and I PMD a note down will let you save notebooks as other formats if that's your preferred choice and we're also thinking about in the Jupiter project itself splitting notebooks into two files at least some of the time so that outputs are in an easily getting ignore Abul file kind of by default that's a tricky conversation but it's ongoing and then thanks to you and everybody especially vita and martin that it's similar who are doing almost all of this work i'm just showing it to you and then on spiders fernando and brian who let me move to Norway which I'm enjoying pretty much yeah thanks I haven't checked what version of get and drivers were introduced in I didn't know about them six months ago I was so I was writing against get diff tool and diff driver is much nicer so I don't actually know how New York it needs to be I don't know what version was introducing I know that my version has it but I used brew so it's probably stable as of a day ago so yep so once we have so part of the point of this yeah yeah repeating question yet so you were hoping that when I said get I meant github and I was gonna show you a really nice github diff view so get up has custom diff view enforcing like images and geo Jason we want the HTML view to look nicer than it does and then we'll start the same conversations we did that we that got a notebook viewing on github will go through that same process to get a notebook diff views on github hopefully we'd but we don't want to we don't ask them to put something on the page that we know they wouldn't sure yeah any anybody who's interested in the main thing is what if we can take two notebooks and then make a nice HTML snippet then it then it's easy for them to say yeah well put that up there that's fine but we want to make that HTML snippet to be nice first so diff drivers are for humans filters are for actually writing commits so that it doesn't actually mutate the document it just it's just deciding what to show the human yeah the question was about whether you get kind of patch level stuff in terms of composing a commit and you can do you can do that with custom get filters but that's not actually what we're we're working on we're not changing none of this work actually involves changing what is committed it's just changing how it looks so if it involves showing you add if there's some inconsistency and get there's a flag called - e^x so the question is if I'm looking at you know what changed a week ago you know the many different ways that git can show you information about what you've been doing will it invoke this and there are flags there's a flag and get called for external diff and it seems for some commands it will use external diffs by default and for others it will not and you have to ask for it explicitly but it's generally up if it would show you add if you can ask it to show you that and it may or may not do it by default you might have to add a flag I think show does need you to add a flag add the flag for some reason but diff doesn't that I didn't I don't know why it may not be on purpose
Info
Channel: Enthought
Views: 4,783
Rating: 5 out of 5
Keywords: Python, Jupyter, JSON, diffing, nbdime, SciPy Conference
Id: tKAmwC8ay8E
Channel Id: undefined
Length: 32min 9sec (1929 seconds)
Published: Fri Jul 15 2016
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.