I don't like notebooks.- Joel Grus (Allen Institute for Artificial Intelligence)

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hi everyone thanks for coming my name is Joel and I don't like notebooks so it's possible the only person like this at the conference but out in the room no all right good because out in the real world there are dozens of us a few here and so how did we end up here well I was anger about notebooks I tweeted that I wanted he would talk about how I was angry about notebooks and someone was listening and it got accepted and so you know data science luminaries thought that it sounded like an unhelpful way to spend time and I got lots of other positive feedback about it as well but here we are so what kind of person not only doesn't like notebooks but goes to a conference about notebooks to talk about how he doesn't like notebooks so I'm a research engineer at Allen's super artificial intelligence my job is Python library design and reproducible science so I've lots of thoughts on these topics I won't call myself a Python expert you should never call yourself an expert but I've been in Python for a long time and I have a lot of opinions about it I was the chief data scientist at Woollett metrics I've managed data scientists I mentor data scientists I wrote a popular book in data science so I have some thoughts about how to teach teen science as well I worked as a software engineer at Google for a couple years and I work on Allen NLP which is a very well engineered project I make live coding videos so I think a lot about how do I teach concepts using code I write a blog that's about python sometimes and I update it sometimes and I co-hosted data science podcast with Andrew muscleman called adversarial learning if you're wondering why there's no episodes recently that's my fault I need to edit them okay so that's me I'm just gonna start a timer and I didn't let me start now too fast I have a lot of slides okay okay so that's me what do I assume about you well you're at this conference so possibly you think notebooks are great and you came to the stops and maybe want me to change your mind and so let me just give you a couple caveats one is that I'm not a notebook expert I've tried very hard not to misrepresent notebooks but it's possible I may have done so in inverted ly if so I apologize but I have a lot of complaints so even if one is wrong they can't all be wrong it's also the case that I believe they're probably you know advanced add-ins that might address some of my concerns but they're obscure and I didn't know about them and so that's why they're not here and but I apologize for not knowing about them minor caveat is that there's a new product called Jupiter lab when I proposed this talk it was not yet released so I haven't used it very much it's possible it addresses some of my concerns as well it's like it doesn't address all of them okay so before I get into things I don't like about notebooks to be fair I want to talk about things I do like about notebooks so one is that I really love the literate programming model I love well-documented code and so I like the idea of mixing markdown in code I find this a very nice aesthetic and it's a great way of presenting things you know another thing I like about notebooks is that in line plots are pretty great the story around plotting and the Python console is is not a good story and so I find this pretty nice okay so those are the things I like about notebooks now let's get to the things that I don't like about notebooks okay so this is my my number one like fundamental complaint is that notebooks have tons and tons of hidden State it's easy to screw up and difficult to reason about so here's some very simple Python i define a function f takes a number X returns X plus 2 I set y equals f of 2 I check is y equal 4 yes it is that's true print why it's 4 okay very simple nothing fancy going on here here's that same example in a notebook define F X plus 2y is f of 2 is y for false print Y y is 5 okay so you're looking at that in your thinking you know if you look at the numbers those cells weren't even executed in order so of course it doesn't work okay so let's go back and make sure that they're executed in order you know here one comes before three comes before four comes before five but still the same result and you know if you'll get the numbers it's clear that something else was wrong between those first two lines which is which is fair but you know on one hand you have this idea that no books are great for iterative development but then you also have this idea that notebooks are actually kind of dangerous unless you run each cell exactly once in order and otherwise you can't really rely and what the outputs of the cells are so there's this tension there that makes me kind of uncomfortable I guess and some of you might be thinking that my preferred you know repple also has plenty of hidden state which it does but that state was built up in a linear fashion just by running one command at a time and you know there is a history magic that I can put in a cell and see here's all the commands that have been run but reasoning about state is actually one of the hardest parts of coding and so we should make it as easy as possible not as difficult as possible so here's one more this is not photoshopped this is not me hacking the HTML one two three four in order what kind of trickery is this well I just edited the first cell and didn't execute it which I did on purpose - to be deceptive but you know you could do that by accident - and really confuse yourself as well so my next complaint is that notebooks are difficult for beginners and so you know someone comes to you and says I would like to learn Python and you say okay that's great you know create a notebook and and get going why do I think they're difficult for beginners it's because these sorts of hidden state complications that I'm talking about are not obvious so you know this is most people's experience of how code works you run a line of code run the next line of code it happens in order the ability to run code snippets an arbitrary order is actually really weird if you think about it and it's very unintuitive and if you look at beginner tutorials they're really either cavalier about or completely silent on this hidden state issue so I just did a google search and found what I could find so this is data quest Jupiter emotes Furby enters a tutorial so you know they talk about some cells and say this will work regardless of the order of the cells in your notebook which addresses the issue but you know not this is dangerous watch out be careful you know don't shoot yourself in the foot and then you know later on they say most of the time the notebook will be top to bottom but it's common to go back and make changes in this case just keep an eye on the numbers and see if you have scale output and if your notebook is you know longer than a page or you can't hold 30 numbers in your head like that then not sure so this is a Jupiter notebook tutorial the definitive guide it did not have anything about order for state it was just stated differently hidden nothing sequence nothing enough rope to hang yourself nothing so it didn't seem to mention the issue here was the comprehensive beginner's guide to Jupiter notebooks for data science and machine learning it also did not seem to address the issue so back to this example where I where I skipped cell 2 you know you people are all notebooks and so the problem here is kind of obvious to you and also this is a really simple example like it couldn't be much simpler than this for beginners who have dozens of cells and more complex code this is utterly confusing and I know this because they come to me with their problems in fact the original angry tweet that kind of launched this talk was a result of me helping someone who was new to Python and came to me and said Python makes no sense I don't understand it he doesn't behave the way a programming language should when what happened was Python worked exactly they expected it to but they'd been very loose with state and execution out of order in the notebooks and ended up in a situation that appeared to make no sense and this has happened to me multiple times and you know you might be thinking lots of B Energy's notebooks so clear it can't be that difficult and it's thought that it's insurmountably difficult it's the out of X order execution makes learning Python more confusing than it needs to be okay so my next complaint is that notebooks encourage bad habits so you've got folders full of these I've got folders full of these everyone's got folders full of these so there's this sense in the data science community which I find super unfortunate which is that data science code doesn't need to follow the rules of good software engineering and this was you know poses as a great tip although apparently this was a misrepresentation of the tip which I didn't learn until after I started a big Twitter fight over it but but if you take one thing away from my talk it should be nothing to do with notebooks it should be that data science code should follow the rules of good software engineering and so you know sometimes people will say you know I don't need to be rigorous about my engineering because I'm just experimenting okay well experimenting is doing science and to say I don't need to be rigorous because I'm doing science doesn't make much sense to me sometimes people will say I just want to see if my model works before I put it in production well and if you want to find out if it works you need to write it correctly and make sure that your code is correct so another you know line of thought here is that people will say I need to write my code as fast as possible and doing poor software engineering allows me to write my code as fast as possible I don't agree that that's even true but generally speaking trying to write your code as fast as possible is great if reading a live coding presentation perhaps but but it's not great when you're trying to do your reproducible science and rigorous work so having you know taking a broad tour of the Jupiter ecosystem I really like unfair cynical take is that a lot of things that people build are to allow people not to have to develop good habits you don't wanna develop good habits Oh blood this for you so this is just a blog post that came across my radar I think I probably saw it on Twitter it's just some random blog post but popular liked it so there must be something to it it's how to present your data science results in a jupiter notebook the right way and then it says how to import one jupiter notebook into another okay so here's someone who wrote some code some function and doesn't want their clients or decision maker to see it so they've invented a couple kinds of magic that allow this to happen so you know it turns out here's some code i wrote but i don't want my client to see is actually a solved problem they're people who do python know how to do but if you insist on putting everything in a notebook then it's not a solved problem without this kind of thing and in the comments people actually pointed this out to the author and said you should really just write modules and libraries and test them and be clean and the author kind of admitted i agree that would be better than what i proposed here but this takes into account that data scientists write bad code and yes this is hard to debug i agree with you but anyway but this blog post was called the right way okay so notebooks encourage bad habits they also discourage good habits so you know here is some code from a live demonstration i do about building a deep learning library on the left you can see that i've nicely segregated things into modules there's a lost module there's an optimizers module there's a tensor module and on the right hand side you can see that I've used classes and inheritance and I've made this code so that it's very clean and testable and reusable and this is something that I'd live code in an hour so it's hard for me to accept that it's too slow to do things this way okay so how do we make sure our code works we write tests for it how do make sure it keeps working after we make changes to it we run those tests is it important that we use that kind of discipline or writing code to do science yeah of course it is and notebooks don't easily lend themselves to the kind of rigorous unit testing that you can get when you write things in terms of libraries which is not saying it's impossible to do but it's not natural to do here's how I write code I use lots of type hints and are in my pod a type check them you probably don't write code this way in part because I haven't lecture to you about it enough if you work with me you'll be lectured about it a lot but also in part because notebooks again don't have a great story for using these type hints and checking them so that's really tough for me this is not like pining this guy stuff this is me I live this so I work on Allen LLP which is a library for deep learning researchers and we don't want people to do bad science with our library so we like test the hell out of it we were on type checking or on linting we run everything and in our tutorials we tell our users to write tests for their code as well and this is something I strongly believe that tests are so important to doing good science okay so my next complaint is that notebooks are way less helpful than my text editor and some things are more easily demonstrated so let me show you so here's I use vs code and so I define X as equals 1 2 3 and I want to insert something so I do X's and I get this autocomplete the nicely pops up and I do insert and it says ok 0 is my index now 10 is objects that's fantastic so now if I go and do the same thing over in my notebook well to start with one if I do X is equals one two three and then trying to type completion nothing happens because I have to do it in different cells so okay now if I run it and I do X's I get the completes but they don't have that extra information and if I do inserts that doesn't do anything and so if I want to get help I can run that and now I get the help but it took me a couple of extra steps and now I have sort of this mark of shame that that I had to use a question mark and if I delete it then suddenly I get cells that are out of order but if I leave it there everyone knows I didn't know how to use insert so I'm sort of it in a little bit of a bind this one is just kind of silly but you know I want to compute the cost of my van pool so I'll say it has ten riders and cost equals vanpool riders times fifty ok that's reasonable enough so let's go over here and do vanpool writers equals ten that's good so now if we do cost equals van and I want to do tab completion for some reason the tab completion seems to favor all the files in the directory over and it's possible that I might want to use a file name as my variable name but it's sort of unlikely the next one which is a little bit tougher so like I said I use these type hints so from typing import list and I have some function deff F which let's take it wise which is a list of strings and it will return none and so one of the joys of using type intz is that now I can give this kind of autocomplete on this variable just because I've told my my editor this is a list so now you can give me things like that so in a notebook because you know it's you to typing import list because you have to execute yourself before you can do that as far as I can tell and I may be wrong but I can figure out we do it it is impossible to get that kind of time completion and so you're thinking okay I don't use this kind of type in which you probably don't this only I do but here's one thing that you probably do use so you know file dot txt as F F dot and now here I get autocomplete on all my nice file commands whereas if I tried you the same thing over here I got nothing so so that's just a way that my editor is a lot more helpful to me then working in a notebook is and that's kind of a small frustration but it's kind of integral to the way that I do my work so these are just slides that I have showing the same things in case it didn't work but I'll just skip past them okay so there is an extension called hinterland that I found that does some of these autocomplete type things but some of its autocomplete behavior is it is kind of questionable and it still can only autocomplete things that have been executed which means it still can't handle these type intz or the with blocks that I showed you the other thing is that my editor has it integrated linter so if I do something you know mistaken like hey I forgot to use one of my variables it will underline it and say hey by the way you forgot to use this variable you probably did something wrong I couldn't figure out a way to get that kind of feature in a notebook I did you know look is there a way that I can run some kind of linting on my notebook and you know there is one solution that I could install this and use a percent percent pepete cell magic for each cell and then there was another one where I could install PI code style magic and two percent percent pi code style for each cell and then there was another one where I could do these other things but anyway whatever it is it's a lot of friction and it didn't get me the same feature that I get in my IDE which is when it helps me you know be productive and write good code and it's possible that you know some of these things are addressed in the next generation Jupiter lab but in the version this is what happens when I tried to do tab-completion in the text editor and so it didn't do what I wanted either so many simply doesn't know books encourage bad processes so I saw this on Twitter fairly recently jupiter notebooks can now share kernels within jupiter labs makes it easy to share state between notebooks so this is fantastic and I tried very I tried very hard to think if I could come up with a non evil application of this functionality and I was not able to but you know so it may just be kind of above my paygrade cuz I have the small brain where I have to run commands in order in the console and then you know you might be a little bit more enlightened and run commands out of order in a notebook but then you need the cosmic brain to run commands an indeterminant order across multiple notebooks so you know once everyone knew that I was running this talk he started coming in with their Jupiter problems and said you don't like Jupiter you listen you must know a lot about notebooks tell me how does solve my problem so here's someone who said you know when I have a huge data set I'm used to saving time by just keeping it in memory and my notebook and then running different analyses off of it and I asked why don't you just persist the process data that seems sort of safer and more replicable than having it hanging around in memory and you seem to agree that that was a pretty good solution okay so this one I you will find probably controversial notebooks hinder reproducible and extensible science and yeah I'm serious so don't take it from me take it from Francois Shelley but don't tell them that I quoted them in this tea party doesn't agree with me about notebooks poorly factored code is bad science hinders reproducibility increases chances of an estate notebooks are a recipe for poorly factored code okay so this is an idea that my co-workers and I were kicking around at lunch fairly recently and we thought we could use a neural network to generate some and there's more to it but this was kind of at the core and so I thought you know I don't want to read them at the wheel let me find someone else who's done this and I use PI torch so I said let's find a library that does midian pi torch and so there's the first result they've got to get our repo so I'll go check it out and so you know I go there and all I see is a notebooks folder you know I was hoping there'd be a model PI R models dot Pi or something that could look at it and get a sense of how the model works but they actually have just a notebook that trains their model so you know I go there and they got a much imports is that part or 0.3 0.4 0.4 0.1 I don't know I can guess based on the commit date but it doesn't say and just as an aside I went to look is there a good story around managing requirements for notebooks I couldn't find one but someone hopefully on Stack Overflow said that good libraries are written to be backwards compatible so ideally any version should do I I downloaded that answer okay so this is the code for training the model that I found so you know in one cell you got model definition in a nutshell they've got model instantiation with some hard-coded parameters then lower in that cell they got more parameters and the next cell they've got some hard-coded paths then further down there's more parameters and then finally a training loop so this code is pretty much impossible for me to use you know first started looking at notebooks manually install all the dependencies and hope that I got the right version then either I can run it the exact same way on the exact same data at the exact same paths or I can copy the notebook and try and figure out how to modify it which forces me to work in a notebook or I can do a lot of work to copy and paste the right parts into a module that can import into my own code and yes there's a tool called env converts but and that does get the code out of the notebook but it still mixes the library and the execution code I still have to go through and line by line figure out which parts I need to change it customized still no requirements or tests and still a few comments and so I want to be I want to be fair this is someone's fun project and so like good for them for sharing what they did I don't want to you know look down on them for sharing what they did I think they probably thought they're being helpful by providing notebooks but what they did was make it sort of pong but not easy for me to repeat their exact same code with their exact same data at their exact same pass and if I have all the right dependencies and making it very hard for me to build new things on top of their code and you know lots of non notebook code has these same issues but notebooks sort of implicitly encourage this workflow but you know you can imagine an alternative requirements are tax that indicates here's exactly the dependencies you need a clean parametrize Abul load data module and a clean parametrizing bill model with good documentation and unit tests you're an easy way to programmatically specify all the model parameters and so on and you know a script that replicates exactly the original result that can be tweaking obvious ways to run novel variants and and so this is what we do on my team at work this is the code for producing ELMO which is a ton of contextual word embeddings that one of my colleagues came up with and so the installation instruction says run pip install this exact thing run Python set up dot pi install and then run the test and then when you want to train your own language model here's the command to run it's very obvious where to put in your own parameters and if you want to go and start you know creating your own variations it's very easy to look at the code and make those modifications and obvious where it's completed notebooks make it hard to collaborate across media so what do I mean by this imagine PI torch slack beginner Channel I'm a beginner that's where I hang out someone asked a question I don't understand what my code it's not working I understand why it's not working so I want to help you've done something wrong let me create a really minimal reproduction that shows you exactly where you went astray so I did it I copied and pasted it and I explained it and you know you could say it probably took you a few tries to get that code exactly right I bet a notebook would have made it easier okay so I can do that so I go to a notebook and I do the same code and it works the same but now I need a copy and paste it into slack and that turned out to be more difficult than I was expecting I tried select all copy paste that got me a bunch of like headers and other stuff I tried select cells edit copy cells based which didn't seem to do anything I did select cells copy paste that got the inputs but not the outputs which made it hard to illustrate so basically I could not figure out how do I copy and paste cohdon outputs from a notebook into slack and as far as I can tell in Jupiter lab the things I tried that made other versions wouldn't work at all so you know this might seem like a frivolous complaint but I actually spend asked my job a lot of time debugging people's Python issues with slack and similarly I spend a lot of time pasting code into github issues and code reviews and so making it impossible for me to do that it's kind of a non-starter for me one other nuisance if I do kernel restart and run all it gets to this error cell and it just stops executing the error was expected and that was what I want to demonstrate but it won't run any that cells after that um so you know I went and I found this was a known issue and that I can get around this by installing the run tools extension and installing this and configuring this in doing the server but that means that anyone that I share the notebook with would have to go through all those same steps to get it to work the same way and so that makes it not really reproducible as an illustration okay notebooks make it easy to teach poorly ok so how many of you ever seen a jupiter notebook tutorial where there's not really much to the tutorial other than pressing shift in order to run one line at a time and get to the bottom and then you're done with the tutorial ok so you know I went to the official Jupiter wiki and they've got a gallery of interesting notebooks and I thought if there's a place to find a great tutorial than Jupiter surely it's the official Jupiter wiki that has a gallery interesting notebooks here's a Python tutorial let's see what is the Python tutorial that showcases the power of notebooks ok so this is what the tutorial looks like you know let's learn while while some condition algorithm executable cell break as the name says is used to break out of a loop executable cell so this is a you know shift in or shift enter or shift down or shift enter and there's really nothing going on here beyond the fact that the cells are executable you know it continues if you want to learn about if these are executable cells are they a great one are they a great way of teaching if and if-else not to me is this a great showcase of what notebooks are capable of again not to me but you know this is from the curated collection of jupiter notebooks that are notable it was an introductory tutorial so that's not to say you can't do really nice jupiter tutorials but but it feels like too often there's this notion of every cell must be runnable and that's kind of the bulk of what i'm producing the website is that notebooks make it hard for me to teach the way that i want to teach okay so this is a from a book a real book written by me it's called data science from scratch available wherever books are sold second edition being written as we speak now exactly to speak but to speak so several times people have said you should make a notebook version of your book and I said okay sure let's uh let's give it a try so you know this is the name aces chapter and Mookie's data science from scratch so talk about databases we implement a database from scratch this involves creating a fairly large class that represents the database so how do I introduce this class I say well let's introduce a table and let's introduce the first method which is insert okay now let's show some examples of how to use this insert functionality and now let's go back and talk about the next method which is update so you know I tried to point this over to notebook and when I got to the part where I was trying to find the second method I ran into a problem and the problem was that I had already defined the class and so this idea that every cell has to be executable standalone meant that I could not go back and later add new methods to that class even if that's what I wanted to do pedagogically and you know you could say your book shouldn't split the class into multiple pieces like that that's fair but the alternative is dumping pages of code at once on the reader and then referring back to it bit by bit and I personally hate books that do that and so I made a very deliberate choice not to write my book that way and I'm not the only one who wants this I found an issue people wanted to find a Python class across multiple cells so other people have desired this as well one solution is called Jupiter dynamic classes so I have to do is when you want to add the new method you just add this percent percent to dog or whatever so that's that's one solution you know another solution this was cover I would not have thought of it myself is that I define a class with one method that I redefined the class as a subclass of itself and I have the next method and then I keep doing that it's it's clever I don't think I want to do in my book but like it this was also suggested just to find the function it's kind of a bear function and then assign it to the class table updating was update so you know every one of these requires presenting unnatural code to my readers just to appease the notebook gods and I don't I don't want to present unnatural code to my readers I'm very I try to be very thoughtful about how can I best use code to illustrate these concepts and to teach and so adding in kind of cruft that's just there for the purpose of making it run in a notebook is not great for me that's a complicated example but here's a simpler example also from my book I want to motivate when you might want to use a set versus a list and so I say imagine you have a list it's stop words eight and that plus hundreds of other words plus yet and you and why you would want to convert it to a set before checking for membership now hopefully it's clear what's going on here and I personally think that this would not be improved by forcing me to define hundreds of other words as a variable and then would actually make the example less clear see there would be a short list or I just have a list of hundreds of other words for no reason so so basically my point is that it's not always the case that they aren't having every code snippet executable is the best way to communicate something but notebooks sort of force you into that format and that's what I just said okay so what do I do instead if I don't use notebooks because generally I don't use notebooks one is I make markdown tutorials and Docs so you can see they look a lot like Jupiter notebooks visually but the snippets are not executable which is fine because again here I'm showing you a function signature it shouldn't be executable anyway below I'm showing you just the constructor for a class by itself it also should not be executable anyway so this is how I write to Troy and docks and it works pretty well for me for my actual development stack I use vs code and then the ipython console and I was going to give you a demo of what that looks like just because some people don't know what that looks like and so I want to you know demonstrate logistic regression so we're gonna do logistic regression on some digits so from scikit-learn data sets import load digits and let's say digits equals load digits so now decide so I'm gonna write Python and what I'll do is I'll just run that file and when you use the run in the console it runs through and it loads all those variables into memory so now if I say digits I can inspect it and data seems to be the arrays let me take a step back what our digits this is a digit recognition data set so it's a bunch of 8x8 images of digits with what they're supposed to be so data is this length 64 array and target is the correct classes so I can pull those out data equals digits data and target equals digits target so now let's go back and run it again and data that looks good what's its shape its shape is 1797 by 64 so I want to demonstrate a test and train said I'll take say the first 1500 and split it up that way so let's say my X train is data up to 1500 which means that my X test will be data up to or from 1500 on similarly my Y train will be data or 19 at target up to 1500 and my Y test equals target from 1500 on so that's good so now I want to do a model so let's import from a scalar and linear model import logistic regression so we'll define a logistic regression model model equals logistic regression and I want to fit the model fit on x and y so that'll be X train and why train and then I want to make some prediction so I'll say predicted equals modeled up predict on X tests so let's say go back and run this again check this out and say predicted okay so those look like good predictions for what are these digits 0 to 9 and now let's do some metrics and see how good our predictions are so from a scalar and import so I want to do confusion matrix and I also want to do in accuracy score so WC m equals confusion matrix and it will Y true and Y predicted so Y true is just Y test and predicted as predicted and then a is accuracy score Y test and predicted one more time accuracy I'm sorry a is 88.8 so that's not terrible here's my confusion matrix that looks good so that's one it was a three half night predicting three and so on so let's uh let's visualize act like visualizing things so from from mat plot william import pie plot as PLT okay and I'm not that good at MATLAB so we'll see if I get this right so fig ax equals p ltd subplots and then unfortunate that's not types i don't get good types in there so ax dot not show the confusion matrix and i want to make a color bar with that so i'll do this and then i'll say fig color bar color bar and then fig dot save fig and let's call it sk learned up PNG so if i run it again let's see if it works so here's sk learned up and g so that looks pretty good but i would have liked to have done a title on it so let me go back and add a title fig that soup soup title confusion and now i can run it again and if i go here now it's got a title on it and i can proceed in that way so you're thinking this was possibly a lot more clunky and confusing than using a notebook and and possibly it was and the whole saving our plot in a different time having a look at it is not great where it's nice is okay now I want to start factoring some of this code out so I can test it and reuse it so you know let's do load data file name string as none and this will be at uple of numpy array and numpy array so I've just used some types that don't really have so I better import them important umpires in P and then from typing import tupple so that'll be good and now I'll just put all this inside a function and return data and targets ok so now I've got this inside a function of what I can do is I can go over and write a test for it so you know I want to test that load data works so I'll say data target equals load data and now I can assert that data dot shape equals 1797 64 which i think is what it was and now I can run my test name data is not defined mmm so I already already broke it let's see yep do you target equals latina again so I know my just passes and so the virtue of doing this apart from being able to work in my text editor which I really appreciate is that now we can start factoring all these different pieces out into functions I can test them I can reuse them and I don't have to rewrite everything from the beginning each time so I promised that I would give you a chance to win me over so here's how you can win me over so one is the Jake plan Jupiter no votes could have a reproducibility mode where code cells are read only once executed new code cells cannot be inserted about previously executed cells and no cell can be executed until previous cells are executed so this is sort of how physical lab notebooks work when you're doing science and keeping results in a physical lab notebook you can't go back and like cross off the observations from yesterday and write new observations on top of them you have to keep going in order so something like this would alleviate a lot of my complaints but I also suspect that a lot of people would hate this workflow I could be wrong but that's my suspicion you could force people to name you notebooks that's really simple but I think it would be a good idea you could give me like I said ID a style autocomplete I really rely on it for my coding and it's very tough for me to code without it and it's the sort of thing where if you're not used to using it it can be hard to appreciate how much easier it makes your life but it makes my life a lot easier you know some kind of real type checking and linting I also rely on that a lot to write good code and so if there are a way to offer that that would be much appreciated definitely a better story around dependency management you know even on notebooks python has a huge problem with this so notebooks are not alone in this regard but if there are a way that notebooks could step up and say here's a great way to manage your dependencies that would be I think really helpful for reproducibility extensibility and things like that and then you know first-class supportive refactoring code out of notebooks into modules notebooks are good for experimenting things but like in the long run I firmly believe you want your code in modules you want unit tests on it you want to be able to reuse it between projects you want all these things and so the easier it is to get your code out and into a mode the better it is now the reality is you're not gonna provide me all these things and I'm not gonna switch notebooks but I hope that I've challenged you to think about one how your tools limit you to how to write better software three how to teach better for how to practice better science how different workflows can make your life easier and some ways you could make notebooks better so in conclusion I hope despite Jeremy's warning that this was not an unhelpful way to spend time and kono I next talked about how I don't like that slideshow program where sometimes the next slide is right sometimes the next slide is down I can't stand it and thank you all for coming so I will tweet out the slides and my twitter at Joel groups that's right there check out my book there's a new edition coming soon ish that will be updated to Python 3 and have some good stuff in it my blog is very infrequently updated but it's there I have a podcast I make live coding videos you can send me an email and tell me why I'm wrong and like everyone else on the Internet I have a SoundCloud which you can check out and see my stupid little experiments in music composition so I'm happy to take questions heckling feedback [Applause] I'll check it out no so so like I said um I feel like if you really want to explore data but you explore it in a way where I'm not going back and changing things already did but I'm appending to the end to me that makes it a much more compelling format for exploring data because I'm not going to get myself in this situation where I have inadvertently changed something I'm relying on and it's a mystery to me so that's that's one use case I also think they're great for communicating finished results like you said that if I have whether you execute them or not they're a good format for communicating so like there's this let me annotate it transformer which I don't know if you've seen this is sort of in my NOP world but basically someone took this really interesting google paper attentions all you need and made a notebook presentation of it that's a reimplementation of impact which is easier to understand but the thing is is that most people who are looking at this are not using it as an executable notebook he did that to make sure that it worked but you know this is just a markdown file that people are reading so I think this is a really nice presentation format and I think that the notebook probably enabled him to do this much better as this presentation but I think where it would go awry is if you started this is a really expensive thing to Train on multiple GPUs and it will train for a day on you know God knows how much data and so you don't need to tell people you can shift enter and run this cell and shift and run the cell it's really I'm using the code to illustrate how to do this without making it so you need to run each little piece yep so if you had to like market somehow that was transparent - it's an interesting question so there's there's I have two thoughts about that one is that it would be reasonable and possibly this is even the case that if a cell like starts off indented and it matches the annotation of the previous cell that it just continues it that that wouldn't actually address this case because the way that it's done in the book is I'm going to introduce a class and one method now I'm going to show you some instances of how to use that method and now I'm going to go back and indent again so it's almost it's almost as if you have this nonlinear thing where one thing branches off but then the execution continues down so it's it's pretty complicated I think I think it would be tricky to get it right in a way that doesn't make the notebook really hard to understand or have make it seem kind of magical but I don't know I haven't thought much of like I got to the point where I'm like this doesn't support like the flow I used in my book and I didn't get much beyond how it make that work so I agree with that and I think there's I think there's plenty of ways to use notebooks that are actually quite helpful I would say my position is more there are a lot of ways seasoned that are unhelpful and a lot of people gravitate towards those ways or build tools to help facilitate those ways and I consider that kind of dangerous there we're getting much closer to the I I think that's fair and like I said I I think that if that's the use case you're optimizing for there are things you can do that make them kind of a safer way of exploring coding and and so you know just not letting you go back and change cells you've already run that would make it not allow beginners to code with with I think a lot fewer problems for instance someone else over there so I don't I don't have a great approach I thought about this a little bit and I think my ideal would be so like on Alan Opie we use finked so you write up Docs for your classes in the Sphynx format and if you're lucky it converts them to Docs if you're unlucky it gives you a cryptic error message that you forgot to put a new line somewhere and takes you know how to figure it out and so I feel like there's a way that if you put your docstrings or your code comments and some you know markdown ish format if you will you could have a way of browsing code that would render those things nicely and give you kind of a more literate interface to your code but this is something that I've thought about very little but something like that would definitely be interesting to me so I I don't think I attended that presentation I would say that if I understand it correctly probably my biggest qualms about that is that it's running a notebook and like capturing the output of that inside a notebook and then saving the notebook to me having like the output of a process stored in a notebook doesn't feel like a super usable format whereas because then I have to do something to get it out of the notebook whereas I might want to you know run some giant MapReduce queries over the results or I might want to ingest it into some other system and so like my my initial reaction is wow that creates an extra layer of complexity at that margin and that there's something about having like I get having notebooks be sort of the source of truth about here's what I ran it feel a little bit stranger to me to have notebooks to be the source of truth about and here's the output of this process unless the output is like very small a few numbers but I didn't see that talk so I can't speak super intelligently about it so would I feel better about that which I haven't used binder very much which of my concern would that address mostly just having an environment with the right dependencies so that would help with the dependency part I would still struggle with the fact that all of the parameters for the experiment are kind of all over the place some are hard-coded in model instantiation and some are on one cell somewhere in another cell so I have to kind of hunt through the notebook to figure out okay here I need to change the path okay and this other place I need to change the number of epochs okay in this other place I need to change the number of hidden layers and so having having not a clear separation between the code and the configuration may that causes me a lot of difficulties as well and makes it much harder to build on top of it so that's a that's a fair concern and if your if your intermediate results are 12 gigs of data and you probably don't want to do the workflow that I showed but are you doing are we talking about doing exploratory data analysis on 12 gig data sets interesting what it's not that it's a big data set it's a big data set to do to me it feels like a big day said to do exploratory data analysis on but but like I work in a world with pretty pretty small data so hmm anything about that that's obviously yes if you have like 12 gigs of intermediate results and you want to you know rapidly try you know 20 different things on them then you're probably stuck either in a notebook or on the repple having it in memory and trying trying things on it I don't think I have a good solution for rapidly reading gigs of data off disk with that oh so these are immutable data sets let's say the intermediate results it's fine so I think that the part where you can't go back and change the earlier results would help with that a lot still right so you know I've computed this data set now I can do various things off it in the notebook but I can't go back and like change something and depended on or change the data set itself and if you're asking what if I want to change that data set itself then I guess you would have to you know modify it in place or recreate it yeah yep that's fair and really address that all right thank you for coming [Applause]
Info
Channel: O'Reilly
Views: 63,489
Rating: 4.8838172 out of 5
Keywords: O'Reilly Media (Publisher), O'Reilly, OReilly, OReilly Media, I don't like notebooks, Joel Grus, Artificial Intelligence, Python, data science, learning, Jupyter
Id: 7jiPeIFXb6U
Channel Id: undefined
Length: 56min 12sec (3372 seconds)
Published: Wed Oct 10 2018
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.