Iodide and Pyodide: Bringing Data Science Computation to the Web Browser - Michael Droettboom

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
okay so I'm mic drop boom I'm a data engineer at Mozilla and today I'm going to talk about a project we're working on there called iodide and a sub project of that called Piatt died this is the rough agenda we'll talk about iodide then and tell you what piight is and then the iodide server and how that relates to everything and then if you stay all the way to the end of the talk we have a very fun little toy demo this is our team we've got five people at Mozilla and then we have some outside contributors at academia and in industry we want to grow this slide with people here hopefully okay so what is iodide came out of a problem we had sort of internally at Mozilla where data scientists were doing experiments in order to answer some business question or some technical question about Firefox and they would share that with their manager who makes the decisions the decision would be made and then all that stuff would just fall off a cliff and then you repeat right so there's sort of this broken cycle things get built explained and then thrown away where it would be much nicer to have a cycle that that sort of connects all of the different phases of data science the exploration explanation and collaboration and keeps it going all in one system without any friction between these things this sort of diagram is is not my own it's shamelessly stolen from a paper by Adam rule and others called exploration and explanation in computational notebooks and one of the interesting insights in that paper is that many analysts simply choose to explain and share their analyses using other more established media and provide a link for the curious back to the notebook where they perform the analysis in the first place so clearly the the tool is not supporting all of the things they need to do or it's making it too hard to do all the things they need to do in one tool so we're trying to build something that's a little more akin to that so what's the primary difference between our tool and something like Jupiter let me be clear right upfront we're not trying to replace Jupiter we're exploring a different sort of trade-off space right trying to make something that's different and explore what happens if you have a different set of trade-offs what does that help what does that hurt so on the left this is a diagram stolen from the Jupiter documentation you may all be familiar here you have a graphical user interface in your browser it communicates to a server that server commute communicates with a kernel and that's where the actual data science computation happens and the kernel is usually connected to some remote data source or maybe a local data source right so that's what you may know what we've done in ayodhya is we've moved the kernel into the browser so all of the computation is happening in the browser it puts you right next to the UI it means you can do very interactive low latency kinds of things with the UI in the browser and then we do have a server to allow things like collaboration data archiving other things but it's totally optional I think it it's going to be very important because it sports a lot of use cases but it's not critical in order to run an iodine notebook so when you have these two different models what you get with Jupiter that you probably familiar with if you want to share Jupiter a notebook with someone remotely state-of-the-art is to deploy it in container using docker or something and using binder this stuff is all way better than it was a few years ago but it's still a cost right because you're running a remote service on someone else's behalf there's free services to do that they don't always scale if you're doing it internally that's a big cost to your IT infrastructure right and if you're going to share a notebook locally that person you're sharing it with it's gonna have to go to install anaconda install Jupiter all the dependencies there's a lot of computation there sorry can't be complication there the iodine model is you basically if you want to share something with someone you put it up on a static web server every Enterprise I've ever come across is running a static web web server somewhere that's a pretty low cost thing to do and if you want to just give someone the file you can do that and they just open up in their web browser now obviously the downsides here is your data's got to be pretty small your computations got to be pretty simple but if that's okay with you we're in this different trade-off space and I think it creates some interesting opportunities another major design feature of what we built is this thing called jsm d which is shamelessly stolen from things like MATLAB cell mode or our markdown the joopa text project is also going in this direction which is rather than using JSON like Jupiter notebooks use we're using a plain text format so this lets you put it in get get pull requests work all that tooling around standard text files works really well many of our users we found actually they don't like to use the cells in the notebook they just open these things up in their text editor of choice and it all just works okay so I'm going to do a quick demo of that this is what a an iodide notebook looks like it's communication first this looks like a blog post or a document write as this explicative text you can put math in there and then generally you'd have something interactive and interactive visualization here we have a Lorenz attractor demo all the computation is here here is happening in the browser I can change some of the parameters and all this is very low latency because there's no server it's going to it's all happening right here but then beyond just this acting like a blog post there's this Explorer button name up here I push this it's like view source if you remember the old days of the web in 90s with Geo cities and what-have-you you could view source on things and you could actually look that was how you learned how to put new things on the web we're trying to bring that sort of ethos into the data science world and into the modern age so here you can see on the Left there's a this is the code it's in cells there's markdown with JavaScript I'm gonna get to the JavaScript issue in a sec but this is basically all the things that created the page on the right hand side we have a console that behaves very much like Firefox dev tools you can look at the value of variables and and play around here and inspect things and see what they are all these little tabs kind of move around it has like a little layout engine and stuff so anyway it's this neat little environment development environment so the idea is you you might come across a notebook on the web it does something almost what you want to do or you want to drill down on the data in somewhere you would hit the explore button Bob's your uncle you have a full development vironment here you just keep going all right so I'm gonna pop back to my slides so everything I've talked about so far is because we're in the browser it's based built around JavaScript right javascript has a real love/hate relationship for me personally I mean on the one hand it's really fast it's got probably the best compiler technology of any dynamic language out there you can write things it's in a lot of cases as fast as what you would write and see that's great on the other hand you have these really legacy rough edges the numerix and javascript aren't great the typing isn't great like a lot of things you have to learn not to touch in JavaScript and you do okay if you learn what not to do it's familiar to a lot of programmers so the the body of JavaScript programmers is huge but it's not very familiar in this crowd right to data scientists there are some there are data sciences who use JavaScript but it's not the big mature ecosystem like Python has or R has or Julia even right so that led us to the creation of the PI that project Piatt died is basically a compilation of standard upstream C Python with none by pandas matplotlib other data science tools to webassembly now webassembly if you're not familiar with it it's a fairly new technology in browsers it's a bit code for compiled code that you share on the web in your browser so it's a completely portable but once it hits your browser it gets compiled into native code and it runs really really fast there is some related projects here this is the first project I'm aware of that really is data science focused and tries to get the data science stack working really well but there's a pie hi Jas that brings pie pie into this environment has the same pros and cons that pie pie ass right there was zero dependency Python which was an attempt to do this with Google Native Client Google Native Client has sort of been subsumed to by webassembly it's basically dead so and then there's bright on which does something different it actually trans piles Python into JavaScript and then runs it so it's not a proper upstream C Python you wouldn't get numpy you wouldn't get those other things so I will pop over again and give you a demo here I have an iodide notebook and all you have to do to use Python we have a little cell type drop-down here you can say JavaScript or Python change it to Python run this cell and over here on the right you see it says loading the Python plug-in and second or two later you have a full Python on let me I'll make this a little bit bigger I think there we go and what's really cool here is all the datatypes sort of automatically convert back and forth so if I have this Python data structure and I return it to display it it's getting converted to JavaScript and then the iodide tool here is just displaying it as if it's JavaScript it doesn't actually know anything about Python but it's able to convert is able to display these things because it's converting the data types you actually have access to the entire Dom API of the browser from Python which is really powerful so here I have a little cell in Python that's going to make a little button and a little div and when I click on the button it'll increment the value right so there's a little callback going into Python adding to the number coming back it all kind of magically works you could use this to like rewrite Facebook entirely in Python if you wanted to I don't recommend that but you you you literally have everything there's no shortcoming here you know another fun little demo you can do is you can take make a canvas and hit hook up all the mouse events to like draw on the canvas and that's all happening from Python this is also the basis for how we get matplotlib to work which I'll show in a sec so that's all pure Python I've showed so far if you import numpy it's gonna go out to the network fetch numpy bring it down and then it displays the numbers so I just made a little sine wave here right it's not not great to see a sine wave as numbers what you want to do is plot it right so the cell import matplotlib it's going out importing matplotlib all of my plot libs dependencies it's going out over the internet from what's currently hosted on github pages but yeah just from a from an HTTP website and once you've actually done this i'm for demo i want it actually you to see the time but once this happens once it gets cached in your browser and it's just really zippy the next time exactly yeah exactly and so when you oh here we go got scroll down so here's matplotlib let me make this the screen is a little small here but basically you have a full matplotlib plot here of the sine wave and it's fully interactive so all of this back and forth is happening without a round-trip to a server right all entirely within the browser and so that's really cool you've got like the the bones of the Python scientific stack here however because you the data is actually shared back and forth between Python without even copying it if you want you can mix and match things so I can take this numpy array I've made and import plotly which is a JavaScript plotting library and plot it in plotly so if you prefer plotly it's nothing keeping you from from using that instead and it works just as well this of course is interactive too but here they've written the plotting library in JavaScript so it's very tightly coupled I think that's sort of one of the coolest parts about this so I will pop back to my slides biggest question I get how fast is it this is a set of sort of numpy numeric benchmarks that someone else wrote and you see some variation there the the scale here is how many times slower it is in the browser versus native on the same machine so worst case it gets to be about ten times in Firefox Chrome is still significantly slower hopefully they assume they will catch up at some point but they are playing some catch up and the the thing that seems to make the difference between the things that are like 10 times slower and the things that are the same speed is how much Python work you're doing one thing that is slow in web assembly is calling a function pointer and that's for security reasons which I don't even fully understand to explain but they've made they have a little extra work they have to do to call a function pointer and because the Python interpreter is basically a function caller a function pointer calling machine that's what it does the more Python stuff you do the slower it gets however these these benchmarks down here they're mainly working in numpy and so they're like tight see loops they pretty much herp at par so that's pretty cool what doesn't work is actually pretty obvious that the Python test suite is running and for the most part is working just great but things that you can't do in a browser you can't do with this either so you can't open up over wrong networks are cut socket and start serving things over it you can't for kasib process because we don't have sub processes in the browser and you can't access files on your filesystem right threads and asynchronous stuff are coming that's not in webassembly MVP level but there are plans to build that so eventually someday we'll get multi-threading in this you can use web workers now and do some settings but that's there you're building specifically for the browser and it's hard so where do we want this to go in the immediate future these are the packages we have we have Python and numpy and pandas and matplotlib if you go down the list of the most popular scientific Python packages the next on the list or suck it image and scikit-learn but those both depend on Syfy and Syfy depends on Fortran and we it's a little dirty secret of this community that we're relying on software using one of the oldest extant programming languages around okay so the the story of compiling Fortran to webassembly isn't really fully there yet it's not good enough to compile SCI PI so we're gonna have to push on that and move that forward but once we solve that hopefully we can also get our so a big part of where we want to go with this is have having this nice tightly integrated system where you can do some of your stats in R where it's really good and do some of your data munging in Python where it's really good and not really care and do that all in the same process within the browser the key around that of course is something like Apache arrow or live nd type in the middle as the sort of conversion layer to help you move those data types around without any copying but already we have a few other languages not just Python for outside contributors have given us Ruby Oh camel Lua these JavaScript related languages are obviously really really easy so we have JSX from react and typescript so it's growing already into kind of an ecosystem here the other thing we want to do is extend Conda Forge so right now Conda Forge builds for Linux Mac and Windows we're gonna add a fourth one called web assembly and that will hopefully make it a lot easier to get new packages into this thing and it won't be a bottleneck on a handful of developers to add new packages we also already and I don't have a slide for this if you have a binary wheel of a pure Python package on pipe I you can install it it just works so go wheels and they make that easy all right so that's piyah died I'm going to check my time okay so now I'm going to talk about the server know like I said this is an optional feature but I think it it turns this from a thing that's kind of a cool toy and does into a real professional collaboration tool so we want to like build a real like Google Docs like experience where you just edit and things are saving and you don't have to care about remembering to say if like just keep that you know the thing that everyone knows from Google Docs we don't care about that stuff make it easy to share add commenting collaborative editing forking all these nice things that Google Docs is sort of the shorthand way of saying that and a place for archiving and exporting your notebooks as well as your data I think when you have data scientists all managing their own stuff sometimes it's a little tricky for collaboration but whereas if you have a centralized place to just share and work around that's basically what the server is going to handle the server also acts as sort of a data broker there's many times where in the browser you can just hit an HTTP URL and get your data back you have some data service or whatever that doesn't always work sometimes there's like an authentication thing you do need to do in the way in the middle and the server will help with those sort of cases and act like a broker or a connector to various data sources so now I'm gonna invite my coworker tee on up here and maybe you can explain what this is while I get the demo going sure so in the former life of the neuro scientists and I actually specialized in neurophysiology so this what we have right here is a portable EEG so this is connecting over bluetooth it's pretty cool is no longer the case that you have to go into a lab get hooked up with electro in order to have your brainwaves right so this is pretty cool 200 bucks is from Muse it's a great company so every turns out chewed on yeah okay so we're connecting to the headset using a technology called web Bluetooth which lets your web browser communicate over bluetooth to all kinds of crazy devices but you can do it blow your brain through the web we're having the same issue we had the other day sometimes this glitch is out and we it doesn't connect the first time I'm gonna reboot the reload the page please turn it off all right gonna restart this worked last time at least to restart it so haven't gotten to the bottom of where that's issue is but ok I'll reconnect won't let me there we go let's not let you mean click the button alright one more time one more time as the charm right so and to talk about what this notebook is doing it's taking the Bluetooth data from the headset bringing it into Python performing an FFT over there because we have FFT and numpy and then plotting that out again using matplotlib so it's a real like hybrid of all the various technologies here and what EEG does it measures the voltage on the scalp so you can think of this as a multi volt meter and you basically have a reference and then you have we have data coming out your head he's getting noise right now try not to think too much so we got a nice some very tough let me see yeah I can try there you go so you're seeing the the top four lines on the four EGS the four detectors on the device so we have a bottom one as the FFT we have TP nine we have a f7 f8 TP 10 so these are the frontal electrodes these are the ones around the temporal parietal and what's really cool this is also used for meditation so one thing that we know from meditation is that when you are being very attentive very focused you have an increase in the alpha-alpha range which is between 9 to 12 Hertz so there's oscillations that happen in the brain and between 9 to 12 Hertz when you are meditating you see an increase of power so what you will see in the FFT is that the values go up when you're meditating versus when you are talking or just doing other things where you don't have a singular focus where is Dell cool yeah live demo yeah alright so I just got two more slides and then I think we have some time for questions um alright so part of us coming here we're Mozilla this is all open-source this isn't really a sales pitch at all we're actually a call for contributions we need experimenters designers programmers writers bug hunters anyone who's interested in helping us out find us on github that's our github project we are also having an unconference space at 4:30 in Belasco way down around the corner and yeah and there's my email so that's that's it this is great wonderful work and really inspiring stuff you mentioned that on the right now the biggest performance bottleneck is is doing kind of native - operations and you also mentioned that as opposed to some other libraries isn't there's not any translation happening have you considered transpiling easily transpiled bits of Python into JavaScript or web assembly so I haven't played with that at all I mean I have played with the rice on package and demos and the thought of rebuilding a numpy that would work in that environment sort of put me off and thought this might be a faster path to all the data science stuff but you're right there may be some sort of hybrid approach in the middle there if like transpiling some of it and getting it to connect I just I have an experiment that at all yeah thank you question you said to you we couldn't get raw sockets but can we get web sockets and can we get them today yes yes you can get web sockets but you have to have something that speaks web sockets at the other end no Mike I really liked the format of the jupiter-like plaintext mm-hmm that doesn't have the output or it's planned because what you showed was only the input right yeah so I actually was input on one side and the output in another pane we actually in an earlier iteration of this we were more like Jupiter where we had inputs and outputs interleaved and just from the user community within Mozilla we had the feedback was that that was for it for a tool that's about building a rapport that actually got to be really confusing because a certain point you're trying to have things be up here but your text your code is down here and so it was much easier to just do it side by side and more like a more like a console thing and my understanding is this is what matlab does this is what our studio our studio does this too so that's the model we've we've gone with I mean internally in the text file is it store everything input and output or is it separated this isn't two folds it doesn't store any output in the in the text file it's just input the I the premise is that you can always generate the output because you're always going to be in a browser environment yeah good thank you sure oh yeah um I think this is really cool uh have you tried using f2c to compile those bits of Fortran yes and actually FTC gets us pretty close F T unfortunately f2 c only supports fortran 77 and there is some more recent fortran standard stuff in inside pi + r so again like I think the last commit to F 2 C was sometime in 2002 or 3 so it's kind of a static project but it would be there's there's all kinds of ways we can hopefully move it forward I think there was question up here if we still have time how easy is it for someone to replicate your demo is it on your laptop is it I missed the initial part is it on the cloud yeah I did everything locally on my laptop because I always worry about conference Wi-Fi but many of these demos are actually already available online that I died i/o unfortunately it's an older version of of our thing but you can just visit it there and hit it with your browser and and and play with it yourself yeah so I have two related questions so the first one is if I have a command line application how easy is it to look at a terminal that then you you can use the command line application from the second question is say I have a rich client application based on QT or whatever mmm how easy is what happens when you go and try to load those tools where do they get rendered and what happens there so with the command line application I haven't put any effort into trying to get like standard in and standard out working the stance so standard out kind of works if you say print from Python it prints to the JavaScript web console right if you open up the developer tools that's where it comes in and actually and actually the default behavior is if you read from standard in it pops up a little dialog if you've ever seen JavaScript alert that really old-school thing pops up a little dialog where you can type things so that's what works now not really what you're talking about but the pieces are all there to probably build something like that I just we haven't put any effort into it my understanding is that QT has a web assembly port that's being worked on so if you put it together with that theoretically you know there's a big splashy demo of AutoCAD running in the browser which is like an old old you know traditional desktop application and so it can be done it's just how much effort it is it's hard to say yeah perhaps a very goofy question but do you mean like once you have a psychic learn there would envision some parts of machine learning production being moved to the front end using this technology say recommendation systems or carousel um I I think it's hard to say so I I think we're the the sweet spot of this is like for a notebook where the where the Python is exposed to the user like that's where I think this is really interesting because you have a really good reason for Python to be there which is that it's the user interface to what you're trying to do for things that like are in production that maybe work in a different environment this is not going to be the most efficient way to do that it might be the most developer efficient way and so there's that trade-off but yeah I don't really yeah I don't really know but it's it's an interesting space to look at too a lot of the tooling I guess is written in rust and then like the transpiler compiled - who has a web sembly so no it's it's all like it's the literally the C Python source code that gets compiled to web assembly there's no rust in here yet we'd like to have some rust support but it's it's just the upstream projects recompiled with a few patches to make it all work they look is there any projects you know that are looking at targeted at having being able to compile like python to Wesen or i guess is this kind of what it's yeah so this is this is the this is basically interpreting python in the traditional way that c python does but inside the browser Brian is something that that compiles or converts the Python to JavaScript so it's a different approach it it ends up being less like standard Python and a little less complete but who knows with more work on that that might be that might be a relief viable approach - yeah thank you it was really incredible Thanks think we might have time for one more for Fortran compilers for compiling Fortran there's something called flaying that composites LLVM ir which goes through web assembly have you yeah we've actually been talking to the flank developers and they're very receptive in general to this idea there's a little work that needs to be done so flang is it uses a port of LLVM that's a little old so the web assembly support is not great so we need to kind of push LLVM from this end and flank from this end and get them meet in the right place but the pieces are there and i it's hard to say how close it is but the you know we wouldn't have to start from scratch yeah probably do one more okay all right yeah thank you very much you'll be in a Belasco later
Info
Channel: PyData
Views: 2,994
Rating: 5 out of 5
Keywords:
Id: iUqVgykaF-k
Channel Id: undefined
Length: 32min 31sec (1951 seconds)
Published: Fri Feb 01 2019
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.