Keynote Jake VanderPlas

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

Thanks it's really great to be here on so PI data is a really fun conference and over the last five years it's grown immensely and one thing that happens when you have a whole bunch of new people come into a community is that they may not know the history and may not know all the little details and things and so what I decided to do with my keynote today was take advantage of this and give kind of a PI data 101 sort of all the things that you may have missed if you're new to the PI data community and kind of give you some context about where you can where you can start if you're just coming in into this or what packages of what tools you can look into and as you go out and do do the work that you have that you want to do in Python so a little bit about me and Jake VDP you can find me most places on the internet under that handle and the most important thing to know about me is these little weenies right here they've recently replaced Python as my favorite thing to talk about so feel free to ask questions they're amazing but in the Python world I do I do a number of things I have this blog that is being updated much more rarely these days but kind of fun Python blog I've contributed to Syfy and scikit-learn in a scope I and other packages in the space and I have a few books on Python and on statistics and astronomy and astrophysics and things like that so you can you can check some of those out I'm from the East science Institute which is a short little hop across Lake Washington from here and it's super convenient because you know getting here from there is only slightly longer than flying from San Francisco given Seattle traffic so I got here in like just under three hours today it was great and we're funded by these three organizations the Moore Foundation the sloan foundation in the Washington Research Foundation so I always want to shout out to them because their generous funding has been really helpful in kind of and supporting a lot of the Python open-source work I've done in the last few years so thanks to them so when I met when I'm at the u-dub a lot of Python training and I talked to a lot of students who are getting started and I hear all these sorts of questions over and over again you know what is Jupiter how do I load a CSB how should I install Python how do I make my code fast just like endless endless questions and and there's so many resources out there there's you can find answers to all these on the internet but it's it's hard to find like one place where you can get these answers so my my audacious goal today is to answer all these questions and more for you in about 40 minutes and those questions boil down to basically these two things why is the PI data space the way it is like why why is the collection of packages and tools and everything the way it is right now and that gets back to some of the history of how we've gotten to where we are and then also what is the best tool for the job and that's more about just having kind of a catalog of associations if I want to do this thing I should look at this tool so I'm going to do two parts of this talk first I'm going to delve into the history of how the PI data ecosystem got here and then I'm just going to do a quick survey of what I think of is the most important kind of fundamental tools in the ecosystem so one thing to realize is go into the history of PI data's Python is not a data science language and this is this is something that comes to the forefront and all of these issues we look at where Python came from it was created by this gentleman Guido van Rossum and it was created in the 80s and he he made it basically as a teaching language he wanted a good language to teach undergrads and also he wanted to bridge the gap between the shell and see basically say that we want to make things a little bit easier to use on UNIX systems and it's interesting if you look if you ask him about what he thought Python would do he thought we'd be writing maybe a dozen or a few dozen line scripts and these days Python there applications with thousands or tens of thousands or hundreds of thousands of lines I mean Instagram runs on Python right and all these other other companies so it's amazing where Python has come given that it wasn't really designed to do any of that from the from the beginning so really the question answer is how did Python become this sort of data science powerhouse given that it wasn't designed to do that in the begin and so I I want to I want to kind of dial back and talk about where Python was and certain certain earlier eras before we are now and this is this is kind of the way that I think about it and in the 1990s I think of Python and science and data is sort of the scripting era right and the motto might be Python is an alternative to bash right no one wants to code in bash so let's code in Python instead and that's where we were in the 90s and you have interesting effects of that and one of the one of the people who is working in the scientific space at that point was this guy named David Beasley who you may know if you if you may know from the Python cookbook and other things but he was working as in a research lab back in the in the 1990s he wrote this interesting paper about the about scientific computing in Python where he basically said that scientists are using all these different tools and they're they tend to use homegrown software to implement their own domain-specific languages or command-line interfaces to put them together and in this paper he argued why don't we just use Python to stitch all these tools together and he gave a case study of some project he'd been working on for about four years using Python as kind of this glue to script together a bunch of other tools and he wrote this he wrote this library that was really influential at that time the swigs simplified wrapper and interface generator that could basically parse some entire Fortran or C code and generate a Python interface for you so you don't have to write Fortran or C anymore to drive your code and this a lot of the early Sai PI and PI data tools were built on swig my first contribution the scikit-learn was a C++ code wrapped with swig and later on we abandoned swig and moved aside on but that's another story so in the 2000s after that I think I think of the 2000s is sort of the sci-fi era and if there's a motto for what Python in science was in the 2000s it's Python is an alternative to MATLAB right for many reasons and I see some knowing nods in the audience that yeah there's many reasons and if you look at some of the people who are influential in the early 2000s and developing the Syfy stack you can see this sort of common thread so John Hunter was the creator of matplotlib and a few weeks before he passed away in 2012 he gave this amazing Syfy keynote that I've honestly I've gone back and watched like four or five times but he talks about pre Python yeah this hodgepodge of work process Perl scripts that call C++ he had MATLAB and he got tired of MATLAB and noting it and started loading things into a new plot and this is what inspired him to build matplotlib which is basically a MATLAB replacement in Python in this language that he wanted to use rather than all this hodgepodge of other languages similarly this guy Travis elephant he's he's founded continuum and previous to that is wrote the the numpy and Syfy projects and he says prior to Python to use pearl and then MATLAB shell scripts Fortran C plus plus libraries he said when I discovered Python I really liked the language but it was nation Mason and lacked a lot of libraries I felt like I could add value to the world by connecting low level libraries to high level usage in Python so this is what inspired Syfy now Syfy was this replacement for MATLAB Fortran shell scripts and he wrote that with that in mind and similarly if you if you know the ipython projects or the the Jupiter project this is Fernando Perez he created ipython and he had a similar hodgepodge of tools C C++ UNIX awk said SH Perl IDL Mathematica make you know this is this is this is like horrible to think about what science was like before Python but what he did is he built this ipython project because he wanted something that was sort of IDL like or Mathematica like in in the Python space so he could use just a single tool to replace all views so you have all these tools that came out in early 2000s that basically had the same goal all of them in the early days they wanted to have they wanted to be the replacement for MATLAB or the replacement for these these combined package and they all had if you look in the early code they all had elements of visualization they had elements of computation and shell I mean if you look at matplotlib you can still import the EM lab sub module that has things like computing periodic Rams and things like this there's still computation in MATLAB more matplotlib even though now a lot of this has been moved out and the libraries we know today matplotlib Syfy and ipython are very very distinct in what their goals are so it's been an evolution in the community that way and so the key conference series I think of for the sci-fi era is the side PI conference so these are all the side PI logos I could find online I look for the ones before 2008 but they don't seem to exist according to Google anyway so the sci-fi conference is really driven a lot of that innovation since 2002 and up to today a few people I know in this room are going to be at Syfy Austin next week it's always a really fun conference and I'd recommend attending if you have if you ever have a chance so after the 90s the scripting area in the Syfy I think of the 2010s is the PI data era alright and if there's a motto for the PI data era it's like let's use Python as an alternative to our right so and I think we're doing we're doing pretty well as a PI data community of offering that there's still a few things that are does really well that we've not matched I think one is breadth of statistical routines another is visualization but a few of us are working on an answer to that so the PI data it is typifies by West McKinney and his package pandas and his book Python for data analysis so this is what he says in the intro to his book I had a distinct set of requirements that were not well addressed by any single tool data structures with labeled axes integrated time series functionality flexible handing or arithmetic operations and reductions flexible handing of missing data merged in relational operations and I wanted to be able to do all these things in one place preferably in a language well-suited to general purpose software development so this inspired pandas and I would say arguably we would not be sitting here today if it weren't for the pandas library and what West did and when was that 2009 to 2011 or so when he quit his day job and ate ramen for two years so he could work on pandas all day so if you ever see West thank him for that because I think he has really done a huge huge thing for our community there but there's other key software that's come out and in this era you know pandas and I think the first big release was around 2011 scikit-learn had a big it had an early early guys in 2007 but the kind of main thing main scikit-learn release was 2009 or 2010 kondeh for packaging it came around in 2012 and really changed the way that I work in Python changed the way a lot of people do things the ipython notebook in 2012 and later on the ipython project was rebranded to Jupiter and the Jupiter project has really pushed forward what the way that we interact with code particularly in this community and of course the key conference series is PI data right you're sitting right here and the early PI data workshop in 2012 this was a one day thing at Google and Mountain View and I hide it as close to my heart because my first public Python talk was at that workshop at that PI data 2012 I get a given one hour tutorial on scikit-learn and from there I was hooked and I've been trying to attend as many of these conferences as possible after 2012 they got their branding in order and they've been pretty consistent Spencer but yeah pi day is it PI data's have been all over the world and it's been this this conference series that really is pushed forward this I think of this data science as something distinct then maybe scientific computing which is what the Syfy era was all about and of course all these eras are concurrent right now you know there's people using Python for scripting there's people using the Syfy tools there's people using the PI data era but this is a way that I like to organize in my mind how we've gotten to where we are and that helps you figure out why the tools are the way they are right and really the overarching theme is people want to use Python because of its intuitiveness its beauty its philosophy its readability Python gets a lot of lot of Burt's from other languages because it's fun to write and so people what people do is they build Python packages that incorporate lessons learned in other tools and communities you know Wes specifically wrote pandas because he wanted to do what our does with data frames John Hunter specifically wrote matplotlib because he wanted to do MATLAB style plotting without having to pay for a MATLAB license you know we're really Python is good at extracting knowledge from other tools and other domains and putting them into our own space and then running with them right and Python we've also developed a lot of interesting things of our own like for example scikit-learn i think is across any language it's kind of the premier way of thinking about machine learning at least how to interface to machine learning and how to do machine learning api's I would argue that no other language has anything that's as as succinct and as well thought out as scikit-learn so what but we have to recognize the whole time through this with Python again is not a data science language and that adds some adds a little bit of it adds a little bit of complexity at times right so python is this general-purpose language and I actually would argue that the general-purpose nature of Python is one of its strengths you know Python you can think of it as a Swiss Army knife you have you have Python you can do all these different things with it you can do web programming and Django you can you know do all the backend stuff you can do front-end type stuff but over the years as people more and more people use this the Swiss Army knife gets a little bit complicated right oh yeah all these little tools that we have to choose which one and you got to remember the order of them so you can find the one you want and the strength here is that there's this huge space of capability of what we can do in Python but the weakness is like where do you start you know I I deeply empathize with new users who say I want to learn Python right now and they go out there and the universe is so huge there's so many packages and so many things that you have to learn and so many so many little unwritten pieces of knowledge that are passed from person to person that it can be tough to break into it so with that kind of historical perspective of the PI data community I want to jump now into taking a quick tour of the PI data world some of the tools that you've probably used maybe if you're newer to the community you may not have used them but these I'm just going to give a quick summary of what I think of the essential tools today in the PI data world so for installation I would recommend going with Conda so Conda is this basically you can think about it as a cross platform package manager' similar to an apt-get or yum or a homebrew or a mac ports type thing but it works similarly on Linux OS X Windows and lets you install packages everything you need in basically one click or or one command-line argument so it comes in two flavors there's Mini Conda which is just the Installer and anaconda which is the installer plus hundreds of packages that continuum thinks you would like and so I recommend mini Conda because you can get started as a 25 megabyte download and you can start installing what what you need so if you go to the website you click on the installer for your platform it's pretty easy and then you run it at the command line there are ways to do this through a graphical user interface as well and once you have that you have this program called Conda that has its own Python installation with it so you have your Python is now connected to Conda you've completely divorced yourself from your system Python or from any other Python installations and then you can run commands like Conda install and just essentially type the names of the packages that you want and everything all the dependencies will be managed and it's a really really nice system if any of you were around the Python scientific world previous to twenty twelve previous to Conda things were way more painful especially if you were in a setting where you were trying to get a roomful of people all to get tools on their laptops and they had mac and OS X and Windows and it was I'm glad we're not there anymore it was a bad time dark times the other thing that's amazing about Conda is you can create environments that are basically like sandbox environments to try out new things so if you do conduct create minus n that means the name I'm going to call this pi to 7 I tell it what version of Python I want what version of the packages I want and it creates this new environment and once you activate that environment you have a brand-new Python executable you have a brand-new set of core packages that go with that executable and you can start using it without breaking your other things so I use this this all the time I have this is just the first few environments on my machine I think I have like 70 or 80 because every time I'm developing a package you can see some of them here whenever I'm developing a scikit-learn package I switch to scikit-learn dev and install from master and do the development work there and then when I need my code to work and not use bleeding edge stuff that might be broken on master I just switch back to my Python 3.6 environment right and you can switch seamlessly back and forth between different Python versions this way so Conda is huge I would I would start there and if if you've heard about this pip thing there's a little like hip verse Conda pip is another installer for Python packages it links up to the the Python package index briefly I the distinction as PIP is something that installs Python packages only and then can install them in any environment Conda is something that can install any package you can you can install node you can install our packages you can install anything but it only installs them within Conda environments so that's that's the distinction I would I would think about if you want to read like three thousand words on Conda versus PIP I reference this blog post right here that I did a little while ago so for your coding environment once you have everything installed you can once you have Conda installed you can install Jupiter and the Jupiter notebook and if you've not seen this this is huge this was introduced in around 2012 actually the first time I heard about the notebook was at that PI date at 2012 meetup I gave my little my scikit-learn tutorial using a web page and a terminal and like all these different windows and well I was given the tutorial Fernando Perez was in the audience he was typing up my entire tutorial into a notebook it would just been released like two months earlier and he came up after the talk and he's like hi I'm Fernando have you heard of the notebook and and he gave me my tutorial in ipython notebook form at the end of the hour and I've never looked back every tutorial since then has been has been in a notebook because it's amazing so what you can do is you run the stupider notebook and then you have this web-based platform that's like kind of like a file system access through your browser and you create a new notebook and then you have an interface where you can start running code trying different things you can even embed graphics inline and you can do you can do a lot with notebooks actually the the Python data science handbook that I recently published I wrote the entire thing in the form of Jupiter notebooks so it's like it's almost like a publishing platform and if you don't feel like buying the book you can go to my github repository and all the notebooks are there if you feel like you know helping out my my kids college fund you can buy the book but and as of this summer actually I think there's going to be some a release today maybe even I was talking to the duper lab folks there's this Jupiter lab project which is like the next iteration of Jupiter notebooks and you can think about it is like bringing the notebook into the future where you have a full IDE with text editor and file viewers and and things like this and I'm really really excited for what Jupiter lab is going to do for our community and I anticipate you know all my work is going to be in Jupiter lab within a within a couple months it's a really cool project that's coming together so that's the coding environment and installation what about numerical computation if you want to do fast numerix in Python everything depends on numpy and yes it's pronounced numpy not numpy everything depends on numpy it's uh if you use pandas or can you second learn or if use any of these other libraries they tend to tend to base on numpy and use everything there so you can Kanta install numpy and then what you have in dump is is a way to create arrays that you can interact with really efficiently so you can create rate arrays you can do element wise operations on them so here if you do X x - it actually multiplies each element of the array x - if you take a Python list and do x - it doubles the length of the list right and then adds all those extra elements because it's you know pythons design not for data science that's designed for something different so we have these data science tools that we put on top of it and you can do things like linear algebra here is taking the singular value decomposition of a random matrix you can do random number generation of different things here down here is a some standard standard normal random numbers we've taken the fast Fourier transform so a lot of these sort of core numeric operations you want to do in Python are implemented in numpy and they're done really really really efficiently so an example of this is if you're coming from a language like C or Fortran or c-sharp or one of these compiled languages you might be used to doing things sort of by hands right so if you want to take an array of numbers and multiply them all by two and add one you might be tempted to write a for loop like this right that's how you would do it and see if you were writing C code but this is in Python this is a really stinkin slow write this takes six seconds to do some basic inner basic arithmetic on what is that 10 million values so this comes down to a number of reasons but basically it's because python is interpreted and dynamically typed and you can get into the weeds about what that means but the Python implementation see Python that you're that most people use is not great for doing repeated numeric operations because that's do a lot of type inference but if you're using numpy you can write this much more succinctly and you can do it a lot quicker rather than six seconds this is taking 60 milliseconds and the way it does that is that the the numpy arrays know about the types of the values and so it kind of pushes those loops down into compiled code where the type inference doesn't have to be done ten million times the type inference is done once on the on operation so anytime you want to use do fast numerix and python think about this sort of vectorization thing if you're ever writing for loops over large data arrays you there's probably a faster way to implement your code and if you want to if you want a more complete intro to vectorization and various guises that takes under in the numpy world you can I did this talk a couple years ago at PyCon 2015 you can look up the video to that okay it's a labeled data we talked about how pandas sort of opened the PI data era right so pandas is this routine is this library that essentially implements data frames and relational operators in Python so it's similar to a numpy array where you have you have types data in these dense arrays but data frames have labeled columns and labeled indices and it looks kind of like this and you and you can add columns to the data frame using pythons slicing Zoar indexing syntax and you can do things like loading data from disk in a really really seamless way in a way that that automatically infers all the types of the columns so if you're ever if you ever have data on disk and you want to get it into your Python space pandas is basically the way to go you know numpy has some things like load text and and other loading you you team Jen from text as anyone ever used Jen from text it's horrible you never want to use it pandas pandas basically superseded all of that and you can do all sorts of interesting sequel like grouping operations that are really quick so here's one where we have a bunch of IDs and a bunch of values and pandas lets you do things like say I want to group by this ID and take everything with the same ID and sum all the values with that ID you get a data frame out that that gives you exactly what you want and this is the kind of operation that you basically can't do in any of the Syfy Syfy era tools pandas provides that and that's a that's a new thing new thing is of what 2010 so yeah pandas is great um so moving on to like to the visualization space the if you look at visualization in Python you probably are going to come up with matplotlib and that's because math live is just battle-tested it's been around since 2002 everybody's used it the Space Telescope Science Institute that run with the Hubble Space Telescope through a whole bunch of resources and do it back in like 2004 and 2005 and it's like you you can use it for just about anything so it looks a lot like MATLAB if you've used MATLAB that you're probably very familiar with it and this and many people treat that as a bug these days but it was definitely a feature when it was created right one of the reasons the Syfy ecosystem was able to take off is because it was so seamless to switch from MATLAB to to Python so as people Bosch Matt Paul live and it's easy to bash Matt pop lived for its API and various reasons but I think it we got to keep the historical perspective in mind that we wouldn't be where we are today if matplotlib didn't have the API that it has right so you can you can plot things it's easy to make quite simple plots if you want to do some other stuff more complicated things I would I would go beyond matplotlib these days so if you're if you're doing like data visualization of data frames pandas has these really nice plotting routines built in that look that produce matplotlib plots without having to do the matplotlib api so if you take a data frame and you say data dot plot scatter and you tell us the two column names you want to scatter it gives you exactly the the plot that you want and you there's no need to fiddle with the axis labels and things like this it just just comes out Seabourn is a similar package this is a different package that does is designed for statistical visualization and you can do really complicated plots and a few lines of code you want the types of plots that Seabourn has so it's a great library to check out and then beyond MATLAB there's also libraries like bouquet which gives a lot more interactive features it can do lots of lots of different plots I'm not going to not going to dive too much into that I just want to let you know that's out there and you should check it out another thing is plotly which is a similar in spirit to bouquet it renders renders plots in the browser allows you to do interactive visualizations and has a huge gallery of interesting plots that you can do so there's a lot out there and if you're if you're an AR user right we're in the 20 20 era of Python is a replacement for our the one thing that R has one thing that our has over Python that's really nice is the ggplot library and I think there's nothing in Python that matches that at this point one approach to that is this plot 9 library which essentially gives you GG plots API to produce matplotlib plots so that's worth checking out if you're like a GD plot fan and you want to keep using Python it's this is not totally mature and complete yet but I think it's pretty promising and there's also the Altair library that I'm not going to talk about here but you can see some of my other talks online on that visualization is really complicated in Python this is a slide from a library from a talk I gave at PyCon a few weeks ago and basically every node and that is some library and Python used for visualization so if you want if you want to see 40 minutes of me talking about this graph you can go on YouTube and see that so with all that other way you you might want to do some like numerical algorithms right so sighs the is the package for doing numerix and Sipe I started as a it's a wrapper of netlist net-lib is a whole bunch of Fortran libraries that do things like integration and interpolation and optimization really really quickly and efficiently so scifi contains a lot of these different sub modules that essentially are wrappers around these Fortran operations that go really fast so I can't give examples of all of these but basic any numerical operation that you want to do Syfy will have so like here's an example we're importing the the special library which is special functions and we're importing the optimized library and we can find the minimum of the first order bessel function and then plot it on there so this is the kind of thing that Syfy does it has especially if you're a physicist Syfy is great it has all the routines that you need if you want to do machine learning I mentioned scikit-learn this is this is a great library because of the API and I'll show you that so imagine you have some 2d data and you want to fit a machine learning model and we all know that a machine learning model is just a fancy way of fitting a line to data right so if you're if you're using machine learning to drive a car you just have a gigantic parameter space and the line you're fitting to the data is the one that makes the car not crash right so yeah so if you want to fit a line to data with what scikit-learn you can use this this model API where you basically create a model you fit it to your data and then you can predict the model on new data and plot it so this is a what a random forest fit to this data looks like and then if you want to switch out and use a different model all you got to do is change that model implementation up there so here I switch from a random forest to a support vector machine regressor and all you have to do is change out that model definition on the top and all the rest of the code remains the same so that's that's the benefit of second learn it gives you a single API that lets you explore every single basically important machine learning algorithm out there without having to write a whole lot of boilerplate yourself so it's a nice way to explore these things and I think that's a real strength of scikit-learn and there's we actually wrote a paper about the scikit-learn API as I think is kind of fun talking about the the choices that were made in in defining that in the beginning so if you want to start doing things in parallel there's this great library that's maybe a year or two old now called desk and desk is really interesting it has if you're doing something in hi this is something you might do you take an array called a you multiply it by 4 and remember that multiplies all the elements by 4 you take the minimum and then you print it right what death does is it allows you to do the same thing but rather than actually doing the computation it just saves the task graph that defines that computation so when you when you multiply the array by 4 say it says I want to multiply this array by 4 and it saves that and it builds up something like this so we have the data on the bottom we get the array and then in five different cores we we make the array we multiply by 4 we take the minimum of all those and of course the minimum of the minimums is the minimum right so you you desk knows about how you can it knows about associativity of these sorts of operations and aggregations and so in the end what you end up with is a way to construct this task graph without doing any computations and then you can farm that task graph out to anything that might be multiple cores on your computer it might be multiple machines in a cluster it might be something on you know Amazon Cloud or or Azure cloud and then at the end you can compute it and there's there's some really cool things happening with tasks in the data science space there are ways to plug tasks into the backend of scikit-learn to kind of do things transparently so you can look up those sorts of things and its really nice if you want to optimize code there's this project that is really really interesting it's been around for maybe five or six years but it's called numba and essentially what it is is it's a project that takes Python code and compiles it into LLVM byte code to make it run really fast and it's really really seamless so let's say you're writing a you're writing an algorithm that has a big for loop in it I told you that for loops are bad you should use numpy if possible but some some some algorithms you can't you can't convert to vectorize code very easily so if you have a code like this you know everyone uses Fibonacci this takes 2.7 milliseconds to get the 10,000 Fibonacci number all you have to do is add this number just some time compiler decorator it gives you a 500x speed-up in this code and what it does is it goes through it actually it goes through and parses all the Python code in there and compiles it to LLVM compiles it really quickly and then for the rest of time whenever you call that function you get the fast version of it and there been some really nice projects built on top of this for example the data shader project which is a visualization project tied to type of bouquet it uses numba in the backend to do really really fast visualizations of billions of points you can see these see these demos where data shader is used to visualize you know a billion taxi pickups and you're scrolling around and zooming and rendering in real-time that's based on number in the backend another way to optimize code is sites on so sites on something different it's a it's a superset of the Python language that allows you to compile Python into fast C code and so an example of this we take our same fib function here 2.7 3 milliseconds for for the result if we run it through sites on this is this % % site on is a way you can do it inside the jupiter notebook you get about a 10% speed-up which okay you know it's sort of fast and what what it's doing is it's taking this Python code compiling it to C code and then running the C code rather than the Python code but to really really get the benefit of scythe on what you need is to add some types so if you look at the difference here all I did was say int n at the top and instead of a B equals zero one I did a seed F int a equals zero B equals one so now the compiler knows that these are integers and it can optimize that code and you get this 500x speed-up just by running your code through sites on with this kind of extra syntactic sugar and scythe on is a it's an amazing project the fact that it can do what it does and if you look at the source code of numpy Sai pipe and of scikit-learn Astro PI sin Phi basically any numerical code in the PI data ecosystem is using scythe on its core all these tools are built on top of site and that's how you get that's how you get fast numerix in these libraries that's how you wrap other C libraries like lib SVM in fact learn you sites on to interface to those so it's a tool if you're doing anything beyond the basic Python development checkout site on because it's fun so yeah that's that's the extent of the tour that I want to give of all these packages hopefully hopefully was helpful and I tried to put references throughout this slide deck if you want to dig in and get a little bit more and most of those packages have nice websites with tutorials and things like this but so remember as you're using Python and Python is not a data science language and sometimes that leads to things that are a little bit complicated right sometimes it means that there are a lot of different ways to do things because everyone's trying to build their own API on top of this language that they love but I think it's also its greatest strengths right because we can draw from so many different communities we can use Python for so many different things beyond data and data science and you know we're looking back at these these different eras of Python development I think it's interesting but what's most interesting is thinking about what's coming in the future right what are the 2020 is going to bring and even though there are you know lots of challenges to Python sovereignty in the data area I would say I'm pretty confident that Python will remain relevant for the next ten years because of the the community and the way that people in the community keep adapting things that are learned in other places in the world bringing them into Python and so I you know I think all of us are going to be still writing Python in 2029 but we'll see thanks very much so this is my contact info [Applause] and we we have about 15 minutes maybe four questions and the like Craig said there via Twitter and thank ya so my back okay so like I said tweet at Pi Day to Seattle I am using very fancy technology where i refresh the page as often as I remember and hold on after so the first question that came in came from a couple different people you were talking about numpy yeah is it gif or Jif I'm gonna I'm gonna be stayin on that we got picked on so I say Jeff but I don't know if that's the answer okay okay uh although I also say get and not JIT so and I said JPEG not JPEG so I one of the first ones that came in was when should I use pandas instead of sequel when both are possible well that's a great question so yeah pandas is is designed to be kind of basically in core operations and sequels is a very different thing you know you're you're writing sequel code that can be pushed out to huge databases and do things at scale so in general you know if your data set fits into memory then use pandas because it's easy you don't have to spin up a sequel server and figure out all these different things there are some interesting projects that that are trying to like take pandas API or something similar to the pandas API and and attach a sequel back-end to it so there there's various approaches approaches to this so you can look up but for the time being I'd say that's the dividing line like if you can fit data into memory use pandas and for a lot of people that serves their needs if you can't use if you can't fit your data into memory and you still want to use pandas and kind of core Python tools desk has a pandas layer that lets you basically do distributed pandas computation across across data that's stored on different machines and in different places so that's something to look up as well all right so I the next one I'm gonna go largely out of order um do you want to come comment on open science initiatives any thoughts on open science and how early I like open science I don't know so I don't know because there's like there's like actual trademark open science initiative but is that but there was no capital doing the tweet been cadet that doesn't imply much yeah so one of the things that I'll just say one of the things that got me into Python or that's kept me in Python is the fact that I was I do a lot of scientific research and I really think Open Sciences is the way of the future and Python allows you to do that it allows you it's free and open you can you can put the code out there you don't have to worry about site licenses for things like if you're doing your work in MATLAB or IDL so I I advocate Python partly because it's so good for open science especially when I'm in more academic settings another one in here what about this deep learning thing I've heard about yeah well you might know this little company that's just across the water called Google I've heard of that yeah yeah so yeah deep learning in Python this is one of the areas that Python I think is really excelling in the last few years and if you look at at the deep learning space I think it's you can make the argument that Python is the premier language for doing deep learning a great example that is the Charis project I was going to put that in here but I thought it was a little bit too much so what Kerris does is it kind of provides a nice clean API that can target different deep learning backends and right now I think it targets tensorflow which is Google's deep learning code base and also fionna which is a project that's been around the Python community for probably a decade but it's sort of like like rebranded itself as a deep learning platform because it can do all the things like automatic differentiation and stuff that makes deep learning fast so if you want to try deep learning in Python I'd suggest checking out Karis and there's some really great tutorials I think there have been some tutorials at at PyCon and recent PI data is about how to use I next one I you said don't write loops or don't write or loop given numpy does number to change that advice yeah so as I mentioned you can you can do some things in number with for loops and make them really really fast at this point they're still using number is a little bit of an art you know people people always put up Fibonacci things like I did and show how it gets a 500x improvement as soon as you start going into something that's a little more complicated that has to access data in different places and things like this it gets a it takes a little more trial and error to get the code going really fast so I think number number has a lot of potential and it's been used for it's been used for some great things already but if you if you can avoid number and do something that you know is going to work really well like numpy vectorization rather than for loop then i would say that's the first thing you should do and if you can't vectorize things with an umpire there's a reason if there's like a memory issue and your vectorization because numpy instantiate all the all the intermediate results then go into something like number or site on is a good option so I met announcement if you're planning on going to a different session for talk number two now's not a bad time to start moving in that direction I'm going to keep talking and we're gonna keep doing QA but you know the next talks are going to start so I don't want everything to run late because you were so excited by these questions but they're great I so the next one close to my heart does Python have an equivalent of deep liar yeah so that's another thing I think where our winds at this point there's not as far as I know there's not a good good deep liar type implementation in Python but again this is I think this is one of those areas where in the coming years you're going to see Python adapt a little more and absorb some of those ideas so we'll see how that happens Oh a very urgent one cabs are spaces tabs or spaces I of course follow Pepe which recommends four spaces and that's about it I there's another one that you know maybe we can do apply on a common end to podcasts in the Python world what other packages do you recommend just out in the wild that maybe didn't come up here but are just fun things to check out or you know if you listen to talk press enemy he likes to say you know what's your favorite thing that nobody's ever heard of that you love on pipey i well so I'll mention Altair because it's my own thing this is a I keep talking about how how graphics are not up to speed with our and Python and I think Altair is going to be a good way to address that it's essentially a way to interface Python to the Vega and Vega light grammars that they're kind of graphical grammars and they're it's in the end you end up with a very nice declarative API for doing statistical visualization Omar also I think one of my favorite packages is the MC package EMC EE and it's a really lightweight way to do Markov chain Monte Carlo if you're doing Bayesian inference so if you're into Bayes stuff check out MC you may be hinted at this at the end of your talk but do you have any prediction for what the 2020s will be the end of your yeah what the 2020s will be huh I don't know prediction is hard especially about the future is that this thing yeah I think you know 2020 is I think we're going to see people using Python more and more for deep learning type things and and if you know maybe in maybe in 2027 someone will give this talk and say you know 20 20s were the era of deep learning where Python was trying to be tensorflow perfect up here a topic that you kind of didn't hit but it does seem pretty important I what about report publishing you know what do you do when you know and specifically you know do you see good ways the Python integrates the things like say tableau or do you you know yeah so there's different aspects of that so I'm going to go back to the art community because they've done some great stuff in this area there's our markdown which I really really like and I really wish we had an answer to that in the Python world because it's just a nice way to create documents to create books to create blogs and websites and there's another thing in the art world which is shiny which is a way to kind of create web apps interactive ways to publish publish your data science results with graphics and interactions on the web so I think some of the answer to that in the Python world is going to come via Jupiter lab because if you think about what the notebook is the notebook does give a lot of these elements of like creating interactive visualizations in the browser but right now they're tied to the notebook and unless you have the notebook server that you can run your you can't you can't view it without the notebook server right but Jupiter Labs is is giving these this ability to to pull out different cells from the notebook and pull out different pieces of the analysis so that you could imagine say running through a whole notebook taking your last cell that has the interactive boquete image or the interactive altair image or whatever and saying I want to take this and I want to publish it and you just have that sell out on a web page with everything in the notebook in the background where the user can't see it so there's going to be the ability to do that another thing that that could be interesting is that the plotly team just introduced this package called - that's also quite quite similar to what shinee wants to do or what shinee does where you can create interactive dashboards in python and then in one click publish them up to plot leaves server and yeah of course cost some money to put things on their server but but I think it's an interesting package the package itself is open-source BSD license so there's no reason you have to you you have to pay plotly to use it so it could be interesting I are PI to question mark speaker our PI - I mean these were you know I how do you I mean I think broadly how do you a pipeline Python alongside other languages so our PI - is you know yeah this answer for our you know beaker is a notebook that does this for many languages what are your what are your top oh yeah I know of those tools I don't I don't use them much because I tend to do everything in Python but yeah Jupiter is another thing that allows you to kind of have kernels and multiple languages I guess I don't have a good answer for that I apologize if I have missed anybody think I have hit all of the pending questions and I have two different Twitter clients though it's refreshing they both tell me we're caught up so with that let's uh let's go ahead and thank Jake again [Applause]

Info

Channel: PyData

Views: 12,986

Rating: 4.9609122 out of 5

Keywords:

Id: DifMYH3iuFw

Channel Id: undefined

Length: 50min 29sec (3029 seconds)

Published: Mon Jul 24 2017