Aron Ahmadia, Matthew Rocklin | Parallel Python Analyzing Large Data Sets

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hi everyone my name is Matthew Rockland i'm a boudreaux software developer I work on a particular library for parallelism called asked if you were in the last section may be heard about that okay I'll try it again how's this great this tutorial was originally built with a few different library developers I'll work on different parallel programming libraries and the intention was to talk about parallelism in general so that people can have a better understanding about how to choose sort of the right framework for their computation and about how to reason about parallel computing generally so it's not about any particular tool it's about sort of a variety of tools and more generally about how to think about parallel programming so you came here looking for a dance tutorial it's not it renews mostly mundane technologies there are a few web pages you should know first there's this github repository which has some materials some notebooks also some insulation instructions inside the instructions it says to download anaconda please do not do that we've noticed that there's relatively low bandwidth in this this room so if you need to download Conda download mini Conda instead or we actually have a cluster setup for you with everything you need so you can do that too but some people like to play along they'd have everything on the local laptop just so when they walk out of this room they're still well served so you can do that you can follow these instructions on how to continuity you can try using pip as well where you can download mini Conda which is a small pond installation if you want to test things or the libraries that should be working then great so the other web page you should be aware of is trying to get a better name for this but big fat integral net without to give this under a better domain but if you press this button it will launch for you a four node cluster on Google few dungeon this will have all libraries you need all the data sets you need it'll have spark running it'll have asked running levithan purloined and so if you don't want to install stuff locally or if there's a low bandwidth you can just click that button at fat integral net that thing is also inside the chat room the chat room is here get er Dai m / task / PI data GC 2016 and that has all the links i've just mentioned except for itself because it'd be sort of silly okay within so either you've started a jupiter notebook server on the local machine or you are clicking that button and there are a few different places you can go you want to go to the directory named pie that IDC 2016 i'll have a sequence of notebook so we're going to run through i'm going to go through the first three and erin go through the last three minor sort of generally about how to think about parallel programming and errands are about actual problems so we're gonna wait a few minutes we'll get set up on that if you have questions please wave your hands wildly ask questions on the Gator chat or if you're totally fine ask your neighbors if they need help maybe you also go into Gator chat so we're going to come back and around five minutes make sure everything's okay probably okay so some additional administrative notes there's a few different wireless networks that are available to you one is gassed BYOD which I hear is fairly terrible there's no one see one coders which is better but has a password that password is unhelpfully in the get er chat so if you have a friend nearby who does have entered access you should ask them that password and then also if you're a capital one employee using your own internet lucky you you actually can't access this cluster it's not using HTTPS all right no cap one employees usual issah Darren ok I'm going to start uh going now you can feel free to ignore me I'm to talk for around 10 minutes maybe exercises we're in a swap between slides explanation and exercises always going to sit down unfortunately because the sort of the mics are set up so we're talked about a few different things in the first section we're talking about three different ways of thinking about parallel computing first parallel map which is just applying one function to lots of things this is a pretty common case I left files I want to analyze them all option two is submit which is sort of fully free lots of parallels and how you want to set it up and then three is sort of big collections things like spark I'm a producer to ask that give you you know a few big operations you can use you to rewrite your accommodation terms of that we go through those three things and hopefully the next 30 or 40 minutes this tutorial was originally a three hour long tutorial presented at Syfy this year so if you were go home and one look at the full thing you can go on YouTube was presented with benzaclin who's online on guitar helping us out Sosa person who built the cluster for you and then Reagan Kelly of Jupiter so pops go down so first bit there first three notebooks were map submit and collections so map we often have code that looks like this we have some list of inputs you want to apply some function on to all those inputs and produce illicit outputs this might be a bunch of log files I want to parse might be a bunch of CSV files I want to do some pandas codon etc we might use this comprehension instead or if we're sort of cool we might use this map function which some people might not usable don't the map function is interesting because we can redefine this map function make it a parallel map so most parallel frameworks all parallel frameworks which include some sort of map function so things like concurrent futures or thread pools things like spark right by the apparel Lord ask things like job lib okay so gently how this works is that we import some library we create some object and that object has its own map and so we replace instead of the pure python map we replace it with you know our library where this might be some smart cluster it might be some thread pool that might be something else okay so the first notebook we have is all the topic of map this is sort of the very common case just show of hands who here has used some pearl library to map over function over data in parallel using that ok run twenty percent of the room how about something more complex than map maybe about ten percent of the room okay this is a pretty common case this is a good tool to have in your belt this solves honestly like eighty percent of parallel programming problems it will need of course you computer Nonis good cluster this solves most of them okay so before we start we're going to generate some data so we have some fake stock data this amount you know two megabytes of historical daily average data direction we generating this on like a 10-second interval basis it'll look like real data but it's definitely fake so there we go it's you know doing what a bunch of work is all happening in parallel interested in well this producing is producing a directory with a bunch of data for us both and csv format and json format so the data set we're going to be working with for the first two notebooks first three notebooks trend local machine you've downloaded like to megs of data and is now expanding out to like an uncomfortable gigabyte okay so i have a bunch of file names bunch of files of JSON data okay and so these are all for a bunch of different stocks so apples in here yo ham Jen etc and what I want to do is I want to let's increase the size here and what we're going to do in this first example is we're going to load up each file read it in as JSON turn into a panda's data frame and then write it out as a hdf5 file and what it's doing is just changing the data format JSON is really slow to work with HD IO is really fast so after do this stuff that's interesting after we did a step everyone all of our future companies will be faster okay so this is run sequentially right we see it processing through this data set one file at a time I'm just using a for loop this is the kind of code we'd like to accelerate while it does that means from all files yeah well it does that were like some other options so as we saw we there's a few kinds of writing the sort of it was embarrassingly parallel company janitor is sort of do this computation many many times as a future ways of writing that one we can use map and when we use something like map we often need to take whatever we're doing inside the for loop and bring it out as a separate function so in this example all of this code in the for loop and the body of the for loop we need to make that into a single function that we can call over each element your data set after we pull that data our flood code out to some function we can use map and we can use parallel okay so that took a while we should sort of get the feeling that I was at a somewhat painful so as an example let's look at a much simpler code so it's gonna record it present a complex real world example we're going to switch to a toy example have an exercise a short exercise to solve that that toy problem we're gonna use that same approach to solve the real world problem so use our Tory problem we're calling sleep eight times one after another and what we want to do is we want I wanted to call that in parallel so we're going to take the body of this code which does two things one it sleeps and then to it adds one to some input never to that into a function so this is concerning and so hopefully the body of this for loop which took eight seconds and this function should look similar to you we've pulled the body out into a separate function so that now we can create a process pool executor which is something from the concurrent futures module which is available in sort of this is in the standard library and then rather than call us in a for loop we're going to use that objects map function so we have this function we have a list of inputs just range eight every fall map on those and so we're still calling that function eight times we're calling it in parallel with a bunch of course this machine has I think like 16 cores on it and so it ran in about a second around all of them at the same time okay so there are two steps one we made a process pool executor so we can copy paste and then too we took the body of this for loop and made it into a separate function then we called nap that function and that list of inputs so back to our original example we have our sequential code which takes a while to run then we want to just parallel as this so we want to pull out the body of this function you probably take you a file name as an input and we're going to create a possible executor and then call a map of that function on all of our file names okay that's our first exercise let's try that for a bit and if you have questions recession alligator chat or wave your hands will come by a help out ok I'm going to keep moving along I was a fair amount of material to cover here so some of this might be abbreviated so we can run that sequentially or we can so that also all the solutions are already present here so I get the impression that someone else is on this cluster which is confusing yeah yeah so your clusters are all built with with Cooper Nettie's so on Google Pete engine there's system that can like spawn up all of what you want when you click that button which is neat and so we should all be entirely isolated but getting this weird okay great so here's a solution so we took the body of the for loop out made it a separate function this function which takes in one of the file names we also created a process pool executor then we're just mapping that function across all different file names so we can go ahead and we can we can time how long this takes and this is going to run in parallel using all the cores on my machine and it took two seconds which is a lot faster than what we saw previously this is again the common case most parallel computing problems sort of fall under this pattern I've got all this data I have this function I want to run run on top of all that data go many frameworks implement some sort of map you're using concurrent features but again most parallel computing frameworks implement some sort of map okay present isn't everything this combination can be equally well accelerated just by using a faster JSON library so just a quick reminder that you know parallelism can solve your problems but also just being a slightly smarter about libraries you use more about your algorithms can help just as much with much less pain okay so that's map again common case really useful sometimes you have computations that are clearly parallelizable but don't fit into this sort of standard map paradigm for example i have this nested for loops or I'm sort of looking iterating over to collections and then comparing elements within those collections and depending on if it's you know one is grim the other I'm calling F or I'm calling G so it's a bit more logic here right it's not just mat and it's sort of it's tricky to think about how I might you know write this code using one of these higher level frameworks in these cases it's nice to have a fallback something that I can use to do parallelism without needing to use a big operation like a map or a group by so for that there's something called submit inside that same current view library submit is also is this Empress also met inside of the things like concurrent futures ordaz represent parallel and multiprocessing now let's just do parallel computing sort of arbitrarily structured computations so it's nice when sort of you have obvious parallelism but it's not clear how to write that parallelism in terms of a framework so how submit works is that we're going to have some sort of system you build like a concurrent futures executor and we're going to submit one function on top of some arguments and this is just a single function call happening at some other process or in some other thread or on the oral and some other computer and I can call that many many times and rather than that running the function immediately in the blocking it gives me some sort of future some sort of promised that i can use get a result when it's finished later on I can call result on that future to wait until that result actually finishes in the meantime I can do lots of other stuff for example I can call submit many many times so here's one way in which you can implement map using submit submit is sort of a fine-grained version so here I'm submitting a function on many elements of an input list and appending all of those features to sound list this happens more or less immediately and then while we're sort of doing other things may be checking your email maybe we're like plotting something all of some other resources computing those functions for us and then they come back when we call results ok so our previous example of this code that was clearly had implicit parallelism but wasn't clear how to use map we can every time we wanted to call a function like f we can replace it with a call to submit where submit takes as a first argument the function we wanted to call as the rest of the arguments the argument that we would have given to that function so it's kind of we're just moving this parenthesis over here to the left of f so this code starts everything running in parallel and then we need to call results on all of our futures to actually wait for them to finish so submit is a pretty powerful tool you can do it to sort of anything you want to get parallelism out and it handles that for you so that's good in cases where map doesn't fit any quick questions on that do you have a question probably take a lad also have that question so i would love have one person ask a question yes yeah so the question is is this always forking memory so i've got a lot of data inside my notebook session is that going to cause problems an answer is that so first submit isn't as an abstract API just like map so map militant many different ways for example there's a thread there's a thread pool executor which will use operate in the same thread the same same process and so the interfaces we're showing here we're going to use concurrent futures to play with them but they could be implemented in many different ways we don't really know how we don't really care it's sort of up to the library framework to take care of it in practice you don't need to worry it's going to be fine if you're using the process pool executor which is what we'll use in a moment it will it'll fork but that'll be probably copy-on-write so you'll be fine you'll be moving stuff over over a wire for your data inputs but don't worry about it in practice don't worry about it it'll be fine okay yeah so a second question submit like triggers competition immediately write the answers again it depends on the framework implementing it threadpool executor possible executor have a queue of work so it'll you know it's going to do eight things at once and i'll put that onto the queue but let's move on to the second notebook this submit notebook okay so we have the same data set here but now conveniently built as hdf5 files from the last exercise if you didn't finish last exercise just load the solution cell and then run that it'll create all this for you and we're going to do is we're going to read in from this tht fi files the closing price well let's look at the status that just for a moment so let's look at series I think Apple okay so this is the closing price of apple stock over lots of time so we have inside of this dictionary the closing prices a lot of stocks a lot of me maybe similar stocks maybe the similar stocks over a span of time something you often want to do with a sort of data is you want to see how well they're correlated or maybes find a couple of companies that seem to be correlated very strongly so in the future maybe one goes up you would maybe you know invest more than the other one so one thing I want to do is I might want to find the two stocks that are the most correlated so here I'm going to do a few things I'm going to again in a sort of sequential code here this thing we're going to paralyze in a bit we're going to iterate over all the file names twice and they are not the same we're going to compute the correlation of one series with the other this is using the pandas correlation method it was a bunch like a little bit of time not too long about you know one and a half seconds and this this code is you know parallelizable but it would be tricky to do with map so instead we can use the same trick we did with submit to to paralyze this code so you're seeing something like correlation now to write a function and then submit that function work sorts thing okay so it looks like there are two stocks that are pretty well correlated you can see there a sort of correlation here so once where to find these stocks so again our computation starts out embarrassingly parallel this is something we could do easily with map reloading in data from hdf5 file many hd5 files then to EE nested loop with an if statement inside of it this is maybe harder to use math for there's more complexity here and then finally there's a reduction on all of our correlations to find the best one and this is something that is just not very parallelizable it's not a it also isn't very it wasn't very slow so we don't really need to paralyze this they're going to work on is this middle part which is maybe the hard part so this submit function if we have a function like slow add which waits for some time about a second then returns the sum of two numbers we can call that normally or we can call it with submit you may notice that this ran instantly right immediately returned and the state of that future was running but now asynchronously that competition is finished so is running in a separate thread somewhere and if you want to even get the result back three okay so this is concurrent futures with submit which is a handy interface that have so we did it's just with one single computation but could on this many many many times for example we can call slow out here to this comprehension just normally on 10 numbers and that's going to take roughly 10 seconds because it's going to delay one second for every slow odd call or in this second cell we can submit slow add get back ten futures this line will finish you know within a couple milliseconds then we're going to wait until we have all the results back that'll take and however long we need you to take this machine has I think a number of friend of coors we started a thread pool executor with I think for four threads and so it looks like you know it did for of the calls i did for more calls into two more calls as it took three seconds tool why this is a function that I will fall right slow add that's a great question Erin sirens question is you know so I sort of gay I want to call this thing I want to call slow out and one and one so why I normally in Python I i say the function name i put a parenthesis I've been its mark you meant to the closing parenthesis thing is if I do that right now it's going to call that function and submit will never actually get access to run that instead so we can't just call it immediately that will instead trigger combination side of our local thread which is bad won't get any parallelism so we need to sort of do this trick where we instead give submit the function we want to call and then separately all the arguments we want to call it with and then it's going to do the actual calling of things on its own somewhere it said some other thread with this when the process on some other machine so a common mistake though is to do what erin was just talking about which is to actually call slow out in here you can't do that that's a common mistake second yeah so right so if we were to do this slow dad's going to run and it's going to produce two and then we're going to go call a dot Simenon to which is not going to be of much use so yeah that's a great pretty equation so we always put in the function and then the arguments and the keyword arguments if we need to but we never actually call the function immediately ok so exercise we have a couple of different functions here slow add and slow sub and we have some sequential code and we need to use submit inside of the sequential code to instead produce not a list of results but a list of futures okay this is going to be very similar to the Kobe are doing up here but now it's something that you have to do in a small exercise if you need to other solutions on here so let's spend a few minutes on that if you finish early your actual full exercise is down here and it's mostly the same thing but this time on actual data so let's spend a few minutes and work on this this exercise and I'll be back in a bit okay I'm going to move on so here's my solution to that problem so we're going to submit the functions slow at and slow sub rather than call them directly into a list of futures and so each of these objects so here's submit returns for a sub object a future object which is something we can use to get the actual result when it's finished and so futures is a list of these two this futures list here is a list of these future objects and this code finishes immediately and gives us this list now what actually happened right so we're using the thread pool executor and so there's maybe you know four different threads they're all watching some cue whenever we call submit we're putting a function in some data into that Q and the threads are pulling from that queue and doing work when they finish they put the results on to the few sure object and so we can call result on that future object until we wait and get the result so we can run this that runs you know faster than would have run otherwise ok we can do the same thing with our actual computation so here this ran in about one and a half seconds for a solution we're going to so this one's a little bit tricky right because we had this method here I'd really know how to submit methods that's a little bit a little bit more awkward so instead we make a little function correlation function which takes in two inputs and calls the correlation method on them it's the same exact pattern we run it it ran in about you know half the time that's actually wasn't a whole lot faster right this is maybe twice as fast and not you know four times or ten times as fast and this computation this correlation computation isn't very isn't very computationally intense so it does it's not easy to make it fast so it's worth noting that a lot of pearl computing really just helps you with using multiple CPUs but that might not be your but your bottleneck may be bottlenecked on disk in Iran disk i/o on the network on memory on other things someone else in the corner also asked hey you know your these the correlation is symmetric right so it actually a lot faster if you were to just only consider a half of these pairs so we don't need to consider a and B and then be than a the correlation be the same we need to consider you know one one of those Paris and so we could got in the same speed up just by being a little bit smarter about our problem and then we do not think about futures or threads or anything we could it has been a little bit smarter so again always think about your competition before you paralyzed as an exercise now there's threads and policies a good rule of thumb is if using numpy pandas scikit-learn numba any sort of Scythia eyes code you should be using threats this reduces memory moving data between processes which is expensive and python is very good at parallelism for numeric code and threats if using Python code pure python code like dictionaries and lists and sets remember using processes this is because of the Gil you can sort of look online about what that is but numpy pandas threads pure python processes you can break this in fun ways okay so that's submit and again submit is good for sort of unstructured computation there's a middle ground which is sort of big collections so things like spark rdd zord ask data frames or bags or arrays where you have some large abstraction like all of my data is big table or is a big array where's a big list and then you are restricted to a few operations things like group by or shuffle or map or filter and you construct your computation with those big operations and that framework provides for you a few things you can do that are safe and everything else is sort of not safe so if you can rewrite your computation in a few of these sort of patterns something sort of semi structured collections might be a good choice so in sort of a middle ground between map which was just one pattern and fully fully sort of restrictive you do one thing submit which is sort of anything you want to do you can do and the sort of middle ground of you have a few different operations you can use and you try to compose your your algorithm in terms of those operations so this works in lots of cases like for you know sequel databases or for data frames if your data if your combination fits inside of that bolt we're running a little bit of a long time so the next exercise you can do in either spark or desk that'll look mostly the same I'm news I work on desk so I'm use spark just to be sort of egalitarian but there's like a one word difference inside of these two so we have this the same sequential code from before where we are it's let's make a copy so we have the same code from before and we can think about some of some of the spark operations like map or Cartesian products or filter or max anything about how to rewrite our for loops in terms of those operations so just to get sort of a brief example of how that works so I make a little spark contacts takes a few seconds if I have you know a spark rdd so ardd is like a parallel collection it's like a list that is you know maybe on my machine or maybe a mutual processes on different computers and I can make one by you know in a simple sense by sort of calling this paralyzed method I can gather it back by call and collect there's some overhead here but it's it's fine I can do things like map where I'm going to map a function across all of my data so previously I had this RTD now contains all of my numbers one through five I can call map on it and that gives me another rdd or i can call map and then collect that gives me our results back so just like concurrent futures sparkle to provide the snap function again so it is i fathom parallel so does job live so it is task lots of things filter I can sort of select a particular elements I want to remove from my collection or keep in my collection sorry I'm selecting only the even elements and Cartesian product is I have this list 1 3 or 5 and make a tuple with all possible combinations so 0 0 0 1 0 to 10 11 12 etc and so using map and Cartesian and filtering employees other things so we can start to chain these operations together so here I'm going to scroll of elements and then going to sort of consider the product with my original data set and I'm going to filter out the elements where the second one is even then collect and so again if you can sort of think with these sort of big operations in our heads so map pretty sure product filter group by join then we can use systems like spark or sparked out of framers dust at our friend so now we have this problem we need to look at our sequential code this sort of nests of for loops this condition and then applying this function and think about how to apply how to rewrite this in terms of the operations we've just seen rhodesian product map filter etc some sort of pie spice work here happening well okay so when are we out of here what's the try to figure out a schedule 1215 okay want to go play with that while Aaron and I switch out and the Aaron will talk about some other materials too so I'll give a brief sort of fine a little bit on this bit so we've seen a few different ways of thinking about parallel computation we've seen map on one end we've seen submit on the other end map is just does the common case it's really good in most of the cases submit is completely freeing and it looks a lot like your Python code and the sort of middle ground of playing with these sort of libraries that give you a fixed set of things you can work with and if you're you know depending on your problem you might choose one solution or the other that fits well all of these kinds of algorithms have been implemented in many different ways you can use these all of these on your local computer you can use all of these on a cluster so you don't to choose one of the other based on your hardware you can choose these based on your computation it's a good way of thinking about you've got to be comin problem okay where do I sort of fit and then I can separately find some tool that matches that algorithm type on my hardware so if I'm on a single machine i just wanna use map maybe you concurrent futures is enough maybe that's good it was really simple it's really as you use done if I want to use a cluster and I want to use submit well let's see i press on apparel does that and ask does that sort of literature to those two if I want to use collections on a single machine I've got a few options Bonnie's questions on a cluster also have a few options things like Sparkman task so so you sort of break coaches in those boundaries with it okay Aaron do want to set while people try out this exercise rewriting this for loop code using the sort of spark primitives and again if user to can't get that the solution is available we're going to move forward onto notebook for which is the cross-validated parameter search okay so i'm going to wait for a minute for everybody to get that notebook up and then I'll start talking I think I also need to be speaking into this mic so that's a balance they soon it's okay so if you're in the PI data DC 2016 folder it's going to be numbered for if somehow you ended up in the other folder that's okay the notebooks are the same just make sure you're in the cross-validated parameter search notebook okay so go ahead and raise your hand if you've got that notebook up raise your hand if you need some more time okay so I'm gonna move forward and I'm gonna let a mat and Hussein help out with anybody who still needs some time Hussein is joining us that he's already been voluntold okay cross validate parameter search so who here is does model fitting wish I could learn so who here has done a parallel model fit what scikit-learn who here has done a parallel model fit with scikit-learn on a cluster with the hyper parameter search okay I think I've seen one person at the end of this exercise you will all have done a parallel hyper parameter search for a model fit on a cluster okay actually you'll have done it on a node but the steps to go from an ode to a cluster win in desk or trivial so we're going to talk through this go ahead do I have a mic I am Mike you know there's probably just no speakers on that end of the room so I can switch mics with Matt let's try this I can shout too but if I shout I have to keep turning my head left and right do you guys prefer shouting or miking okay let me just switch mics and see if that works of course okay am I miked on the right side yeah i er nice okay just to recap when I said for the last two minutes we're going to do a parallel hyper parameter fit you don't know what that is don't worry about it it's basically we're going to search over a set of parameters in a model fit to find the best possible one so this is a task that's trivially parallelizable and yet we never paralyze it and so sometimes you'll have vast computing resources available to you and yet you still won't be doing things faster because you haven't been able to paralyze things until now so here we go we're going to walk through this we're going to start with the problem that you may have seen I think for a lot of you this is going to be familiar this is just the digits data set in scikit-learn so if you've worked in psych it learned before you'll have worked probably with this data set if you've trained a classifier at some point or another I am waiting for this to execute okay los the kernel for a moment there and its back okay so this is the problem you have go ahead Smallville's the okay so you should not be in PI data DC you need to go back to the DEP there's a pile of a failed to move because we were reading before it's cv5 ketchup pal we did okay sorry so Matt just instructed me that I failed to move it a file over last night when you're getting this tutorial ready so the easiest way to do this is to go back up to this cluster controller head to the notebooks directory and then head to to number 5 cross-validated parameter search and I will fix this after the tutorial if you head into this the correct directory okay so just taking a quick look what we're looking at is a set of data where we have handwritten digits and we know that we want those digits to go to map from 0 to 9 and so for each of this digits probably the best guess here is that this was a 0 for example but we want to do this by hand we want the computer to do this for us so if you step forward this import should work if it didn't work you need to move over to the other notebook directory again come up to the home go into the notebooks directory instead of the PI data DC 2016 and then open notebook 5 cross-validated parameter search okay so here's the parameter space there's a couple of different parameters used and I believe that this is a support vector classifier yeah so there's a couple of different parameters that show up in a support vector classifier I believe that C is the penalty parameter gammas the colonel coefficient this one's actually relatively important as the stopping criterion the default that a library like scikit-learn gives you may not actually be suitable for your use case right so you may actually want a model that has a really tight convergence or a really tight attempt at eric convergence and so depending on what your criterion you may need to change this tolerance so don't just always rely on the defaults don't even necessarily rely on what comes out of the hyper parameter space take a minute and think about what your actual loss function is and what you're trying to get to this is a little bit of a preview for a talk that's coming up on Sunday by Chris White so if you have some more thoughts about that you can come see him after the talk or you can go and go watch this talk in the afternoon okay so obviously when you train a model you don't want to train on your test data you split it so you have a cross-validation set second learn makes that easy and what we've done in this CV params demo file is we've done some of the work ahead of time for you in terms of splitting this data set up so that you can work or operate on individual pieces of the cross-validation set okay so we have a training set and now what we're going to do is we're going to run through this parameter space just sequentially and did I miss something so again oh oh that so a quick way to not execute this correctly so this is a coarse-grained search that we executed in cereal and we're doing it slowly because we're doing it insert we're doing a small course search because we're in cereal so the exercise here so this is just a quick plot of the parameter space that we're searching over so you'll find that these in the lower right are the better solutions but we don't really have a good understanding of what this space looks like and so we can do a much more high resolution parameter search if we do it in parallel so again this is the the number that you may want to increase this is so this is just a pair of split so you could have more splits and then this is a parameter grid that we have sort of a 10 by 10 and again once you have this in parallel you will increase this so that you can have a more high resolution sure there's two solutions here that you can look at but why don't you start as an exercise now try and apply all the things that you've learned in the first 45 minutes with Matt if you get stuck you can either talk to a friend you can raise your hand or you can look at the solutions files so there are two solutions that have been saved to you and are already available what I'm going to do is I'm going to paste into the get er room my password this is a solution that does the hyper parameter search in parallel on the cluster so i'm going to show you the differences between this solution and the solution one which is ok I'll just load it into another cell so we can look at them next to each other oh it's just this all above I see it now ok ok so this is the original solution you all have this in front of you you can pull up a thread pool executor or a process pool executor both would work and then here's the work right you submit a function that evaluates a single sort of model fit right that's what the evaluate one is doing is just taking a specific set of parameters and is trying to fit the model there and then it's depending that result in to the list of futures and then it's just forcing these results to return it's going to block on all the results here in this last bit this is that solution but now it's been I we can say desk if I'd right ok so instead of using a thread pool executor we say from dash distributed import executor and progress is just a little bit of sugar here that makes it easier to see how things are going and detect if there's a problem and of course this code is going to work for you as well because you have if you have a cluster up so if you're logged into big fat integral you'll be able to execute this you set up a equals executor 9000 that's just telling the schedulers what port to look for and where to go okay this upload file is there because the CV params demo I is not on the other cluster note and so one thing that you can do is if you have some data like some lightweight data or you have a python module that need to ship over just while you're sort of testing things out you can actually send this thing up to all of the nodes this isn't sort of what you would do in production it's just something that you would do while you're playing around and then everything else looks the same the code actually hasn't changed so if you got something in parallel and you tested it out and you were comfortable to wave it was working you could actually take this code and then put it on a cluster and run it in parallel and you'll get the speed up from the cluster and of course the API is the same sometimes things will subtly change and it won't work but at least you'll be able to write code that looks the same or very familiar and test it out and work on it locally before you go on to a big cluster ok so there's also a spark solution down at the bottom and I don't know if anybody tried it should work ok any questions about this before I move on I know this is we probably gave this thing not even a tenth of the time it deserves there's a lot to explore here yeah did you have to start something on port 9000 like the separate process so the processes were already running when the cluster I started we started them up for you go ahead sure Oh run ok one of the box we didn't save it as a results we just kept it here I got ya all right just needed to yeah so this line here don't worry about it too much it's just it's just a way it's it's actually almost assumed that the schedulers are going to be running on this port it's just set up for the desk parallel cluster go ahead I don't know he's saying that the desk solution is four times faster than spice Park I wouldn't i wouldn't take those numbers too seriously we actually get really critical of folks who don't benchmark properly you're on a cluster with share a tendency so I yes in this particular case it probably is faster remember the PI spark has to cross the Java Python boundary every time data comes in and out and that's going to slow it down especially for sort of smaller workloads where there's a lot of communication versus computation but yeah there's no claims about speed of desk versus pi spark in this tutorial you can okay I'm gonna move on before we get into dangerous territory okay so the final thing I wanted to show you is distributed data frames it's number eight in the notebooks and basically what i'm going to do is just for the next five minutes I'm gonna just talk a little bit about this this cluster is going to stay up after this tutorial ends and through lunch so you may need to log in and create a new cluster because I believe they're expiring they're not expiring after 30 minutes okay keep your current thing up don't listen to me um okay so now we're going to talk about a different type of analysis what we started before was we already had our data loaded and what we wanted to do was we wanted to fit a model now as anybody here who has ever worked with the data set can tell you ninety percent of the effort is just getting it into memory and in the shape that you want it to be in right it's over a gigabyte or if it's over 10 gigabytes all of a sudden all of your standard tricks start getting really painful so what we're going to show you how to do here is to actually bring data in into a appendage like data frame in desk so the tool that we're using is desk data frame it's very similar as well to what sparks concept about our DD is or a distributed data frame but we to whatever extent possible use the same interface as pandas so there's some things that are going to be missing you're not going to have everything but you're going to be able to use a lot of the same ideas and so if you're just exploring some data or trying to get it you know from one state into another this is a really good place to start so the first thing that we're doing here is we're looking at the New York taxi cab data set they're stored as a bunch of CSV files on s3 and we see that if we tried to just pull one of these in with pandas it would be pretty painful right so that's not really showing you how hard this is but this is what the data looks like so instead we're going to use a parallel task execution context and we're going to read those CSV files again that reads CSV should look familiar it looks like pandas read CSV there's one more thing in here this storage options and I'm not going to go into it but that's add a specific keyword for loading parallel data okay so it didn't actually finish load just in that call everything by default in desk there's two things it's lazy and it's non blocking so lazy means that it's not going to execute until you tell it to non-blocking just means that it's going to come back a result is going to come back that may not be the result it just says I executed this thing so remember there's actually a difference there so lazy means that it doesn't start executing until you I tell it that I need it non-blocking means I'm going to get I'm interactive I'm going to get something back right after I make that function call so i can go and maybe start doing something else so for example if it's lazy i might say oh I actually need you to actually compute this so there's two ways you can tell dass to compute you can say persist which is do this thing oh go ahead I should start this is what you're saying okay so so there's do this thing and then there's actually I want the result back and so do this thing is an execution call but it's not blocking that's persist compute is blocking it won't come back until it's actually got a result and then it gives you that result as well so there's more notes on that in here but now I you actually have this data frame up and you'll see this progress is being called every time we're doing an action just give you an idea of sort of a sense of how tough some of these computations are to compute and how long they're going to take we can look at for example we pulled the fares out of the data frame that we're positive that's the request here that almost you know if you're familiar with sequel but not super familiar with Python or pandas this starts looking a little bit you know like a sequel like language that's intentional right Anna's data frames are sort of this in-between of arrays or tabular data and sequel like or database like access to your data okay so we can take a sum of all of the fairs where the taxicab driver got stiffed they didn't get a tip okay and we can take a look at the total number of fair grace and we can take a look at the total number of passengers right so these are just easy reductions over the entire data set ok so is 1215 I'm going to turn it over to Matt I have to run I've got another thing I've got to run to it's been awesome this has been really fun follow up with us come to the happy hour tonight and thanks ok so I think we're actually at time right twirls are done now right I have you stick around for questions we won't ask questions but now it'd be a good time to sort of finish up if you want to leave the room they can leave the room in a way that's yeah friendly and happy so thank you all for coming I'll give another talk on desk on Sunday that's just much more than what we've seen here and generally speaking this wasn't a task talk this is a general perla computing talk good remember that please play around the stuff these clusters will be up for the next you know day at least so please play around it's a nice chance to play with a lot different tools have a nice environment that's safe you can't break anything everything's protected so do what you want yeah thank you all for coming
Info
Channel: PyData
Views: 1,825
Rating: 4.8000002 out of 5
Keywords:
Id: agD2VYrq-n8
Channel Id: undefined
Length: 60min 36sec (3636 seconds)
Published: Mon Oct 24 2016
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.