Parallel Programming in R and Python

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
all right hi everybody and welcome thanks for joining us there's a lot that I want to cover so I'm gonna jump into this and get started if you have any questions while this is going on just shoot an email to Nick at Domino data lab comm so really quickly a little bit about me just so you know who you're hearing from I'm one of the cofounders of Domino data lab which is a platform for doing data science work on scalable infrastructure it does a lot of other things as well that and I'll show you some of that as we go as we get into this before that I built built analytical software at a large hedge fund and before that I had sort of a formal education in computer science and so a lot of what I'm gonna go through today will be informed by kind of a software engineering perspective on how to extract the most juice from from parallel programming a quick outline of what I'm going to cover I'm going to sort of motivate the the whole concept here explaining you know why I think this is important and then I'm going to describe an overview of general concepts and parallel programming and some pitfalls this will be abstract not language specific I'm gonna talk about some machine learning applications and then I'm going to get into code I'm going to do a section on Python examples both general parallel programming and machine learning specific applications and then I'll go through some analogous examples in are okay why do I think this is important so here's a great quote that I like I've heard it I've heard various versions of it attributed to different people so I'm not sure the one canonical example but big data's like teenage sex everybody talks about it nobody really knows it do it everyone thinks everyone else is doing it so everybody claims are doing it so what does this really mean you hear a big data thrown around a lot often when I ask people who say they're doing big data how large their data sets are they say things like 10 or 20 or 30 gigabytes well that's not that's not big data that's medium data or small data and what I mean by that is a data that can fit easily into memory on one machine and it's really easy to get machines now with lots of memory so these are standard ec2 instance types you can spin up with the click of a button and you see some of them have 240 gigs of memory you can you buy machines with a terabyte of memory it's not a problem at all so if you're doing something that can fit into memory on one machine there's there's not a lot of point in going through the overhead of setting up a cluster or a very complicated infrastructure clusters are hard there's a lot that can go wrong there's a lot there's a lot of work involved and it creates more fragility and more maintenance burden the other thing that's that's sort of been happening is lots of problems that people need to do in the real world are naturally parallelizable and I'll get into more in a second and what I mean by that but these these large more powerful Hardware machine hardware that you can access also have lots of core so it's very easy to take advantage of that to to parallelize your work so a little bit about basic concepts in parallel programming the key idea here is to think about independent tasks that you're doing in in the work that you need to do as a heuristic for loops are a great place to to key in if you've reviewed written some code that has a for loop in it there's a good chance that that what you can do in there can be parallelized you also want to think about parallelizing things that are that are computationally heavy as opposed to things that say write read write files or read and write to a database that's because you have multiple CPUs on on these more powerful machines your multiple cores so you want to you want to match your work to the resource that you actually have a lot of a couple warnings and pitfalls and I'll show you some examples of these but you know the headline is you can't just throw CPUs at your at your problem and expect it to to be solved it's not a substitute for good code and there's some subtlety to to how to get the performance gains here there's some overhead to using - distributing your work across cores so not everything will benefit from what will give you performance gains you can have shared resource contention if you are using you know if you're doing 32 computations in parallel little all that are all using some some shared database you can actually again make things worse and if you try to you know if you try to use more parallelism than you have underlying CPU cores you can have you can have traction so I'll show you some examples of all those particulars just as a basic example to reinforce this concept of independent work and here's a simple example from the ideas doing payroll for a bunch of employees and we have all our employees and we have one processor so we send them all to the processor but if you have four processors or four cores then you can split that work out because processing each employee can be done in parallel nothing nothing about processing one employee depends on the others so there are different things that you can parallel eyes at starting at the lowest level here math operations yeah you know a lot of computational math operations like matrix multiplication and things like that can be parallelized and I'm not really going to focus on this at all this webinar except to say you should use an underlying linear algebra library that supports parallelism to open blasts or Atlas will kind of do that automatically for you what I'm mostly going to talk about is parallelizing algorithms if you've got code that you're writing or machine learning algorithms you're using they can take advantage of parallelism how to do that and I'll also talk a little bit about how to parallelize serve entire experiments or different techniques you might want to try okay so there's one one operation that's so common in parallel programming that I want to just give you an overview because it'll it'll give you it'll sort of frame the specific examples I show and it's it's a commonly referred to as map some programming languages we'll call it apply but the idea is there's a function you have some some manipulation you want to perform on a bunch of items you have a bunch of items you have a function that operates on one of those items and returns a new altered or manipulated or new computed result and what you want to do you can you can process each and nothing about processing each item any item depends on processing any other items so instead of a normal for loop where you loop through each one a map operation will will apply your function to each one and return sort of a list or an array the same size as the original set of items but with the function applied to each one and this is really easy to parallelize so yeah so you can split up these independent tasks across your different course Map Reduce if you don't know already is sort of an additional step on top of this where first you you map all your your items and then there's some step where you reduce them by combining them in some way like adding the results back together or or grouping them or something like that okay so a couple pitfalls and just general principles here one is you want to parallelize tasks that match up with your underlying resources so and typically what you have multiple things of on your with multiple resources of are your CPUs so I mentioned this earlier but you want to parallelize things that are going to you know use the CPU ideally not parallelized things you're gonna use a database or a disk or things like that because there's only one of those and so your different parallel tasks will will we'll be contending inter fighting for that one shared resource so as an example this is pseudocode here you know if we have a bunch of items and we we're gonna do something to each of them first we're gonna fetch some data for for an IDE then we're going to compute something about that object and then we're gonna save a result well the fetching and the saving will use the database and we'll use network traffic it's really the computing that uses your CPU so if you if you run your code like this you'll have let's say we have four cores we're spreading this work out on four quarters we'll have sort of four cores at once hitting a database which could introduce contention they are depending on how scalable your databases then you'll get the real juice from spreading your computation across your four cores then you'll hit your one database again from four different four different threads so you know a better way to structure this sort of simpler a cleaner way to structure this is to fetch your data all at once parallelize your actual computation and then save your results back at once another common pitfall is to avoid in in work that your parallelizing in the actual parallel computation step to avoid modifying global state and I'll show you I'll show you this in both Python and R but most of the libraries I'm going to show you use an underlying operating system model sort of concept called processes to split your work out and a process will have its own memory and so when you start new processes your variables will be copied into them but the operating system will know how to copy them copy things back from them so this will be a lot more clear when I show some some examples but just an abstract let's say we've got you know an array of items and then in parallel we're gonna update each of those we're gonna update one spot in the array so what will happen here again in most with most libraries as you have your array on one process you spread it out to say four different sub processes each of which get the copy of what the array looks like so they all start with zeros then each one of them modifies just the one position that it's it's task is doing is performing and when you get when you get back your original items item ID is in that initial that initial starting point in the code hasn't changed at all because the copy each local copy was modified in yours and your sub you know your your subtasks so I'll show you how to avoid that as well okay so lots of machine learning tasks are naturally parallelizable cross-validation done all the time and training lots of different models and this is inherently parallelizable because what you're doing is you're splitting your dataset into different different folds of training and test data each and then training and then training and validating a model and so each of those different folds or different partitioning z' of your data set can be done on you know on their own on their own core basically it completely independent of any of the others grid search also it just inherently parallelizable here what you're doing is you're exploring different hyper parameters for your model like say we're training a support a support vector classifier what what criminal do you want to use and what penalty penalty amount do you want to use so each of these combinations of parameters can be can be contained it on you know completely in isolation and the others random forests also very easy to parallelize because building each tree in your forest can be done completely independent on any of the others a couple that are more subtle k-means clustering can can benefit from parallelism at each step in the iterative algorithm when you compute new distances between the points and the Centers of the clusters you picked that computation can be parallelized and neural network training can actually be paralyzed as well if you're using like a back prop algorithm what you can do is split your training data set up into different into different partitions train train the network in parallel on different cores with different slices of you subsets your data and then aggregate up all the weight deltas from each of the different parallel training steps to do a so you sum up all the weights from each iteration and then you you apply that to the to the whole to your network before you start the next step and chances are you're not going to be writing your own neural net training library but good good package and good libraries will handle this and so I just kind of wanted to make you aware of it okay so let's get into some code I'm gonna start with some Python examples and Python has a few great great ways to to take advantage of multiple cores so I'm going to cover John Lib my Python notebook clusters and then scikit-learn and I'm gonna switch to show you guys show you guys Domino so I mentioned that Domino is sort of a platform for for accelerating your data science work by making it easy to access powerful infrastructure and not have to worry about any of the plumbing or setup so I this is the Domino website I've got a project set up here this kind of works like github that I've synchronized some files on my computer to the server so I've got these files here I've got smart codes and Python code what I can do then is I can instead of just looking at these files I can actually access really powerful hardware and execute code so what I'm going to do is I'm going to switch my hardware to get a big machine with 16 cores and 6 and 30 gigs of RAM I wanted to I could get a whole lot more but for the sake of those 16 cores and 32 Ram is plenty so change my hardware I go back to my runs and what I want to do is spin up a Python notebook session sorry it sounds like we've lost the screen so let me okay sorry about that so I'm gonna I'm going to go back just a minute to excite it sounded like the screen cut out for a sec so it's a domino we'll sort of synchronize files with your computer so you know I just have a folder here with some code in it I've uploaded those and I've got Marcos Python codes and notebooks and then what I can do is I can I can change the hardware I want to use so I can easily run this stuff not on my machine but on on remote much more powerful more scalable hardware so I picked just a 16 core 30 gig machine that can go up a lot higher if we need to and then what I'm gonna do is I'm gonna fire up and ipython notebook session so it's just one click and I've now got a ipython notebook session running on a remote you know 16 core 30 gig machine where I didn't have to set anything up and donut handled transferring all my files over there and everything so I can pop open the notebook here and got a bunch of examples to go through so the first place I want to start is a really great library called job Lib so job Lib makes it easy to do essentially these sort of parallelized for loops if you've got to go over a number of different items and you want to process them in parallel and we import some some packages here that just let us check the number of cores we've got our machine from I really got 16 cores so this is the basic way that the job live works I have some function I want to apply to a bunch of different inputs and so here are my inputs I've got 100 numbers and I'm going to in parallel using a number of cores up to the number of cores on my on my machine I'm going to apply that that function to all the items in my input array and what I'm going to get back is you know be as though I as though I had a for loop and I ran this function on each of the inputs and I get back the answers in sequence so it takes care of sort of ordering the results in the right way but it actually spread this out across all 16 cores a more a more complicated example here's a function checks to see whether one one line segment is contained by another longer line segment and by the way so this setting end jobs equal to negative one it's a little bit bigger here setting setting and jobs equal a negative one II will tell table to just use all the cords available if so you can write generalize code that'll run on any number in the machine with any number of course so to show some of the benefits here if we sort of run this algorithm with using all the cores at our disposal that's three seconds or so and if we do this just on one core it will take a lot longer so we don't we don't need to wait for that all to finish well actually we might have to well we'll see you come back in a second while I start explaining the next example so this is here's one of the pitfalls that I described earlier about modifying global State and being aware that when when your tasks run on the different cores there getting's or copies of the the data that they're using so you don't want to be modifying your variables in the parallel compute step you want to be returning things that get aggregated later so I'll make that more concrete here so by the way this finished and you know eight times eight times faster or something so here I've got an array that just starts with zeros and I've got a function that's going to modify one point in the array to just set it to the value of its index and what we're gonna do is a normal for loop here so this is what we you know we'll just modify this is not parallel just a normal version and so this is what we get back right this is the array after we've manipulated it if I try to do this in parallel so I'll first reset a to BL zeros and then in parallel will apply this update array call to each of the index the indices what we're gonna get back here what we're gonna get back here is serve counterintuitive it's you know the actual array doesn't change at all and that's because when this code ran on each of the individual processes the the value of a it was using was sort of local to each process and it didn't affect the original copy of a that we had back on the the master process that kicked all these off so the better way to do this is to have functions to just return information from the parallel task so instead of actually modifying a variable this is you can return a new value and so we parallelize that calculate the new value instead of modifying the variable and then you know to what this returns is the modified array and then I can you know that's just like a normal variable I can set that and and use it however I want so that's job live it's really really powerful the next thing I want to show is just kind of a different way to do parallelism if you're specifically using ipython notebook and it's ipython notebook has its own mechanism for setting up a cluster and the way you do this is on your home by the notebook home screen you go to your clusters tab pick the profile you want to use and you spin up a number of engine so here I know I have 16 cores so I'm going to start at 16 for cluster and then what you can do you find my so again we'll start in the same way we'll check the number of CPUs we have so this is just the syntax for using the ipython notebook cluster you create a client which is a thing that you can sort of tell to do work and I've had that notebook clusters identify themselves with just IDs I have 16 of them this magic command will run any code on all the different the different nodes in your cluster here so I'll get 16 output output statements and one from each of the different different nodes and that can take advantage of some some variables too that we can actually so you can see that they're actually different on the different machine so this is just saying what process ID I'm using and if I apply that function to all the different nodes in my cluster what you'll see is I get I get back indeed something that's really different on each one that's running and so whereas job Lib has the parallel function ipython notebook clusters have a map function but it you know works the same way there's a function that you want to apply to a list of inputs and so you you you call map on the view that you created and what you get back what's interesting is an asynchronous results objects and that's because this thing could be it's gonna run in the background and we don't know how long I can take it could take very long but you can query how it's doing and whether it's done and when you want to get the results you can just call get and then I'll return return back what was computed and again this so this will spread the work out to the different nodes aggregate the results in order and give them back to you and this even works if you have functions that take multiple different arguments so multiplying two numbers I'm going to map the multiply function to this is my first input and this is my second input and get the results and we get what we expect so again there you know there's a lot of depth behind all this and a lot more flexibility but I just wanna give you a flavor the tools that are available to you and how easy it is to take advantage of this just using one of these libraries and easily spinning up one of these big machines you can dramatically speed up your work so now I'm going to show a some machine learning example specifically and I'm gonna take I've taken this from Olivier gazelles talk at PyCon a couple years ago this is one of his notebooks so I'm just gonna highlight highlight a couple a couple things in here so this is all using the scikit-learn package in python which is a really fantastic package for for doing a lot of different machine learning and I'm just going to pre calculate some things here so scikit-learn has built-in functions for doing a lot of different classifiers random forests k-means and then some sort of you know hyper parameter search like grid search cross-validation and on all of these it's sort of built built in from the ground up to be able to take advantage of multiple cores to parallelize those tasks and I'll show you to show you how how to do that let's hide all our okay so this example notebook here walks through some handwritten digit classification which is a one of these canonical machine learning examples and the first part of this doesn't have anything to do with parallelism is just getting a feel for the data set it loads it in I'm just sure just want you to see what we're dealing with here so you know the training dataset had some images and the goal is to classify what their images of handwritten digits and then to say we know what image is this so where things get interesting I'll just show a couple examples the first example is cross-validation so I mentioned earlier this is inherently parallelizable and so the way this works in scikit-learn is there's a function crossbell score and you know it'll take whatever whatever sort of classifier you're trying to use and adult through the cross-validation across the number of folds you've set up and it has a parameter called end jobs and this defaults to one so if we run it here it'll take ok a second and a half but all i have to do is flip this to negative one and that'll take advantage of all the cores on my machine and so all of a sudden that drops by a lot you know that was five times faster or something and if i were doing more tasks or my task took longer i'd see i'd see a bigger speed-up I'm just only doing a few things here and yeah then you know as sort of works as expected you I didn't have to change the syntax I didn't have to change how I structured my code I can just just take advantage of additional course so the next thing I wanted to show about this is grid search grids are the same idea this is kind of the pattern in in scikit-learn it's a number of these routines just taken in jobs so if I try this try fitting this with with the default and jobs equals one it's seven seconds and if I take advantage of all my chords here again it becomes a whole lot faster certain note changes all to my code so the k-means classifier in scikit-learn also has this end jobs parameter as does the random forest classifier and just to show you an example the random forest classifier this is accessible to you guys on on Domino if we did a blog post about this but the tenon if you're familiar with the data science full the plankton tango project so the idea is to take images of plankton and classify what type they are and we took some starter code that they the Kangol released that was using a random forest classifier and so you can see what so this is from the scikit-learn library and i think the code they had initially had this set to whatever the authors machine was it was like three or something and so all we had to do is change it to two negative one actually here's a so this is the original non parallelized code or paralyzed so be limited to three three jobs and then we just changed its negative one and what you can see in the the history of the work we did here is how much faster this was running on a thirty-two core machine about six six and a half minutes versus running on the same code say running on a one core machine or is fifty minutes so you know change one number get an almost 10x speed up ok so that's kind of a summary of Python before I transition to our I just want to show kind of a different paradigm here in addition to parallelizing code at the level of your algorithms you can also paralyze what i like to think of is experiments may be trying entirely different techniques and this is one of the ways that Domino really helps because it makes it easy to get access not just a one big machine but any number of big machines that you might want so i'll show you how this works and the example uh you know in theory what you could do is maybe you're going to try a random forest you can try a neural net you're going to try a support vector classifier that itself is going to be grid search so you know each of these techniques could benefit from parallelism and then you're gonna actually easily use several different machines so just to give you a sense of how that works I'll switch to back on this project I've I've got here I've still got my notebook running by the way on my sixteen core machine but now what we can do is this is my project folder where I've got some code like I mentioned earlier so I've got some site kit since I can't learn code it just does some classification on Reuters datasets and generate some charts and it takes a parameter and the busy tails don't matter I'm just gonna show you how you do this is so we have a command line tool and we have a command for that called run and it lets you run any Python or R whatever code you want and I mentioned my particular script expects of parameters those passing the parameter value and what the client here will do is it will package up all my code if I made any changes on my machine it'll ship those over and it will go spin up a new machine of whatever hardware type you specify and so another 16 core 30 gig machine in this particular case and it will ship all my code over there and start executing that code okay I've got the wrong project sorry this is the actual project I was in so I'll do that again and here we'll try it's a different parameter and if you don't like using the command line tool that's fine you can also use the website to start new parallel executions of things we'll try yet another parameter so now I've got two tasks going I got three tasks going in parallel each trying a different parameter each of these is using its own 16 Wow that's changed the hardware on this one but you get the idea I can sort of pick what hardware I want and use many different machines at once without I don't have to manage those I don't have to transfer my data now I've got four things going at once I can try different parameters I can try different techniques and the Domino handles all the infrastructure and plumbing of doing that and then as these things finish the results are kept isolated and trackable so I can see the results for this particular run and I can compare okay these are the charts we generated and I can compare that to what I generated for the parameters with 800 so that they don't overwrite each other each thing is kept naturally naturally independent and trackable and and shareable if I want to so that's an example of so we are going to talked about parallelizing Python at the level of algorithms but also really easily getting lots of machines at your fingertips so you can you know you can get 10 32 core machines and all of a sudden you've got 3,300 cores so you didn't have to set anything up at all so I'm gonna transition to R and I'm going to cover you know kind of the same structure general general our program parallel programming techniques and then more specialized our packages and I'm going to flip back to my Domino session here I'm gonna close down this Python notebook session so in addition to Python notebook Domino can also spin up our notebook sessions and this is the same the same UI as ipython notebook if you're familiar with that but instead of running Python code we'll run our code and so same same deal I get whatever hardware I've Tex 16 cores in this case so we're spinning out this our notebook session and it's ready I can pop it open and I've still got all my files here it's the same you know it's the same basic idea but I'm gonna open up a our notebook alright so the first thing we're gonna talk about is just the basic parallel package this has the metadata for figuring out how many cores we have on our machine and as expected we have sixteen core still so the simplest version of this this is kind of a simple map use case is we've got a function that we want to apply this one to generate some random numbers and let's just say we want to do that 20 times NCL applying is the most you know the most basic primitive form of this it's going to apply my function test function to all my inputs and I can sign and I said how many cores I want to use all right and then you know just like what we saw with job Lib this until apply aggregates the results back and puts them in the right order so here results are all 20 of my random random matrices are random data sets just to give you a sense of performance if I do a non paralysed version of that same thing it will take a lot longer so yeah so elapsed was a lot faster here just a note for Windows users this won't work out of the box on windows because this uses an underlying fork operating system call that windows doesn't have so you'll need to do something like this if you're on Windows where you construct a cluster and then use this parallel apply thing so okay I mentioned a couple of these pitfalls earlier here's one I want to show you about overhead sending a task out to another process to work be worked on there you know that takes them that takes some time and there's some work involved on that so you don't want to parallelize things that are real that would be very fast to compute non in you know serially especially in our for functions that have vector vectorized versions of them so here's a you know if I were to apply I square root to the first 10,000 numbers this is what happens if I do it in parallel but this is what happens if I do it in serial it's about an order of magnitude faster and that's because shipping off number two go do square root on another process is going to be oftentimes slower than the work that would have been required to just compute the square root so mcl apply is you know it's nice but it's a little constrained you kind of have to structure your code so there's this function you can apply sometimes you just want to work in a loop and put whatever code you want in a loop and so there's this great package called for each it lets us do that the way for each works is it gives you a construct that lets you write I like my screen a little bigger that lets you write parallelized for loops and it'll take care of spreading out the work inside the loop body across the different cores in your machine the way it works behind the hood is there there are different what are called backends that you can register that tell are how to actually spread that work out so I'm going to use a back-end called do MC and I'm do multi-core I thank you so it stands for and I'm going to register and how many cores I have with it this won't work on Windows you'll want to use do parallel on Windows or if you you can actually use multiple machines and for that you can use snow but I'm not going to get into that today so this is just you know 16 cores on one machine and the way this works is this example here I'm just gonna do some build some linear models from the iris data set and I'm just gonna take a bunch of samples a thousand trials mmm so that code runs and again it works this with me insert a cell below here it works the same way where this are the result I get back from executing this for each is just the the aggregated result set that I got from each of these individual calculations so there they are and you know that the time for this was 31 seconds no wait I'm sorry the time for this was was 5 seconds the the the for each package also lets you do things not in parallel if you just like the syntax of a for each loop and so do instead of do pair up to par will do this serially and this will take a while to run again but it'll be quite a bit slower I think we're already past five seconds so we don't need to wait for it to finish so here's that data copying gotcha again the same thing I showed in Python if we have our a variable you know an array of just zeros and we update it in a normal for loop we get back I gotta wait for it to finish all right yeah it was a lot slower we get back the manipulated array so I just set each value to be two times its index but if we try to do that in the for each loop again the X here that we're modifying is a copy of the original X and so what we get back at the end won't have actually been modified at all so the better way to do this as I showed in Python is to you know to have your function your what's inside the loop here return what you actually want so that gives me back I have to format a little differently but that gives me back the actual value as I care about so that's a good segue to one of the options to mcl apply which is how you or one of the options to do doop are for each which is how you combine the results so you may have noticed that mcl apply defaults to just returning thing as a list that's not very pretty for each will also by default do the same thing gives me results but if I want that to be formatted differently or combined in a different way I can specify a combined function when I call for each and so here I'll just concatenate them and then I get a much nicer array to work with a couple other more advanced options to to the for each function I can specify whether I want to pre schedule the tasks and what pre scheduling will do is it says okay I know you're asking me to use sixteen processes and I know you're asking me to do a thousand different things in this case so pre scheduling will before actually run the code it will divide up my inputs into equal size sets to send out to the different processes in different cores and so that can save some time because it does it all in one shot it doesn't have to transfer things kind of one of the time to the different the different processes so you know if I if I pre schedule something here so here on priestess is the same code a thousand inputs and I'm going to take a square root of each one but pre scheduling you'll see is much much faster over ten times faster than not pre scheduling and that's because the overhead like I mentioned earlier of sending each one of these numbers over to another process is pretty high and if I can just send a batch of you know 150 of them that that's cheaper but you got to be careful because there are instances where pre scheduling can be slower in particular if there's a lot of variability in how long your tasks take and you don't have that many of them because then what can happen is the pre scheduled group that gets defined you can sort of you can get unlucky and you can have one batch that takes way more time than all the other batches and so your final your ultimate overall call is going to just wait for that one sort of bad batch to finish it's gonna slow down or the others will have finished but you won't get your results back in time because the one bad one is still is still running so here I'm I'm sleeping for some a fraction of a second or so but but I increase how long I sleep for each time and if we don't pre-scheduled this then each time one of my my tasks my processes freeze up and finish the task and freeze up it's gonna get the next one that's available so it's sort of on demand it's just gonna spread things around as necessary but if I pre schedule those I can get a bad grouping where say maybe you know all the the slowest ones are these that are waiting like five seconds I'll get I'll get grouped together and so here where I pre schedule it actually takes longer so to get into some machine machine learning examples here is an example you're looking at line quality this is another data set available from one of these canonical machine learning repositories and so this has chemical properties of wine like acidity and residual sugar pH and then the quality of it as measured subjectively so the idea is can we predict the quality of line so the random forest package you guys have probably seen is really nice train a random forest using our our chemical properties in our quality and I've set number of trees to 500 here I can really easily parallelizing the for each package so I'll just talk through what this is doing what's going on here is first I'm saying I'm gonna split the number of trees I train at each any sort of partition into the number of trees divided by the number of cores I have so I'm actually call random forests you know 16 times in parallel each getting a fraction of the trees I want to train so that's that's what's happening here right I'm using the for each function the number of times I'm going to call this is just I'm just repeating this I'm generating a matrix of length number of cores so this will happen 16 times num trees will take on this value 16 times and then the real trick here is that the combined function I'm using in my for each call uses the combined function from the for each package so this is a for each or I'm sorry from the random forest package so this combined function is in the random forest package and it combines different forests into one overall forest so so that you know basically we just trained 16 little forest on murder all together and if I run this you'll see it takes in again almost you know eight and it says eight or almost nine times nine times faster than the than the non parallel version and didn't have to make much of a change at all to the code so another great library I want to show is called caret and it has very general functions for training different classifiers and doing cross-validation and doing grid search all wrapped together so this is the data set here is a sonar data set that's got some information about sonar pings and come back and the ideas we're supposed to predict what type of object it was like was it you know metal or or animal or whatever and here's an example where we do a few different things we do some cross validation we do eight different folds of cross validation and we also do a grid search over the random forests hyper parameters so we try to three and four different features at each at each node in the tree and the Train function this is the carrot function and so we just specify okay we're going to do 750 trees we're going to use these grid search parameters and we're going to use this this training control parameter set for the cross-validation and what carrot will do is it'll it'll take care of paralyzing all those things it'll paralyze the grid series will paralyze the cross-validation and so this comes back really fast and you notice all this is just automatic I didn't have to specify a number of cores I didn't have to use for each this takes advantage of the same back-end that we registered when we used the for each package so when I register this or is it this do MC number of cores carrot nose to look at that so to show you how long it takes when in a non parallel version I have to explicitly register the number of cores back down to one and we'll just run the same call here and this will take a very long time I don't want to wait for it to finish but it'll probably take about a minute the final thing I wanted to mention though I'm not really going to do a demo of it is plier is another really great package for applying functions to lists of things or to datasets or data frames rather teeny ply is a very common function that gets used all over the place and it takes a parameter called parallel dot parallel by default it's false if you said it true it will it will again use the underlying number of cores you've registered for your for each backends so when you call to register DMC or register to parallel it'll take advantage of that and it will it will in parallel apply your function to the different items in your data frame in parallel ok so that finally came back and took about 50 seconds and and you know and then the final thing I wanted to show is that once again you can sort of take advantage of this approach to experimentation with your R code as well so if we go back to this project I was showing you guys earlier if we want to run some our code and we want to try different experiments and by the way Dominic she has a nice packages lets you start these experiments from your arc studio from your art console or our studio or whatever it's the Domino library what we can do is Domino run and I'm going to run the script I've been working with here that averages some create the visual average of some images from the machine learning data set I've got it up up here and again it takes a parameter at the command line that I just read in so we can sort of kick this off with you know one parameter value and you'll see in the in the web interface here in a second it'll you know notices I've got some some file changes we'll send those off to the server it'll ship off my R code and all the all the dependent files and data sets to the remote machine and spin up a new sixteen core machine start running that over there if I want to try some other R code the same script maybe with a different parameter value start that in parallel to so now I've got two different sixteen core machines each calculating some R code I can track progress as it's going cranking through the different digits this one's will you know step farther behind and then once again the results get tracked separately so here's the results from when I ran this with with a parameter value of five and here are the results when I ran this with a parameter value of eight and we can see I get to thank you the should have been different sizes but that's okay so that is that's kind of the basic idea really easy to get big machines and then lots of great libraries and Python and R to do to take advantage of lots of different cores multiple cores on those machines and then you know in addition to being easy to get big machines Domino makes it really easy to spread different experiments entirely different scripts or different parameters across many different machines so you can spin up lots of machines at once each with their own core each with lots of cores and and really go a lot faster when you're trying to to build out your models and train train large machine learning models or just crank through lots of it not so data sets are lots of lots of data we've got a few minutes I'm going to see if any questions have come in so one question we had was about how the r and ipython notebook thing is working it's something that we that we built and we've done a blog post on it if you want to check it out on our blog but so I mean there is an underlying library that's that's being used under the hood but we're using an eye Python kernel that knows to process process the the values of the cells as our code instead of Python code so I think it takes advantage of our two pi or something I think it's name of the library it's using but that's you don't have to worry about that that's all handled by him by what we've what we've built okay well thank you guys very much I hope that you'll you know try some of these techniques put them into practice check out our website and our blog we will send out a follow-up email with the reporting of this and also an opportunity to ask questions offline if if you guys have anything like that and really appreciate you taking the time to join us today and hope to talk to you soon Thanks
Info
Channel: Domino Data Lab
Views: 12,222
Rating: 4.9298244 out of 5
Keywords: data science, R (Programming Language), Python (Programming Language), Programming Language (Software Genre), Machine Learning (Software Genre), predictive analytics, Analytics (Industry), Data Mining (Software Genre), data scientists, Parallel Programming, Analysis Of Algorithms (Field Of Study)
Id: FIS_LsOzxYo
Channel Id: undefined
Length: 50min 53sec (3053 seconds)
Published: Wed Jan 28 2015
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.