Getting In Shape For The Sport Of Data Science

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
okay so we're here at the Melbourne mamita and we are talking about some techniques that Jeremy Howard is used to do as well as he can in a variety of cattle competitions and we're going to start by having you look at some of the tools that I've found useful in in predictive modeling in general and in tackle competitions in particular so I've tried to write down here what I think is some of the keys dips so after you download data from our table competition you end up with CSV files generally speaking CSV files which can be in all kinds of formats so here's the first thing you see when you open up the time series CSV file it's not very hopeful is it so each of these columns is actually oh here we come alright it's actually quarterly time series data and so because well for various reasons it's kind of each one's different lengths and they kind of start curving along the particular way that this was provided didn't really suit the tools that I was using and in fact if I remember correctly that's already I've already adjusted it slightly because it originally came in rows rather than columns yeah that's right this is how it originally came in rows without in colonies so this is where this kind of data manipulation tool box comes e there's all kinds of ways to swap rows and columns around which is where I started the really simple approach is to select the whole lot copy it and then paste it and say transpose in Excel and the one way to do it and then having done that I something which I could open up in let's have a look at the original file this is the original file in BIM it's just my text editor of choice this is actually a really good time to get rid of all of those kind of Lading comments because they kind of confuse me so this is where stuff might be miss very even things like notepad plus plus them Emacs or any of these kind of power user text editors will work fine as long as you know how to use regular expressions and if you don't I'm not going to show you now but you should definitely look it up so in this case I'm just going to go okay let's gets use a regular expression is the substitute file alright so I can now save that and I've got a nice easy power that easy so that's why I've listed this idea of data manipulation tools in my toolbox and to me VM or or some regular expression Coward text editor which can handle large files is something to be familiars so just in case you can catch that that is regular expressions probably the most powerful tool for doing text and data manipulation that I know of sometimes it is called regex is the most powerful types of regular expressions I would say would be the one for Intel they've been widely used elsewhere any C program that uses the pcre engine has the same River expressions as Perl or LS C sharp and dotnet have the same regular expressions as well more or less so this is a nice example of one bunch of people getting it right and everybody else plagiarizing empirical expressions are slightly different unfortunately which annoys me knowing that I still do the job so yeah make sure you've got a good text editor that you know well how to use something with a good macro facility is nice as well okay in vim is great for that kind of record a series of keystrokes and hit a button and it repeats that basically on every line I also wrote Cal here because to me pearl is a rather unloved programming language but if you think back to where it comes from it was originally developed as the Swiss Army chainsaw of text processing tools and today that is something it still does I think better than any other tool it has amazing command-line options you can pass to it that do things like run the following command on every line in the file or run the following line and everything in the file and print it out command is a command line option to back up each file before changing it to live up back I find with peril I can do stuff which would take me a much much longer time in any other tool even simple little things like happiness and data on the weekend graphically the whole bunch of files but only the first one I want to keep the first light because their whole bunch of CSV files in which they will add a header line I had to delete so in imperil in fact it's probably still gonna be sitting here in live history yeah so in Perl that's basically - n means through this on every single row - a means are not even gonna write script file and a give you the theme to it right here on command line and here's a piece of rather difficult for comprehending to fill but trust me what it says is if the line number is greater than 1 in print line so here's something to strip the first line for me every file so this kind of stuff you can do in folders is great and I see a lot of people on the forums who complain about the format of the data wasn't quite what I expected or not convenient you please change it for me and always think well this is part of data science this is part of data hacking you know this is this is data munging or data manipulation there's actually a really great Oh called um I don't know if it's hard to find nowadays but I loved it called adamantium film and it's a whole book about all the cool stuff you can do in Calvin align to um so okay I've now got the data into a form where I can kind of load it up into some tool and start looking at it what's the tool I normally start with uh normally start with Excel now your first reaction might be to think Excel hmm not so good for big files there which my reaction would be if you're just looking at the data for the first time why are you looking at a big file start by sampling it and again this is the kind of thing you can do in your data base relation piece that that thing I just showed you in Perl if that's a round is greater than 0.9 and and ffred that's going to sample every ten rows more or less so get you if you've got a huge data file get it to a size that you can basically start playing with it which normally means some random sampling so I like to look at it in Excel and I will show you for a particular competition how I go about doing that so let's have a look for example at a couple so here's one which the New South Wales Government Brisbane rent which was to predict how long it's going to take Thomas to travel along each segment of the m4 motorway in each direction the data for this just is a lot of columns with every column is another route and what's a lot of rows every row is another two-minute to me observation and very hard to get a feel for what's going on there were various terrific attempts on the forum of trying to create animated pictures of what the road looks like over time I did something extremely low tech which is something I'm proud of these very low tech which is I created a simple little macro in Excel which selected each column and then went additional formatting color scales red to green and I ran that on each column and I got this picture so here's each fruit on this road and here's how long it took to go in that route at this time and isn't this interesting because I can immediately see what traffic jams look like see how they kind of flow as you start getting a traffic jam here they flow along the road as time goes on and you can then start to see at what kind of times they happen and where they tend to start so here's a really big jam and it's interesting isn't it because so if we go into Sydney in the afternoon then obviously you start getting these jams up here and during that the afternoon progresses you can see the jam moving so that at 5:00 p.m. there's actually a couple of them and the other end of the road it stays jammed so you get a real feel for it and even when it's not peak hour and even in some of the period areas which aren't so busy you can see that's interesting there are basically parts of the freeway which out of Tikal they're basically constant travel fun and the colors are immediately showing you you see how easy it is so when we actually got on the phone with the RTA to take them through the winning model actually the people that won this competition were kind enough to organize a screencast with all the people in the RTA and from cable to show you when you model and and people from RTA said well this is interesting because you tell me in your model they said what we looked at was we basically accreditor models that looked at for a particular time for a particular route we looked at the times and routes just before and around it on both sides and they remember one of the guys said that's weird because normally these kind of cues traffic jams I'm going one direction so why would you look at both science and so I was able to pre-plea say okay guys that's true but have a look at this let's okay go to the other end you can see how sometimes although cues kind of form in one direction can kind of slide away in the other direction for example so by looking at this kind of picture you can see what your model is going to have to be able to model so you can see what kind of inputs is going to have and how it's going to have to be set up and you can immediately see that if you credit the model that basically try to predict interesting based on the previous few periods of the routes around it whatever modeling technique you're using you're probably gonna get a pretty good answer and interestingly the guys that won this competition this is basically all they did a really nice simple model they used random forests as it happens which we'll talk about soon they added a couple of extra things which was the rate of change it's a really good example of how visualization can quite quickly tell you what you need to do I'll show you another example this is a recent competition that was set up by the Dadaists comm blog and what it was was they wanted to try and create a recommendation system for our packages so they got a bunch of users to say ok this user for this package doesn't evidence for this user for this package does have any thought so you can kind of see how this is structured right they added a bunch of additional potential predictors for you how many dependencies does this package have come gesture this package have any imports how many of those task views on CRM is have included in the core packages that are recommended package who maintains it and so forth so I found this not particularly easy to get my head around what this looks like so I used my number one most favorite tool for data visualization and then health analysis which is a pivot table pivot table is a being which dynamically lets you slice and dice your data if you use maybe tab or something like that you'll know the feel this is kind of like Tablo doesn't cost $2000 I'm in tempo he's got cool stuff as well but this is fantastic for most teas I find I need to do and so in this case I simply drag user ID up to the top and I've dragged package name down to the side and just quickly threw this into a matrix basically and so you can see here what this data looks like which is those nasty people at catalyst that comments deleted or bumpy being in this matrix so that's the stuff that they wants to predict and then we can see that generally as you expect there's ones and zeroes there's some weird going on here where some people have things apparently there twice it's just as to me maybe there's something funny with the data collection and there's other interesting things there are some things which seem to be quite widely installed most people don't install most packages and there is this mysterious user number five who is the world's biggest our package he or she installs everything that they can and I can only imagine that ad acgih is particularly hard to install because not even using number five finish to get around to it so you can see how kind of creating is simple at all picture like this I can get a sense of what's going on so I took that day there in the our package from competition and I kind of thought well if I just knew for a particular so let's say this empty cell is the one we're kind of forget so if I just knew in general how commonly acceptance sampling was installed and how often we use number one installed stuff I probably got a good sense of the probability of huge number one installed an acceptance standpoint so to me one of the interesting points here was to think actually I don't think they'll care about you so I jumped into our and what I did was I basically said okay read that CSV file in there's a whole bunch of rows here because this is my entire solution that's going to show you the rows I used the solution number one so reading the whole life although user is a number treated as a factor because user number one it's not 50 times worse than user number 50 those trues and falses German two one zero to make life a bit easier and now apply mean the main function to a user across their installations and apply the mean function to which package across their packages installations so now put a couple of block out basically the time we use number 50 installs this percent of packages this particular package is installed by this percent of users and then I just stuck them basically back into my file of predictors so I basically did these simple look us for each row to look at the user and find out for that row the name of that user and I mean that package and that was actually it at that point I then created a GL m in which I have what suppose me I created a tree limb in which obviously I had my ones and zeros of installations as the thing I was predicting and my first version I had goofy and PP so these two probabilities as my predictors in fact know in the first version of it even easier than that all I did in fact was I took the max of those two things so P mess if you're not familiar with our is just something because a mess on each row individually in our nearly everything works on vectors by default except from X so that's why we had to use P max well worth knowing so I just took the mess so I you know this user installs 30% of things and this package is installed by 40% of users so the next two is 40% and I actually created in three LM with just one predictor the benchmark that was created by the ballasts people for this youth to GLM on all of those predictors including all kinds of relations and alysus of the manual pages and maintained names and God knows what and they had a aut 0.8 this five line of code thing had a no you see of 0.95 so you know the the message here is don't over complicate things if people give you data don't assume that you need to use it and and you know look at pictures so if we have a look at kind of my progress in there so here's my first attempt which was basically to multiply the user probability by the package probability and you can see one of the nice things and care givers you get a history of your results but here's my point out for you you see and then I changed it to using the maximum of two and there's my point nine five oh you see that's what oh that was good I mentioned how powerful this will be when I use all that data that they gave us with a fancy random forest and it went backwards okay so you can really see that actually a bit of focus simple analysis can often take you a lot further so if we look to the next page we can kind of see rail you know I kind of kept thinking random forests the author buddy works at more random forests and that went backwards and then I started adding in a few extra things and then actually I thought you know it was one piece of data which is really useful which is that dependency graph if somebody has installed package a and it depends on package B and I know they've got package a then I also know their content HB so I added fat piece that's the kind of thing I find a bit difficult doing are because I think ours a slightly programming language so I did that piece in language which I quite like if you see shut important it back to our and then as you can see each time I send something off to Cagle I've generally copy and paste into my notes just the line of code that I rent and so I can see exactly what it was so here I added the dependency graph and I've contacted 4.9 paper that's basically as far as I got in this competition which was enough for sixth place I made a really stupid mistake yes if somebody has packaged a and it depends on package B then obviously that means they've got package B I did that if somebody doesn't have package B and package Oh depends on it then you know they definitely don't have package a I forgot that tips and so when I went back and put that in after the competition was over and I realized I forgotten that I realized oh come on second if I was done there so in fact to get you know in the top three in this competition that that's probably as much modelling as you needed so I think you can do well in these comps without necessarily being our expert or necessarily Venus Ness expert but you do need to kind of dig into the toolbox appropriately so let's go back to my extensive slide presentation so you can see here we talk about data manipulation about interactive analysis we've got a bit about visualizations and I include there even simple things like those tables we did as I just indicated from in my tool box is some kind of general purpose programming tool and to me there's kind of three or four clear leaders in this space and I know some speaking to people in the data science world half the people I speak to don't really know how to program you definitely should because otherwise all you can do is use stuff that other people to pay for you and I would be picking from these tools so I like the highly misunderstood c-sharp and I would combine it with these particular libraries or um yes yeah complementary and I'll come to that in the very next bullet point yes so so yeah so this general purpose programming tools this is for the stuff that doesn't do that well and even the guy that wrote I Russell not heard you know says he's not that fond nowadays of various things about as a kind of a underlying language I yeah I mean where else there are other languages which are just so powerful and so rich and so beautiful I should have actually include some of the functional languages in here too like Haskell would be another great choice if you've got it if you've got a good powerful language a good powerful matrix library and good powerful machine learning toolkit are you doing great so python is fantastic - also has a really really nice ripple the ripple is like you know where you type in a line of code like an hour and it immediately you see the results and you can keep looking through like that you can use a Python which is a really fantastic rep author person and in fact the other really nice thing in Python is matplotlib which gives you really nice charting library much less elegant but just as effective for c-sharp and just as free is the MS chat controls I've written a kind of a functional layer on top of those to make them easier to do analysis we've been there super fast as we could have also done it it takes two minutes if you use C++ that also works great there's a really really interesting very very underutilized chord item which originally came from the KDE project and just provides an amazingly powerful kind of vector and scientific programming kind of language on top of C++ jabber to me is something that used to be on a par with c-sharp back in the 1 and 1.0 1.1 def dos it's looking a bit sad nowadays but on the other hand it has just about the most powerful general-purpose machine learning library on top of it which is weaker so there's a lot to be said for using that combination in the end if you're a data scientist who doesn't get my messages learning program and I don't think it matters too much which one you pick I would be picking one of these but without it you're going to be straddling to go beyond what the tools provide question the back yeah okay so the question was about visualization tools improvement to assess John freely available yeah I would have a look at something like gob gob is a fascinating tool which kind of has and not free but in the same kind of area if we talked about tableau you know it supports this concept of brushing which is this idea that you can get a whole bunch of plots and scatter plots in parallel coordinate plots all kinds of plots and you can highlight one area of one plot and we'll show you where those points are and all the other plots and you know so in terms of kind of really powerful visualization libraries I think so you go II would be where I would go having said that it's amazingly it's amazing how little I use it in real life because things like Excel and what I'm about to come to which is ggplot2 although much less fancy than things like John from tableau and ggod support a kind of hypothesis driven problem-solving approach very well something else that I do is I tend to try to create visualizations which you know meet my particular needs so we talked about the time series problem and the time series problem is one in which I used a very simple ten line JavaScript piece of code to plot every single time series enough food mess like this now you kind of might think well if you're plotting hundreds and hundreds of times series how much insight are you really getting from that but I found it was amazing how just scrolling through you know hundreds of time series how much my brain picked up and what I then did was when I started modeling this was I then turned these into something a bit better which was to basically repeat it at this time I ago I showed both the orange which is the actuals and the Blues which is my predictions so actually and then I put the metric of how successful this particular time series was so I kind of found that using more focused kind of visualization development in this case I could immediately see whereabouts were these which numbers were high so here's one here point one that's a bit higher than the others and I could immediately kind of see what if I done wrong and I could get a feel of how my modeling was going straight so I tend to think you don't necessarily need particularly sophisticated digitalization Paul I just did to be very flexible and Janee note how to drive them to give you what you need so you know through this kind of visualization I was able to make sure every single chart in this competition if it wasn't matching well and I'd look at that say yeah it's not matching well because there was just a shock in some period which couldn't possibly be predictive so that's okay and so this is one of the competition's that I won and I really think that this visualization approach was key so I've mentioned I was going to come back to one of the interesting findings for which is ggplot2 ggplot2 is created by a particularly amazing New Zealander who seem to have more time than everybody else you know art combined creates all these fantastic tools they Hedley I was going to show you what I meant by really powerful that kind of simple cutting tool here's something really fascinating right you know how creating scatter plots with lots and lots of data is really hard because when that would just be black gloss no so here's a really simple idea which is why don't you give each point in the data a kind of a level of transparency so that the more they sit on top of each other it's like transparent disks stepping up and getting darker and darker so in the amazing package called ggplot2 you can add so here something says plot the carrots of a diamond against its price and I want you to vary it's got the Alpha channel to the graphics Deeks and I stood you know that means kind of the level of transparency and I want you to basically set the Alpha Channel for each point to be 1 over 10 or 1 over 100 1 over 200 and you end up with his parts which actually show you kind of the heats you know the the amount of that area and it's just so much better than any other approach just get a bust over this thing so simple and just one little line of code in your GG but I'll show you another couple of examples and this by the way is in a completely free chapter of the book he's got a news website there's a fantastic book you should definitely buy it by the author of the package about ggplot2 but this one and most important chapter is available free on his website so check it out I'll show another couple of examples everything's done just right here's a simple approach of plotting a Louis smoother through a bunch of data always handy but every time you plot something you should see the confidence intervals no problem this does it by default the best kind of plot kind of thing you you want to see normally is a Louis smoother so if you ask for a fit it gives you the lowest move by default it gives me the confidence interval by default so it's kind of it makes it hard to create really bad graphs in case you plot too although some people have managed I've noticed things like box plots or next to each other it's such an easy way of seeing in this case how the color of diamonds there is you know they've all got roughly the same median that some of them have really long tails in their prices what a really powerful fighting device and so impressive that in this chapter of the book he shows a few options you know here's what would happen if you used a jitter approach and he's got another one down here which is here's what would happen if you use that alpha transparency approach you know you can really compare the different approaches so ggplot2 is something which NOS go through these you can see what kind of stuff you can do is a really important part of the toolbox here's another one I love right okay so we do lots of scatter plots and scatter plots really powerful and sometimes you actually want to see how if the points are kind of ordered kind of logics in it how did they change over time so one way to do that is to connect them up with lime pretty bloody hard to read so if you take this exact thing but just add this simple thing set the color to be related to the year of the date and then bang now you can see by following the color exactly how this is ordered and so you can see we've got his one end here one end here so the GG plot again it's done fantastic things to make us understand data more easily one other thing I will mention is carat how many people here have used the carat package you know so I'm not going to show you carat but I will show you this if you go into R and you've got some model equals train for a command called train and you can pass in a stream of 300 different I think is about 300 different possible models classification and regression models and then you can add various things in here about saying I want to Center first please and I'll do a PCA on it first please and and it just if you know it it's kind of puts all of the pieces together it can do things like remove columns from the data which like hardly bury at all and therefore useless modeling you do that automatically it can automatically remove columns from the data that I have in here but most powerfully it has got this wrapper that basically lets you take any of hundreds and hundreds of hours most powerful algorithms really hard to use and they all now can be done through one hour through one can end and here's the cool bit right imagine we're doing an SPM I don't know if how many of you try to do it again they're really hard to get a good result because they depend so much in the parameters in this version it automatically does a grid search to find automatically find the best parameters so you just create one command and it it does it in for you so you definitely should be using on kara there's one more thing in the toolbox I wanted to mention which is you need to use some kind of version control tool how many people here have used a version control tool like get CVS um okay so let me give you an example from our terrific designer at Kegel he's recently been changed in some of the HTML on our site injected into this version I told all that we use and if they're so nice right because I can go back to any file now and I can see exactly what has changed you win and then I can go through and I can say okay I remember that thing broke at about this time what changed oh I think it was this file here okay that line was deleted this line was changed this section of this line was changed okay you can see with my version control tool it's keeping track of everything I can do can you see how powerful this here's for modeling because you go back through your submission history on tackle and to see oh I used to be getting point nine seven oh you see now we're getting point nine three I'm sure I'm doing everything same go back into your version control tool and have a look at the history so the commits list and you can go back to the date where cable shows you that you had a really pipe result and you can't now remember how the how you did it and you go back to that date and you go oh yeah it's this one here and to go and you have a role can you see what changed and I can do all kinds of cool stuff like you can merge back in results from earlier pushes or you can undo the change you made between these two dates so on and so forth and most importantly in the competition when you win and ASCII sends an email that's fantastic send us your winning model and you go hope I don't have to waiting muscle anymore no problem you can go back into your version control tool and ask for if well as it was on the day that you had that fantastic answer so there's my toolkit there's quite a lot of other things I wanted to show you but I don't have time to do so what I'm going to do is I'm going to jump to this interesting one which was about predicting which grants would be successful or unsuccessful at the University of Melbourne based on data about the people involved in the grant and all kinds of metadata about them about the application this one's interesting because I won it by a fair margin kind of point 960 at 97 is kind of you know 25 percent of available error it's interesting to think what did I do right this time and and how did I how did I set this up basically what I did in this was I used around forests so I'm going to tell you guys feel about random forests what's also interesting in this is I didn't use our at all there's that's not to say that our couldn't have come up with a pretty interesting answer the guy who came second in this computes s but I think he used like twelve gig of ram multi-core huge thing mine my laptop in two seconds so you know then I'll show you an approach which is very efficient as well as been very powerful okay so so I did this all in c-sharp and the reason that I didn't use R for this is because the data was kind of complex each grant had a whole bunch of people attached to it it was done in a denormalized form I don't know how many of you guys are familiar with kind of normalization strategies but basically D normalized form basically means you had a whole bunch of information about the grant you know kind of the dates and blah blah blah and then there was a column a whole bunch of columns about person 1 did they have a PhD and then the whole country columns about person to write and so forth for I think it's about 13 people very very difficult model is extremely wide and extremely you know messy data set it's a kind of thing that general-purpose computing tools are pretty good at so I pulled this in to c-sharp and grants data class where basically I went okay breathe through this file and I created this thing called grant data and for each line I spit it on a comma and I added that grant to this grants data for those people who maybe aren't so familiar with general-purpose computing young people's programming languages you might be surprised to see how readable they are you know this idea I can say for each something in minds dot select the Lions bit by Plummer you know if you haven't used anything that portrayed you might be surprised that something like C sharp look so easy fire I've read lines dots get someone this is just descriptive first line if you hit up and in fact later on I discovered the first couple of years of data were not very predictive of today so I actually skipped all of those and the other nice thing about these kind of tools is okay what does this don't add to I can work one button and bank-owned into the definition of dad you know these kind of IDE features are really helpful and this is equally true of most Python and Java and C++ editors as well so the kind of stuff that I was able to do here was to create all kinds of interesting derived variables like here's one called max year birth so this one is one that goes through all of the people and on this application and find the one with the largest yearbook okay again it's just a single line of code if you kind of get round kind of curly brackets not these things like that the actual logic is extremely easy to understand you know things like do when he isn't have a PhD well if there's no people but none of them do otherwise oh this is just one person has a PhD down here somewhere around gosh any has a ph.d Thanks so I create all these different derived fields I use pivot tables to kind of work out which ones seem to be quite predictive before I put these together thing and so what did I do with this well I wanted to create a random forest for this now random forests in are a very powerful very general purpose tool but the our implementation of them has some pretty nasty limitations for example if you'd have a categorical variable in other words of factor it can't have any more than 32 levels if you have a continuous variable so like an integer or a double or whatever it can't have any null so there are these kind of nasty limitations like a collective could it particularly difficult to use in this case because things like the RCD codes at hundreds and hundreds of levels and all the continuous variables were full of vowels and in fact if I remember correctly even the fetters aren't letters of notes which I find a bit weird because to me mail is just another factor right they may all this female or it's still something I should better model on so I created a system that basically made it easy for me to create a data set up run one so I made this precision I decided that for that for that for doubles that have nulls in them I created something which basically simply added two rows sorry two columns one column which was is that column now on i-10 and another column which is the actual data from that column so whatever it was two point three as well so that one's an ulcer two point three six ba-ba-ba-ba-ba and wherever there was a null I just replaced it with the median so I now had two columns where I used to have one and both of them in our model why is that why the median actually doesn't matter because every place where this is the median there's a one over here so in my model I'm going to use this as a predictor I'm going to use this as a predictor so if all of the places that that data column was originally no or meant something interesting then it'll be picked up by this is no version of the column so to me this is something which I do which give automatically because it's clearly the obvious way to deal with null values and then as I said in the categorical variables I just said okay the factors if there's a null just treated as another level and then finally in the categorical in the factors I said okay take all of the levels and if they're more observations then I think it's 25 then keep it or maybe it's more than that I think if there's more levels that maybe if there's no observations 100 and keep it if there's more observations than 25 but less than 100 and it was quite predictive in other words that level was different to the others in terms of application success than keep it otherwise merge all the rest into one super level called the rest so that way I basically was able to create a data set which actually I quit then feet up although I think in this case actually ended up using my own random permutation so should we have a quick talk about random forests and how they weren't so too many there's kind of basically two main types of model right there's these kind of parametric models models with parameters things where you say all this fits linear and this fits interactive of this fear and this bits kind of logarithmic and as best as I you know I specify how I think this system looks and all the modeling tool does is it fills in premise ok this is the slope of that linear bit this is you know the slope of that logarithmic this is how these different nutrients so things like GLM very well known parametric tools then there are these kind of nonparametric called semi parametric models which are things where I don't do any of that I would say here's my data I don't know how it's related to each other just build a model and so things like vector machines neural nets random forests decision trees or have that kind of flexibility nonparametric models are not necessarily better in parametric models I mean think back to that example of the package competition where really all I wanted was some weights to say how does this kind of this next column Malaysia and they've only really wanted some weights all you wanted some parameters and throaty limbs perfect GLM's certainly can over fit but there are ways of creating GLM's don't for example you can use stepwise regression or the much more fancy modern version you can use GLM net which is basically another tool for doing GLM's which doesn't over fit but anytime you don't really know what the model form is this is where you'd use a nonparametric tool and random forests are great because they're super super fast and extremely flexible and they don't really have any parameters in their tune so they pretty hard to get it wrong so let me show you how that works a random forest is simply in fact we shouldn't even use this term rainforest because a random forest is a road map to a ensemble of trees and in fact that road map to a rain forest that was 2001 that wasn't where this ensemble decision trees was invented it goes all the way back to 1995 in fact it was invented it was actually kind of independently developed by three different people in 95 96 and 97 the random forest implementation is really just one way of doing it it all rests on a really fascinating observation which is that if you have a model that is really really really but it's not quite random it's slightly better than nothing and if you've got ten thousand of these models that are all different and they're all in different ways but they're all better than nothing so average of those 10,000 models will actually be fantastically powerful as a model of its own so this is the yeah the wisdom of crowds or ensemble learning techniques you can kind of see why right because it's out of these 10,000 models they're all kind of crap in different ways they're all a bit random right they're all a bit little better than nothing 9,000 of 99 of them might basically be useless but one of them just happened upon the true structure of the data so the other 9090 99 will kind of average out being being if they're unbiased not correlated with each other they'll all they'll all average out to whatever the average of the data is so any difference in the predictions of this ensemble will all come down to that one model which happened to have actually figured it out right yeah that's an extreme version right but that's basically the concept behind all these ensemble techniques and if you want to invent your own ensemble technique all you have to do is come up with some learner some underlying model which is which you can randomize in some way and each one will be a bit different and you read it lots of times and generally speaking this whole approach we call random subspace so random subspace techniques let me show you how unbelievably easy this is take any model any kind of modeling algorithm you like here's our data here's all the rows here's all the columns okay I'm now gonna create a random subspace some of the columns some of the ropes okay so let's now be able to model using that subset of rows and that's up sort of Collins it's not going to be as perfect at recognizing the training data as using the full light but you know it's it's one way of building your model that's when I build a second model this time I'll use this subspace a different set of rows and a different set of columns no absolutely not but I didn't want to draw 4000 mines and so let's pretend yeah so in fact what I'm really doing each time yeah is I'm pulling out a bunch of random rows and a bunch of random Collins correct and this is a random subspace it's just one way of creating a random subspace but it's a nice easy one and because I didn't do very well at linear algebra in fact I'm just a philosophy graduate I don't know any linear algebra I don't know what subspace means well enough to do it properly but this certainly works and this is all decision trees do so and they are I mentioned that we're going to do this and for each one of these different random subspaces we're going to be off decision tree how do we build this decision tree easy let's create some data so let's say we've got age sex is smoker and lung capacity and we're trying to predict equals minus s1 so we've got a whole bunch of data there so to build a decision tree this is let's assume that this is the particular subset of columns and rows yes you know random subspace so let's build a decision tree so to build a decision tree what I do is I say okay on which variable and which predictor and at which point of that predictor can I do a single split which makes the biggest difference as possible in my dependent variable so it might turn out that if I looked at his smoker yes and know that the average lung capacity for all of the smokers might be thirty and the average for all of the non-smokers might be seventy so literally all I've done is I've just gone through which of these and cat headed the average of the two groups and I found the one split that makes that his bigger difference as possible okay and then I keep doing that so in those people that are non-smokers I now interestingly with the random forest or these decision tree ensemble algorithms generally speaking at each point I select a different group of colors so I randomly select a new group of columns but I'm going to use the same rows I obviously have to use the same rows because I'm kind of taking them down the tree so now I turns out the different look age amongst the people that are non-smokers if you're less than 18 versus greater than 18 is the number one biggest thing in this random subspace that makes the difference and that's like 50 and that's like 80 and so this is how I create a decision tree okay so LED point I've taken a different random subset of columns for the whole tree I've used the same rim subset of rows and at the end of that I keep going until every one of my leaves either has only one or two data points left or all of the data points that they all have exactly the same and at that point I'm finished making my decision tree so now I put that aside and I say okay that is decision tree number one put that aside and now go back and take a different set of rows and repeat the whole process and that going to be decision to number two and I do that mmm thousand times whatever and at the end of that I've now got a thousand decision trees and each thing I want to predict I'd then stick that thing I want to predict down every one of these decision trees so the first thing I'm trying to predict might be you know a nonsmoker who's 16 years old your prediction so the predictions for these things at the very bottom is simply what the average of the dependent variable in this case the lung capacity for that group so that could be 50 in position tree one then it might be 30 in decision tree 2 and 14 in decision tree now to take the average of all votes and that's given me what I wanted which is a whole bunch of independent unbiased not completely correct models how not completely crap are they but the nice thing is we can pick right if you want to be super cautious and you really need to make sure you avoid overfitting then what you do is you make sure your random subspaces are smaller you pick less rows and less columns so then each tree is an average where else if you want to be quick you make each one of more rows and more columns so better reflects the true data that you've got obviously the less rows and less comments you have each time the less powerful each tree is and therefore the more trees me and the nice thing about this is that the old in each of these trees takes like a ten thousandth of a second you know all there so it turns on how much data you've got but you can build thousands of trees in a few seconds or a ton of data sets I look at so generally speaking this isn't an issue and here's a really cool thing the really cool thing in this tree I built it with these road with these rows which means that these roads I didn't use to build my tree which means these rows are out of sample that tree and what that means is I don't need to have a separate cross-validation data set what it means is I can create the table now of my full data set and for each one I can say okay row number one how good am I at predicting row number one well here's all of my trees from one to a thousand row number one is in fact one of the things that was included when I created tree number one so I won't use it here but row number one wasn't included when I felt tree - it wasn't included in the random subspace 4-3-3 and it was included in the one before so what I do is row number one I send down trees two and three and I get the predictions so everything that it wasn't in and averaged them out and that gives me this fantastic thing which is an out-of-band estimate for row 1 and I do that every row so none of this all of this stuff which is being predicted here is actually not using any of the data that was used to build the trees and therefore it is truly out-of-sample or defend and therefore when I put this all together to create my final whatever it is a you see your log likelihood or a square or SSE or whatever and then I send that off to cackle cackle should give you pretty much the same answer because because you're by definition not overfitting yes you do yes you can no I wouldn't so the question was can you just pick one tree and with that tree picking that one tree be better than what we've just done and let's think about what that's a really important question let's think about why that won't the whole purpose of this was to not over fit so the whole purpose of this was to say each of these trees is pretty crap but it's better than nothing and so we average them all out it tells us something about the true data each one can't over fit on its own if I now go back and do anything to those treats if I try and prune them which is in the old-fashioned decision tree algorithms or if I weight them or if I pick a subset of them I'm now introducing bias based on the training set productivity so anytime I introduce bias I now break the laws of in-sample methods fundamentally so the other thing I'd say is there's no point right because if you have something we're actually getting a we're actually you've got so much training data that a sample isn't a big problem or whatever you just use bigger subspaces and less trees and in fact the only reason you do that is for time and because this approach is so fast anyway I wouldn't even bother then you see and the nice thing about this is is that you can say okay I'm going to use kind of this many columns and this many rows in each subspace right and I've got to start building my trees and I build tree number one and I get this out at the end error tree number two this out of Bandera tree number three this other band error and the nice thing is I can watch and see and it will be monotonic well know exactly monotonic but kind of bump remark on it it will keep getting better on average and I can get to a point where I say okay that's good enough I'll stop and you know as I say normally it's talking four or five seconds it's just times not an issue but if you're talking about huge data sets you can't send for them this is a way you can watch it so this is a technique that I used in the grants prediction competition I did a bunch of things to make it even more random than this one of the big problems here both in terms of time and lack of randomness is that all of these continuous variables the official random forest algorithm searches through every possible breakpoint to find the very best which means that every single time that you use that particular variable particularly if it's in the same spot like at the top of the tree it's going to do the same split right in the version I wrote actually if all it does is every time it comes across a continuous variable it randomly picks three break points so it might try 50 70 and 90 and it just finds the best of those three and to me this is the secret of good ensemble out with a system like every one is different to every other tree as possible no not at all so the question was does the pop help us the distribution of the dependent variable matter and the answer is it doesn't and the reason it doesn't is because we using tree so the nice thing about a tree is let's imagine that the dependent variable was kind of maybe very long tail distribution like so write that the nice thing is that as it looks at the independent variables it's looking at the difference in two groups and trying to find the biggest difference between those two groups so regardless of the distribution it's it's it's more like a ranked measure isn't it it's picked a particular break point and it's saying which one finds the biggest difference between the two groups so regardless of the distribution has been variable it's still going to find the same break points because it's it's really a nonparametric measure it's we're using something like for example Gini or some kind of other measure of the information gain of that to build the decision tree so this is true of really all decision tree approaches in fact yes no so the question is does it work for highly unbalanced data set sometimes some versions can and some versions can't the approaches which use more randomization are more likely to work okay but the problem is in highly unbalanced datasets you can quite quickly end up with nodes which all the same value so I actually have often found I get better results if I do some stratified sampling so that for example think about the our competition where most people don't have install other than user number 5 99% of packages right so in that case I tend to say all right that at least half that data set is so obviously zero let's just call it zero and and just work with the rest I do find I often get better answers but it does depend well you can't call it forest if you use a different algorithm other than the tree yes so you can use other random subspace methods yes you absolutely can a lot of people have been going down that path it would have to be fast so GLM that would be a good example because that's very fast but GLM net is is parametric the nice thing about decision trees is that they're totally flexible they don't assume any particular data structure they kind of almost unlimited in the amount of interactions can handle and you can build thousands of you very quickly but there are certainly people who are creating other types of random subspace ensemble methods and I believe some of them are quite effective interestingly I can't remember where I saw it but I have seen some papers which which show evidence that it doesn't really matter if you've got a truly flexible underlying model and you make it random enough and you create enough of them it doesn't really matter which one you use or how you do it which is a nice result it kind of suggests that we don't have to spend lots of time trying to come up with better and better and better predictive modeling generic predictive modeling tools if you think about things there are better version or better versions of this in quotes like a rotation forest and there's things like JVM Israeli boosting machines and so forth in practice they can be faster for certain types of situation but the general result here is that these ensemble methods are you know there's flexible Newton today at the back so the question is how to define the optimal size of the subspace and that's a really terrific question the answer to them is really nice and it's that we really don't have to generally speaking the less Rory's and the less columns we use the more trees you need but the less your legacy and the better results you'll get the nice thing normally is that for most data sets because of the speed of random forests you can pretty much always pick a row account and a column count that's small enough that you're absolutely sure it's very fine sometimes it can become an issue huge data sets or maybe you've got really big problems with data imbalances or hardly the training data and in these cases you can use the kind of approaches which would be familiar to most of us around you know creating a grid of a few different values of the column count from the row count and trying a few out and watching that graph of as you read more trees how does it improve but the truth is it's so unscented to this that if you pick a number of columns of somewhere between I don't know 10% and 50% of the total and a number of rows of between 10% and 50% of the total and you'll be fine and then you just keep adding more trees into or out you sick of waiting or it's obviously factual you know if you do a thousand trees again you know these rule doesn't really doesn't matter those things it's it's not sensitive to that assumption on oh yeah the our routine actually yoga routine actually so this idea of a random subspace there are different ways of creating this random subspace and one key one is can I go out and pull out a row again that I've already pulled out for the RN forest and the portrait encoder which is based by default like you pull something out multiple times and by default in fact pull out if you've got n rows it will pull out in rows but because it's pulled out some of the times on average it will cut recover and ink 63.2% of the roads I don't have the best results when I use that but it doesn't matter because in our forest options you can choose is that we go without sampling and I'm a little hammock with in yes I absolutely think it makes a difference yes to me I'm sure it depends on the dataset that you know I guess I always into chemical competitions which in areas that I've never entered before kind of domain wise or algorithm wise so I guess I'd be getting a good spread of different types of situation and in the ones I've looked at sampling without replacement it's kind of more random and I also tend to pick much lower end than 63.2% you know I tend to use more like ten or twenty percent of the data in my random substances yeah yeah so yeah I know the concepts I guess I can I'm gonna say if that's my experience but I'm sure it depends on the fellowship and I'm not sure it's a terribly sensitive to it anyway I always put it into two branches so there's a few possibilities here as is fifteen your decision truth in in this case I've I've got something here which is a binary group so obviously that is given to two in this case I've got something here which is a continuous variable now let's put it into to you but if actually it's going to be optimal for the three then if the variable appears again at the next level it can always be moved into another to at the dollar speed point I can absolutely I can so it just depends whether when I did that remember at every level I repeat the sampling of a different bunch of columns I couldn't have to literally have the same column again in that route and it could have so happened that again I find the split point which is the best in that group if you're doing ten thousand trees with a hundred levels each it's gonna happen lots of times so the nice thing is that if the true underlying system is a single univariate logarithmic relationship these trees can happen we're all absolutely fine that eventually definitely don't prune the trees if you prune the trees when you do slice so the key thing here which make this so fast and so easy with office of Hathor is you don't know it doesn't necessarily because your split point will be such that the two paths will not necessarily be that balance account yeah that's right because so in the under-18 group you could have not that many people and in the over-18 group you can have quite a lot of people so it's the weighted average of the two we will come to see where this gradient boosting machines yeah absolutely I have a gradient boosting machines are interesting there I'm not harder to understand gradient boosting machines I mean they're still they're still basically ensemble technique and they're more working with kind of the residuals of previous models there's a few pieces of theory around gradient boosting machines which are nice of event they want to be faster and they ought to be more well directed and you can do things like say you did create a twisting machine this particular column has a monotonic relationship with dependent variable so you can actually add constraints in if you can't do a vendor forests in my experience I don't need extra speed of CBN's because I just never have found it necessary I find them harder to the bottom all parameters to deal with so I haven't found them useful for me and I know a lot of data mining competitions and also a lot of real world big demanding problems people try both in another room well we're probably just about out of time so maybe if there's any more questions I can suggest you guys hopefully thanks very much you
Info
Channel: Jeremy Howard
Views: 81,016
Rating: undefined out of 5
Keywords: Science, kaggle, data mining
Id: kwt6XEh7U3g
Channel Id: undefined
Length: 73min 58sec (4438 seconds)
Published: Thu Nov 24 2011
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.