Combining Random Forests and GLMs in R

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
[Applause] [Music] hello and welcome back today we're doing random forest and r as a reminder there are three advantages to random forests they detect interactions they detect non-linearities and they naturally validate but they are not transportable glms are and glms fit but they are transportable so why not combine the strengths of both approaches and that's what i'm going to show you today in r but first let me talk about my general strategy we're going to use random forest with its variable importance measures to find a set of plausible predictors and when we visualize these predictors they will tell us whether we need to include interactions or non-linear terms and then what we're going to do is we're going to use the general linear model to then fit a model based on the visual of the random force model and we may need to add nonlinear terms or we might need to add interactions so with that let's get to r welcome to r okay so we're gonna start with requiring or you could also go library does the same thing but that basically loads the party package and i don't remember what party stands for it's like partitioning uh recursive something i don't know anyway um but it's used for making decision trees and that sort of thing and of course we're gonna require flex plot and now i've also made comments in this i will be sure to include a link to this file in the comments so you can follow along as you wish so we're going to set a seed what does that mean well when you're doing random forests um remember it is randomly sampling variables and is randomly sampling observations and because it is doing it at random the answers by nature are not reproducible but what you can do is you can set the seed and basically what that does is that the random numbers that are generated in this instance are going to be consistent and so if you choose set dot seed and put the number to 10 10 like i did then your results will be identical to mine assuming you're using the same version of r as i am i heard rumors that they changed how the seeds operate anyway they should be similar even if they're not perfect but if you want to try to get them exactly the same as mine then set it to 10 10. so the data set i'm going to use is avengers and because i'm lazy i'm going to go d equals avengers so i don't have to type in avengers every time on or reference my data set and um again i have comments in here so uh the um the function we're going to use is c forest and if you want to learn more information about what arguments are allowed in c-force you could always type like it says right there question mark and then see for us and that'll bring up the documentation and so what i am doing is i'm going ptsd tilde dot and what does that mean basically that says i want to predict ptsd with every single variable in the data set that's what the dot means and so if i run this model it's actually going to take quite a while to run um well actually i guess it wasn't that bad it only took me you know two seconds ish but later i'm going to show you how to visualize these do not visualize this model i repeat do not visualize this model it will take forever and you don't need to know why but if you are interested the reason why is because when you use the visualize function in the background it is coming up with unique values for every single variable and pairing that and pairing them with every other unique value of every other variable and so if you've got five unique values for one variable and five for another that's going to do five times five different combinations so that's 25 not a big deal if you only got two variables but if you've got 50 variables and instead of doing 5 you do 50 i think is the default and that's going to be 50 to the well if you have 10 variables that's 50 to the 10th power that's a lot that is a lot so you don't want to do that so um don't visualize that one so instead what i recommend is you run the model with everything and then you look at the estimates and this too will take a little bit of time but not as much as if you were to visualize it and there we go i'm gonna make this bigger so we can and then i'll run it again for some reason i wish it would automatically adjust but it doesn't so uh it gives you uh like i said in the video there are really two metrics that we are interested in one is the out of bag performance and the other is the variable importance so uh let me spend some time interpreting these because they are a little confusing just because of how random wars worth so remember out of bag means the observations that were not used to fit the model and so what this tells you is it is the quantiles of the absolute value of out-of-bag performance and so what it's doing is it's basically taking the predicted score uh for each individual and subtracting from that the actual score and then it's taken the absolute value of that and again the predicted is the out of bag prediction so every person has a predicted out-of-bag score it is going to take that and then subtract from it the actual score and this right here tells you the um the quantiles so the minimum difference between predicted and actual is .001 and not surprisingly the zero quantile is usually pretty close to zero 25th percentile is 0.117 um the median or you can think of it as the average so on average the difference between your actual score and your predicted score is about 0.263 now this is a ptsd score let's go ahead i don't remember so we got to think about what this means on the metric of the variable so if we go flex flex plot uh ptsd tilde one not two data equals avengers just so we can get an idea of the scale of this variable okay so this ranges from two to six mostly and you got some deviation and so what is this saying this is saying that um on average uh on a scale from about what was it one to four one to six on a scale of about one to six um we are off on average by .26 so that's respectable it's not super impressive but it's it's respectable maybe someday i'll include standardized differences so you can interpret those in terms of standard deviations uh but for now this will work and then the maximum difference is three so on a scale of one to six the maximum deviation between what is predicted and the actual score is about half that so that's a pretty pretty severe departure but um if you look at the 75th percentile 75 of the scores are within half a point so that's doing pretty good out of bag performance uh actually makes more sense or it's easier to interpret when we have a binary outcome and maybe i'll show you how to do a binary outcome eventually and then after that we got the variable importance measures so what is this and again when you're using regression for random forest when you have a continuous outcome or a numeric outcome it's not as intuitive as when you have a binary outcome and so what is this this is the root mean squared error of predicted versus permuted now remember when we're computing out of bag error remember we are taking the predicted values that the random force model predicted and then we're shuffling everybody's score and so what this represents basically it's the average deviation from what was originally predicted versus the permuted prediction and so on average um in our scale of oh what was that ptsd um shots taken seemed to be the most important variable and it was off by about 0.26 points so not a huge difference um but it's the best we got and then we got agility that is off by about two point or point two five seven injuries north south et cetera onto uh died is n-a-n not a number so i'm not quite sure why that happens but whatever it's it's not that important right now so uh this tells us uh if we were to include everything in the model these are the best predictors in order and notice that they're sorted so the most important predictor is shots taken then agility then injuries etc remember we don't want to visualize that entire model so now what i'm going to do is i'm going to actually take the top four variables and create a new model why am i taking the top four because flex plot can only visualize four variables at a time that's really the only reason and i'll show you here in a minute uh what happens uh when we only take the top four and whether we know we need to go more than four or not so now i'm going to fit a new model with oh it looks like last time i ran it it had a slightly different model i might not have been i might not have set the seed before that anyway so we're going to just take them in order shots dot taken plus agility plus injuries is north south next so now that is fitting a much smaller model why are we fitting a much smaller model we're doing that because we know we're going to really struggle visualizing it if we take the entire model it's also going to struggle if it's only um four variables but much less so than it would otherwise and then we can go ahead and again look at the estimates if we wanted and so um sometimes the estimates can be different sometimes they're pretty much the same so before we had shots taken at 0.26 now this time around is 0.26 again but notice the order of importance changed and that's to be expected because the original model that had all those variables it's trying to account for the any sort of correlation it shares with all these other variables now it's only concerned about four variables okay so uh we have the estimates things are looking good and actually look at that so our our uh prediction actually improved slightly so the worst before was 3.23 and now it's 2.916 so apparently having so many variables before might have confused it but then again the other ones don't look all that different so maybe that's just a trivial difference now what we could do at this point is we could use the flex plot visualize function and that would be fine except what it's going to do is it's going to choose for you let me put this on separate lines just so you can see it all at once what it's going to do is it's going to choose which variables go where and you may want to look at it from different angles and if you wanted to look at it from different angles you would have to rerun visualize lots and lots of different times and what's happening in the background is visualize is generating different predictions for different levels of shots taking agility injuries north south etc so it probably is going to take about five seconds every time you do that maybe 10 seconds and i'm a little impatient so rather than doing visualize for every different view that i might look at the data instead i'm going to go down here and i'm going to use the compare fits function and what i'm going to do is there's this little known argument where you can tell it to return the predictions and so i'm going to go ahead and run that oh oh it gave me an error that's right so remember before i ran it and it was a different seed so now it's choosing different variables so i'm gonna have to put these new variables in and then i'm gonna have to make that a flex plot formula well i guess it i don't think it'll matter because we're just generating predictions here so now if i read that in that should work and now it's running and it looks like it probably took between five and ten seconds which isn't too bad but but now we have a matrix of predictions and so if i just actually i'm going to go head predictions just to look at the first few and so notice what it has done is it has taken um a bunch of different values of shots taken a bunch of different values for agility the different types of north versus south uh different numbers of injury prediction etc actually prediction is the predicted outcome notice that agility is all constant here and that's because it's going to look at every single unique combination of all the different types of shots taken with all the different types of agility in fact if i went to tail uh that agility will not be negative 11 anymore it'll be 93 which is apparently the maximum that it predicts but anyway you get the idea so it's just generating a bunch of different it's just sampling a bunch of different values so now we can visualize it and so now i have flex plot and i'm going to delete actually i'm going to copy this line right here and i'm going to paste it here because that is a flex plot formula so now i'm asking flex plot to look at uh shots taken which was that the most important variable no it wasn't injuries was well usually what i like to do on my first glance of this is i want to look at the least important and why is that because that variable whatever is put first is going to be on the x-axis if this right here has a really strong positive or even negative i guess if it has a really strong relationship with ptsd that i know that when i initially said just give me four variables we might need to do five instead hopefully not and hopefully what we can do is we can visualize that and see that there ain't nothing going on so uh so far this looks like a normal flex plot formula except now i'm adding the predictions that i had here and i'm putting them in here so i'm going prediction equals predictions and what this is going to do for me is it's going to overlay the fit of the model [Music] all we're looking at is whether these lines are mostly parallel and if i were to kind of squint my eyes i see that this one right here there's a pretty strong difference between south and north and there is right there but aside from that it doesn't look like there's a really strong association between north and south versus ptsd so i'm not actually too worried about that so i think i feel comfortable kind of rejecting that so i'm gonna go ahead and take that out now so now i'm gonna put the second weakest predictor which was i'm gonna have to go up i don't wanna run it again because it'll just of course it'll take two seconds which i can't afford the time so instead i'm gonna spend five seconds looking for where i ran it before that makes perfect sense okay uh oh and i'm looking at the wrong one so oh i was looking well darn it looks like i screwed things up and oh i just messed up the north south okay so that needs to be damage resistance now if we look at the estimates again that's going to take a little bit of time now let's go ahead and look at this so most important is injuries then shots taken then agility then damage resistance so now what i'm going to do is in order so i'm going to start with damage resistance put that on the x-axis and then i will do agility and basically i'm trying to either dismiss that variable and say it's really not that important or what i'm going to do is i'm going to say oh shoot damage resistance that actually has a lot of um it's highly correlated with the outcome so maybe i ought to go one more variable back okay so now with that i'm going to again copy this and paste that right there and then move that plus to uh vertical pipe and now i'm going to put damage dot resistance there and shots dot taken okay and now if i run this it'll give me a flex plot that we can now reinterpret i'm going to zoom in on it okay and again we're looking for evidence that there is basically no association there and um looks like it's all pretty flat maybe there's some curvilinearity there maybe there's some interactions but it doesn't look like that strong of a relationship so um nothing too impressive there so i feel more comfortable about removing damage resistance and now what was the second most important variable i guess i'll run that again and just wait for my computer to run okay so damage resistance next is agility which is what's there already so we can just leave that formula as is in fact i'm gonna go ahead and put shots taken right there just so i know what's next i'm just modifying the flex plot formula then if i visualize that what do we get okay and so the red line is the random forest model and there is a little bit of a bend um in some areas not in all areas so i don't know is that worth keeping well for the sake of simplicity i'm going to say it's not worth keeping so let's get rid of that one too so now we will just do shots taken given injuries and then let's look at the zoom version of here okay so now we're getting somewhere so we definitely have a pretty um significant slope there and so now we can start using this to inform what kind of general linear model we're going to use so to me it looks like there might be some curvilinearity going on there so that looks nice let's go ahead and look at it from a different view so i'm just going to copy that and paste it down here and now just switch the order of that so now we've got the largest predictor which was injuries on the x-axis looks to be pretty relatively i mean i'm kind of surprised that variable is more important because that slope doesn't look as steep but anyway um between these two plots that i ran here this tells me that it's possible that there are interaction effects going on i'm sorry not interaction effects there are polynomial terms going on um but probably no interaction effects let's go ahead and look at that plot again so you can see why oh yeah it's already over here now this blue line isn't a ghost line but we can kind of use it like a ghost line and to see if these lines are parallel and they seem to be pretty parallel okay so if i were to look at this i would say all right random forest has been awesome i thank you random forest for showing me the wisdom of your massive experience and showing me which variables are actually related to ptsd and it seems that random forest is telling me that injuries and shots taken are are two important predictors and it's possible that there is a curvilinear relationship between shots taken in ptsd whereas injuries seems to be uh flatline so again random force is great at telling you which variables are important and kind of detecting these wonky relationships but it's not so good at being transportable it's not you can't um use an equation in a new data set you have to you have to keep those 500 or a thousand trees or whatever so for that reason now i'm going to model using glms so i'm going to just fit a model lm uh i'll do a full equals lm and we will do shots.taken oh i need my outcome variable ptsd tilde shots taken plus uh injuries and then i'm going to add the interaction term because not the interaction term the polynomial term because this is my full model so i'll go i shots dot taken uh squared and then i will go data equals avengers so that is my uh regular old uh general linear model fit and now i will do a reduced model which is going to be that minus the uh polynomial term so just get rid of that and we have our full and reduced model so now we can go compare.fits and then we'll just use um we're going to put shots taken on the x-axis because that's the one that have the curvilinear effect uh and then we'll go data equals avengers and then we're going to compare the full and the reduced oh i forgot to run full now if we visualize that and zoom in i mean just from this small plot yeah it looks like ain't nothing going on oops i should say it looks like that curvilinear effect didn't do anything there is no polynomial okay i'll trust you fine you're the boss you win so we visualized it uh and then of course i could do model dot comparison and go full reduced it's totally not necessary because i mean i saw the plot i knew that they weren't going to be very different and of course the base factor 28 times stronger evidence for the reduced model than the full model not surprising so anyway uh that's good so that simplifies things so what we've basically done here it's actually really amazing what we did was we took a data set with lots and lots of variables and we use random forest to figure out all right of all these different variables which ones of these are actually informative and it helped us use the variable importance to narrow it down to really two variables shots taken and injuries and we looked at the random forest plots and they seemed to indicate that we could fit a general linear model maybe with a polynomial term but when we fit the polynomial term it didn't seem to do anything and so now we have a final model we used random forest to find the variables and we use glm to fit a final model that's cool so let me do a quick summary of what we've talked about the last couple of videos with lots of variables it's very easy to overfit your data regression solution or the general linear model solution stepwise regression really sucks random forest on the other hand is amazing it's really good at figuring out which variables are most important so in r i showed you how to combine the strengths of random forest with the strengths of the general linear model and the end result is we have a good model that doesn't over fit and we can use that same model and transport it to use it for predictions or whatever so let me say one final comment about four unique uses of random forest models and again once i finish my paper i'm gonna link that in the description because that's basically what that paper's about one use is for variable selection again that's kind of sort of what it was designed to do second is to detect interaction in non-linear terms and again we've already talked about that number three you can use it for non-parametric modeling and i've done this a lot if you've got this wonky curved fit and a polynomial term doesn't work and you can't figure out what the functional form is why not use a random force and then four we can use it for prediction and classification so with that let's review our learning objectives number one the general strategy for random force again we use random force to find the variables and the functional form of the variables and we use glm to then fit the model so it can be transportable number two how to model random force and r number three how to compute variable importance as well as out of bag measures of error number four how to visualize random forest models in flexpot and then number six is that what we're on the different uses of random forests again variable selection non-linear and interaction detection non-parametric modeling and prediction and classification so with that peace out
Info
Channel: Quant Psych
Views: 584
Rating: 5 out of 5
Keywords: Statistics, Psychology, NHST, Null Hypothesis Significance Testing, Philosophy of Science
Id: QctVIuLTKtY
Channel Id: undefined
Length: 25min 35sec (1535 seconds)
Published: Wed Feb 24 2021
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.