Regressions with SAS Enterprise Miner

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
okay I'm going to start from the beginning and do the process for doing regression and SAS Enterprise miner assuming that our first step is already completed and that we have imported a set of data that is randomized the next thing we need to concern ourselves with with the veterans data set that we've worked with in the past is that we know that income has been improperly reported that in many cases it says zero rather than missing which is what the actual value should be and so in order to fix that we put a replacement node on there and we have to do a few things we have to do missing is the thing that we're going to put into the column that matches what we're looking for and in the interval variables area we're going to say none because we're going to do this on our own we're going to go into replacement editor and where it says income we will say user specified and for anything under one it's now going to be replaced with a missing value we get confirmation here that there were two thousand three hundred and fifty seven values that were replaced the next thing we want to do before we were using any of our modeling nodes is to split the data into training and validation data because that allows us to test our model creations out on the validation data data that it hasn't seen yet the models created with the training data and then it's tested against the validation and then we get reported back to us how well it's doing based on that data that it hasn't seen before so we'll do 50 50 and 0 and then run that now at this point we might be tempted to go straight to model and find our regression and put that on there and attach that to the data partition however before doing that we need to consider whether our data meets the criteria for being analyzed as a regression and one of the things we have to be concerned about is whether or not they're missing values in our columns of data and if there are then the default behavior is to throw that row of data out so if you have if 50 percent of your data has a missing value somewhere in a row then that data goes away now and that also is problematic then not only for creating your model but for analyzing a later model at this stage with just creating the model what we want to do is take our existing data and we want to add in values where there is no value currently and so the way we do that is we go under modify and grab this impute node and what this is going to do by default is it's going to replace the class or the named variables with the thing that appears the most often and for the interval variables it's going to replace the missing values with the ones that show up the most frequently or they're actually the average of that set of data so if the average age is 18 and they're missing ages then any missing ages will be fit filled with 18 now the thing we need to change is we're going to go down here to indicator variables and what we're in addition to creating a new variable which is going to be the imputed variable for the missing ones we're also going to include variables in our data set that indicate whether or not a variable has been imputed in that row of data we can do that two ways either because one thing we can do is we can say what something has changed but we won't indicate where another thing we can do is we can create indicator variable a 1 or a 0 variable and and kind of name it after every single column that was missing and then we know which column was missing so we kind of so if we have a missing variable in a given row of data this approach is going to create two different variables for us one is the new variable that was filled in and the other one is going to be a 1 0 saying which column was changed and both of those become inputs to be tested whether or not either of those affect the outcome in the regression so after running the procedure here it looks like we had three different to do well three different variables that had missing values in them one of them was the one we just created in a previous step and then so each of those results in two more variables we get this new variable over here in the third column which is the updated value and that's what we want to do in our predictive predictions now it's so it's identical to the previous column of data except it now it has averages filled in where the where it used to be empty and then we have this other column of data with the M in front of it which is just a 1 or a 0 saying hey there was a change that happened in the data where it used to be missing ok I dragged a regression node down and connected it and I'm not going to change anything at this point I'm just going to let it run with what it has and see what happens and so as I compare this particular model in terms of its efficiency and accuracy compared to something else like a decision tree I would look for things like average squared error and miss classification rate in the validation data set and so I would use those numbers and compare to future models now the behavior of a default regression node is that it's going to look at every single variable in the model at the exact same time it's not going to try to get rid of any of them and it's going to say subject to the presence of the other variables are any of these significant meaning do they indicate that they're causing an impact on the thing that we're predicting and kind of the intuition behind this is if you put in too many things some of the things that matter aren't going to show up as well and so that's what our stepwise procedures and such come in if we scroll down in this data set far enough then at some point I'm going to pause and find the right spot on the data come back looks like the answer was near the top first to test the overall model and it says for this that it's far less the probability is far less than 0.05 so we think that something in our in our model matters and that on the whole our model matters but in terms of which variables stood out as being impactful it's not very many of them are we can see that this one is because it's less than 0.05 that's usually our rule of thumb cutoff for saying whether or not something matters and so they're really in anyway there's just a few things in here we can do better than this and get better accuracy if we narrow down our variable data set so let's see how we do that all right what I've done now is I've added on three more nodes of regression and had them perform their procedure differently to select variables that would be useful for predicting outcomes I really only made one alteration in each case if I go down to model selection on each of these I can select forward backward and stepwise and so kind of the thought process here is with the forward model it looks through each variable to see which one at a time and then incrementally sort of like when we're building our decision tree it selects which variable is going to be most important to our model and then it puts that one in there and then it runs the model again and looks for the next variable that would also be important and it continues to do that until there aren't any other significant variables left to add so it starts at you know basically no variables keeps on adding them the the backwards procedure does something a little different sometimes it'll come up with the same answer and sometimes it won't instead that one starts with every variable in the model you saw how that performed before we're only three variables came as significant and then it starts removing the least important variables one at a time and as it does that other variables start to become more impactful in the model and it'll pick that up so it does that one at a time until the only thing left in the model of stuff that is significant and then stepwise does both of those procedures at the same time it does a forward and a backwards procedure so it'll let things in assuming that their value is significant enough and then they'll kick it out assuming that it's low enough and in addition to just going with the defaults you can also change the you know say don't use the defaults and then go in there into your selection options and make sure you put us do three things make sure you do plenty of iterations like 30 and then I would typically recommend that you have a loser input to the model criteria then kick out criteria and as we performed in class with this data set the defaults are 0.05 0.05 but if we go a little bit higher than that then we we actually tend to get a better model in terms of when we look at the results and I think it would actually be useful to discuss the results that came out if I look down in my stepwise regression at some point it says that you know it's done that no further effects were useful and it it's selected the selected model is the one that happens at step eight just kind of some background information this number is probably getting higher and higher this chi-squared ratio and there the higher this gets usually the higher the significance of our over my overall model gets meaning that it's performing better and better at predicting an outcome and right here under type 3 analysis of effects these are the variables that are important in our model now this is more than what we had before in our first run we only had three now we have a lot more that are useful in predicting things and we're to have greater accuracy because it's selecting this model based on how it's performing against our validation data so that gives me a high degree of confidence another thing for us to look at and talk about is the practical significance of the variables that were selected so generally speeds ratio estimates a couple things to discuss here one is any value that's less than that's greater than 1 is actually increase it means that a unit increase in this variable maybe it could be in this case it's not age but if it were age every unit increase like going from 18 and 19 to 20 results in this right here of an increase the number behind the 1 of an increase in the likelihood of getting a yes answer in our outcome variable and similarly we also get a reduction for each one for each input level decrease for these variables if it's less than 1 so you know so a large number here would you know indicate I'll you know this would get kind of multiplied by that percentage and it keeps on decreasing the likelihood of getting a 1 or a 0 on the other end so greater than 1 good less than 1 it reduces the likelihood of the outcome a couple other items to talk about for categorical variables it actually compares different variables and saying hey if you're this this value versus that value for a particular column of data you know then your your value goes up or down and the way that this typically works is that one value will be selected and it looks quite like the s value is selected here is the default case and then so compared to getting an S you can go up or down from there depending on which other category categorical item you selected now one that's a little bit confusing perhaps is this one that's a one as I said if it's greater than one that increases your chances if it's lower than one it decreases your chances of getting the outcome you're looking for and the reason for that is that the variable here for medium median home value is is such a large value because it goes into the hundreds of thousands that a one unit increase isn't very meaningful so if instead you were to collapse that into much bigger changes in value like 10,000 20,000 30,000 then you would get a larger value here as it is what what happens is if you bring this out to a you know a zero zero zero zero you know somewhere out there there'll be some number that's greater than zero which means that for every dollar amount that's greater in the person's home value there's a little really tiny increase in the likelihood here it's greater than one we know that we just don't know where where it starts picking up the the impact but if you were to again change the scale of the variable you would get something get something different here maybe it's maybe turned it into categories or much bigger brackets rather than just a value that goes from like one to hundreds of thousands
Info
Channel: Degan Kettles
Views: 17,115
Rating: 4.9230771 out of 5
Keywords:
Id: wQ24e0LrmPg
Channel Id: undefined
Length: 14min 29sec (869 seconds)
Published: Mon Oct 31 2016
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.