Optimizing Regression in SAS Enteprise Miner

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
I'd now like to talk about optimizing our model we've done some good things already in regression as it relates to imputing some variables but another problem that can come up is that our variables may have a significant amount of skew associated with them if I go over to the data partition and click on variables the ellipsis it's there and look at these variables for a gift count are just gift it will pull up a list of histograms any of the ones that have a significant tail on them are a candidate for or you know they're those are skewed so we might provide a or perform a transformation on that variable so that it is more symmetric and more bell curved and one of the ones that we can use is the log function and the log function really what it does is it'll compress and bring down really high values so the low values stay kind of the same and the big values end up getting all squished down and when we anyway that's a good thing and and can help our model perform better so the the good news is our model will likely perform better better when we do that but the bad news is it's harder to explain than what it means because as it is if I were to look at something like gift amount average all months if I said that the average over the last 36 months that someone's given is a dollar 50 and that per if and anyway if we put that into the formula for a regression we can talk about you know very straightforward what what giving something means you know from a dollar fifty to four hundred and five dollars that's our range of donations and stuff but if we transform the scale of this by taking the log of it well now we say that the input value is the logged gift amount value and so that number mate just doesn't mean quite as much to us if we're talking to a business person about it now based on the the histograms that we were just looking at the book argues that everything that other than the time gift variables are skewed and so we if we select just these ones we can perform a sync a simple change to these and click OK and if I just go ahead and and run this again it should go through all of the steps all the way to getting me to step wise and I will get a better outcome than I did before this may not always be the case but you can verify this by looking at your Mis classification rate or your average squared error if I click on the transform node and look at its results I can see that the gift variables that I used before have now been used to create new variables and when these variables are used as part of a stepwise function it will or it will determine whether or not these variables perform better in predicting an outcome than these ones and I have just looked at the results for the regression the stepwise regression that was performed based on including the logged variables and I see that it looks like this chi-squared value is getting higher than it was before so I probably have an even better model and these are the variables that it selected as being useful in predicting in the outcome so in the well I guess in the near term what we can say is that we've now included these two logged variables and they matter more than maybe just a straight gift average all variable now the logged version of that is actually more important in predicting things and it's highly significant very very important and now maybe it just a from a bigger perspective we can also look from at what we started with compared to what we have now and we see that this imputation indicator is important and these other variables that we created are also important a couple of times while working on regression we've made changes to interval variables so for example the first replacement we did was when there was a zero in the inn and the income of people we changed that to missing because that was more appropriate we've done imputation which allows us in this particular case where we had a lunch a bunch of missing values on interval data to put the average of any column of variables and fill that in in the missing spots we haven't talked about class variables and one of the problems that they might introduce to performing regressions with regressions if if you have a continuous variable an interval variable and it's creating the formula for that then it you usually only get one single variable where you plug in the the value for your continuous variable so it'll so the for the models kind of look like beta 0 plus beta 1 X 1 plus beta 2 X 2 where X 1 X 2 etc is the actual value of your observation for a given row so it could be X 1 could be age but when it comes to two categorical variables if there are let's just say you have a color and it's green yellow brown it will make different what what it does in regression if is it will create different columns for each color in essence and if there's only a couple of values that's not a big deal but if you have lots of columns or lots of values let's just say you have sixty some values of for a particular categorical variable that means that you could be adding like 59 60 columns with different variables in them to your model and that is bad it makes it harder for the math to find the true signal compared to the noise so the bottom line is if you have a variable a categorical variable in our case dem cluster has many many many many different values that it could have B you want to reduce the number of different options there are there if you think it really doesn't matter if you think like for example you know one you know for Democratic demographic cluster if one to thirty really kind of fits into one grouping and 31 to 50 fits into another it's much better to have call rename those two Group one and group two than to have you know fifty different options and so let's look at putting in another replacement node to fix a problem related to having too many categorical options so I bring this down and I've already selected none here for interval variables because we don't want to mess with that in this pass but we do want to mess with the class variables so let's pull that open and look what we have available and in looking at this I see my class variables and the different values are associated with it so with dem cluster one of my variables there's a ton of different values that that takes on and I think ideally it would be nice to just enter in a number there to kind of recode those so in the real world you'd probably recode it to something else that makes sense so you'd reduce the number of values that it has so maybe you thought really there's only group 1 2 3 or ABC something like that you could just kind of pick those out and reassign them to ABC and then every single level is accounted for you know gender is probably not so bad female male and unknown I guess we have two versions of unknown which is a little weird in this for purposes of this demonstration we're gonna pick on this variable right here status cat 96 m'kay now I don't recall off the top of my head what this means I do know that it keeps coming up as a significant variable in our model if you recall that when we do the odds ratio estimates for this particular variable it'll say a vs s and F vs s and n versus s and E versus s so these values represent an increase or a decrease and just in the outcome variable but but it also adds on in essence one two three four five six or so different variables so one less than the total number of options that are there and so that's kind of bad to have too many things more particularly so for this Democrat them demographic cluster variable but it's just to demonstrate how you would fix that problem I could have a continue to be a I could have s now be a I could have FB n n be N and this one be L and L and really I'm done once I I run this it will replace it'll the values and create a whole bunch of new ones if I look at the results of this replacement node it tells me up at the top that the number of replacements that were made 1625 and the training dataset in 1627 and the other and then if I were to go view model replaced levels it pops up in this little box here which is kind of you know maybe I forget after the fact what I did so I can see that you know what happened that an original value of a became a original value of s became a etc I reran the regression the stepwise regression with my new transformed categorical variable for status cat 96k NK in the model and when it got through selecting its its significant variables these ones were the ones that were selected and it opted to include this replaced value column that we just created as an important column in there over perhaps the original one and if I go down and kind of see how that how it works in terms of prediction now it's just looking at three different levels of these variables right or well yeah because there's only three for different variables now there's a l and n n is created at the deef as the default case and the other two are the alternatives and computationally this is better for getting a signal out of the noise that is there the math works better now of course this isn't something you do arbitrarily you should collapse levels if the levels are telling you the same thing but if they're if they're if they're very different then it doesn't make sense to collapse them so use this with caution and also look at what happens to the outcomes is this improving you know the predictive capability of your model is it affecting the average squared error in the MIS classification rate are those things going down by using this approach if you've segmented things appropriately hopefully the answer will be yes
Info
Channel: Degan Kettles
Views: 3,804
Rating: 5 out of 5
Keywords:
Id: mYSkTTz8728
Channel Id: undefined
Length: 12min 33sec (753 seconds)
Published: Mon Oct 31 2016
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.