SAS Statistics - Logistic Regression (Module 04)

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hi everyone my name is king IV and this is introduction to SAS statistics and in today's lesson we'll be covering logistic regression if you haven't seen the previous three lessons I recommend you check them out and as well check out my introduction to SAS lessons as well just so we make sure we're all on the same page logistic regression is a very popular and common topic into six it's also known as classification and it differentiates from linear regression because instead of having your response variable be a continuous variable such as monetary amount or SAT scores or a number of other continuous variables it's actually your response variables usually in this case a classification which is could be a yes or no could be an ordering or or nominal set of variables so for example red green or blue or could be small medium or large in today's lesson we'll be covering when there are binary so in this case yes or no and the data set that we are working with and the days that we're working with is this banky mark mark any data set I think from some polish on a brown polish Portuguese is a baking institution and really what they want to classify the goal is to predict who is going to subscribe to a term deposit so you'll see the number of data sets and and the number of hits and went on and here's the data dictionary and it's binary yes you know the types of data so it's all good very helpful um so let's go ahead and get started ok this file is tab delimited so I'm going to be using my handy dandy import script I have so here are proc import and we're going to name this we're just going to save this to the work data this work library and it's going to ask us for the data file path so I'm just going to go in find the datapath file and really depends on where you save the file but in my case I've saved it here okay perfect and that's all you need to do so let's go ahead and run that and we're going to check the log looks like it ran successfully for 45,000 211 operations 17 variables so quite a rich data set and we go ahead and take a look at it you'll see all these different characteristics the yes or no the previous day campaign all these different components about what I'm most interested is how do I is there a model I can build that will help me better predict whether or not someone's going to subscribe to a term loan so I can better focus my marketing have my people call those individuals as opposed to just randomly selecting some individuals building some science related to this okay so that's good so keeping that in mind let's go ahead and do some proc logistic so we're here the data set however the proc step is proc logistic and we need to define the data set that we're using so in this case Bank I also want to include some plots in this case I'm going to get lazy and just quit plots all so it's going to include all the plots by default sometimes people just do effects or just odd ratios but we'll just leave it like that and well what we need to do is for our non continuous variables we have to define that as costs just like we have to do in in other models and other purchase procedure steps as well in this case I want to use marital and then here I'm going to use parm ref and then I'm going to say are my reference which is female baseline is going to be single put that in quotation pulls all for the semicolon and then as well I'm going to be using units and yeah now I'm going to be using balance and H in my model as well feel free to use different components as well but whenever you use continuous variables in your logistic regression it's oftentimes good to put an increment that way you can actually measure an assess in and you'll be able to see what the actual impact is so here I know balance otherwise it would analyze the model and $1 which is fine as well but it's just little bit hard to see on the graphs and appreciate what the actual differences are and then here now we develop our model in this case the response variable is demand and then here I have to define what the what I'm modeling against in this case I want to look for model for yes where there is a demand loan and here I'm gonna do my various variables balance age and then I wanted to to also produce the odds which we'll look at Courtney and discord innate odds ratio which is a very powerful way to assess it so yeah that's good so I'm going to go ahead and run this portion yeah okay so it's not called to my head it's actually called why it's related demand alone that's why I got confused okay perfect I'm going to walk through this output we're going to the very top obviously I should have put a title but that's fine here it says your data set what your response variable is the number of response levels in this case it's a binary logit is the model type because there's only two response variables yes or no it's going to use the fisher scoring the number of operations read the number of observations use so it doesn't look like it dropping the responses which is good even if you did would still work out fine but you should understand why that's occurring and then here has ordered values one two the responses and the frequency count so you can get an idea of how common term loans are and then here it's going to model for y is equal to yes and then here are the classification levels for marital status so divorced married single and you'll see the different design variables so in this case because there's three that you have to have two design variables because one this is the baseline this is divorced and this is married so just to give you an idea and then it looked like you did converge um and actually create the model which is good expected and then here just model fit statistics to tell you whether or not the intercept only model is better or worse than then intercept in covariance and in this case we're we're also going to be focusing on AIC n and SC and what basically the premise is that a IC will punish for more will punish with there are more independent variables used in your model so it actually refers a simpler model over a more complicated model while SC not only punishes or observed for the number of variables that you have in your model but also the number of observations and typically it punishes more than than AIC so you usually see the SC has a higher score than then a I see so both those models the lower the score the better so you'll see based on the intercept only compared to with with the covariance you'll see that it actually improves the model so that's good and then you'll see here some likelihood ratios and you'll see here that if your offers point zero five which is a common measure you'll see that your model looks like you can this in this case it was looking for is whether or not all the the components of the models are zero so effectively say that the intercept only is is better if we continue on you'll see here again looking at each of the different explanatory variables and whether or not they are statistically significant and you'll see based off this that they are you'll see the number of degrees of freedom their Chi score as well if you go down here you'll see as well the analysis of maximum likelihood estimates and you'll see a cross across the board that it is all fiscally significant which is good and then as well you'll see here the the C score basically tells you I when it ran the odds ratio analysis how many times was it actually correct brushes not correct in this case 59 percent which isn't the greatest but considering the likelihood of the number of demand loans was about 1/9 of the population 11% this is not not too bad and you'll see here the percent that were coordinated versus dis coordinated number of pairs that were formed and as well these other measures to help you better assess your model and if you go down here you'll see the units which is just for classifications could be one for balancing age because we include units there it had 1005 has the as increments you'll see here that within the ninety-five called residual as long as it doesn't cross one in this case is up either above one or below one that means it's physically significant which is which is good and it's the confidence interval doesn't cross this this one boundary which is the odds ratio so you'll see here that this is the odds ratio with 95% likelihood you'll see that the mean as well the range of the conference limits so you'll see here which is good which tells us that these are stiffly significant and tells us what the actual impact is so it's good that's interesting and we'll cover the ROC curve in in the next couple lessons and you'll as well you'll be able to see here what the impact on the balance is on these different marital statuses so you'll see able to see single which is this middle line you'll see the impact of divorce and as well you'll see the impact or sorry not sorry that wrong marital this is kind of hard to see with the these are different different colors and then as well you'll see married at the very bottom and the probability that they'll actually the predicted probability that they'll be yes so perfect and this is at at a given age of 40.94 so controlling for for the age variable in these components so that's good that's interesting and we're going to go back and what you can also do is assess for interactions so if you want to test for interactions you'll just need to put this vertical bar in between them you can limit the number of interactions but we don't have that that many variables here or that many degrees of freedom with any variables so I'm not too concerned there so if we go up and you'll see a lot of this is very similar but when you get down to here it'll do these joint tests so you can see whether or not it has to is actually significantly significant so you'll see the impact of balance on marital status which doesn't look statistically significant the only interaction that does look like it is age on marital that makes sense balance on H no and then downs aging marital status so that's good if we were actually doing a step stepwise model selection will surely covering in the next lesson you'll see that it would actually would actually when we run through it I would actually probably drop these variables depending on what what kind of model you you were starting with but either drop or it probably likely drop these models or not add them to the model okay good and you'll see here this is just based off the the different components you'll see here balance and marital status for divorced people and balance and marital status for married but you'll see here all of them are not statistically significant and as well you'll see here compared to the previous model if you looked here is fifty point five eight nine now it's 0.66 zero five so it's actually improve prove the model but as well you can take a look at the AIC and SC which I believe are lower as well but it did actually in fact improve the model and you'll see here the now the charts have updated and take consideration I believe they've moved actually a little bit higher since it has improved the model that's probably expected so that's that's it for proc logistics and logistic regression obviously you can make it more complicated you could just use one eye explanatory variable as opposed to the multiple but if you have any questions or comments feel free to leave it in the comment section below don't forget to subscribe and I look forward to speaking to you next time thank you
Info
Channel: SAF Business Analytics
Views: 34,335
Rating: 4.769784 out of 5
Keywords: SAS Institute (Business Operation), Computer Science (Field Of Study), Analytics, Advanced Analytics, SAS TUTORIAL, Regression, Box Plot, ANOVA, Model Selection, Logistic Regression
Id: QtiVi20PK10
Channel Id: undefined
Length: 13min 35sec (815 seconds)
Published: Sat Sep 19 2015
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.