Logit and probit in SPSS and SAS

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
okay welcome to my tutorial on login and profit models so this time I'm going to show you something from statistics or econometrics well I hope you know something about these models so I'm not gonna focus on the very basic things I'll just briefly recall some some necessary things like notation and so on and so on so I'll start with notation let X be vector of parameters so is a vector and B is a vector of coefficients and basically what log it and profit models are about is estimation of the coefficients that enter the model okay from now on I'll just I will not make the letter bold but I assume that that you know that it's vector okay so if I write B it means this vector of coefficients so both logit and probit models assume that there is a trivial indication of some event that means that the event either occurs or not so there is some binary way variable 0 or 1 okay so there is some binary variable that can either be 0 or 1 you can think of why say zero is survival and one means that or zero means no default one means default and so on well we are interested in what is the probability that comes from log it or proud or profit so for log it the probability say P which is equal to probability that Y is equal to 1 that means that the that the patient dies or something like that the probability of death of the patient or probability of default or something like that is equal to 1 divided by 1 plus e to minus X transpose times B okay so this is basically dot product of two vectors parameters and coefficients okay and in profit profit is related to normal distribution and in profit the probability that the even happens is equal to integral from minus infinity to X B times 1 divided by square root of 2 2 pi e to 1/2 minus 1/2 that squirt DS that okay so here you can identify normal distribution okay I can also write it like and it's B where n means stands for cumulative distribution function of normal distribution okay so this is the these are the basic relations for luggage and profits so in log eight we model probability of event to happen using this expression and for profit we model the probability of some event to happen this expression okay and what we are going to do is to estimate the parameters I have a sample spreadsheet over here I hope you can see the data let me just rearrange rearrange this a little bit yeah and I have some data set we're in the first column I keep information about income of some individuals in the second column I keep information about each class it's not important what exactly the each class represents it can be either young people or old people but it's really not important it's just it's just the variable that I have created and in each of these groups this is one group this is a second group this is third group we have several elements and from these elements some default okay so you can see that for instance in the first group there are three elements all of the same characteristics okay so mmm all of them have 25 thousand USD income and they are all of the first age group okay and among these three two of them are default okay and so on as you can see in the last group for instance there are five individuals but non defaults okay it's it might be caused because of the age group or it might be also caused and we assume that it's caused by their higher income okay because we assume that with higher income this there is lower likelihood that they default so that's the basic idea about about this data set moreover I have the same data set in here but as ungrouped okay this is because some of the software's I am using only accept the data in this form okay so from this group data I have to create this sample sheet it basically copies exactly what's written here so if there are three elements with this character characteristics we have three elements over here and we have indicator whether they default it or not so two of them defaulted so the first default it and the second defaulted and the third hasn't default okay and likewise I have the data for the whole sample sheet okay and now I'm gonna try to estimate the parameters the parameters here in order that I can define the probability that some debts are with some income and of some age group defaults I intentionally use age group I don't use age exactly because sometimes you are required to use categorical variables and it's important to know how you incorporate the categorical variables into the resulting equation for the probability okay because you cannot use the same approach for quantitative or continuous variable over here as far as for categorical one okay and moreover if you use the categorical variable you can benefit from that because you know if you know that some variable can only take values 1 2 or 3 it gives you an extra information because you can somehow constrain the range of these values so you got you haven't really got a better estimation because you know that you you will never have a group number 4 or something like that okay okay fine so mmm that's it and let's start with sauce I will start with log it ok so I open the sample sheet now and I'm not sure whether it accepts the group data or on group data but I think it's ungrouped so it's the long list of individuals yeah finish okay so this is exactly what we have in Excel and if you want to apply log it we have to go to analyze regression logistic regression here this is crucial so our dependent variable is well whether the capture has defaulted or not okay so there's there can only be the indicator 1 or 0 then we have quantitative variables well quantitative variables assume that quantitative variable is the same as continuous variable so income is continuous variable because it can take any values any positive values to be more specific and then we have some categorical variable this is named as classification variable here so put it into consideration varietal okay and here use reference this is the way that this is the way or the the parameter or option that effects that affects how the classification variables are treated in the model okay here we use the binary model that means that either zero or one can cure you are using log it or we will try to calibrate log it and fit model to level one that means that the parameters that the model will output using those are predetermined to model the probability that the adapter defaults we are not going to model the probability that the debtor survives but that the debtor defaults however is it's complimentary so there is there is no difficulty to determine the probability of survival if you know the probability of default facts here put everything into effects okay this this means that you include these variables into the model and include intercept that means that we include constant to the model selection keep it like that here you can choose some advanced statistics if you want usually if you want to know something about goodness of fit you you choose the Hosmer and blemish of goodness of fit or something else my task today is not to go through all these features here but really roughly show you how you can get the parameters out of the model plots choose if you want predictions also can choose if you want and it's basically all hit run without process the data several moments and eventually we get the result so here we have the overview of the results this is these are some basic basic statistics this is related to classification of the categorical variable I will get back to this layer these are some it statistics results and what interests us is basically it's tab because here we get the estimation of the parameter this is the B okay and yeah and that's basically what's also important is the significance of the parameter in the model this is not a good significance but for the purpose of this video I'm just dropping the assumption that the significance is to be that that there is that this column is important okay normally you would like to see numbers less or equal to 0.005 under standard conditions okay so this is really the only important thing here for us now okay so this is SAS results this estimation and now we are going to do the same in SPSS so here we are in SPSS we open the same data set look at profit ah SPSS doesn't like that someone else enters this file let me open that and here we're gonna choose well say also ungroup data okay so in order to estimate the parameters for logic model here we have several options okay at this time I'll just use one specific I'll choose binary logistic so here we have these following options we have covariates so these are the independent variables which is income and age classification and dependent which is has D has defaulted okay which is 0 or 1 here we want to say that each classification is categorical variable how do we do that here's here's the button and what you do here is that you put this arrival into classification query it's tap here keep indicator and last okay hit continue these are some other statistics if you want this is the already mentioned Hosmer llama shot goodness of fit and you want constant to be in the model ok hit OK and here we had the old boots so let me just firstly verify that the results out are the same in both SAS and SPSS so B is minus point zero point 54 what we have in size yeah we got exactly the same numbers here that means we have performed the configuration correctly and now we can go roughly go through the output okay so here is some basic overview it's the number of a number of individuals here is tab that is exactly the same tab as the tab I mentioned here I will discuss this tab shortly then we have so called beginning block the beginning block is a block hmm or is a model where there is only constant there are no other parameters okay and what SPSS does is that it compared to the beginning block with only constant weight model that has some covariates and you can see if the model improved when you include the covariates in the model okay so here with constant the model can estimate correctly 67.7% of cases or individuals here on the block for the block one this is our model with parameters okay we can see that now the model who include the parameters it can estimate correctly about seventy seven point four percent of cases and here we have the same the same type as in sauce for the for the estimated parameters okay so we have now this output and let me go to excel and you would like to see know what probability of default does the model predict for Ford adapters okay so say we would like to know what's the probability of default for the first actors like I said at the very beginning for Lockett we use the first equation okay so what I do first is that I copy the estimated parameters this is the tip process but you can also use the very same type in SPSS I'll copy it in here and once again has values because I don't like from performant okay I know also dropped unused columns okay so the equation here says that once we know on the coefficients so this is this is basically be here until the coefficients we can based on the parameters which is incoming and H classification estimated probability of default okay but well that would be easy but what's the because we know we know the other ibis and you also know the coefficients but how do we deal with with the H classification okay what's that what what does it want to say to us this each classification and variable number one where are the number number two and we know that we have three groups of each classification what is it trying to say to us well now it's time to go back to this table and this table is trying to say that if each classification has value 1 then you apply the coefficient number 1 this is number 1 okay if each classification is number 2 which is here then you apply yeah that you apply variable number two because it's second column okay and if H classification is number three then you just simply include zero so h plus if if it's number three the value of the parameter is zero okay so if it's number one value of the parameter is this if it's to the value of the parameter is this and if it's three the value of the parameter is zero okay but the important thing is I hate this because of this to me clearly the way the important point is the normally according to this equation you use the relevant x times the relevant B but for these you don't apply an X anymore you just use this number okay let me show you here you would like to know for each of these each classification what's the value of the parameter because we know that for one it's this for - is this for three it's this so we create a column called say H coefficient and we try to look up the value which is valid for the value for this column in this table okay so I apply the field cut we want to see what's this value related to this volume let me just erase this one okay once again you look up from this table here's the second column and you won't accept much so in the first case we expect here the number one point two one seven one okay I think everything is working correctly for two we have the value which is in here okay now we can basically set up this expiry is equal to the intercept which is constant plus the income times the parameter oh sorry not parameter about the variable yeah just like this and like I said you only include the value of the coefficient you don't include this anymore okay like this I hope everything set up correctly and now I just put now I just try to express the probability so probability of default is equal to 1 divided by 1 plus exponential negative x be ok and I can see the nice result here as we expect the bad doctors here with low income and with low age we have very high probability of default on the contrary for individuals with high income we have very low probability of default so hopefully a model is working correctly okay okay so that's basically it and we can do the same for profit okay so if you want to do the same for profit you go here we go to this data set we hit analyze regression logistic regression has defaulted its the dependent variable income is the quantitative arrival has each crucification it's the classification of arrival here hits reference the option here basically has incidence on where is that on this table okay normally I just choose reference now go to model select profit and again select to fit the model to one because we are going to estimate the probability that the adapter defaults the facts put all the variables into the main model selection options plots tiles properties okay we get the result and again we have the set of set of coefficients that are crucial for us here but now the tricky thing how do we do this in SPSS one would normally go to regression and when we see here by non logistic we would choose profit but the thing is not not that simple why is it not that simple well because here we do not see the same the same boxes as before here we see a response frequency total observed and we also have no categories okay so we basically cannot use this data set in the structure that is here to get them the profit model okay so it canceled open and I just choose I just choose the second list the group date list from here that will satisfy the the needs of sauce to perfrom profit I save this one okay wait wait no open Margit prophet and I'm gonna use through data okay so here we are the data set in a group form go to analyze regression profit and so what's total observed well it's it's the number of total elements within each group okay [Music] what's the response frequency well that's the number of the defaulted elements isn't it because you can see the frequency of default here and then we also have the independent variables here yeah but the problem here is there is no categorical variable option for the h classification so what we have to is to satisfy us with well the result without categorical variables okay and here we have the result you can see there is no age classification number one H classification number two and that's because we haven't been using the categorical variables okay so now SPSS deals with age as technically the same variable as income as a continuous variable okay and now we're gonna check the output directly nothing to go to the dataset and so this was this was log it put this way and here is the space for profit so what we have and what we need so we have the set of coefficients and we are going to model the probabilities okay firstly I'll copy the output from sauce clear this away erase the extra information so sauce uses the H as categorical variable okay so we will apply basically the same approach as before so H F is equal to V lookup and now we are going to match this value or the value of the second column of this okay here find it in here we turn the value of the second column and you want exact match I forgot to put three zero here that means I had to extend the table here okay that's it and on the rest XB is equal to intercept the constant plus income times the income plus the age okay and forget that you don't multiply the H coefficient with this claim the value of the column here because it's already defined in here because it's categorical variable he's being you know your probability of default is the cumulative normal function and you can see that the numbers are quite close but not the same that's basically our comparison of market and profit but as you remember you are also trying to apply profit using using SPSS which is here what's the output over here and we were unable to incorporate the information that H is categorical variable so you'll just use values here I hate it for modding okay clear this away and so this was this was sauce say you comment okay PSS figurative arrival okay and the last part here will be for SPSS where the variable H is not configure it but treat it as continuous so let me copy the values here and we have no H coefficient because H is treated as continuous X B is the intercept plus income times Lincoln coefficient plus h now you can see that I include the column here into into this equation okay you can see that the output differs slightly but not that much and here we have the probability hums dist it's standardized normal cumulative distribution function okay this is no extra and so you can compare the outputs this is product pieces H is okay so there is a slight difference and however we are interested which model perform the best so for that reason I need the group data table and I hit here I put here from probability y is equal to one use default well that's equal to default it divided by the total and we would like to get as close as possible to these numbers okay however there are also some weights because there are more assignments here then say in here so this observation has more weight so let me put here the weight weight of my group which is equal to two elements divided by sum all elements in the sample okay and I'm gonna apply weighted r-squared but firstly I have to put the data from the other sheet in here so wait a second I'll do it and have the back son back I just copied the three columns with the probabilities from that sheet into this one and now I need to assign the probabilities from here to here so that they match so for the first group I know it's basically this one or I can do it two different ways I copied the age because I know that for every age the probabilities are the same are not it's not age it's income as a matter but I know that I had the same number here so I'll use here and here I'm gonna use the vlookup this value here and I'm gonna return the value of column second okay that's it and likewise are you the same here but I'm gonna return third column and here I'm going return the fourth column okay in order to calculate R squared I have to do the following R squared statistics is equal to 1 minus there're of estimate estimation and divided by her total sample so in our case that means R squared is equal to one minus some it should be next with I times why hat why I with that or do I find that over here - - why observed squirt / why I observed - the average value or the mean value of y so this I don't have now so need to calculate it so it's the weighted average of the probability of default just calculate the dose product to get a minute the weighted average probability of default okay and here I can calculate numerator and denominator values these are gonna be these okay and then I sum it this is the general denominator so put in a second over there so the numerator is weight times the observation minus this value lock it like this squirt I forgot to lock the wait okay and denominator we'll be wait times now we are computing this this expression the observation - me - value squirt so basically this is this is the same expression or and now I'm gonna hit the song somes and our squared is equal to 1 minus the numerator divided by D to the the did on the denominator okay zero clarified 0.84 so probably we would conclude that the first model which is this was log it that the first model describes the best the data said we are we were given and it's basically concludes the video bye-bye
Info
Channel: Marek Kolman
Views: 73,078
Rating: 4.2666669 out of 5
Keywords: Logit, probit, SPSS, SAS, categorical variable
Id: TxwvDgoPKHE
Channel Id: undefined
Length: 54min 2sec (3242 seconds)
Published: Sat Sep 01 2012
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.