Logistic Regression for Excel

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hi this is wagner kamakura and i'm going to show you how you can run logistic regression using my analytic tools for excel but before doing that i want to give you the intuition behind logistic regression at this point i assume that you're already familiar with linear regression a linear regression is somewhat related to logistic regression because both of them produce a linear score as a function of predictors the main difference is that in linear regression this linear score is a direct prediction of the dependent variable you have a continuous dependent variable and the score is already an estimate of the dependent variable or a prediction for it in logistic regression the linear score measures the propensity for the event for a binary event or its log odds so in this case the score is equal to the log of the probability that the event will happen over the probability that it won't happen and by the little math you will find out that the probability of the event is equal to the exponential of the score over 1 plus the exponential of the score and we're going to see how this works using a fairly simple example in this example we have 768 female patients they were tested for diabetes and the dependent variable is a result of that test whether the patient was diagnosed as a diabetic one if yes or zero if not in the predictors explaining that propensity would be the number of pregnancies glucose level blood pressure skin thickness insulin body mass index diabetes the degree function and age before doing anything let's run a regression just a simple linear regression using excel go to data analysis linear regression or regression and our dependent variable will be has diabetes one or zero our predictors our input range will be pregnancies glucose etc down to h we have labels and we're going to save the regression coefficient on a new sheet and here we have the linear function the coefficients for the linear regression i'm going to insert two rows to copy the regression coefficients then i'm going to use as a starting point to build our logistic regression here we have the intercept and the weights for each one of the predictors so i'm going to compute this score here the score for each patient the score for each patient will be equal to intercept plus sum product of the weights regression coefficients and the predictors for each patient because we want to copy this formula all the way down for every one of the patients i'm anchoring the coefficients so that the same row 2 will be applied to every patient so there we have the linear score which right now is the linear regression score because we are using the linear regression coefficients now i'm going to calculate the probability of diabetes as x of the score over 1 plus x all the same score first patient has a 65.57 probability of having diabetic now we are going to calculate the likelihood not the likelihood of having diabetes but the likelihood associated with the actual event whether the person has or not if the person has diabetes which is the case for the first patient then the value of the likelihood should be the probability on the other hand if the patient does not have diabetes the probability should be one minus the probability of having diabetes so that's the likelihood for us to see the difference between having diabetes and not have it in the second case the patient did not have diabetes and the probability of having diabetes 0.501 so the likelihood of the event the event is not having diabetes is a complement of that 0.498 and we're going to calculate the log likelihood which is just the log of the likelihood now we can copy these formulas for every patient each one of the 768 patients so we know that these coefficients here are the regression coefficient linear regression coefficient but we wanted to estimate the logistic regression coefficients we wanted to find what are the coefficients that make the likelihood of this sample in other words the product of these probabilities here the largest possible but we are dealing with probabilities so when i multiply 768 probabilities that go from zero to one that product is likely to be very small that's why we use log likelihood so that now instead of multiplying the likelihood i can calculate the sum of all the likelihoods now we can find what are the weights those weights here that i'm highlighting what should be those weights so that the sum of log likelihoods is the largest possible in other words what will be the set of weights that will make the observed events most likely and we can do this using solver so what is our goal here we want to maximize the log likelihood which is on cell n1 we want to maximize and we want to do this by varying the range that i had highlighted before these cells here we wanted to find what is the combination of weights intercept pregnancies good calls etc etc that would produce the highest log likelihood and we don't want the coefficients to be non-negative right we wanted the coefficients to be anything they want to be so we do this and runs over and you can see that the likelihood now is not as negative went up from minus 500 and something to minus 361. so now this score here is the logistic score so those are the scores that measure each patient's propensity of being diabetic they are the long odds of being diabetic based on the patient's number of pregnancies glucose level blood pressure skin thickness etc etc so these now are our logistic regression coefficient these are our log odds or the logistics scores and this is the maximum log likelihood to see how well the logistic regression predicts the event having or not having diabetes let's copy this range here all the data up to our score to a new sheet i'm going to create a new sheet and our goal is to see how well this score discriminates between people who have diabetes and people who do not have diabetes i'm going to only focus on that variable dependent variable and the score and again the score is our propensity score the score that measures the patient's propensity to have diabetes let's sort this data in decreasing order of the score so since this score was computed through a logistic regression this score measures the propensity of the patient with the highest propensity of being diabetic is patient number 446 the second highest propensity is patient 194 etc itself so let's calculate the cumulative number of patients the cumulative number of diabetic patients and so on and here let's just count the number of patients so we have in this column here the propensity ranking on column e and on column g we have the count of diabetic patients so for example if we take the top 20 patients with the highest propensity score they include 16 of the diabetic patients so let's transform those two columns into percentage so here is the cumulative percentage of diabetic patient and here we have the cumulative percentage of patient this number is going to be cell d2 over the last cell d76 and let's make sure that we encode and this cell is going to be e2 over b769 and let's make them look like proportions or percentage so now we can see how well the propensity score isolates diabetic patients from non-diabetics let's see the top 20 percent of patients with the highest propensity include 47 of all the diabetic patients it shows that the propensity score really helps identify diabetic patients from non-diabetic and we can see this by plotting those two columns in a scatter plot now we can see in this chart how well the propensity score separates diabetic patients from non-diabetic patients the top half of patients with the highest propensity score account for more than 80 percent of all diabetic patients and we want to quantify that discrimination we can draw this line here if you measure the area between the curve in the diagonal and the total area of the chart that's a number called the gini coefficient gene coefficient shows the extent to which a high percentage of the highest propensity patients account for a large percent percentage of the patients with diabetes or you can evaluate this chart directly so the top 70 of patients with the highest propensity score account for 95 of all diabetic patients top 40 percent of the highest propensity score patients account for 75 of the diabetic patients now let's see how you run a logistic regression using my analytic tools for this let's open my examples there you're going to find a tab with an example data sample for logistic regression in running a logistic regression with my tools is fairly simple as long as the data is in the right format the first column should have an id for each observation the second column should have the dependent variable coded as zero for no one for yes and then every other column should contain the predictors and the first row should have labels for each one of the columns and as long as your data has a simple format uh running it is straightforward you go to analytic tools lucific regression you can do the analysis for up to 80 000 rows and up to 60 predictors the data is automatically highlighted but if it's not you just highlight the range that you want to analyze we're reading 5 50 rolls we have these predictors available and i'm going to run an analysis using transactions per month profit income whether the customer has a savings account credit card account and atm card and we are trying to predict the likelihood of the customer having a certificate of deposits click ok and you can either do a straight estimation or calibration using a hundred percent of the sample or you can do an automatic holdout test a predictive validity test by holding out between zero to fifty percent of the data i'm going to hold out 30 percent of the data in other words we are going to use 70 of the data for calibration randomly picked and the other 30 percent will be used for a predictive validity test ok and there we have it we have here the log likelihood the consistent ikaiki information criteria is something that you should google if you want to know about it it's saying here that we're holding out 30 of the cases for validation here's the genie coefficient that i mentioned earlier in the calibration sample and in the validation sample typically the fit tends to be lower in the prediction because after all we are trying to predict the unknown auc is the area under the curve calculated under the roc curve receiver operating characteristics curve and something also that you should google if you want to know more about it is closely related to the gini coefficient but the way the chart is defined is different so we calibrated the logistic regression on 38-29 observations and we are validating it on 17 21 observations and here are the logistic regression coefficients and the t value for each one of them and that's pretty much how you run a logistic regression using my analytic tools for excel if you're interested in running a logistic regression you should know that kate kamakura's analytic tools for excel has a new tool called predictive stepwise logistic regression this tool will automatically select predictors for a logistic regression in a way that maximizes predictive fit in holdout samples if you're interested watch the tutorial on predictive stepwise logistic regression for excel in this channel but i'll give you a very brief preview of how this tool works my idea here is to select predictors that will improve the performance of your model for predictions rather than fit to the calibration sample so in order to do that we split the sample into hold out p percent of your total sample will be randomly chosen for a holdout and the remainder one minus p will be used for calibration and this is done randomly then if you have a candidate predictor and you want to know should this predictor be included in the model or not you'll run the logistic regression in the calibration sample cathy regression coefficients then apply these regression coefficients to the holdout sample calculate the predicted log likelihood and sum the predicted log likelihood across all observations in the holdout sample and if the inclusion of this predictor improves the log likelihood that predictor is a good candidate now which particular predictor should you include you should include the one that produces the highest increment in log likelihood and if none of the predictors improves the predicted log likelihood you know that you already have the best possible model so you stop search and this is done multiple times instead of doing this on a single holdout sample you can choose to replicate this process for as many as 10 random draws random draws from your observat sample split into code out and calibration samples so that's pretty much how it works now this is a new tool available in kate and there is a tutorial for it in this channel if you're interested in running a logistic regression just take a look at my tutorial on predictive stepwise logistic regression for excel
Info
Channel: KamakuraAnalyticTools
Views: 10,894
Rating: 4.8165135 out of 5
Keywords: Excel, Logistic Regression, Business Analytics
Id: xKA-fOCENaY
Channel Id: undefined
Length: 17min 59sec (1079 seconds)
Published: Tue Oct 20 2020
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.