Logistic Regression for Excel

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

hi this is Wagner kamakura and I'm going to show you how you can run along gistic regression using my analytic tools for excel but before doing that I want to give you the intuition behind logistic regression at this point I assume that you are already familiar with linear regression a linear regression is somewhat related to logistic regression because both of them produces a linear score as a function of predictors main difference is that in linear regression this linear score is a direct prediction of the dependent variable you have a continuous dependent variable and the score is already an estimate of the dependent variable or a prediction for it in logistic regression the linear score measures the propensity for the event for a binary event or its log odds so in this case the score is equal to the log of the probability that the event will happen over the probability that it won't happen remember the little math you will find out the probability of the event is equal to the exponential of the score over 1 plus the exponential of the score am I going to see how this works using a fairly simple example in this example we have 768 female patients they were tested for diabetes and the dependent variable is a result of that test whether the patient was diagnosed as a diabetic one if yes or zero if not and the predictors explaining that propensity would be the number of pregnancies glucose level blood pressure skin thickness insulin body mass index diabetes pedigree function and age before doing anything let's run a regression just a simple linear regression using Excel go to data analysis linear regression regression and our dependent variable will be has diabetes 1 or 0 our predictors or input range will be pregnancies glucose etcetera down to H we have labels and we are going to save the regression coefficients on a new sheet and here we have the linear function the coefficients for the linear regression I'm going to insert two rolls to copy the regression coefficients then I'm going to use as a starting point to build our logistic regression here we have the intercept in the weights for each one of the predictors I'm going to compute the score here score for each patient the score for each patient will be equal to intercept plus some product of the weights regression coefficients and the predictors for each patient because we want to copy this formula all the way down for every one of the patients I'm anchoring the coefficients so that the same Row 2 will be applied to every patient so there we have the linear score which right now is the linear regression score because we are using the linear regression coefficients now I'm going to calculate the probability of diabetes as X of the score over 1 plus X for the same score first patient has a 65 point 57 probability of having diabetes now we are going to calculate the likelihood not the likelihood of having diabetes but the likelihood associated with the actual event whether the person has or not if the person has diabetes which is the case for the first patient then the value of the likelihood should be the probability on the other hand if the patient does not have diabetes the probe will should be one minus the probability so that's the likelihood for us to see the difference between having diabetes and not have it in the second case the patient did not have diabetes and the probability of having diabetes point 501 so the Michael who does the event the event is not having diabetes is the complement of that 0.49 a and we're going to calculate the log likelihood which is just in log of the like load now we can copy these formulas for every patient each one of the 768 patients so we know that this coefficients here are the regression coefficient linear regression coefficient but we wanted to estimate the logistic regression coefficients we wanted to find what are the coefficients that make the likelihood of the sample in other words the product of these probabilities here the largest possible but we're dealing with probabilities so when I multiply 1768 probabilities that go from 0 to 1 that product is likely to be very small that's why we use log likelihood so that now instead of multiplying the likelihoods I can calculate the sum of all the likelihoods now we can find what are the weights those weights here what should be those weights so that the sum of like my clearance isn't largest possible in other words what would be the set of weights that will make the observant events most likely and we can do this using solver what is our goal here we want to maximize the log likelihood which is on cell n1 to maximize we want to do this by varying the range that I had highlighted before this cells we wanted to find what is the combination of weights intercept pregnancies glucose etc that would produce the highest log likelihood and we don't want the coefficients to be no negative we want the coefficients to be anything they want to be so we do this and run solver and you can see there is a likelihood now it's not as negative went up from minus 500 and something to minus 361 so now this score here is the logistic score so those are the scores that measure each patients propensity of being diabetic they are the log odds of being diabetic based on the patient's number of pregnancies glucose level broad pressure skin thickness etcetera etcetera so these now are our logistic regression coefficient these are our log odds or the logistic scores and this is the maximum likelihood to see how well the logistic regression predicts the event having or not having diabetes let's copy this range here all the data up to our score to a new sheet going to create a new sheet and our goal is to see how well this score discriminates between people who have diabetes and people who do not I'm going to only focus on that variable dependent variable and the score and again the score is our propensity score the score the measure is the patient's propensity to have diabetes let's sort this data in decreasing order of the score so since the score was computed to in logistic regression this score measures the propensity of the patient with the highest propensity of being diabetic this patient number four four six the second highest propensity is patient 194 etc so let's calculate the cumulative number of patients the cumulative number of diabetic patients and so on and here let's just count the number of patients so we have in this column here the propensity ranking and call you me and oncology we have the count of diabetic patients so for example we take the top twenty patients with the highest propensity score they include sixteen of the diabetic patients so let's transform those two columns into percentage so here is the cumulative percentage of diabetic patient and here we have the cumulative percentage of patient this number is going to be cell d2 over the last cell d7 69 and let's make sure that we anchor and this cell is going to be e YouTube / e 769 and let's make them not like proportions or percentage so now we can see how well the propensity score isolates diabetic patients from non diabetics let's see the top 20% of patients with the highest propensity include 47% of all the diabetic patients it shows the depression score really helps identify diabetic patients from non-diabetic and we can see this by plotting those two columns in the scatter plot now we can see in this chart how well the propensity score separates diabetic patients from non diabetic patients the top half of patients with the highest propensity score account for more than 80% of all diabetic patients and we want to quantify that discrimination you can draw this line here if you measure the area between the curve in the diagonal and the total area of the chart that's a number called the Gini coefficient Gini coefficient shows the Excel to which a high percentage of the highest propensity patients account for a large percent percentage of the patients with diabetes or you can evaluate this chart directly so the top 70 percent of patients with the highest propensity score account for 95% of all diabetic patients top forty percent of the highest propensity score patients account for 75 percent of the diabetic patients now let's see how you run a logistic regression using my analytic tools for this let's open my examples there you're going to find a tab with an example data sample for logistic regression in running a logistic regression with my tools is fairly simple as long as the data is in the right format the first column should have an ID for each observation the second column should have a dependent variable coded as zero for no one for yes and then every other column should contain the predictors in the first row should have labels for each one of the columns and as long as your data has a simple format running it with straightfoward go to analytic tools leucistic regression you can do the analysis for up to 80,000 rows and up to 60 predictors the data is automatically highlighted but if it's not you just highlight the range that you want to analyze we're reading 50 550 rows we have these predictors available and I'm going to run an analysis using transactions per month profit income whether the customer has a savings account credit card account an ATM card and we are trying to predict the likelihood of the customer having a certificate of deposits click OK and you can either do a straight estimation or calibration using a hundred percent of the sample or you can do an automatic holdout test at reddit ability test by holding out between 0 to 50 percent of the data I'm going to hold out 30 percent of the data in other words we are going to use 70 percent of the data for calibration randomly picked and the other 30 percent will be used for a predictive validity test okay and there we have it we have here the log-likelihood the consistent like information criterion is something that you should google if you want to know about it it's saying here that we are holding out 30% of the cases for validation here's the Gini coefficient that I mentioned earlier in the calibration sample and in the validation sample typically the fifth tends to be lower in the prediction because after all we are trying to predict the unknown LC is the area under the curve calculated under the ROC curve receiver operating characteristics curve that's something also that you should go go you want to know more about it is closely related to the Gini coefficient by the way the chart is defined as different so we calibrated the logistic regression 38:29 observations and we are validating it on 17:21 observations and here are the logistic regression coefficients and the T value for each one and that's pretty much how you run a logistic regression using my analytic tools for Excel

Info

Channel: KamakuraAnalyticTools

Views: 24,782

Rating: 4.9000001 out of 5

Keywords: Business Analytics, Add-ins, Excel, Analytics, Logistic Regression

Id: 8R9byhGAYM4

Channel Id: undefined

Length: 15min 31sec (931 seconds)

Published: Fri Aug 31 2018