Logistic Regression Using Excel

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
in this video I'm going to demonstrate how you can perform logistic regression using Microsoft Excel to begin go to the real statistics website and download the real statistics resource pack then also go to the kaggle website and download the training data set for the titanic competition next I'm going to install the real statistics resource pack and open the dataset downloaded from keidel now the real statistics resource pack requires that you put the variable you want to predict on the right side and in this case we want to know who survived and who died so we move that variable we can call that our dependent variable or our target variable to column L all the way on the right also we the real statistics resource pack the algorithm does not automatically recode variables so we have to do that manually and I'm going to remove several columns just to keep it simpler and just recode the sex variable so I'm going to change this variable so that if it's male and make it zero otherwise we'll make it a one and if you wanted you could do that for some of the other variables such as the embarked I think variable that tells which port the passenger embarked from but for now I'll just do sex call it sex recoded and copy and then paste as values so now instead of a formula here I have values because I copied and pasted as values so now I can delete the original column and I have a new column that just says 0 or 1 0 for male one female now I also notice that there are some missing values for age that's not good because if I just use the data as is it would I think these would be treated as zeros so I'm going to put the mean value in the cases where it's blank if I was going to be doing this really properly I would also probably create a dummy variable that would be called age missing that would be a 1 if that the value of missing in a 0 if it's not so that if there's a difference if there's anything unusual or any informational value in the missing values then I could include that also in the logistic regression model but just to keep it simpler for now although the average is 29 point 7 so I'm going put that in for all the blanks okay so now I have a data set with only numeric values as my independent and dependent variables so I'm ready to run the logistic regression I click Add Ins real statistics data analysis tools and select logistic regression then I click this button to select the range of values and then fill notice I didn't select column a because that's just telling me what row number I'm on currently it's not valuable so I leave it out I also click new here because I want the results of the regression to go on a new tab another field that's important is the classification cutoff so to understand this you need to realize that logistic regression is predicting a probability that survived will be 1 here and so what I'm saying is the classification cutoff is if the probability is above that cutoff I'll treat that as though it's a survived and if it's below point 5 I'll treat that as though it's a non survived or death if I raise the cutoff that would mean you need a probability of say for example 0.6 to be considered of a survival so now I click OK and I get the following output these variables are simply my independent variables and so they're just copies of these columns up from over here however it should be noted that if there are two or more rows that have the same values for all of these variables they're grouped together so for example any of these rows that have a2 over here this is actually two passengers combined into one row so it kind of rolls them up next look at column G H and I this is telling me was the value of survived 1 or 0 if it's a 1 then we put then we count it as a success if it's a zero it's counted as a and then this is just calm combining the two now the most important column here is probably column K because that actually shows me the predicted probability that the logistic regression is putting for this particular passenger or group of passengers with based on the independent variables so what we can do here is basically say we believe that based on the data the logistic regression is giving us a probability of 0.7 179 3/4 in this case the passenger actually did survive and since our cutoff is 0.5 we're above the cutoff so we consider this a correct prediction now in this row for example we have a probability above 0.5 so we consider that we're predicting a survival because we're above our cutoff however the passenger actually died so this is an incorrect prediction okay so moving on we have the following classification table this shows us of the passengers who actually survived how many did we predict would survive how many did we predict would die so 243 of the survivors were predicted as surviving and 99 of the survivors were predicted as dying of those who died 85 were predicted as surviving and 464 were predicted as actually dying so if we change the cutoff these values will shift if we for example made the cutoff 1 then we'd be predicting nobody survives right and so how many of the survivors we predicted none of them to survive right we basically just say everybody dies right and if similarly if you put it at zero we predict everybody lives so point five is in between and so we see the change and you also see the accuracy for predicting the survivals and the deaths
Info
Channel: Savvy Data Science
Views: 327,617
Rating: 4.7256432 out of 5
Keywords: Logistic Regression, Microsoft Excel (Software)
Id: EKRjDurXau0
Channel Id: undefined
Length: 7min 55sec (475 seconds)
Published: Wed Feb 18 2015
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.