Logistic regression in R

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hello and welcome to this workshop on logistic regression in our so logistic regression is a really useful tool for modern modeling data which is sort of the other way around to what we use to the kinds of experiments so that we can see in the lab where typically we might have predictor variables which are factors that categorical it might be vehicle versus treatment and our dependent variable and the thing that we measure as an outcome is a continuous variable logistic regression handles models and experiments which are the other way around so where our predictor variables may be continuous and our outcome variables might be categorical so an example being that we may measure lots of different things about individuals and use those variables to predict whether a person will survive or die whether they will whether they're positive for an infection or not or even whether or not they will default on their mortgage for example so logistic regression is sort of the other way around where our dependent variable is categorical and our independent variables can be either continuous variables or categorical variables and we'll talk about both examples in this tutorial logistic regression can be binomial and what we mean by binomial logistic regression is that our outcome variable has two levels so it could be survive or die positive negative etc another form of logistic regression is multinomial logistic regression which we'll touch on just at the very end where instead of having just two categories and as the name suggests we have more than two and this could be three five seven as many as you like but we will see when we come to it that actually and multinomial logistic regression is at least in my opinion not always as useful as binomial logistic regression and the more categories that you add the less interpretable result becomes and there are other techniques I think which are probably better at dealing with this kind of model than that there's a third kind of regression which we won't talk about today called ordinal logistic regression which as the name suggests and the the categories of the dependent variable have an order to them so for example we could use it to predict the likelihood that someone will progress from the Zee stage one to disease stage two so the levels of the dependent variable have an important order so before we carry on our recommend that you've watched at least the first introduction to our tutorial just so that some of the commands that we use and make sense and you know how to navigate our but other than that let's get stuck in so lets them download some example data which should be available to you so we're going to create a vector sorry I'm late frame called my data using the read dot CSV function something that we haven't talked about before is that the read dot CSV function can actually read and data files and from a web server so there's some publicly available data at the following address so it's HTTP colon forward slash forward slash stats dot e ID are e dot UCLA dot e-d-u forward slash stat /data /index csv and if you hit control enter then our shows retrieve that data from that server and save it in an object called my data and then we can explore this a little bit just to have a look so basically we have a data frame with four variables admit which is whether or not a student was admitted to a particular university so zero means that they weren't admitted one means that they were admitted and GRE which I'm not entirely sure what this stands for and GPA Stan for grade point average so it's a measure of the students performance and finally a variable called rank which is a ranking system where if the university that the student applied to had a rank of 1 this was the most prestigious university and all the way down to 4 as the least prestigious so a summary of our data frame then we get some summary statistics and notice actually that both admit and rank here are being considered by our to be and continuous variables because we get our summary statistics for them but actually they are both factors where admit has two levels 0 and 1 and rank has 4 levels 1 2 3 2 4 music we'll deal with that in a second finally we can also have a look at the structure and the additional information that this tells us is that our considers these guys to be numeric vectors or integers and that we have 400 observations of 4 variables so we want our to consider the the rank variable as a factor and so the way to do this and I explain why it's important that we do this later on is we basically say we basically overwrite the vector rank then the column so we give the data frame the dollar sign to subset the rank column and then we're basically just going to overwrite this and we're going to overwrite it with exactly the same data we don't want to modify the data but what we're going to do is wrap it in this function factor and what that does is turns this vector which we can print turns this vector into a factor so if we run that and then call the the vector again we can see now that we know it's a factor because our the very bottom prints out the levels it has four levels one two three and four for the purpose of this demonstration we're going to alter the data a little bit just so that the tutorial works out a little bit nicer and makes a little bit more sense so what we're going to do is take every student who was admitted to the University that they applied to and just increase their GPA by one just to make the example a little bit easier so we're going to say take the my data data frame and we go to subset and select all of the cases for which atom it is equal to one and we want to select the GPA variable which is column number one two three so so what this does is is select the every the value of GPA for every student who was accepted into the College and we're going to this value we're going to reassign the identical value plus one so if I print this out again I hope that you can see that we've just for every single student we've added the value of one to their GPA so this is just the students who were admitted into their and their university of choice just so that we can separate out the two groups a little bit for the sake of this tutorial okay so let's pull off our data just to understand the relationship between the GPA variable and whether or not a student was admitted into their university of choice I've got to do this using ggplot so if you've completed the ggplot tutorial then please feel free to follow along if not then don't worry about this too much because it's just for the purpose of demonstrating exactly what we're doing with logistic regression so going to plot my data I'm going to plot GPA on the x-axis and admit on the Y and we're going to add some points so each dot is a different student and with their value of the grade point average and whether or not they were admitted into the Universities of 0 is so these guys down here we're not admitted and these guys appear with the value of 1 we're admitted now we could model this relationship using a linear regression model for example and we can have a go at doing that we can do that in keachi plots using John smooth tell it to use a linear model method I'm also going to tell it not to plot and standard error curves and I'm just going to limit the the plot so that we only see what we need to see okay so we could model this relationship using a linear regression model as you can see here and it sort of works in a way you can see that as GPA increases so does and the likelihood that a person will be admitted into University but the model isn't a very good fit really an is is violating some of the assumptions for a linear model such as homoscedasticity and so let's model this instead using a [Music] binary logistic regression model and we can represent this in power in ggplot sorry by telling it that we want a GLM model this stands for generalized linear model I'll explain a little bit about this in a second we don't want standard error but bounds and method again don't worry it doesn't make sense with what we're basically asking ggplot to do is plot our data and plot a binomial logistic regression line on top of our plot and so this is what we'll be modeling with our logistic regression model and you can see that it's sort of models of the the data a little bit better and basically the way that this works is that we build a model so in this case predicting admit from GPA and in this two case scenario if our value of GPA is modeled such that if we have a battle if we have a value of GPA which predicts a value of admit and closer to 1 than to 0 then that individual we would predict that they would be admittedly college whereas if an individual had a GPA value which predicted a value of admit which was closer to zero than closer to one than for that individual we would predict that they would not be admitted into college so let's actually build our model formally so that we can have a look at our model coefficients and work out sort of exactly how how much more likely you are to be admitted into University with one unit change in your GPA I should mention by the way that this is completely false data and is not based on on real relationships so we can model admit by GPA we'll look at the other variables amount in our data table in a second but for now we'll just focus on GPA so we're going to build GPA model and to this object we're going to assign model object so and you'll recall from at least if you did the second introduction to our tutorial but if you want to build a linear model so this could be a regression model or an an over to a an over even a t-test we can do all of these through the LM linear model command linear models are a sort of think of them as a subclass of generalized linear models so for linear models the relationship between your outcome variable and your predictor variables is linear for generalized linear models and they can be nonlinear so for example in in this case the logistic model so for this case we can build the model using the GLM function but the way that we specify our model is very much similar to how we did with the linear model commands so we give it our output variable admit this is dependent on GPA this is our model formula our data is equal to my data and this is the important that the family so this basically so there are there are lots of different generalized linear models from binomial logistic regression multinomial logistic regression and Poisson regression and so to tell our that of this family we want a binomial logistic regression we simply say family equals and then in double quotes binomial and if we hit control enter we've created our model and if we have a look at our model using the summary function QP a model then we get some output statistics so first of all we get our call this is a model formula that we just specified a second ago and we get our a table of coefficients and some p-values so how do we interpret these so these estimates that we get and the intercept is probably not not so interesting but the estimate for GPA this is the log odds for every what so for every one year increase in GPA the log odds of being admitted to University increases by eight point seven one but nobody can think in log odds I can't understand exactly what log odds are and one of the reasons why clinicians like logistic regression so much is because you can derive odds ratios from these log odds an odds ratio is a much more interpretable way of understanding and the relationship between the variables in your model and will explain the how to interpret them in a second and we also get a p-value for this so the relationship between GPA and where the student was admitted to University or not is is highly unlikely to be stochastic it's it's likely to be predictive of whether a student will be accepted to University now we can use this function KO f for coefficients to get the coefficients from from the model and you'll notice that these coefficients are the same as here so eight point seven one so this is the change in log odds for every one unit increase in GPA and to derive odds ratios from log odds you simply take the exponent so exp so to take exponent is to do the opposite of often log so an example would be the exponent of log 23 is 23 so the exponent is it's a reversal of a log function and so if we take the exponent of log of the log odds from a model then we get our odds ratios so an odds ratio is a ratio of odds but the way that we interpret this in terms of our model is for every one unit increase in GPA the student was right so this example is a little bit ridiculous because because we changed the values earlier to make it a little bit nicer and but for every one you increase in GPA a student is 6040 nine times more likely to be admitted to university than not so that's how we interpret an odds ratio now usefully we can get confidence intervals for our odds ratios and we can use this we can do this using the conf int function now if we do this these are not the confidence intervals for our odds ratios these are confidence intervals for the log odds and so to get the compras intervals for the odds ratios we simply do what we did before and wrap this for this command inside the exponent function and then we get our confidence intervals so the lower is seven hundred and eighty nine times more likely and the upper is ninety four thousand seven hundred and fifty nine times more likely so quite quite a wide confidence interval but but clay at the height highly predictive in our in our model now I mentioned earlier that in logistic regression your predictor variables can be continuous but they can also be categorical so let's build a model now where we predict admission based on both categorical and continuous variables so we'll create a full model again using the GLM function and this time we want to predict and admit based on GPA and GRE which I'm still not sure what the GRE variable stands for sorry and rank data my data family equals binomial and just like before we can call summary on our full model and again we get a call and now we get our coefficients for each of our predictors and the p-value so we get an estimate for GPA and it's p-value we get an estimate for GRE and its p-value and and then notice that something interesting has happened and for rank which is a single variable we have three coefficients and the reason for this is that rank is a factor with four levels and so these coefficients are each comparing the the log the log odds of being in rank - this is rank one or rank 3 this is rank one or rank 4 this is rank one so rank 1 is the first level in our fact that has been taken as a sort of reference baseline level to which these are compared now this is happening because earlier we told our to consider rank as a factor if we hadn't have done that we would have got a single estimate for rank treating it as a continuous variable which we would have interpreted as for every one unit increase in rank sitting log-odds change in admission so we'll interpret this is that the log odds of admission decreases by not 0.287 as we go from rank one to rank two the way that we interpret this is that the log odds of admission decreases by one point four to three as we go from rank one to rank three and so basically what this is saying is that the higher the rank of the University the less likely the individual is to be admitted although none of these are actually significant in our model at the moment now again I'm not very good at interpreting log odds I think few people are and so let's steal this from up here to have a look at our odds ratios let's change this to full model full model so for GPA for every one unit increase in GPA the person is seven thousand three hundred and eighteen times more likely to be admitted for GRE the odd ratio is smaller than one and which itself is a little bit of a difficult thing to interpret the way to interpret odds ratio is smaller than one is - if we do 1 divided by and copy the odds ratio is to say it's sort of the other way around for every one unit increase in GRE a student was one point naught naught five times less likely to be admitted to to their University and notice that the GRE this odds ratio is very very close to one and the reason this is is because a one unit increase in GRE is actually very very small so these these GE units are sort of in hundreds and so although it's significant and you would almost be tempted to conclude that there's there's no real effect of GRE because one unit increase injury changes the odds ratios by very very little you might think that well this this is the person is no more or less likely to be admitted to university really for every one unit increase in GRE and the problem is that actually a one unit increase in GRE is very very small because these revalue are sort of in hundreds and and this is an important point to raise because if you're it's a scale of your if a one unit increase in your variable is kind of meaningless either because it's it would be a huge increase in that variable or in this case a very very tiny increase in that variable it can often be a much more intuitive - instead of entering the raw variable into your model enter the log2 variable into your model instead and we'll do this in a second but basically how you would interpret that then is that so every time the variable doubles then this is the odds ratio that you'll be and that you'll be given similarly the odds ratios for the levels of rank are all smaller than one so to make this a little more interpretable we can do one divided by the odds ratio and the way that we interpret this is that if the university was ranked number two a student was 1.3 three times less likely to be admitted than if they had applied to a university which was ranked number one and then the similar things will apply for rat universities ranked number three and number four as well and again we can get our confidence intervals for all of these odds ratios if we want as well that we can combine these into a slightly nicer table using the SI bind function and so if we say oh our odds ratio equals this and then supply the confidence intervals which come with their own headings and if we hit control enter then we get a slightly nicer table where we have the variables entered and the odds ratio of each of them and our lower and upper confidence intervals for them now I mentioned earlier that we can log to our predictor variables and to make their odds ratio slightly more interpretable and so I'm going to do that with the GRE predictor in our model we can simply do this in a model call by using the log to function and wrapping it around the GRE variable and if we rerun this and have a look at our coefficients if I do one divided by the odds ratio now for the log to GRE predictor this is slightly more interpretable so what how we interpret this is that every time the GRE variable doubles a student is eight point seven two times less likely to be admitted to University and we can get confidence intervals for these again unbuild our table to make it slightly nicer to interpret and so I hope that that makes sense in interpreting odds ratios so if the odds ratio is positive you simply interpret it as a one unit increase in the variable means that a person is this much this many times more likely to be admitted and if the odds ratio is smaller than one then a simple transformation helps just dividing it one divided by the odds ratio allows you to interpret it the other way around so for every one unit increase in log two GRE which of course is a doubling in tre and that's what we use log to a person is a student is eight point seven two times less likely to be admitted now in our model we get individual odds ratios and P values for the different levels of our categorical variable rank comparing them back to our baseline which is rank one but actually we may want a single p-value to compare whether the predictor overall contributes to prediction of whether a student was admitted or not and we can do this using the Ald package so we can install packages using the installed packages function and we employed the AOD package and then loading it into memory using library and then the function that we want is walls dot test and it has an argument called b2 which we need to give the coefficients from the model it has an argument called Sigma which requires something called the variance covariance matrix of our model don't worry if you don't understand that and we need to tell it exactly which terms in the model we want to compare and basically what we need to tell it is we're asking whether terms four five and six significantly contribute to the model and these are four five six figures we can't the interceptors one so this is one two three four five and six so we say terms equals four to six and if we hit control enter then we get our world test and we're our p-value is much larger than null point naught 5 and so we would conclude that the rank of the university doesn't significantly contribute to our model and so actually we could alter our model now and say well let's remove Matt rank we don't we know it doesn't contribute to the predictive value of our model wearers of TPA and GRE do so we can just recreate our model and using just the the parameter that we know and saying it's going to contribute to it so we've got our final model and we find that both GPA and the log to GRE are able to seem to significantly predict or contribute to the prediction of whether a student would be admitted to university or not but once you have your model it's important to assess whether or not your model is generalizable whether you've just over fit your your points so that perhaps you could have some sampling error in your sample such that your model fits your data very well but doesn't generalize very well to the real world and so perhaps if you then went and gathered more data perhaps your model wouldn't fit that data very well and a way of guarding against that and working out whether your model does generalize well is by something called cross-validation and we're going to use a version of cross-validation called k-fold cross-validation so cross-validation concerns splitting your your data set into training sets so a training set is a subset of your data which you build the model on and a test set which is a subset of your data set which you apply your model to that you the model that you built from the training set and see how well it fits to your test set and the principle is that the students or the the observations that are in your training set are not in your test set and vice versa now k-fold cross-validation is a version of cross-validation and which basically works as follows so the first time that you perform a cross-validation you randomly select and a subset of your observations which are going to be your test set and the rest are your training set so here a k1 we just look at the data in the green bars we build our logistic regression model based off this data and then we apply our model to the TREC the test set of our data and see how well our model is able to predict whether a student was admitted or not and then we do it again but this time we pick a different subset of our data to be our test set and training set and so we build a model based on the the data in the green boxes and then we apply our model to the gray and see how well it's able to predict whether a student is accepted or not into University and we do this again and again and again and a certain number of folds and so this is why it's called k-fold cross-validation and the model that we get from each time and how well each model is able to predict group membership for the test set is saved and so at the end were able to take sort of a an average model and see how well across all of these different folds how well our models are able to predict the correct values and if we had a model which based on our our training data and was not generalizable was was only able to predict the data it was too specific for our training set and wasn't able to predict very well in the test set and then this would mean that our our data are insufficient really to be able to build a model that is sufficiently predictive and accurate enough to predict real-world scenarios so how do we perform this in our so there's a handy package that we're going to install called a carrot which handles most of this for us I should say as well that cross-validation is not something that is specific to logistic regression but is is something that really if you're building models to predict particularly models to classify then cross-validation is always an important step at the end of your analysis and to make sure the data that you have allow you to are sufficient and a good enough sample of the population to predict from unknowns so we're going to install carrot package not spelt like the vegetable and then load it into our current our session and it's done okay so before we can perform the cross-validation we just need to tell carrot and a few settings basically we exactly how we wanted to perform so we're going to call this cross file settings and to set these we're going to use a function in karat called train control and all this function does is allow us to set our settings and preferences for the cross validation process which we'll use in a second so there are also different ways of performing cross validation and the method that we're going to use is repeated cross validation what this means is that we will perform cross validation once and then again and again and then again and we'll repeat it k number of times the number of times we're going to repeat it is 10 values sort of arbitrary but if you look on line 10 seems to be a decent number of iterations to perform and then there's an argument save predictions equals true which basically just means that every time it performs a cross validation it will save the result of the model that it will save the accuracy of the model as well so we run this and now we can actually perform our cross validation so let's call this a cross vowel going to use the function train within the current package and inside this training function we're basically going to specify the model in a similar way like we did within the GLM function earlier and there's a slight quirk of this function that and whereas before we didn't need to specify the admin was a factor the GLM function was smart enough to know that it was a factor for the train function we need to specify this factor so we could go to use this function as dot factor to wrap around admit so this is our our dependent variable and again just like we specified our model this is dependent on GPA plus log to gr e we're going to omit rank because we decided that it was not important a data equal to my data and the family is equal to binomial and whereas before we didn't have to specify that we were using a generalized linear model because that is in the name of the function and for the train function we need to specify that the method is generalized linear model and now this is where we add our settings that we we set a second ago with this command where we say train control basically what settings are we using or equal to cross bow settings this object here so it passes all of these options into the train function and if we run this then R has performed our cross-validation if we have a look inside our object and we don't get an awful lot of information we see that we have 400 samples we have two predictors and we have two output classes 0 and 1 and we can say that the we could hit the resampling was cross validated it was there were 10 folds and this was just repeated at one time and then we get a brief summary of sample sizes so that we can see that basically other because this was repeated 10 times of our 400 starting observations for each iteration for each fold around 360 we're used as our training set and the remaining 40 roughly were used as our test set and then we get a value down if for our accuracy but we'll discuss that a little bit more in a second now arguably one of the best ways to validate the predictive power of a binomial just a regression model or indeed any classifying model is to look at something called the confusion matrix which is a table which has the true positives this is the true negatives the false positives and the false negatives so that we get an idea of the accuracy of our model and we can do this by creating a new object called recorded pred we use this function predict what this function does is takes our cross validated model so this is the average model and that was produced after performing our cross validation and we're going to apply this to our original data if we have a look at pred what it's done is use apply this model our cross validated model and that was the average of those 10 different iterations apply it to our original data and use our model to predict whether each individual in my data and was accepted or was sorry it was accepted or rejected from and the university application and that we can use the predicted values to construct a confusion matrix data is these predicted value spread and then we just have to supply the two the two levels of our dependent variable 0 and 1 so we can do that by saying my data admit and if we hit ctrl + Enter then we get a confusion matrix so this is our matrix here and the way that we interpret this is that these this 0 and this one are the predicted values of 0 and 1 for our model this 0 and 1 reference are the actual values for 0 and 1 and so these guys in the top left were actually 0 and they were predicted to be 0 so these are the true negatives these guys were predicted to be 0 but we're actually one so these are the false negatives these guys down here were predicted to be one but actually zero so these are the false positives and these guys down here were actually one but were predicted to be one as well so these are the fault that sorry these are the true positives so I hope you can see that and there are 265 correctly classified students who were rejected and there were 111 students who were correctly classified as who were accepted and were correctly classified as being accepted and so if we look them this so that in total there are 376 individuals that were correctly predicted and there were only 24 individuals who were falsely predicted so this models pretty good and then we get some more statistics down here really about sort of summaries of this confusion matrix so our accuracy is 94 percent so 94 out of 100 students and were correctly predicted to be admitted or not by our model and we get some 95% confidence and values for that we get a p-value for the overall model and although I'm not sure how useful that is really and then these statistics down here a particular useful so the sensitivity to the ability to detect when a person was accepted or not when they actually were accepted and the specificity and the ability to only categorize those that were admitted who actually were admitted so we've constructed our binomial logistic regression model and we've performed cross-validation we've made sure that our sample is is good enough to generalize to unknown observations and we've evaluated our model in terms of its accuracy its sensitivity and its specificity another way that you can visually represent how good your model is at predicting group membership is by using or producing a receiver operator curve a ROC curve and so we're going to produce one for our model so we're going to use the package rocker and load it into our current our session and we're going to use this package to visually compare the full model that we created earlier with the GPA only model so this is the model that has GPA as a predictor only and then our full model is the model which has both GPA and our log to GRE as predictors and so we're going to create an object called prop full probability and we're going to use this function predict and we're going to pass it our full model our data and we're going to ask it to return the response so what this function is going to do is take the my data data frame use the full model to predict the value of admit okay and then we go to say back inside prop full and actually if we call prom full if we if we call prop full then you can sort of see that based on a model that might perhaps be something a little bit like this and we get values of Y that are predicted and of course if any individual value is closer to one then an individual we predicted to be admitted and if it's closer to zero then the individual will be predicted to be rejected now from this we can create another object called predictable we're going to use a similarly named function so don't get them confused that easy confused called prediction and to this we're going to pass a probe full and the admit and variable from our my data and data frame this function generates the data and that Rocker requires to produce the ROC curve and then function performance actually calculate the performance measures for our models such as the true positive rate and the false positive rate that we'll use to create the ROC curve so the first measure is true positive rate the X dot measure is the false positive rate now if we hit control enter and so we've done all this so far for our full model and so let's do this is copy and paste these to save some time typing but rename them GPA so change change it there make sure that we have GPA model there we'll call this predict GPA [Music] probably GPA that's fine purse GPA predict GPA and let's run these three as well and simply then to peruse our Rock curve and just using base graphics we could do this in ggplot that for the sake of time we'll just use base graphics for now to the plot command you simply supply a performance object and let's add a color so that we can discriminate between them color equals blue and hit control enter and then our using based graphics prints a receiver operator curve for us so the way that we interpret this is if you imagine that instead of being GPA this is the overall value printed out by the model the outcome value exactly where along this scale we choose to set our separator determines and our the our false positive rates are not forcing negative rate so for example let's say that this was the outcome variable of the the model and we set our cutoff at four and so there be maybe possibly one or zero of these guys that would be misclassified and but actually all of these guys to the left of 4 would be misclassified and as as not being admitted to university and so what a receiver operator curve does is basically takes different values different arbitrary thresholds along and the output of the model and says that this particular value what is the false positive rate and what is the false and answer time what is the true positive rate and if the line fell across a diagonal then our model would be no better than a sort of best guess estimate the further away from the diagonal line our model line fits and the it is classifying our observations so you can see that this this particular model our full model is very far away from the diagonal and so the true positivism remains very very high and whereas the false positive rate remains quite low now let's compare this to our GPA only model and we get ad and the line on top of this plot by again calling a plot function this time calling the GPA performance object that we created and I'll supplies I can add equals true what this does is that it just prints the line on top of the plot that we and already have open so you can see that added a black line and the black light is very very close on top of the blue line for a full model and the reason that is is that although the log to GRE variable was significant in our model and did contribute to prediction the the greatest log odds in our model greater coefficient was for our GPA remember it was something like 7 you have 7000 times more likely to be accepted for every one unit increase in GPA and also because of this as long as GPAs and our model we're still able to predict with very very high accuracy whether a student will be accepted or rejected now remember we we change the data a little bit to make the example run a little bit nicer and so let's go back and see what happens if we didn't do that so let's reread in the data back up here this time we are not going to do this we're not going to add one to the GPA of students who were accepted instead we're going to recreate our models that will recreate the GPA model we're going to recreate the full model and then if we come back down here from the section on rock let's run this whole section again and then let's rebuild our rock curves now because we no longer made the difference in GPAs between those students who are accepted or rejected as large as it was you can see that now the ROC curve is much much closer to the diagonal line because our model is no longer as phenomenal at predicting because now there isn't such a big difference between the GPAs of those students who were accepted and those who weren't and if we add on top and just our GPA model then again you can see it is still quite similar because GPA was still more important really than GRE now looking at the graph is a nice way of visually telling which models are better than others but we like to have numbers to sort of quantify that and one way that you can do this is using the area under the curve so the greater the area under the curve and the more accurate your model is at predicting and we can do that by stealing this performance function from up here copy and paste that let's call this a UC so you let's change this to full and this time we only want one measure and the measure is a EC and let's do the same but for we call this a UC GPA change this to GPA and now these objects have a slightly unusual structure so this next bit is a little bit strange and that it's just so that we can extract the AUC value I'll show you what I mean if I just call this a UC object we get a list with some names a few different values in them and actually this this is a the value that we want to make it a little bit cleaner we can call a UC use the app symbol to subset the list item that we want and we want the Y dot values and if we run that then we get the area under the curve for the full model and if we do the same but for a UC GPA then we get the area under the curve for just our GPA model and just to note that the area and the curve if it was no point 5 it would be exactly along the theoretically it would be exactly along the diagonal line and the model would be no better at predicting whether a student would be accepted or not than the best guess so at least these models are though not fantastic better than having sort of a best guess no information at all now very finally I've mentioned at the start that we touch on multinomial logistic regression at the very end so let's just briefly build above a multinomial mobile and for this we're going to load in the iris data frame again which is built-in it comes with our and just to refresh our memory if we have a look at the summary we basically have four continuous variables sepal length width petal length and width and one categorical variable species so we're going to build a multinomial logistic regression model to predict flower species from the sepal width unfortunately we can't build a multinomial logistic regression model using the GLM function like we did for the binomial model and so instead we need the use of a package called and net so I get a warning because mine is already loaded into R so for me I'm just going to select cancel if you get the same thing you can select cancel otherwise just let it install and then load a net into our current working memory so we're going to build a multinomial logistic regression model we're going to call it multi model and this time as I said we can't use GLM but we use the multi gnome function we want to predict species based on let's say log to settle with the data is the irish data frame and if we hit ctrl enter and build our model you get some feedback that the model is finished building and then we can summarize our model using summary and so again we get our model call and we get our coefficients and standard errors but note that we don't get any p-values so we'll talk about that in a second and we can use as well as looking here we can use the coefficient function to just extract our coefficients so again these are log odds which I can't interpret so let's go to go ahead and take the exponent of these and so now these are our odds ratios note that these odds ratios are a lot smaller than zero and so to make them easier to interpret we can take one divided by and so let's subset this table so let's just look at both rows of the second column oops okay and so the way to interpret this within in the case of a multinomial logistic regression model is that every time sepal width doubles a plan is 430 times less likely to be versicolor than it is to be Satou sir every time sepal width doubles a plant is nearly 9,000 times less likely to be virginica than it is to be Seto's err okay so this is a difference a slight difference between binomial and multinomial logistic regression in that in multinomial logistic regression no matter how many how many categories you have and the odds ratios are always interpreted back to a baseline and by default this will be the one which is which comes first alphabetically although you can't change that manually to explain what's going on a little bit let's some plot and this data and see exactly what this relationship is so let's plot IRS and along the bottom will plot species and on the y the plot settle with and let's go with a john box plot oops shuffled up with and so I hope this explains how we interpret and the meaning of our odds ratios every time supple with doubles an individual flower and individual observation is far less likely to belong to versicolor than it is to Chateau sir every time supper with doubles and an individual observation is far less likely to be virginica than it is to be Satou so in this case a unit change of one of supper with actually makes sense so let's just rebuild our model I think is probably unnecessary to log to the users make odds ratio slightly nicer so if we redo this and redo this this is perhaps slightly and something easier to get our heads around for every one unit increase for every one millimeter increase in Sapa width and observation is 454 times less likely to be part of the versicolor species than it is to be part of the toes of species and similarly and for every one millimetre increase in Sapa width an individual observation is just shy of 60 times less likely to be virginica than it is so Tozer and you can see that the although we these alterations compare back to Satou sir and not between these two the magnitude of the odds ratio gives you some indication as to where the larger difference is and so there's a greater difference between so Tosa and versicolor down between virginica and the SOTA now our multinomial logistic regression model didn't give us any p-values and of course we would like to worship p-values and so one way we can get them is by installing this package AER and if we load it into memory we can use the function oops coefficient test Co F test and supply it with our multinomial model hit control enter and then we get P values for our and odds ratios so what we're interested in we're not interested in the intercepts we're interested in the odds ratios for a versicolor and the odds ratio for virginica so I hope that was useful and binomial particularly but also multinomial logistic regression models are very powerful models for predicting membership class and seeing the the relationship between increases in certain variables and the likelihood that a person will belong to one class or the other one reason why there well two reasons why they're particularly popular one of them is that clinicians and like logistic regression they understand odds ratios in the context of clinical setting and from a certainly more modeling and statistical point of view and logistic regression models don't care about the distribution of your predictor variables so you're our settle width for example could be horrendously non normally just distributed and the logistic regression model doesn't care whereas if you're running something like a t-test or an ANOVA and then of course that will will rend your coefficients and your p values and interpretable because you violated the assumptions so the direction doesn't have that assumption and so I strongly and suggest that you play around with this data and if you've got any of your own data and even if you have some experiments to perform the t-test on sort of flip your data around include the the two groups as dependent variable and see and how the odds ratios change for an increase in your continuous variable so I hope that was useful and I will see you in the next video
Info
Channel: Hefin Rhys
Views: 19,155
Rating: 4.9335546 out of 5
Keywords: rstudio, statistics, programming, logistic regression
Id: D1xVEi8PU-A
Channel Id: undefined
Length: 66min 49sec (4009 seconds)
Published: Mon Jul 10 2017
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.