ROC Curve & Area Under Curve (AUC) with R - Application Example

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
using a get working directory I can see that right now my working directory is desktop I am going to read a CSV file and the first row contains information about the variables so I am going to say header is true and we'll call this data file s binary so we can run this and you can see binary file has four hundred observations and four variables we can look at structure using this data we want to create a predictive model where we want to predict whether or not a student will be admitted to this college and the variables that help to make this prediction are GRE GPA and rank so I am going to use n net our variable of interest is admit as a function of tilde and then dot means I want to use all these three variables GRE GPA and rank and our data is binary so we run this model and we get some solution all the values for this data set binary and let's store these predictions in P now we can create a table and the Abell in tab and if you want to look at this tab this is what we get I said what we have is given this side and here you have values based on the prediction this means 253 students who applied they were not admitted and this model also predicts that they are not admitted but there are 20 students actually they were not admitted but the model predicts that they should be admitted so this is the MIS classification similarly there were 98 students who were actually admitted but the model says they should not be admitted and 29 were actually admitted and the model also says or predicts that they should be admitted so obviously the correct classification based on this data set is 253 plus 29 divided by the entire dataset which is 400 so let's calculate some of diagonal values within the term divided by some of the entire tab so this gives us about 0.7 zero five so this is correct classification and if you do one minus we get 0.295 which is miss classification rate now the question is whether this 70 percent correct classification is good so let's look at simply how many students were actually admitted and how many students were not admitted so I'm going to make a quick small table so you can see in the data set 127 students were admitted and 273 out of 400 were not admitted one way we can predict whether or not these applicants will be accepted is using the higher value here so if you see 273 divided by 400 so if we predict all students will not be accepted still we'll be right 68 point two five percent of the time if we create a statistical model and find that the percentage of accuracy is less than this number obviously we should not use that model so right now we have developed logistic regression model and it gives accuracy of 70.5% which is slightly better than this so at least it is better than this benchmark number so two four to model performance evaluation let's make use of a package our OC R so let's make a prediction using the model that we developed my model and the data set we have is binary and the type of prediction that we want is probability so we want to predict probability values and store this in T our Edie so now in PR ad we have like 400 prediction values if you want to look at them you can type PR Edie in fact we can say head PR Edie so you can see first 6 probability values are given if you look at head binary you can see that first applicant was not admitted and our prediction probability is 0.18 which is a very low so prediction also is that this student should not be admitted so this is a correct prediction this classification table that we made so this makes use of a cut-off which is 0.5 so if the probability is below 0.5 it will say prediction is 0 and if the probability is above 0.5 it will say prediction is 1 so here you can see second probability is point three so prediction will be that student should not be admitted whereas reality is that this prudent was admitted so this is a classification error similarly this is 0.71 so this is more than 0.5 and you can also see that this student was admitted so this is a correct classification now if you look at the prediction value so let's make a histogram of PR IDI so you can see these probabilities vary between zero and about 0.8 so this is 0.6 so about 0.8 and most of the values are below 0.4 so if we use 0.5 we may have one type of classification but if we use cutoff let's say 0.4 or 0.6 so accuracy or miss classification might change so let's see what will happen if we do that and for that I am going to make use of prediction function within our OCR and we will use these probabilities that we calculated and stored in trad and we also make use of the actual values so let's store this in trad again so now we are going to use performance using trad that we have and will make use of accuracy values and will store this in eval for evaluations and then let's plot eval so we get this kind of curve so what we have is these cutoff values change from zero to one and for different cutoff values in this picture we see what is the accuracy level that we'll get so this accuracy is overall accuracy so you can see when cutoff is close to 0.1 accuracy levels are really very low in fact close to 30% and it rapidly Rises as we increase the cutoff values and reaches a peak somewhere here so remember 0.5 was our default cutoff and here we can see what would have been accuracy for different cutoff values now if you want to identify what is the best value here let's make a line on this chart using a beeline we will draw a horizontal line at somewhere here Oh point seven one so you can see somewhere here we have the peak and then we try to identify what is this value so this is about 0.45 so we can say vertical line equals 0.45 so this will give us more or less highest accuracy value for this model so this is based on our I estimate we are going to use which dot max which one is maximum and the way this ROC our package is made the data that we have are stored in some slots so I am going to make use of a slot and our data was in eval and we are interested in y dot values and then we also specify with double square brackets that this is in one and suppose we want to store this in max so before you run this if you simply want to look at what exactly is contained in eval you can just say evil and hit enter so we notice that there are lot of values that you see there are y values there are X values and so on so let's run this so it will identify what is the maximum value so if you do simply max what is there in max so it says that it is the sixty-first value now let's go into the slot for evil and we are interested in Y values and double square brackets and then one more square bracket and we specify this is for the max that we identified in the last row and let's store this accuracy value in AC C so now ACC has that value let's look at what is contained in AC C so this is 0.7 one seven five so the highest value here is 0.7 175 and now we want to figure out what is the optimal cutoff level for that point seven one seven five value so it may not be exactly 0.45 that we have seen here it may be slightly different so we are going to make use of the same format so from using slot and eval now we want values on the x-axis so X dot values and square brackets two times with one in it and we also specify that against the maximum and let's call this s cut for cutoff so if you want to look at how much is the cutoff you can see this is 0.46 eight so not exactly 0.45 that we were looking at on the graph so now we can print so if you run this it will give us what is the accuracy value and what is the cutoff value so when compared to a default cutoff value of 0.5 that we have here a cutoff value of zero point 4 6 8 3 so this will give us a better accuracy of 0.7 175 remember this is based on the table that we have seen earlier so this table was just for one situation where cutoff is 0.5 and it tells us how the model has performed but sometimes what can happen is instead of focusing on the overall accuracy or miss classification we are more concerned about predicting more accurately in one group compared to the other group for example if we have data on bankruptcies and we are trying to make a prediction whether a company will go bankrupt represented by 1 in that case our interest may lie more in correctly predicting one rather than zero so that's where we can make use of our OC curve we'll make use of performance and we will calculate t be our true positive rate so true positive rate based on this data is 29 divided by 29 plus 98 positive rate is about 22% obviously this is a very very low accuracy level for correctly predicting one most of the times one is being misclassified as zero so obviously this one needs big improvement when we look at the overall model and see that accuracy is 70 1.75 that looks very good but when we have to focus on one accuracy level of 22% obviously is not very good similarly we also calculate what is called false positive rate so false positive rate we can calculate again from the same table here so for example 20 is falsely predicted as one out of 20 plus 253 so false prediction rate in this case will be 20 divided by 253 plus 20 so false prediction rate is about 7% we can do this calculation and store this in let's say our OC because we are going to make a ROC curve so remember these calculations are based on cut-off value of 0.5 which is default but when we do ROC curve we'll also be able to see how is the performance for different cutoff values so that's the idea we got this one now let's plot our OC this is how the ROC curve looks like so you have true positive rate on the y axis and we have false positive rate on the x axis so the ideal situation should be that the curve starts somewhere here at zero zero and goes to this one zero one and then this value which is 1 1 and that would have classified in a perfect way or accuracy would have been hundred-percent but in reality based on the data we get these curves which are not really close to the ideal value we can draw a line in the middle so that the intercept is 0 and slope B is 1 so this straight line here means without any model if we say that out of 400 students reject everyone will be right about 68% of the time so if the model does worse than that so this curve will be below the line but obviously in this case the model is doing better so this curve can be compared for different models and we can see which model is doing better and which model is not doing good we can customize this chart by adding few more things we can colorize by saying true so if you run that line you will see now there is a color and that color is based on the cutoff so for example 0.5 is somewhere here so that light green color is for cutoff at 0.5 so cutoff values range from point 0 5 up to 0.72 in this example we can also add a title note that while Abel here is true positive rate another name for this is sensitivity also xlabel we can say this is one - specificity so that's another name for false positive rate so if you run that you get this title here sensitivity and one - specificity and of course we can draw this a beeline and see how the model performs against this benchmark another way people use this ROC curve is to calculate area under the curve because visually we can see that this curve is doing better than the benchmark but when you have many curves on this chart what will happen is it will be difficult to differentiate between the performances so we need a numeric value what we do is we find area under the curve and if the area is higher that means model performance is better so note that for the entire rectangle here the total area is 1 and if you look at this line area below this line will be 50% obviously area under the curve for the model that we have built that area will be more than 50% so let's see how much we get so we will use performance PR ad that we had calculated earlier we will get a you see area under curve and let's store this in a you see so we'll make use of this command here Unleashed and slot a you see that we calculated earlier why dot values so let's do this also in a you see so if you simply run a you see so you can see we get point six nine two one etcetera but if you want to round this to less number of decimals you can do a you see here you see and let's say we'll have only four so now you see only four decimal values so let's add this number to this graph somewhere here using legend let's say we want the legend to start somewhere at 0.6 and 0.2 so X value is 0.6 Y value is 0.2 we want a you see we can also give a title the curve is indicated on this graph you can also choose to change the size of this by using cex if you say 4 it will be very big so obviously this will not fit 1.2 so let me run this line again this one is so this is with about 1.2 size
Info
Channel: Dr. Bharatendra Rai
Views: 85,571
Rating: 4.9356136 out of 5
Keywords: data science, big data analytics, data mining, business analytics
Id: ypO1DPEKYFo
Channel Id: undefined
Length: 19min 40sec (1180 seconds)
Published: Mon Mar 06 2017
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.