Logistic Regression in R | Machine Learning Algorithms | Data Science Training | Edureka

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hey guys this is a man from Eddie Rica today's session is going to be on logistic regression so without wasting any more time let's move on to today's agenda to understand what all will be covered in today's session so we'll start off today's session by discussing the five main questions which can be asked to you in Gator science now based on these questions you decide which algorithm to use right so we will see how a regression fit into these questions and then we'll move on to the part where you will be discussing what the regression is exactly after that we'll move on to the topic of the day which is logistic regression and then we will understand the what and why of logistic regression after that we'll see how logistic regression actually works and towards the end we'll be doing a demo wherein will be taking the diabetes data set and we'll be solving the data set using logistic regression in the end we'll be discussing the use cases as in the real life scenarios where in logistic regression is actually used all right so guys this is our agenda for today are we clear with it all right I'm getting confirmations so Denise says yes others below we pilots current okay guys so I'm getting confirmation from you all all right so let's start with today's session so our first topic for today is the five questions the only five questions which can be asked to you in data science and based on these questions you decide which will garden to use right so the five questions which can be asked you are this the first question that it could be asked us is this a or b is this an apple or is this a pineapple is this a pen or is it a pencil is it a mouse or is it an elephant so when you have these kind of questions the algorithms with these kind of questions is the classification algorithm all right our next question is is this weird all right so basically this question deals with patterns so whenever there is a change in pattern the algorithm detects and and the algorithms which it deals with these kind of problems are called anomaly detection algorithms then you have questions which are quantifiable when you ask numbers how much or how many right for example what will be the temperature for tomorrow or after how many days it will rain right so all these kind of questions are tackled by algorithms which are called regression algorithms then you have questions like how is this organized right so basically these deals with clustering and you have and algorithms which deals with these kind of problems are called clustering algorithms then you have questions as in what should I do next right so if you don't know when you have to make a decision so this isn't me taking capabilities are are basically done by algorithms which are called reinforcement learning so using these algorithms you can take a decision as in what to do next right so these are the five questions which are asked in data science and these are the algorithms which are made to tackle these kind of questions now our topic for today is the autistic regression so as the name suggests it comes under regression algorithms but with logistic regression this is a scene that the answer which comes is categorical as in the answer is either a yes or a No it's either a B or C or it's either true or false right so it is classified the values of fixed values are categorical the dependent variable the output that we are getting is like this all right so it will also categorize under the classification algorithms as well so it is a regression algorithm because before classifying the output you get a probability and based on that probability you decide whether it will be a yes or a no or whether it will be an A or a B all right so that is the reason it is categorized under both of these algorithms moving on let's first understand what is the question right so what is regression so regression is basically trying to establish a relationship between two variables so what happens in this scenario is you have an independent variable and you have a dependent variable so the dependent variable is related to the independent variable so like in our case Y is dependent variable and X the independent variable so as you can see in the graph as the value of x increases Y also increases but Y value is dependent on X X can increase as much as it wants but Y will increase according to X right so Y is dependent on X so now where does the regression comes into the picture is for say an arbitrary or a random value of X what will be the value of y so you're predicting Y correct and this prediction is done using regression algorithms and that is what the regression is all about all right so any doubts regarding to what I've told you guys till now as in what is regression so it is basically estimating a relationship between a dependent variable and an independent variable alright so guys are we clear with what is regression okay so Karen saying yes okay guys let's try to make this session as interactive as possible because then it will be in the benefit of both of us all right so since most of you are clear let's move on to the next topic we have discussed what is regression and regression is basically categorized into three sub parts right so there are basically three types of regression so the first type of regression is the linear regression then you have the logistic regression and then you have the polynomial regression today we are going to discuss about logistic regression right so let's move on to the part where we will be understanding what a logistic regression is but before that let us understand where do we use logic regression or why is a logistic regression actually needed right so why logistic regression so whenever the outcome of the dependent variable is categorical or is discrete so when I say discrete it means the value is fixed can either be an A or a B it can either be a zero or it can be a one so if I'm asking your question like is this animal a rat or an elephant so you cannot say it's a dog because my question is is this a rat or an elephant are you getting me right so when you have these kind of possible it is when you have outcomes which are discrete which are categorical as in which Capri refined you use logistic regression alright so now my next question which can be asked is why can't we fix the formula of linear regression in this right so let's understand why can't we use linear regression so guys this is the best fit line that you get any in a regression so if you guys know about linear regression we have already discussed what a linear regression is but for the people who don't know about linear regression let me give you a quick summary you use linear regression when you have the Y value in range right but in our case our value is discrete as in the value could either be 0 or it could be 1 so as you can see these are the values for R voiceover ry could either be 0 or be 1 but the best fit line for a linear regression is this right and as you can see it is crossing 1 and is also going below zero but our Y value cannot be below 0 or above 1 right it could either be 0 or be 1 if you were to clip the line so the only solution that we can see over here is that we have to clip the lines of this best fit line right so when we clip the line it comes out to be like this so this had to be solved because this is nothing this kind of curve cannot be formulated this is actually not a curve these are basically three straight lines right so this had to be formulated into a equation so once it's formulated into an equation it comes out to be like this so this is a sigmoid curve right so it's a sigmoid function curve and it is basically very famous it's called the S curve now we had to make an equation which will make a curve or which will come out to be like this right but why this is the question so let me address that so as you can see a value of y is 0 for certain values of X and our value is 1 for certain values of X as well right now this is the transition that is happening right so after for the values are becoming 1 where the Y value is becoming 1 right so this transition has to be represented by a curve and hence we came up with this s-curve now in logistic regression how do we decide whether the value will be a 1 or a 0 when we are predicting it through a model so logistically agression basically gives us a probability right so it will tell me what are the chances of Y being 1 so for example we have a match going on right and we want to predict which team will win so if my team has scored 75 runs and I asked my Dodge segregation model tell me whether this team will win or not right so with 75 runs it calculates its probability of winning through the values that we would have trained the model with and it calculates ok the winning probability of your model is point 8 this is the winning probability now we have to decide what will be the threshold value what will be the value above which we will definitely win all right so we define ok so if my probability comes above say 0.5 if my value comes above 0.5 my probability whatever that probability is above 0.5 let convert it into 1 all right so if my team has a probability of 0.8 it is going to win so if I ask my model ok my team has scored 75 runs so what my model will do is it will calculate the probability of its winning which is pointed and then it will apply the function or it will apply the if/else model as in if 0.8 is greater than 0.5 Y is equal to 1 if yes otherwise it will be equal to 0 all right so this is how logistically agression actually works now this is how you predict values through the s-curve right so basically we had to come up with this kind of curve because this is the curve which will represent a logic regression because of the transformation which is happening right so having this kind of curve we should have a linear equation right so for coming up with a linear equation we compared our linear equation to the straight lines linear equation which is like this now in this rvice range within the case of straight-line advised ranges between minus infinity to infinity and if we talk about cladistic regression we get probabilities and a wise value is between 0 & 1 right so we have to change that we have to change the range of our Y so that this equation can be achieved right so in this basically this part will remain the same the Y will change and hence we can come up with a linear equation which will represent that curve all right so our viols range for now is between 0 & 1 so we had to change the value of pi so if you want to make our value between 0 and infinity what we can do is we can divide it by 1 minus y if we do that if Y is 0 it will be 0 over 1 minus 0 which is 0 and if Y is 1 it will be 0 over 1 over 1 minus 1 which is 1 by 0 which is infinity right so my bias range has now become between 0 and infinity but we want our wise range between minus infinity to infinity and hence we do one more transformation and we apply the log function to it so log of 0 is minus of infinity and log of infinity is again infinity right so now when now this function this particular element that is log of Y over 1 minus y has a range between minus infinity and infinity and hence it can be compared now with this wire whose range is between minus infinity and infinity and hence it becomes a linear equation so this is the linear equation for this s-curve all right so guys any doubt in whatever we did so current is asking me is this formula used in our logistic regression are no current so while you are working with us you will not be implementing this formula this is the formula which happens in the background the reason I'm telling you this formula is because whenever you're passing a command you should know what that command is actually doing you should understand the math behind it right so this is the match exactly okay so any more questions is anything that you didn't understand in this part okay so you guys are giving me a goal let's move ahead then guys so we ever understood why logistic regression is actually used now let's move on to the what part but what is logistic regression so logistically aggression or logit regression or logit models it's a loss aggregation is also known by these names as well logit regression and logit model so it is a regression model where the dependent variable is categorical so whenever like I've been saying in the previous slides as well whenever the dependent variable is categorical you use logistic regression and that is what tall about is what happens in logistic regression is you calculate probabilities as to what is the probability of Y being 1 and based on those probabilities you take a call what would be a threshold and whenever your values above that threshold you convert it to be a 1 and whenever your value is below that threshold your probability values below that threshold you calculate it to be a 0 right so this is what logistic regression is right so categorical is when your values are fixed they are discrete and your values could be either A or B or C right dependent is when your Y is dependent on X right so X value is dependent what you can fit any value in the X but your Y value will be dependent on X and whatever input you give to X Y will change according to X Y cannot have its own value Y will always have a value according to it whatever relationship it shares with X according to that relationship it will take a 1 right so Y is dependent on X so whenever our dependent variable is categorical we you aggregation and this is what the logistic regression is all about you have seen the graph card this is for your understanding let's see the graph again so this is our y-value this is my value of y so it's between vo and one right and my graph is like an s-curve so my values are transforming from this point till this point and this is the best curve that can be fit according to my water now in between this my logistic regression model will come up with values so for say this X it comes up for this value so let me draw it for you so if I give my model this value of x right so my model will calculate which value on the graph is my X pointing to right so if my model will calculate this value and will see this value corresponds to what value of y right and this value is actually the probability right so this value is basically the probability and when this probability is converted whenever you get this probability based on this probability you decide whether your Y will be a 1 or let's be a 0 and how is that calculated you decide a threshold and based on this threshold your model decides whether it will categorize it under one a little categorize it under 0 right so this is what happens in your logistic regression model now how does it work so let me give you an example we have understood the theory that this is the way it happened so let's see it in a real-life use case so I have a list of people's IQ right so I'm a company and I want to recruit people all right so I came up with their IQs I have their IQs in my hand and I want them to be selected automatically using a logistic regression 1 now we are not discussed yet what a logic regression model is so let us consider this to be a black box ok so we don't know what is there in this model what we will be doing is we will be feeding these values to the model right and a model will predict whether this candidate will be selected or not selected based on his IQ right so for example 147 IQ has been selected and 107 IQ has not been selected right so it is a calculation which is done in the model so basically what would have happened in the modulus we would have calculated its probability of being selected based on my past records and it would have compared it with the threshold and if it was less than the threshold it would have categorized it under the not selected part if which is greater than the threshold would have categorized under the selected part so this is how it happens guys we now know what logistic regression will actually do we have an abstract understanding how logistic regression function works but what we do not know is the main part that how is this model created how do we come up with this model alright so now we will be discussing how can you create this kind of model in logistic regression all right so let us take an example in this take a sample data set which is inbuilt in the rstudio language which is empty cause all right so you have these many values don't get afraid with the values that you are seeing all right these are just values it should be clear to you in a minute so you have these values you have mpg cyl the ISP etcetera and the focus should be here vs and a.m. so as you can see the values of V s and M are discrete or are categorical they are either 0 or our 1 right so what we'll be doing in our exercise today or in this example is we will be predicting based on these values what will be the type of engine of my car will it be a vs engine or will it be a straight engine so if the value of my vs engine is 0 all right so that means that car will not have a vs engine hence it will have a am engine or the straight engine if my value of V s is 1 that means my car will have a vs kind of engine and these values will be calculated based on these inputs so these are actually the parameters of the car mpg and you can actually look at what this means in the key for the data set so mpg basically means miles per US gallon your cyl is this number of cylinders in your engine your D ISP is it actually the displacement right your HP is the horsepower the cross horsepower of the car your dri t is the rear axle ratio etc right so WT is the weight now what we'll be doing is we will not be taking all the values now this is a step that you will be learning in the demo part of today's session and how do you select which values to take in this model so as of now forget how we select values we will be selecting D is P and WT all right so while we are creating a model we'll be filling in only these two values out of the whole model and we'll be predicting y what is by Y is basically BS all right so we'll be predicting vs based on these two values that you will be feeding into the model right so D ISP is basically the displacement of the car and WT is the weight of the car so based on displacement and weight of the card we will predict whether our car will have a V s engine or a straight engine now let's see how will we do that so first before creating a model the first step is to divide a dataset into training and testing why are we doing that we divided this data set into training because this is the data set that we'll be feeding to the model this is how we will train our model right so our model has to be trained for predicting values now like I told you guys in the starting of today's session the regression means establishing the relationship between two variables now how does it establishes relationship is by looking at the past the results are looking at the past records what has happened before right so in our training day I decided to see like the Mazda our export car when the displacement of the car was 160 and the weight of the car was 2.6 to 0 type of engine which was used was straight engine that is V as being 0 so it is straight engine however when my displacement of the car was 108 and my weight of the car was 2.3 to 0 the type of engine you use in the car was of V engine or of V s engine right so this is how it establishes relationships so we will be training our model using the training data set and once our model is created once a model is made we will test that model using the testing data set all right so this is what we do we divide a data set between two now the larger part of the data set will be used in training and the smaller part will be used in the model validation so that we can come up with an accuracy all right so we will be dividing a data set and then this like I said this data set you will be used to create the model so once your model is created this is what you get when you look for summary of the model alright so when you type in summary model you will get all these values now these values are very important guys so this is the value that you need to focus on as of now so the first value is the intercept so this is basically the constant value in your equation remember in our estimated regression equations our formula was logit of y equal to beta naught plus x 1 beta 1 plus x 2 beta 2 plus X 3 beta 3 and so on right so over here our beta naught will be the constant value at beta 1 that is the WT w2e is the independent variable that we have fed to our model all right so the coefficient for that is one point zero nine four that is beta 1 B is P is the coefficient for the model which is beta 2 that is minus of point zero two five to nine right now these values are calculated by are using them MLE method what is Emily it is maximum likelihood estimation but if we were to discuss how are these values calculated it is very complex and I can explain it to you guys but it will take a lot of time so what you can do is you can google Emily right if you don't understand you can ping me and I'll explain it to you in the next session all right if you want to learn because Emily would not concern you these values are actually calculated automatically that is why we have our to calculate these values for but if you are still curious as to how these values are calculated you can google it and if you don't understand it I will explain it to you in the next session all right having said that ok so we get these three values and these three values are actually the coefficients of the independent variables next these coefficients are actually fed into this expression now how did we come up with this expression when we solve log of y over 1 minus y which is equal to beta naught plus beta 1 X 1 plus beta 2 X 2 when we remove the log and if you want to find y we come up with this expression so this is basically something like this equal to Y is equal to this right so now with expression basically you get the probability of Y being one all right now over here you will be feeding in the value that are beta naught beta 1 beta 2 X 1 X 2 e he is the Euler's number which is a constant that is two point seven one right and beat a naught is the value that you got from here this is the beta naught value you have the beta one value you have the beta two value and then you will be feeding the X 1 value the X 2 value and hence calculating the probability now again this is actually in the background for getting the Y value you simply have to type in predict you have to type in your data which is you have to type in X 1 and you have to type in X 2 and it will automatically implement it in this formula it will get the beta 1 beta 2 values from the model and it will predict the value for you but you should understand what is happening in the background and that is the reason we have explained you this expression so this is the expression which is U which is logit of Y is used to calculate the probability of 5 all right now as you can see these are the values that we got from our data set right now these are the values that we'll be feeding to us logit by function all right now where did we get the x1 and x2 thrown we took a value from the training data set and right and the d is p value is 1.3 and the WT value is 2 point 1 4 0 right so we'll be feeding these values into the function with the beta naught beta 1 beta 2 and the e values and we come up with the value which is 0.49 6-2 alright so this is the probability of Y now as you can see in the data set if the displacement was 120 point 3 and the depth weight was two point 2 1 4 0 r vs engine value was 0 alright so when we are solving it in the equation for y is basically the value of V s all right so this probability is basically the probability of V s being 1 all right so the probability in the case when my D is P and my WT values were this my probability of Y being 1 or V as being 1 is this right our threshold in this case or in any case is 0.5 now usually in our probability we apply the threshold is 0.5 so if my probability is greater than 0.5 we say it's a success and if we say it's lesser than 0.5 we say it's 3 so this value is actually lesser than 0.5 right so if it's lesser than 0.5 then it turns out to be zero and that is the predicted value of V s right so the predicted value of V s from our calculations is zero and if we compare it with the testing gear set with the values one twenty point three and two point one four zero the value of V s was again zero over here alright so as you can see our model was correct our model has predicted the right value all right so our model is validated now so it has predicted the right value for the value of D is P and WT so this is what happens in logistic regression now we have done it manually as in we have applied the formula manually and we came up the dismember but hadn't been an odds we would have simply type in the predict command entered the WT and the D is P values and would have came up with this number right and then we would have applied a threshold on to this number whether this number is greater than 0.5 or not and since this is not greater than 0.5 it would have automatically reduced that value to be 0 and hence it would have matched with the V s value all right so this is how a logistic regression function works right so this is how it happens so let's move on to the demo part now so now we will be using a different data set a more complex and more elaborate data set so now we will be protecting our aim is to predict whether a patient is diabetic or not all right now how will we predict that we have a data set wherein we have values like these so we have n preg which is the number of pregnancies that that patient has add we have Glu which is the plasma glucose concentration we have the BP we have the BP levels of that patient the skin that is the triceps skinfold thickness this is the test which is done right we have the body mass index of the patient we have the pedigree per function we have the and in the end we have the type all right so we have this data set wherein if these are the values the type is one meaning the patient is diabetic right so he is in the diabetic type so if the value is one it means I've patient as diabetic if the value is zero that means a patient is not diabetic all right so we have to predict by entering these values we have to create a model which would predict whether a patient will be diabetic or not all right having said that let's move on so let's create this model now so in our first we'll be passing this command so first we would have to import our data set right so let me quickly go to my our console all right so let's feed in this command and see what we get so first we'll be passing in this command and we'll be including a data set in our environment all right so I've passed this command and this is my data set as you can see all right so this is my data set guys now our next step is to split our database into testing and training all right in the ratio 8 ratio 2 now let's do that so basically this is a library which first has to be implemented right so we will include this library why is this library use this library is used for the split function which is there right so for splitting our data set into training and testing we have to include this library which is ca tools once we do that we will be splitting our data set into training and testing all right so that has been done let me explain you the commands right so first my data is split alright so this is the data that I have caught from the CSV file in this variable all right so this is that variable I will splitting this variable in the ratio point 8 right so what this command will do is it will split my data set on my data variable in the ratio 0.8 and 0.2 so it will be splitting it in this variable right so when I will be passing this variable whatever you do is in the point 8 part of the data set the value will become true and this values they are picked actually randomly it's not sequential random values of pick and each value is assigned our value which is like coup or false all right so 0.8 part of my real set will be assign the true value and 0.2 part of my dredge it will be assigned a false value now what that means is you understand in the next line so once I've said split the next line is training for my training data set I'm saying the subsets of data where the split value is true right so where the point 8 part of my data which is true I have assigned it to the training variable which is the training data set then I say for the testing data set the subsets of data in which the split is equal to equal to false that means the data set which has been labeled as false will be coming under the testing data set there is a point to part of my data set all right so this is how my splitting of the data set happens okay so once I have divided my data set the next step is creating the model so let's do that using this command all right so my GLM is basically used for creating a model in logistic regression right this part specifies the formula that I am applying right so my type is the by that I want to predict right the still symbol is basically the vs. part survived versus X right on the x-pyr a specified dot so dot means you have to take into account all the variables right now so type tilt dot is the formula comma training is the data set that we will be using so we have split the data set like this in the a 2002 manner right so we want the training data set and then our family of Y is binomial so binomial means basically it can take on only two values that is yes or no true or false or 1 or 0 all right so we have to specify that as well so let's pass this command now into R and see what will be our model 8 so let us pass it so this is our module let's run it and then we will get this summary of the model now what is the summary of the mortal the summary of the model is basically something like this so whatever value will be calculated that value will be phone away answer let's calculate the value alright so this is the value that you get when you look for summary of the model all right so you get a lot of values here don't worry I'll be explaining you guys everything so basically these are estimate is basically the values of your coefficient right so the beta naught beta 1 B 2 that is saw in the previous example is this so you have a lot of independent variables here and hence a lot of coefficients right so these are the coefficients and these will be used when you'll be predicting your value so you don't have to worry about these values these values will be also automatically taken the next thing that you should be focusing on here is this alright so this is the significance code that your are provided with so basically this tells you how much significant that particular independent variable is so in our summary we have got these significant codes right so these significant foods specify how much significant your data is all right so let's see what does this actually mean all right so three stars or three asterisks if they have been specified in front of your independent variable it means that a model is 99.9 percent confident that your specific independent variable or value is significant to the model it means it will add on to the accuracy of the model right if it's a 2 star or two asset that means your model is 99% confident that your particular independent variable is significant to the model as and contribute to the accuracy of the model the prediction of the value will be more accurate right if it's one star of one astrick that means it's 95% confident and if it's a dot if it's a simple dot it means it is 90% confident right so if you look into our data so our constant is obviously significant right because it's a constant then we have the beta value so this particular value that is the glu is significant right and our BMI is significant a PE D is significant and the rest of the values it says are not significant so basically what we are doing over here is basically we are optimizing our model right so now if this record is not significant you cannot straight away remove this record all right there are other things that you have to see now what are those things let's see that so there are two more things over here three more things actually one is called the null deviance then we have the residual deviance and then you have the AIC all right so what is null deviance null deviance is basically the deviance that you get from the actual value of your data set so if I'm saying I'm passing these many values and this is the outcome of this so my particular model is 311 units deviant when it is null so what is the meaning of null null means when you are not using any of your independent variables you are only using the intercept this constant or if you go by the equation when you're only using the beta naught and not the beta 1 X 1 and the beta 2 X 2 and the beta 3 X 3 how much is the value that you will be getting right how much your model is going to be deviant from the actual value so it is 311 units when we talk about the residual deviance so your residual deviance is basically when you include your independent variables in your models when you include your independent variables this deviance is actually coming down to a smaller number right which is obviously the fact because whenever you are using the independent variables you are making the accuracy of your model more correct all right so this value comes down at two twenty nine point three five from three 11.15 which is correct now the third value is AIC right so this value should be as minimum as possible so this is helpful then when you are actually removing the not necessary independent variables from this data set all right so now we will be optimizing our model now how will we optimize the model is like this we know that this particular or the number of pregnancies is not a significant independent variable as in it is not contributing to our model according to its calculations our is telling us that even BP is not a significant variable skin that is the skin foltest is also not a significant variable and according to our age is also not a significant variable now how can you double-check this is like this so we will basically remove one by one the independent variable then we'll check what are the difference in the values of this so basically your residual deviance should not increase and your AIC should decrease so if both of them is true then your variable removal is right so let's do that so let's call our model function and remove age from it so if we remove age and then we call the summary of the model let's see what different values we get all right so my residual deviance value was two twenty nine point three five before and now it has increased to two thirty one point five seven which is not right right also my AIC value had to be reduced if I am removing a variable right but it was two forty five point three five years and it has a actually increased to two forty five point five seven that means age is a significant variable hence it cannot be as removed right so this is the conclusion that we get by interpreting the data the next not significant variable is DP so let's take BP from air right so BP is again not significant let's try and remove PP so if we remove BP let's see what are the changes that we get let's now get the summary of the model alright so if I remove VP earlier mine residual deviance value was two twenty nine point three five and my AIC value was two forty five point three five when I removed BP my AAC value is going down but my residual deviance value has gone up it has gone up by one unit it is two thirty point three three and over here if you compare it it is to twenty nine point three five so again I don't think VP can be considered to be removed and hence will not go ahead with it we will not remove VP let's try the other value so we what are we left it so our n preg is not significant now we have tried VP we are not tried skin as of now right so let's try the number of pregnancies now so we'll be removing minus n prick from our model and let's just get summary of the model now alright so as you can see my value is 229 over here it's 231 so it's change it's significant change and my EIC value is also the same if I remove n plectrum else we will not be removing n play against then okay well let's try the last variable with the skin so let us in the skin so we'll be removing skin from model now let's get the summary okay so before my value of ASE was 245 point something so it has reduced awesome and my recital deviance was two twenty nine point three five and now I am getting to twenty nine point six four not a significant change and my AIC value has also reduced so this can be considered right so we can remove the skin parameter or the skin variable from here and hence get a value which is like this right we will go ahead with this we will remove the skin variable and our model seems to be now optimized all right and I've removed the insignificant part now right so let us go back to our slide right so let's move ahead so as you can see the insignificant field was skinned which has now been removed so what does not leave in screen I had told you guys how well the response variable is predicted by a model when it includes only the intercept which is this and what does is it will deviants mean it means how well the response variable is predict with the inclusion of independent variables so this is what we have learnt right moving ahead now let's predict values out of our testing data set right and check out the accuracy for our model so what we'll be doing now is we will be predicting the values in the model using a testing data set right so let's pass this and let us check what this variable has in store for us so if I type in our es over here gives me dis values so these values are all probabilities as you can see point zero four five one point six two eight point two zero zero point three six and so on so this is the testing data set that has been predicted and these probabilities have now been attached over here right so the patient number two has a probability of 0.05 one of pink diabetic all right so if you want to check this let's quickly check this for our patient number two so if it's point zero five one for a patient number two that means it is close to zero right so he should not be diabetic so let's check out the real values so for the second patient I type a zero all right so our model is predicting right the values are right in a model so patient number two the type if you were to have diabetes it would happen zero and my model also says that he has a point zero four percent chance of having diabetes all right so so far so good that my model has been corrected right let's check one more value so for patient number six my probability is 0.6 to eight all right so that is greater than 0.5 that means he should be diabetic right so if we check that so yes my patient is eidetic so my model is correcting writers when it is predicting correct values now how will it check the accuracy for your model right so we will be checking the accuracy of the model by this so now we have passed this command as us now we have predicted the model using the testing data set and we have type and type is equal to response so response means we want to get an answer we want to get a probability out of our predict model out of a predict function right so we will be writing predict and in our predict function we will be passing the model the model that it has to tell and predict a value from will be passing the data set the values that we'll be taking from the data set and predicting the value and the type of the answer should be response right and if you be storing all of this in the our es variable right so we saw the our es variable it has all the probabilities the next step is creating a confusion matrix now what is it using matrix it is actually very simple once you understand it right it might be a little complex to understand so let be a little careful over here when I am Telling You right so in my confusion matrix is something like this this is my actual value and this is my predicted value alright so actual values is something which is there in my data set and my predicted value is something which has been predicted by the model alright so in my actual value my data set had values which was 0 and 1 alright so in the type MyType could be either 0 & 1 right so my actual value was either 0 or 1 so and my predicted value has values which was either false or true false means he is non-diabetic that is 0 and true means he is diabetic that means 1 all right so my confusion matrix is something like this so let me draw something here this is my matrix all right value was 0 and my predicted value was also false that means when my patient was not diabetic and the predicted the model also predicted that my patient is not diabetic this scene or this scenario was 47 times found in my data set all right which means 47 times in my dataset it has occurred that my actual value was 0 and my predicted value was also 0 right and 13 times it has occurred that my actual value was 0 that my patient was not diabetic but my model predicted that my patient is actually diabetic all right so this is the error this is 13 times it has occurred again for the value 1 for the patients who were diabetic right so my model predicted that the patients are not diabetic and he was actually diabetic in the actual value so in my model the patient was diabetic but my model predicted that she is not diabetic this has occurred 9 times in my month right and for the data wherein my patient was actually diabetic and my model also predicted that he is diabetic has occurred 15 times right so this is how you can read a confusion matrix now if you were to find the accuracy in this case so if you were to ask me the formula the formula is basically the right diagonal this is the right diagonal right so 47 plus 50 divided by every record in the confusion matrix now why is this let me explain you that so when my action value was 0 and my predicted value was also false 47 is the number that I predicted correctly all right so that is 47 again if we were to ask when my actual value was 1 and the predicted value was also 215 was the number so 15 is also correct right but when my actual value was 0 and my predicted value was true this was 13 so this is the error similarly when my actual value was 1 but my predicted value was 0 or false my value is 9 so this is again the error so if we were to find the accuracy the accuracy is correct number of instances over the kögel instances right so the correct number of instances are 47 plus 15 divided by 47 plus 13 plus 9 plus 50 right so this is what I've done in the next step so for calculating the accuracy I've written 47 plus 15 divided by 47 plus 15 plus 13 plus 9 right so it came out to be 73% let's check in our model as well so we will be passing the data set now right so for passing the confusion matrix this is a syntax I will explain you the syntax in a minute let's see the output first so I'm assuming my threshold to be 0.5 as of now all right so let's predict so this is my confusion matrix all right now if you see my actual value was 0 and my predicted value was also 0 has happened 52 times right and my actual value was 0 and my predicted value was true as in my patient was not diabetic but my model said he is diabetic it happened only one time in my model right so this is the error similarly my actual value was 1 and my model predicted that he is not diabetic but this is very dangerous right if you are going through a disease and he quoted the doctor and doctor says you're fine you're actually been given a wrong consultation which could be harmful or it could be it could cost you your life as well all right so this is a very sensitive number this should be the minimum right because it is okay if you are not diabetic and the doctor says you turned out to be diabetic that is ok because you are actually fine but in this case you actually have diabetes but a doctor is saying you don't have diabetes and you might end up costing your life itself because this is diabetes what if you this case or this model was for cancer right you had cancer no doctor said you're fine it could have been a huge catastrophe right so like I said now let's calculate the accuracy per our model which has come out from a confusion matrix so it is 52 plus 18 divided by fifty-two plus 18 plus 12 plus one all right so my accuracy now comes out to be 84% which is a very good number guys right so my accuracy is 84% so we have optimized a model correctly and they have got the accuracy at eighty four point three now we have assumed that my threshold is to be 0.5 now how can we be so sure about the threshold right what if I increase my threshold and my accuracy actually increases and also this number has to be reduced right this number should be the minimum what if this number can further go down all right so for that we need to have the correct threshold now one way of doing this is by actually doing the hidden trial method by trying each and every threshold and seeing what are the effects that we get in a model or what are the effects on the accuracy of the model or what is the effect on the confusion matrix but is there any other way through which we can find the threshold let's think about that all right so don't think too much our actually has a method called the ROC curve which is used to calculate the threshold in your model all right so let's see how that thing is done so first you have to put in the predicted values the predicted value should be from the training data set because you're basically calculating the threshold from the training data set so that they can be applied to whatever value will be passing to your model all right so your threshold will be collected from it testing the innocence so let us do that so our re s should have the predicted values from our testing data set so let us change our re s so this is our Aria skies so let us change this to training data set alright let's run this command okay now next is running the ROC commands up first the first step towards creating your ROC curve is actually including your library in your environment so first you will include the library ROC are in your environment and then you will be passing your next line of code so you have store the predictive values for the training data set in the Arias variable you also imported the library for the ROC r package the next step is define the ROC our prediction and the ROC M performance all right so prediction and performance are parameters that the curve uses to actually plot the graph so you have to specify them inside the prediction you have to specify the predicted result set comma this result said that is actually there right so our es which is predicted and this is the actual data center this is compared and it is stored in our OCR print all right after that you check the performance so in this you are actually passing the ROC up red so you will see how good your prediction is and then you will plot it against EPR and FPSO TPR is basically the true positive rate and FPR is basically the false positive rate right so you will be specifying the t PRF PR and you will be checking the performance using the prediction that you'll be doing today all right so let's do that so we have included the library so let's pass on these two commands from them all right so my arrastia pride and my roc a puff have been saved have been executed a next step is plotting the graph alright so for plotting the graph I will be plotting roc RF right the performance curve I will be plotting and I can also color the graph so I have said colorize equal to 2 so this basically colors my graph in a way so that when the values are changing the graphs color is also changing and then I want the cutoff to be printed in the graph as well so this basically means that whatever cut-offs experienced a change will be printed so that it becomes easier for me to choose the threshold right so once you execute this command this is how your curve will look this is what is a graph that you will be getting so let's pass this command let us plot the corral for an ROC are alright so this is a graph that you get so as you can see now how will you interpret this graph is something like this this x-axis is basically the false positive rate so this should be the minimum and this is the true positive rate and this should be the maximum all right so if you go by this graph as you can see between 0.3 and point 2 there is a lot of gap for the false positive rate so if I give point to asthma threshold I will be actually going towards the false positive rate and my model will go towards lesser accuracy but my point 3 is going towards higher true positive rate at the same time it has the least false positive rate right if we compare it to 0.5 0.5 actually had a less true positive rate though it had a less false positive rate as well but it has a true positive rate which was less as well now what does this mean this means that this value which is dangerous because if I was a cancer patient and the doctor told me that I am ok it would have been very dangerous for me right so we have to reduce this value and this value can be reduced by increasing the true positive rate right so but this might also cost you the accuracy of the model but in that scene our aim is actually to serve the human kind right so your model should be able to predict the right values and the correct right values it is ok if a model predicts that I am diabetic even if I am NOT right but if I am diabetic and it tells me that I am NOT diabetic that actually is a dangerous thing and that thing should not be there that thing should be more accurate all right and that is what we are trying to solve here so it depends on your use case what kind of threshold you will choose now you may have some other use case where in this particular value is not matter right so for in that case you will choose your preferable threshold but in our case which is the medical case wherein you are predicting whether that guy is having particular kind of disease or not you have to reduce this particular number all right so according to our graph I should be choosing point three and not point two and if you compare it with 0.4 actually so 0.4 and point three actually have a very little margin of false positive rate right so I'll go with point three all right now let us check the threshold 0.3 in our testing data set confusion matrix all right and then see what is the difference so we will copy this and we will paste it here so our result should be for the testing data set predicted values right and then so let me do it for 0.5 again and then we will be able to compare our confusion matrices so this is testing alright so this is the confusion matrix for my threshold point 5 for my testing data set right now let us do it for point 3 okay so as you can see my value for this valium which is when my patient was diagnosed wrong in the sense that he had the disease and he was told that he does not have the disease has been reduced to six by changing our threshold to 0.3 at the same time we are accuracy I think has also gone down let's check so for the first confusion matrix let me get their accuracy so it is basically 52 plus 18 divided by 52 plus 18 plus 12 plus 1 right so it is 84 point three three all right now let us check our accuracy four point three threshold so it is 48 plus five and then 48 plus 5 plus 6 and plus 24 sorry this is 48 plus 24 all right so let's check awesome so my accuracy has actually been increased to 86.7% and at the same time my predicted my this value that is my true negative value has also gone down right so this has been a very fruitful exercise where and we found out that whenever our threshold is 0.3 we will get 86% accuracy and at the same time our model has improved because now my two negatives have been reduced so if my patient is diabetic then the chances that he will be diagnosed wrong has gone down right so any doubt in this part guys that we have reduced the true negative part that is from 12 we have gone down to 6 and at the same time usually what happens is when you'll be solving other data sets when you bring down this value your accuracy might come down a little bit but since this is more important you suffice or you let go the accuracy and you stick to this number but in our case I accuracy has actually gone up which is a good sign all right so we have 86 percent accuracy which is tremendous for our model and we have also brought down our number of two negatives so awesome guys so my accuracy for my testing guesses is 86 percent which is awesome guys now let me recap what we did we got to learn how you can create a model right and then we divided our data set in testing and training then we trained our model using the training data set and we tested a model using the testing data set right when we tested our model using the testing data set using the point 5 threshold we got an accuracy of eighty four point three three percent all right then we were not sure whether that threshold value is right or wrong so be bent to the ROC R curve and also we wanted to reduce the values of true negatives right so we went to the ROC R curve we saw that point three is an optimum value for the true positive rate and the false positive rate and we put the value as point three in our data set right and then we plotted and we then came up with accuracy and accuracy actually improved to eighty six point seven four percent and at the same time we also reduce the true negatives all right so this is what we have done until now so this has been fruitful all right so we are done with the demo guys any doubts in the demo part you can come up with a question right now and I shall explain it to you all right no doubts okay let's move on to the next part of today's session which are the use cases so in 2005 we have this use case wherein we have to predict which areas in Africa were breeding grounds for malaria so that we can provide their help and what we had was a limited set of values for a small part of Africa where in malaria was actually present right now along with geographic information system what we did was we took the topography of the areas where my Riya was there right so malaria is basically is Devon where there are dense vegetation right where there is a lot of water so there are a lot of mosquitoes over there and hence you have malaria set over there so what we did was using the geographic information system we calculated the topography of these areas where malaria was more and at the same time we also had areas where malaria was less and we also had the topography for those now this data was used to create a model right and then we had to predict now we had topography for the rest of the areas right so let's say if we have the topography for this area so this area was then fed to the model and then the model predicted whether this area will have malaria or not and based on the probabilities we came up with a plan that which area should be given more care first right so if this is orange and as you can see this is yellow so this has more probability of being malaria prone right so our health could first go to this area and then maybe this area because this area might need more help from us right so this is how logistic regression was used we felt the geographic infinite sense data and we came up with the model and then that model was used to predict the other areas of Africa where in malaria could be there right so this is how we used a plastic regression in this use case a next use case is used in case of marketing so basically it is used in the marketing research forms wherein you have to know your customer so when you have to predict whether your customer will buy this product or not so that can be done using logistic regression so logit and analysis is used by marketers to assess the scope of customers acceptance of a product right like in our company if we come up with a product say we come up with a shaving gel right and we see the past market trends that what product has been more successful what kind of product has been more successful when it comes to the shaving creams we can create a model and then we can feed our values our new features into that model and then that model can predict whether our product can be successful or not so it is used by the marketing research firms and basically it is also used in targeting ads so basically whether what kind of data should be fed to you right so say if I want to predict whether you will buy a pendrive or not right so that will depend based on your search history whether have you searched for a pendrive in the recent days were you looking for a pendrive on safe Amazon or something like that and based on those values my model will predict whether you will buy a pendrive or not so based on all these information marketing the search firms they come up with data and they sell it to companies which actually needs that kind of data right so if I'm a company I'm a commerce company right and I need to know what kind of products I should upload or I should display in my website so that more customers are there I can do that using the odd stick regression I can actually ask the model bits product would be more successful and hence keep that product on my ecommerce catalog all right so these are the two use cases that we just discussed okay guys so with this we have come to the end of our session for today we are over with Raja segregation any doubts in any other scenes that we discussed today all right so Denis is saying thank you nice session you're welcome Dinesh okay so since none of you have any doubt let me wrap up this session for today then so let me recap what all we did in today's session so first we look at the five questions which can be asked in data science and according to that we got to know which algorithms could be applied and then we discussed what is regression after that we moved on to the part that logic regression what is the degradation and why it is used and then we saw what is the working of logistic regression using an example after that we move to the part where and we saw with did a demo on diabetes use case and we applied logic regression and we came up with a model which could predict whether you are diabetic or not based on specific values and in the end we saw use cases where in logistic regression has actually been applied right so guys this is what we did today any doubt in any other part that we have discussed till now again I'm asking you I can repeat it all over again I have no shoes all right so since none of you have any more doubts so let's wrap up today's session so thank you guys for attending today's session it was a pleasure to teach you guys and I hope you learnt something new today assignments have been uploaded to your LMS I expect you guys to solve those assignments and come in our next session also I would expect you guys since this was a little complex topic I expect you guys to go through the recording again in your LMS right and come more prepared in the next session because our next session is going to be on a topic which is polynomial regression and it is actually kind of based on logistic regression as well right so be prepared with those sick regression so that it becomes easier for you to understand what polynomial regression is alright so guys this brings us to the end of our session so thank you everyone see you in the next session good bye I hope you enjoyed listening to this video please be kind enough to like it and you can comment any of your doubts and queries and we will reply to them at the earliest to look out for more videos in our playlist and subscribe to our ready Rica channel to learn more happy learning
Info
Channel: edureka!
Views: 142,178
Rating: 4.9317698 out of 5
Keywords: logistic regression, logistic regression in r, logistic regression machine learning, logistic regression machine learning tutorial, logistic regression example in r, logistic regression in r tutorial, logistic regression in r studio, logistic regression in r edureka, edureka, data science training, data science tutorial, data science edureka, r edureka, logistic regression tutorial, machine learning algorithms, logistic regression algorithm
Id: Z5WKQr4H4Xk
Channel Id: undefined
Length: 69min 12sec (4152 seconds)
Published: Fri May 12 2017
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.