Logistic Regression | Logistic Regression Machine Regression | Logistic Regression in Python

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hey guys I welcome you all to this infinitive session by creat learning so classification techniques are an essential part of machine learning and today we'll go through one of the most widely used classification technique gnostic regression with the help of logistic regression you can identify whether a tumor is malignant or not it also helps in tango Milla's genuine or spam and you can also use it to find if a credit card transaction is fraudulent or not so there are a surfeit of applications of lourdes tick regression so keeping the importance of floristic regression and mine we have come for this comprehensive tutorial the session will be taken by Professor mukacheve who has over 20 years of industry experience and market research project management and data science now before we start the session I'd like to inform you that we'll be coming up with a series of high quality tutorials on artificial intelligence machine learning and computer vision so please do subscribe to Greek learnings YouTube channel and click on the bell icon so that you get a notification of our upcoming videos so in this tutorial we'll start by understanding the core concepts of logic regression and then implement a demo with the scikit-learn library so let's start off at the session the name logistic regression is slightly amis normal it is used for classification it cannot be used for regression okay though the name says regression but there's a reason why it's called a regression not logistic classifier the reason is under the hood logistic regression uses linear models linear regression that is the reason it's a classification method based on a linear regression the response variable that is the target variable can be binary class default or non default of diabetic non diabetics or it can be multi-class classification also I can use logistic regression to for our people character recognition I can do that and if I personal experience I have seen and I have also read some papers about it when we compare models often logistic regression comes out on the top if not exactly on the top at least top few top two or three okay four classifications so if I am going to start by showing you binary classification but this can be used for multi-class classification also given your independent variables what is the class likely to be that is the objective of this model I can also use on a function available in scikit-learn where instead of defining which class and observation belongs to a test record belongs to I can print out what is the probability that this record belongs to this class versus this class I can even print probabilities okay this particular function is nothing unique to this it's available in all probability based models predict you remember we use predict we just add underscore proba this function will print out probability values if you for just predict this will print out the class's decision tree also has this nine ways algorithm also has this logistic also has this predict underscore proba let's take a hypothetical situation don't worry about the PBT let's take a situation and discuss this concept and for that we are going to use three colors you are a data scientist you're working in a bank and you have been taxed with you have been given a responsibility to build a model where the model will take certain inputs and the output of the model will be is this person likely to default if I give him loan is this person likely to be a default or a non-default oh you have to build that model okay so what kind of data will you need to build such models in terms of so let's keep civil to decide you need his age gender in any cross loans he has taken any property for so many things now let's assume and one of the things given to you is Mingo forget it this is a touchy subject so let's call it age okay one of the top one of the inputs given to you is the age of the people so you have the past data with you the bank has given you a past data set in the data set you have both default as non defaulters and one of the columns in the data set is age you notice that you notice a pattern here the pattern is most of the defaulters okay most not all most of the defaulters are in lower age bracket okay and you also notice most of the non defaulters are concentrated in higher age bracket but that doesn't mean there are no Nandi folders in low ridge okay you are observe this pattern now since there are two classes here what we're going to do is we are going to see green class we are going to assign a numerical value one rate class numerical value is zero you remember we can't work on string data types so the target columns will be assigned numerical types okay y1 to this y0 to that no reason you can flip it also if you wish but we assign a value of zero and one we don't assign any other value we assign one is zero and one there is a reason for that okay now your your object yeah now you find that the pattern over here as income increases you find more greens as income decreases you find more Reds you have done a linear model already so what you do is you try to build a linear model here so you say like the y-axis B probability of belonging to any one of the classes let's call it one the y-axis is probability of belonging to class one you can actually take it as probability of belonging to class zero also if you like the two probabilities will be inverse of each other so you can take any one you like alright so when you build a linear model the linear model anybody has a black pin here black black no no this pen okay now you'll notice that as the age increases if this is a probability the probability of somebody being belonging to green the probability increases linearly probability I can represent with linear models if I have taken this as zero if this was 0 then this line how would it look it will be reverse that's the difference so you can use any line okay so I'm taking this as one for ease of explanation because you will be able to connect to what we did yesterday so the line will look something like this so this line y equal to Y is probability of belonging to blue class is related to M X X is this variable plus C C is the intercept toriel C NFE 0 C will be negative but the problem with the linear model is linear models go to infinity minus plus probabilities have to be thin 0 & 1 & be below 0 or above 1 so we can't use a linear model we use another technique where we transform this line into an S curve this S curve is called a sigmoid and it is very easy to achieve this the sigmoid is nothing but 1 by 1 plus e is a Euler constant views in mathematics minus MX plus C so this best fit line the best fit line which is found for you this bed first line is fed into this transformation this mathematical formula the result of this transformation is this curve sigmoid curve the property is it will reach 0 at infinity it will reach 1 at infinity that's exactly the property if a problem is a story now let us interpret what this formula is thing keep in mind we are building the models of which means we're in training set so in training set we already know the labels of every record right the algorithm gets both the input variables and the output variable x and y so all the labels are already known and the labels are shown in the color over here are you okay shall proceed now what the models know that I got sigmoid I will remove this this line I'll remove this line okay this is going to infinity we don't even need this we don't need any of this now what this model is telling you let's look at that if you take this red point look at this if you take this red point what the model is saying is the probability of belonging to blue class what is the probability of this guy suppose we don't know the color what is the probability of this point belonging to blue class it's very low which means it belongs to the red class so the model prediction and actual Czar matching here the model prediction and the actual label both of them are matching take this green one what the model is saying is what is the probability that this Bleen point belongs to blue class this is probability of blue class right the probability is very high and he is actually green so both prediction and actual smach right hey lady are you with me sure 100% okay don't get left behind your expressions tell me a lot of stories I am content to move over there and I lost count more than 250 sessions now in data science so I built a model you know what's coming right so looking at the place the position people occupy in the class I can guess lot of things about the person believe me the Riyadh follow patterns and you will be surprised and notice this pattern without fail based on the position people are occupying the class okay of course late comers they don't have a choice but people who come early okay based on the positions they occupy I can draw a lot of conclusions inferences from them right so people who are feeling doubtful but are able to I ask their fear they are not asking I know I can add in for such peoples so feel free to ask okay all of you please feel free to us don't get left behind I was always a backbencher as always a member of the backbench society I'm not saying anything to you don't look at me if you if you sit in all places means I was always the back benches so I know moreover the back bench and then the front print but other patterns which are observed was during the sessions right so deviation from our main topic of discussion bottom line is don't hesitate to ask okay feel feel free to ask don't restrict coming back to this are you all with me on these two discussions okay now look at this point you see a red point here in this one what the model is saying is as per his income he belongs to probability of he belonging to blue class is very high so the model will pass him as blue sorry in this case Green Green in this case green so what the model surface is telling is predicting the probability of belonging to the green class it's very high so the model is classify him as green but he's actually red so model is done and error over here mistake similarly if you look at this one this side the green point here can you tell me what the model is telling you the model you the probability of him belonging to green class is very low so he's likely to be red but actually is green so the model has done an error here a mistake okay so whenever you have overlapping data sets as in this case in spite of the best fit line the classification you will find some errors they will always be at these errors are called training errors errors in the training data okay all right no problem which one is minus 1 this is 0 this is 0 this is 1 and this axis in this particular picture this axis is probability of belong to class 1 okay that's why your s-curve is coming like this this S curve has been built out of a linear model if we had taken this if you had taken this as probability of belonging to class 0 then it would be reverse S curve because your curve would be like this the line would be like that oh yeah all models the problem with outliers can you explain what is the reason you have a linear will be over Fattah who said that in linear models we don't have any boundary they go to plus infinity minus infinity we don't want that probabilities have to be between 0 & 1 that's where they were converted into sigmoid sigmoid curve has that property that will always remain between 0 & 1 so we are available to map the numerical values to probability function yes in multi-class classification it will be 1 versus rest suppose I used Sigma this logistic regression for OCR optical character recognition so you'll have one ace curve for a versus others one S curve for B versus others so and so forth so in your mathematical space you'll have multiple s curves cutting each other one is for a versus others the other one is B versus other C versus and so and so forth all these algorithms decision tree night based support vector machines neural networks logistic they all break your mathematical space into compartments they break them into compartments pockets where one pocket belongs to this object the other pocket belongs to that object all of them do that you're done decision tree yesterday which data set did you use nowadays there Pima evidence never said okay that is a binary classification actually and so when you make decision trees okay this is the structure you have of decision trees what you're doing here is this entire mathematical space is your root node the entire data set all data points put together is your root node then you find out a one particular column on which you break the data into two that is like putting a vertical boundary here so these two now are these child nodes but you have not achieved your vertical homogeneity you're you're in Gini index is still very high or interface is very high it's still not pure leaf so what it is you again find a column here in this data set and break this into two that means you're drawing a political boundary so this and this is the child node here and here so decision tree also predicts your mathematical space into pockets such that each pocket becomes homogeneous at leaf level so right now I've reached the leaf level hopefully at leaf level each pocket has become homogeneous now I know what are the conditions under which somebody belongs to this pocket that is called the rules of the decision tree all algorithms break your mathematical space into pockets including logistic regression you treat is plus one this is binary classification if you are in multi-class classification optical character recognition then you treat at one point in time a all data belong to class a alphabet a s1 and rest of them are zeros see what is the boundary then treat all the records belonging to class b s1 and rest of them zeros then read out the boundary so we'll have multiple boundaries cutting each other hypersurface hypersurface but multiple hyper surfaces for a 4b 4c you'll have them right yes the whole model will be the opposite of all these surfaces okay let's move on there's so many declares is always one versus a year it's multi-class classification most of the algorithms give you oh we are over here means one versus rest some algorithms give you a facility of replacing the over here with Bayesian probabilities ok you can actually go and modify that in your hyper parameters but most of the algorithm by default will you over here right some algorithms like support vector machines over here is not very efficient so they gave you another technique called Bayesian probability based she'll move on ok it has a gradient descent algorithm money working under the hood to find out the best fit line alright the linear model uses gradient descent algorithm B remember gradient descent discussion we had yesterday see the grade to find best with mine I'll quickly summarize this I know it was too much in one day to cover your error function will always have this form not very difficult guys if I give you a equation y equal to X square for any value of x Y will always be positive it's a square term if X is minus 1 minus 1 into minus 1 becomes plus 1 this is called the convex function so whenever you have any function raised to power 2 it will always be convex right now this is single dimension this X can be multiple dimensions so if I take two dimensions m and C and plot the error this will be a ball shape right now in this ball shape will never reach the global minima it's guaranteed you will have only one global minima you won't have any other four quadratic error expressions you will always have one global minima guaranteed okay so we try to start with some random MN see that random M&C gives us some error the same concept applies in logistic because logistic is also linear model right from that MN see it will come down to this using your partial derivatives concept okay there's lot to discuss in this I'm not going to discuss all these things at this point but you will read this if you're going to do deep learning neural networks and all those things right at that time we'll take it up in detail same concept will apply today in logistics because logistic uses linear model under the hood so it finds for you the best fit line given the spread of the data points in mathematics space or best fit plane or a hyperplane once that plane is found that plane is sent through the sigmoid transformation and it's converted to this curve right now the driving force behind this here in this case it was quadratic in this case the S curve which is formed the S curve can be shifted for backward upward downwards let me see if I can show this to you okay what happens is insert if I take this this curve I hope it works yeah it works okay now this is curve you can move it up by changing the C in y equal to MX plus C or you can move it down you can flatten it to flatten it what do you think we'll use the slope M M of the line by changing the slope I can make it work Takano flatten or move it up and down using C so by using these two parameters m and C I will find out which sigmoid surface I get is the best surface and to find out which is the best sigmoid surface of all the infinite possible piece it uses a function that function is called log loss this is just like a gradient descent ok don't worry about this function is actually very easy looks dangerous but actually it's very easy now let us look at the log loss function looks very dangerous but it's actually simple very simple okay so forgive this Sigma and all these things just late take the expression Y I into log of P i okay plus then 1 minus y I into log of 1 minus y I into log of 1 minus P a correct this look at this expression what is this why why is the target variable target variable we know the target variable ranges from zero and one only one is so green and zero is for red so this y I can take a value of 0 or 1 okay now let us look at four situations I'm going to show you four cases and the four cases we are going to discuss here where is the four cases where are these are the four cases look at the first case in your data the first test case these are four test records okay the first test record has this income the model is saying that probability belongs to blue class is very high and is actually blue class no error zero error right zero error come to this what is the Y value for a blue class one into what is the probability very high almost close to one log of a very high number when log comes close to 1 this is 0 log of a very high number is 0 close to 0 2 raised to power 0 is 1 this is log to the base 2 okay so log to the base 2 of a very high number high number means close to 1 will always be 0 come to this term this becomes 1 minus 1 we don't even need to evaluate because this is 0 so when you do correct classification for the blue points your error is 0 let's take the next case the next case is this red point let's take this red point let us take the case number 3 or let us take case number 2 the second case ok as per the model given his income this the probability of him belong the blue class is very low which means model is predicting that he belongs to the red class as per the income group of this person the the model is a blue line it's same probability of him belonging to blue classes very low anything below point five is very low so the model is predicting that this guy is going to be red but actually is blue so let's see what the equation says his blue values one probability is very low that he belongs to blue class the model is saying probability is very low log of a very small number is a very large number so this is in to a large number okay since he is a blue class this expression we don't even need to evaluate it will be zero so whenever you do miss classification of the blue you get a very large error alright so the more the blue class is on this side larger the error will become so in technical parlance what we says more confident this algorithm is about a particular classification which we know is wrong so for this blue point where the probability of him belonging to blue class is almost zero this error will be very high for rulers we will will come to that will first discuss only the blue class the more blue points are on this side in this case the more the errors will come out larger the error will be okay now look at this the red one the fourth one or let's take the third one sorry third one we know that this guy is a defaulter the class is red the models the income the models predicting is going to be default to come to this equation what is the label value for this guy class red zero so this expression becomes zero we don't even need to evaluate come to this guy 100 zero is 1 into log of a very very small number number minus 1 becomes almost 1 so this expression turns out to be 0 log of a very large number okay zero so this expression comes too close to zero for rate classification we use this part of the equation for blue classification we use this part of the equation take the last case where the red class is misclassified the model is saying he belongs to blue class yeah see the model the surface is property belonging to blue plus look at the last case last case is the fourth case where the fourth case is miss classification of red so when you do miss classification of red this expression is not for red so forget it come to this this is 1 minus 0 into log to the base 2 this is a very high probability which your model is predicting so this becomes a very small number log of a small but is a very large number so if error is very large that's on this equation is so you sum up the errors done across all the data points those which are correctly classified those which are not correctly classified sum it up you will get the total error across the inter model sum of squared errors same way this so the object is to minimize the sum of squared errors by finding the right logistic surface yesterday the sum of squared errors was driving the gradient listen here this will try the variance right sigmoid curve both happen together actually both happen together it's not one after the other both happen together zero and one all hundred-percent this Y which will happen when you have all the r870 classified as blue and all the new seventh graders yeah classification always even alright shall we move on extension of linear model that we saw yesterday the beauty of logistic regression is it makes no assumption about the distribution of classes in the feature space many of these algorithms linear model especially if you're building linear classifiers or linear regression they expect Gaussian distributions you understand the term Gaussian distribution all of you know okay when you build any model expectation is if you take this attribute the data will be spread around the central value of this attribute with some standard deviations same way this attribute will have a distribution with sandal divisions when you plot this together together in the mathematical space they will look like if they are independent of each other they will look like this we already discussed this this kind of a distribution is called Gaussian so in 3d it looks like a bell curve all right this algorithm does not make any such assumptions but keep in mind if you have outliers in your data set even this algorithm will fall on its face because it is based on linear model so those outliers have to be addressed what will happen if you have outliers if you have outliers you'll like you to have outliers for say for example what I call them the class 1 if I an extreme outlier here then you're in trouble if your outlier on this side no problem if you have outlier on this side you have problem so more extreme it is more severe the outliers is more the errors will be right okay you can do multi-class classification using either binomial distribution in this case or multinomial distribution also is possible you can also print out the probability values if you are not interested in the class but I want the probability values what is the probability belongs to this class of that class quick to learn because it's based on linear model and you have gradient descent to help us out to find the best fit line quickly fast in your tests data also test data also it's very fast and as I said many times I have personally seen logistic regression stands out among the top few in many of the situations resistant to all fitting because as a simple linear model linear models cannot be very complex shapes it is a resistant to overfitting resistant to overfitting does not mean resistant to bias errors it means resistant to variance with us right but we have another one to deal with and you can print out the coefficients the probability values and interpret the probability values so the model is not a black box right disadvantage in classifications is it uses linear boundaries what it means is if the distribution of the classes if the distribution of the classes is something like this then the model will suffer right now if we took a linearly separable this thing but what if the distribution is not linearly separable then large historic regression will suffer if the distribution was like this with overlap but it was like this then it'll look well but if the distribution like this on the attribute iam decided you see the difference the difference is if I plot the density graph okay in this case the two density curves are overlapping your t-test analysis of variance and all these things will will tell you that these this record belongs to this plus that class fifty-fifty it will not be able to classify on the other hand if I look at this situation your overlap is there but the central values are shifted apart so when we have distributions which are linearly separable may not be hundred percent then logistic will do well but you have distributions like this I don't know whether you noticed Pima didn't data set is like this that is why in Pima data set you'll not be able to get accuracy beyond 70 percent 76 percent do whatever you can right it will not give you beyond certified when the data distribution is like this none of the algorithms will be able to help you out 100% here lower data before you start building the models you have to know your data which means you have to know every attribute how the data is distributed if you are in classification how will the klutzes are distributed which dimensions are which attributes are able to linearly separate the two classes use only those for your classification models you have some attribute where both the classes are significantly overlapping including the central values using such attributes for building model is useless and P maintains you will see all the dimensions selected what human brain cannot do an algorithm cannot do so you have to have at least one attribute as you will notice in another data set that is called Bank dot CSV there also the data overlaps on all the dimensions between the two classes but there are few dimensions two or three where the data central values are separated apart based on this three it gives you 96% accuracy this is now you find out how the density of the plot changes when you move from this direction to this direction you will see that the density is very low very low very low increases very high highest then it starts falling so when you convert this into 3d it becomes a normal distribution in 3d this becomes a normal switching the axis for analysis of in this case can be can't switch the axis for analysis switch means say Y axis becomes x axis so that is what is basically saying is no I'm not I'm this way I am comparing these two with this this axis is not I'm just guessing if you have a distribution like this your logistic will do well if you have a distribution like logistically suffer any algorithm you'll suffer okay but what you're saying is can't we switch to this axis where the two classes are separated or that's what you should be doing so if you have to access to attributes on one attribute they are not linearly separable at all but on the other attribute they are linearly separable use the other word so use that which is helping random forests every node of the forest is constructed using randomly selected features from the data set yeah it has internally it internally does that but random forest is the only one which comes close to what they are asking it doesn't it randomly picks up it doesn't see with a piece of illustrator all right okay let's move let's do a hands-on ah this is the same data set be mind ends never used it doesn't same data say should I will do a hands-on on this okay all right guys are you all ready ok so if you have access to the code the first line is matplotlib we already know what it means I am importing from scikit-learn linear model logistic regression yesterday we had imported linear regression ok then I am importing mat lourdes PLT Seabourn I use Seabourn libraries for our plots especially pair plot comes out very well in this so I'm going to use this training to exploit you already know what it is never little Python and to measure the accuracy of the model I'm going to use matrix so whenever you doing classification one of the most important tools for evaluating is your confusion matrix okay so that comes through this matrix library we will see all this things down the line earlier this pima diabetes used to be available in UCI data set but now i think it has been removed or archived so it's not available but I downloaded it from Kaggle com another site I am loading the data set over here it's a data file it's a text file while loading the data file I'm feeding the column names here which I got from the original source the column names so the column names are the first column name is pregnancy how many times by the way this particular test was done amongst Pima tribes a tribe in South America where apparently the ladies are prone to type 2 diabetes very early in age so this research was done by this Eunice tea where the objective was can we build a model and predict who is going to have type-2 diabetes by the teenage okay for that they did research on that community some of them were diabetics some of them were not diabetic and the characteristics that were captured or these the first one is number of times pregnancy reported by the subject next one is plus more level in your body plus minus a component of the blood then we have blood pressure this is a systolic a systolic diastolic blood pressure this one is diastolic diastolic blood pressure then we have skin where apparently people suffering from type 2 diabetes in that tribe they have some kind of skin folds this is measurement of that then they do some testing this is called fasting sugar test or something that is this test body mass index pedi is another test some pedigree test they do I think it's another test related to blood sugar age and the last one is your target column whether this person was diabetic or non diverting diabetic means one non-diabetic is zero now whole thing that the objective is we are trying to find out is there a relationship between the independent variables and the target variable what is it relationship we are trying to find out that relationship we assume exists in the universe in the real world we are trying to find out whether we can discover that relationship so for that we are building this models and I always do this based on the experience on car mpg data set when you look for this head okay then look for some sample records you see all numerical values but what is a guarantee that there is no record further down which has some non numerical value sitting in the numeric columns what is the guarantee how do you identify that to edify that we can use we can use this method what I'm doing is I am using a numerical Python function is real this function is a binary function it's a boolean function it will give true or false this function I'm applying it on to look at this apply I am applying it on to all the rows all the columns of this this is the beauty of Python where you have to do minimal coding if you have to do this in Java you'll probably write a 4 next within a 4 next over here so all the thing is taken care of so what's happening is I am going to apply this function is real on all the rows all the columns of p-bodies okay and you see this tilde or is this is the not operation I want only those records where it returns a false ok fortunately there are no records where it returns a false which means all records all columns are numericals what is the other way of finding out whether any such non numerical characters are coming to the data these types these types when you do will return the data types of all the columns if there is any non numerical value in any of these columns it will return it as object that is easier than this DF types way to go and identify the record convert that into any any in numerical then you can replace with medium mean motive replacement strategy really we don't have any such case right now okay but we do have nothing to be happy about because we see a lot of zeros and zeros in blood pressure is surprising it can't be not possible zero and plasma is not possible but you have zero values in blood pressure you have zero values in plasma so missing values are there the missing values will cause problems we need to address the missing values okay let's move on do a div describe how the data is distributed on the various dimensions univariate analysis if you don't like this don't do it do the pair plot pair plot also gives you visually the same univariate analysis so I will quickly tell you one look at the difference between for the test data the sorry the test attribute look at the difference between the mean eighty almost eighty and the median 30 the mean is drastically shifted away from the median on the higher side which means there is going to be some long tails on the right side on the higher side mean gets easily impacted by outliers so some outlier is pulling mean towards the right side how do you confirm that look at the distance between the third quartile and this look at this correct look at your like me you don't want to calculate can you tell me how long is wrong the right side tail look at the length and compare that to the left side tail left side there is no tail 7 1 9 units so is this possible or is this a typo we need to go and check maybe this is the typo we don't know okay right I'm going to move on the next thing that you have to do is first understand how the data is distributed on your various dimensions if you think there are outliers we need to address outliers we already know we have missing values we need to address missing values the next thing you need to do is since you are in classification how many records are available for each class in the data set look at that the number of cases for non-diabetic 0 is 500 whereas the number of cases for diabetic is half of that almost half you are preparing for your board exams and most of the questions that you saw practice on is calculus a small part you do it on trigonometric obviously you're going to flunk in trigonometry right why because the pattern of questions that come to you in the exams in trigonometry may not have been reflected by this patterns that you have used which means you are not gone through all the possible patterns in the trigonometry questions same thing will happen here to the model I have given only 268 of the diabetic cases maybe this 268 does not reflect all possible permutations that lead to diabetes so the model will perform poorly in predicting the diabetic cases it will perform relatively well in predicting the non-diabetic cases but I want the reverse the object is the reverse okay and keep in mind whenever you are in classification and the classes are skewed like this all algorithms are biased towards the higher represented class all models the objective is to minimize the overall miss classification so their focus will be on the higher class so this is another source of bias which comes into the model where the algorithms themselves are biased so how do you handle this they finally do yeah we call this strategy up sampling down sampling okay there is a strategy called up sampling down sampling which I will refer to which I will take you through when we do F empty model tuning but right now we're not going to do that there is a package available it's called in blonde but right now we are going to live with this there's another way of improving the this thing classes that the classification performing better in the underrepresented class that is called modifying thresholds thresholds in logistic regression 9 base any other probability based model the cutoff is 50% if the probability of belong to look blue class is 51% that the data point will be classified as blue if the probability of belonging to the blue class is 49% the data point will be classified as red so the boundary is at 50% okay I can modify the boundary now none of the algorithms allow me to modify the boundary directly I can build rapper class around the algorithms I built a model around the model I write my own class where I modify the threshold okay once again I'll cover this in fmt not now it'll be too much so we can play around with the threshold modify the thresholds to improve the accuracy of the under-represented plus we can do that okay how do we do that we'll take it a little but whenever you see in classification skewed distributions like this be ready there's no way you can achieve a very high accuracy for the underrepresented class unless you play around with these two techniques let's move on so this is my pair plot okay I don't like this so let me quickly redo this all right I'm going to come into I'm not going to do all this I'm going to come straight to this where I'm going to say s in a spare plot comma dag underscore kind is equal to KD e kernel density estimate since these are numerical values it makes sense to have this and heyou humans color is equal to the last one what we call it target class so hue is equal to plus it will take some time let's fast actually yeah there you go now let us understand this bed plot all of you have it let's look at the very first dimension the first dimension here is pregnancy reported okay the blue is non-diabetic class and the red one is the diabetic class the oranges that are diabetic class okay if you look at it the two distributions are overlapping with the central values they're overlapping very small separation over here come to the next dimension the next dimension is your plasma plasma you see long tails on the left side with a small bump this is their outliers the zero values but otherwise both the classes the blue class is perfectly look it looks like almost like your normal curve and central values are overlapping the central values are eclipsed so once again not a very good attribute but at least we have some differentiator here look at this one even the outliers are overlap this is your blood bishop these are the zero blood pressure cases they are also overlapping yeah so there is no way your blood pressure will be able to separate out based on blood pressure you cannot distinguish diabetic versus non diabetic so both of them will shift together right so we have problem there look at the next one this is even more surprising almost gone skin color skin not skin color this is skin fold yeah we will see the same problem here owns every dimension you look at it every dimension has this problem on none of the dimensions except for a few dimensions where you see small difference in the two classes none of the dimensions are separating the two classes out so even you put all these dimensions together and build a model we are not going to have info yeah it's overlap almost this case the previous case there is this one that's now and that's what we're seeing here the two classes are overlapping on every day they are not shifted like this let's go ahead and build a model let's not be very hopeful about this any model will give a problem here so what I'm doing here is this is not required it's it's not required to be done all algorithms can take as input the data frame itself but early rise to do this as a practice where I am converting the data frame into an array so data frame - the column headers is called an airy okay and that array I am taking the independent attributes into X and dependent attribute in boy and then I am splitting this x and y into train set s set is this part okay with you people all fuse me on this all of you inside on here also this great now look at this I'm instantiating the logistic regression model you can call it anything you like I'm calling it model I'm doing the fit this is where the best fit line is found and convert into your sigmoid curve the fit once the model has been found then I make use of the predict function of that object on test so these are my predicted labels diabetic non-diabetic the actual labels are white list then some people get confused here so I'm going to stop it don't look beyond this just stop it I'm asking you to just look at this part till this point everything is fine okay now what I can do down write down is I can for the time being I'll remove this I'll put a yeah for the time being just ignore this control X we'll put it over here insert and below will put it yeah all right I'll uncommitted ly to undo that okay now look at this have you seen confusion matrix before yeah all of you agree it's exactly what it's called it confuses people a lot right confusion matrix so what I do is as a practice when I make my confusion matrix I always keep the actuals on this side actual diabetic non-diabetic and the predicted output of the model on this side diabetic non-diabetic this is my practice you can follow the other way but with your way you follow follow only one how do you control this since again why test first this is actual it goes to the rows my predict next this is your predicted values goes into the columns if you flip this you'll flip that now what it's saying is of the 147 diabetic cases in your sorry this is non-diabetic no damn this is indeed non-diabetic this is diabetic nd be awfully sorry if you go and read the source of the data there very clearly told 0 is non diabetic one is diabetic of the 147 non-diabetic in your test data 132 has been classified as non diabetic no mistakes for this class true positives of the 147 15 have been identified as diabetic incorrect classification these are those values of red lying on the extreme and values of blue lying on the extreme end look at the diabetic plus a 484 there are 84 diabetic cases in your test data of this 84 46 have been correctly classified as diametric 84 minus missing since 38 38 have been misclassified okay now look at this now look at this what is the overall accuracy 77% looks looks decent but let's look at accuracy in demotic class okay we call it recall recall is the formula for recall is two positives my true positives plus false negatives looks very high Phi but what it means is two positives divided the row total so the recall is 46 divided by true positives plus false negatives this is false negatives for this class this is false negative right so yeah so it is 46 plus 38 back to 84 that comes to around 55% hey it's coming like 55 54 percent right that is like passing a coin heads or tails you don't need a model for this so to get this kind of accuracy you don't need a model you can just toss a coin and say diabetic non-negative that is what it is so overall looks great but don't get fooled by this what is important is this but this is what we expected because this is represented very poorly the diabetic class and this is important for us look at the recall for the non-diabetic class non-direct class the recall is 132 by 147 this will come very high 90% 80 plus 90 percent roughly around the recall for the non WT class is very high a recall for the diabetic classes extremely poor but that's what you should expect we wanted the reverse so what will happen when you have situations like this okay there is no way you can get a very good model when the individual attributes are poor attributes and the classes are under represented you can't do it so the magic happens in the data how do you prepare your data for analytics that is where the magic is it is not in the algorithms okay it depends on data preparation 80% of our project estimated effort in data science goes into preparing the data running the algorithm is not then rest around 20 15 to 20 percent goes into fine-tuning a model it goes into that all right let's do there is a function called score instead of doing this I can directly call score but don't rely on this when you are doing classification what this function does is it really generates the model and then tests it on this test data it regenerates mortal then compares the test data and this sort of particular gives us the score okay so this is nothing to do with confusion matrix you can actually run this and say remove the hash Ron it really it should give you the score model it's not defined by what spelling mistake oh I have not executed I'm just showing you this okay I have not executed these rights so it gives you score but overall score is not reliable so don't depend on this
Info
Channel: Great Learning
Views: 21,040
Rating: 4.8900342 out of 5
Keywords: Great Learning, Logistic Regression, Logistic Regression Algorithm, Logistic Regression Machine Learning, Logistic Regression in Python, Machine learning algorithms, Logistic regression in Python tutorial, Logistic regression example, Binary logistic regression, Logistic regression for dummies, Multinomial logistic regression, What is logistic regression, Linear regression vs logistic regression, logistic regression demo, Logistic Regression Example, machien learning tutorial
Id: dXCawGZh6dc
Channel Id: undefined
Length: 66min 41sec (4001 seconds)
Published: Tue Dec 24 2019
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.