Logistic Regression with Maximum Likelihood

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hello everyone welcome back to another video today we're gonna talk about logistic regression I'm Gus and this is endless engineering so let's dive right in okay so logistic regression is a statistical algorithm that allows us to predict the probability that a given that a dependent variable is of a certain category say cacked given an input and a certain model right so say I have a bunch of inputs say these are the dependent the independent variables X and dependent variable Y and let's say these are images now I'm displaying those in 2d as there's one data point just in order to make to make the case for what is logistic regression but let's just say this is an image and this image is clearly an image of a cat or clearly an image of a dog these are categories so 0 and 1 right so this becomes a classification problem and a classification problem is the ability to predict a certain input having the certain type or certain class and that's what logistic regression does for us it's a classification algorithm to find the probability that I'd given us the input and I have a certain model that that input belongs to a certain category so how do I do that now I could say that I want to use a hypothesis similar to linear regression right where I have theta transpose and X bar you can go back and watch my video on linear regression and see what that means where I have theta transpose is the row vector of theta 0 theta 1 all the way to theta n and X bar is at 1 starts with 1 and then goes to X 0 all the way to xn transpose column vector right so if I do that I could then fit this data with something that looks like this right but if I want to find the probability of this value then it can't be that I use a line because this line can a value larger than one and less than zero and that violates the definition of a probability so what do I do I make my hypothesis H of X equal to some function Sigma of theta transpose X bar and that function that I pick here is a sigmoid function which is equal to one plus one over one plus e to the Maya to the negative minus theta transpose X and the reason why I pick the sigmoid is twofold the first is that the sigmoid looks something like this it starts at zero that ramps up to one right that's what a sigma it looks like so it's squashed here and here now technically it goes around zero so like the zero axis would be here but the idea here being is that the sigmoid function cannot be larger than one no matter what the input is or less than zero so that's the first reason the second reason is that the derivative of a sigmoid is mathematically convenient and we'll find that out later in just a few minutes however here that's my hypothesis right this is the hypothesis that I can find the probability of Y I one sample is equal to cat given X I permit rise by theta is equal to 1 over 1 plus e to the minus theta transpose X right so the reason why the notation I'm using here is that the probability of Y condition so given by this this bar means given X I and I'm putting the semicolon here because theta is not really a random variable theta is a parameter that we need to find and also my probability that Y I is not a cat or in this case it's a dog given X I parametrized by theta is going to be 1 minus this probability so 1 minus H of theta of X transpose confused the H with signal so H itself is Sigma Theta transpose X and that's the probability that I have one given the input and parametrized by theta and the other probability is probably that I'm not of that class or I'm a different class I get that and I can combine those two in one equation and I can see the probability of y I given x I parametrized by theta is equal to the probability probability itself of HX ^ Y I so H and X I times 1 minus HX I to the power of 1 minus y I so now this this function right here is known as a Bernoulli Bernoulli distribution and this is nice for a binary problem like this you see we have two classes in the case where Y I is one so my class is a cat this term remains and this term is 1 minus 0 so this becomes 0 this becomes 1 so I get H X so that's the probability that I have a cat and then if Y I was 0 this term becomes 1 and then this term remains and I have the probability that it's not a cat or a dog now what we're going to discuss here is only for a binary classification problem to show the essence of logistic regression but it can be expanded to multi-class classification problems ok so now that I have the probability function this tells me how plausible that my dependent variable is equal to a certain class given an input with a model parameterize by theta right so if I combine all my observations into a large matrix X and a large matrix a large vector Y like if I stack them all on top of each other I can write the probability of this is y Capital y given capital X parameterize by theta equal to the multiplication from I equals to M where m is all of the data points that have the number of data points of my hypothesis ^ i won - my hypothesis the power of one - weii and this whole function right here is typically referred to as my likelihood of theta so what this represents is how plausible my model is theta the parameters that I have given all of my data points and the reason why I'm able to multiply them all together here is because I'm assuming that my data points are independent so the if the if data points are independent I can multiply out all their probabilities and I get the likelihood function and that is a reasonable assumption since if I have a picture of a dog or a cat they're not really related even if I have multiple pictures of dogs they're not independent it doesn't matter what camera ticket wave doesn't matter if you get with a cell phone it doesn't matter if it's the same dog all that matters is that's an image of a dog or an image of a cat and I'm trying to find a model that allows me to predict given a new image if that's a cat or a dog so this likelihood function is what going to allow us to find these theta parameters so now that we have our likelihood function which represent the plausibility of a model theta parameters given all the data sets the data set data points that I have and I know that my hypothesis uses a sigmoid function how do I find my model parameters how do I find the Thetas well the way to do that is to maximize my likelihood function right L of theta and that's not easy to do when I have multiple multiplication of probabilities so one way to deal with this is to use the log likelihood so if I take the log of this likelihood function I'm gonna write it as a script L of theta so the log likelihood if I take the log the nice thing is this multiplication becomes a sum and that's a nice property of the log and then I have this exponent becomes a multiplier times the log of the hypothesis which is thing my Sigma of theta transpose X I bar this bar right and then I have plus because the multiplication becomes a sum this exponent becomes a multiplier and multiplied by 1 minus Sigma of theta transpose X I so this is my log likely so I need to maximize in this case my log likelihood and this is nice because to maximize function I can take the derivative and move in the in the direction of the gradient so I can do a gradient ascent in our last video we in linear regression we talked about gradient descent because we wanted to minimize a function in this case we wanted to maximize the likelihood or the log likelihood the reason for that is since the likelihood represents the plausibility of my model if I maximize it intuitively that means that that's the most plausible model that I can use to represent this data ok so let's go ahead and do that so if we looked at one data point so I'm going to drop the I subscripts in the sum because they're not affected by the derivative and if I'm going to take the partial derivative of this log likelihood as a function of theta with respect to theta what do I get my Y derivative of that 0 so I don't care about that so I keep the Y and the derivative of a log is 1 over whatever is inside the log and then say Sigma transpose Sigma Theta transpose X bar and then I have the derivative of this term whatever's inside the log so the derivative of Sigma Theta transpose X bar divided by the derivative of theta and then I have plus the other term is going to be 1 minus y I can't take the derivative of that there's no theta in there you drop the hi and then I have oh there is a log here I'm sorry this is log of this value right because there's a log here there should be a log ear because I have a log expand into all this whole function so the derivative of a log is 1 over whatever is inside that log which is in this case 1 minus sick theta transpose X bar and then I multiplied with the derivative of this term I would say Thea so this one derivative is zero goes away if I have a negative sign so dot product negative sign of this derivative so the partial of Sigma Theta transpose X bar with respect to theta so you can see here I have this term and this term shared and this is essentially the derivative of the sigmoid with respect to theta so the sigmoid has a theta transpose times X right so I can write the derivative of sigma theta transpose of x by theta as the derivative of sigma theta transpose x by the derivative of theta transpose x multiplied by the derivative of theta transpose x divided by the derivative of theta in this case this value is basically sorry these are all x bars this value is X bar and this value is the derivative of a sigmoid and this is where the mathematical convenience of a sigmoid comes into play because the derivative of a sigmoid function with respect to its input is essentially the sigmoid itself multiplied by 1 minus the sigmoid itself and that's very mathematically convenient so you can go and you can prove that to yourself but this is this is the actual derivative of the sigmoid so in this case here I know that this term is that and then this term is X bar and this is shared with each other so if you multiply out you essentially can get you can show that if I substitute this derivative here times X bar same thing here and I multiply everything out then you can get Y minus H of X multiplied by X bar right this is what you get so Y itself minus Sigma of theta transpose X bar time x-bar so this is my derivative of the log likelihood of theta with respect to theta and this equation is what I would use for great in the sense to compute my values iteratively of theta until I converge but this is again gradient ascent so I have a positive value alpha times partial log-likelihood with respect to theta and this alpha is our scaling parameter our step size and this is the derivative itself so there you have it this is the maximum log likelihood approach to finding the parameter models theta that maximize the likelihood of my distribution so that I can have a model that allows me to predict the probability of a categorically dependent variable or categorical dependent variable given an input X of an independent variable so to sum up logistic regression is a statistical model that allows us to predict the probability that an input belongs to a certain category and I do that by finding a model that maximizes the log likelihood of this distribution that I found right here in this all essentially is the theory for a binary logistic regression which means I have two classes but this can be easily generalized for multi-class logistic regression I just wanted to go through some of the math here ice tipped a few steps here because I ran out of space but you can go ahead and prove to yourself given that all these equations and the properties of the sigmoid function that you can arrive at this equation and then that would be used in your grading ascent algorithm to find the parameters theta that best or most likely fit the model if you liked this video on logistic regression hit that thumbs up button and think about subscribing to endless engine if you like our videos and would like to see some more also maybe hit that notification bell that way YouTube will notify you every time we have a new video thank you for watching
Info
Channel: Endless Engineering
Views: 16,722
Rating: 4.9419236 out of 5
Keywords: logistic regression, regression, statistics, statistical model, model, machine learning, data analysis, data analytics, data science, data scientist, maximum likelihood, maximum likelihood estimation, gradient descent, gradient ascent, classification, calculus, derivative, sigmoid, parameter estimation, bernoulli distribution, bernoulli, binomial distribution, probablity, Bayesian inference, Bayesian, binary classification, likelihood function, log likelihood
Id: TM1lijyQnaI
Channel Id: undefined
Length: 15min 51sec (951 seconds)
Published: Sun Jan 06 2019
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.