Logistic Regression

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
Today in this module, we will study about Support Vector Machine and also before that we will have a brief lecture on Logistic Regression. So, this is part A on Logistic Regression. So, in previous class in the second week, we have talked about linear regression which is used for our regression problem. But, if you have a classification problem you cannot use linear regression. We want to see what is a simplest way in which we can handle our classification problem? In a linear regression, we had our hypothesis function h (x) as sigma beta i x i, i equal to 0 to n when n is the number of a predictor, number of variables. So, we have h (x) i equal to beta 0 plus beta 1 x 1 plus beta 2 x 2 plus beta n x n and learning involves learning the values of this beta i in order to optimize a certain function, for example, we try to minimize the sum of square errors. Now, suppose we have a classification problem that is we have different training points and they are positive and negative and we want to have a classification of them, we want to say when they are a positive and when they are negative. Now, this function will give a real value and it is not appropriate for classification, but what we can do is that based on this linear function, we can apply another function on this linear function so that we can use the result for classification. So, one of the ways we can do it is, in logistic regression we use the logistic function or the sigmoid function for this task. Now, first of all let us look at what is the logistic function or the sigmoid function it is given by this formula g (z) equal to 1 by 1 plus e to the power minus z and this formula this function has the following profile. Suppose this is 0, this function has this type of shape, roughly this as this type of shape. The value of this function varies between 0 and 1 at z equal to 0.5 the value is 0; as z tends to infinity, the values tends to 1; as z tends to minus infinity, the value tends to 0 right. So, this function gives the value between 0 and 1 and how we can use it for classification, we can say that we have a function if the output is greater than 0.5 it is positive, if it is less than 0.5 it belongs to the negative class. So, just like in regression we use this function for classification using logistic regression we will use this function h beta x is; we will apply this function g this function is also called the sigmoid function, we can refer to as sigma z. So, we will apply this function g or sigma to sigma beta i x i. So, sigma beta i x i can also be written as beta transpose x more compactly using matrix notation, we can write h beta x for classification as equal to g beta t x which is equal to 1 by 1 plus e to the power minus beta t x. So, we can use a linear function of beta, pass it through the sigmoid function and use it as for a classification function. Now, let us looks at certain properties of this sigmoid function which makes it very attractive to use. So, as we have seen g (z) tends to 1 as z is after sometime it becomes 1 as it asymptotically stays at 1. So, as z tends to infinite g (z) tends to 1 and g (z) tends to 0 as z tends to minus infinite then a very attractive feature of this function is if you take the derivative of this function . So, the derivative of this function you can take as, you have 1 by 1 plus e to the power minus z let us take the derivative. So, g dash z equal to d d z of 1 by 1 plus e to the power minus z which is 1 by 1 plus e to the power minus z whole square into e to the power minus z. So, we do a change of variable this is 1 by x which is 1 by x square and by taking the derivative of this part we get e to the power minus z. So, g dash z is e to the power minus z times 1 by 1 plus e to the power minus z whole square and we can simplify this to write it as we can do some manipulation and we can write this as let me write it where you can see. So, this can be written as 1 by 1 plus e to the power minus z times 1 minus 1 by 1 plus e to the power minus z which is simply equal to g (z) into 1 minus g (z). So, the derivative of g (z) can be written as g (z) times 1 minus g (z). So, the derivative is extremely simple to compute and this is a property which makes it attractive to use this logistic function or sigmoid function. So, when you are using this logistic function based on this we can look at the conditional distribution of generating the data. So, suppose you have the input x you want to find out what is the probability of y given x, if y equal to 1. So, we can write this as h (x). So, if y equal to 1 h (x) is the probability and if y equal to 0 the probability is 1 minus h (x). So, we can write probability y distribution of y given x as h (x) to the power y 1 minus h (x) to the power 1 minus y if y is 1, 1 minus y equal to 0 and this factor will not be there this factor will be 1. So, we have h (x) if y equal to 1 and probability y given y given x probability y equal to 0 given x is here y is 0 and 1 minus y is 1. So, we have 1 minus h (x). So, it can be given here and h (x) has we have seen equal to 1 by 1 plus e to the power minus beta transpose x. Now, given this function we can now try to learn this function by using gradient assent just like. So, we want to maximize this function and we can use the gradient a decent method, asset method as we have used in linear regression. So, in logistic regression we need to learn the conditional probability distribution probability y given x. So, this is what we need to learn. Now, suppose our estimate p y (x) beta are the parameters p y (x) is our estimate of probability of y given x and beta is the vector whose value the beta are the set of parameters whose values we have got to learn . Now, what we will do is that we will do stochastic gradient descent for that we assume a single training example and with respect to the training example we will do the radiant descent. So, in order to do that we first define the likelihood of the data, what we have to do is. We have to learn the optimal values of beta and we use the maximum likelihood approach we find out what is the likelihood of beta. So, the likelihood of beta is the probability of observing the data given beta were the actual parameters. So, the likelihood of beta can be written as probability of the y given x parameterized by beta and because we have m training example this is product of for each training example, we find the probability of y i given x i beta. And this as we have seen is product of i equal to 1 to m h (x) i to the power y i times 1 minus h (x) i to the power 1 minus y i. So, we have got to find beta. So, that this expression is maximized now whatever maxi. So, all the probabilities are positive. So, whatever maximizes this expression will also maximize the log of this expression and in order to make our computation simpler, we take the log likelihood of respect to beta. So, small l of beta is the log likelihood given beta and it is given by log of likelihood of beta, which is summation i equal to 1 to m y i log of h (x) i plus 1 minus y i log of 1 minus h (x) i. So, this is by taking logarithm of this expression we get the likelihood of beta. Now, we have to maximize this likelihood and in order to maximize this likelihood we do gradient asset. As we know that when we try to maximize this function we can take the derivative, let me rub this. So, we have this function here, which we want to maximize. So, what we will do is that, we will take the derivative of this function and so, we will start with some initial value of beta. We will start with initial value of beta and we will update beta as follows beta equal to initial beta plus some alpha, alpha is the learning rate times the derivative partial derivative with respect to beta of the log likelihood of beta. So, this is how will update beta iteratively by using gradient asset and we can do it 1 example at a time if you are using stochastic gradient asset. Suppose we have a single training example x y. So, x y is a single training example based on this training example and the current beta we want to find out what the next beta will be for that we find the derivative of this likelihood and we try to based, we find the derivative and we take a step towards the derivative. So, let us take the derivative of this expression. So, if we take the partial derivative of l beta, which is the function that we have given here and what we get is y times 1 by g beta t x minus 1 minus y 1 by 1 minus g beta t x times del del of beta j g of beta t x. So, on simplification, on manipulation of this, what we get is y 1 by g beta t x minus 1 minus y 1 by 1 minus g beta t x times, expanding this part, we get 1 minus g beta t x del del beta j beta t x from which we get y times 1 minus g beta t x minus 1 minus y g beta t x times x j. Now, we have use the fact that g dash x is equal to g (z) times 1 minus g (z) and we get y minus h beta x x j. So, upon simplification we get this as the partial derivative of the log likelihood of beta y minus h beta x x j. So, plugging in this formula here what we get finally, is beta equal to original beta plus alpha times, we had partial derivative with beta with respect to l beta which we get is this, alpha times y minus h beta x times x j. So, this is the value of the jth component of beta. So, beta j equal to beta j plus alpha times y minus h beta x times x j this is the change that we make for a single training example x j a x y. So, given a training example x y, we do the partial derivative of this and we find out this is our like log likelihood of beta, we take the partial derivative of the log likelihood of beta and we have worked this out here after some manipulation what we get that it is equal to y minus h (x) times x j and plugging it in to this formula we get, how we can update beta j beta j is the jth component of beta beta j equal to beta j plus alpha times y minus h beta x into x j. So, this is the formula by which we can do stochastic gradient has not and we will find the right values of beta which we can use for logistic regression. With this we come to the end of today’s lecture. Thank you very much.
Info
Channel: Machine Learning- Sudeshna Sarkar
Views: 59,043
Rating: undefined out of 5
Keywords:
Id: CE03E80wbRE
Channel Id: undefined
Length: 20min 40sec (1240 seconds)
Published: Fri Aug 12 2016
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.