CS480/680 Lecture 8: Logistic regression and generalized linear models

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
okay we're running everyone let's get started all right so if you recall last class we talked about a statistical model for classification that was based on mixtures of gaussians but then the obvious restriction of the obvious limitation of this problem of this model is the fact that we're assuming Gaussian distribution so today we're going to remove this limitation so we're going to consider a broad class of distributions and then this will lead to a new model that is known as logistic regression okay so yeah so that's on the menu for today to talk about logistic regression so beyond being essentially a generalization of mixtures of gaussians we're going to see that it is a very simple yet powerful technique that is quite flexible and is very much used in by the industry so lots of companies are using logistic regression in all kinds of products for good reasons we're going to see a simple use case for that and also this will also serve as a precursor to our discussion later on about neural networks we're going to see that logistic regression it can be thought as essentially a simple one layer type of neural network whenever you have a sigmoid activation function okay so just as a brief recap so last class we talked about mixtures of gaussians and then I'm gonna draw again on the board the picture that you should all have in your mind so whenever we talked about mixtures of gaussians in the context of classification the idea is that there are different classes and each class we're going to assume that the data is essentially centered in one location but then due to errors in the measurements then there's some noise that creeps in and then when you get your data then it's not precisely in one location but in some region and then here if we assume that the noise follows a Gaussian distribution and this region will will have a nice bell shape that corresponds to a Gaussian distribution okay so let's draw again the picture for mixtures of gaussians okay so if we have three classes let's say Class A is here we've got Class B here and then Class C here the idea is that normally we would assume that if there was no noise then the data would be concentrated into these three specific location but then because of noise we observe data in the neighborhood so this would correspond to data for Class A and then if the data the it is is a little bit away from the center due to some errors some noise that follows a Gaussian distribution then we can draw some contour lines that would correspond to where we expect the data to be same thing for Class C so there's data that's nearby and then we can draw some contour lines that correspond to the Gaussian distribution and then finally same idea for Class B so we've got some data nearby and these are the contour alliances okay so this works well when the data doesn't need form clusters that have the shape of a Gaussian so here that means a spherical shape otherwise some sort of ellipsoid shape but now what if the data does not have clusters with nice shapes like this right so what could we do because in practice the data will be in whatever shape and then presumably it it is concentrated in two different regions that correspond to different classes and we'd still like to do classification okay so for this we're going to have to relax our assumption of a Gaussian distribution and then the question will be what other distributions can we consider besides gaussians so here it turns out that there is a family of distributions known as the exponential family that is quite important so in most introductory courses of proteases that is then you would see lots of distributions from this family these would include obviously the Gaussian but also the exponential the Bernoulli the categorical the Poisson baradar Schlage gamma and many others okay so many of the famous distributions that you were often seen in some introductory course about proteins and statistics are actually part of this exponential family so this family is Patras as follows so we can write down a distribution over where the data might be based on some powders theta and then it's called the exponential family because the the product density function has an exponential and then the exponent I can be of the following form where we have a linear term in theta times some other terms in X minus another term a of theta and then add another term B of X so here this general form the key is that again there's an exponential and then there's three terms the first term is a hybrid term that is linear in theta and then depends on X as well the second term just in theta and then the third term just in X and now what makes the difference between let's say a Gaussian and then a Bernoulli and a gamma and so on would be the precise functions T of X a of theta B of X so in general all of the distributions that are part of the exponential family can be rewritten in this form they're just gonna have different definitions for T of X a of theta and B of X okay so now the beauty though is that these distributions because they have this form it turns out that we can show that the posterior when we do classification will also be of the form of a sigmoid logistic linear function in okay so if you recall for mixtures of gaussians we did the derivation right if we start with a prior which is a categorical distribution and then we have a likelihood which is a class conditional that is Gaussian then the posterior to make a prediction about the class that a point belongs to will also be Gaussian so I know we'd also be a sigmoid logistic function so that's what we arrived at last class we did the derivation and now I'm not showing you the derivation but any distribution of this form you can show that the posterior will also be a sigmoid logistic function that is linear in X so this is beautiful because what we derived a theory that we derived last class will apply more or less directly here for now a wide range of distributions maybe not all possible distributions but at least all the distributions in the exponential family okay any questions regarding this so far okay very good all right so now based on this we're going to formulate what are known as probabilistic discredited models and here the idea is that since we know that the posterior distribution is of the form of a logistic function instead of making some assumption that the data is perturb with some noise as follows a specific distribution and then trying to estimate the powers of that distribution and in computing the posterior we're going to go straight to the posterior we know that for all these distributions the posterior is a logistic function and so the gistic function that depends essentially on W and W naught and then the question will be can we estimate W and W naught directly because what we did with mixtures of gosh is is that we would fit a mixture of gaussians by learning the mean and covariance matrix of each Gaussian and then from that computer posterior which allows us to infer what W and W naught should be in terms of the mean and covariance matrix but maybe we don't have to have so many steps perhaps we can go straight to estimating W and W naught since we know that the posterior is going to have this form so that's exactly what we're going to do and here we can do this by essentially treating this as if it was a likelihood for the classes and then optimizing the powers to maximize this class okay so in general again we know that for binary classification the posterior is going to be a logistic sigmoid and and if we have multiple classes then we'll simply have a softmax which is the generalization of the sigmoid so again for all these distributions from the exponential family the posterior they all follow the softmax whenever we've got multiple classes okay so this will lead now to a technique known as logistic regression because for all these distributions we know the posterior is a logistic sigmoid so we're going to essentially try to fit or try to find what would be the best logistic function that captures the posterior for the data okay so if we model the posterior and here let's say that we want to model it not just in terms of X but also in terms of Y because the the point of the posterior to compute the probability that a data point belongs to a specific class that we denote by Y so here if we denote those classes to be either 0 or 1 then we can use an expression like this to represent the posterior so that now it isn't just the posterior of a specific class but this will be non the posture for any class when there's only 0 or 1 so if it is of class 1 then the exponent will ensure here that this term is active but if the exponent is 0 then this term disappears and then we'll have this term instead that is active when y is 0 okay so for two classes right I can write my posterior distribution in this form and now this is essentially a conditional distribution for Y given X where it is part rise by W and and now we would like to simply find the best W that maximizes the posterior right because we're given some data all right our data includes both the input X and then the output Y which is our target and then we would like to simply find the W that would ensure that for most data points right the the probability of the correct class Y will be as high as possible so so then yeah we can simply try to learn directly W in this way and then the beauty is that you see W normally would correspond to some expression in terms of the powders of some underlying distribution like a Gaussian or maybe a gamma or their share or something else that's from the exponential family but now we don't care really what distribution it's from we're simply going to fit W directly and the idea is that there is presumably some underlying distribution from the exponential family that would correspond to that so that's the the idea okay so here I'm showing the expression for one data point but now we can consider an entire data set so the main difference you see here is I've got a capital X to denote the data set this was a single data point X and here I've got a vector of outputs Y in comparison to a single output Y okay so we're data set then the main difference is that now I'm going to have a product over the posteriors of all those data points and okay beyond that we simply try to optimize this posterior so I'm gonna find the best W the problem is that this type of optimization is not straightforward because I've got a product so I'm gonna use the same trick that we saw already several times what I'm going to take the log of my objective that doesn't change where the maximum is but it allows me to essentially change the product into a sum and then I will apply the log to those terms which will give me this expression here the other change I'm making is that now I will change the max to a min by introducing a negative sign here so that I can treat this as a loss function right so in everything we've seen so far whenever we optimize right then we can always go from a Max to a min by introducing a minus sign and then that allows me to interpret this as a musician problem where this is essentially some kind of loss function okay but but this is really the same optimization problem as here just written differently okay so with this objective now we will follow a similar approach that we saw for previous models where what we're going to do is compute the derivative and see what happens when we set it to zero in computing the derivative it'll be important to look at what is the derivative the sick function itself and it turns out that the Sigma it has a nice guys has it that a nice derivative which is the Sigma and again times 1 minus the Sigma yet okay so we're gonna make use of that this is an interesting observation and and then it's actually quite convenient to make use of of this expression okay all right so so now the goal is to solve this optimization problem to find the W that minimizes the subjective but before we go on and do this let me just make a remark here and say that perhaps what's a little bit awkward is that this technique is called logistic regression but really we're trying to do classification right so so we talked at the beginning of the course that whenever we have categories so classes that are categorical then it's classification and whenever we try to predict something that is numerical continuous right this would be regression so here how can we call this logistic regression when it's a technique applied to a classification problem right so this seems to be a contradiction but it turns out that there's a good reason for this the idea is that we can think of what we're really doing as a form of regression because at the end of the day we're trying to estimate the posterior property of the class given X and in this posterior being a probability that's a number it's a number between 0 and 1 so so it's we're making a prediction about a numerical value so so yeah so that's why even though this is a little bit awkward right that we have a regression technique that is applied to a classification problem but it's it makes sense because here we turned our classification problem really into a form of regression by saying that we're going to try to predict what the probabilities are for each class and then the properties are themselves numbers right so they're continuous values any questions regarding this okay very good so let's continue alright so with that in mind we're going to proceed with how we're going to find the best powers W and W naught and okay on this slide I've got a little derivation that is similar to what we've seen in for previous models where what we do is we take our model right we take our objective we computed the relative and set it to zero so here if you take the expression that we have on this slide here compute the derivative while using this tip then you arrive at the following expression right here now this expression can be simplified so here I'm gonna cancel off some terms I cancel off some more terms and then I arrive at this expression finally so this expression is a little simpler but unfortunately if once I set the derivative to zero what I would really like to do is to isolate W right so the goal is that I have an objective and then when I set the derivative to zero I should be able to isolate my variable and then figure out what is the the expression in closed form for this variable but if you look at this expression right even though it's a short expression W is inside the Sigma head which is itself inside the summation so getting W out of the sigmoid and out of the summation is not easy and in fact this day we don't have a way of doing this in closed form so here this is an example of an optimization problem where we're going to have to resort to an iterative method all right so the previous types of optimization that we did there was always a way to set the Delta to zero and then I the variable often that led to a system of linear equations but here we really have a nonlinear equation because Sigma is a nonlinear function alright so if it wasn't for the Sigma head then this would be just fine we could isolate W but the Sigma it prevents us from isolating W ok so let me draw a curve to illustrate what the objective looks like and then we're going to discuss some iterative methods for dealing with this okay so the objective that we're working with if I just go back is this objective here so that's our last function so we have an x-axis W and then on the y-axis L of W that's our last function all right and now I'm gonna claim that this function here is actually a convex function that has a nice ball shape like this okay so here if you're familiar with optimization you would notice that whenever we have the log of a sigma and that turns out to be concave and then with a minus sign it becomes convex and then same thing on this side so it's a convex function now this is not a course about optimization so I guess you can just take my word for it okay so this function is convex overall so the nice thing about a convex function is that now when you look at it then it has a single global Optima and then we're gonna be able to use lots of techniques that will find this global optimum relatively easily now the technique that we were about to use it was the one that says well since we're trying to find the minimum and the minimum is where the slope is zero right so here I can draw a line right and then this would be the place where the curve is essentially flat that there's no gradient and then that's why when we set the derivative to zero right and then we try to find the W for this then normally it would give us precisely the solution so the best W right here the problem is that it's not something easy right as we just discussed isolating W is not possible because of the Sigma so it's a nonlinear system of equations okay so even if we can't directly isolate W what we can do is used an iterative method and perhaps the simplest would be to use some form of great in this sound alright so like let's say I start here right I might have a certain initial starting point for a for W and then what I can do is compute the gradient so from here I will have a gradient in this direction that would tell me that I need to move this way then I could compute another gradient I move this way another gradient and so on and then eventually I will arrive at the solution right so I can use gradient descent and then the idea is that at every step I simply find out the direction of steepest descent make a step in that direction and keep on going until I arrive at a point where there's no more slope all right so this would be the simplest method so this method will work the only issue is that it's not efficient and then there's also a problem of how can I determine this step size right so when I say I have a direction and then I make a step in that direction how big should the step be right because if I make a tiny step then I'm gonna have to do a lot of steps and if I make a big step I might overshoot and then I might have to come back right so so then coming up with a right step length is not something obvious there's a whole literature on this but then that's one of the tricky things regarding grading this out so here let's consider another method and this other method is known as Newton's method so this is another popular and common method for doing optimization whenever we've got a convex objective and this method is going to be much faster than gradient descent in the sense that it's going to take much fewer steps and also does not require us to come up with a step length so every step we're going to see in a moment that we can automatically find out where should be our next point without having to worry about the step left okay the the way this method works is that you start with an initial guess for W and then you subtract from W the inverse of the Hessian this is the Hessian here times the gradient of our last function okay so the way to think of Newton's method is that it is really doing what is known as iterated three weighted least square okay so all I'll explain that by by drawing a graph okay so again we've got our last function in terms of W and then the same ball shape as over here here what I drew as an example is really gradient descent and what gradient descent does is that at some level you can think of it as we start with an initial point and then we fit a line okay so we fit a line that goes through that point that tells us where we can go down we take a step and we fit a line again fit a line and so on so we continously approximate our objective with lines but perhaps we could get a better approximation instead of fitting a line maybe here what I could do is start by fitting a quadratic function so quadratic function could essentially approximately the curve better and then we might be able then to take fewer steps and to converge faster that's exactly what's gonna happen so we know our global optimum is here let's say we start here if I approximate this curve with a quadratic function it might look like this okay so here this will be a quadratic function that touches the curve at this point and then at this particular point it has the same slope and the same second-order derivative right so so here we're approximating not just with a line but with a quadratic so it means that the linear and quadratic terms of this of this new function and in comparison to my original function would be the same so what might be different or higher-order terms but at least the linear and quadratic terms would be the same and and then it would look like this in comparison here with gradient descent I'm essentially fitting a line we means that is only the linear terms that are the same okay when I fit a quadratic function like this what I can do then is say well why don't I think of my quadratic function as my new objective and let's minimize that so if I find the global maximum of this quadratic function I'll find the bottom right here and this bottom now I can treat it as a new point so it will give me a new W so I started here now I have another W that is here and then I can approximate again so let me draw another quadratic function okay that approximates now the function at this point again I will find the minimum of this quadratic function which happens to be here and then this gives me a new W and then I approximate again okay so I keep on going I find another minimum and then I approximate once more and and so on okay so so then I will have a series of approximation but then the beauty is that I don't have to worry about computing a step length because each time that I have a quadratic function I'm gonna solve optimally for the minimum in closed form and here solving what is the minimum of this quadratic function is essentially the same as what we did for linear regression when we were minimizing squared loss right so with linear regression what you coded an assignment one was essentially this problem of minimizing a quadratic function because the objective when we minimize squared loss was a quadratic objective and it happened to be convex so it was essentially this problem so here what we're proposing now is that at every step we approximate a function with a quadratic turn it if you want into some form of linear regression where you you find the minimum of that quadratic curve that gives you an you estimate and then you refit the function and you keep on going like this now the beauty is that when you fit like this because we have a better fit it's it's a quadratic function that that matches the curve better at least for the linear and the quadratic terms then in general the number of steps are going to have to take is going to be much fewer than for gradient descent so in fact Newton's method is a second-order optimization method that that makes use of both the gradient and then the second order derivatives and that's where the Hessian comes in okay so on this slide here right so I've got the Hessian which is a matrix of second order derivatives so it's a their ative of the loss with respect to every pair of their ball so in this case here would be with respect to w0 once and then the second time with respect to W zero in this case it's a territory with respect to W zero and also W M and so on so here you see we take there it is with respect to every pair of the air balls and this essentially gives us what is the curvature up to the quadratic terms so it allows us to match in this graph the the function up to the quadratic term okay so now when we update W by subtracting the inverse of the Hessian times the gradient of the last function what you're effectively doing is taking a step here where we essentially solve one quadratic approximation and find the minimum so so finding the minimum would correspond to doing this update right here so I'm not showing you how we derive here Newton's method but it's essentially what it boils down to against the picture you should have in your mind is what I drew on the board there and what we're really doing is approximating or function with quadratics and then every step we simply find the minimum and that can be done simply it's as simple as just computing the Hessian taking the inverse multiplying by the gradient okay any questions regarding this yes so yeah Newton's method tends to be much faster meaning that it takes much fewer iterations to arrive at the global optimum so when you see which one is most accurate it really depends here on how many steps how many iterations you take but then for the same number of iterations Newton's method is going to be much faster much more accurate now here I should be careful so the cast though of doing one iteration of Newton's method is more expensive than gradient descent because here we have to compute the Hessian so the Hessian includes all the second order derivatives and that's quadratic in the dimensionality right so so this method is great when we only have problems in that or low dimensional so for instance once we start talking about neural networks and then we have millions of weights millions of powders then this will not be feasible but then if you have an easier problem where you just have a you know data points of a few dimensions and then weights as well just a few dimensions then you can use this and then it tends to be much faster and therefore more accurate okay any other questions yes that's right so here what Newtons duh what Newton's method does is always approximate with a quadratic regardless of what the function is and in fact the function in our case if we just go back right so this is our last function all right and and you'll notice here there's nothing that looks quadratic in here right so so we have log terms we have Sigma ins we have something linear so it's just a complicated nonlinear function it happens to have this ball shape and then we're just saying let's approximate this bowl with a quadratic function at every step and Newton's method always does that it always approximate with a quadratic function regardless of what the original function is yeah so I guess yeah the question is are there any specific reason why we would approximate to a quadratic so here you see we're looking for a way of making steps that are let's see you know as effective as possible and at the same time as cheap as possible and here you know the way to think about this is that instead of approximating with a line we're approximate with a quadratic so it's a better fit therefore allows us to make better steps but it's still not too expensive it is more expensive than gradient descent because we have to compute the hasha but it's still something doable and and I guess we could ask well could we fit maybe something up to let's see a cubic terms and and I'm sure we could come up with such a technique but then at that point it becomes too expensive and it might not be obvious how to find the minimum as well okay so was with Newton's method approximating with a quadratic we know how to find the minimum very easily so that's what we had on one of the slides will be the iteration is simple yes a good question yeah so here what if we use Newton's method for a non convex objective right so if you do that then yeah then at that point Newton's method is not guaranteed to find the global optimum and then we'll generally converge towards a local optimum soaking in in this case for logistic regression the function is convex there's a single global optimum and the Newton's method can start anywhere so you can initialize your weights randomly all right so it doesn't matter what you start with if you do enough steps of Newton's method you converge to the global optimum and that's because we've got this nice bowl shape right but if it's non convex then we might have essentially a shape where there's a local optimum and then it'll heal and then goes down again for a global optimum and Newton's method might get stuck into the local optima but that's not just Newton's method gradient descent as well and most techniques would get stuck okay so let's continue all right so to apply Newton's method the main thing that we have to compute is is the Hessian as well as the gradient you guys know how to compute the gradient now what about the Hessian all right so I said that it's second order derivatives so that's by definition what it is but now what's a simple way of getting that Hessian in the context of logistic regression so this slide here shows that we can do a little derivation and arrive at an expression where the Hessian is essentially X bar times R times X bar transpose and here if you recall X bar capital X bar is essentially our data set with a row of ones on top of the data right so every column is a data point right and then whenever I put the bar notation on top of X it means that I append a 1 to each data point so I essentially have my data set with a row of ones so it's X bar times R where R is a diagonal matrix of this shape so it's a diagonal matrix where the first entry is Sigma 1 times 1 minus Sigma 1 and here I'm using a shorthand to mean that Sigma 1 is really Sigma it applied to W transpose X bar for the first data point and then every entry of the diagonal is going to be like this as Sigma 8 times 1 minus Sigma applied to a different data point the last one is here where the Sigma lfwe transpose times X bar for the nth data point ok so J so here this is interesting because it shows the structure of the Hessian it can be rewritten in this form where you have your data set that's easy to obtain X all you need to compute after that is R which is a diagonal matrix of this this form and multiply again by X bar transpose really the only thing you have to compute are those Sigma heads and here you'll notice that those Sigma is right there closely related to this tip here right so it comes from this expression where whenever we take the derivative of the sigmoid we get the Sigma n times 1 minus the Sigma and so that's where it comes from ok so yeah so I'm not showing you all the steps here to get to this expression but the idea is that in assignment 2 when you program logistic regression and then I asked you to apply Newton's method then an important quantity that you have to code is essentially the Hessian and then you can obtain it by using this formula yes are that's right so all the other values are zero here so perhaps I should have made it explicit this is a diagonal matrix so it means we only have nonzero values on the diagonal and then everything else is zero yeah our Hessian no our Hessian is not constant in every step it will differ because you see when you compute the Sigma of W transpose X bar here we've got W right so this would be the W of the current step so it will change as a result I think there was one other question okay maybe no question so let's continue okay so I've derived logistic regression and okay this was a little bit of mathematics I know a lot of you have been concerned that in this course maybe there's too much math the reality is that machine learning is based on a lot of mathematics and then it's important for us to go through this so that you can understand how those methods arise and why they are the way they are but now let's talk about a case study and essentially logistic regression up until five years ago with perhaps one of the most important machine learning technique in the industry today neural networks are very important and perhaps have displaced logistic regression but for many applications this was the go-to technique for machine learning okay so in this case study I'm gonna consider recommender systems and ad placement so here these this is a type of application where essentially we've got a user and then we want to make a recommendation to the user so this could be a product recommendation where the user will then decide whether to buy or not the product it could be an ad recommendation so whenever you go to a webpage and then there's some as that shown right then there's a machine learning technique under the hood that quickly decides which ads should be shown to the user and then these ads are not the same for different users right so the idea is that there's a machine learning technique and often or at least historically it was based on logistic regression where the ideas that the well the company would like to show an ad that has a high probability that the user would click on it right okay other examples would include let's see recommending movies or recommending apps to be downloaded so there's lots of situations in social media where you would like to make some recommendations of users and then the idea is that this recommendation either is going to be followed by the user where the user is gonna buy something click on something or download something or the user will simply ignore the recommendation and here we can think of this as a classification problem because the recommendation will either be followed by the user or will not be followed by the user right so two classes follow or not follow and and now we would like to simply maximize that probability right so we would like to make a recommendation that the user is most likely to find interesting and therefore will buy the product will click on the ad or will download an app ok so we can formulate this as a form of logistic regression and here part of the reason why a lot of companies would use logistic regression for these problems is because it's simple it's also flexible and it's efficient now ok we saw the derivation you might say well how is that simple I mean there was all these slides of math but when I say simple what I mean is that in terms of implementation to make a prediction the computation is actually simple so there's not much computation to done okay and then in terms of the implementation I'm going to see in a moment that you're making a prediction just boils down to computing a dot product so even though the technique has a lot of mathematical justification that might not be so easy to grasp applying it as a matter of computing a dot product okay so let's say that we consider app recommendation as as an example it would be the same thing for ad placement it would be the same thing for other types of product recommendation so here if we're making a recommendation for an app to be downloaded this might be let's say the the Google Play Store or the Apple Store right so very often you'll see that there's some recommendations that are made and how are they doing this right so what they're going to do is to take into account lots of features about yourself about your phone about anything that they think might matter and then make a prediction based on those features and here in terms of features they typically take into account millions of features and those features might be binary there might be numerical here it doesn't matter right so as long as all the features can be casted at some form of number then we can feed that to to logistic regression so so here those features would correspond to the X that's the input okay so now let's write down some example for features in in this context okay so we're gonna have lots of features and we can think of this as a vector perhaps some of the most classic and very important features would be binary features that might correspond to whether some apps were downloaded and installed okay so here there would be a long vector in fact yeah it's not a vector of a finite size so let me do this okay so this would be apps already installed whistle as a starting point right we could consider the apps that you've already downloaded and installed in your phone to get a sense of what other apps you might be interested in so that when a recommendation is made right then you're likely to be interested in in the apps that that'll recommend it okay so here I'm denoting by one the fact that an app was downloaded and installed and then zero when it was not downloaded or not installed right so it's going to be a long vector and in fact because there are millions of apps out there right this vector is going to be a size that corresponds to a number of apps out there so therefore length of at least a million now beyond that there's going to be other things like perhaps your age the gender so here the gender obviously is male or female but then we can represent this by simply a binary feature of 0 or 1 and then many other personal features so this could include the location and then depending on what the company or the smartphone knows about you right then any other relevant information could be taken into account so the idea is that yeah we have a long vector like this these are just examples of some features but then there's many more features that would be taken into account and all of this corresponds to X all right so this is X and and here I wrote it as a row vector so it's X transpose okay so once we have our features like this and let's say we've already trained a logistic regression model and we'd like to do classification so so let's see your your phone is about to make a recommendation of an app that you might want or you might be interested in in downloading so so now in terms of choosing this app right then we'd like to know with what probability the user is likely to download and install the app okay so we can formulate this as a problem with two classes right so how did other person's going to install the app that's one or the person won't install the app and that's zero now we can calculate the party that the person is going to download and install the app just with the Sigma head of W transpose X bar and then if this probability is greater than 0.5 then we would say that yes this person is very likely to install and download that app and then zero otherwise now to make this prediction all right whether the person is going to download or not the app we we just need to verify this condition here and it if you recall from last class right whenever the sigmoid is equal to 0.5 that corresponds to the linear expression inside W transpose X bar being equal to zero right so this is the boundary between the two classes and so when we talked about mixtures of gaussians we saw that it gives us a linear separator right and then the linear separator is when the two classes have priority point five point five and that corresponds to W transpose X bar being equal to zero so here we can simplify this condition and really all we have to verify is whether W transpose X bar is greater or equal than zero or not okay this is for two classes now in practice really it's not just that you want to know with what probability somebody is going to download an app or not but it's actually to decide among all the apps which one should be recommended right and perhaps we can simply use the one that has the highest probability as the one that we're going to recommend but still it's a multi-class classification problem if we think of which app the person might be most likely to download next so if we think of this as a multi-class problem then the computation would be the following so we would use a soft max distribution and so we have W transpose X bar divided by the sum of the W transpose X bar and here the main difference with multiple classes versus two classes is that here we have a vector W for every class K in contrast here there's a single vector W for the entire model for both classes but when we've got multiple classes the generalization would use one vector W for every class K so now if we wanted to select which class has the highest probability right then we would maximize this but it turns out the denominator doesn't matter because it's the same for all classes so we're simply left with the numerator and then again it's just a question of computing a dot product all right so this is the beauty of logistic regression at the end of the making prediction is simple because it's just a matter of computing a dot product and it's the case for multiple classes as well as just one class okay so it's simple but we can show even more that we can make this very efficient by exploiting sparsity and also paralyzing some of the computation okay so for sparsity the idea is that whenever we compute a dot product we only need to multiply the nonzero entries and we can ignore all the zero entries so if I look at the features that I gave here as an example right so we can see that for the apps already installed right this was a super long vector that's millions in terms of laughs because there's one and three per app but how many apps does anybody have on their phone and so most people have maybe I don't know 20 30 apps if you have a lot of apps maybe you've reached a hundred right but by far most of the apps out there you've never installed them right so that means this vector is very sparse because most of the entries are zero so here when we compute a dot product we can ignore all of those zeros and then we're going to multiply only the nonzero entries so even if this vector is very long it's not a big deal right so it this can be very fast now we can also paralyze the computation so we can use a GPU to do some element-wise products so especially if we consider multiple classes right then we have to compute this dot product for every vector W associated with each class and then the the features X and so there's lots of dot products we could compute all of those in parallel very easily by using a GPU and in fact a lot of the element wise operations like products can be done in one operation on the GPU so here we can make this super efficient so the beauty of logistic regression is that it's something that can run in a smartphone otherwise a low resource embedded device very simply and then even for applications where the prediction could be made on a large server or in a and in some kind of server farm the idea is that if you have millions or billions of users and you need to serve lots of recommendations like ad placement for instance each time somebody clicks on a new link right chances are they're some as in that new web page and for each one of them right you need to be able to to make a prediction and select the add very quickly right so all of this has to happen in fraction of a second and then so we need to make this very very very efficient and then so logistic regression is a good technique for that because then first it's just a dot product you we can simplify this dot product by exploiting sparsity and then we can paralyze a lot of those dot products all right so so that's part of the reason the other aspect is that when you start by using logistic regression and let's say that you take into account some features and then you you launch your product and now or you launch your recommender system and now you'd like to expand the type of features that you're using it's actually fairly easy all it means is that now you're going to have a terr that that is just larger than before right but then you know the the cost of of expanding is not really an important cost it just means you got to scale a little bit more but at least you don't have to redesign everything from scratch right so so it's a nice technique in that sense to design scalable recommender systems yes yeah so if we change the feature vector yes and in theory we need to retrain w another way to think about it is that you could say well the original W it was as if there were zero weights for those new features right so if you want to you could still use the old W but then that would mean that you're ignoring those new features so it's still possible but yes in general you will have to retrain W yeah good point okay so yeah so the the method that I explained is not scalable so Newton's method is not scalable if we have really millions of features because then the Hessian is going to be millions squared right so so this is a I guess yeah I I wanted to cover this method so that you know of another method beyond just gradient descent so it's by far one of the most popular technique that is among the second-order techniques these techniques typically are not very scalable on the other hand the they tend to be very fast if you're in low dimensional spaces yeah so and so in practice then people end up using approximation or what they would call quasi second-order methods okay so Newton's method will work well but one issue and in fact okay it's not just Newton's method but logistic regression in general it will work well but then it might over fit part of the issue is that because logistic regression the optimization is convex so we've got this nice ball shape for the last function and then there's lots of techniques that we can use to find the global optimum and it's generally easy to find the global Optima and practice then we tend to find the global optimum but a problem is the global optimum might fit the data too well so it's a method that tends to overfit easily now when we have overfitting with logistic regression what happens is that we at some point have some data where there is a class that is known to be the correct class and if we want to fit this data very well then it means that we're going to try to make the probability of the class close to 1 or arbitrarily close to 1 and when we do this we're going to see in a moment that the weights will tend to become very large they will either go off to infinity or minus infinity and then the Hessian will tend to be singular so if there has in a singular we cannot apply Newton's method because we need to take the inverse of the Hessian and we can't do that if the matrix is singular okay so let's draw a picture to understand this okay so I've got here as Sigma n so W transpose X as the input and this is here the Sigma of W transpose X okay all right so if we're overfitting as I explained the problem is that we have some data and we know what the correct class is so if we're overfitting we're gonna try to make that priority of the correct class as large as possible and in fact as close as possible to one so now if we are trying to make a profit close to one what happens that the sigmoid function is a function that goes from zero to one and never actually reaches one so its asymptotically one right so so now if we want to get as close as possible to one then what it means is that we're going to have to go as far as possible along this axis and it means that W transpose X bar will be arbitrarily large okay so to reach a sigma of w transpose X bar that is close to one we need W transpose X bar that goes to infinity and therefore the magnitude of W so the norm of W will also go off to infinity okay so again to reach your party of one we need to go as far as possible along this axis so that means W transpose X bar goes to infinity now X correspond to our features if they're binary that means they're just 0 or 1 whatever the features are they're typically something bounded but it's something fixed right so if W transpose X is going to blow up and become infinitely large it's gotta be because of W and then as a result it means that we're gonna have some weights that become arbitrarily large okay so this is for W now what about the Hessian right so the Hessian if you recall is X bar R X bar transpose and here R is a matrix where we've got Sigma times 1 minus the Sigma head so it's a diagonal matrix that looks like this and since the sigmoid will go to one and this is the case for every data point right then here it means that the entries on the diagonal the Sigma's are going to be close to one but then one minus the sigmoid is going to be close to zero so over all the diagonal is going to go to zero so here this will 10 to 0 okay and as a result H will tend to zero as well okay so that's how we end up with a singular Hessian and then computing the inverse of the Hessian will not be possible numerically we're gonna have problems and then we won't be able to apply Newton's method so this is interesting because in fact you can almost well you can use this as a way of diagnosing whether or not you're overfitting when that happens then perhaps it's time to do something to to prevent that now does anybody have any ideas of what we could do to prevent overfitting in this context yeah so regular rice so we've seen methods before in the context of linear regression to deal with this and it turns out that we can do the same in this context here okay so regularization if you recall can be thought as a way of simply making sure that the weights are not going to be too large by introducing a penalty so here we've got our objective so we're minimizing some last function W but the problem is some of the weights might become arbitrarily large so we add a penalty term that tries to minimize the magnitude of the weights and here we're just going to use York Lydia norm for the magnitude so our new objective is the same as before plus this additional term and now in order to work with Newton's method you need to change what the Hessian is going to be so the Hessian again is a second order derivative this objective and it will change in a way where you're going to add lambda I to the objective so no so you add lambda I to H so that's the main difference and here it should be clear that this will help to prevent any type of singularity because if R is going to 0 and this is also all going to 0 once we add lambda I then we make sure that the diagonal entries are essentially at least lambda and therefore the eigenvalues are going to be away from 0 and you'll be able to take the inverse okay all right so this course again is not about optimization so the details here regarding eigenvalue singularity and so on don't worry about this if this is not making too much sense it's a nice connection with linear algebra but for our purposes in machine learning the main point is that there's overfitting possible and then we can prevent overfitting by essentially adding this penalty term and that's what you would do when you apply Newton's method so the Hessian changes and now what is not shown on this slide but you also need to take into account that the gradient will change too because Newton's method is the inverse of the Hessian times the gradient right so you need to change the gradient to also take into account this this additional term any questions regarding this ok very good so I guess in assignment 2 you guys will get to practice this by programming this all right so now what we've done so far is that we've seen how we can generalize mixtures of gaussians to logistic regression where we're not assuming any more Gaussian right but what is still somewhat of a limitation here is that we obtain a classifier where the boundary is linear right so if I just go back when we have two classes right the boundary between those two classes is determined by W transpose X bar and whenever we've got multiple classes it's also a linear expression all right so that means we're gonna have a linear separator between the classes and a lot of problems we don't want linear separators it might be the case that we have data that clusters in interesting ways and then the boundary should be nonlinear so an interesting question then is can we generalize what we've talked about to non linear separators okay so so far we talked about linear models and not an interesting question is can we relax this assumption can we do classification and also regression that are nonlinear and obviously the answer is yes but then there's a catch once we go nonlinear things are going to be more complex right so it's going to be more complex mathematically computationally it's also going to take more time and then in many cases we're not going to be able to find algorithms that are robust and solutions that aren't necessarily good so then the real question is when we go nonlinear is there a way that we could in fact retain the simplicity of linear models and yet really do nonlinear classification and regression and it turns out that there is a way of doing this we're going to make some assumption still for that but there is a way and that leads to what are known as generalized linear models okay so the main idea between the main idea behind this is that we're going to map the inputs the inputs are the axes that we have in our data set to a new space and then we're going to do linear regression and classification in that space and here this is going to achieve something nonlinear as long as we choose our mapping to go to the new space in a way that this mapping is nonlinear right because if I if I have in some space now I map it to a new space but then I'm doing a nonlinear transformation and then in that new space I do linear regression or linear classification then a line in that new space really corresponds to a curve in the original space that is nonlinear so so this is a very simple approach that will allow us to generalize all of the linear models that we talked about so that they can work in nonlinear settings and and in fact it's also going to be the basis for kernel methods that we're going to see later in the course okay so I'm going to draw a picture to illustrate this and in this particular case just as a simple example let's consider a quadratic function okay so let's say that I've got some data that is roughly in this location and now let's say that I'm interested in doing regression but so far all we've seen is how to do linear regression and it's clear that when you look at this data I can't fit a line very well through this data because it looks like it's curved right and now so let's say that the true underlying function so we've got X here f of X here so the Tyrande line function perhaps is this nice curve that corresponds let's see to a quadratic function so this could be f of X is equal to X minus 1 squared plus 2 which is the same as having x squared minus 2x plus 3 okay so this is just an example it's an arbitrary function right but let's say that this is the true underlying function for for this data and and now the question is how can we approximate this function by doing some form of nonlinear regression okay so as I explained before the main idea is going to be let's map our data from the original space to a new space and here I'm gonna use Phi to denote this mapping so Phi is going to map every input X into a new feature so for instance the function and and then the example that run this on the board here X is just a scalar but let's say that I mapped this scalar X to three values so I'm gonna have a first mapping Phi 0 that gives me 1 regardless of what X is another mapping Phi 1 that essentially keeps X the same and then a third mapping Phi 2 that essentially squares X so so this is a transformation I'm taking my input in this case it's a scalar X right and then I'm just mapping it into a three dimensional space that has one X and X square so this might feel a little bit weird because and very often in there are signs what we want to do is the opposite we have let's see a large input space and then we're looking to extract just a few features and we're trying to do some form of compression and get into a low dimensional space but here we're actually doing the opposite and we'll see that it's a good thing so we're starting from something that is one-dimensional and then in this example we're going to a three-dimensional space generally speaking for what I'm explaining here we're gonna start with some input space and then when we want to essentially do some form of nonlinear regression or classification in into a new space then we'll want to come up with some mapping that will get us into a larger space where by having more dimensions we'll be able to capture some of the nonlinearities okay so this mapping we're going to say that in general it denotes a basis function so every phi i is going to be a different function that corresponds to a different basis function and the ideas that I can feed it with any input X and then it gives me a new feature and the intuition is that in many problems right the features are not something that's necessarily a hardcore dude or something that everyone would agree to so depending on how problem is formulated and depending on what type of instruments you have to measure the inputs in your data you might come up with with certain features but then someone else might just encode this data using different features and here if the encoding is something we can play with well why don't we define a mapping that will essentially allow us to go from whatever original space to a new space make it non linear and therefore implicitly when we do linear classification or regression in a new space we're effectively doing non linear classification or regression in the original space okay so now in the new space you see now Phi 0 Phi 1 and Phi 2 of X are essentially the input and now we're going to be looking for coefficients W naught W 1 and W 2 such that then we can find the best function and that hypothesis space so you see here what we're just looking for weights right that this is a linear expression in the Phi's right so so we're really just doing linear classification or regression and and everything is nice it's just that we use Phi to essentially capture the nonlinearities so that's the beauty of this and and here this is again just an example but ok to be concrete you see in that new space you see the the function we're going to be looking for is linear so what it means is that I could draw a corresponding figure here where now I'm gonna have more dimensions so let's say that I have here that action for Phi 1 of X and here Phi 2 of X and here f of X okay so so here it's curved right now I'm gonna have something linear and okay so it's just gonna correspond to some sort of plane or hyper plane that is linear with respect to so I guess f of X is going to be linear with respect to Phi 1 and Phi 2 here I've dropped Phi 0 simply because I can draw something in 3d but I can't draw anything in 4d okay so normally there's also Phi 0 and this particular case Phi 0 is always 1 so I can just drop it it's really just going to move my my hyper plane but it still shows up as a hyper plane ok so that's the idea any questions regarding this yeah ah very good question yes oh here for this to work the assumption is that we need to be able to come up with basis functions essentially the mapping and and now I mean this was easy because I knew the function was quadratic so I just picked a mapping that essentially captures any quadratic function so 1x x squared and everything is good right but if I don't know whether the function is quadratic then what should be my basis functions right so so this is an important question ideally what happens that you have domain knowledge you have a sense of what might be the space of functions that you need to capture and then you pick basis functions to capture that feeling that there's actually a whole literature on learning basis functions and also learning kernels so this is beyond the scope of the course but in any case there are techniques as well for learning that yes so yeah so here okay I said that this function is the true underlying function f right now what I've done is simply transform F into my new space so this is still f of X ok so it's a Tyrande lying function but now when I'm searching I'm searching in a space of function the hypothesis space and there I'm gonna search in a space of linear functions so that means I'm gonna space search in the space of all hyper planes this is just one that corresponds to the true underlying hyperplane but you know I can tilt it I can change it but it's still gonna be a hyperplane yeah very good point so here it's a vector right so in this example I started with X as input which is a scalar and now I get into a new space where my input is a vector Phi 0 Phi 1 Phi 2 and generally speaking I'm gonna go into a higher dimensional space so it's definitely going to be a vector yes that's right so yeah a hyperplane is defined by essentially the vector that's perpendicular so we can think of the weights as essentially defining the vector that's normal to the hyperplane yes so why are the sighs okay so here yeah the whole point is that you see I can take any quadratic function rewrite it in a way where I've got ax squared plus BX plus C alright so so a B and C happens to be 1 -2 and 3 here and and I can treat x squared X and and 1 as essentially just basis functions we're now in the new space f is really so I wrote here f of X but but really it's we can think of it as f of Phi of X right and now you see if I feed in Phi of X which corresponds really so here five X's is like a vector 1x x squared right then if I feed this in right then I just need to find coefficients so it's really linear in Phi of X okay okay so now coming back to the question of basis functions so we just saw a little example the type of basis function in our example where essentially what are known as polynomial basis functions because I can take a scale or X and then I can look at every single power of X and then if I if I have essentially all the powers up to a certain degree then I will be able to span polynomial functions so so then these are known as as polynomial basis functions now I don't have to limit myself to this right so I can consider all kinds of other functions so here in general they just has they just have to be nonlinear so for instance I could consider Gaussian basis functions and and here we should not be confused with Gaussian distributions right so so this is not meant to return a probability right it's just a function that happens to have the same form as the algebraic formula of the Gaussian distribution so it's e to the minus X minus mu J square divided by 2 s square so you can see that is the same form as the Gaussian distribution but it's it's meant just to be something nonlinear right and then the Gaussian the Gaussian distribution right has a shape that is like a bell shape right so so it's a nonlinear function and then we could use it as a basis function right so it will allow us to map some input to some outputs and in this form another type of function that we've talked about is a sigmoid function and then again so far we talked about it in the context of statistical model where the Sigma arises as the posterior distribution but in this case it's not a distribution we can just use it as a function that is nonlinear and then it's just a Sigma hat applied to X minus mu J divided by s and and then this is another type of basis function that we can consider that is nonlinear now these are just examples there's also for a basis functions wavelets and many others and in general you can consider any type of nonlinear function that you think might be appropriate yeah okay yeah good point so here for gosh Ian's bases functions there are two powers the mean and and now here the standard deviation although because this is not a distribution software the standard deviation is just the width of the function so in the case of a Gaussian basis function it's it's really a bump right that corresponds to to a bell shape right and and now we can think of mu as essentially indicating where this bump is going to be along the x-axis and and then Sigma is going to tell us the width of that bump so these are two pairs and now we can in fact generate lots of Gaussian basis functions just by changing what mu and s are and and then we can also divide some learning techniques to fit what might be the best choice for MU and s but for at least the purpose of just this slide in this lecture right I'm just my intention here is just to show you that oh we could also construct basis functions using this algebraic formula I think it was one other question or okay keep going oh yes okay good point so yeah here we could make s a pounder therefore it should have J as as an index so that we can change it depending on on which basis function we look at yes yeah okay go ahead absolutely yeah so sine cosine are common ones that that can be used here yeah so we could come up with a really long list okay all right so with that in place now we're ready to revisit all of the linear models that we discussed and generalize them so that we can implicitly do nonlinear regression and nonlinear classification so if we go back to linear regression all right the way we formulated linear regression or at least one way we formulated it was as an optimization problem where we minimize squared loss with some regularization term and now we can change that simply by replacing X with Phi of X right so X is the input into our original space Phi of X is now the input into the new space and then everything else is going to work the same way so so in terms of running an algorithm right or or deriving all of the math everything is the same as before all I'm doing is changing X by Phi of X and then for classification we saw for instance how we can do regression logistic regression for classification and using this expression and now we replace X again by Phi of X in both locations and now we obtain I guess a separator that is not necessarily linear I mean it is linear in the new space but in the original space it's nonlinear depending on type of basis functions we use any other questions regarding this material yes okay so when Phi is applied to a vector it's also going to be a vector so I guess and the example that I did on the board Phi was applied to X where X happened to be just a scalar but generally speaking we will have many dimensions for our inputs right the original space will be defined by a vector so X is going to be a vector so that's okay when we apply Phi to a vector X we also get another vector just with different features as output exactly yeah so here Phi can be applied to different parts of the original input X so it's possible that the first basis function really just applies to the first entry of X and the second basis function could just apply to the second and 3 of X and and so on yep yeah so ideally we choose the Phi's the basis functions in a way that will give us a space of functions that should capture what we think is the true underlying function so for instance if we think that the Toronto lying function is some polynomial function of a certain degree then we can use polynomial basis functions up to that degree but now if we don't think that polynomials are going to be good enough because maybe it's it's something that depends more on some exponential then perhaps a Gaussian a set of Gaussian basis functions would be better yeah oh no no no so yeah when we apply Phi it doesn't have to be applied to each dimension of X separately it can be so but it doesn't have to be like that so generally speaking Phi takes us input the entire vector X and then might use all of the entries in X to produce the first output the first dimension into our new space and then same thing for the second dimension and so on yeah okay very good so this concludes the slides for this lecture and now let's move on to the next topic
Info
Channel: Pascal Poupart
Views: 4,971
Rating: undefined out of 5
Keywords:
Id: CXV-vVoPFHw
Channel Id: undefined
Length: 94min 44sec (5684 seconds)
Published: Sat Jun 08 2019
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.