Lecture 4 - Perceptron & Generalized Linear Model | Stanford CS229: Machine Learning (Autumn 2018)

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

Couple of announcements, uh, before we get started. So, uh, first of all, PS1 is out. Uh, problem set 1, um, it is due on the 17th. That's two weeks from today. You have, um, exactly two weeks to work on it. You can take up to, um, two or three late days. I think you can take up to, uh, three late days, um. There is, uh, there's a good amount of programming and a good amount of math you, uh, you need to do. So PS1 needs to be uploaded. Uh, the solutions need to be uploaded to Gradescope. Um, you'll have to make two submissions. One submission will be a PDF file, uh, which you can either, uh, which you can either use a LaTeX template that we provide or you can handwrite it as well but you're strongly encouraged to use the- the LaTeX template. Um, and there is a separate coding assignment, uh, for which you'll have to submit code as a separate, uh, Gradescope assignment. So they're gonna- you're gonna see two assignments in Gradescope. One is for the written part. The other is for the, uh, is for the programming part. Uh, with that, let's- let's jump right into today's topics. So, uh, today, we're gonna cover, uh- briefly we're gonna cover, uh, the perceptron, uh, algorithm. Um, and then, you know, a good chunk of today is gonna be exponential family and, uh, generalized linear models. And, uh, we'll- we'll end it with, uh, softmax regression for multi-class classification. So, uh, perceptron, um, we saw in logistic regression, um. So first of all, the perceptron algorithm, um, I should mention is not something that is widely used in practice. Uh, we study it mostly for, um, historical reasons. And also because it is- it's nice and simple and, you know, it's easy to analyze and, uh, we also have homework questions on it. So, uh, logistic regression. Uh, we saw logistic regression uses, uh, the sigmoid function. Right. So, uh, the logistic regression, uh, using the sigmoid function which, uh, which essentially squeezes the entire real line from minus infinity to infinity between 0 and 1. Um, and - and the 0 and 1 kind of represents, uh, the probability right? Um, you could also think of, uh, a variant of that, uh, which will be, um, like the perceptron where, um. So in- in- in the sigmoid function at, um, at z equals 0- at z equals 0- g of z is a half. And as z tends to minus infinity, g tends to 0 and as z tends to plus infinity, g tends to 1. The perceptron, um, algorithm uses, uh, uh, a somewhat similar but, uh, different, uh, function which, uh, let's say this is z. Right. So, uh, g of z in this case is 1 if z is greater than or equal to 0 and 0 if z is less than 0, right? So you ca- you can think of this as the hard version of- of- of the sigmoid function, right? And this leads to, um, um, this leads to the hypothesis function, uh, here being, uh, h Theta of x is equal to, um, g of Theta transpose x. So, uh, Theta transpose x, um, your Theta has the parameter, x is the, um, x is the input and h Theta of x will be 0 or 1, depending on whether Theta transpose x was less than 0 or- or, uh, greater than 0. And you tol- and, um, similarly in, uh, logistic regression we had a state of x is equal to, um, 1 over 1 plus e to the minus Theta transpose x. Yeah. That's essentially, uh, g of- g of z where g is s, uh, the sigma- sigmoid function. Um, both of them have a common update rule, uh, which, you know, on the surface looks similar. So Theta j equal to Theta j plus Alpha times y_i minus h Theta of- of x_i times x_ij, right? So the update rules for, um, the perceptron and logistic regression, they look the same except h Theta of x means different things in- in- in- in the two different, um, uh, scenarios. We also saw that it was similar for linear regression as well. And we're gonna see why this- this is, um, you know, that this is actually a- a more common- common theme. So, uh, what's happening here? So, uh, if you inspect this equation, um, to get a better sense of what's happening in- in the perceptron algorithm, this quantity over here is a scalar, right? It's the difference between y_i which can be either 0 and 1 and h Theta of x_i which can either be 0 or 1, right? So when the algorithm makes a prediction of h Theta of- h Theta of x_i for a given x_i, this quantity will either be zero if- if, uh, the algorithm got it right already, right? And it would be either plus 1 or minus 1 if- if y_i- if- if the actual, uh, if the ground truth was plus 1 and the algorithm predicted 0, then it, uh, uh, this will evaluate to 1 if wrong and y_i equals 1 and similarly it is, uh, minus 1 if wrong and y_i is 0. So what's happening here? Um, to see what's- what's- what's happening, uh, it's useful to see this picture, right? So this is the input space, right? And, uh, let's imagine there are two, uh, two classes, boxes and, let's say, circles, right? And you want to learn, I wanna learn an algorithm that can separate these two classes, right? And, uh, if you imagine that the, uh, uh, what- what the algorithm has learned so far is a Theta that represents this decision boundary, right? So this represents, uh, Theta transpose x equals 0. And, uh, anything above is Theta transpose, uh, x is greater than 0. And anything below is Theta transpose x less than 0, all right? And let's say, um, the algorithm is learning one example at a time, and a new example comes in. Uh, and this time it happens to be- the new example happens to be a square, uh, or a box. And, uh, but the algorithm has mis- misclassified it, right? Now, um, this line, the separating boundary, um, if- if- if the vector equivalent of that would be a vector that's normal to the line. So, uh, this was- would be Theta, all right? And this is our new x, right? This is the new x. So this got misclassified, this, uh, uh, this is lying to, you know, lying on the bottom of the decision boundary. So what- what- what's gonna happen here? Um, y_i, let's call this the one class and this is- this is the zero class, right? So y_i minus- h state of i will be plus 1, right? And what the algorithm is doing is, uh, it sets Theta to be Theta plus Alpha times x, right? So this is the old Theta, this is x. Alpha is some small learning rate. So it adds- let me use a different color here. It adds, right, Alpha times x to Theta and now say this is- let's call it Theta prime, is the new vector. That's- that's the updated value, right? And the- and the separating, um, uh, hyperplane corresponding to this is something that is normal to it, right? Yeah. So- so it updated the, um, decision boundary such that x is now included in the positive class, right? The- the, um, idea here- here is that, um, Theta, we want Theta to be similar to x in general, where such- where y is 1. And we want Theta to be not similar to x when y equals 0. The reason is, uh, when two vectors are similar, the dot product is positive and they are not similar, the dot product is negative. Uh, what does that mean? If, uh, let's say this is, um, x and let's say you have Theta. If they're kind of, um, pointed outwards, their dot product would be, um, negative. And when- and if you have a Theta that looks like this, we call it Theta prime, then the dot product will be positive if the angle is- is less than r. So, um, this essentially means that as Theta is rotating, the, um, decision boundary is kind of perpendicular to Theta. And you wanna get all the positive x's on one side of the decision boundary. And what's the- what's the, uh, most naive way of- of- of taking Theta and given x, try to make Theta more kind of closer to x? A simple thing is to just add a component of x in that direction. You know, add it here and kind of make Theta. And so this- this is a very common technique used in lots of algorithms where if you add a vector to another vector, you make the second one kind of closer to the first one, essentially. So this is- this is, uh, the perceptron algorithm. Um, you go example by example in an online manner, and if the al- if the, um, example is already classified, you do nothing. You get a 0 over here. If it is misclassified, you either add the- add a small component of, uh, as, uh, you add the vector itself, the example itself to your Theta or you subtract it, depending on the class of the vector. This is about it. Any- any- any questions about the perceptron? Cool. So let's move on to the next topic, um, exponential families. Um, so, um, exponential family is essentially a class of- yeah. Why don't we use them in practice? Um, it's, um, not used in practice because, um, it- it does not have a probabilistic interpretation of what's- what's happening. You kinda have a geometrical feel of what's happening with- with the hyperplane but it- it doesn't have a probabilistic interpretation. Also, um, it's, um, it- it was- and I think the perceptron was, uh, pretty famous in, I think, the 1950s or the '60s where people thought this is a good model of how the brain works. And, uh, I think it was, uh, Marvin Minsky who wrote a paper saying, you know, the perceptron is- is kind of limited because it- it could never classify, uh, points like this. And there is no possible separating boundary that can, you know, do- do something as simple as this. And kind of people lost interest in it, but, um, yeah. And in fact, what- what we see is- is, uh, in logistic regression, it's like a software version of, uh, the perceptron itself in a way. Yeah. [inaudible] It's- it's, uh, it's up to, you know, it's- it's a design choice that you make. What you could do is you can- you can kind of, um, anneal your learning rate with every step, every time, uh, you see a new example decrease your learning rate until something, um, um, until you stop changing, uh, Theta by a lot. You can- you're not guaranteed that you'll- you'll be able to get every example right. For example here, no matter how long you learn you're- you're never gonna, you know, um, uh, find, uh, a learning boundary. So it's- it's up to you when you wanna stop training. Uh, a common thing is to just decrease the learning rate, uh, with every time step until you stop making changes. All right. Let's move on to exponential families. So, uh, exponential families is, uh, is a class of probability distributions, which are somewhat nice mathematically, right? Um, they're also very closely related to GLMs, which we will be going over next, right? But first we kind of take a deeper look at, uh, exponential families and, uh, and- and what they're about. So, uh, an exponential family is one, um, whose PDF, right? So whose PDF can be written in the form- by PDF I mean probability density function, but for a discrete, uh, distribution, then it would be the probability mass function, right? Whose PDF can be written in the form, um. All right. This looks pretty scary. Let's- let's- let's kind of, uh, break it down into, you know, what- what- what they actually mean. So y over here is the data, right? And there's a reason why we call it y because- yeah. Can you write a bit larger. A bit larger, sure. Is this better? Yeah. So y is the data. And the reason- there's a reason why we call it y and not x. And that- and that's because we're gonna use exponential families to model the output of your- of- of your data, you know, in a, uh, in a supervised learning setting. Um, and- and you're gonna see x when we move on to GLMs. Until, you know, until then we're just gonna deal with y's for now. Uh, so y is the data. Um, Eta is- is called the natural parameter. T of y is called a sufficient statistic. If you have a statistics background and you've learn- if you come across the word sufficient statistic before, it's the exact same thing. But you don't need to know much about this because for all the distributions that we're gonna be seeing today, uh, or in this class, t of y will be equal to just y. So you can, you can just replace t of y with y for, um, for all the examples today and in the rest of the calcu- of the class. Uh, b of y, is called a base measure. Right, and finally a of Eta, is called the log-partition function. And we're gonna be seeing a lot of this function, log-partition function. Right, so, um, again, y is the data that, uh, this probability distribution is trying to model. Eta is the parameter of the distribution. Um, t of y, which will mostly be just y, um, but technically you know, t of y is more, more correct. Um, um, b of y, which means it is a function of only y. This function cannot involve Eta. All right. And similarly t of y cannot involve Eta. It should be purely a function of y. Um, b of y is called the base measure, and a of Eta, which has to be a function of only Eta and, and constants. No, no y can, can, uh, can be part of a of, uh, Eta. This is called the log-partition function. Right. And, uh, the reason why this is called the log-partition function is pretty easy to see because this can be written as b of y, ex of Eta, times t of y over. So these two are exactly the same. Um, just take this out and, um, um. Sorry, this should be the log. I think it's fine. These two are exactly the same. And, uh. It should be the [inaudible] and that should be positive. Oh, yeah, you're right. This should be positive, um. Thank you. So, uh, this is, um, you can think of this as a normalizing constant of the distribution such that the, um, the whole thing integrates to 1, right? Um, and, uh, therefore the log of this will be a of Eta, that's why it's just called the log of the partition function. So the partition function is a technical term to indicate the normalizing constant of, uh, probability distributions. Now, um, you can plug-in any definition of b, a, and t. Yeah. Sure. So why is your y, and for most of, uh, most of our example is going to be a scalar. Eta can be a vector. But we will also be focusing, uh, except maybe in Softmax, um, this would be, uh, a scalar. T of y has to match, so these- the dimension of these two has to match [NOISE]. And these are scalars, right? So for any choice of a, b and t, that you've- that, that, that can be your choice completely. As long as the expression integrates to 1, you have a family in the exponential family, right? Uh, what does that mean? For a specific choice of, say, for, for, for some choice of a, b, and t. This can actually- this will be equal to say the, uh, PDF of the Gaussian, in which case you, you got for that choice of t, a, and, and b, you got the Gaussian distribution. A family of Gaussian distribution such that for any value of the parameter, you get a member of the Gaussian family. All right. And this is mostly, uh, to show that, uh, a distribution is in the exponential family. Um, the most straightforward way to do it is to write out the PDF of the distribution in a form that you know, and just do some algebraic massaging to bring it into this form, right? And then you do a pattern match to, to and, and, you know, conclude that it's a member of the exponential family. So let's do it for a couple of examples. So, uh, we have [NOISE]. So, uh, a Bernoulli distribution is one you use to, uh, model binary data. Right. And it has a parameter, uh, let's call it Phi, which is, you know, the probability of the event happening or not. Right, right. Now, the, uh, what's the PDF of a Bernoulli distribution? One way to, um, write this is Phi of y, times 1 minus Phi, 1 minus y. I think this makes sense. This, this pattern is like, uh, uh, a way of writing a programming- programmatic if else in, in, in math. All right. So whenever y is 1, this term cancels out, so the answer would be Phi. And whenever y is 0 this term cancels out and the answer is 1 minus Phi. So this is just a mathematical way to, to represent an if else that you would do in programming, right. So this is the PDF of, um, a Bernoulli. And our goal is to take this form and massage it into that form, right, and see what, what the individual t, b, and a turn out to be, right. So, uh, whenever you, you, uh, see your distribution in this form, a common, um, technique is to wrap this with a log and then Exp. Right, um, because these two cancel out so, uh, this is actually exactly equal to this [NOISE]. And, uh, if you, uh, do some more algebra and this, uh, we will see that, this turns out to be Exp of log Phi over 1 minus Phi times y, plus log of 1 minus Phi, right? It's pretty straightforward to go from here to here. Um, I'll, I'll let you guys,uh, uh, verify it yourself. But once we have it in this form, um, it's easy to kind of start doing some pattern matching, from this expression to, uh, that expression. So what, what we see, um, here is, uh, the base measure b of y is equal to. If you match this with that, b of y will be just 1. Uh, because there's no b of y term here. All right. And, um, so this would be b of y. This would be Eta. This would be t of y. This would be a of Eta, right? So that could be, uh, um, you can see that the kind of matching pattern. So b of y would be 1. T of y is just y, as, um, as expected. Um, so Eta is equal to log Phi over 1 minus Phi. And, uh, this is an equivalent statement is to invert this operation and say Phi is equal to 1 over 1 plus e to the minus Eta. I'm just flipping the operation from, uh, this went from Phi to Eta here. It's, it's, it's the equivalent. Now, here it goes from Eta to Phi, right? And a of Eta is going to be, um, so here we have it as a function of Phi, but we got an expression for Phi in terms of eta, so you can plug this expression in here, and that, uh, change of minus sign. So, so, let, let me work out this, minus log of 1 minus Phi. This is, uh, just, uh, the pattern matching there. And minus log 1 minus, this thing over, 1 over 1 plus Eta to the minus Eta. The reason is because we want an expression in terms of Eta. Here we got it in terms of Phi, but we need to, uh, plug in, um, plug in Eta over here. Uh, Eta, and this will just be, uh, log of 1 plus e to the Eta. Right. So there you go. So this, this kind of, uh, verifies that the Bernoulli distribution is a member of the exponential family. Any questions here? So note that this may look familiar. It looks like the, uh, sigmoid function, somewhat like the sigmoid function, and there's actually no accident. We'll see, uh, why, why it's, uh, actually the sigmoid- how, how it kind of relates to, uh, logistic regression in a minute. So another example, um [NOISE]. So, uh, a Gaussian with fixed variance. Right, so, um, a Gaussian distribution, um, has two parameters the mean and the variance, uh, for our purposes we're gonna assume a constant variance, um, you-you can, uh, have, um, you can also consider Gaussians with, with where the variance is also a variable, but for-for, uh, our course we are go- we are only interested in, um, Gaussians with fixed variance and we are going to assume, assume that variance is equal to 1. So, this gives the PDF of a Gaussian to look like this, p of y parameterized as mu. So note here, when we start writing out, we start with the, uh, parameters that we are, um, commonly used to, and we- they are also called like the canonical parameters. And then we set up a link between the canonical parameters and the natural parameters, that's part of the massaging exercise that we do. So we're going to start with the canonical parameters, um, is equal to 1 over root 2 pi, minus over 2. So this is the Gaussian PDF with, um, with- with a variance equal to 1, right, and this can be rewritten as- again, I'm skipping a few algebra steps, you know, straightforward no tricks there, uh, any question? Yep? [BACKGROUND]. Fixed variance. E to the minus y squared over 2, times EX. Again, we go to the same exercise, you know, pattern match, this is b of y, this is eta, this is t of y, and this would be a of eta, right? So, uh, we have, uh, b of y equals 1 over root 2 pi minus y squared by 2. Note that this is a function of only y, there's no eta here, um, t of y is just y, and in this case, the natural parameter is-is mu, eta is mu, and the log partition function is equal to mu square by 2, and when we-and we repeat the same exercise we did here, we start with a log partition function that is parameterized by the canonical parameters, and we use the, the link between the canonical and, and, uh, the natural parameters, invert it and, um, um, so in this case it's- it's the- it's the same sets, eta over 2. So, a of eta is a function of only eta, again here a of eta was a function of only eta, and, um, p of y is a function of only y, and b of y is a function of only, um, y as well. Any questions on this? Yeah. If the variance is unknown [inaudible]. Yeah, you- if, if the variance is unknown you can write it as an exponential family in which case eta will now be a vector, it won't be a scalar anymore, it'll be- it'll have two, uh, like eta1 and eta2, and you will also have, um, you will have a mapping between each of the canonical parameters and each of the natural parameters, you, you can do it, uh, you know, it's pretty straightforward. Right, so this is- this is exponential- these are exponential families, right? Uh, the reason why we are, uh, why we use exponential families is because it has some nice mathematical properties, right? So, uh, so one property is now, uh, if we perform maximum likelihood on, um, on the exponential family, um, as, as, uh, when, when the exponential family is parameterized in the natural parameters, then, uh, the optimization problem is concave. So MLE with respect to eta is concave. Similarly, if you, uh, flip this sign and use the, the, uh, what's called the negative log-likelihood, so you take the log of the expression negate it and in this case, the negative log-likelihood is like the cost function equivalent of doing maximum likelihood, so you're just flipping the sign, instead of maximizing, you minimize the negative log likelihood, so-and, and you know, uh, the NLL is therefore convex, okay. Um, the expectation of y. What does this mean? Um, each of the distribution, uh, we start with, uh, a of eta, differentiate this with respect to eta, the log partition function with respect to eta, and you get another function with respect to eta, and that function will- is, is the mean of the distribution as parameterized by eta, and similarly the variance of y parameterized by eta, is just the second derivative, this was the first derivative, this is the second derivative, this is eta. So, um, the reason why this is nice is because in general for probability distributions to calculate the mean and the variance, you generally need to integrate something, but over here you just need to differentiate, which is a lot easier operation, all right? And, um, and you will be proving these properties in your first homework. You're provided hints so it should be [LAUGHTER]. All right, so, um, now we're going to move on to, uh, generalized linear models, uh, this- this is all we wanna talk about exponential families, any questions? Yep. [inaudible]. Exactly, so, ah, if you're-if you're, um, if you're- if it's a multi-variate Gaussian, then this eta would be a vector, and this would be the Hessian. All right, let's move on to, uh, GLM's. So the GLM is, is, um, somewhat like a natural extension of the exponential families to include, um, include covariates or include your input features in some way, right. So over here, uh, we are only dealing with, uh, in, in the exponential families, you're only dealing with like the y, uh, which in, in our case, it- it'll kind of map to the outputs, um. But, um, we can actually build a lot of many powerful models by, by choosing, uh, an appropriate, um, um, family in the exponential family and kind of plugging it onto a, a linear model. So, so the, uh, assumptions we are going to make for GLM is that one, um, so these are the assumptions or design choices that are gonna take us from exponential families to, uh, generalized linear models. So the most important assumption is that, uh, well, yeah. Assumption is that y given x parameterized by Theta is a member of an exponential family. Right. By exponential family of Theta, I mean that form. It could, it could, uh, in, in the particular, uh, uh, uh, scenario that you have, it could take on any one of these, um, uh, distributions. Um, we only, we only, uh, talked about the Bernoullian Gaussian. There are also, um, other distributions that are- those are part of the, uh, exponential family. For example, um, I forgot to mention this. So if you have, uh, real value data, you use a Gaussian. If you have binary, a Bernoulli. If you have count, uh, like, counts here. And so this is a real value. It can take any value between zero and infinity by count. That means just non-negative integers, uh, but not anything between it. So if you have counts, you can use a Poisson. If you have uh, positive real value integers like say, the volume of some object or a time to an event which, you know, um, that you are only predicting into the future. So here, you can use, uh, like Gamma or exponential. So, um, so there is the exponential family, and there is also a distribution called the exponential distribution, which are, you know, two distinct things. The exponential distribution happens to be a member of the exponential family as well, but no, they're not the same thing. Um, the exponential and, um, yeah, and you can also have, um, you can also have probability distributions over probability distributions. Like, uh, the Beta, the Dirichlet. These mostly show up in Bayesian machine learning or Bayesian statistics. Right. So depending on the kind of data that you have, if your y-variable is, is, is if you're trying to do a regression, then your y is going to be say, say a Gaussian. If you're trying to do a classification, then your y is, and if it's a binary classification, then the exponential family would be Bernoulli. So depending on the problem that you have, you can choose any member of the exponential family, um, as, as parameterized by Eta. And so that's the first assumption. That y conditioned on y given x is a member of the exponential family. And the, uh, second, the design choice that we are making here is that Eta is equal to Theta transpose x. So this is where your x now comes into the picture. Right. So Theta is, um, is in Rn, and x is also in Rn. Now, this n has nothing to do with anything in the exponential family. It's purely, um, a dimensions of your of, of your data that you have, of the x's of your inputs, and, and this does not show up anywhere else. And that, that- that's, um. And, and, uh, Eta is, is, uh, we, we make a design choice that Eta will be Theta transpose- transpose x. Um, and another kind of assumption is that at test time, um, right. When we want an output for a new x, given a new x, we want to make an output, right. So the output will be, right. So given an x and, um, given an x, we get, uh, an exponential family distribution, right. And the mean of that distribution will be the prediction that we make for a given, for a given x. Um, this may sound a little abstract, but, you know, uh, we're going to make this, uh, uh, more clear. So this- what this essentially means is that the hypothesis function is actually just, right. This is our hypothesis function. And we will see that, you know, what we do over here, if you plug in the, uh, um, exponential family, uh, as, as Gaussian, then the hypothesis will be the same, you know, Gaussian hypothesis that we saw in linear regression. If we plug in a Bernoulli, then this will turn out to be the same hypothesis that we saw in logistic regression, and so on, right. So, uh, one way to kind of, um, visualize this is, right. So one way to think of is, of- if this is, there is a model and there is a distribution, right. So the model we are assuming it to be a linear model, right. Given x, there is a learnable parameter Theta, and Theta transpose x will give you a parameter, right. This is the model, and here is the distribution. Now, the distribution, um, is a member of the exponential family. And the parameter for this distribution is the output of the linear model, right. This, this is the picture you want to have in your mind. And the exponential family, we make, uh, depending on the data that we have. Whether it's a, you know, whether it's, uh, a classification problem or a regression problem or a time to vent problem, you would choose an appropriate b, a and t, uh, based on the distribution of your choice, right. So this entire thing, uh, a-and from this, you can say, uh, get the, uh, expectation of y given Eta. And this is the same as expectation of y given Theta transpose x, right. And this is essentially our hypothesis function. Right. Yep. [BACKGROUND] That's exactly right. Uh, so, uh, so the question is, um, are we training Theta to, uh, uh, um, to predict the parameter of the, um, exponential family distribution whose mean is, um, the, uh, uh, uh, prediction that we're gonna make for y. That's, that's correct, right. And, um, so this is what we do at test time, right. And during training time, how do we train this model? So in this model, the parameter that we are learning by doing gradient descent, are these parameters, right. So you're not learning any the parameters in the, uh, in the, uh, uh, exponential family. We're not learning Mu or Sigma square or, or Eta. We are not learning those. We're learning Theta that's part of the model, and not part of, uh, the distribution. And the output of this will become the, um, the distributions parameter. It's unfortunate that we use the word parameter for this and that, but, uh, there, there are- it's important to understand what, what is being learned during training phase and, and, and what's not. So this parameter is the output of a function. It's not, it's not a variable that we, that we, uh, do gradient descent on. So during learning, what we do is maximum likelihood. Maximize with respect to Theta of P of y i given, right. So you're doing gradient ascent on the log probability of, of y where, um, the, the, um, natural parameter was re-parameterized, uh, with the linear model, right. And we are doing gradient ascent by taking gradients on Theta, right. Thi-this is like the big picture of what's happening with GLMs, and how they kind of, yeah, are an extension of exponential families. You re-parameterize the parameters with the linear model, and you get a GLM. [NOISE]. So let's, let's look at, uh, some more detail on what happens at train time. [NOISE] So another, um, kind of incidental benefit of using, uh, uh, GLMs is that at train time, we saw that you wanna do, um, maximum likelihood on the log prob- using the log probability with respect to Thetas, right? Now, um, at first it may appear that, you know, we need to do some more algebra, uh, figure out what the expression for, you know, P is, um, represented in the- in- in- as a function of Theta transpose x and take the derivatives and, you know, come up with a gradient update rule and so on. But it turns out that, uh, no matter which- uh, what kind of GLM you're doing, no matter which choice of distribution that you make, the learning update rule is the same. [NOISE] The learning update rule is Theta equals Theta j plus Alpha times y_i minus h Theta of x_i. You guys have seen this so many times by now. So this is- you can, you can straight away just apply this learning rule without ever having to, um, do any more algebra to figure out what the gradients are or what the- what, what the loss is. You can go straight to the update rule and do your learning. You plug in the appropriate h Theta of x, you plug in the appropriate h Theta of x, uh, depending on the choice of distribution that you make and you can start learning. Initialize your Theta to some random values and, and, and you can start learning. So um, any question on this? Yeah. [inaudible] You can do, uh- if you wanna do it for batch gradient descent, then you just, um, sum over all your examples. [inaudible] Yeah. So, um, the uh, Newton method is, is, uh, is probably the most common you would use with GLMs, uh, and that again comes with the assumption that you're- the dimensionality of your data is not extremely high. As long as the number of features is less than a few thousand, then you can do Newton's method. Any other questions? Good. So, um, so this is the same update rule for any, any, um, any specific type of GLM based on the choice of distribution that you have. Whether you are modeling, uh, you know, um, you're doing classification, whether you're doing regression, whether you're doing- you know, a Poisson regression, the update rule is the same. You just plug in a different h Theta of x and you get your learning rule. Another, um, some more terminology. So Eta is what we call the natural parameter. [NOISE] So Eta is the natural parameter and the function that links the natural parameter to the mean of the distribution and this has a name, it's called the canonical response function. Right. And, um, similarly, you can also- let's call it Mu. It's like the mean of the distribution. Uh, similarly you can go from Mu back to Eta with the inverse of this, and this is also called the canonical link function. There's some, uh, terminology. We also already saw that g of Eta is also equal to the, the, the gradient of the log partition function with respect to Eta. So a side-note g is equal to- [NOISE] right. And it's also helpful to make- explicit the distinction between the three different kinds of parameterizations we have. So we have three parameterizations. So we have the model parameters, that's Theta, the natural parameters, that's Eta, and we have the canonical parameters. And this is a Phi for Bernoulli, Mu and Sigma square for Gaussian, Lambda for Poisson. Right. So these are three different ways we are- we can parameterize, um, either the exponential family or, or, or the G- uh, GLM. And whenever we are learning a GLM, it is only this thing that we learn. Right. That is the Theta in the linear model. This is the Theta that is, that is learned. Right. And, uh, the connection between these two is, is linear. So Theta transpose x will give you a natural parameter. Uh, and this is the design choice that we're making. Right. We choose to reparameterize Eta by a linear model, uh, a linear of- linear in your data. And, um, between these two, you have g to go this way and g inverse to come back this way where g is also the, uh, uh, uh, derivative of the log partition. So yeah. So it's important to, to kind of realize. It can get pretty confusing when you're seeing this for the first time because you have so many parameters that are being swapped around and, you know, getting reparameterized. There are three kind of spaces in which- three different ways in which we are parameterizing, uh, uh, generalized linear models. Uh, the model parameters, the ones that we learn and the output of this is the natural parameter for the exponential family and you can, you know, do some algebraic manipulations and get the canonical parameters for, uh, the distribution, uh, that we are choosing, uh, depending on the task where there's classification or regression. [NOISE] Any questions on this? [NOISE] So no- now it's actually pretty, you know, um, you can- you can see that, you know, when you are doing logistic regression, [NOISE] right? So h theta of X, um, so h theta of X, um, is the expected value of- of, um, of Y, uh, conditioned on X theta, [NOISE] and this is equal to phi, right? Because, um, here the choice of distribution is a Bernoulli. And the mean of a Bernoulli distribution is just phi the- in- in the canonical parameter space. And if we, um, write that as, um, in terms of the, um, h minus eta and this is equal to 1 over minus theta transpose X, right? So, ah, the logistic function which when we introduced, ah, linear reg-, uh, logistic regression we just, you know, pulled out the logistic function out of thin air, and said, hey, this is something that can squash minus infinity to infinity, between 0 and 1, seems like a good choice. Bu-but now we see that it is- it is a natural outcome. It just pops out from this more elegant generalized linear model where if you choose Bernoulli to be, uh, uh, to be the distribution of your, uh, output, then, you know, the logistic regression just- just pops out naturally. [NOISE] So,um, [NOISE] any questions? Yeah. Maybe you speak a little bit more about choosing a distribution to be the output. Yeah. So the, uh, the choice of what distribution you are going to choose is really dependent on the task that you have. So if your task is regression, where you want to output real valued numbers like, you know, price of the house, or- or something, uh, then you choose a distribution over the real va- real- real numbers like a Gaussian. If your task is classification, where your output is binary 0, or 1, you choose a distribution that models binary data. Right? So the task in a way influences you to pick the distribution. And, you know, uh, most of the times that choice is pretty obvious. [NOISE] If you want to model the number of visitors to a website which is like a count, you know, you want to use a Poisson distribution, because Poisson distribution is a distribution over integers. So the task deci-, you know, pretty much tells you what distribution you want to choose, and then you- you do the- you know, uh, um, you do this, you know, all- you- you go through this machinery of- of- of figuring out what are the, uh, what h state of X is, and you plug in h state of X over there and you have your learning rule. Any more questions? So, uh, it-, so we made some assumptions. Uh, these assumptions. Now it- it- it's also helpful to kind of get, uh, a visualization of what these assumptions actually mean, right? [NOISE] So to expand upon your point, um, um. You know if you think of the question, "Are GLMs used for classification, or are they used for regression, or are they used for, you know, um, something else?" The answer really depends on what is the choice of distribution that you're gonna choose, you know. GLMs are just a general way to model data, and that data could be, you know, um, binary, it could be real value. And- and, uh, as long as you have a distribution that can model, ah, that kind of data, and falls in the exponential family, it can be just plugged into a GLM and everything just, uh, uh, uh works out nicely. Right. So, uh, [NOISE] so the assumptions that we made. Let, uh, let's start with regression, [NOISE] right? So for regression, we assume there is some X. Uh, to simplify I'm, um, I'm drawing X as one dimension but, you know, X could be multi-dimensional. And there exists a theta, right? And theta transpose X would- would be some linear, um, um, some linear, uh, uh, uh, hyperplane. And this, we assume is Eta, right? And in case of regression Eta was also Mu. So Eta was also Mu, right? Um, and then we are assuming that the Y, for any given X, is distributed as a Gaussian with Mu as the mean. So which means, for every X, every possible X, you have the appropriate, uh, um, um, Eta. And with this as the mean, let's- let's think of this as Y. So that is, uh, a Gaussian distribution at every possible- we assume a variance of 1. So this is like, uh, a Gaussian with standard deviation or variance equal to 1, right? So for every possible X, there is a Y given X, um, which is parameterized by- by- by theta transpose X as- as the mean, right? And you assume that your data is generated from this process, right? So what does it mean? It means, um, you're given X, and let's- let's say this is Y. So you would have examples in your training set that- that may look like this, right? The assumption here is that, for every X there is, um, um- let's say for this particular value of X, um, there was a Gaussian distribution that started from the mean over here. And from this Gaussian distribution this value was sampled, right? You're - you're- you're- you're just sampling it from- from the distribution. Now, the, um- this is how your data is generated. Again, this is our assumption, [NOISE] right? Now that- now based on these assumptions what we are doing with the GLM is we start with the data. We don't know anything else. We make an assumption that there is some linear model from which the data was-was- was- was generated in this format. And we want to work backwards, right, to find theta that will give us this line, right? So for different choice of theta we get a different line, right? We assume that, you know, if -if that line represents the- the Mu's, or the means of the Y's for that particular X, uh, from which it's sampled from, we are trying to find a line, [NOISE] ah, which is- which will be like your theta transpose X from which these Y's are most likely to have sampled. That's- that's essentially what's happening when you do maximum likelihood with- with -with the GLM, right? Ah, similarly, um, [NOISE] Similarly for, um, classification, again let's assume there's an x, right? And there are some Theta transpose x, right? And, uh, and this Theta transpose x is equal- is Eta. We assign this to be Eta, right? And this Eta is, uh, from this Eta, we- we run this through the sigmoid function, uh, 1 over 1 plus e to the minus Eta to get Phi, right? So if these are the Etas for each, um, for each Eta we run it through the sigmoid and we get something like this, right? So this tends to, uh, 1. This tends to 0. And, um, when- at this point when Eta is 0, the sigmoid is- is 0.5. This is 0.5, right? And now, um, at each point- at- at- at any given choice of x, we have a probability distribution. In this case, it's- it's a- it's a binary. So let's assume probability of y is the height to the sigmoid line and here it is low. Um, right. Every x we have a different, uh, Bernoulli distribution essentially, um, that's obtained where, you know, the- the probability of y is- is the height to the, uh, uh, sigmoid through the natural parameter. And from this, you have a data generating distribution that would look like this. So x and, uh, you have a few xs in your training set. And for those xs, you calc- you- you figure out what your, you know, y distribution is and sample from it. So let's say- right. And now, um, again our goal is to stop- given- given this data, so- so over here this is the x and this is y. So this is- these are points for which y is 0. These are points for which y is 1. And so given- given this data, we wanna work backwards to find out, uh, what Theta was. What's the Theta that would have resulted in a sigmoid like curve from which these- these y's were most likely to have been sampled? That's- and- and figuring out that y is- is- is essentially doing logistic regression. Any questions? All right. So in the last 10 minutes or so, we will, uh, go over softmax regression. So softmax regression is, um, so in the lecture notes, softmax regression is, uh, explained as, uh, as yet another member of the GLM family. Uh, however, in- in- in today's lecture we'll be taking a non-GLM approach and kind of, um, seeing- and- and see how softmax is- is essentially doing, uh, what's also called as cross entropy minimization. We'll end up with the same- same formulas and equations. You can- you can go through the GLM interpretation in the notes. It's a little messy to kind of do it on the whiteboard. So, um, whereas this has- has- has a nicer, um, um, interpretation. Um, and it's good to kind of get this cross entropy interpretation as well. So, uh, let's assume- so here we are talking about multiclass classification. So let's assume we have three cat- three, uh, classes of data. Let's call them circles, um, squares, and say triangles. Now, uh, if- here and this is x1 and x2. We're just- we're just visualizing your input space and the output space, y is kind of implicit in the shape of this, so, um, um. So, um, in- in, um, in multicl- multiclass classification, our goal is to start from this data and learn a model that can, given a new data point, you know, make a prediction of whether this point is a circle, square or a triangle, right? Uh, you're just looking at three because it's easy to visualize but this can work over thousands of classes. And, um, so what we have is so you have x_is in R_n. All right. So the label y is, uh, is 0, 1_k. So k is the number of classes, right? So the labels y is- is- is a one-hot vector. What would you call it as a one-hot vector? Where it's a vector which indicates which class the, uh, x corresponds to. So each- each element in the vector, uh, corresponds to one of the classes. So this may correspond to the triangle class, circle class, square class or maybe something else. Uh, so the labels are, uh, in this one-hot vector where we have a vector that's filled with 0s except with a 1 in one of the places, right? And- and- and- and the way we're gonna- the way we're gonna, uh, um, think of softmax regression is that each class has its- its own set of parameters. So we have, uh, Theta class, right, in R_n. And there are k such things where class is in here, triangle, circle, square, etc, right? So in logistic regression, we had just one Theta, which would do a binary, you know, yes versus no. Uh, in softmax, we have one such vector of Theta per class, right? So you could also optionally represent them as a matrix. There's an n by k matrix where, you know, you have a Theta class- Theta class, right? Uh, so in softmax, uh, regression, um, it's- it's- it's a generalization of logistic regression where you have, um, a set of parameters per class, right? And we're gonna do something, um, something similar to, uh, so, uh, [NOISE] so corresponding to each- each class, uh, uh, of- of, uh, parameters that exists, um [NOISE] So there's- there exists this line which represents say, Theta triangle transpose x equals 0, and anything to the left, will be Theta triangle transpose x is greater than 0, and over here it'll be less than 0, right? So if, if- for, for- uh, uh, the- Theta triangle class, um, there is- uh, there is this line, um, which- which corresponds to, uh, uh, Theta transpose x equals 0. Anything to the left, uh will give you a value greater than on- zero, anything to the right. Similarly, there is also-. Uh, so this corresponds to Theta, uh, square transpose x equals 0. Anything below will be greater than 0, anything above will be less than 0. Similarly you have another one for, um, this corresponds to Theta circle transpose x equals 0. And, and, and, and this half plane, we have, uh, to be greater than 0, and to the left, it is less than 0, right? So we have, um, a different set of parameters per class which, um, which, which, which hopefully satisfies this property, um, and now, um, our goal is to take these parameters and let's see what happens when, when we field a new example. So given an example x, we get a set of- given x, um, and over here we have classes, right? So we have the circle class, the triangle class, the square class, right? So, um, over here, we plot Theta class transpose x. So we may get something that looks like this. So let's say for a new point x over here, uh, if that's our new x, we would have Theta transpose, um, Theta trans- Theta square transpose x to be positive. So we- all right. And maybe for, um, for the others, we may have some negative and maybe something like this for this, right? So- th- this space is, is also called the logic space, right? So these are real numbers, right? Thi- this will, this will, uh, this is not a value between 0 and 1, this is between plus infinity and minus infinity, right? And, and our goal is to get, uh, a probability distribution over the classes. Uh, and in order to do that, we perform a few steps. So we exponentiate the logics which would give us- so now it is x above Theta class transpose x and this will make everything positive so it should be a small one. Squares, triangles and circles, right? Now we've got a set of positive numbers. And next, we normalize this. By normalize, I mean, um, divide everything by the sum of all of them. So here we have Theta e to the Theta class transpose x over the sum of i in triangle, square, circle, e to the Theta i transpose x. So n- once we do this operation, we now get a probability distribution where the sum of the heights will add up to 1, right? So, uh- so given- so- if, if, if- given a new point x and we run through this pipeline, we get a probability output over the classes for which class that example is most likely to belong to, right? And this whole process, so let's call this p hat of, of, of, of y for the given x, right? So this is like our hypothesis. The output of the hypothesis function will output this probability distribution. In the other cases, the output of the hypothesis function, generally, output a scalar or a probability. In this case, it's outputting a probability di- distribution over all the classes. And now, the true y would look something like this, right? Let's say, the point over there was- le- let's say it was a triangle, for, for whatever reason, right? If that was the triangle, then the p of y which is also called the label, you can think of that as a probability distribution which is 1 over the correct class and 0 elsewhere, right? So p of y. This is essentially representing the one-hot representation as a probability distribution, right? Now the goal or, or, um, the learning approach that we're going to do is in a way minimize the distance between these two distributions, right? This is one distribution, this is another distribution. We want to change this distribution to look like that distribution, right? Uh, and, and, uh, technically, that- the term for that is minimize the cross entropy between the two distributions. So the cross entropy between p and p hat is equal to, for y in circle, triangle, square, p of y times log p hat of y. I don't think we'll have time to go over the interpretation of cross entropy but you can look that up. So here we see that p of y will be one for just one of the classes and zero for the others. So let's say in this, this example, p of- so y was say a triangle. So this will essentially boil down to- there's a little min- minus log p hat of y triangle, right? And what we saw that this- the hypothesis is essentially that expression. So that's equal to minus log x e x of Theta triangle transpose x over sum of class in triangle, square, circle, e to the triangle. Right. And on this, you, you, you, you treat this as the loss and do gradient descent. Gradient descent with respect to the parameters. Right, um, yeah. With, with, with that I think, uh, uh, any, any questions on softmax? Okay. So we'll, we'll break for today in that case. Thanks.

Info

Channel: stanfordonline

Views: 82,705

Rating: 4.301446 out of 5

Keywords:

Id: iZTeva0WSTQ

Channel Id: undefined

Length: 82min 1sec (4921 seconds)

Published: Fri Apr 17 2020