CS231n Lecture 3 - Linear Classification 2, Optimization

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
so before we get to dive into some of the material today on loss functions and optimization I wanted to go over some administrative things first just as a reminder the first assignment is due on next Wednesday so you have roughly nine days left and just as a warning Monday is holidays so there will be no class no office hours so plan out your time accordingly to make sure that you can complete the assignment in time of course you also have some late days that you can use and allocate among your assignments as you see fit okay so diving into the material first I'd like to remind you where we are currently last time we looked at this problem visual recognition and specifically at image classification and we were talking about the fact that this is actually a very difficult problem right so if you just consider the cross-product of all the possible variations that we have to be robust to when we recognize any of these categories such as cat it just seems like such an intractable and possible problem and not only do we know how to solve these problems now but we can solve this problem for thousands of categories and the state of the art methods work almost at human accuracy or even slightly surpassing it and some of those at classes and it also runs nearly in real time on your phone and so basically and all of this also happened in the last three years and also you'll be experts by the end of this class on all of this technology so it's really cool and exciting ok so that's the problem of image classification and visual recognition we talked specifically about the data driven approach and the fact that we can't just explicitly hard-code these classifiers so we have to actually train them from data and so we looked at the idea of having different the training data having different validation splits where we test out our hyper parameters and a test that that you don't touch too much we look specifically at the example of the nearest neighbor classifier and so on and the K nearest neighbor classifier and I talked about the CFR 10 dataset which is our toy data set that we play with during this class then I introduced the idea of this approach that I termed parametric approach which is really that we're writing a function f from image directly to the raw tenant scores if you have ten classes and this parametric form we seem to be a linear first so we just have F equals WX and we talked about the interpretations of this linear classifier the fact that you can interpret as matching templates or that you can interpret it as these images being in a very high dimensional space and our linear classifiers are kind of going in and coloring this space by class scores so it's so to speak and so by the end of the class we got to this picture where we suppose we have a training example training data set of just three images here along the columns and we have some classes say ten classes in C for ten and basically this function f is assigning scores for every single one of these images with some particular setting of weights which I've chosen randomly here we get some scores out and so some of these results are good and some of them are bad so if you inspect this course for example in the first image you can see that the correct class which is cat got a score of 2.9 and that's kind of in the middle so some some classes here received a higher score which is not very good some classes received a much lower score which is good for that particular image the car was very well classified because the class score a car was much higher than all of the other ones and the Frog was not very well classified at all right so we have this notion that for different weights these different weights work better or worse on different images and of course we're trying to find weights that give us scores that are consistent with all the ground truth labels all the labels in the data and so what we're going to do now is so far with only eyeballed what I just described like this is good or that's not so good and so on but we have to actually give it we've charged actually quantify this notion we have to say that this particular set of weights W is say like 12 bad or 1.5 bad or whatever and then once we have this loss function we're going to minimize it so we're going to find W that gets us lowest loss and we're going to look into that today so we're going to look specifically into how we can define a loss function that measures this unhappiness and then we're actually going to look at two different cases a SVM cost and a soft max cost or a cross entropy cost and then we're going to look into the process of optimization which is how do you start off with these random weights and how do we actually find very very good setting of weights efficiently okay so I'm going to downsize this example so that we have a nice working example to work with so suppose we only had three classes instead of you know tens of thousands and we have these three images and these are our scores for some set of w's and we're going to now try to write down exactly our happiness with this result so the first loss we're going to look into is termed a multi-class SVM loss this is a generalization of a binary support vector machine that you may have seen already in different classes I think 229 covers it as well and so the setup here is that we have the score function right so s is a vector of Class scores these are our s vectors and there's a specific term here loss equals to stuff and I'm going to interpret this loss now for you so that and we're going to see through a specific example in to why this expression makes sense effectively what the SVM loss is saying is that it's summing across all the incorrect examples so all the all the something cross all the incorrect scores classes so for every single example we have that loss and it's something across all the incorrect classes and it's comparing the score that the correct class received and the score that the incorrect class received J minus s of why I why I being the correct label plus one and then that's maxed of zero so what's going on here is we're comparing the difference in this course and this particular loss is saying that not only do I want the correct score to be higher than the incorrect score but there's actually a safety margin that we're putting on we're using a safety margin of exactly one and we're going to go into why one makes sense to use as opposed to some other hyper parameter that we have to choose there and intuitively you can look into notes for much more rigorous derivation of exactly why that one doesn't matter but intuitively to think about this this course are kind of scale-free because I can scale my W I can make it larger or smaller and you're going to get larger or smaller scores so really there's this free parameter of these scores and how large or small they can be that is tied to how large your weights are in magnitude and so these scores are kind of arbitrary so using 1 is just an arbitrary choice just on some extent ok so let's see specifically how this expression works with a concrete example so here I'm going to evaluate that loss for the first example so here we're computing we're plugging in these scores so we see that we're comparing the score we got for car which is 5 point 1 minus 3.2 which is the correct class and then adding our safety margin of 1 and the max of 0 and that is really what it's doing is it's going to be clamping values at zero right so if we get a negative result we're going to just clamp it at zero so if you see for the second-class score the incorrect class of frog negative 1.7 subtracted from 3.2 at a safety margin and we get negative three point nine and then when you work this through you get a loss of two point nine so intuitively what you can see here the way this worked out is intuitively the cat score is 3.2 so according to the SVM laws we we would like ideally is that this course for all the in credits are up to at most 2.2 but the car class actually had much higher much higher score 5.1 and this difference in what we would have liked which is 2.2 and what actually happened which is 5.1 is exactly this difference of 2.9 which is how bad of a score outcome this was and in the other case in frog case you can see that the Frog score was quite a bit lower below 2.2 and so the way that works out in the math is that you end up getting a negative number when you compare this course and then the max of zero clamp set at zero so you get a zero loss contribution for that particular part and you end up with a loss of 2.9 okay so that's the loss for this first image for the second image we're going to again do the same thing plug in the numbers we're comparing the cats go to the car score so we get a 1 point 3 months for benign at a safety margin and the same for for the other class so when you plug it in you actually end up good with loss of 0 and loss of zero intuitively is because the car score here is it is true that the car score is higher than all the other scores for that image by at least one right that's why we got zero score zero loss that is so the constraint was satisfied and so we get zero loss and in this frog case we end up with a very bad loss because of course the Frog class received a very low score but the other classes received quite high score so this adds up to an unhappiness of 10.9 and now if we actually want to combine all of this into a single loss function we're going to do the relatively intuitive transformation here where we just take the average across all the losses we obtained over the entire training set and so we would say that the loss at the end when you average these numbers is 4.6 so this particular setting of W on this training data gives us some scores which we plug into the loss function and we given you get and unhappiness at four point six with this result okay so now going to ask you a series of questions to kind of test your understanding a bit about how this works I'll get into questions in a bit let me just pose my own questions first first of all what if that sum over there which is the sum over all the incorrect classes of J what is that was instead sum over all the classes not just the incorrect ones so what if we allow J to equal to y I why am I actually adding that small constraint in the sum over there sorry go ahead yes so in fact what would have happened is the reason that that's a J not equal to I I is if we allow J to equal to I I then score of why I cancel score of why I you end up with a zero and really what you're doing is you're adding a constant of one so if that sum was overall this course then really we'd be just inflating the loss by constant of one so that's why that's their second what if we used a mean instead of a sum right so I'm summing up over all these constraints what if I use the mean just like I'm using mean to actually average over all the losses for all the examples what if I use the mean over the scores the score constraints good so you're saying if there were too many classes that would dilute the loss so you're right in that the absolute value of the loss would be lower so it would be a constant factor why that's right yeah some of these choices matter but yes so basically what you're pointing out is if we did actually do an average here we'd be averaging over the number of classes here but there's a constant number of classes say three in this specific example so that amounts to putting a constant of one third in front of the loss and since we're always in the end so that would make the loss lower just like you pointed out but in the end what we're always interested in is we're going to minimize a W over that loss so if you're shifting your loss up by one or if you're scaling it with a constant this actually doesn't change your solutions right you're still going to end up at the same optimal w's so these choices are kind of basically free parameters doesn't matter so for convenience I'm adding a J not equal to Y I and I'm not actually taking the mean although it's the same thing and the same loss also goes for us for whether or not we average or sum across the examples okay next question what if we instead used not the formulation over there but a very similar looking formulation but there's an additional squared at the end so we're taking the difference between scores plus 1 this margin and then we're squaring that do we obtain the same or different loss when you think do we obtain the same or different loss in a sense that if you were to optimize this and find the best W do we get the same result or not yes so I like the fact that it's divided you would in fact get a different loss it's not as obvious to see but what one way to see it is that we're not just clearly scaling we're not just clearly scaling the loss up or down by constant or shifting it by a constant we're actually changing the difference we were changing the trade-offs nonlinearly in terms of how the SVM the support vector machine is going to go there and trade out the different score margins in different examples but it's not obvious to see but basic it's not very clear but I wanted to illustrate that not all changes to this laws are completely no op and the second formulation here is in fact about something we call a squared hinge loss instead of the one which we call hinge loss and you can use two different it's kind of a hyper primary which one you use most often you see the first formulation that's what we use most of the time but sometimes you can see datasets where the squared hinge loss ends up working better so that's something you play with that's really a hyper parameter but it's most often used form as the first one let's also think about the scale of this loss where's the min and the max possible loss that you can achieve with the multi-class SVM on your entire data set what is the smallest value 0 good what is the highest value yeah it's infinite right so basically the scores could be arbitrarily terrible so if you're at sign score to the correct example is very very small then you're going to get your loss going to infinity okay and one more question which becomes kind of important when we start doing optimization usually when we actually optimize these loss functions we start off with the initialization of W that are very small weights so what ends up happening is that the scores at the very beginning of optimization are roughly near zero all of these are like small numbers near zero so what is the loss when all of these are near zero in this particular case that's right number of classes minus one so if all the scores are zero then with this particular loss I put down here and by doing an average across this way we would have achieved a loss of two okay so this is not very important where it's important is for sanity checks when you're actually starting optimization and you're starting with very small numbers W and you print out your first loss as you're starting the optimization and you want to make sure that you kind of understand the functional forms and that they can think through whether or not the number you get makes sense so if I'm seeing two in this case that I'm happy that the further losses may be implemented correctly I'm a hundred percent sure but shorten suddenly there's nothing wrong with it right away so it's interesting to think about these let's see I'm going to go more into this loss a tiny bit but is there a question in terms of this slide right now good I became more efficient to not do them oh yeah so what the question is and I was asked to repeat the questions would it be not efficient to actually not have this ugly constraint G I J's not weii because it makes it more difficult to actually do these easy vectorized implementations of this lost implementation so that actually predicts my next slide to some degree so let me just go to that so here's some numpy code for how I would write out to this loss function in a vectorized numpy code so here we're evaluating Li in vectorize numpy we're getting a single example here so X is a single column vector Y is an integer specifying the label and W is our weight matrix so what we do is we evaluate this course which is just W times X then we compute these margins which is the difference between the scores we obtain and the correct score plus one so these are numbers between 0 and whatever and then see this additional online margin set y equals 0 why is that there yeah exactly so basically I'm doing this efficient vectorized implementation which goes to your point and then I want to erase that margin there because I'm certain that margins @y currently is one and I don't want to inflate my score and so I'll set that to zero sorry yes I suppose you could subtract one at the end as well that's right so we can optimize this if we want but we're not going to not going to think about this too much if you do if you do in your assignment that's very welcome for extra bonus points and then we're setting up these margins and so we get a loss in them okay going back to the slide any more questions about this formulation and by the way this formulation if you wanted to make it if you actually write it down for just two classes you'll see that it reduces to a binary support vector machine loss okay cool so we'll see a different loss function soon and then we're going to look at comparisons of them as well but for now actually so at this point what we have is we have this linear mapping to get scores and then we have this loss function which you have now written out in its full form where we have these differences between this course plus one sum over the incorrect classes and the Sun and the average across all the examples right so that's the lost function right now now I'd like to convince you that there's actually a bug with this loss function in other words if I'd like to use this loss on some data set in practice I might get some not very nice properties okay if this if this was the only thing I was using by itself and it's not completely obvious to see exactly what the issue is so I'll give you guys a hint in particular suppose that we found it W such that we're getting zero loss okay on some data set and now the question is is this W unique or phrase in any way can you give me a W that would be different but also definitely achieves a zero loss in the back that's right and so you're saying we can scale it by some constant alpha and in particular alpha must obey some constraint you probably want it to be greater than one right so basically what I can do is i can if i change my weights and i make them larger and larger all i would be doing is i'm just create making this score difference larger and larger as ice came out w right because of the linear loss form here so basically that's not a very desirable property because we have the this entire subspace of W that is optimal and all of them are according to this loss function completely the same but intuitively that's not like a very nice property to have and so just to see this numerically to convince yourself that this is the case I've taken this example where we achieved previously zero loss there before and now suppose I multiply my WI twice I mean this is a very simple math going on here but basically I would be inflating or my scores by two times and so their difference would also become just larger so if all your score differences inside the max of 0 we're already negative then this is going to become more and more negative and so you end up with larger and larger negative values inside the maxes and they're just become 0 all the time right good so off the scale in fact you would have to be larger than 1 because let's see yes so there's that added margin of 1 which accompanied things yep ok another question with the scaling applied to this to the bias part so here I'm just I guess I'm not assuming the bias for simplicity but yeah basically scores are W X plus P so so you're just yet forget the bias and we're just killing W by itself okay cool so the way to fix this is intuitively we have this entire subspace of WS and it all works the same according to this loss function and what we'd like to do is we'd like to have a preference over some WS over others just based on intrinsic you know what what do we desire of W to look like forget what the data is just what what are nice things to have about a W and so this introduces the notion of regularization which we're going to be attending to our loss function so we have an additional term there which is lambda times a regularization function of W and the regularization function measures the niceness of your W okay and so we don't only want to fit the data but we also want W to be nice and we're going to see some ways of framing that and exactly why they make sense and intuitively what's going on is regularization is a way of trading off your training act your training loss and your generalization lost on a test set so intuitively regularization is a set of techniques where we're adding objectives to the loss which we'll be fighting with this guy so this guy just wants to fit your training data and that guy wants W to look some particular way and so they're fighting each other sometimes in your objective because we want to simultaneously to achieve both of them but it turns out that adding these regularization techniques even if it makes your training error worse so we're not correctly classifying all the examples what you notice is that the test set performance ends up being better and we'll see an example of why that might be actually with the neck for now I just wanted to point out in the neck slide but for now I just want to point out that the most common form of regularization is the what we call l2 regularization or wake decay and really what we're doing is suppose W in this case is a 2d matrix so I have two sums over K and L the rows and columns but really is just all the element wise W squared and we're just putting them all into the loss okay so this this particular regularization it likes w's to be 0 right so when W is all 0 then regularization is happy but of course W can't be all 0 because then you can't classify so these guys will fight each other there are different forms of regularization with different pros and cons we'll go into some of them much later in the class and I just like to say for now that basically l2 regularization is the most common form and that's what you'll use a quite often in this class as well so now I'd like to convince you I'd like to convince you that this is a reasonable thing to want out of a w that it's weights are small so consider this very simple cooked-up example to get the intuition suppose we have an example where we are in four-dimensional space where we're doing this classification and we have an input vector of just all one's X and now suppose we have these two candidate weight matrices or weight single weights I suppose right now so one of them is 1 0 0 and the other is 0.25 everywhere since we have linear loss functions you'll see that their effects are the same so basically the way we are evaluating score is by WX so the dot product with X is identical for both of these this course would come out both of these but regularization would strictly favor one of these over the other which one would the regularization cost favor even though their effects are the same which one is better in terms of the regularization the second one right and so the regularization will tell you that even though they are achieving the same effects in terms of the data loss classification down the road we actually significantly prefer the second one what's better about the second one why is that a good idea to have sorry can you repeat that that's correct so well that's one interpretation I like the most as well it takes into account the most number of things in your ex vector right so what this algebra is asian wants to do is it wants to spread out your WS as much as possible so that you're taking into account all the input features or all the input pixels it wants to use as much of those as many of those dimensions as it likes if it's achieving the same effect intuitively speaking and so that's better than just focusing on just one input dimension it's just nice it's something that often works in practice basically just the way things are bigger sets are arranged and the properties that they usually have statistically okay any questions about this it's a regular ization good idea everyone I sold okay great so basically our losses will always have this form where we have the data loss and then we'll also have a regularization it's a very common thing to have in practice okay I'm now going to go into the second classifier the softmax classifier and we'll see some differences between the support vector machine and this soft max classifier in practice these are kind of like these two choices that you can have either SVM or soft max the most two commonly used linear classifiers often you'll see that soft mass classifier is preferred and I'm not exactly sure why because usually they end up working about the same and I just like to mention that this is also sometimes called multinomial logistic regression so if you're familiar with logistic regression this is just a generalization of it into multiple dimensions or in this case multiple classes not just you was there a question over there or yes the question is why do we want to use regularization basically so I don't think I sold it very well if you're compose you have this entire sub base of ways there's all achieving the same effects we'd like to pick between them in some way and I think what I'm arguing for is that wanting low WS is a reasonable way to pick among them and the LT regularization will favor diffuse WS like in this case here and one of the intuitive ways in which I can try to pitch why this is a good idea is that diffuse weights basically see this w1 is a completely ignoring your inputs to 3 & 4 but w2 is using all of the inputs right because the weights are more diffuse and so intuitively this just ends up usually working better at a test time because more evidence is being accumulated into your decisions instead of just one single evidence one single feature that's right that's right that's right so the idea here is that these two W 1 and W 2 they're achieving the same effect so this data loss suppose that that's basically it doesn't care between the two but the regularization expresses preference for them and since we have it in the objective and we're going to end up optimizing over this loss function so we're going to find the W that simultaneously accomplishes both of those and so we end up a w that not only classifies correctly but we also have this added preference that actually we want it to be and we want it to be diffuse as much as possible is awesome good yes so in this particular example l1 would also be indifferent l1 has some nice properties which I don't want to go into right now we might cover it later l1 has some properties like a sparsity inducing properties where if you end up having l1 s into your in your objectives you'll find that lots of w's will end up being exactly 0 for reasons that we might go into later and that sometimes is like a feature selection almost and so l1 is another alternative that we might go into a bit more later good yeah so the question is yeah so you're pointing out that isn't it may be a good thing that we're ignoring features and just using one of them yeah so there's many technical reasons why regularization is a good idea I wanted to give you just basic intuition so maybe uh maybe I failed at that but yeah I think that's a fair point um yep I'm not sure if I have a good return I would have to good can you say that like epsilon zero zero would do the exact same thing is w1 a small number so that for that would be the exact same thing but also have small much more regulation so you want to consider a different ee and you want to looks like a different from these two I'm just saying like the previous question was like is negated we're ignoring sometimes why my way yeah isn't that a feature not a bug that we're ignoring some inputs yeah that's not saying that you could just have like point zero zero one here there's your as another W vector don't do the exactly the same thing as W 1 in terms of this decisions but I see okay thanks yeah I didn't want to dive too too much into this there's actually like an entire literature of regularization and looking at the test error and actually of you know writing theorems and learning theory and you saw some of that in 229 and there are some results on why regularization is a good case in it in those areas and I don't think I want what I want to go into that and it's also also beyond the scope of this class so for this class just ultra regularization will make your test error better okay so I'm going to go into Southwest classifier now which is this generalization of logistic regression so the way the way this will work is this is just a different functional form for how loss is specified on top of V scores so in particular there's this interpretation that Southwest classifier puts on these scores these are not just some arbitrary scores and we want some margins to be met but we have specific interpretation that is maybe more principled kind of from a probabilistic point of view where we actually interpret these scores not just as these things that need margins but these are actually the unnormalized lock probabilities that are assigned to different classes okay so we're going to go into exactly what this means in a bit so these are unnormalized lock probabilities of all the Y's given the image so in other words we're assuming that if the scores are unknown wise lock probabilities then the way to get probabilities of different classes like class K is that we take the score we exponentiate all of them to get the unnormalized probabilities and we normalize them to get the normalized probabilities so we divide by the sum over all the exponentiated scores and that's how we actually get this expression for a probability of a class given the image and so this function here is called the softmax function if you if you see it somewhere it's the e to the the element you're currently interested in divided by the sum over all exponentiated scores okay and so the way this will work basically is if we're in this probable stick framework where we are looking at we're deciding that this is the probability of different classes then makes sense in terms of what you what usually want to do in this setting we have probability over different classes one of these is correct so we want to just maximize the log likelihood of for the loss function and sorry we want to maximize the log likelihood of the true class and since we're writing a loss function we want to minimize the negative log likelihood of the true class okay so you end up with a series of expressions here really our loss function is you want the log likelihood of the correct class to be high so negative of it want to be low and the log likelihood is softmax function of your scores let's look at a specific example to make this more clear okay and here I've actually like subbed in that expression so that this is the loss negative log of that expression let's look at how this expression works and I think it will give you a better intuition of exactly what this is doing why it's what's computing so suppose here we have these scores that came from our neural network or from our linear classifier and these are the unnormalized log probabilities so as I mentioned we want to exponentiate them first because under this interpretation that gives us the normalized probabilities and now probability is always sum to one so we have to divide by the sum of all of these so we add up these guys and we divide to actually get probabilities out so under this interpretation we've carried out the set of transformations and what this is saying is that in this interpretation the probability assigned to this image of being a cat is 13% car is 87% and frog isn't very unlikely 0% so these are the probabilities and now normally in the setting you want to maximize the lock probability because it turns out that maximizing just the raw probability is not as nice mathematically so normally you see maximizing lock probabilities and then so we want to minimize the negative log probability so the correct class here is cat which is only having 13% chance under this interpretation so negative log of 0.1 3 gives us 0.89 and so that's the final that's the final loss that we would achieve for this class here under this interpretation of it obvious or softmax classifier so 0.894 softmax okay so let's go over some examples let's go over some questions now related to this to try to interpret exactly how this works first where's the min and the max possible loss with this loss function so that's the loss function what is the smallest value and the highest value so I'll give you a chance to think about is what is the smallest value that we can achieve zero and how would that happen okay good so if you're correct class is getting probability of one where we have a one which we're plugging into the log and we're getting negative log of 1 which is zero and the highest possible loss infinite so just as with SVM we're getting the same zero is minimum and infinite is maximum so infinite loss would be achieved if you end up giving your cat score very tiny probability and then log of 0 gives you negative infinity so negative that is just infinite so yeah so the same balance as SVM and also this question normally when we initialize W with roughly small small weights we end up with all the scores that are nearly zero what ends up being the loss in this case if you're doing sanity checks at the beginning of your optimization what do you expect to see as your first loss yeah yeah that's right so it's negative log of one over number of classes so you'd basically be getting all zeros here you get all ones here and so here is one over the number of classes and then negative log of that ends up being your final loss so actually for myself whenever I run optimization I sometimes take note of my number of classes and I evaluate negative log of one over number of classes and I'm trying to see what is the my first beginning loss that I expect and so when I start the optimization I make sure that I'm getting roughly that otherwise I know something's maybe slightly off I expect to get something on that order moreover as I'm optimizing I expect that I go from that to zero and if I'm seeing negative numbers then I know from the functional form that something very strange is going on right because you never actually expect to get negative numbers out of this so that's the softmax loss I'll show you one more slide and I'll take some questions just to reiterate the difference between them and really what they look like is we have the score function which gives a WX times B we get our scores vector and now the difference is just how they interpret what P scores coming out from this F function is so either is just random scores no interpretation whatsoever we just want the large larger score the correct score to be some margin above the incorrect scores or we interpret it to be these unnormalized log probabilities and then in this framework we first want to get the probabilities and then we want to maximize the property of the correct classes or the log of them and so that ends up giving us the loss function for south max so they start off with the same way but they just happen to get a different result we'll go into exactly what the difference is of them are in a bit there are questions for these don't cheat that's correct so they take about the same time to calculate for sure especially once you have a comm net your classifier is near instantaneous to evaluate most of the work is done in doing all the convolutions and so we'll see that the classifier and especially the loss is roughly the same of course softmax involves some X and so on so these operations are slightly more expensive perhaps but usually it completely washes away compared to everything else you're worried about which is all the convolutions over the image good sorry I didn't catch your question for you why do we want to maximize lock probabilities instead of probabilities directly it turns out when you so this is partly covered in 229 when you do logistic regressions and so on if you just want to maximize probability it would it would be the exact same problem because log is a monotonic function and so maximizing the probability and maximizing the log probability give you the identical result but in terms of the math everything comes out much nicer looking when you actually put a log there but it's the exact same optimization problem ok cool let's look at some interpretations of these two and exactly how they differ so self max versus SVM and trying to give you an idea about one property that actually is quite different between the two so they have these two different functional forms and now assume that we have these three examples all three examples and suppose there are three classes three different examples and these are the scores of these examples and for every one of these examples the first class here is the correct class so 10 is the correct class score and the other scores are these guys either the first one second or third one and now just think about what these losses tell you about how desirable these outcomes are in terms of that W in a particular one way to think about it for example is suppose I take this data point the third one 10 negative 100 negative 100 and suppose I jiggle it like I move it around a bit in input space what is happening to the losses as I do that okay I do this so they increase and decrease as I wiggle this data point round do they both increase or decrease for the third data point for example SVM remains the same correct and why is that it's because the margin was met by a huge amount so there's this added robustness when I take this data point I shake it around the SVM is already very happy because the margins were met by you know we desire a margin of 1 and here we have a margin of 109 so basically there's a huge amount of margin the SVM doesn't express a preference over these examples where this course come out very negative then the SVM adds no additional preference over do I want these to be negative 20 or negative 100 or negative a thousand the SVM won't care but the S but the softmax could always see you will always get an improvement for softmax right so softmax function expresses preference for it wants these to be negative 200 or 500 or a thousand all of them would give you better loss right but the SVM at this point doesn't care and for the other examples I don't know if it's as I mean that's one clear distinction right so the SVM has this added robustness to it wants this margin to be met but beyond that it doesn't micromanage your scores where softmax will always want these scores to be you know everything here nothing there and so that's one kind of a very clear difference between the two that was our question good yes so the margin of one I mentioned very briefly that that's not a hyper parameter you can fix it to be one the reason for that is that these scores they're they kind of the absolute values of those scores are kind of a they don't really matter because my W I can make it larger or smaller and I can achieve different size scores and so one turns out to work better and in the notes I have a longer derivation where I go into details exactly why one is safe to choose so I refer to that but I didn't want to spend time on it in the lecture oh so zero would be if you wanted to use zero that would be trouble you can use any positive number and that would give you an SVM if you use zero that would look different so one example for example this added constant there one property it gives you when you actually go through the mathematical analysis like saying the SVM in CS 229 is you will see that it achieved this max margin property where the SVM will find that the best margin when you actually have a plus constant there combined with the LT regularization on the weights so you want very small weights that meet a specific margin and s we will give you this very nice max margin property that it didn't really go into in this in this lecture right now but basically do want the positive number there otherwise things will break good um yeah is there so you're saying we're taking this exponential of these numbers that are real numbers and we're interpreting them as probabilities we're kind of free to so we decided you're getting these scores out and it's up to you to to endowed them with interpretation right we can have different losses in this specific case I showed you multi-class SVM there's in fact multiple versions of a multi-class SVM you can fiddle around with exactly the loss expression and one of the one of the interpretations we can put on this course is that there are these unnormalized block probabilities they can't be normalized because they just can't we have to normalize them explicitly because there's no constraint that the output of your function will be normalized and they have to be they can't be probabilities because you're outputting just these real numbers that that can be positive or negative so we interpret them as lock probabilities and and that requires us to exponentiate them it's a very bad kind of explanation of it because uh batang is good yeah that's right so there are really cool or connections to physics and how they actually think about the loss functions for them energy and loss is like kind of an equivalent so good once it goes beyond the certain margin just going to exponentiated the difference you will get in the software we will be vanishingly small so you're talking about so say for example in this case yeah and now you arguing about say this one yeah then it'll really change your wrong with each other that's right so what you're saying I think if I understand correctly is the softmax would already look at this and give zero probabilities here and one here right so you're saying if I jiggle this around nothing's changing I think the difference is the loss would definitely change for softmax even though I wouldn't change a lot but it would definitely change the softmax to express his preference whereas SVM gets you identically zero right yeah yeah so the preference wouldn't be very big but they're different layers preference but in practice basically this distinction the intuition I'm trying to give you is that the SVM has a very local part of the space image you're classifying that it cares about and beyond it it's invariant and the softmax kind of is a function of the full data cloud it cares about it cares about all the points in your data cloud not just you know there's like a small class here that you're trying to separate out from everything else a softmax will kind of consider the full data cloud when it's fitting your plane and SVM will just want to separate out that tiny piece from the immediate part of the data cloud like that in practice when you actually run these they kind of give nearly identical results almost always so really what I'm trying to I'm not trying to pitch one or the other I'm just trying to give you this notion that you're in charge of the last function you get some scores out and you can write down nearly any mathematical expression as long as it's differentiable into what you want your scores to be like and there are different ways of actually formulating this and actually to two examples that are common to see in practice but in practice we can put down any losses for what you want your scores to be and that's a very nice feature because we can optimize over all of it okay let me show you an interactive web demo at this point all right let's see you this so this is an interactive demo on our class page you can find it at this URL I wrote it last year and I have to show it to all of you guys to justify spending one day on developing all this okay but I'm not sure last year not too many people looked at this okay but basically this is one day of my life you should all look at this so we have here is a two-dimensional problem with three classes and I'm showing here three classes each has three examples over here in two dimensions and I'm showing the three classifiers here is the level set so for example the red classifier is has scores of 0 along the line and then I'm showing the arrows in which the scores increase right here's our W matrix so as you recall this W matrix the rows of that W matrix are the different classifiers so we have the blue classifier red-green classifier and red classifier and we have both the weights for both the X and the y component and also the biases and then here we have need data sets so we have the X and the y coordinate of all the data points their correct label and this course as well as the loss achieved by all those data points right now with this setting of W and so you can see that I'm taking the mean over all the loss so right now our data loss is two point seven seven regularization loss for this w is three point five and to our total loss is six point two seven okay and so basically you can fiddle around with this so let's see so as I change my W you can see that here I'm making my W one of the W is bigger and you can see what that does in in there or the bias you can see that the bias basically shuts these hyperplanes okay and then what we can do is we can what we're going to war is this kind of a preview of what's going to happen we're getting the loss here and then we're going to do back propagation which is giving us the gradient over how we want to adjust these WS in order to make the loss smaller and so what we're going to do is this repeated updates where we start off with this W but now I can improve I can improve this set of WS so when I do a parameter update this is actually using these gradients which are shown here in red and it's actually making a tiny change to every single parameter according to this gradient right so as I do parameter update you can see that the loss here is decreasing especially the total loss here so the loss just keeps getting better and better as I do parameter update so this is the process of optimization that we're going to go into in a bit so I can also start a repeated update and then basically we keep improving this W over and over until our loss it started off with roughly 3 or something like that so now our mean loss over the data is 0.1 or something like that and we're correctly classifying all these all these points here so I can also randomize randomized W so just kind of knocks it off and then this always converges to the exact same point through the process of optimization and you can play here with the regularization as well you have different forms of a loss so the one I shown you right now is the Western Watkins SVM formulation there's a few more SVM formulations and there's also softmax here and you'll see that when I switch to soft max loss our losses look different and but the solution ends up being roughly the same so when I switch back to an SVM you know the hyper planes move around the tiny bit but really it's it's mostly the same and so anyways so this is the step size so this is how much how big up steps are we making when we get the gradient on how to improve things so randomized parameters we usually start with the very big this step size these things are jiggling trying to separate out these data points and then over time what we're going to be doing in optimization is we're going to decrease our update size and this thing will just slowly converging on the parameters that we want in the end and so so you guys can play with this and you can see how these scores jiggle around and what the loss is and if I stop a repeated update but wait there's you can also drag these points but I think on the Mac it doesn't work so : so when I try to drag this point it disappears so but it works on a desktop so I don't want to go in and figure out exactly what happened there but so you can play with this okay so I'm going to go into the process optimization now just to give you a sense of what this looks like so we have this F function we have these two formulations that I've shown you and the full loss is achieved as the mean loss over data plus regularization this is one other diagram to show you how what this looks like I don't think it's a very good diagram and there's something confusing about it that I can't remember from last year but basically you have this data x and y your images and your labels and there's W and we're computing this course and getting the loss and the regularization loss is only a function of the weights not of the data and basically what we want to do now is we don't have control over the data set right that's given to us but we have control over that W and as we change that W the loss will be different so for any W you give me I can compute the loss and that loss is linked to how well we're classifying all of our examples so one thing a low loss means we're classifying them very very well on the training data and then we're crossing our fingers that that also works on on some test data that we haven't seen so here's one strategy for optimization it's a random search so because we can evaluate loss for any arbitrary W what I can afford to do and I'm not sure I don't want to go through this in full detail but effectively I randomly sample W's and I can check their loss and I can just keep track with a W that works best okay so that's an amazing process of optimization of guess and check and it turns out if you do this I think I tried a thousand times if you do this a thousand times and you take the best W found at random and you run it on your C far ten data test data you end up with about fifteen point five percent accuracy and since there are ten classes and see far the mean the baseline is a ten percent that's your chance performance so fifteen point five there's some signal actually intelligibly and so state-of-the-art is that ninety-five which is a covenant so we have some gap to close over the next two weeks or so so this is so don't use this just because it's on the slides one interpretation of exactly what this looks like this process of optimization is we have this lost landscape right this lost landscape is in this high dimensional W space so here we I guess if you're a 3d and your loss is the height then you only have to double use in this case and you're here and you're blindfolded that's your current W you can't see where the valleys are but you're trying to find low loss and so you're blindfolded and you have an altitude meter and so you can tell what your losses at any single point and you're trying to get to the bottom of the valley right and so that's really the process of optimization and what we've shown what I've shown you so far is this random optimization where you teleport around and you just check your altitude right so not the best idea so we're going to do instead is we're going to use what I refer to already as a gradient or really we're just computing the slope across in every single direction so I'm trying to compute the slope and then I'm going to go downhill okay so we're following the slope I'm not going to go into too much detail on this but basically there's an expression for the gradient which is defined like that there's a derivative calculus 101 a definition in multiple dimensions if you have a vector of derivatives that's referred to as the gradient right so because we have multiple dimensions multiple w's we have a gradient vector okay so this is the expression and in fact we can numerically evaluate this expression before I go into the analytic solution I'll just show you what that would look like to evaluate a gradient at some W so suppose we have some current W and we're getting some loss okay what we want to do now is we want to get an idea about the slope at this point so we're going to basically look at this formula and we're just going to evaluate it so I'm going to go in the first dimension and I'm going and really what this formula is telling you to do is evaluate X plus your altitude at X plus h subtracted from f of X and divide by h what that corresponds to is me being on this landscape taking a small step in some direction and looking whether or not my foot went up or down right that's what the gradient is telling me so suppose I took a small step and the loss there is 1.25 then I can use that formula with a finite difference approximation where I've used the small H to actually derive that the gradient here is negative 2.5 the slope is downwards so I took a step the loss decreased so this is sloping downwards in terms of the loss function so negative two point five in that particular dimension so now I can do this for every single dimension independently right so I go into the second dimension I add a small amount so I step in a different direction I look at what happened to the loss I use that formula and it's telling me that the gradient the slope is point six and I can do that in the third dimension and I get the gradient there okay so what I'm referring to here is I'm basically evaluating the numerical gradient which is using this finite difference approximation where for every single dimension independently I can take a small step I can evaluate the loss and that tells me the slope is it going upwards or downwards for every single one of these parameters and so this is evaluating numerical gradient the way this would look like is a Python function here it looks ugly because it turns out it's slightly tricky to iterate over all the W's but basically we're just looking at f of X plus h comparing to f of X and dividing by H and we're getting the gradient now the problem with this is if you want to use the numerical gradient then of course we have to do this for every single dimension to get a sense of what the gradient is in every single dimension and right when you have a ComNet you have hundreds of millions of parameters right so we can't afford to actually check the loss in hundreds of millions of parameters before we do a single step so this approach where we would try to evaluate the gradient numerically is first of all its approximate because we're using finite difference approximations but second is also extremely slow because I need to do million checks of the loss function on my ComNet before I know what the gradient is and I can take a parameter update so very slow approximate turns out that this is also silly right because the loss is a function of W as we've written it out and really what we want is we want the gradient of the last function with respect to W and luckily we can just write that down thanks to these guys you actually know who those guys are right away you need to know the lightness that's right do you know which is which because they look this it just uh yeah they look a remarkably similar but basically Newton line it's sort of two inventors of calculus there's actually controversy over who really invented calculus and these guys fought each other over it but basically calculus is this powerful hammer and so what we can do is instead of doing the silly thing where we're evaluating numerical gradient we can actually use calculus and we can write down an expression for what the gradient is off that last function in the white space so basically instead of fumbling around and doing this is it going up or is it going down by checking the loss I just have an expression where I take the gradient of this and I can sync simply evaluate what that entire vector is so that's the only way that you can actually run this in practice right we can just have an expression for here's the gradient we can do a step and so on so in summary basically numerical gradient approximate slow but very easy to write because you're just doing this very simple process for any arbitrary loss function I can get the gradient vector for analytic gradient which is you actually do calculus it's exact no finite approximations it's very fast but it's error-prone because you actually have to do math right so in practice what you see is we always use analytic gradient we do clear calculus we figure out what the gradient should be but then you always check your implementation using a numerical gradient check as it's referred to so I will do all my calculus I figure out what the loss function should be I write an expression for the gradient I evaluate it in my code so I get an alert gradient and then I also evaluate a numerical gradient on the side and that takes a while but you enumerate your you evaluate your numerical gradient and you make sure that those two are the same and then we say that you pass the gradient check okay so that's what you see in practice whenever you try to develop a new module for your network you write out at Sloss you write the backward pass for it that computes the gradient and then you have to make sure the gradient check it just to make sure that your calculus is correct and then I've already refer to this process of optimization which we saw nicely in the web demo where we have this loop when we optimize where we simply evaluate the gradient on your loss function and then knowing the gradient we can perform a / M update where we change this w by tiny amount in particular we want to update with the negative step size times the gradient the negative is there because the gradient tells you the direction of the greatest increase it tells you which way the loss is increasing and we want to minimize it which is where the negative is coming from we have to go into negative gradient direction step size here is a hyper parameter that will cause you a huge amount of headaches step size or learning rate this is the most critical parameter to basically worry about that really there's two that you have to worry about the most the step size or learning rate and there's the weight regularization strength lambda that we saw already those two parameters are really the two largest headaches and that's usually what we cross validate over was their question in the back by the way it's not that gradient is just gradient it tells you the slope in every single direction and then we just take a step size step of it so the process of optimization in this weight space is your somewhere in your W you get your gradient and you March some amount in the direction of the gradient but you don't know how much so that's the step size and you saw that when I increase the step size in the demo things were jiggling jittering around quite a lot right there was a lot of energy in the system that's because I was taking huge jumps all over this basin so here the loss function is minimal at the blue part there and it's high in the red parts so we want to get to the lowest part of the Basin this is actually what the loss function looks like for an SVM or for aggression these are convex problems so as you really just a bowl and we're trying to get to the bottom of it but this bowl is like thirty thousand dimensional so that's why it takes a while okay so we take a step and we evaluate the gradient and repeat this over and over in practice there's this additional part that I wanted to mention where we don't actually evaluate the loss for the entire training data set in fact what we do is we only use what's called a mini batch gradient descent where we have this entire data set but we sample batches from it so we sample say like say 32 examples out of my training data I evaluate the loss in the gradient on this batch of 32 and then I do my parameter update and I keep doing this over and over again and basically what ends up happening is if you only sample very few data points from your training data then your estimate of the gradient of course over the entire training set is kind of noisy because you're only estimating it based on a small subset of your data but it allows you to step more so you can do more steps with approximate gradient or you can do a few steps with exact gradient and in practice what ends up working better is if you use mini-batch and it's much more efficient of course and it's impractical to actually do full batch gradient descent so common mini batch size is 32 64 128 256 this is not usually a hyper parameter we worry about too much you usually set it based on whatever fits on your GPU we're going to be talking about GPUs in a bit but they have finite amount of memory say about like 6 gigabytes or 12 bytes if you have a good GPU and you usually choose a batch size such that a small mini batch for example spits in your memory so that's how usually that's determined it's not a hyper parameter that actually matters a lot in a optimization sense good yeah for sure and we're going to get to momentum in a bit but if you wanted to use momentum then yeah this is just fine we always use mini-batch gradient descent with momentum very common to do so just to give you an idea but what this looks like in practice if I'm running optimization over time and I'm looking at the loss evaluated on just a small mini batch of data and you can see that basically my loss goes down over time on these mini batches from the training data so as I'm optimizing I'm going downhill now of course if I was doing full batch gradient descent so this was not just mini batches sample from the data you wouldn't expect as much noise you just expect this to be a line that just goes down but because we use mini batches you get this noise in there because so many batches are better than others but over time they kind of all go down over time is there a question yes so you're wondering about the shape of this last function you're used to maybe seeing more rapid improvement quicker these loss functions come in different shapes sizes so it really depends it's not necessarily the case that loss function must look very sharp in the beginning although sometimes they do give different shapes for example it also matters on your initialization if I'm careful with my initialization I would expect less of a jump but if I initialize very incorrectly then you would expect that that's going to be fixed very early on in the optimization we're going to get to some of those parts I think much later I also wanted to show you a plot of the effects of learning rate on your loss function and this learning rate is the step size basically if you have very high learning rates or step sizes you start thrashing around in your W space and so either you don't converge or you explode if you have a very low learning rate then you're barely doing any updates and all so it takes a very long time to actually converge and if you have a high learning rate sometimes you can basically get a kind of stuck in a bad position of a loss so these loss functions kind of you need to get down to the minimum so if you have too much energy and you're stepping too quickly then you don't have you don't allow your problem to kind of settle in on the smaller local minima and your objective in general when you talk about neural networks and optimization you'll see a lot of hand waving because that's the only way we communicate about these losses and bases so just imagine like a big basin of loss and there are these like smaller pockets of smaller loss and so if you're thrashing around and you can't settle in on the smaller lost parts and converge further so that's why the learning rates is no good and so you need to find the correct learning rate which will cause you a lot of headaches and what people do most of the time is sometimes you start off with a high learning rate so you get some benefits and then you decay it over time so you start off with high and then we decay this learning rate over time as we're settling in on a good solution and I also wanted to point out who I'll go into this in much more detail but the way I'm doing the update here which is how do you use the gradients to actually modify your W that's called an update parameter update there are many different forms of do it this is the simplest way which we refer to as just SGD simplest stochastic gradient ascent but there are many formulas such as momentum that was already mentioned in momentum you basically imagine as you're doing this optimization you imagine keeping track of this velocity so as I'm stepping I'm also keeping track of my velocity so if I keep seeing a positive gradient in some direction I will accumulate the velocity in that direction so I don't need so I'm going to go faster in that direction and so there are several formulas we'll look at them shortly in the class but at a grad rmsprop atom or commonly used so just to show you effect of different what these look like these different choices and what they might do in your loss function this is a figure from Alec so here we have a loss function and these are the level curves and we start off optimization over there and we're trying to get to the basin and different update formulas will give you better or worse convergence in different problems so you can see for example this momentum in green it built up momentum as it went down and then it overshot and then it kind of go back goes back and this SGD takes forever to converge in red that's what I presented you so far so SGD takes forever to converge and there are different ways of actually performing this parameter update there are more or less efficient in the case of optimization yeah so we'll see much more of this soon I also wanted to mention at this point slightly shifting gears I wanted to go slightly into I'd explain now basically a linear classification we know how to set up the problem we know there are different loss functions we know how to optimize them so we can kind of do linear classifiers at this point in the class I wanted to mention that I want to give you a sense of what computer vision looked like before commnets came about so that you have a bit of historical perspectives because we used linear classifiers all the time but of course you don't use linear classifiers on the raw original image because that's a pixel you don't want to put linear classifiers on pixels we saw all the problems with it like you have to cover all the modes and so on so what people used to do is they used to compute all these different feature types of images and then you compute different descriptors and different feature types and you get these statistical summaries of what the image looks like what the frequencies are like and so on and then we concatenated all those into large vectors and then we those into linear classifiers so different feature types all of them concatenate it and then that went into linear classifiers that was usually the pipeline so just to give you an idea of really what these feature types were like one very simple feature type you might imagine is just the color histogram so I go over all the pixels in the image and I bend them and to say how many bins are there for different colors depending on the hue of the color as you can imagine this is kind of like one statistical summary of what's in the image is just a number of colors in each bin so this will become one of my features that I would eventually be concatenated with many different feature types another kind of and so basically the classifier if you think about it the linear classifier can use these features to actually perform the classification because the linear classifier can like or dislike seeing lots of different colors in the image with positive or negative weights very common features also included things like what we call set and Hogg features basically these where you go in local neighborhoods in the image and you look at whether or not there are lots of edges of different orientations so are there lots of horizontal edges or vertical edges we make up histograms over that and so then you end up with just the summary of what kinds of edges are where in the image and you concatenate all those together there was different lots of different feature types of our proposed over the years just LBP text on lots of different ways of measuring what kinds of things are there in the image and statistics of them and then we have these pipelines called bag of words pipelines where you look at different points in your image you describe a little local patch with some scheme that you come up with like looking at the frequencies or looking at the colors or whatever and then we came up with these dictionaries for okay here's the stuff we see in images like there's lots of high frequency stuff or low frequency stuff and blue and so on so you end up with these centroids using k-means of what kind of stuff do we seen images and then we express every single image as statistics over how much of each thing we see in the image so for example this image has lots of high frequency green stuff so you might see some feature vector that basically will have a high value in high frequency and in green and then what we did is we basically took these feature vectors we concatenated them and put a linear classifier on them so really the context for what we're doing is as follows what it looked like mostly in computer vision before roughly 2012 was that you take your image and you have a step of feature extraction where we decided what are important things to you know compute about an image different frequencies different bins and we decided on what are interesting features and you'd see people take like 10 different feature types in every paper and just concatenate all of it just just dump all of it you end up with one giant feature vector over your image and then you put a linear classifier on top of it just like we saw right now and so you click up train sale in your SVM on top of all these feature types and what we're replacing it since then we found that what works much better is you start with the raw image and you think of the whole thing you're not designing some part of it in isolation of what you think is a good idea or not we come up with an architecture that can simulate a lot of those different features so to speak and since everything is just a single function we don't just train the top of it on top of the features but we can actually train all the way down to the pixels and we can train our feature extractors effectively so that was the big innovation and how you approach this problem is we try to eliminate a lot of hand engineered components and we're trying to have a single differentiable blob so that we can fully train the full thing starting at the raw pixels so that's where historically this is coming from and what we'll be doing in this class and so next class we'll be looking specifically at this problem of we need to compute the analytic gradients and so we're going to go into back propagation which is an efficient way of computing analytic gradient and so that's backdrop and you're going to become good at it and then we're going to go slightly into neural networks that's it you
Info
Channel: MachineLearner
Views: 23,105
Rating: 4.9259257 out of 5
Keywords: classification, gradient descent
Id: q3TZVNGtOug
Channel Id: undefined
Length: 71min 23sec (4283 seconds)
Published: Tue Jun 14 2016
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.