Machine Learning Lecture 16 "Empirical Risk Minimization" -Cornell CS4780 SP17

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

ready please put your laptops away thank you one little thing so the exam is in one week and one day I hope you are feeling ready please don't start too late as I said before number one so is the mic on can you hear me in the back mic on can you thumbs up no one to better okay so exam is next Tuesday please start early some people ask what's going to be on the exam the answer is everything everything that we have covered we are posting or I think it's already posted exams from previous years on Piazza under resources if you click on resources on the top and you can see them there we separated the months with within without solutions so basically first try to do it yourself without solutions and then look at the solutions and please please please start your shuttle ready with the studying you know one one common failure mode in this classes that people do very well on the projects but that actually don't take the exam seriously enough exams is 50% of your grade so please keep that in mind alright so now I do this purposely because I machinery really is kind of half theory and half it is you know practical chops right so you can also have people who know the theory but can't actually get anything working so I really want to kind of have this course cover both of these two sides of machine learning in order to do well you have to do well on both of them any questions about logistics okay good and last thing that's the project the last the next project will really be released today I've been promising it several lectures we ran into some bark that actually sometimes appeared when we tested it so we always have some TAS pretend they're students and it turned out you know we actually had to communicate with velarium and they actually had to change something in their back-end to make sure it didn't happen anymore so delay things a little bit but ultimately you will get the same amount of time you'll get two weeks starting the moment it gets released it's not it's basically erm so the idea is you have to implement them on a bunch of these last functions so hinge loss logistic loss and so on it doesn't take very long so it's a good preparation for the exam but you can also finish it afterwards if you feel like you know you'd rather study you know the lectures okay so last lecture we went over empirical risk minimization and so the idea basically is that machine learning very often we can write the learning algorithm as saying what we are trying to do is we try to find some parameters we call these w that doesn't have to be W these are busy our weights and what we are minimizing essentially is some loss that goes over the data set you know I think I wrote it as H W of X I come on why I plus some regular rise and so here H W of X is busy the prediction of data point X I this here is the true label and we have some last up easy says how much do I have to pay for getting a point wrong or even right but not right enough it's a Bayesian that maybe not confidently enough and the regularizer here is the second term that penalizes really complicated solutions it turns out if you're in a high dimensional space then typically even with linear classifiers you typically always find a solution that makes knots no single mistake anymore but it tends to be overly complicated right so basically you learn some rules that just you know rely on the noise in your data which doesn't mean anything so by making trying to come up with simpler solutions you effectively obtain better generalization all right the last time we talked about the hinge loss the hinge loss which was also what we saw in SP ends then the resisting loss and the zero one loss and the exponential loss so these are kind of typical famous classification loss functions today I want to move on to regression and say regression bassy says we are in the certain regime well why it can be a real number alright so for example you're predicting a person's height or you're predicting the value of a house based on its features etc right or you you know predicting the temperature rights based on previous temperatures like you know net cetera and and so again we have many common loss functions I just will go through them and maybe make a few comments on them and the important thing is that all of these basically fit into the same framework so if you understand how to optimize one of them you probably know how to optimize all of them so the first one is one that we've already seen many times that's the square loss and we've seen there an ordinary least-squares and enriched regression and the loss is very simple loss of the following it's penalized h w of x i minus y i squared and so that's our loss function and so you know in some sense nice thing about square is is always positive so if you're off you're either below or above why are you trying to predict that say a price of a house right if you have too high or too low it does matter right get the same penalty the other property of the last function is that because it's squared is if you're more off that gives you a much larger penalty right so if you're off by 1 1 squared it's just one if you're off by 10 10 squared is 100 right so this quadratic relation forces this last to focus more on outliers so if something is really busy make sure that nothing is really wrong and if you would just say ok well I'm just trying to say you know I mean over take all the people in this class for example and look at their height and I would try to predict what's you know what's the the height P that minimizes the square loss across everybody does anyone know what I would get probably the answer I'm trying to find some value that would an average minimize like here my age Bobby of X is just one number right he ignores the X right so ignores the W it's just one number I'm just outputting one constant and I'm trying to fit that constant what would it be any ideas the average that's exactly right right so and what the square loss does it tries to fit the average all right that's basically what is aiming for if you actually can fit exactly the average then that's actually you know that's the optimal solution in terms of the square us all right you can't do better than the average so estimates estimates me and so you know for example if you do income prediction right and that'd be one example but the average may not be well-suited right so because this income is not is this distributed based on the power law distribution so there's a very very few people who own as much as everybody else and so if you know for example if you take the average income in the United States all right let's say Bill Gates makes a good investment and you know increases his his his income by 10% right something you would look like everybody's income and up by quite a bit on our but it is not very meaningful because only one guy made more money so but on other hand you know in some other settings the mean may be totally appropriate but if you don't actually have such crazy outliers the nice thing about the square loss is that it's fully differentiable everywhere so it's very very easy to fit with gradient descent and of course if you you know Hessian then it's become just just one step um beginning on your picture and if you just minimize this with a linear classifier then it just becomes ordinary least squares this is exactly what we had before any questions about the square loss all right so the next one is the absolute loss and here the loss is it's almost the same but it's the absolute any idea without looking at the notes of what it estimates this does not estimate the average what does it estimate what is the minimum error in absolute terms I would try to predict the some quantity here that estimates the height of everybody in here and I want to minimize the special absolute terms I'm I'm off by as little as possible it's not the average what could it be median right the median and so for example in the income example that would make it out more sense right so now Bill Gates gets up very richer wouldn't wouldn't change the median at all yeah no no no oh sorry yeah so if you compute the average here it doesn't give you zero it's just as good as you can do with the constant right you can't do any better that's that's all right and you can do a lot better if you do something different than a constant if you actually try to predict something for the X you know take the X into account but here's one more thing right for every single X let me just expand on this but every single actually have some distribution of what's the probability of Y given X I right and so basically people you're trying to predict this actually the mean label given X I right that's basically what you try to predict of this distribution what's the average that's what the square loss is trying to predict okay it's conditioned on your featuring and here given your X are you trying to predict the median label this is down here does that make sense okay good so for example let's say I try to predict height in this class right yeah if I would just take a constant just hard to say what's the you know what's the best you know one number what's the height of my students in this room right here this would be the average height in the median height but I could do better if I take any features in the account right so for example one obvious feature would be gender right we all know that height is highly you know agenda certainly affects height right so then actually you know what I would get is for all the guys here would get the average height for a guy and all the girls I would get the average height for a girl okay that would be the best prediction if I only have one binary feature yeah how does it go from okay they're not yeah it's quite simple actually so what you do is you put in some constant here and you minimize this so you just take the derivative and set it to zero what do you go find is actually the answer is exactly the media they mean this one is a little not quite as trivial any more questions all right and okay so this one this one is often better because what the apps the advantage of the absolute loss is that if you have something that's really off right let's say I'm trying to predict house prices and one house is you know I'm here if the car the most houses are not that expensive right but some guy actually you know build a nice beautiful mansion that Lake right and that actually costs ten million dollars right so what the square loss would do if I'm off here by just 1 million right well that's a lot that's a million squared is this just you know gigantic right so would spend all its energy trying to get that one mansion correct and would be happy to sacrifice all the other houses in Ithaca that's right so that's the problem with these you know with the squaring after us because the absolute loss wouldn't actually you know wouldn't consider the this mention any different the problem of the absolute loss is that this is not differentiable at 0 and you're trying to get this to zero so right at the point where you're trying to get it you it's not a friendship ball and that makes it a lot harder to optimize and there's a beautiful kind of best of both worlds and that's called the Google us and who Balazs one of my favorite losses it has the following you either pick there so you pick this you'd be basically if any very very close to zero because that's where it's differentiable but if you're far away from zero that's where this here becomes problematic because you're squaring the differences so you actually I tell you you know you have very large losses then you actually switch to this one and you can basically construct a continuous function that way the following way you say I hope you can see this 1/2 H of X I minus y I squared if the absolute value is less than some Delta and otherwise we have delta x minus delta 2 so otherwise so what does this do this here says if the loss is the absolute or was less than Delta then square it which makes it basically the same as the colleague loss so around zero you make a quadratic but then at some point basically you just continue linearly so for large losses you just get the penalty of an absolute loss so if someone is a millionaire or something you're not actually squaring that in addition to the original large loss so that's that's very outlier friendly and and that's typically the best of both worlds but surprisingly many people don't know it many people just use this loss of this loss and most of the time I would say the boob loss actually is the preferred choice and who knows just one way of doing this there's another one that's very very similar and that's called the law cosh loss and just for those people are not familiar with it the cost function is the following fee of X plus e to the minus x over 2 maybe think about this for a second what happens with this function but it gets very very large and what happens when it gets very very small very negative yeah yeah ok yeah so the question is when watch what you think what should you Delta V and so that's the one dancer than velocity they have this choice to me Delta equals two here we go I mean it did and it doesn't matter all that much right because in some sense I mean by picking this Delta u baby say how much of the square loss do you want how much of the absolute last day you are right so if you think there's really drastic outliers that make that so small right if you think they're what we won't be that many make it larger yeah how do you can with this expression so you basically just connect these two functions that said exactly like you basically have a quadratic function then you say up to a certain point and then you busy solve for what's the first derivative and that's exactly how you continue it I'll make you draw it in two seconds well I see in one minute okay so Lizzy has the cost function if we think about this is actually a pretty symmetric function right sothanks becomes very positive if the X takes over there's just e to the X right if X is very negative right then this here goes to zero so it doesn't matter but if this here this goes it's e to the X right so this is negative the negative sign cancel out this becomes again e to the X so it's basically kind of each of the X in both directions okay does that make sense raise your hand if you have an intuition what that function looks like okay good right so this may see you just e to the X but instead of actually that's kind of what each of the X looks like right you just say well that's going to make this and the nice thing about what we can now do is the log car section just says we take the log off the course of X H of X Y and I'm gonna have a quiz in a minute that you have to think about what this function looks like so the nice thing about this is this is like a Hoover loss the nice thing is differentiable everywhere and the Hoover loss actually right at the point where this equals Delta it's not differentiable but that's that's okay because that's a point but you don't really care about very much the downside of this is that you have to compute these functions which is a little slow if you have to do it millions of times you know computers are not very good at computing and exploitation and logs etc okay so why don't we just do a quick exercise so please you know on your on your sheet right below this table you actually see a little little graph so why don't you just quickly draw all four of those functions and she managed to draw the human loss with Delta equals 1 and Delta equals 5 so to Hoover losses the law kosh the absolute in the square loss oh no no it's actually a constant delta oh good point yeah oh sorry someone does that's a really good question here use Delta this is not a function it's not the Delta function it's just the constant Delta Y Delta just means a small number yeah thanks for clarifying this [Music] [Music] all right who thinks he or she has all of them no who thinks he or she will have all of them soon all right amazed s get started okay good and I can draw them here okay who knows what the function this is this is easy that's the easiest one it's Creoles alright good next one hey oh yeah here we go what is this one absolute right exactly gets harder oh it takes a long time to compute that function wait oh okay damn it so which one is green which one is blue yeah greenness is uber what the delta 105 one yeah that's right and blue so blue is I believe also tell who got us yeah and the lock cost is actually actually well sorry the green was actually LA Cotte and red and pink are hooba there's basically almost no difference between LA cashier and and Hooper right they're very very similar function that's why it's hard to tell them apart it's not impossible so here you can basically see so once again right the way you have to look at these last functions is but you have to look into if I have a point that somewhere here right I can easily spend energy I can adapt my parameters do moving closer to loss zero right and the gain that I'm getting is how much the function goes down right sleeve or the absolute loss the black line if I'm here three right at some point that's off by minus three if I correct it by one point you know that the improvement is the same as if I have a point here and move it by moving one to the right okay so there's no incentive of looking at the points that are really off and improving those like improving anything helps the same way whereas a quadratic loss if I'm off you know here right this point here if I move one block to the right I get a huge improvement right whereas if I'm here I only get improve right does that make sense how much you go down so that's what do you have to think about right so moving along these lines or what the last ones will do is it will take at the points a look at the points that are really far out and try to move them in because then it can drop the loss a lot on average I'll take it any questions about these dogs foxes yeah so seems somewhat arbitrary it is somewhat arbitrary there's this in some sense is the choice you have to make is the data scientists right that is kind of left to you which one is the best loss and certainly something you know weighing in to how noisy is your data etc right is a big factor in this case but yes you know it's people have written books about each one of these last functions though it's you know the theory is very well understood but in practice is still kind of it mostly comes down to how noisy your data is I guess I remember when I used to work in industry we had actually you know we had fierce discussions about which loss function is the right one thank you know we almost got into actually real arguments yeah yeah yeah it's not faster so nowadays that's maybe not so much of a problem anymore but computing log gosh actually is computationally pretty expensive if you just think in terms of cycles computing cycles that squaring is really easy it's just one multiplier right okay yeah it's not even sorry shift shift yeah so it's super super super fast takes one cycle one few nanoseconds whereas taking the exponentiation in the log is much much more expensive yeah yeah yeah there they are already on the website yeah any more questions sorry Turner shift you have to multiplication what's wrong yeah okay good there was monix the author's not here today but here's extreme P bugging me to show you this one demo that I promised you last time so there's actually I don't want to pretend that I had a demo and then actually not show it so this is actually I never I owe you a demo on Ridge regression so it's just as as relevant now as it was back then because its base is just the square that's right so here's basically richer guessing that's a you know it's basically just a minimize the square Lawson have an l2 regular riser so here's you know the data sent let's say I'm just drawing some data points you know it helps when you make that sound and and now you I fit this data set with a casa que regresar and here's here's the line that minimizes the square loss right so basically what it does on average basically it just be off very low right and this is for very small lambdas I do very little regularization if I increase my regularization then the you can see here what happens I'm oh my gosh no so now I'm increasing my regularization what happens if AZ flattens this out right and the baby says simpler solutions are parallel to the horizontal axis so in the extreme case I do a lot of regularization that just ignores the data and just gives me a straight line all right so that's basically about this what this does but the effect of this regularization is and I can also show up maybe I draw one example just do it one more time with with them some outliers and let's you know do this and let's make some point here right and if I now fit this you see that actually it moves this whole line right a lot over towards this point here okay so it really is really a bad estimate for these points yeah right and that's only because of this one point but maybe the measurement where as like typical and then in biology or something you measure something you might invite that moment a big Mack truck drives past you lap right and everything shakes and so this is kind of the one point where the Mack track kind of you know distorted your results right and so the square law basically falls for it right and basically moves like all of these guys now get worse predictions just because of this one outlier yeah well so the problem is it doesn't really work right because the prediction is actually the overall prediction across the entire over dimensions it's not actually clear how to do that right yeah it doesn't really work and it's a good question yeah you're right you may have some some dimensions are very noisy now that's not right the right probably the best way is to use the lock harsh loss are the coolest yeah any more questions okay okay let's move on to regularization so what's regularization regularization is raising this additional term so we try to minimize the error and but what we've already seen this right that actually when we did in logistic regression we did map inference or in the OLS we that map inference right and then what we got is the second term where in our case actually our of W was always W transpose dot right in fact we saw three classifiers where we had this regular doesn't even remember the third one one is rich regression the second one is logistic regression with map and estimation what's the third one nobody someone as IAM thank you thank you oh my gosh yes SV app right so the SVN the last actually became that a regular rise of his become the hinge loss plus the l2 regular is someone asked me a really good question at the end of last lecture and said so with the SVM right and I'm paraphrasing the question but it was a great question so like it initially we started out saying we have this data set right where we have these crosses you have the positive negative points then we would like to find the hyperplane that has maximum margin but at the end when we looked at the final optimization problem you know it just minimize you know minimize W transpose W is subject to the constraint that why I stopped you transpose X I plus P is greater than equal 1 right so why does that maximize the margin somehow you know the intuition was very clear at the beginning and he you said like you know every single step made sense but then at the end we arrived at this thing that doesn't seem to maximize the margin and so you know for example if I now would take this hyperplane here right that also seems like a good good classify a classifier according to this this optimization problem and let me just make that connection because I think it's a very good point right turns out no no this is actually a suboptimal solution in this case and so why is this because this constraint here basically says that the inner product with W must be at least one for every single data point that means this user space there can be to draw a line on each side they say there is a line all the points to the right of this heaven in a product of at least one ok or negative one in this case it is that those lines make sense raise hand that makes sense ok awesome this is my W right and now here comes the point what does the last function say the last function here says minimize the square of W so what I wants to do is take this error and make it shrink it right then what happens then if I have an inner product right this is this is basically the line with the inner product W transpose X but speedy gives exactly you want now I take my top of you and I shrink it and make it shorter what does that mean for this line right it means that the line moves out right it has to because now the points X have to be further away right to get the same in the product because I'm multiplying with a smaller number right so if W get smaller X has to be larger right to get the same in a product so if I said shrink W and that's exactly what the subjective is trying to do it says give me the smallest possible W right so I'm doing this what happens this is this margin do you know moves out but in this case if I have this hyperplane the the margin can't move outwards anymore right because this X already lies here this is o lies here okay so I have no more wiggle room open against the corner right so I can't shrink my W anymore okay where's on the other hand if I would now rotate the whole hyperplane right and I'm here that actually have more wiggle room you don't actually get they can actually push them out even further by don't actually in this case actually the maximum Arjun I played with Bobby look like this be somewhere along here right actually have if this and this this line right so if I make sure exactly in the middle then I can move these out as far as possible and what that means is I should get the smallest probably a possible W squared okay does that make sense so it's actually exactly the same thing right it's just a different way of looking at all right so with this constraint and basically saying you know try to to you know move these lines where the inner product is one as far away as possible that's what this optimization problem says but don't go any further you know like the moment you hit any data points you have to stop and that means you you will arrive at the maximum margin solution yeah yeah yeah so essentially this is basically that's right so though if you now put this in there this is basically hinge loss right and so the nhd have the last function here plus the regularizer yeah that's right so in the soft constraint where is it okay good any questions okay so I certainly regularize on the SVM it's pretty clear but the regularizer does write the regularizer basis says don't just find me any hyperplane find me the hyperplane that gives me maximum margin so in general a regular rise that can be viewed as saying I want to have a simple solution so this year's ocean also know an ocean of simplicity all right saying the simplest solution actually is the one that's right smack in the middle and typically it's a little more general than that and let me actually before we get into this let me just tell you a little bit a little trick of optimization theory so we have the following problem or let me write it here can people read this in the back can you give me a thumbs-up okay good so this here's the optimization problem you're trying to solve now for those people who've done after ization theory and you may know that actually this year is facing something you know it's called the Lagrangian formulation of an optimization problem that actually has constraints so what I really mean is you can rewrite this guy as something that looks slightly different you can say that's the same thing as saying the following and minimizing over W the same function L of H W of X I comma Y I right I'm minimizing the loss subject to a constraint such that our of W is less equal than beef and so for every lambda they exists a B such that those two are equivalent if every be there exists a lambda that those two are folks right so it's not obvious what that is but but you just trust me it exists okay does that make sense so why is this the case because my lambda is large enough I will arrive at some solution for our of the okay so if my Landers large enough based eternal minimizes loss but it will also try to minimize our of W right which is also some function right the larger lambda is the more I'm trying to make this part is small if lambda is small that I don't care about this then I mostly want to make this putt small so lambda B says how much do I weigh off these two functions now if I solve this for given lambda then basically you know I get some value for R of W and I can make that value exactly might be right and now if I run this optimization problem I get exactly the same answer okay does that make sense the other way around is not as obvious so once you have a B you have to try out a bunch of Landis but turns out there always exists any questions about this who believes me raise your hand okay good so gullible no it's actually true it's true all right so let's just view it as this right so that actually what is our regular riser regularizes constraint right and so what's the regular rise that they've been looking at so far that's the l2 regular right so R of W equals W transpose double and it's be looking from this perspective right that this actually has a very neat interpretation it means that W squared must be less equal than some B for some B that you specify so what does that mean that means you're somewhere in the space of W right so this here is my W wand it's my first entry of W this is my second let's say my W is just two-dimensional okay then I can draw it right so any point here is basically a solution that's aw okay does that make sense that's the force curve of W that's the second coordinate of the other okay wait raise your hand that did that make sense okay now here's what we say we have some function over W even if you're trying to minimize so that's some function you know in the space let's say it looks like this right and here's the minimum okay so this is kind of a valley alright you're looking at and we tried to find a WBZ you know when we do gradient descent we start with some W when we walked down here until we hit the minimum okay that's just minimizing the loss this here's my last part what the what is this year this says this w fw w transpose w is less equal than b what does that mean that means the sum of squares right w 1 square plus w 2 squared that's less equal to b what does that mean that it's a circle that's exactly right right you have some circle here of radius B squared I guess right and my W must lie inside the circle actually no it's a sphere at a hypersphere have some ball right I'm saying minimize this function here on the right right minimizes lost function but the solution must still lie inside this ball that's what my constraint is telling you that's what the regularizer is doing ok and so what what about to seek waiting descends trying to do right it's trying to get to this point here but it's not allowed to right you're saying you have to lie inside this ball so what is it going to do it will find the closest point that's basically to the minute and it will end up here okay does that make sense and as you increase your B at some point it doesn't matter anymore right if you really really large and and you will just end up here right so if your B is that basically means your lambda is so small that the regularization has no effect anymore right that's the equivalent of you but if you lander is large enough then actually it will not find the optimal solution it will find the optimal solution under the constraint that you're inside this ball okay that's that's the idea behind regular l2 regularization any questions about so the B that's right so for every London exists to be if everybody existed London it's also a yeah yeah that's exactly right so in practice what you do is the question is how do you find lambda you just search for different values of lambda on some holdout set you see which one does best right and so the problem is that basically if you minimize just on the train data set why'd you get a little barrier you always get lower error if you decrease Yolanda right the lambda will hurt the regularization hurts your training performance but it may increase your test performance and we will get into this very very soon actually after erm is done we will talk about the bias-variance decomposition and then we will get into exactly this trade-off right you know between regularization and actually minimizing a more regularization less regulation etc any more questions yeah that's exactly right right it's a small small Nam that means larger P right in basis zero lambda means internet P the base e means this is no longer an active constraint that can be anything you want it's exactly right any more questions okay so I guess we are out of time I will continue on this next week a

Info

Channel: Kilian Weinberger

Views: 14,669

Rating: 4.9316239 out of 5

Keywords: artificial intelligence, machine learning, cornell, cs4780, empirical risk minimization, erm, kilian weinberger

Id: AkmPv2WEsHw

Channel Id: undefined

Length: 46min 32sec (2792 seconds)

Published: Wed Jul 11 2018