CSC411/2515 Why L1 regularization drives some coefficients to 0

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
and this video I want to give and intuitive argument Pope why sometimes when you apply and in fact usually when you apply a one regularization what happens is that some of your coefficients get driven to zero whereas when you do it with an l2 regularization that would not tend to happen so let's say that what we're trying to do is we're trying to predict height and the input variables that we're using our shoe size and weight and here I'll invent the unit's in such a way that basically kind of X plus W is equal to H so I'll just rescale the shoe size and the height in such a way that that's going to happen and the way in such a way that that's going to happen and another thing that I'm going to say so that's going to be of course Rocksmith another thing that I'm going to say is that the shoe size predicts the hike a lot better so essentially the height is equal to two times the shoe size but the height is not exactly equal to two times the weight so there is some variance here so it's approximate equal but that's not exactly so I'll just write it down so let's say that the shoe size is if it's like three then we'll say that the height is definitely six and then the weight here it's like approximately right so it's like something like 2.8 or something and then if this is four then this is eight and this is going to be let's say a bit of an over estimate so this is going to be like 4.1 and then if this is let's say 1 then this is definitely 2 and this is let's say so I'm just making it out let's say this is like 143 so you can predict the height from the weight just not as well as you can predict the height from the shoe size okay so let's say that you're trying to predict Heights so we're trying to set up a linear regression such that H is equal to alpha 1 times the shoe size plus alpha 2 times the weight and you can kind of see that lots of things will work here right so alpha 1 equals 2 alpha 2 equals 0 would would work great but of the 1 equals 1 and alpha 2 equals 1 what kind of work as well so now let's see what happens when we apply regularization so our cost function is going to be the error in the prediction right so the error here is going to be something like alpha 1 s plus alpha 2 w minus H and that's going to be for an FBI l squared and now let's add our records ation term so so this is the cost which means that what what you want to do is we're going to say well this is large this is bad and also if lambda times of a 1 squared plus alpha 2 squared if that's large that's also bad we're trying to make both this and this SMO at the same time so what's going to happen if we make lambda very large well we want this to be kind of as accurate as possible so this means that we want basically alpha 1 plus alpha 2 to be kind of equal to 2 right because 1 plus 1 works and do place' works and 0 plus 2 also kind of works so this thing is going to guide us to this kind of constraint what's this guy saying this guy saying just make both alpha-1 and alpha-2 as small as possible right so no if that's the constraint what's a good way to satisfy it well if alpha 1 is 2 alpha 2 is 0 then this guy is 4 if alpha 2 is 2 1 0 this guy is also 4 but if this is 1 and this is 1 then the sum is just 2 so in fact subject to this constraint this guy is going to be as small as possible when alpha 1 is equal to alpha 2 is equal to 1 why because essentially what we're saying is that the minimum of a 1 squared plus alpha 2 square subject to alpha 2 is equal to 2 well that's just going to be minimized at a 1 equals 2 alpha 2 is equal to 1 which is about what you'll get for a reasonable value well no when you're trying to minimize this so ok now let's say that the cost is lambda alpha 1 plus the absolute value of alpha 2 now it's alpha-1 plus alpha-2 so offer 1 plus alpha-2 should still be 2 if alpha 1 plus alpha 2 is 2 when is this guy as small as possible well this it doesn't matter because if alpha 1 plus alpha 2 is equal to 2 and they're both positive numbers then you can trade off the magnitude of alpha 2 for the magnitude of 1 it won't matter this guy is going to stay constant what does it mean it means that if you're minimizing this guy what's going to happen is because you'll do better when you're just using the shoe size to predict height alpha 1 is going to become 2 alpha 2 is just going to become 0 because this guy doesn't care how you trade them up one against alpha 2 this guy just want alpha 1 plus alpha 2 2 equal to 2 and for this to be as small as possible so that'll make essentially alpha 1 B 2 alpha to be 0 so this is what you mean when we say that when you're using a 1 regularization you will tend to see some coefficients driven to 0 the coefficients the correspond to features do not predict the outcome well and some coefficients not for those coefficients for which you are able to predict where the corresponding feature is the other you are able to predict that well those coefficients are going to be large
Info
Channel: Michael Guerzhoy
Views: 2,531
Rating: 4.8688526 out of 5
Keywords:
Id: iqXEnO2a-no
Channel Id: undefined
Length: 7min 17sec (437 seconds)
Published: Mon Jan 22 2018
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.