Ridge Regression (L2 Regularization)

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hello everyone and welcome to this new video today's topic is Ridge regression I'm Gus and this is endless engineering so let's dive right in so let me start out by talking about what regression is and if you're not too familiar feel free to check out my video on linear regression I'll leave a link in the description below so in the regression problem we have a independent variable X and the dependent variable Y and we want to find the relationship between them and in this example here we're going to do what is called the linear regression where we the relationship is assumed to be linear so we say that Y is equal to X transpose theta and let me just say Y I and theta I bar I'll explain what those mean in a second I is related to every point here so I have multiple points and number of points and so this is the index for every point so Y I is X bar transpose an X bar transpose in this case is One X I X I squared all the way to X I to the power n so I have essentially n to the power source like an nth order polynomial and in my case theta or let me just say theta transpose because theta is a column vector but I want to make things fit here it's going to be theta 0 theta 1 all the way to theta n will have n parameters and I have these n order polynomials so when I multiply those together I get my model right and so a typical problem is when you have an N order model like this you might end up doing a fit that looks something like this right so it hits a lot of points or it finds really good solutions on the training data so when you fit such a model you've overfit to the data of the training and when you deploy it in the real world when it encounters data it's never seen before then it doesn't know what to do and you get a bad estimate of your value so one solution from my prior video to do this fit was to say that I have a cost function J that's equal to some of i0 to em and I do X I transpose theta minus y I squared right and in the previous video we did gradient descent and we solve this problem and we were able to find the theta that fixes it so this is actually saying you know my model - all my measurements squared so this is essentially a linear least-squares when ordinary least squares if I do that I'll fit something but it might over fit and the ridge regression allows us to have a partial solution to this problem where I don't need to over fit so what I can do is I can add a term I'll call it lambda times the two norm squared of theta right so this these double column the Double Dash is here around this means it's a norm and this subscript two means that it's a second norm and squared is just a squared of that value right so what i'm doing here essentially is I'm fitting to the data what this term is doing and here I'm keeping these parameters as low as possible and sometimes you'll hear the term shrinkage applied to this or your heel the term l2 regularization and these are all the same thing basically using the l2 because the l2 norm and regularization means that you're making your parameters small your regular izing them or shrinkage meaning you're shrinking them right so this is great so how do I solve this problem now so what I can do here is I can rewrite this in matrix form this summation I can rewrite it in matrix form how do I do that I take all my X transpose is for every X I write X I bar I stack them together so I make a large X matrix where I stack x1 transpose X plus 1 bar transpose X 2 bar transpose I stack all of those until X n Trant bar transpose and I make another large matrix Y where I stack all my Y measurements Y I so y 1 y 2 all the way to yn and I can write this whole term as X oops that's wrong so capital X theta minus capital y whole transpose capital X theta minus Capital y so this term is the same as this term right here let me just this term here is this term and now this term right here the second norm if you're familiar with it is essentially the square root of the summation of all the squared elements right so when I square that the square root goes away so I can write this other term as you know the square of a square root gives me the sum of all the elements squared and I can write that as theta transpose theta so this is going to give me the sum of all the elements squared so this is my cost function J right so I have this cost function now and I need to find the value of theta ok so now that we've established that my cost function is X theta minus y transpose times X theta minus y plus lambda theta transpose theta and this lambda right here is basically a hyper parameter that you have to tune for your specific problem I won't go to too much detail about some of the techniques used to pick it I'm just gonna say it's a parameter that you have to tune and select so that you can find the best model that fits your data so now how do I find theta right that's the question so we're going to treat this as an optimization problem we're going to try to find the theta that minimizes J and the way to do that is I take the derivative of J with respect to theta and this first term right here is a quadratic term and how do I take the derivative of this we get a 2 because it's quadratic and then the derivative of whatever is inside the bracket which is X transpose and then the bracket itself X theta minus y right plus this is also a quadratic term so lambda is a constant that stays as is a quadratic term here I get 2 times theta all right so now I have this derivative what do I do with it I said equal to zero so if I write out this equation the two and the two go away and then I take this X transpose and I multiply this bracket I get X transpose X times theta minus x transpose y plus lambda theta equals zero if I take this X transpose Y to the other side and I have a theta here and the theta here so I can write this as X transpose X plus lambda I times theta equal to X transpose Y so I took X transpose Y to the other side becomes positive I factored out the theta it's right here so x transpose x times theta and then I have lambda then say that it's LAN times identity is because these are matrix and matrix multiplication so I can't just put a constant here lambda which is a scalar I have to multiply by identity that fits the right size of X transpose X and now I can solve for theta as you can see if isolated on the right here and Y on the sorry I'd let it fade on the left here and Y on the right so I can invert this matrix and I can write theta as X transpose X plus lambda I all inverse x transpose Y and that gives me my solution to theta that minimizes this cost function so essentially what we've done here is we've solved the ridge regression problem by stacking all of our measurements and all of our variables on top of each other in this capital X capital y matrices and we found the theta that is dependent on this lambda here so I know X I know Y only lambda is unknown but I'm gonna pick that as a parameter I'm going to tune that based on my model and so then I can solve for theta now what's interesting here its lambda number right here what you can think about it is it's it's a way to the singular values of this X transpose X matrix so the larger it gets the smaller it gets now I would recommend it being a positive number between 0 & 1 don't go above 1 that's never a good idea but also another thing to note is if you set lambda equal to 0 this term goes away and you're essentially solving an ordinary least squares problem so religion is an ordinary least-squares tacked on top of it a nice term right here that does the shrinkage or l2 regularization of the weights or of the parameters of my model that way I don't over fit to the data and I get a nice model that can generalize in a good way and obviously the more data you have the more your generalize I hope you've enjoyed this endless engineering video and if you did hit that thumbs up button if you liked this there's a lot more where that came from think about subscribing and Channel and hitting the bell that way you get a notification every time we drop a new video thanks for watching
Info
Channel: Endless Engineering
Views: 6,904
Rating: 4.9502072 out of 5
Keywords: regression, ridge regression, linear regression, L2 regularization, shrinkage, least squares, ordinary least squares, OLS, Tikhonov regularization, bias-variance tradeoff, machine learning, statistics, data analysis, data science, statistical analysis, data
Id: skOcLw_fXDs
Channel Id: undefined
Length: 9min 54sec (594 seconds)
Published: Mon May 18 2020
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.