Data Science - Part XII - Ridge Regression, LASSO, and Elastic Nets

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

hello thank you for joining me today my name is Derrick King today we're going to get into the topics of modern regression approaches this tutorial is just one part of a broader series of lectures diagnosed optics of that data science predictive analytics machine learning okay let's begin don't review of topics that we're going to get into today are going to include we're going to talk a little bit about advancements with regression analysis then we'll shift gears and we'll focus on Ridge regression then we will talk about the lasso technique and elastic notes once we've done all of that then we will get into a practical example where we'll walk through a data set related to prostate cancer if we continue to draw from all us as our only approach to linear regression techniques methodologically speaking we're still within the late 1800s and early nineteen hundred's timeframe so when most people learn about regression analysis typically in an undergraduate or even in a graduate level depending on what you're looking at we're really just looking at OLS okay but it's been around for a long time it's a very powerful technique but perhaps there's more with advancements in computing technology regression analysis can be calculated using a variety of different statistical techniques which is led to the development of new tools and methods and that's exactly the point that we're going to get in today so we're going to talk about these methods and techniques that we're going to get into today will bring us up-to-date with advancements in regression analysis as a whole now one thing to consider okay well why can't I just use a while as well in modern data analysis we find that we have data with a very high number of independent variables okay so you can have many variables that you're working with and this is referred to as a higher dimensional modeling and we need better regression techniques to handle this type of modeling and the techniques that we're going to get into today are really work very well with this scenario before we dive into the more advanced topics of regression analysis I think it's important that we just do a very basic review of OLS techniques and just how to look at the models because we will get into some matrix notation and some other ways to go about calculating regression a little later but we need to have a firm understanding of all the components so if you wanted to look at this in greater detail please refer back to the lecture on that regression analysis earlier in the series but for now let's just go ahead and take a look at the model so this is the standard form that we have we have y equals beta naught plus beta 1 times X 1 plus error term so Y in this case is our dependent variable that's what we're trying to predict X 1 represents our independent variable beta 1 is our coefficient in this games and beta naught is a coefficient in and of itself but it's generally referred to as the intercept and the e is the error term now the error term is the value that's needed to correct for prediction error between the observed and predicted value this error term is going to become more important as we start diving into these techniques down the road so just just remember that when we make a prediction our prediction isn't going to be correct for every observation so the error term is kind of like a buffer that is used to correct the difference between our predicted and our observed value and is flexible so the output of a regression analysis will produce a coefficient table similar to the one that we have below this table shows that the intercept is minus 114 point 3 to 6 and the height coefficient is 100 and 6.50 5 plus or minus 11.5 5 which is the standard error in this case and this can be interpreted as for each unit increase in X we can expect that Y will increase by 100 and 6.5 now also the T value and the P values indicate that these variables are statistically significant at the 0.05 level and can be included with the model so we want statistically significant variables and we want them to be less than 0.05 now let's take a look at a multiple linear regression formula it's essentially the same as the simple linear regression except for their complete multiple coefficients and independent variables now the interpretation of the coefficient is slightly different than in a simple linear regression so if we look at our output of our OLS regression model the interpretation can be thought of for each one unit change in width increases Wyatt by ninety four point five six and this is all holding all other coefficients constant and that holding of the other coefficient concept constant is what's important in this case now that we've talked a little bit about the basic model structure for our aggression analysis I want to get into the topic of ordinary least-squares fair at least just a refresher is alternative so what is ordinary least-squares or all us in statistics ordinary least-squares or linear least-squares is a method for estimating the unknown parameters in linear regression the goal of OLS is to minimize the differences between the observed responses in some arbitrary data set and the response is predicted by the linear approximation of the data so what it is trying to do is is trying to find out a line that goes through all of the data points that minimizes the errors visually this is seen as the sum of the vertical distance between the each data point in the set and the corresponding point on the regression line the smaller the differences are the square size the better model fits the data so just want to take just one minute just to look at this line so imagine that I create my formula and it produces a line which is represented by this blue line here if I was to take each data point and draw a line from the data point that that connects to this regression line I can then that create a square from this line and the size of the squares of all of the different data points are visually represented here but if I sum all of these squares or the area of the squares I'm trying to minimize the difference of these ordinary least squares and so what ordinary least squares essentially is saying is I'm producing the sum of squares that is of the smallest size possible therefore having or producing a Russian line that ah best fits all of these data points in a linear fashion well now that we talked about this OLS regression in the sum of squares I want to talk a little more about the error term linking the sum of squares is a representation of the error for our OLS regression model we're trying to minimize that that degree of error but when we think of linear regression models prediction errors or these errors can actually decompose into two main some components that we care about and this is the error due to bias and the error due to variance and we'll talk about these two points in the upcoming slides understanding bias and variance is critical for understanding the behavior of prediction models but in general what we really care about is the overall error not the specific decomposition so it's important to understand the bias and variance but really what we care about is the general overall error understanding how different sources of error lead to bias and variance helps us improve the data fitting process resulting in more accurate models and it's the accuracy of the models that really is what we're trying to ultimately produce with our predictive techniques want to take a moment and talk about bias and variance trade-off so we can kind of really visually understand it because I know these are abstract concepts we have an error due to biased and that error is taken as a difference between our expected prediction of our model and the correct value for which we're trying to predict the error due to variance is the error due to variance is taken as the variability of a model prediction for a given data point imagine you can repeat the entire model building process multiple times the variance is how much the predictions for a given point vary between different realizations of them and so that kind of gives us a clue of how the variance component works but let's just take a a moment and let's take a look at these on these targets are these archery targets as and what they look like if I have a model that is very accurate which means its errors are going to be very low I would expect to have low bias low variance and we see this in the top left hand side so the bull's eye would be the most accurate model we see all the different data points kind of fit within that bull's eye okay and that's how we know we have a very powerful model imagine we just take a look down so in the bottom left hand side so I have a low variance but a high bias so my bias is just it's off center so it's not hanging the bullseye but it's now in this blue region but all the spread of the data points are actually very compact they're like where we would see them in the bullseye they're just they're just off center and that off center really indicates a high degree of bias now if I have low bias but a high variance which we see in the upper right hand corner we see that the points are generally spread very close to the center so I can see I'm hitting the center of the target but the high degree of variance is now taking this compact nature of these points and it's spreading them okay around that high variance means that even though it's centered it's not as accurate as it could be and if we have models that have high bias and a high variance as we see in the bottom right hand side of this image we see that it's not centered at Dorothy on the target it's in this case it's in the upper part of the target and we see that the spread it isn't very compact as well okay so I think when we look at the different sources of the error and the trade-off between all these sources of error this this image is a very good one to go back to so there is a trade-off between a models ability to minimize buy-ins bias and variance understanding these two types of error can help us diagnose model results and avoid the mistake of over or under fitting okay so there's a trade-off the sweet spot for any model is the level of complexity at which the increase in bias is equivalent to the reduction in variance bias is reduced and variance is increased in relation to model complexity as more and more parameters are added to the model the complexity of model Rises and variance becomes our primary concern while bias steadily Falls so looking at this image on the left hand side you you can see where the trade-off is in terms of the complexity of the model so what we're trying to do is we're trying to find that optimum point that minimizes the trade-off between various and variance and bias an example of more complexity is if we had more let's say polynomial terms to a linear regression the this model is now becoming even more complex because we're performing mathematical iterations on on the independent variables and the complexity will will naturally rise the gauss-markov theorem states that among all linear unbiased estimates or ordinary least-squares has the smallest variance and this implies that our OLS estimates have the smallest mean squared error among linear estimators with no bias well this begs an important question can there be a biased estimator with a smaller mean squared error so if I have an OS model I know I have the smallest variance I can possibly have but could there be another model that reduces the error even further and then and that's really the class of regression models that we're going to be talking about here as we dive deeper into these more advanced regression models we have to first understand the concept of shrinkage and estimators so let's consider something that initially seems really crazy we're going to replace our loss estimates beta with something slightly smaller so in this case I have a formula arm where I have beta prime is equal to 1 divided by 1 plus lambda times beta ok and what this is telling us is that if lambda is 0 we get our own assessments back and if lambda gets really large the parameter estimate approaches on minimal value and it actually approaches 0 lambda is referred to as the shrinkage estimator or rich constant depending on how you're looking at in principle with the right choice of lambda we can get an estimator with a better mean squared error the estimate is not unbiased but what we pay for in the increase in balance we make up for in the variance to find the minimum lambda by balancing the two terms we get the following expression therefore if all the coefficients are large relative to their variances we set lambda to be very small on the other hand if we have a number of small coefficients we want to pull them closer to zero by setting the lambda to be very large if we're trying to establish a good value of lambda with the optimal being this specification do we ever really have access to this information that's a really great question suppose we know Sigma squared in the early 1960s Charles Stein working with graduate student Willer James came up with this following specification I'm not going to get too much into the mathematics is going to give a little mathematics heavy in the next couple of slides but just understand that they came up with this statistical specification this formula was expanded upon by Stanley sclavi in the late 1960s Stanley's proposal was to shrink the estimates to zero if we get a negative value in this case where X plus is the max of X to zero if Sigma squared is unknown he proposed that we use the following criterion in its place for some value of C this particular formula can be re-expressed as the following and then expressing squad A's estimate as the following specification what this tells us is that this statement reads that we will set the coefficients to zero unless F is greater than some value of C alternatively the result shows that we set the coefficients a 0 if we fail enough tests with the significance level set by the value of C if we pass the F test then we shrink the coefficients by an amount determined by how strongly the F statistic protests the null hypothesis so these brilliant statisticians came up with a way for us to essentially create a statistical test on whether or not we should shrink a parameter in this case and it follows the following formulas that we had just taken look at to me this is just absolute brilliant piece of mathematical work by these statisticians this preliminary testing procedure acts to either kill coefficients or keeps them and it shrinks them ok this is kind of like a model selection criteria except that it kills all the coefficients unlike the keeper kill rules experience with it AIC and B I see now we know that simple model selection of via AIC e and B IC can be applied to regular regressions this is known we use it in our forward our backwards our ACEF is approach and this is the formula to calculate these criterion if you didn't already know but what about these shrinkage parameters can we use them for determining coefficients before we dive into that point I want to now introduce the topic of Ridge regression and Ridge regression is a modeling technique that works to solve the multicollinearity problem in OLS models through the incorporation of this shrinkage parameter this lambda that we've been talking about the assumptions of the model are the same as OLS linearity constant variance and independence however normality does not need to be assumed in this case and additionally multiple linear regression has no manner to identify a smaller subset of important variables we use AIC and B I see as a selection criteria but in the actual regression formula itself there is no manner to identify a smaller subset of important variables in OLS regression the form of the equation of y equals beta naught plus beta 1 times X 1 plus beta 2 X 2 etc plus the error term can actually be represented in matrix notation as follows and I'm not going to get too much into the description of the mathematical terms but just understand that we can we represent this equation in matrix notation so we can actually take this equation and we can actually rearrange it to show the follow so we can actually solve in this case for our coefficients and where R is equal to X prime X and our in this case is a correlation matrix of the independent variables here is the rearranged OLS equation again from the previous slide I just want to keep this equation just kind of sitting in the back of our of our mind because once we start getting into the Ridge regression statement of the formula we'll understand why it's called Ridge and we can draw from it from there but we look at this the estimates are unbiased so the expected values of the estimates are the population values the variance covariance matrix of these estimates is specified as fone and this is true because we are assuming that the Y's are standard office which is as Sigma squared is equal to 1 in this case now that we've talked about the all us statement let's get into Ridge regression rich regression proceeds by adding a small value lambda to the diagonal elements of the correlation matrix this is where richer regression gets its name since the diagonal of ones may be thought of as a rich so you can if you're looking at a matrix on on the off diagonal is where we see this Ridge estimate and that kind of creates kind of a peak similar to what we would expect to see at the top of a mountain but the specification in this case we see it's very similar to LS except for it now incorporates this Ridge of this Ridge constant now lambda is a positive value less than one it's usually less than point three propagated all the amount of the bias of the estimator is given by the following equation and the covariance matrix is given by the following equation in matrix notation as well now that we've talked about the basics of Ridge regression let's get into some of the Diagnostics one of the main obstacles and using Ridge regression is choosing an appropriate value of lambda the inventors of Ridge regression suggested using a graphic which they call a Ridge trace a Ridge trace is a plot that shows the Ridge regression coefficient as a function of lambda when viewing the ridge trace we are looking for the lambda for which the regression coefficients have stabilized often the coefficients will vary widely for small values of lambda and then stabilize then we choose the smallest value of lambda possible which introduces the smallest bias after which the regression coefficients seem to have remained constant so if we look on our trace chart on the left hand side all these various data lines represent each coefficient and we see that they're kind of wide is it almost as like a funnel tipped on its side and as we increase the value we see that they begin to converge and they they stabilize we see more of a straight line and if we were to actually extend this rich parameter out to near infinity we would see that all of these coefficients would converge at 0 but what we're looking for is the smallest lambda value where these coefficient stabilize and we will get into that on in the next couple of slides we'll we'll look at some examples and as I had stated before increasing Lander will eventually drive the regression coefficients to zero in this example the values of lambda are shown on a logarithmic scale I've drawn a vertical line at the selected value of lambda equals point zero zero six we see that lambda has little effect until lambda is about zero point zero zero one and the action seems to stop somewhere near zero point zero zero six now there are a couple things to note in this the vertical axis contains points for the least squares solution these are labeled as zero point zero zero zero zero one and the reason that it's not zero is that we can't actually see that when we run logarithmic scale so if the logarithm of zero is taken it's actually negative infinity so how do we graph this in this case we can't and that's why we put those points as point zero zero zero one alternatively there is a procedure in our which automatically selects the lowest value for lambda this is great this is a very similar to the box-cox that tests that we've gone over previously the value is calculated using a general cross validation procedure and is called GVC in this case the example on the Left shows the value as 6.5 which is a lowest point on the curve so we're trying to identify that minimal point on the curve now one thing to note that range of lambda we need to re specify it for each model that we're building because a lambda can take a value very small or can be a very large value so whenever we're working with this just make sure that your recess off' eying the bounds in your procedure and here's a snippet of some our code that we can look at in this case to show how it works so I'm using a select function I have my LM Ridge which is how R is calling this particular function then I specify my regression model very similar to how I would a OLS regression but then I have to specify the lambda and in this case the sequence of lambda is very important it's saying give me a value between 0 & 1 increasing by point zero zero one increments so that's the sequence that we're going to specify you know if if we don't see a good value between 0 & 1 okay we'll change that one to a hundred maybe change that point zero zero one to point 1 and play around with this until we find what works for the model here is a visual representation of the rich coefficients for lambda versus a linear regression we can see that the size of the coefficients which are P lies has decreased through our shrinking function okay and this is our penalization function and we'll get into this a little more later it is also important to point out that in Ridge regression we usually leave the intercept on penalized because it is not in the same scale as the other predictors the lambda is unfair if the predictive variables are not on the same scale and so typically in rich regression what we'll do is we will actually scale all of the coefficients so in order to talk about the terms and an equal footing we scale them down create our model and our parameter estimates and then rescale back up to the original values of our independent variables okay and this is because of the issues that we see with the intercept in this case okay so whenever you're working with Ridge regression just remember I need to scale build my model and then rescale after I'm done so we talked about variable selection procedures and how this lambda could come into play here I want to talk more about variable selection the problem of picking out the relevant variables from a larger set is really what variable selection is suppose there is a subset of coefficients that are identically 0 this means that the mean response doesn't depend on these predictors out the red paths on the plot on the left hand side are the true nonzero coefficients and the gray paths are the true zeros the vertical dashed line is the point which Ridge regressions mean squared error starts losing to linear regression so once we start moving past that point in line OLS becomes a better predictor in this case note the great coefficient pants are not exactly zero they are shrunken but they are nonzero even though that they're converging near the zero line they will never actually be zero but what's important when looking at this particular graph is that we see that the variables that there's different groups of them we have the red and the gray groups and we see that all the gray ones are going to converge around zero eventually but the red ones are going to take longer so they have really what we're talking about as true zero coefficients and two nonzero coefficients well what does this mean exactly well we can show that Ridge regression doesn't set the coefficients exactly to zero unless lambda equals infinity in which case they are all 0 therefore Ridge regression cannot perform a variable selection it's just it's not robust enough Ridge regression performs well when there is a subset of true coefficients that are small or zero this is an important consideration when working with Ridge regression it doesn't however do well when all of the true coefficients are moderately large however it will still perform better than OLS regression okay but it Ridge regression even runs into its own problems when all the true coefficients are large so what can we do about it how can we address that are there other techniques that we can employ well it turns out there is now that we've talked about Ridge regression let's get into the topic of lasso the lasso combines some of the shrinking advantages of Ridge regression with variable selection okay so it's overcome with the limitations of Ridge regression in this case in lasso is an acronym for least absolute selection and shrinkage operator the lasso is very competitive with the ridge regression in regards to prediction error so they both of them are going to give very strong MSE s but we're going to gain some advantages with lasso that we're going to see the only difference between the lasso and Ridge regression is that the ridge penalty term l2 uses the beta squared penalty worthy lasso penalty one it uses the beta penalty and we'll get in more into this later ok so we have to introduce these penalty terms in order to work with these techniques even though these penalty terms these penalty bond and penalty to look very similar okay they're solutions they behave very differently and it's the nuances of how these penalty terms work all that gives lasso at strength the tuning parameter of lambda controls the strength of the penalty and like ridge regression we get that though the coefficients of lasso is equal to the OLS estimate when lambda is equal to zero and the coefficients of lasso approaches zero when lambda is approaching infinity for lambda in between these two extremes we are balancing two ideas fitting a linear model of Y on X and shrinking the coefficients the nature of the penalty l1 in this case cause the sum of the coefficients to be shrunken to zero exactly again this zero exactly is very important in the last world this is what makes last so different than retrogression it is able to perform variable selection in the lunate linear model so if I imagine I have a independent variable and I am extending it by a coefficient well if the coefficient is zero in this case I'm going to multiply my independent variable by zero which is going to make it zero when that happens we're effectively removing that variable from our final my final model so as lasso is shrinking the coefficients and those which settle on zero it actually is performing variable selection because it's saying hey this variable no longer helps me my model eliminate it okay and that's that's the difference between lasso and rich lasso will take you exactly to zero and Ridge will just kind of converge towards zero eventually towards infinity and as I had said as lambda increases more coefficients are set to zero less variables are selected and among nonzero coefficients more shrinkage is employed so as R lambda gets higher its performing more of this shrinking on the coefficient estimates because the lasso sets the coefficients to exactly zero it performs variable selection in the linear mode so if I'm looking at my Ridge regression from earlier we see that that the coefficients stabilize to a point and we see if they're moving down towards the zero okay but they don't exactly hit the zero mark where as lasso in this case as we increase our lambda it is incorporating more of this penalty function and at higher values of lambda we see all of the variables converge at zero and this this convergence actually gives us some indication in terms of variable importance and strength so the variables with the largest lambda values in lasso that converts to zero indicate the most desirable variables for the model so we can use this plot as kind of an indication of which variables are more important than other now there are some alternative classes that we can look at in this case we can use plots of the degrees of freedom to put different estimates on equal footing so if we want to look at these coefficients in a similar manner across these techniques so if I'm looking at Ridge regression here it is plotted against the degrees of freedom of the lambda versus lattice so we see that the shape is actually different and of itself we see that Ridge has these kind of nice curves and linear lines that come out whereas lasso is more rounded in this case okay but we can now look at them in equal context and that's that's a great technique and it's readily available in in your statistical package of choice I know it can be done in our I'm assuming that in SAS as well and as before we can see that the most important variables can be readily identified because they are significantly higher than that zero point zero horizontal line it can be helpful to think about our penalty terms ok the penalty one and penalty two parameters in the following form now I want to get into these penalty terms because it can be kind of confused okay but in order to understand the techniques look let's just look at this mathematical form that we have here so these lambda terms are the penalty one and the penalty two for both the lasso and the rich but we can think of this formula now in a constrained penalized form okay so we can actually constrain and penalize this particular function T in this case is a tuning parameter and we've been calling this lambda earlier but you can think of it as a tuning parameter very similar to lambda the usual OLS regression solves the unconstrained least squares problem these estimates constrain the coefficient vector to lie in some geometric shapes centered around the origin and that constrained form this geometric shape centering around the origin actually gives us some clues on how these models function and we're going to take a look at a couple of these shapes the shape that we choose generally reduces the variance because it keeps the estimate close to zero but the shape that we choose it really matters okay so it's not just a an interesting visual representation but it's important in terms of how the shrinkage is performed so in the case of a lasso regression if I'm looking at this chart the contour lines are the least squares error function the blue diamond is the constrained region for the lasso regression and that constraint region you know in lasso takes the diamond shape whereas enriched regression we see that the shape is a circular okay so the constraint region is it is a circle instead of the diamond and we're going to take a look at you know how to interpret this chart and then in the upcoming slide just understand that you know these two different functions have a different constraint region shape so here's a nice visual that gives us a more detailed breakdown we can see looking at our coefficients which are the least square coefficients that's what those black points are and the contours are those little red circles that are going around those are the contours of the RSS as it moves away from the minimum point so the least square coefficient now the lasso coefficients are the point in which the contour is connect to the constraint region so I'm looking at the lasso function we actually see that the outer contour connects at the tip of the diamond and if I look to the right hand side and then looking at the ridge regression we actually see that the outer the outer contour connects to the circle at a different point and that penalty term is the penalties from 1 to is really the constraint region that we're looking at in this case okay so it's that geometric shape as we had talked about before but as we're moving away from OLS regression and we're penalizing these coefficients eventually you the coefficients are going to converge against one of these constraint regions and that's why the shape matters now that we've talked about Ridge regression in lasso I want to now shift focus and talk about the concept of elastic Nets when we are working with high dimensional data and those are datasets with a very large number of independent variables so imagine that you're not a data set with 30 variables or 40 variables but imagine you have a thousand independent variables or 10,000 independent variables well a problem that we're going to find with the model building is that these variables are going to be correlated with other independent variables not just have correlations between the dependent and independent variables but we're actually going to have correlations amongst the independent variables and then when this happens this is the problem of multicollinearity that we had talked about in our loss of regression these correlated variables can sometimes form groups or clusters of correlated variables so if I have a thousand variables in my data set you know I might have twenty variables they're all correlated with each other okay and then I could have packets of these correlated groups you know amongst the overall pool of variables that I'm looking at now there are many times where we want to include the entire group in the model selection if one variable has been selected so imagine you know I'm looking at my data set and I find that one variable is very strong I wanted to be selected into the group but you know by removing the other variables that are correlated to it perhaps I'm losing something in terms of interpret ability and the overall predictive performance and how can i address that I like to think of this visually for me as the elastic nut catching a school of fish instead of seeing only now - a single fish so if I am trying to catch fish I see one fishes swimming around I'm going to throw my net and I'm going to actually collect the entire school of these fish okay even though they're correlated with each other I want to look at the entire group an example of a data set that has this problem is a leukemia data set okay and this data set contains 7,000 129 genes with correlation and a tumor type so I have a bunch of genes that are related to leukemia and then I have an outcome variable that says you know you either have leukemia or you don't but within this set of independent variables these genes if you will you know there is lots of pockets and we needed to be able to have regression techniques that can handle this and that's where elastic net that comes into play the total number of variables that the lasso variable selection procedure is bound by the total number of samples in the data set additionally the lasso fails to perform group selection it tends to select one variable from a group and ignore the others so this these issues with the lasso technique in terms of you know how it identifies the pool of candidates you know based off of the size of the data and the fact that it can't perform this group selection that we have talked about is a limitation that has to be over overcome in certain circumstances so the elastic nut it forms a hybrid of the penalty one and penalty two terms so if I'm looking to the mathematics behind a ridge and lasso a coefficient determination you know we have our penalty one or penalty two terms in this case in elastic that actually combines both of them again we see that highlighted towards the bottom in blue rich lasso and elastic net are all part of the same family with a penalty term of the phone so this is just a rien of this penalty term but the way that we react spreading it I think makes a little more sense if I have an alpha value equaling zero then we have a ridge regression if I have an alpha value of one then we have a lasso and if I have a value between these so between 0 and 1 we have an elastic nut so we can use this alpha as a specification criteria when we're building these regression models so if we want to look at various strengths of an elastic net you know we can specify alpha to fall somewhere between is 0-1 depending on how we want if we want to be more rigid then we move it closer to a 0 if we want to be more like a lasso then we move it closer to 1 so getting back to that specification that of that elastic net I just want to talk very briefly on some elastic net theory in this case it's actually what we would call a naive elastic net unfortunately this naive elastic net it doesn't actually perform well in practice the parameters are penalized twice with the same alpha level and that's really why we're calling it a naive in this case because it can differentiate between the two functions so to correct this we can actually use the following function in this case we're taking that our coefficients of the elastic net is equal to 1 plus lambda 2 times the beta of the naive elastic net so we're actually calculating it all in two separate iterations let's take a look at the visualization of the constrained region for the elastic net so if we remember looking at our constraint regions before we had our diamond shape and we had our circular shape for Ridge and land so elastic net is something that falls between so it's not quite circular and it's not quite a diamond and depending on how you establish these thresholds they can flex in between the points so when we're looking at the constraint region on the right hand side here we see the elastic net represented it in red here as following somewhere in between but depending on how we specify the Alpha level you know it can flex its shape even more we can see that the elastic net organizes the coefficients or a lasso rope into organized groups forming the framework for the elastic nut so if I'm looking at a lasso in this case and just looking at the shape I'm imagining the cowboy taking the rope and spinning it over his head and throwing it out and you can think of all of these lines as just you know how the shape of that the throw would look like visually no an elastic net in this case is more rigid okay so it's it has tighter linear lines you can see so I think I tend to be more of a visual learner so what I'm imagining you know working with these techniques just think of the last event as just a tighter more linear representation than a lasso lasso is looser so let's take a final look at the visualizations for a ridge regression lasso and lastic notes so if we're looking at our Ridge trace charts on the left-hand side we see our ridge trees in this case this is a fictional data set that we're looking at but we see that the rich trace converges to zero at a certain point and send these nice smooth lines it's more linear in terms of how it works a lasso is looser we see that at different points they're hitting the zero because of the nature of the penalty term that's being applied to it but it's not in that smooth linear fashion an elastic net falls somewhere in between so we see more linearity between the shape than a lasso as more related to original just understand that these visuals agencies rich traces help to just kind of denote the shape and understand you know the strength of the variables and how they function across the ridge trace we spend a lot of time talking about modern regression theory I want to now go through and apply this technique to a data set and we're going to take a look at prostate cancer there are many forms of cancer which humans contract throughout their lifetime and prostate cancer is a more prevalent form which occurs in males by having an understanding of the variables and measurements which impact the development of this cancer this could aid researchers in developing medical treatment options the dataset we will be working with contains various measurements between the level of a prostate specific antigen in this case that's our dependent variable and a number of clinical measures in men who are about to receive a radical prostatectomy the goal for this exercise will be to examine various predictive models that can be leveraged to help model the relationships as with all my tutorials I like to first start having an understanding of the data that we're working with so here's just a description of the variables that we have within the data set our goal is to develop various regression models so we're going to we're going to build a linear regression we're going to a ridge we're going to bring lasso and elastic net into this and we're trying to determine the LPS a variable based upon all these other variables so let's take a look at the raw data that we have we can see it's all in American nature which is good and you know I see no specific outliers jumping out at me and it looks like data I would expect but really to do this correctly we would go through an extensive EDA process which I had gone through in a previous lectures but I just like to get a sense for the numbers that I'm working with and here is a nice way to do it for brevity's sake I'm not going to go through all the Diagnostics and all the different approaches to linear regression and understanding the residuals and all the statistical tests I'm just going to get right to it so here we're going to create a linear regression model to assess the predictive performance I bring in all of the coefficients if I want to understand you don't want these parameter estimates are you know what the various p-values are and so on and so forth so now that we've done this then I'm going to pare it down into a smaller set I've looked at all of the parameter estimates that had statistically significant p-values and I kept those if there wasn't a statistically significant p-value I eliminated it from the model and what we see is we actually have a smaller group of coefficients in this case to work with though there are many different techniques you can take that's the approach that I had taken for this exercise now in order to assess the predictive performance in this case we're going to calculate the mean squared error or the MSE for this linear regression model and when we did this we got a value of zero point four nine two six two I now I want to go through and see if I can further improve this mean squared error by exploring the utilization of a Ridge regression model so the first step we will utilize would be to calculate the minimum value of lambda to use in the Ridge model after flexing the boundaries of the lambda values that we had discussed I had determined that six point five would be the ideal value so we then go about we create our Ridge trace to see where the conversions of the coefficients would be to zero the ridge trace that I'm showing here has a lambda between 0 and a thousand so we can see just by looking at the shape that we still haven't seen the full conversions but we see exactly where it stabilizes the next Ridge trace shows a subset of the original Ridge trace but it's only isolating the values of lambda between 0 and 10 now in this case I'm bringing all of the variables into this procedure and when I apply my 6.5 lambda value that's where I get that vertical line as we can see by the dashed line I now have then calculated my mean squared error of the mount model to be 0.46 oh 1 let's now build a lasso model maybe we can further improve it I don't know so I'm going to utilize the our package GLM net for the lasso by establishing the alpha parameter equaling 1 now there are some Diagnostics that I can look at that give me you know what my minimum lambda value is that I should be using calculation purposes and all various diagnostic plots in this case my minimum lambda value was determined to be 0.03 9 now a comparison of the shrinking of the coefficients between the ridge regression and lasso can give us some clues about the significance of variables we had talked about this variable selection procedure a little earlier but I'd like to actually look at these coefficients because they help me to understand more the variables LCP and Gleason had been reduced to 0 which effectively eliminates them from the LES own model also please note that the variables 8 and PGG 45 are very close to a zero value which indicates that they were less significant than the other variables now I use significant term loosely here I feel that these variables are less important in understanding the overall relationship okay now they are from statistical sense important in the model because they were not is shrunk to zero the regression model utilizing the lasso had produced a mean squared error of 0.47 to 5 finally let's assess a regression model by utilizing an elastic net we built the elastic net model using the r package GLM net by establishing the alpha parameter of 0.2 now there are many different thresholds that we can choose remember we can choose somewhere between zero and one I just chose in this case to take in a smaller value of the 0.2 when I perform this I can get a lot of different diagnostic plots and components I can look at and I also get a value for a minimum lambda of 0.1 one - well we first established the minimum limit to be 0.12 for elastic net notice that the rigidness of the shape of the coefficients of the elastic net as compared to the lasso when compared when comparing the penalized l1 norm so these diagnostic plants from the GL mm net procedure and R gives us an idea of the linearity and rigidity of the shapes of these lines you see that the elastic net has a much more linear fashion lasso is looser in this case and this is exactly what I would expect to see and I think this is a nice visual way to highlight the differences between the two techniques for non-technical audiences you know understanding you know when you know a lasso elastic net Ridge regression can be very difficult for non practitioners but I think using these visualizations here really kind of highlight how the approaches are working so definitely consider using them in your analysis and your post analysis the regression model utilizing the elastic net had produced a mean squared error of 0.47 and 1 1 5 what we've produced up these different models so let's just um summarize and recap what we've seen from them so here's a comparison of the new coefficient values and the predictive performance rank between the regression methods the coefficient values in the table on the left hand side the ones that are highlighted in red are equal to zero and have been removed from the final model so we can see that certain variables are getting eliminated here we see them being eliminated in both the lasso and the elastic net and remember our Ridge regression doesn't have that variable selection procedure so we will never see one's exactly at zero point zero zero unless all of them are at zero but what I find to be much more interesting is the diagram on the right-hand side and that's looking at the predictive performance so we have an ordinary least square regression model then we reduced based off of statistical significance of variables we find that the mean squared error is 0.49 it's it's very good but when we start incorporating the lasso the elastic net and the ridge regression we actually find that our predictive performance is higher using these techniques than using OLS and in this case and the way that we approach the modeling that our Ridge regression that actually was the strongest performer of them all so this is a very good reason to explore using a different class of regression models and let's not be so narrow-minded and focus on OLS exclusively and as we're using data that is becoming you know higher dimensional with more independent variables and that requires us as practitioners to build models that are stronger predictors even closer to reality based off of the data that we have we have to draw from these techniques and that's why I think that Ridge regression elastic nuts and lassos are very important tools in our analytical toolbox like to thank everyone for joining it's been a lot of fun talking about this class of models with you please feel free to check out some of my other tutorials and subscribe to the channel thank you very much you

Info

Channel: Derek Kane

Views: 89,164

Rating: 4.9008622 out of 5

Keywords: LASSO, lasso, elastic net, ridge, regression, bias, variance, machine learning, predictive analytics, statistics, matrix, OLS, collinear, multicollinearity, Ridge Regression, Least Squares, Econometrics, model, Linear Regression, multivariate, multivariate regression

Id: ipb2MhSRGdw

Channel Id: undefined

Length: 64min 52sec (3892 seconds)

Published: Wed Feb 11 2015