When Should You Use Regression Methods?

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
linear and logistic regression are some of the most commonly used methods in applied statistics and data science for the newcomers though a question that doesn't get asked nearly enough is when and why you should use these methods in the first place well keep watching because we're going to talk about that [Music] hello again everyone and welcome back to my channel if this is your first time here my name is richard and this is richard on data the channel where i help you break into data science and grow your skills in the field so i know this happens because i saw it myself in academia but a lot of people will start out by learning regression through the math and the matrix interpretation the estimate of the slope vector is equal to the inverse of x transposed times x then times that by x transpose times y interesting i'm not knocking understanding that by any means that's not what i'm trying to do here there is a tremendous amount of value that comes from understanding the mathematical underpinnings of regression or quite frankly from any model you choose to use but you'll quickly find yourself in a situation where you're trying to solve a real world problem for a client and you don't want to catch yourself asking is regression truly the best way to attack this problem so i'm going to give a super brief overview of what linear and logistic regression are broaden that discussion to supervise learning and what the goals of these types of exercises are in the first place and then how you should think through these problems and how to attack them before i do that though subscribe to my channel if you haven't already hit the notification bell and smash the like button for the youtube algorithm then if you guys would like to support my channel above and beyond that i'll have links in the description of the video to my paypal and patreon accounts as well as to my crypto wallet addresses your support is highly appreciated all right so let's start off with what the regression methods even are in the first place and if you're familiar with this already feel free to skip ahead in the video but let's start from the premise that we have a data set that's in tidy format that is every cell contains a value every row contains an observation and every column contains a variable then let's suppose we have one variable that represents some kind of outcome or response that we're interested in and we're typically going to represent this with the letter y now that outcome variable might be continuous or discrete and if it's continuous then you're looking at a linear regression problem if your response variable is discrete though well then that gets a little trickier suppose your response variable has two distinct levels often then we'll just call one of those levels success and the other failure and under those circumstances your data set is a candidate for what's known as binomial logistic regression which is often just referred to simply as logistic regression we'll cover that more in a little bit here granted if that response variable has more than two levels you have extensions in multinomial logistic regression and there are ways to approach that problem depending on whether those levels are nominal or if they're ordinal but anyway back to the whole rest of the data set so for all the rest of the variables there's a lot of names we can call those explanatory variables features covariates predictors my favorite term is predictors also let's refer to these as the x's in fact for now let's just suppose we have two variables the response variable y and a predictor x and for this example we're going to use the super basic mpg data set from r well suppose you want to understand the relationship between these two variables the first thing that you're probably going to want to do is create a scatter plot of them so you end up with something like this and you very quickly get the general idea that as the x variable displacement increases the y variable highway miles per gallon decreases you can also generate the correlation coefficient which is one of the most frequently used metrics even if it's not the most interpretable in and of itself and it's negative 0.766 here so already we're beginning to learn a few pieces of information about the relationship between the y variable highway miles per gallon and the x variable which is displacement a regression model is going to formalize our understanding of this relationship you fit a linear regression model generate output like this and suddenly yes our understanding of that relationship is formalized and we have tangible information that we can take away for example most importantly the estimate of the slope parameter on displacement is negative 3.5306 meaning for every one unit change in engine displacement we expect highway miles per gallon to decrease by 3.5306 units and then also multiple r squared is 0.5868 meaning 58.68 of the variation in highway miles per gallon is explained by this model specifically the linear relationship on engine displacement so that super simple model was created by like all regression models are through the concept of least squares that is by finding the line that minimizes the sum of the squared residuals residuals being the difference between our observed values that is the y and the fitted values let's call that y hat so we can take this same concept and essentially project it into multiple dimensions where the effects of multiple different variables on the response are analyzed and controlled for for example this model where once again the response variable is highway miles per gallon but we also introduce a variable for the number of engine cylinders the drivetrain and class of car on top of that displacement variable that we had in the model before suddenly we have a lot more information let's take for example the variable class now this is a factor variable that estimates the effect on highway miles per gallon for different classes of car now all of these are relative to some baseline now you have to do a little bit of digging through the data set and you find out that the baseline is a class of two-seater so holding everything else constant a compact class is expected to have a highway miles per gallon of 3.8536 less than a two-seater plus also of interest is the fact that the estimate for the slope parameter for displacement has changed around a lot recall in our previous example it was negative 3.5306 now it's negative 0.3203 so there you go are you getting the idea so far that linear regression really helps you understand relationships and how a particular variable of interest affects some response variable and then we don't have to repeat this same exercise for a logistic regression because essentially you get the same information from that but i'll have a link in the description of this video to an r tutorial i've done where i basically interpret every single piece of output from both linear regression and logistic regression but in short you get to understand how the log odds of the outcome that is the natural log of the probability of success divided by the probability of failure changes with various predictors very similar premise to linear regression even if the mechanics are a little bit different now i want to take a moment zoom out to 30 000 feet here and look at the broader type of methods that linear or logistic regression fit into in the first place and what the overall purpose of any of these methods can be regression models are a type of supervised learning method now what is supervised learning you might ask well broadly speaking it's any type of problem where you have the kind of data set i've described up to this point that is you have a response variable and you have a set of potential explanatory variables or predictors contrast that with unsupervised learning where there is no known response variable at all and often in those types of problems our goal is dimensionality reduction so we might use a technique like principal component analysis but probably more often our goal is to create our own classes or clusters so when you got one of those supervised learning problems and you have a fully defined response variable well there's really two types of things you might want to do here and those are inference and prediction i've done a whole video on the difference between these two where i really go into detail of examples of them you can find that in the description but i'll briefly go over the difference here if your goal is inference then you want to infer how the various predictors or often just one single predictor influences the response variable you want to understand that relationship or those various relationships as best as possible so you probably care most about maximizing the goodness of fit of the model and there's various metrics for that such as adjusted r squared pseudo r squared for logistic regression aic bic the list goes on and on whereas in the case of prediction that's when you want to build a model that can with the most accuracy predict the response variable suppose you were given a brand new data set where you knew all the predictors but the response variable at that time was unknown well then you would want to be able to take your model run it and get as close to the true response variable as possible there's tons of real life scenarios where that kind of thing applies now let's suppose you have a continuous response variable your goal is probably going to be to minimize a metric like root mean square error granted you may run into circumstances where predicting too high is worse than predicting too low or vice versa but that's usually the kind of approach you're going to take if your response variable is categorical though that gets a little bit more complicated and nuanced now i talk about these considerations in my r tutorial series on carrots that'll be in the description too but still this is broadly a whole separate video topic you'll notice that almost everything i've described so far about what a regression method has going for it pertains to inference specifically that really is where regression methods shine they dissect the variation in the response variable in linear terms and are easily interpretable you could say that they're so simple even your clients will understand them but these and other statistical methods aren't necessarily designed for prediction and yes there usually is a trade-off between inference and prediction granted it's difficult to find studies or rigorous analyses on this because every problem is a little bit different but it's generally agreed that machine learning methods that rely on novel algorithms and detect complex patterns for example neural networks are going to be superior for predictive purposes now there is one possible approach to regression that sort of strikes a balance between the two goals though and that's regularization specifically the two most common ways we can skin this cat are l2 regularization also known as ridge regression and l1 regularization also known as lasso regression or just the lasso this is another topic for which i could do a whole separate video on the distinction between these two but i'll try to boil it down as simply as possible in essence there is a penalty in both of these for having two large estimates of slope coefficients as your slope coefficient estimates increase that leads to high variance and overfitting and you're trying to correct for that by introducing some bias and the single most important distinction between these two methods is that the lasso can actually reduce your slope estimates all the way to zero remember that at the slope representing the rise in the response over the run of the predictor is zero that predictor is essentially useless because it means there's literally no linear relationship between the predictor and the response so the lasso is actually helping us perform feature selection and even get rid of some variables in our model ridge regression though doesn't do that you may end up in a scenario where your slope coefficient estimates are going pretty close to zero but they're never actually going to go all the way to zero so ridge regression does not double as a feature selection approach in the way that the lasso does so if you take either of these two regularization approaches you'll finish up and you have a regression model still and it's probably going to perform a little bit better for predictive purposes but you've lost some of that beauty of interpretability that comes with the ordinary least square solution and that's because you have biased slope coefficient estimates now how exactly are you going to interpret those and explain those to people now not to mention if you're using the lasso you might have even dropped off one of the terms that you wanted to understand more about the way it influences the response variable in the first place so you see this is all a trade-off and it all comes down to understanding what the purpose of your analysis is in the first place one final idea that i'll leave you with though is that you don't have to be monolithic for the same analysis or exercise you can try multiple models and in fact when the goal is prediction it's perfectly common to ensemble predictions over multiple different classifiers or multiple different models but even for inference every algorithm is different and every practitioner is painfully aware of all the assumptions that linear regression carries for example normality and constant variance of the residuals truly linear relationships between the predictors and the response and you have minimal multicollinearity you should probably check these things out especially if your goal is inference but you can tell a very interesting overall story if on top of a regression method you also use some other statistical method or machine learning method that can be used in and of itself to tell an inferential story if you've got the same covariates that are super important both based on the absolute value of a t statistic from a regression method as well as the mean decrease in genie index from a random forest you probably got yourself a pretty robust result but if these things tell wildly different stories it couldn't hurt to try to understand or come up with a hypothesis for why that is to summarize this whole discussion i personally love regression methods and i use them all the time especially when i'm trying to create a baseline for predictive performance to see how much better machine learning methods are doing or if i'm trying to tell a story to my client about how one predictor or multiple predictors exactly influence some response variable but don't make regression methods and exercise in trying to fit a round peg into a square hole use them with intentionality for the right kinds of problems so thanks for watching this video if you enjoyed it please consider sharing it smash the like button and also leave me a comment down below and let me know do you like regression methods yourself do you hate them and you're all in on machine learning let me know what you think then i'll see you all in the not so distant future until then richard on data
Info
Channel: RichardOnData
Views: 1,712
Rating: 5 out of 5
Keywords: when should you use regression, why should you use regression, regression methods, regression techniques, linear regression, logistic regression, regression, linear regression in r, regularization, l1 regularization, l2 regularization, l1 regularization vs l2 regularization, ridge regression, lasso regression, regularization in machine learning, inference vs prediction, regression for inference, regression for prediction, supervised learning vs unsupervised learning
Id: 9FH0CIMq8-Y
Channel Id: undefined
Length: 16min 46sec (1006 seconds)
Published: Sun Apr 11 2021
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.