Hi. I'm Andy, and I'm a technical trainer at SAS. And today, I'm going to talk to you about how to perform simple linear regression in SAS. Linear regression has actually been around for a really long time, and it's used in many different industries-- for example, the medical field. It's used in retail sales. And we use linear regression in order to help us predict a continuous target, a continuous variable. Something like sales-- let's say we're trying to figure out exactly how much a customer is going to spend with us. That's a really useful piece of information. So we have other variables, which we consider our inputs. And those are the variables that allow us to try to explain sales. Like for example, maybe if you knew how much money I made, you'd be able to predict how much I was going to spend. Why don't we go ahead and take a look at what the formula for linear regression looks like? I don't remember when was the last time you were in high school or college and you took your mathematics course, and I'm certainly not going to tell you how old I am. However, you probably remember that the formula for a line looks something like this-- y equals mx plus b. And when we were learning about lines back then, you probably remember that this m term is the slope of the line, and this is the y-intercept. That's actually where our line is going to intersect the y-axis. Well, in more modern days, as we kind of bring this up to date, we're going to do the same thing in order to create our simple linear regression, but we're going to use little different terms. So don't let that throw you off. Instead, we're going to say that y-hat, which is going to be our predicted value, is going to be equal to a beta naught term. Now, beta naught is just a parameter estimate, and that is going to reflect our y-intercept. And then, we're going to add to that a beta 1 term times an x1. X1 just means that's my first input. And we're going to be performing a simple linear regression, so that means we're only going to have one input. And beta 1, well, once again, that's just my slope that we were taking a look at in the earlier equation. So you see they're kind of flipped, but it's really the exact same thing. The last thing that we're going to include as we sort of move into a little bit of theory is, we do have an error term. So we're going to add in this epsilon, or error term, because there's always some unknown error. And so this is the formula for a simple linear regression. And what we want to do is we want to find these parameter estimates. We want to find this beta naught and this beta 1 that's going to give us the best line through our data. So I think now is a good time for me to switch over and start to talk about the way that that's going to get calculated in the background. The method that's going to be used is something called-- there we go, that's where I was trying to get-- ordinary least squares regression. And so ordinary least squares regression is basically going to work like this. What's going to happen is if we think about it-- so let's suppose that my response variable, the thing that I'm trying to predict, is something like sales as I was talking about earlier. So we have sales over here. And then my predictor variable, in this case, maybe this is income. So this is the income of all of our customers. And so we have different customers. They're going to have different incomes. And what we have here is a scatter plot. So you see all the little dots along this scatter plot? Those are the actual value. So that means there's a customer that we know about that has this amount of income. We could come down to the x-axis and see what that is. And we see that in the past, they've had this amount of sales. And we can look at the y-axis. So these points are really just a sample of what we call our true population, but I'm not going to get way heavy into theory today. I'm going to try to keep things a little bit lighter and fast-moving so that I can show you how to do this. And what we want to do is we want to find the line that best goes through this data. Now, you can see the blue line that I've currently got on the screen. That's actually doing a really good job. But the question is, how did we get there? And this is the way that we got there. Let's suppose that we don't know anything about our customers, but we have a whole bunch of sales values so sort of like worst-case scenario. If I wanted to be able to predict how much a customer was going to spend, what would be a good value in order to make that prediction? So not knowing anything about incomes, really, the best guess that I could make would be the average of all of my ys, that average sales value. So let's kind of look at my chart here and say, it looks like the average is probably going to be somewhere right about here. That's going to be the average sales. So if I have a really bad model, and my income does not help me predict the sales, what I'm going to do is I'm going to actually have a flat line going through that average. So we think of that as our baseline model. It's sort of like the worst-case scenario. We have to do better than this line. And then now, we get into the part where we want to talk about, well, how do we decide where to put that line? Well, as you can imagine, if we were to take that line and actually move around that pivot point, we could start to place that line so that it's going to be the very best possible fit that we could get in our data. But how are we going to define that best fit? Well, think about this. If I look at my blue line, I can see how far away my blue line is from a lot of my different points. And we can actually take this distance, which we're actually going to call errors. Those are what we consider the error in our model, the part of our model that is not being explained. And if you think about it, if we keep toggling that line, you can imagine that those errors are going to grow greater. And they're also going to go smaller as well. And at some point, we're going to find the best line that's actually going to minimize those errors. And that's where the least square estimates are going to come in. Now, why is it called least squares? Well, think about this. If we were to add up all of these error terms, you'll notice that some of them are going to be positive, and some of them are going to be negative. So they would cancel each other out. So what we do is we square the terms so they're always positive. And then we're going to try to find the minimum least squared error. And that's exactly what's happening in the background. So why don't we go ahead and actually do that in SAS right now? I'm going to tell you that I have been around SAS for 30 years. So I'm fortunate enough to have a lot of experience at SAS. And I can tell you there are so many different ways of performing a linear regression in SAS that I just couldn't even count all of the ways. I may not even know all of them. And I'm going to focus in on just showing all of you two different ways of doing it. And I'm going to start in the SAS Studio interface, there are these really cool things called Tasks. And I like Tasks because I don't have to write code. But actually, we're going to be able to see the code written as we're pointing and clicking. So here's what I'm going to do. You'll see that underneath my list of Statistics tasks, I actually happen to have a Linear Regression task. And if I double-click on that, it's going to open up a new interface for me where I'm going to be able to do some pointing and clicking. And I want to focus in on this real estate, so I'm going to go ahead and close our list of tasks because we're not going to use that again. In this particular example, I have chosen a data table called the Class data table in the SASHELP library. And if you log into SAS anywhere you're at, you're more than likely going to have access to the SASHELP library. And you're going to get access to this class table as well. This class table has some different students in it. And it records their heights and weights. So wouldn't it be interesting if we tried to create a linear regression model where maybe we could actually predict somebody's weight by looking at how tall they are? That kind of makes sense. So let's do that. So what we're going to do is we're actually going to call our target variable our dependent variable. So in this example, what I'm interested in doing is trying to be able to explain or predict somebody's weight. And then I'm going to scroll down in the interface. And I'm going to find the list of continuous variables. And these are my inputs. These are my independent variables. We also sometimes called these explanatory variables. And I'm going to click on plus. And I'm going to add in their height. We'll click on OK. And then we'll notice that SAS Studio is giving me a little message down here. And it says, you know what? We need to add at least one model effect, which is one term, into this model until it'll actually run for us. And the way that we'll do that is by coming over to the Model tab. And on the Model tab, I'm going to edit here. And you'll see here's the one input that we can use. I'm going to turn it on and say, let's add that in as a single effect. So the reason that we have all of this terminology is because I'm showing you how to use a simple linear regression, which just has one input. It turns out there's something called multiple linear regression where you can have a whole bunch of inputs, but that's for another day. That's for another topic. That's for another YouTube video, right? OK, now, what's really cool about this is you'll see that now that I've met all the needs of the task, it actually wrote some code for me. So let's take a look at this code that got written in the background. You'll see that we're using the REG procedure. And you'll see that it's using the DATA equals option to actually specify the name of the table that we're working with. Then over here, you'll notice we're using the PLOTS options. And we're actually going to produce several different plots to help us take a look at how well that model is performing. And then finally-- and perhaps this is the most important part-- this is actually the MODEL statement. And this is where we say, hey, we want to try to explain weight by using the height. So that really makes weight my y variable and height my x1 variable that I was showing you a little bit earlier when I was outlining this out. But it's nice that SAS Studio did all the work for me so all I really have to do is click on the Submit button, or you'll notice the little running man beside that. We like to call that a little running man. And now I actually have some output. So I'm going to make this screen just a little bit wider so that we can focus in on the output. Over on the left-hand side, if you want to go to a specific piece of output, it's very easy to navigate to that from here. But I'm going to show you this from the top down. And as you can see, on this particular linear regression, we actually had a total of 19 students that went in here. And then we get down to the analysis of variance table. You might go, hey, Andy, wait a minute. You lied to me. You said this was least squares regression. Well, how come there's an analysis of variance table? Well, we're, in essence, doing the same thing because that's our ANOVA table. And it's analyzing the variance. So it's looking at that variability. It's trying to minimize those error terms. And probably the most important piece of information here, since I don't want to get too much into the ANOVA table today, is you'll notice that we have this p-value there that's very small. It's associated with the f statistic that's associated with the model. And since that p-value is very small, what that tells me is that this model is doing a really good job at explaining a lot of the variability in my target. So a lot of variability in the data. So we like that. We actually want really small p-values. We take a look at the table underneath that. And there's a couple of interesting statistics there. For example, we can actually see what the overall average is. So it looks like the overall weight of all my students was about 100 pounds. And then there's another really interesting statistic here called the R-squared value. The R-squared value tells me what proportion of the variability my input is going to explain. So in other words, it looks like that my heights-- did I get that right? I always get those two backwards. Yeah, I'm using heights to predict weights. That's right. That when we're talking about heights that the looking at the variability of the heights, it's actually explaining about 77% of the variability in my weight. And that's a good number because the R-squared is going to vary between 0 and 1, and higher values are better. The next table-- and this is probably the most interesting piece of information because if statistics bores you, then you're probably more interested in the juicy details. Andy, just tell me the line. Well, what we have here are those parameter estimates. We have that beta naught and that beta 1. So here's my y-intercept, and here's my height. And you'll notice there's a couple of p-values associated with those two because in this table, there's a t-test that says, hey, is this different-- is this value different from 0? And with that small p-value, we go, yes, it is. So we know now that my parameter estimates are statistically significant. But how do we tie this back in to y equals mx plus b? Well, now, by looking at these parameter estimates, I can do fun things like this. I can say, hey, I know that somebody's weight is going to be equal to that y-intercept, which is about minus 143. Plus it looks like if we take about 3.9, almost 4 times somebody's height, and that's the formula for my line, which is really pretty cool. Let's start to take a look at some of the plots that got produced here by SAS Studio. And you know what? Let me go ahead and clear some of this out. We don't need this anymore. We can leave the equation for our line up there. And you'll notice that first piece of information that we're shown is a plot of my observed by predicted. So in this particular case, we can see that line that we're creating, that's actually the diagonal line that we see here. And then we see the actual weights that are there. Now, if our model is doing a good job of predicting the weights, then what's going to happen is those points are going to be real close to that line. And so we want a random scatter because if we see a pattern here, that probably means that our one input is not sufficient in being able to help us explain what's going on here. And we'll want to look at a different model. So we don't want to see any patterns here. We want to see a random scatter, and we want it close to this line. Then as we start to take a look at some of the other output, we'll see that we have some fit diagnostics here. So one of the things that we can use with these plots down here is they can help us validate our assumptions. Assumptions? Andy, you didn't talk about any assumptions before. Well, guess what? We're not going to be able to get away scot-free when we produce our linear regression. It turns out there's certain assumptions that you need to meet in order to perform a linear regression. And there is a nice little acronym or shortcut that I like to use to help me remember what those assumptions are. And it looks like this. It's actually the word LINE. That helps me remember the assumptions that we need to validate in order to perform a linear regression. L stands for linear. And what that means is we are going to assume that there is a linear relationship in between the target variable and the input variables. So what do we mean when we say that? Well, clearly, what we're talking about is if the actual relationship between these two is like a curve, then we know a line is not adequate. Another way I like to think about this assumption is that if you plot a scatter plot, it needs to look like a line. It doesn't look like a line, you probably shouldn't be doing linear regression. The remaining three assumptions that we're going to talk about can actually be validated by looking at those errors that we were talking about earlier. So you'll remember that error terms are actually the distance or the value between the predicted value that comes from our line and the actual value. So when we're looking at those errors-- and that's what's in these fit diagnostics that we're going to be spending some time looking at it for just a second-- we're going to use these other three assumptions. The first one has to do with independence-- "in-de-pen-dence." Yeah, I'm spelling it right. That's a long word. Whew. All right, independence of my errors. That also means independence of the individual observations as well. That means that knowing something about one point cannot tell me anything about the next point. A good counter-example to that would be things like temperature. Temperature goes up and down. It has a seasonal effect. And so as that temperature is moving up and down, if you know it's 71 degrees right now, you know in just a little bit, it's either going to be 72 or 70. It's going to be one or the other. So in that case, we really don't have independence. We also see seasonal effects in things like the stock market and even purchase behavior. So if you have seasonal data, we actually analyze that with time series. So independence is the second assumption. The third assumption has to do with a normal distribution of those errors. We want to have a normal distribution of the errors because if we don't, it's going to make it very tough for us to get parameter estimates. And it's also going to make it difficult when we want to produce confidence intervals. And so I'm going to show you what those look like in just a little bit. And then my final assumption that we want to validate is equal error variance. In other words, we don't want to see any patterns in our variability of those errors. What's really great about the charts that we have on the screen now is that we can actually use these charts to help us validate our assumption. So remember, we're going to validate the last three by using the errors. And so what you can see here when we're trying to validate independence is we actually want to take a look at these residuals. And what we want to see is a random scatter against the predicted values. We don't want to see a pattern. What would a pattern look like? Well, if you see a cornucopia in either this direction or the other direction, that's a bad sign. That means that your model's not picking up all the signal of your data. And that also means that that variability is increasing as, for example, the predicted value is increasing. So we don't want that. If we have that, we might have to do something like a transformation. Then we also want to have equal error variance. So you can see that by looking at that first chart, it also validates that for us as well. Finally, in terms of trying to validate the normal assumption, that's this plot down here. It's a good plot for us to see there. And what we have here is called the quantile quantile plot. And in that particular plot, you want to see your dots following that diagonal line. If your error terms are normally distributed, they will do a good job of following that line. And they're doing that. We even have another chart that can help us validate the assumption for the normal errors. And that's the one right below it. And you can see here, we just have a nice bin chart where we're taking a look at the distribution of those errors, and it looks relatively normal. I'm not really concerned about anything here. As we continue to scroll down in this output, we'll see that we do get a really big plot that shows us the residuals against our input. And once again, this is that same chart that we would use to help us validate things like equal error variance. And then finally, at the bottom here, this is probably the most interesting part. And this is our line. This is the line that we created using that y-intercept and that slope. And one of the things that's kind of interesting about this particular line is that you'll see that there is a blue area on it. Those are those confidence intervals I was talking to you about. Confidence intervals are really cool because they allow us to be very confident in what we're saying. The blue lines represent the confidence interval for the average. So for example, if I was looking at people who are 60 inches tall, then I would be 95% confident that their actual weight is going to be somewhere looks like between, oh, maybe about 75 and 90 pounds. And that's how we use the confidence intervals. You'll also notice there are some outline-- there's some outside lines. Those are actually confidence intervals for the individual values. So if we wanted to say how confident were we about an individual person's weight that was 60 inches tall, you can see that that's much wider. There's a much bigger range there in terms of predicting their weight. OK, great, so now, I've shown you how to perform this in SAS Studio. And now, I'd like to show you a second way of performing a linear regression. And for that, I want to take advantage of the newest piece of the SAS platform, which is called SAS Viya. SAS Viya is really cool because it allows us to take advantage of in-memory analytics and in-memory data. In other words, if you've got big data, and you want to crunch through it, Viya is a great way to do that. I'm going to go into the upper left-hand corner, which we like to call our hamburger menu. And we're going to come over to Explore and Visualize Data. So now what I've moved into is SAS Visual Statistics, which is part of SAS Viya. And I actually have a different table that we want to use in order to perform our linear regression. The reason I want to show you this different way is because this is a much more interactive point-and-click way of actually producing linear regression. And then also, if I happen to have big data, which I'm not using big data in this example, but I'll be able to get those answers very quickly. I'm going to come over here to my data pane. And I'm going to say that we want to open up the VS_BANK table. So I actually have some banking data that has been anonymized. So that way, we can't tell who this really belongs to. I certainly don't want you to look at my bank account. And you can see that this is a list of all of the inputs that are in that data. And it's broken down into two sections. We have our categorical variables on top, and we have our measures down below. And what I'd like to do is I'm going to add in a linear regression object. So I'm going to move over to the Objects pane. I'm going to scroll down and find my list of SAS Visual Statistics. Here's the Linear Regression object. I can either drag and drop it onto what we call the canvas, or I can just double-click on it. I'm going to double-click on Linear Regression. And then what happens now is it's ready to perform a linear regression, but we have a little message that says the required roles have not been assigned. So let's come over to my Roles pane. And let's pick a response. So what do I want to predict? In this banking data, we actually have information about all of the new sales that a customer has given us in the last six months. So that's going to be my target variable. So that's what I want to be able to predict-- how much they're going to spend for us in the future by looking at the past data. What am I going to use to help us explain that? Well, it turns out we have a lot of explanatory variables in this particular table. I'm going to focus in on just on one because we want to perform a simple linear regression. So we're going to look at the amount that they spent on their last purchase, and we're going to use that as the predictor. So I'm going to click on OK here. And as soon as I've filled everything out, we're going to get information about that linear regression. Now, you'll notice that by default, all of the outputs are put together in one panel. And I think this is a little easier to look at if we actually break this up. So I'm going to show you an option. We're going to come over to the Options pane. And I'm going to scroll down, and under the Model Display options, I'm going to change this plot layout from a Fit to a Stack. And what that means is, hey, take each one of those windows and put them on a different tab. So now, we have a lot more room to examine what's going on. So let's start by taking a look at the Fit Summary pane. We can see here this is my one input variable, the last product purchased amount. And what this very long, green line tells me is that I have a very small p-value. In other words, this item is significant to this model. We have a 5% cutoff line, which means that if that p-value was greater than 5%, it wouldn't be green. It would actually be blue. And so we would think that, hey, this really did not help me explain how much somebody was going to purchase. But in this case, we have a very long green line. The green line is actually the minus log of the p-value. So if you think about that for just a second, if you take a very small value, and then you take the log of it, it's going to give you a very big value. But it's going to be negative. So we take the negative of that. So the minus log of a very small value is going to be a very large value. And that's where that long, green line comes from. Let's also take a look at my residuals. Now, here, I'm a little concerned because in general, what we would hope is that the majority of our residuals would be within this plus or minus two range, those two horizontal lines there. This is actually a studentized deleted residual, which means it's standardized. And so one of the things that I see here that concerns me a little bit is I am beginning to see what does look perhaps like a little bit of a cornucopia. In other words, it looks like my residuals are actually-- the variation in my residuals are actually increasing as my predicted value is increasing. And that's not a good thing. So that means that maybe just this one term is not a good idea. We might need to do a transformation on it. We might need to look at a different term. But for now, let's just go ahead and keep looking at our output. And then finally, we're getting an assessment piece from Visual Statistics. And you can see that the green line is the predicted value. That's what the model's telling me. And the orange line is the actual value, the observed average. That's what we're trying to get. And one of the things that I notice here is I can see that in these lower percentiles or the higher target sales, it looks like my model is under-predicting, whereas in other areas, it looks like it's over-predicting. So we might want to think about a different tactic, but that's OK. Let's just keep going because I want to show you one more important piece of information. And that is remember when we performed our linear regression in SAS Studio, we got the equation for the line. Well, what is the equation for the line? Where is that? That's actually in this Details table that I can open up by clicking on the Maximize button. And let me get rid of some of this stuff that we wrote here earlier. There we go. I'm going to come over to the Parameter Estimates tab. And aha! That is my parameter estimate. So there is my slope and my y-intercept. So what does that mean? If we want to predict the sales, that's going to be equal to this intercept, which is about 8940. I'm just rounding up now. And then we're going to add that to about a 187 times the last product purchase amount. So we'll just call that LPPA. And there's the formula for the lines so it's kind of hidden in that Details table, but now we can see what it is. There is also some assessment statistics. And we can see, for example, the average squared error. But maybe we should save that for another day. Thanks for joining me today. I hope you learned a little bit more about how to perform a simple linear regression in SAS. If you want to learn more, don't forget to subscribe. Also, we have some great information for you down below. You can click on the links, find some interesting resources, and also, you know what? Let me know what you think. Ask me some questions. Leave me some feedback.
