Hi. I'm Andy, and I'm a
technical trainer at SAS. And today, I'm
going to talk to you about how to perform simple
linear regression in SAS. Linear regression has
actually been around for a really long
time, and it's used in many different industries--
for example, the medical field. It's used in retail sales. And we use linear
regression in order to help us predict a continuous
target, a continuous variable. Something like sales--
let's say we're trying to figure out exactly
how much a customer is going to spend with us. That's a really useful
piece of information. So we have other variables,
which we consider our inputs. And those are the
variables that allow us to try to explain sales. Like for example, maybe if you
knew how much money I made, you'd be able to predict how
much I was going to spend. Why don't we go
ahead and take a look at what the formula for
linear regression looks like? I don't remember when
was the last time you were in high
school or college and you took your
mathematics course, and I'm certainly not going
to tell you how old I am. However, you probably remember
that the formula for a line looks something like
this-- y equals mx plus b. And when we were learning
about lines back then, you probably
remember that this m term is the slope of the line,
and this is the y-intercept. That's actually
where our line is going to intersect the y-axis. Well, in more modern days, as we
kind of bring this up to date, we're going to do the
same thing in order to create our simple
linear regression, but we're going to use
little different terms. So don't let that throw you off. Instead, we're going to
say that y-hat, which is going to be our
predicted value, is going to be equal
to a beta naught term. Now, beta naught is just
a parameter estimate, and that is going to
reflect our y-intercept. And then, we're going to add to
that a beta 1 term times an x1. X1 just means that's
my first input. And we're going to be performing
a simple linear regression, so that means we're only
going to have one input. And beta 1, well,
once again, that's just my slope that
we were taking a look at in the earlier equation. So you see they're
kind of flipped, but it's really the
exact same thing. The last thing that
we're going to include as we sort of move into
a little bit of theory is, we do have an error term. So we're going to add in
this epsilon, or error term, because there's always
some unknown error. And so this is the formula for
a simple linear regression. And what we want to do is we
want to find these parameter estimates. We want to find this beta
naught and this beta 1 that's going to give us the best
line through our data. So I think now is a good time
for me to switch over and start to talk about the way that
that's going to get calculated in the background. The method that's going to
be used is something called-- there we go, that's where
I was trying to get-- ordinary least
squares regression. And so ordinary least
squares regression is basically going
to work like this. What's going to happen is
if we think about it-- so let's suppose that my response
variable, the thing that I'm trying to predict, is
something like sales as I was talking about earlier. So we have sales over here. And then my predictor
variable, in this case, maybe this is income. So this is the income
of all of our customers. And so we have
different customers. They're going to have
different incomes. And what we have here
is a scatter plot. So you see all the little
dots along this scatter plot? Those are the actual value. So that means there's a
customer that we know about that has this amount of income. We could come down to the
x-axis and see what that is. And we see that in
the past, they've had this amount of sales. And we can look at the y-axis. So these points are
really just a sample of what we call our
true population, but I'm not going to get
way heavy into theory today. I'm going to try to keep
things a little bit lighter and fast-moving so that I
can show you how to do this. And what we want
to do is we want to find the line that best
goes through this data. Now, you can see the blue
line that I've currently got on the screen. That's actually doing
a really good job. But the question is,
how did we get there? And this is the way
that we got there. Let's suppose that we don't know
anything about our customers, but we have a whole
bunch of sales values so sort of like
worst-case scenario. If I wanted to be able to
predict how much a customer was going to spend, what would
be a good value in order to make that prediction? So not knowing
anything about incomes, really, the best guess
that I could make would be the average
of all of my ys, that average sales value. So let's kind of look at
my chart here and say, it looks like the
average is probably going to be somewhere
right about here. That's going to be
the average sales. So if I have a really bad
model, and my income does not help me predict the sales,
what I'm going to do is I'm going to actually
have a flat line going through that average. So we think of that
as our baseline model. It's sort of like the
worst-case scenario. We have to do better
than this line. And then now, we
get into the part where we want to
talk about, well, how do we decide where
to put that line? Well, as you can imagine,
if we were to take that line and actually move
around that pivot point, we could start to place
that line so that it's going to be the very
best possible fit that we could get in our data. But how are we going to
define that best fit? Well, think about this. If I look at my
blue line, I can see how far away my blue
line is from a lot of my different points. And we can actually take this
distance, which we're actually going to call errors. Those are what we
consider the error in our model, the
part of our model that is not being explained. And if you think about it, if
we keep toggling that line, you can imagine
that those errors are going to grow greater. And they're also going
to go smaller as well. And at some point,
we're going to find the best line that's actually
going to minimize those errors. And that's where the
least square estimates are going to come in. Now, why is it
called least squares? Well, think about this. If we were to add up all
of these error terms, you'll notice that some of
them are going to be positive, and some of them are
going to be negative. So they would cancel
each other out. So what we do is
we square the terms so they're always positive. And then we're going to try
to find the minimum least squared error. And that's exactly what's
happening in the background. So why don't we go ahead
and actually do that in SAS right now? I'm going to tell
you that I have been around SAS for 30 years. So I'm fortunate enough to have
a lot of experience at SAS. And I can tell you there
are so many different ways of performing a linear
regression in SAS that I just couldn't even
count all of the ways. I may not even know all of them. And I'm going to focus in
on just showing all of you two different ways of doing it. And I'm going to start in
the SAS Studio interface, there are these really
cool things called Tasks. And I like Tasks because I
don't have to write code. But actually, we're going to
be able to see the code written as we're pointing and clicking. So here's what I'm going to do. You'll see that underneath
my list of Statistics tasks, I actually happen to have
a Linear Regression task. And if I double-click
on that, it's going to open up a
new interface for me where I'm going to be able to
do some pointing and clicking. And I want to focus in
on this real estate, so I'm going to go ahead
and close our list of tasks because we're not going
to use that again. In this particular
example, I have chosen a data table called the
Class data table in the SASHELP library. And if you log into
SAS anywhere you're at, you're more than likely
going to have access to the SASHELP library. And you're going to get access
to this class table as well. This class table has some
different students in it. And it records their
heights and weights. So wouldn't it be
interesting if we tried to create a linear
regression model where maybe we could actually predict
somebody's weight by looking at how tall they are? That kind of makes sense. So let's do that. So what we're going to
do is we're actually going to call our
target variable our dependent variable. So in this example, what
I'm interested in doing is trying to be able to explain
or predict somebody's weight. And then I'm going to scroll
down in the interface. And I'm going to find the
list of continuous variables. And these are my inputs. These are my
independent variables. We also sometimes called
these explanatory variables. And I'm going to click on plus. And I'm going to
add in their height. We'll click on OK. And then we'll notice that
SAS Studio is giving me a little message down here. And it says, you know what? We need to add at least
one model effect, which is one term, into this
model until it'll actually run for us. And the way that we'll
do that is by coming over to the Model tab. And on the Model tab,
I'm going to edit here. And you'll see here's the
one input that we can use. I'm going to turn
it on and say, let's add that in as a single effect. So the reason that we have
all of this terminology is because I'm
showing you how to use a simple linear regression,
which just has one input. It turns out there's
something called multiple linear
regression where you can have a whole
bunch of inputs, but that's for another day. That's for another topic. That's for another
YouTube video, right? OK, now, what's
really cool about this is you'll see that now that I've
met all the needs of the task, it actually wrote
some code for me. So let's take a look
at this code that got written in the background. You'll see that we're
using the REG procedure. And you'll see that it's
using the DATA equals option to actually specify
the name of the table that we're working with. Then over here, you'll notice
we're using the PLOTS options. And we're actually going to
produce several different plots to help us take a look at how
well that model is performing. And then finally-- and perhaps
this is the most important part-- this is actually
the MODEL statement. And this is where
we say, hey, we want to try to explain
weight by using the height. So that really makes
weight my y variable and height my x1 variable
that I was showing you a little bit earlier when
I was outlining this out. But it's nice that SAS Studio
did all the work for me so all I really have to do is
click on the Submit button, or you'll notice the little
running man beside that. We like to call that
a little running man. And now I actually
have some output. So I'm going to make this
screen just a little bit wider so that we can focus
in on the output. Over on the left-hand
side, if you want to go to a specific
piece of output, it's very easy to navigate
to that from here. But I'm going to show you
this from the top down. And as you can see, on this
particular linear regression, we actually had a total of 19
students that went in here. And then we get down to the
analysis of variance table. You might go, hey,
Andy, wait a minute. You lied to me. You said this was least
squares regression. Well, how come there's an
analysis of variance table? Well, we're, in essence,
doing the same thing because that's our ANOVA table. And it's analyzing the variance. So it's looking at
that variability. It's trying to minimize
those error terms. And probably the most important
piece of information here, since I don't want to get
too much into the ANOVA table today, is you'll notice that
we have this p-value there that's very small. It's associated with
the f statistic that's associated with the model. And since that
p-value is very small, what that tells me
is that this model is doing a really good
job at explaining a lot of the variability
in my target. So a lot of variability
in the data. So we like that. We actually want
really small p-values. We take a look at the
table underneath that. And there's a couple of
interesting statistics there. For example, we can actually
see what the overall average is. So it looks like the overall
weight of all my students was about 100 pounds. And then there's another really
interesting statistic here called the R-squared value. The R-squared value
tells me what proportion of the variability my
input is going to explain. So in other words, it looks
like that my heights-- did I get that right? I always get those
two backwards. Yeah, I'm using heights
to predict weights. That's right. That when we're talking about
heights that the looking at the variability
of the heights, it's actually
explaining about 77% of the variability in my weight. And that's a good number
because the R-squared is going to vary
between 0 and 1, and higher values are better. The next table--
and this is probably the most interesting
piece of information because if statistics
bores you, then you're probably more interested
in the juicy details. Andy, just tell me the line. Well, what we have here are
those parameter estimates. We have that beta
naught and that beta 1. So here's my y-intercept,
and here's my height. And you'll notice there's a
couple of p-values associated with those two
because in this table, there's a t-test that says,
hey, is this different-- is this value different from 0? And with that small
p-value, we go, yes, it is. So we know now that
my parameter estimates are statistically significant. But how do we tie this back
in to y equals mx plus b? Well, now, by looking at
these parameter estimates, I can do fun things like this. I can say, hey, I know
that somebody's weight is going to be equal to
that y-intercept, which is about minus 143. Plus it looks like
if we take about 3.9, almost 4 times
somebody's height, and that's the formula
for my line, which is really pretty cool. Let's start to take a look
at some of the plots that got produced here by SAS Studio. And you know what? Let me go ahead and
clear some of this out. We don't need this anymore. We can leave the equation
for our line up there. And you'll notice that
first piece of information that we're shown is a plot
of my observed by predicted. So in this particular
case, we can see that line that we're creating, that's
actually the diagonal line that we see here. And then we see the actual
weights that are there. Now, if our model is doing
a good job of predicting the weights, then
what's going to happen is those points are going to
be real close to that line. And so we want a random scatter
because if we see a pattern here, that probably means
that our one input is not sufficient in being
able to help us explain what's going on here. And we'll want to look
at a different model. So we don't want to
see any patterns here. We want to see a
random scatter, and we want it close to this line. Then as we start to take a look
at some of the other output, we'll see that we have
some fit diagnostics here. So one of the things that we can
use with these plots down here is they can help us
validate our assumptions. Assumptions? Andy, you didn't talk about
any assumptions before. Well, guess what? We're not going to be able
to get away scot-free when we produce our linear regression. It turns out there's
certain assumptions that you need to meet in order
to perform a linear regression. And there is a nice
little acronym or shortcut that I like to use
to help me remember what those assumptions are. And it looks like this. It's actually the word LINE. That helps me remember
the assumptions that we need to
validate in order to perform a linear regression. L stands for linear. And what that means
is we are going to assume that there is
a linear relationship in between the target variable
and the input variables. So what do we mean
when we say that? Well, clearly, what
we're talking about is if the actual relationship
between these two is like a curve, then we
know a line is not adequate. Another way I like to
think about this assumption is that if you plot
a scatter plot, it needs to look like a line. It doesn't look like
a line, you probably shouldn't be doing
linear regression. The remaining three assumptions
that we're going to talk about can actually be validated
by looking at those errors that we were talking
about earlier. So you'll remember that
error terms are actually the distance or the value
between the predicted value that comes from our line
and the actual value. So when we're looking
at those errors-- and that's what's in
these fit diagnostics that we're going to be
spending some time looking at it for just a second-- we're going to use these
other three assumptions. The first one has to
do with independence-- "in-de-pen-dence." Yeah, I'm spelling it right. That's a long word. Whew. All right, independence
of my errors. That also means independence
of the individual observations as well. That means that knowing
something about one point cannot tell me anything
about the next point. A good counter-example
to that would be things like temperature. Temperature goes up and down. It has a seasonal effect. And so as that temperature
is moving up and down, if you know it's 71
degrees right now, you know in just
a little bit, it's either going to be 72 or 70. It's going to be
one or the other. So in that case, we really
don't have independence. We also see seasonal effects
in things like the stock market and even purchase behavior. So if you have seasonal
data, we actually analyze that with time series. So independence is
the second assumption. The third assumption has to
do with a normal distribution of those errors. We want to have a normal
distribution of the errors because if we don't, it's going
to make it very tough for us to get parameter estimates. And it's also going
to make it difficult when we want to produce
confidence intervals. And so I'm going to
show you what those look like in just a little bit. And then my final assumption
that we want to validate is equal error variance. In other words, we don't
want to see any patterns in our variability
of those errors. What's really great
about the charts that we have on
the screen now is that we can actually
use these charts to help us validate our assumption. So remember, we're going
to validate the last three by using the errors. And so what you
can see here when we're trying to
validate independence is we actually want to take
a look at these residuals. And what we want to
see is a random scatter against the predicted values. We don't want to see a pattern. What would a pattern look like? Well, if you see a cornucopia
in either this direction or the other direction,
that's a bad sign. That means that your
model's not picking up all the signal of your data. And that also means
that that variability is increasing as, for example,
the predicted value is increasing. So we don't want that. If we have that, we might
have to do something like a transformation. Then we also want to have
equal error variance. So you can see that by
looking at that first chart, it also validates
that for us as well. Finally, in terms of trying to
validate the normal assumption, that's this plot down here. It's a good plot
for us to see there. And what we have here is called
the quantile quantile plot. And in that particular
plot, you want to see your dots following
that diagonal line. If your error terms are
normally distributed, they will do a good job
of following that line. And they're doing that. We even have another
chart that can help us validate the assumption
for the normal errors. And that's the one
right below it. And you can see here, we
just have a nice bin chart where we're taking a look at the
distribution of those errors, and it looks relatively normal. I'm not really concerned
about anything here. As we continue to scroll
down in this output, we'll see that we do get
a really big plot that shows us the residuals
against our input. And once again, this
is that same chart that we would use to
help us validate things like equal error variance. And then finally,
at the bottom here, this is probably the
most interesting part. And this is our line. This is the line that we
created using that y-intercept and that slope. And one of the
things that's kind of interesting about
this particular line is that you'll see that
there is a blue area on it. Those are those
confidence intervals I was talking to you about. Confidence intervals are
really cool because they allow us to be very confident
in what we're saying. The blue lines represent
the confidence interval for the average. So for example, if I
was looking at people who are 60 inches tall, then
I would be 95% confident that their actual
weight is going to be somewhere looks like
between, oh, maybe about 75 and 90 pounds. And that's how we use
the confidence intervals. You'll also notice
there are some outline-- there's some outside lines. Those are actually
confidence intervals for the individual values. So if we wanted to say
how confident were we about an individual person's
weight that was 60 inches tall, you can see that
that's much wider. There's a much bigger
range there in terms of predicting their weight. OK, great, so now,
I've shown you how to perform
this in SAS Studio. And now, I'd like to show you
a second way of performing a linear regression. And for that, I want
to take advantage of the newest piece
of the SAS platform, which is called SAS Viya. SAS Viya is really
cool because it allows us to take advantage
of in-memory analytics and in-memory data. In other words, if
you've got big data, and you want to
crunch through it, Viya is a great way to do that. I'm going to go into the upper
left-hand corner, which we like to call our hamburger menu. And we're going to come over
to Explore and Visualize Data. So now what I've moved into
is SAS Visual Statistics, which is part of SAS Viya. And I actually have
a different table that we want to use in order to
perform our linear regression. The reason I want to show
you this different way is because this is a much more
interactive point-and-click way of actually producing
linear regression. And then also, if I happen to
have big data, which I'm not using big data in
this example, but I'll be able to get those
answers very quickly. I'm going to come over
here to my data pane. And I'm going to
say that we want to open up the VS_BANK table. So I actually have some banking
data that has been anonymized. So that way, we can't tell
who this really belongs to. I certainly don't want you
to look at my bank account. And you can see that this is
a list of all of the inputs that are in that data. And it's broken down
into two sections. We have our categorical
variables on top, and we have our
measures down below. And what I'd like to
do is I'm going to add in a linear regression object. So I'm going to move
over to the Objects pane. I'm going to scroll down and
find my list of SAS Visual Statistics. Here's the Linear
Regression object. I can either drag and drop it
onto what we call the canvas, or I can just
double-click on it. I'm going to double-click
on Linear Regression. And then what
happens now is it's ready to perform a
linear regression, but we have a little message
that says the required roles have not been assigned. So let's come over
to my Roles pane. And let's pick a response. So what do I want to predict? In this banking
data, we actually have information about
all of the new sales that a customer has given
us in the last six months. So that's going to be
my target variable. So that's what I want
to be able to predict-- how much they're going to
spend for us in the future by looking at the past data. What am I going to use
to help us explain that? Well, it turns out we have a
lot of explanatory variables in this particular table. I'm going to focus
in on just on one because we want to perform
a simple linear regression. So we're going to
look at the amount that they spent on
their last purchase, and we're going to use
that as the predictor. So I'm going to
click on OK here. And as soon as I've
filled everything out, we're going to get information
about that linear regression. Now, you'll notice that by
default, all of the outputs are put together in one panel. And I think this is a
little easier to look at if we actually break this up. So I'm going to
show you an option. We're going to come over
to the Options pane. And I'm going to scroll down,
and under the Model Display options, I'm going to change
this plot layout from a Fit to a Stack. And what that means is, hey,
take each one of those windows and put them on a different tab. So now, we have a lot more room
to examine what's going on. So let's start by taking a
look at the Fit Summary pane. We can see here this is
my one input variable, the last product
purchased amount. And what this very
long, green line tells me is that I have a
very small p-value. In other words, this item is
significant to this model. We have a 5% cutoff
line, which means that if that p-value was greater
than 5%, it wouldn't be green. It would actually be blue. And so we would think
that, hey, this really did not help me explain how much
somebody was going to purchase. But in this case, we have
a very long green line. The green line is actually
the minus log of the p-value. So if you think about
that for just a second, if you take a very small
value, and then you take the log of it, it's going
to give you a very big value. But it's going to be negative. So we take the negative of that. So the minus log of
a very small value is going to be a
very large value. And that's where that long,
green line comes from. Let's also take a
look at my residuals. Now, here, I'm a little
concerned because in general, what we would hope is that
the majority of our residuals would be within this plus or
minus two range, those two horizontal lines there. This is actually a
studentized deleted residual, which means it's standardized. And so one of the
things that I see here that concerns me a little bit is
I am beginning to see what does look perhaps like a little
bit of a cornucopia. In other words, it looks like
my residuals are actually-- the variation in my
residuals are actually increasing as my predicted
value is increasing. And that's not a good thing. So that means that
maybe just this one term is not a good idea. We might need to do a
transformation on it. We might need to look
at a different term. But for now, let's just go ahead
and keep looking at our output. And then finally, we're
getting an assessment piece from Visual Statistics. And you can see that the green
line is the predicted value. That's what the
model's telling me. And the orange line is the
actual value, the observed average. That's what we're trying to get. And one of the things
that I notice here is I can see that in these
lower percentiles or the higher target sales, it looks like
my model is under-predicting, whereas in other areas, it
looks like it's over-predicting. So we might want to think
about a different tactic, but that's OK. Let's just keep
going because I want to show you one more important
piece of information. And that is remember when we
performed our linear regression in SAS Studio, we got the
equation for the line. Well, what is the
equation for the line? Where is that? That's actually in
this Details table that I can open up by clicking
on the Maximize button. And let me get rid
of some of this stuff that we wrote here earlier. There we go. I'm going to come over to
the Parameter Estimates tab. And aha! That is my parameter estimate. So there is my slope
and my y-intercept. So what does that mean? If we want to predict
the sales, that's going to be equal to this
intercept, which is about 8940. I'm just rounding up now. And then we're going to
add that to about a 187 times the last product
purchase amount. So we'll just call that LPPA. And there's the
formula for the lines so it's kind of hidden
in that Details table, but now we can see what it is. There is also some
assessment statistics. And we can see, for example,
the average squared error. But maybe we should save
that for another day. Thanks for joining me today. I hope you learned
a little bit more about how to perform a simple
linear regression in SAS. If you want to learn more,
don't forget to subscribe. Also, we have some great
information for you down below. You can click on the links,
find some interesting resources, and also, you know what? Let me know what you think. Ask me some questions. Leave me some feedback.