Hello friends, I welcome you all to the course
marketing research and analysis. In the last lecture, we covered concepts of correlation.
So, we have understood what is correlation and how it impacts or how it explains the
relationship between 2 variables or more than 2 variables. Sometimes, we just find out the
relationship between 2 variables and more and sometimes we partial out or we try to
control one variable and see the effect of other variables which was the case of partial
correlation. Today we are trying to discuss an another
important concept, very, very important concept and very highly utilized in all spheres of
education or academics be it management or be it social science, be it engineering wherever
you can think of and this is regression. So regression basically is very interesting because
this concept actually means to regress towards the mean or to moves towards the mean so what
it means something that there is a law of nature tends to average out everything. So,
if something has happened you got some high value that the next time when you do the same
exercise the chances are fair that you will get a low or lesser value which is closer
to the mean. So, regression means to regress or to where
does it regress towards the mean, that what it means, so what exactly is regression? It
is a very powerful and very flexible approach or procedure for analysing associative relationships
between a metric dependent variable and one or more independent variables. But it has to be it is like you know Y and
X and both this Y let us say could X 1, X 2 up to X n, for example so these are both
in metric, this is a metric variable and these also have to be metric this is how it is defined.
So it can be used the following ways to determine whether the independent variables, so these
are my independent variables, explain a significant variation in the dependent variable. So my
dependent variable; so any change in my dependent variable will be explained by a change in
the independent variable so what is the relationship, how much of the variation of the variance
in the dependent variable is explained through the independent variables. Second to determine how much of the variation
in the dependent variable can be explained by the independent variables first it is whether
the relationship exists and second the strength. The third to predict the values of the dependent
variable you need to find out calculate how much what is the change in the what is the
Y value for a particular X 1, X 2 or particular X value and the third and the last is to control
for other independent variables when evaluating the contributions of a specific variable or
set of variables supposed I want to find out the effect of the X 1 and Y, I want to control
X 2 or I want to take X 1, X 2 and I want to control may be X 3. So we will start with the simple linear regression
model, so as it says linear regression is the next step after correlation, so you have
understood correlation, very clearly you have understood correlation. Now it is the next
step, so it is used when we want to predict so this is regression analysis is also called
as predictive analysis, sometimes you must have heard the terms predictive analysis.
To predict the value of a variable, which variable now Y the dependent variable based
on the value of another variable which is the X the independent variable. The variable
we want to predict is called the dependent variable or the outcome variable. The variable, which was using to predict the
X is called the independent variable or the predictor variable. For example you could
use linear regression to understand whether exam performance can be predicted based on
the revision time. So how much time a student is using for revision for revising his subject
that will have an impact on how he is performing in the exam. Another example whether cigarette
consumption can be predicted, can we predict how many cigarettes does a person consume? Based on the time he is spending on smoking,
duration of smoking and so for. If you have 2 or more independent variables, rather than
just one we say it is a case of a multiple regression. Since there are multiple independent
variables, they are called multiple regression. Now what are the assumptions of simple linear
regression or regression for example, a 2 variables should be measured at the continuous
level I said that both Y and X we need to be measured in a continuous or metric variable
so which is either interval or ratio, examples of continuous variables include for example,
revision time measured in hours, intelligence measured in IQ score, performance measured
from 0 to100% or something, weight measured in kg etcetera. Assumption 2 there needs to be a linear relationship
between the 2 variables. That means what we are saying the linear relationship means they
are moving linearly that means, they have a proportionate the movement the change, there
is an equal change there is a proportionate change. Third there should not be any significant
outliers because if there is outlier it will drastically distort the entire relationship. You should have independence of observations
so this independence of observation can be checked by the Durbin-Watson statistics test
which is also a measure of autocorrelation. Now just forget this autocorrelation. So if
you want to check it you can check through this also, whether the presence of autocorrelation
is there or not and this helps you to find out independence of observation but just imagine
and understand independence of observation we mean that no variable or no respondent
is repeated for more than one time that means it has; it gets a chance to be a part of the
study for once only. Until and unless we are doing a repeated measured
design, so you had understood earlier experimental design so in that we are saying in our factorial
design whatever we are doing so there we are using the independence of observation assumption
that means one respondent can be part of study for once only and if the sample is repeated
for let say 2 times, so once for year 1 once for year 2 then it is called a repetition
of repeated measured observation. The last assumption is the data need to be
homoscedastic. What is homoscedastic? This is where the variances along the line of best
fit remain similar, now look at to this case, so homo means similar, so similar variance
that means the variance is similar and if the variance is not similar it is hetro so
these 2 cases are hetroscedastic you see the variance here and you see the variance here. So there is a large gap, this is significant
difference. But look at the gap of the plots are the data points here and here everywhere
it is more or less the same. This homogeneity of variance is very; very important assumption
right, homoscedastic or the homogeneity of variance is very important assumption in any
statistical analysis. One more thing is you need to check the residuals
or errors. Now what are the errors? Please understand them, many of times students get
confused. The error is basically the unexplained part in a study. Now let me explain it through
may be a diagram if possible so what I will do is I will draw it here, I have some space.
So Let us say this is my X and Y, this is my Y, this is my X, now let us say I have
this is my Y mean, so the value of Y is the average value of Y. Now let us take a variable and we are assuming,
now this is where I am saying is my Y estimated, now what is Y estimated? Means the Y which
you have calculated, the calculated value of Y the dependent variable that you have
calculated, now let us take a variable here, now this is my Y observed, so although my
estimated calculated Y is this much, my mean is this much but this is my real Y or my observed
Y. Now if you look at this part, this entire
thing, this total from here observed minus the mean, this part is called my total variance.
Now so Y = Y – Y bar summation of this, similarly if you look at the Y, this Y bar
estimated Y - the Y bar, So Y estimated - Y mean this gives me what, now this is the one
which we call as the explained variance, so explained variance. So this is called sum
of square of regression also it is sometimes it is donated as sum of square regression.
What is this part? The Y the actual Y - the estimated Y, now this part is what is called
my unexplained variance and this unexplained variance is called as sum of square of error
or residual. So, when I said this keep residual is nothing
but this part, so you need to check the residuals of the regression line are approximately normally
distributed, this should be normally distributed, 2 common methods to check is through a normal
PP plot or histogram or I have shown explained when I was taking about data purification
you can easily go and calculate and find out whether the live in the range of -2 to +2
and if it is yes then it is within the range and then we say it is normally distributed
if it is beyond -2 to +2 then we say is not normally distributed. Otherwise just you can go through a histogram
or a normal plot and PP plot and find out. Let us take this case. A salesperson for a
large car brand wants to determine whether there is a relationship between an individual’s
income and the price they pay for a car. So you watch now is there is any relationship
between individual income and price. As such, the individual income is the independent
variable and the price they pay is my dependent variable. The salesperson wants to use this
information to determine which cars to offer to the potential customers. Suppose a customer
is a very income customer so which car should we will be shown, so to offer potential customer,
new area where average income is known to them. So as it says regression analysis involves
one independent variable and one dependent variable which you have understood, in which
the relationship is approximated by a straight line. Now what is this straight line? So let
me explain that point, so for example if you see is something like this so we say this
is the regression line let say and this line is called the best fit line. Why it is called
the best fit line? Because there could be infinite lines but
this is line which we are saying the best line why because if you take all the data
points and calculate the variances, the distance from this line then you will see it is in
this line that the variance is the minimum, if you would had taken any other line like
say this one or let us say here or somewhere here then the variance would be more in comparison
to when you take this line as the regression line. So we have understood that this is a case
of simple linear regression and when we are using more than 2 independent variables, 2
or more then we say it is a case of multiple regression analysis. Now let us go to the
example Anand pizza parlour is a chain is a food restaurants located in 5 states. The
most successful locations are near college campuses. The managers believe that quarterly sales
for these restaurants are related positively to the size of the student population, which
is X, that is restaurants near campuses with a large student population tend to generate
more sales than those located near campuses with a small student population obviously
you know pizza’s are more liked by the students. So now using regression analysis, we can develop
an equation showing how the dependent variable Y is related to the independent variable X.
So what is Y? Sales, what is X? My size or population. So now in Anand’s example the population
consists of all the Anand’s restaurant for every restaurant there is a value of x student
population and a corresponding value of y sales. The equation that describes how y is
related to x and error term in the model, so you see y = b0 or this b right, + b1x +
e, now what is this let us explain where b0 or b0 and b1 are referred to as the parameters
of the model so this parameter is called my intercept and this is called my slope. So I did you had more number of x for example
x 1, x 2, x 3 so there would be b1, b2, b3 goes on but what is this error I just now
explained, the unexplained variance or the residuals or the errors ok, so these are the
thing we have the Greek letter epsilon is a random variable referred to as the error
term this one. The error term accounts for the variability in y that cannot be explained
by the relationship between x and y. So let us understand again let us draw a line
this is my let say this is a regression line. Now this is called my intercept, the one we
choose as seen here b0 and this is my b1, the slope, so y is the b1 x which we are talking
about so this is the Beta 1 so the slope. Now continuing the population of all Anand’s
restaurant can be viewed as a collection of subpopulations, one for each distinct value
of x, so x was my size of the population. For example one subpopulation consists of
all the restaurants of Anand located near college campuses with 8000 students, and so
on. Each subpopulation has a corresponding distribution of y values. So, for each value of x there will be a corresponding
value of y obviously, if x is 8000 the y is something let us say y 1, If it is 7000 let
us say y 2, it is 9000 let us say y 3 so we have corresponding values, thus a distribution
of this y values, this distribution of this y values y 1, y 2, y 3 goes on y n is associated
with restaurants located near the campuses. So each distribution of y values as it is
own mean or expected value that is what I was trying to say when I started the lecture
that regression is regressing towards the mean, so what is the mean or excepted value
the equation that describes how the excepted value of y is related to x is called the regression
equation. So the regression equation is E(y) = b0 + b1 x that means my intercept + my slope
* the independent variable. Now this is about the relationship you see
in correlation also you have seen this so if you see, if I take a slope between x and
y, so this part is my intercept as I shown b0, and this is my regression line, these
are the regression lines, so in this case you see this case the b1 is positive. So when
it is positive we say it is positive linear relationship that means as x is increasing
y is also increasing. But in this case as x increasing you see the value of y is decreasing,
so the slope is negative in this case, it is a negative linear relationship. The third
case you see with the change in x, there is no change in the y, so the slope is 0, so
this is no relationship. If the values of the population parameters
b0 and b1 the intercept and the slope were known, we could use previous equation to compute
the value of y for a given value of x. Obviously if I need to find the value of y the estimated
y, the estimated value of the dependent variable and I know my x but I do not these 2, I need
to calculate. So in practice, the parameter values are not known and must be estimated
using the sample data. Since you cannot infer it from the population
is difficult so you have the sample so let us use the sample and because it represent
the population and calculate the b0 and b1 the slope. So sample statistics b0 and b1
are the reflection of the same b0 for the population. So are computed as estimates of
the b0 and b1. Substituting the values of the sample statics b0 and b1 for 0 so that
means the intercept is 0, so if intercept is 0 that means it is starting from the origin
and 1 in the regression equation. We obtain the estimated regression equation, so this
is how it is looks like. Now the least square method, so least square
is a procedure for using sample data to find the estimated regression equation. So, in
the case of this Anand’s case there are 10 restaurants, student population in 1000
is given, 2000, 6000, 8000, 8000, 12000 up to 26000. The sales are given to us. So now
it says 10 Anand’s pizza restaurants data is given, and the xi is the size of the student
population in 1000’s, yi is the quarterly sales in 1000’s of dollars. So the values
of xi and yi for the 10 restaurants are given to us. Now if I draw a just I draw the data points,
if I draw the data points this is what it says, student population is shown on the horizontal
axis and quarterly sales is shown in the y axis. So scatter diagram is just representation
of the data points, for the regression analyses are constructed with the independent variable
x and y. The scatter diagram enables us to observe the data graphically and to draw preliminary
conclusion you can draw any final conclusion but preliminary conclusion about the possible
relationship. Now just see this and tell me what is the
relationship you think? So this is a positive relationship, it is growing, as x is growing
y is growing, if this is my regression line so this is a positive relationship, from this
figure we can find out that Now quarterly sales appear to be higher at
campuses with large student populations, in addition, for these data the relationship
between the size of the student population and quarterly sales appears to be approximated
by a straight line, indeed, a positive relationship. We therefore choose the simple linear regression
model to represent the relationship between quarterly sales and student population. Given that choice, our next task is to use
the sample data and determine the value of b0 and b1 the simple parameters in the estimated
equation. If we find out these for suppose b0 and b1 then for any x 1 we can find the
value of y. So y is my estimated value of quarterly sales, b0 the y intercepts, b1 the
slope and x 1 is the student population size. So the criteria is saying y this is what I
was explaining, the minimum the criteria least square criteria is the summation of (yi - y
?)2. So I had explained you so least square what
it is says the minimum, minimize this, what is this minimize is just go back let us go
back to the diagram, so I said earlier this was my estimated y, this is my y bar and we
have a observed value y, so this part was my unexplained part. So the intension of the researcher is always
to minimize this part and to reduce it as much as possible, so that means what our estimated
value should be able to incorporate or accommodate the actual observed values. So y is my observed
value of the dependent variable for the ith observation and yi is my estimated value. So if my observed value minus my estimated
value is equal to 0, that means what observed value is equal to my estimated value that
means what we can say that it lies on the same point which means that there is no unexplained
variance, everything is explained in the study. It is a very important concept because the
more unexplained you have that means the researcher has very little control over the research. So how do you calculate the slope and intercept?
First let us calculate the slope, so the formula is if you see
b1 = summation of [(xi – x¯) (yi – y¯)] and divided by the summation of (xi – x¯)
2. So this is the xi is my value for the independent
variable for ith observation, yi is value for the dependent variable for the ith observation,
x bar is mean value for the independent variable, y bar is mean value for the dependent variable,
n is my total number of observations. Once we get this and I have let us say y and x
the estimated, the mean value of y we have and the x value then we can calculate b 0.
So let us do this. So first we find out these are the 10, so
this is my xi the values which earlier also we have shown and this is my y. Now first
we calculate the x¯ so x¯ is how much 140/10 so that is equal to 14, y¯ is how much 1300
/ 10 = 130. We need (x – x¯) so we find out x - 14 so 2 - 14, 6 - 14, 8 - 14, 8 - 14,
12 - 14 goes on till 26 - 14, so x - x¯ similarly y - y¯ 58 - 130, 105 - 130 and goes on till
202 - 130. Now we want (x - x¯ )* (y - y¯). So we can do this, calculate this multiply
and we have found out this. Now (x - x¯)2 is what we are finding out so (x - x¯)2,
12 square is 144, 8 square 64 goes on. So after finding everything, see this is the
formula, so we have got everything now with us, so let us use the formula So this part (x - x¯) * (y - y¯) this is
how much 2840 / 568 so my slope is equal to 5. Now we calculate the intercept, so calculate
the intercept b 0 we have y bar is my mean, so 130 - 5 the slope * x¯ the mean is 14
here so this gives me the value of the intercept is 60 what is the intercept mean it means
that whenever you do not have any value for x that means when x is 0 still there is some
value for y and this value for y is nothing but the intercept, the meaning of this is
that. When x = 0 whatever value of y remains that
is my intercept. So the estimated regression equation is now y estimated is equal to 60
this is my slope + 5 the slope this looks like, how do you look b0 + b1x so b 1 is my
5 * x. So for any new variable any new value of x now you can calculate the value of y,
you can estimate. So for example in this case so let us say
b1 is positive. Implying that as a student population increases, sales increase because
the slope is increasing, in fact we can conclude that an increase in the student population
for 1000 is associated with an increase of 5000 in excepted sales that is quarterly sales
are expected to grow by 5$ per student. If we believe the least squares estimated regression
equation adequately describes the relationship between x and y, it would seem reasonable
to use the estimated regression equation to predict the value of y for a given value of
x. For example, if we wanted to predict the quarterly
sales for a restaurant to be located near a campus with 16000 students, we would compute
as how y is equal to estimated y is = 60 which is my intercept + 5 is my slope * 1000 we
are taking it into only the numeric is 16 so 16000 because that is how you have written.
So finally this becomes 140000 the sales 140000. So this is what we understand, so this is
how graphically this is how it is shown the regression equation so this is my slope and
this is my intercept everything is shown here. Now I will show you how to just do it in the
SPSS. First let me show in the slide also, so how do you do it in the regression, go
to analyze regression, go to the linear model and then you take what you want as the dependent
you take it here and what you to take it independent take it here. So here price and income this
is just arbitrary example so we have taken and then we need to check. So let me show this and the model summary,
obviously this later on we will come. So, but if you want to understand this understand
you will get such kind of descriptive table model summary which this is the R and if you
remember this R we had said this is related to the multiple correlation value, anyway
let us go to this what is the R square what is the adjusted R square explain. R square
is nothing but the square of this value and adjusted R square is something very interesting
which let I will explain that adjusted R square is a value which goes on increasing up to
a particular level as you increase the number of variables but then after a certain point
of time when you add more variables the adjusted R square value actually it is stared declining.
That means either no change or it is start declining because the point is that adjusted
R square only accommodates those variables which contribute to the data or to the dependent
variable or to the study. So let us go to this simple regression. So simple regression I go to analyze, this
case is IQ and Grade. So is Grade effected by IQ, so let us see so I take Grade as my
dependent variable and IQ as my independent variable. So which this is a very basic regression
we are doing and we are understanding. So let you can go to statics and you can see
there are several things I want to check descriptive for example R square change is not required
here. For example independent observation you can
check it through a Durbin Watson which is an auto correlation again. Collinearity I
will explain it later on, what is the role of Collinearity, what is multiple Collinearity,
multi collinearly problem I will explain it later but forget it for a moment. So you want
to anything else now. I do not want to do any change here so I just want to run it. So I see, you see if you look at this now
first what it saying the Grade the mean is 30.67 and standard deviation is 9.2, IQ 49.5,
12.9. Now let us look at the correlation, what is the correlation between Grade and
IQ it is 0.498 and is it significant? Yes it is significant at a 0.007 level. Now look
at this the model summary, So my R can you go up and see so here you have got a correlation
of .98 and here R is .98. That is what I was trying to explain many
a times student do get confused how is this R connected, is this R connected with the
correlation or not so well this is the output, this is same as the correlation, multiple
correlation coefficient or the correlation coefficient. So R is .948 in this case and
does the R square; now R square is nothing but my coefficient of determination. So if I divide this 1 - coefficient of determination
I will get something which can be explained as strength of the test. So, .248 but look
at this adjusted R square that means what, when I am taking all variables it is .248
and remember the R value will go on increasing. The R square value and the R value will go
on increasing as you add more and more number of variables. But the adjusted R square, will not, it will
remain same or it will not increase. So then we say well this is how the models look like,
now this is the ANOVA, ANOVA means it is variances I will explain you may be later on. These
2 terms ANOVA and regression are very strongly correlated also, you can understand each other
through one from the other and now what is the coefficient. Now there is 2 coefficients
you can see unstandardized Beta coefficient and standardized Beta coefficient. Now this is the t value and this is my significance.
If you look at this IQ, so IQ the unstandardized coefficient is 0.358, the standardized coefficient
is 0.498 which you already got standardized coefficient is my correlation and my t value
is 2.694 and it is significant, that means what we can say that IQ plays a significant
role in the Grade of a student. So, if higher the IQ because it is positive relationship,
the higher is the students Grade. So this is what it explains this you know
this is the simple regression model and it has explained you how to find out the value
of the dependent variable from by changing the value of the independent variable and
I have explained to you what error terms mean and you should not be afraid of the word error,
it is not actually the error it is an unexplained variance. So what is the relationship between
explained and unexplained I have explained all. So I think this is the just a beginning for
the regression class, we will be doing more forms of regression, regression can be used
in multiple ways so as I said it starts with a basic that both the dependent and independent
variable have to be metric in nature but we will see several special cases where it might
not be the independent variable might be in some other format may be in a categorical
scale but we can still do it, so how we will do it all we will see in the later on future
classes in the upcoming classes for today we will close it here, thank You very much.