welcome to lecture number four, in this lecture
we will discuss how to estimate the parameters of a linear regression model, in the earlier
lecture we had discussed that there are three parameters beta0, beta1 and sigma square,
so if you try to recall in the earlier lecture we had taken the model y is = beta0 + beta1
x + epsilon and we had obtain n observations say x1, y1,
x2, y2, x n, y n and we assumed that all this observation they are going to satisfy beta0
+ beta1 xi + epsilon i, this is the model that they are going to satisfy, and if you
try to recall we had created this diagram this was x, this was y, and then we had observed
the point something like this and so on and we wanted to fit here a line something like
this. we had given it a name say x1, this is my
x1 y1 and this is my x2 y2 so this line is now in more technical terms this is the line
which we want to be fitted and this is essentially the line y is = beta0 +beta1 x. so in this
case we also had assume that this epsilon i has got mean zero, variance sigma square,
and we also assume that f silent are iid that means they are identically and independently
distributed. at this movement i am going to make an assumption
that epsilon i are iid and they are following a normal distribution. i can write down briefly
that iid, epsilon are iid, following normal zero sigma square distribution. this mean
that all epsilon i they have been observed from the probability density function normal
wi’th mean zero and variance sigma square, and we also assume that they are independent
they all epsilon1, epsilon2, epsilon3, epsilon n they are mutually independent of each other. i would like to make here one note that when
we are going for the least square distribution this assumption of normal distribution will
not be of use there. when we are going for the test of hypotheses and confidence interval
estimation then only this assumption of normality will be used, and later on when we are doing
the maximum likelihood estimation in that case right from the first step we will require
assumption of normality, so that you have to keep in mind. well, i will try to explain you as soon as
i come to the maximum likelihood estimation and ordinary least square estimation. so under
this setup now we try to estimate the parameters so our objective is estimation of parameter,
and you have to keep in mind that there are three parameters beta0, beta1, and sigma square
that we want to estimate. now i am going to use here two methods or
two approaches, one is method of least squares and another is maximum likelihood estimation,
first we try to understand what is this method of least square, now in this graphic if try
to see we had said this is my random error involved with the first observation denoted
has epsilon1 and similarly this is my epsilon2 and so on. so if you try to see in very observation i
have got some random error now principle of least square says that i would like to find
out this line, this orange line in such a way such that this random errors are minimum
and most of the point they are lying exactly on the line, so in the first case i try to
use the method of least squares. so first let us try to understand what is the least
squares estimation. the principle of least square says that we
try to find out the values of parameters in such way that the total error is as - as possible
and most of the points are lying on the line. so if you try to see in this picture the random
errors in the first observation is epsilon1, in the second observation this is epsilon2,
and so on, so incase if you try to minimize the total error the total error is summation
i goes around one to n epsilon i. but can you really do it, if i say we have
to minimize it, does it make any sense. we had assumed that some of the errors are in
the positive direction that is above the line and some errors are in the negative direction
they are line under the line. so if try to sum them up, some may be very close to zero
and that will be indicating that my observation do not have random errors, that is wrong so this idea does not work here, so this is
not meaning full, so now how to do? let as try to considered and let us try to minimize
i goes from one to n summation epsilon i square. now does this make any sense? answer is yes.
why? because we had face the problem earlier because some of the random errors were negative
so once i try to square them the negative become positive and now i can easily minimize
it. well, at this stage you can ask that once
i am trying to convert my negative random errors into positive random errors then another
option is that i can take the absolute value of epsilon, yes, answer is yes. you can also
minimize i goes from one to n absolute value of summation f silent i, yes, you also minimize
the sum of absolute errors that is i goes from one to n, summation epsilon i. this is also available in the literature this
is called as least absolute division estimation technique, but in this course we are not going
to talk about it. so we will try to consider that we want to obtain the values of the parameters
by minimizing some of squares of the random error. so the next question is how to minimize
it? well, i can use the principle of maxima in minima. let us try to use the principle of maxima/minima
and try to obtain the values of beta0, beta1 and sigma square so let me write the summation
epsilon i square has as a s of function of beta0 and beta1 i goes from 1 to n, epsilon
i square. this can also be written has a summation i goes from 1 to n, y, i – beta0, - beta1,
x i whole square. the principal of maxima and minima says that
we need to obtain the partial derivative of s with the respect to beta0 and beta1, put
them = 0 solve it, and then check using the second order derivative whether the solution
gives us the maxima or minima, so exactly we are going to follow the same rule. so if
i try to obtain the partial derivatives first. so i try to obtain the partial derivative
of this thing wi’th the respective beta0 and this will come out to be minus twice of
i goes from 1 to n, y i – beta0, - beta1 x i and next i tried to partially differentiate
this s wi’th respect to beta1 and this comes out to be summation i goes from 1 to n y i
– beta0 – beta1 x i times x i. and now i try to put it = 0 put them = 0 and i need
to solve it. so let we call this as equation number one
and equation number two. if i try to solve this equation number one this can be obtained
like as follows, once i open the bracket this gives me i goes from 1, 2 n summation y i
- n times beta0 – beta1 times summation i goes from one to n x i put it = 0. or i
can write down this thing here as a summation i goes from, one to n say y i minus beta0
– beta1 over n summation. i goes on here 1 to n x i is = 0 or i can
write down here y bar – beta0 – beta1 x bar = 0. so solving this thing this gives
me that beta0 = y bar – beta1 x bar, but this beta0 can be known to us provided beta1
is known to us, but up to now we do not know the beta1 so now i try to solve this equation
number two and let us see what we obtain over here. let as try to consider this equation number
two and we try to solve it. the equation number two is summation i goes from one to n x i,
y i – beta0 – beta1 x i is = 0, and now if you just try to open the bracket and if
you try to solve it we get here beta1 is = summation x i y i - n times x bar y bar upon summation
x i square minus n times x bar x square summation is i goes around one to n. if i try to simplify, this quantity nothing
i goes around one to n x i - x bar y i - y bar and this quantity in the denominator is
summation x i - x bar whole square now keep in mind that this x bar and say y bar they
are simply our sample mean, whatever the observations we had obtained based on the that i can find
out there sample means, so x bar and y bar are known to us. so now i can see one thing that when we stated
our model y = beta0 + beta1 x + epsilon in that model this beta0 and beta1 were known
to us, but now i can see that once i have got the observations using those observation
i can find out the value of beta1. so this i take as an estimator of beta1 in simple
words estimator means that the value of the parameters that can be obtain on the basis
of given set of data. so i have here parameters that is beta1, but
is value is completely unknown to us now i am saying that using my observation i can
compute the value of beta1 from this expression, which i have written here, so this is an estimator
of beta1. so for the sake of simplicity let we try to rewrite here say beta1 hat is = sxy
upon sxx where this sxy nothing but summation i goes from 1 to n x i - x bar y i - y bar
and sxx is i goes from 1 to n x i - x bar. so now in the later lectures we are going
to use this notation, so what i have seen now that we have obtain the value of beta1
has beta1 hat now the value of beta0 that we had obtained in the earlier slide here
like this one this is going to be known to us only when beta one is known to us or i
try to write down here that this value of beta0 can be known to us if i try to replace
my beta1 by beta1 hat like this. so now using this expression i can again estimate
my intercept term so this beta0 hat is an estimator of beta0. both this beta 0 hat and
beta1 hat they have been obtained from the principles of least square or in this case
we have a minimize the vertical distance between the observed values and the line something
like here you can see we had minimize these thing. so they are also known as direct regression
estimators this beta0 hat is the direct regression estimator of beta0 and beta one hat is the
direct regression estimator of beta1 they are also called has least squares estimates
or least squares estimators of beta 0 and beta1. well we have obtain these thing, but
we do not know whether the values of beta0 and beta1 that we have obtain as a beta0 hat
and beta1 hat are they really minimizing my random errors are they are maximizing it. so for that we have to find out the second
order condition, here you can see i have got here two parameters and we are jointly estimating
them. so we need to check about the bordered hessian
matrix, so the hessian matrix of second order partial derivatives is defined here as h that
can be a matrix of two by two wi’th the partial derivatives of the second order wi’th
respect to beta0 and second order partial derivative of s wi’th the respect to the
beta0 and then beta1 and on the second diagonal the partial derivative of s wi’th the respective
beta1 square. this matrix has to be obtained at beta0 = beta0
hat and beta1 is= beta1 hat, so what i have to do i simply have to differentiate it again
and then substitute beta0 = beta0 hat and beta1 = beta1 hat in the normal equation that
we have to obtain here. in fact they are actually providing us global minima, you can see we
have obtain the value of beta0 and beta1. so, if i try to write down compactly we had
a model y is = beta 0 + beta1 x + epsilon and we have obtained a fitted model and this
fitted model is y is = beta0 hat, + beta1 hat x, and this model is called as a fitted
regression model. now after this i have to obtain the fitted values. what is this fitted
value? you see when we conducted the experiment and we obtained the data. then there was some different between the
observed data and that line, so if i try to make the earlier diagram here once again say
this was your x and this was your y and this was your line and the observation they were
lying somewhere here, here and so no. so if you try to observe this is suppose our x1,
y1 and we expected that this value is going to lie somewhere here on this line y is = beta0
+ beta1 x. so we had observed the values x1, y1 but i
am expecting that this value should lie somewhere here. the
value of y which is obtained using the observed value of xi this is the i’th fitted value.
well, let me try to explain simple example suppose i have got here a data, which i can
write xi and yi, suppose i have here four sets of data i take suppose xi = 1 and i obtain
yi = 6. i take xi is = 3 and i obtain yi = 10, i take
xi = 6 and i obtain yi = 22 and once i take xi = 7 i obtain yi = 21. now suppose after
fitting the model after obtaining the values of beta0 hat and beta1 hat on the basis of
this four pairs of observation suppose i get here a model y is = 2 + 3 x, so this means
all this xi , yi they are also going to satisfy this model. now i can obtain here the value of yi hat,
how i can obtain? for example my y1 hat that is going to be, i will try to use this model.
so this is going to be 2 + 3 times xi, this is actually here one, so this is going to
be five. similarly if i try to obtain here y2 hat this is going to be 2 + 3 times here
three, so this is going to be 11, similarly y 3 hat this is going to be 2 + 3 into 6 and
this is 20. and similarly y four hat this is 2 + 3 into
7 this is going to be 23. after this i can write this point here has a x1 and say y1
hat, so these are nothing but my fitted values and if you try observe what are these values,
i simply have fitted the model on the basis of given set of data then using the given
values of xi i am trying to obtain the values of yi, which are yi hat, so yi hats are the
values of y. which are obtained from the model and they
are called as fitted values. so i can write down the fitted value here as y hat is = beta0
hat + beta1 hat and now i am using a given value at x = xi so this is my i’th fitted
value. let us try to see a different aspect, now i can find out the difference between
yi and yi hat, so yi is the absolute value yi hat is the value of y, which is obtained
from the model. so in this case if you try see i try to be
note by ei and suppose i define it yi minus yi hat, so this becomes here 6 - 5 which is
= here one this becomes here 10 -11 which is = her - one and similarly this becomes
a 22 - 20 this is two and this becomes 21 - 23 = - 2, these values are called residuals. i try denote it by ei, so ei is nothing but
the difference between yi and yi hat, and in general i can define e as residual the
difference between observed and fitted values so this is my residual, now this residual
has a very important property. if you observe in this picture am saying the difference between
y1 and y1 hat now as per this definition is nothing but e1 earlier we had denoted the same distance as
epsilon1 so you can see here that this residues are going to act like as we have observed
the random error in my data, remember one thing residuals are random variables, errors
are random variables and am not estimating the random errors by residuals, but they will
look like as if they are the values of random errors and these residuals helps us a lot
in obtaining the information about my random errors and this will try to discuss in the
forth coming lectures till then good bye.