Statistics with R (1) - Linear regression

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hello everybody today we are going to use the software program our in order to perform a linear regression so it's going to be very simple and I'm going to show you the first steps in working with our and setting up a linear regression model in later videos this will become more complex and we're going to look at multiple regression models with and without interactions among continuous and categorical variables but for now we keep it very simple so you can see here I've just opened the interface of our is the 32-bit version of it on a Windows System and when you open our you can see here is the console it's version 3.1 so the most recent one that's available up to now and we can see that we end up in the console window here and in order to perform a linear regression model we have to use the LM command which comes from the base package that's already loaded so we want to do a linear regression using the LM command before we can start with that we open a new script window that's the usual way to start working in R and for convenience we arrange the windows so we'll tile them vertically so the left-hand window here is going to be the editor and the right-hand window will be the console and now we can start typing and you may wonder now which kind of data are we going to use we're going to use a data set that already ships with our with your personal our installation in order to browse through the data sets we just use the command data brackets open brackets closed and you can see that a window pops up here and there's a wide range of datasets available already with which you can play a little bit we're going to use the so called air quality data set so that's what we're going to do we type in the same line here air quality and this loads the air quality data set and what I'm doing here every time I type something on the left hand side I press ctrl R to send it over to the console on the right hand side if you don't want to press ctrl R you can also push this button in the middle here run line or selection you can see it here this has essentially the same effect all right so we loaded the data set where is it let's just see the names of its names of the air quality data set so that we can get a feeling for it ctrl R and you can see here there are several variables in this data set and we can copy these variable names just to the left hand side so that we know what we have in here don't forget to put a hash here so that this line is not interpreted by r as being as containing any command alright so we know this data set contains the variables else own then something with solar radiation wind temperature month and day we're not going to focus on month and day we're just going to look today at the relationships among all sown solar radiation and wind and the first thing to do before we start doing any regression model is we plot the data so we use the plot command brackets open and we know that the response variable is going to be the ozone concentration and one potential explanatory variable would be the solar radiation if we just use this command and press control R then R will moon and say object ozone not found so what has happened here we have told it which data set to use so we have to say data equals air quality in here if we do that we get a plot here we can see that there is definitely going to be a nonlinear relationship among ozone and solar radiation at some levels of solar radiation the ozone concentration is higher than at lower levels of solar radiation so just for the sake of simplicity we don't do anything about this non-linearity here we just pretend that this relationship would be linear and we use the LM command now in order to fit a linear regression line to these data so the first thing to do is calculate the mean of the ozone concentration and plot the line here because that would be the null hypothesis the null hypothesis would be there is no relationship between solar radiation and ozone so let's do that first we have to calculate the mean of the ozone concentration and be reminded that we always have to address from which data set this variable has to be taken from so air quality dollar or so that's what we're going to do we want to calculate the mean of the ozone concentration within and across the whole air quality data set and it happens to be n/a why is that well if we just type air quality ozone on the left hand side we can see the variable with all its values and we can see immediately from looking at it that it contains an ace that means not available there are missing data in this variable so in order to calculate the mean we have to get rid of these missing values so what we're doing is we type mean quality dollar ozone comma ma dot R M equals true so we want to remove the missing values n a dot remove equals true and if we do that we get a value of 42 now let's go back to our window here to the so-called graphics device and let's check if that's reasonable so a value of an ozone concentration of 42 could that be a true mean of these data we don't know it could be that maybe a geometric mean what be better here anyway um we just plot this mean value now in this graph how do we do that let's give this mean a value let's call it clean dot o so equals this thing yeah and we don't need this line anymore air-quality de lausanne just delete what we don't need anymore and let's already comment on our script here so use the hash and write calculate mean o so concentration n ace removed okay so we have this here now and all we have to do is give them that we've already opened the plot window here all we have to do is use the a b9 command and say H equals mean dot goes on so what could that mean it means plot a horizontal line H horizontal line at the mean value of ozone so let's do it control are there you go this is what the null model would look like if there would be no relationship among solar radiation and ozone all right and the next thing is fairly easy to do it's going to be use a linear model to fit a regression line so you LM to fit a regression line through these data and be reminded that LM given that we have this courier formed here it looks a bit like a one here this L so please be very careful when typing that you really use LM because no object in R is allowed to start with a number so if you type one M that would not be good use L instead okay so the principal structure will be LM response variable tilde explanatory variable so what's the response or in our case the response variable is what we measured what we're interested in this is the ozone concentration and it happens to be called ozone till the solar radiation data equals air quality and we're not done yet we want to give this object here this LM object a name so call it model 1 equals LM also until the solar radiation comma data equals equality that's what we need that's all we need in order to set up a linear regression model run it you can see that there is no error popping up on the right hand side so we've done it correctly if we just type model 1 now there will be something stored in this object now first of all there is the call which is what we typed in order to fit the linear model you see the formula here formula object everything in R is an object and we see the so-called coefficients so what would that be for example this thing here called intercept let's look again at our graph so what's the intercept the intercept will be the value of ozone concentration for zero solar radiation so just imagine we would fit by eye we would fit a line through here definitely this intercept would be potentially zero potentially a bit higher than zero and what the computer estimates here is that it's around ozone concentration of 18.6 and what's this other term here called solar are here this is the slope this is Delta Y over Delta X so the increase of ozone concentration with increasing solar radiation so let's before we inspect this object further this model one here let's plot the line a beeline model one and let's give it a different color color equals red and let's see what's happening here so it's very important that you keep this plot window here open so you start always with a plot and then you add your lines otherwise you will get an error message here you go so the a Bay a B line command on this model has already produced a line that goes through here what you can see immediately is that the variance increases with the mean so with higher solar radiation the spread of the values of ozone here increases and we should be able to see that also when we are looking at the residuals so let's look at the residuals of this model plot model one and this is something important now if we plot model one it produces a plot a series of plots of a model inspection so if we do that first thing happening is that the mouse cursor changes its shape its awaiting something and you see here on the right hand side waiting to confirm page change here we have to do something now so the best thing to do is to click on the plot and we can see now as I just said the residuals the spread in the residuals increases with the fitted values here so we see a pattern in this graph indicating that our linear regression model may not be perfectly representing the true relationship between ozone and solar radiation so we can see there is something wrong and we cure that in some of the later videos and this is the most important plot to look at the residuals versus the fitted values the fitted values are there also concentrate predicted by the model and the residuals are the deviations from these fitted values we have to click again to see further plots the next one shows whether our residuals are approximately normally distributed or not so if they were normally distributed they would all lie on this line here what we can see clearly here is that the deviations are above this line here indicating that the distribution is shaped differently than the normal distribution this is not so important in order to produce good predictions but nevertheless it's good to check this and we'll find later on in the course we will find ways to cure these things without even having to transform the response variable okay the next plots are not that important so I just close these graphics devices now for now okay what we've done now we've produced this model we've also plotted two line I want to show you some more things now a very useful tool is the so-called term plot to plot of model one using this on the data produces a plot already showing us the relationship between Seoul already and also on concentration and so we can see the effects that this solar radiation has on the ozone concentration immediately without further problems here okay another thing we want to do of course is we want to summarize summary so we want to summarize our model they particularly want to know how sure we are about the estimates so this is what the summary tells us first of all the summary again shows us the call so what we typed in order to fit the model then it shows us a summary of the residuals the median is already not zero it should be zero so the median is minus eight the minimum is minus fifty about the maximum is plus 190 so we see that the residuals are clearly not symmetrically distributed among the value of zero here and then we get the most important part of the summer which is the coefficients we get the intercept and it's standard error and the slope and it's then at every one and these T values here result from dividing this estimate here by the corresponding standard error the p value on the right hand side shows the probability of observing a T value larger than this one here and given the T distribution and it shows that we are quite confident that our intercept here is definitely different from zero which is not surprising and for solar radiation we see that the slope is also different from zero okay then we find the r-squared value here which is 0.12 the adjusted r-squared values adjusted for the degrees of freedom used up by the model and the f-statistic gives us the p-value off the overall regression here and it happens to be that this p-value is exactly the same one as the one for the slope up here just that this one has a bit fewer digits printed here okay so summarizing what we've did today we've used the air quality data set to perform linear regression using the LM command the LM command always works by putting in the response variable first followed by a told and then the explanatory variable in later videos we'll introduce more variables will introduce nonlinear terms and I will show you that different cures are available for variance heterogeneity so for this residual pattern that we observed here this increasing spread of the residuals with increasing fitted values but for today as this is only an introduction to LM I thank you very much for listening and I hope to see you again soon in one of the next videos bye bye you
Info
Channel: Christoph Scherber
Views: 428,782
Rating: 4.889462 out of 5
Keywords: R software, statistical software, programming, statistics, linear model, regression, console, script window, residuals, model diagnostics, Linear Regression, Statistics (Field Of Study), R (Programming Language)
Id: Xh6Rex3ARjc
Channel Id: undefined
Length: 19min 21sec (1161 seconds)
Published: Thu Sep 05 2013
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.