Multiple regression using Excel and SAS Enterprise Guide

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
recall that in linear regression what we're doing conceptually is fitting a line to a set of points and that line allows us to estimate the relationship between in this case two variables a single predictor variable fly-ash and our response variable which in this case is the strength of concrete now viewing this graphically helps us conceptually but actually how we do regression is run a routine which gives us a tableau or a set of output now recall that the output can be split into two parts the top part tells us about the overall quality of the model how well does the linear model explain the observed data the second part of the regression output is the equation of the line the equation of the model and this typically includes some confidence intervals around the slope of the line which can be very useful when doing root-cause analysis what we're going to talk about in this lesson is multiple regression multiple regression is a situation in which we want to explain a response variable in terms of more than one explanatory or predictor variable now of course we can't show this graphically here's one I stole off the internet that shows a response variable as a linear function of two predictor variables and the result is no line but is in fact a plane in space but if we go to three or four or five predictor variables we can't draw this which is too bad because we often do multiple regression using quite a large number of variables and what we're going to do is show you how to do this in Excel you will recall this concrete data what we want to do here is predict or explain compressive strength as a function of a whole bunch of different inputs like Samantha slag fly ash and so on the only one we're not going to be able to use in this situation is air entrainment because it's not a continuous variable it is a categorical yes/no variable so I'll show you how we do this as before we hit data data analysis and we select regression and as before we use the same response variable the difference here is is when we select our X range we select all the variables okay including the headings and I will indicate that I've selected the labels here and I'll leave my confidence interval confidence level but I will put the output range right beside my model so I hit OK and I get a tableau and you'll see right off the bat that this tableau is quite a bit more complicated in the sense that now my equation the coefficients table contains 1 coefficient for each of the variables so for example no number that corresponding to the experiment number so experiment number one experiment number two experiment number three now we might assume that that should have zero impact on the strength of the concrete on the other hand we could imagine a situation in in practice running experiment where the machines get dirty or the operator gets tired or the measurement equipment starts to fail and as the experiment goes on we see some kind of trend in our measurements this would be a very bad thing so it's a good thing to test for it now fortunately when we look at the p-value or the impact of experiment number on the compressive strength of concrete we see now it's it's quite high which suggests to us that this slope includes zero or the slope is zero and if we if we go over to our 95% confidence intervals which is always an easy way to verify this we see that we're 95% certain that the slope of experiment number is somewhere between a positive number and a negative number in other words it seems to have no impact but other things in our equation other variables have slightly higher more significant slopes although looking at this you'd go really I've got a whole bunch of nothing most of these slopes aren't significant and one of the reasons for this is that we've got too many things in our model and we'll talk about mile model refinement later but this is the basic of multiple regression in Excel this linear regression model that we just did is a kitchen sink model we have every possible variable involved in the model even though we have theoretical reasons for thinking that some of them won't be relevant and we can see if we look at our results that some of them appear not to be statistically significant so what we can do is take out the non significant variables the non significant explanatory variables but this is a bit trickier than you might think because we can't simply take out all the insignificant variables right now we can't simply go through the list of p-values and identify all the ones that are greater than say something like a 5% threshold and remove them because when we take one variable out of the regression everything changes so what this means is we have to do this in a stepwise or iterative fashion that is we have to remove one variable rerun the regression model and then see how that worked so which one do we take out first well the rule of fellow I'm going to use is I'm going to look at these p-values and I'm going to pick the one with the highest p-values as the SAS material describes there's different more scientific ways of going about this but this is pretty much the only way to do it in Excel so what I'm going to do first is create a copy of each iteration because I don't want to throw out data you're gonna have to you see that I'm going to end up having to delete some columns here so I create a copy and I white might want to call this model too and then I go and I find slag the column for slag and the easiest thing to do is just click on the see the column header and go delete that simply deletes the entire column but now this is going to make it more difficult for me to select my ranges using the tool for regression so I can insert an empty column here on the on the far left and you'll see why when I actually go through data analysis so now if I go a regression you'll see that I don't have to redo my explanatory variable or my Y range because even though I deleted a column I added a column so there's just a little trick here now on the input range I do have to make one change but I'm just going to type that in I'm starting now in column B because column a is empty and that should be fine I should be able to just run that it'll overwrite my existing model which is fine and then I can go see what happened now right off the bat you should see there's quite a difference here whereas when I had slag in the model all sorts of things were not statistically significant many of these variables now are highly statistically significant 10 to the 27 10 to the 23 but if we use our rule of thumb and keep going now and say we use a 5% cut off for significance we might say you know look BSP or the super polymerize is pretty high oh this one's even higher the experimental number in fact that's not even close to being statistically significant we should clearly take that one out so I do the same thing i I create a copy of it I add it to the end I rename it model 3 and then I take out one column and add an empty column so now I'm starting from C column C so I can do data analysis regression the only thing I have to change here is column C and I can hit okay and overwrite it and check my results and if I look at my results now all my p-values with the exact with the exception of fine aggregate are all significant fine aggregate is sort of borderline so we'd have to decide whether we want to keep it in or not again a tool like SAS has a better methodology for removing variables to get the smallest yet most accurate model possible now that we've done a multiple regression in Excel we can see what the same process looks like using a specialized tool like SAS Enterprise guide so I'm going to use the same data here the same concrete strength data and I'm going to import it into SAS and we can see hopefully it's identical so now what we do is go up to tasks and regression linear regression and we get this dialog and as before compressive strength is our dependent variable and I'm gonna select all D variables as my dependent variable but you'll notice that air entrainment isn't added now air entrainment if you recall was a yes/no variable it was a categorical variable and by default all regression programs will not accept categorical variables they need numerical variables we can deal with that later but right now we can't include a category categorical variable in our regression now we can just leave this as it is now that I've set up the basic model and run it and it will go away for a little while quite a while actually and it will come back with a full model so we see just as before we get an R square and we get our parameter estimates and I call this a kitchen sink model because it has every variable in it now as before like we had in Excel we see here if we look at the p-values that a lot of these are not significant that is there's a relatively high probability that the parameter estimate is zero so what we want to do just like we did in Excel is refine the model now it turns out this is a lot easier to do in a tool like SAS than in Excel so I'm gonna hit modify task and the second item on the far left is the model selection method and by default the model selection is no selection it just gives you the full kitchen sink model whatever you specify but there are several iterative techniques that in a very stepwise fashion refine the model and as I said before there's no real science to this there's good rules of thumb and bad rules of thumb so them the simplest is probably what we did in excel start with the full model and then take out one variable at a time until all the remaining variables are significant at some level now there's more sophisticated techniques in sass and those are covered in the south's material but right now backward elimination will give us a flavor of this now to stay in the model by default it it requires a ten percent significance level I'm going to change this to five percent again there's no real really solid reason to do this you'll see that it's a trade-off between a model that is small but not particularly accurate in terms of its predictions or one that's really predictive but it's very complicated and which probably has over learned in the training data that's kind of a technical point over learning but basically we want to strike a balance between a simple model and an accurate model so I hit run and this will take a little bit longer to run because it's doing different models now what the SAS output is is a blow-by-blow analysis so it starts off by saying in step zero all the variables in their model and your r-squared is 0.8 two seven one exactly what we had before now as we take various out the r-squared can only go down can't by removing information have a more accurate model and that's where in regression you often see the adjusted r-squared as being important the adjusted r-squared penalize --is the model for having more terms more variables in it so as we scroll down here we see that slag is removed very similar to what we did in Excel and then the experimental number is removed fine aggregate or sand is removed and at each stage my r-squared is going down slightly but not that much finally it says all the models all the variables left are significant at the 5% level and then it gives you a whole bunch of graphs showing you these trade offs these various measures of trade-offs between accuracy and model complexity and we don't really need these but you can basically see that these tell you various measures of model quality at different steps of the process so if you were doing this for a living you'd pay a lot of attention to these to say you know at what point should I stop taking variables out of the model one annoyance at least in this version of SAS Enterprise guide is that I don't get a full table of results for some reason I get all this diagnostic diagnostic information which can be quite useful especially this graph here where we see the quality of the model prediction along the horizontal axis and fit along the vertical axis these are great diagnostic graphs but I don't get a final version of the model what I have to do to get around that I found is go to modify task and where it says statistics click on two things standardized regression coefficients and confidence limits for the parameter estimates now the confidence limits are something that we get by default in Excel so it makes sense that we'd want to see these here as well we're sort of used to this information and it's extremely useful from a business point of view the standardized regression coefficients is something a little bit different but we have to have both these checked it turns out in this version of Enterprise guide to get this tableau when we're doing its back step refinement so if I scroll down now I get all my steps I get all my model fit Diagnostics but here I get my final model with my r-squared my adjusted r-squared my parameter estimates my 95 percent confidence limits and this thing called a standardized estimate and we're not going to talk about standardized estimates too much in this course it's conceptually very simple instead of being expressed in terms of units you know how many units in kiloPascals of strength am I going to get am I going to gain with every kilogram of fly ash this is expressed entirely in terms of standard deviations how many standard deviations change in strength occur with each standard deviation change in my various variables so these are now directly comparable when looking at cement fly ash and water I can see which of these has the biggest impact and in this case it looks to be this one in terms of standard deviations but again mccann be useful for root cause analysis to know what's driving the phenomena of interest in this case concrete strength but more typically we like to use the units of business that is kilograms of fly ash the kilo Pascal's or mega Pascal's of strength so I put that out there just as a workaround for the missing output problem
Info
Channel: Michael Brydon
Views: 923
Rating: 5 out of 5
Keywords: Multiple regression, Excel, SAS Enterprise Guide
Id: AwZHp45oD3A
Channel Id: undefined
Length: 19min 36sec (1176 seconds)
Published: Wed May 22 2019
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.