SAS Statistics - ANOVA (Module 02)

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hi everyone my name is King IV and this is introduction to SAS statistics this is second lesson a series of six lessons on the topic of SAS 66 and in today's lesson will be covering ANOVA analysis of variance let's go ahead and get started so what were the problem that we want to investigate to understand is does weight status impact cholesterol or does it correlate with cholesterol it's probably the better way to phrase it and what we're going to do first is we're going to use the proc t-test and we're essentially going to be looking for outliers outside the normal bell curve using these t tests so what we're going to do first is we're going to be using the data set we're going to be creating new data set called work heart and we're going to be using sourcing our data from the SAS help dot heart data and then we're going to be running filter to get rid of the underweight observations in this data set so we're not equal to underweight that way we'll have just two levels in our categorical variable weight status so here we're going to title we are going to say prog proc t-test and we're gonna say Co less chiral versus weight status and we're going to be using proc t-test data equals work dart heart and we're going to be using the class weight status and we are going to be using the variable cholesterol Oh smelt it wrong up here and then we're going to run it let's go ahead and check it out Oh yep sorry I have to run the whole thing so we haven't created a data set here and proctitis is really great procedure runs a lot different proceeded procedures within it so what we're going to do first is we're going to assess the three assumptions for any ANOVA analysis so the first one is are the samples independently selected in this case well trust that they are because we're assuming the study independently select selected them to our are the is the data normally distributed within each of the populations and we take a look here and we'll see that it's approximately norm normally distributed the blue line is normally distributed and the red line is actual distribution you can see and you can also look at the QQ plots to actually see that it is fairly normally just be there are some like at wires out there but for the most part I would say is normally distributed this is the flat assumption is are the variances are the variance is equal so in this case what we're going to use is we're going to use F statistic to determine whether or not there they are equal so you can see here so the null hypothesis is that they are equal and in this case we actually reject the null hypothesis because the the p-value is very small other word yet this value is very small that means we'll have to use a satyr weight method which is for unequal variances basically we reject reject the null hypothesis that they are equal which means that they are unequal and so we take a look here and we see that we have fairly high absolute p-value and as well the p-value is very small which means that we reject the null hypothesis that normal normal weight status and overweight status have equal means well for cholesterol and you can as well see see the variance here and the the differences as well so that's good that's interesting let's go ahead and move on to the next one and now we're going to be moving on to one-way anova in this case well there's actually going to be more than one level and what we actually want to assess is what is the impact of your cholesterol status on your H of death so a little bit of a gruesome topic but in an important study that we should probably investigate so here we're going to be OTS graphics on we're going to be using prompt we're going to be using proc GLM so I'm going to call this one-way anova age of death versus cholesterol status so proc GLM data is equal to in this case we're going to be using the entire data set says how part and then as well I want to make sure that we're plotting the diagnostic unpack and there's a number of different plots that you could put apply but this is ones that can help us do the assessment so here we're going to go pull status and then our model is going to be in this case H at death equals to Col status which are explained to our variable and then we're going to be using means Col status to actually assess some of our assumptions HIV tests let's go ahead and add that and now let's run our model okay so if we we go up to to the very top because there's quite a bit to look at you'll see the different levels so you see borderline desirable and high you'll see the number of observations that were actually read obviously because most mostly people haven't died yet and you'll take a look here and you'll see before we get to this area we're going to go and assess our assumptions the first one being Independence which we can assess here using this one we have to assume that the the researchers independently accept selected observations to will have to check whether or not the residuals or the errors are normally distributed and this residuals by predicted for each death is a good comparison you'll see here what the what the differences are and you'll see here you want amounts below in above the line which is where there is no differences and as well you won't you don't want to see any clear pattern there's a slight downward cloud closing of the like tightening of the residuals but it's not really that significant I think we can safely assess that the the errors are normally distributed in here here you go you see again using the QQ plot for each of the quantiles of the age of death you'll see here that there is no that is looks fairly normally distributed there is a bit tailing off here but nothing too serious and as well the third assumption is are the variances equal and so the Levine's test of null hypothesis assumes that the the variances are equal for each of the groups and we'll see here that the p-value supports that as well where we don't reject the null hypothesis that means the variances are equal so that's good so let's go ahead and move on to to see whether or not there's actually any differences you'll see here that there is a difference basically that between the different smoking statuses because the p-value is less than point zero five which is our alpha based off of this which it means that there's not within 95% confidence interval that they are different but you'll see here the r-squared is very low which which makes sense because you're age of death this is probably not just impacted or not just correlated with your cholesterol status there's probably a number of different components to it so that's good that's interesting let's go ahead and and move on in and actually assess so now that we know that the your cholesterol status impacts your age at death we want to know what what status increases or decreases your your age at death so the way we're going to do that is we're going to be using the post hoc a pairwise comparison so here we're going to put odious graphics on we're going to be using post hoc pairwise comparison and you'll see here the first parts pretty similar because obviously we're comparing we're assessing the different pairs of cholesterol status here and then we're going to be using LS means Col status P diff and this one we're going to tell it to use all the different components and this case we're going to be using two different model successes we're going to be using two keys as well we're going to be using done it's method so here what we actually want to do is control for Dennis Smith access asks us to put a baseline so here the baseline is going to be borderline which is where someone has normal cholesterol and the cessation we're going to be using the done and done it's method and let's run that odious graphics off and hopefully shouldn't produce some interesting results so we're going to go here there's quite a bit to look at so let's go ahead and get started so this is two keys so okay there's actually a bit more up here to look at but let's just start here so cuz the rest we already looked at before so here there's the different cholesterol status borderline desirable high you'll see that means at death for each of them and actually the the legend so here sociated you'll see here here's a pairwise comparison you'll see here for example two is physically different than one because it's less than the off of point zero five and as well it's different two is different than three and you'll see again that for example three is not different than one but three is different than two etc etc you'll see here here are the different age at death for each of the different cholesterol statuses and you'll see here here's another it's called if agreeable or not there are significant or not the red lines indicate that they're not significant when compared to each of these so which is kind of what we expected based off the previous graphs you'll see here the dotted method is pretty similar but what it does instead is it it controls for one of the cholesterol senses saying that this is the baseline which is borderline and then looks at it and see whether or not there's a difference between the populations you'll see here that there's only a difference between just a borderline and desirable you'll see here there's not actually physically significant difference between borderline and high cholesterol which is supported up here as well you'll see here that high that borderline and I are not different but the rest are so as well you'll see here this line right here is is the baseline and then this blue box will tell you that if you're within the blue box its Siskel II not different between the baseline which is here again borderline you'll see here the high cholesterol even though it's higher is not high enough to actually support that it's really different then then the border borderline cholesterol but you'll see here desirable cholesterol status is actually systole significantly lower mean than if you had borderline so having good cholesterol in this case actually explains partly or partly correlated with having a lower age of death but having a high cholesterol is syste Killeen OTT different than having borderline cholesterol so hopefully you've learned something you have any questions or comments feel free to leave it in the comment section below and don't forget to subscribe and I look forward to speaking to you next time thank you
Info
Channel: SAF Business Analytics
Views: 47,624
Rating: 4.9111109 out of 5
Keywords: SAS Institute (Business Operation), Computer Science (Field Of Study), Analytics, Advanced Analytics, SAS TUTORIAL, Regression, Box Plot, ANOVA, Model Selection
Id: KThq7kQ8O9M
Channel Id: undefined
Length: 13min 13sec (793 seconds)
Published: Sun Sep 13 2015
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.