SAS Statistics - Descriptive Statistics (Module 01)

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

hi everyone my name is King IV and this is introduction to SAS statistics in these lessons I'm going to be using base SAS 9.4 32 bit I also teach at the University of Waterloo and Computer Society and Business Analytics these are meant to be short instructional videos on how to apply stats or how to use different procedures to answer your statistical questions that that you have and the type of analysis that you want to perform this isn't a stats class I'm not gonna go over the theory and the different components I'm gonna point you to a couple different resources that you can check out but I'm under the assumption that you at least know the basics of stats in this case the inless these videos I'm going to be covering six different topics the first one is descriptive statistics which is this video and then we're going to be following that by ANOVA which is analysis of variances and then as well we're going to follow up with linear regression logistic regression model selection and predictive models and scoring so any if you haven't checked in my previous playlist introduction to SAS I'm gonna point a little card thing so you can check out here if you haven't checked that out that's a good basis so you can help better understand how to do SAS programming so that you could enable you to better do this six and understand the different procedures as well and set up your data and and do all that but enough rambling let's go ahead and get started so here I have SATs nine foot four open hopefully it's not surprised to to anybody and before we dive into the different procedures that we're going to be performing let's better understand the data set that we're working with so I'm going to go the SAS help library and it's just data set called heart I believe there you go and this is based off of a a study here done by farming cam heart study and then as well you can see here it has a bunch of different variables and observations including the status the cause of death cholesterol all weight status height all those different components and when we look at this data we may have some ideas on different statistics that we want to run so for example does for example weight this weight is weight correlated with cholesterol so if you're higher way does it correlate with having a higher cholesterol and that's really the purpose of descriptive statistics one is to better understand your data and then to develop what I call some hypothesis based off your analysis too before you go in and start doing your Russian logistic regression building on your models and doing the scoring and and doing all that and really not getting anywhere because you've gone down the wrong path or you've analyzed the wrong contributing variables that will impact your resulting variable but enough of that let's go ahead and get started so here I'm going to be using proc means because that proc means really useful way to develop them descriptive sticks off your data so here if you haven't if you don't already know the system the syntax go ahead and check out my previous videos on how I covered proc means in this case we're going to be using the data set SAS help which is the library help not cars heart and then this is where usually tell SAS what variables you want to use but in this case we're just going to leave it just so I can show you what the default looks like and as well this is where you usually tell SAS or your categorical variables you're going to use to split the data and as well what your variable numeric variable that you want to analyze but let's leave it now just so you can see what the default is and sometimes default is good to look at so you can look here the number of observations as well with the mean is the standard deviation the minimum and maximum and you can see all these different data and useful stats so you can check that out and get a better understanding and see here cholesterol meanness 227 the standard deviations 44 in the standard deviation of Latinos but it's basically the variability around the mean for the different observations so let's go ahead and modify our clone is here let's add a title or so we're here we're going to do some descriptive statistics on Coe this case colas sure roll yeah there you go I based on weight status okay interesting so here we're going to throw in a couple different variables so I'm going to throw in n which is the number of operations are going to be used in the analysis use mean I'm going to use the standard error and as well I'm going to use CLM which is basically our confidence level it's going to give us our 95 percent confidence in level and then here I'm going to divide the data by weight status and then as well I'm going to be analyzing the variable coal ash tour all there you go hardware to spell you can see here the title up here looks good and as well the different weight statuses and you'll see ennopp's and then as well and so ennopp's basically means how many observations fell into this category normal for weight status but then you'll see here there's another end which is the 1430 and the reason why that is is that there's a bunch of missing values around the cholesterol like maybe they couldn't collect the information maybe for whatever reason the data wasn't available the machine was broken whatever they this proc means excludes that one actually runs this analysis that's it that's as well important to know like you really need to know what these programs do when they handle missing data when they handle data errors just so you can actually interpret your results differently like this would be different if it made each of these 40 values for example 0 or 100 or whatever it is versus where it actually excludes it you can see the mean here the standard error and as well the the confidence level using 95 verse 9 5 percent confidence you can see here there's like pretty low the pretty tight distribution around the the mean and you can see that because the standard error is fairly small compared to the mean and as well you can see up here if you go to cholesterol that the min was 96 in the maximum was 568 which tells us that there's a pretty big range in this data which which is going to give us a clue around what our kurtosis value is which is likely going to be a positive number so this is some of the interpretation that you can get as you build out your results so that way you have this expectation not necessarily that you think you already have the answer but at least have this expectation just so that you don't come across any errors when you're doing analysis you can see here the comments mean you'll see if you go to underway and go the upper upper 95 percentile which is approximately the mean plus two standard errors so you see here it's 207 plus 3 times 2 is going to give you the 213 you can see here it's actually below the 95 percent lower end of the mean for normal and as well you can see the upper hand for normal is below the 95 percent lower end for overweight which gives us indication of like maybe there's a correlation between weight status and cholesterol after all some hypothesis that we can start generating later on ok perfect let's go ahead and move on to the next one in this case I want to I'm going to put ODS graphics on which can allow us to actually output some of the plots that we're going to be putting and as well I'm going to be throwing up a title so this case I'm going to be calling it let's call it histogram histogram and probability plot for cholesterol cholesterol and weight status by weight status as well within parking there's a whole bunch of other procedures that you could have included as well and by only showing you a few so you get a taste of it in here we're going to be using the procedure proc univariate which is very powerful for descriptive statistics so here we're going to be using the same data set and as well we're going to call in your class status so classes is always going to be a counter coke or variable which basically means like a LED like how we if we can split the database off that category and then it was all we're going to be using the variable no not that at cholesterol cholesterol and as well we're going to be as based off the title we're going to be doing a histogram and here we need to define what are our numeric variable lists so in this case it's cholesterol and as well I'm gonna put the insect intent was basically what kind of other stats are do you want to throw on top of it in this case the cleanest skewness and as well the kurtosis and then as well I'm gonna copy this because I want basically the same thing except for here I want prob plot I'm just gonna be the probability plot and let's go ahead and run that and just throw on ODS graphics off so I like to run my OTS graphics by itself first make sure that's been captured in the log which has perfect let's go ahead and run this and there you can see there's a whole bunch of analysis that's gonna be run which is again super interesting and really good way to better understand your data so let's go all the way to the top where you see the title that's why it's used to have titles so you can see here it's going to run some descriptive statistics up front it's gonna tell you in this case this is the weight status normal and it's going to tell you that your variables cholesterol you see your mean your some of your observations your variance your kurtosis your skewness and as well here some data around the the mean median and mode and obviously you want to know that because in an ideal it a normally distributed data set normally distributed normal variable the the mean the mode and the median would all be the same and they're roughly the same which tells us some of the the values you can see here your standard deviation your variance your range and your interquartile range a lot of times people rely on range but range I find not I find but it actually is can be impacted by outliers so if you have this one huge outlier can really produce a really large range so actually prefer looking at the interquartile range which tells us what's the the range between the third quartile which is 243 in this case and the first quartile which is 188 which is in this case is 55 while the range tells you what the max is and what the min is you see here even this quartile table and the 99% is 342 and then the hundred percent is 568 which is really high which tells us that there's some outliers on on the higher end and as well you can see some of the stats here we're going to go over that in some of the future videos and as well the highest amount saying this is going to be same thing for overweight and under way so I'm going to skip that for now and then here we have some histograms and then here you'll see here our inset which basically the stats that we're going to be pull you can see the skewness which is positive you can see here because of its kind of a on the has some higher amounts rather than over here obviously normally you would have this normal curve that would tell you and as well you can see here the kurtosis is positive and you can really tell and we already knew this because we saw in our interquartile range and our standard error that there was some indication that there is a pretty heavy distribution around the mean Bethan more than normal and that's for normal and that's for overweight and as well you can see here for underweight it follows more of this normal distribution and you see here the probability plot which tells you what the normal percentiles is plotted against the the actual cholesterol and normally ideally it would be a linear relationship you can see up here there are some outliers here which tell us maybe there's some other contributing factors that we need to consider and as well you can see over here our for underweight so again some lots of useful descriptives which tells us that there probably is a relationship between weight status and our relationship a correlation between weight and cholesterol let's play let's let's go and do one of my favorite graphics which is a histogram so here I'm going to use SG plot not a histogram a box in whiskers plot and then here we're going to be using V box which basically says that it's going to be a vertical a box and whisker plot in this case we're going to be again analyzing cholesterol and then as well we're going to be using an option called category and we're going to be splitting it based off the weight status again and before we run that I'm going to throw up a title here and I call this box and whisker plot for colas sure all by weight status perfect and then let's run these components or Vic so you can see here the blue line represents the the mean the lower head of the box is the first quartile which is basically 25th percentile the higher upper end is the third quartile which is 75th percentile so this tells you 50% your data lies within within this range see up here as well and as well you'll see that these these whiskers which is where their name comes from is basically the inter quartile percentiles interquartile range mad amounts so basically this is eighty seven point five which is half a quartile and as well so this is a seven point five in this 12 eight and a half and a good analysis to look at this is to say that anything outside of here is generally now why are not always the case because there's always going to be values outside of the amount because inherently it's it's a it's only 75% of the data from one wet skier to another but this gives you an indication of like maybe there's some gay tears you see up here that the normal amount there's this amount is actually higher than the highest amount in the overweight which maybe tells you a couple things one maybe you have a data entry error maybe this should be 200 something or two there's actually this is relation there there are other explanatory variables that explain cholesterol which is actually probably the case beyond just your weight in your weight status so some things to consider that we all need to pull in consider as we go ahead and do our analysis so I'm going to leave that there there's obviously a lot more they can do around descriptive statistics but I'll leave it there so you can go and explore and try things out on your own if you have any comments or questions feel free to leave it in the comment section below and I look forward to speaking to you next time thank you

Info

Channel: SAF Business Analytics

Views: 93,944

Rating: 4.9369087 out of 5

Keywords: SAS Institute (Business Operation), Computer Science (Field Of Study), Analytics, Advanced Analytics, SAS TUTORIAL, Regression, Box Plot, ANOVA, Model Selection

Id: HMOWriqdQTI

Channel Id: undefined

Length: 17min 2sec (1022 seconds)

Published: Sun Sep 13 2015