Introduction to Stata - Thinking like Stata

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
welcome to the second video of this introduction to Stata in this video I'll share tips and tools so you may begin to think like Stata and feel comfortable with Stata Stata is a programming language and with any language there are rules we'll put the concepts to practice when we explore some basic commands so let's begin syntax what really is the definition of syntax well syntax are the rules governing and languages sentence structure and it's your guide in Stata TOC for all commands and the syntax can be easily found with the help command so here we are back in Stata in the command line type the command help-help hit enter this represents that we want help to find the syntax to the command help which we have over here here's the syntax you have help with the rest of the syntax components as you can see over here the command itself help is bolded and is the first word of the syntax usually you can see that the H here is underlined which is status shorthand meaning you can type in H to represent the command help afterward in these brackets here it says command or topic name for us we wanted to see the command syntax for help so we typed help help the brackets here represent optional components which I'll explain later do you see this comma it's very important instead of because it separates the main command over here from the options of the command which fall after the comma here our status five major grammar rules the first would be one equal sign is not equal to two equal signs because the double equal signs appears after if in the command now if is great for restricting results because if is the conditional it introduces an expression which we'll be seeing later but stat allows select commands to use it and you would check by asking for help by using the help command commas they're required for some commands to separate the main command from the command options and again you would find that out with the help command spaces Stata is sometimes picky about it and I haven't quite figured out as hard and fast rule for spaces but if data is giving you an error play around with the spaces because that might be the error lastly capitalization Stata is case-sensitive and Stata does not understand capitalized commands whatsoever but Stata can't understand capitalized variables if you already named your variables with capital letters here are today's five basic commands we have summarized codebook count tabulate and tab stat let's start with summarize I pulled up the syntax for summarize by typing help summarize or you could also type H su in shorthand remember the components and brackets are all optional the order of the components really matter var list here just means a variable list and also the options over here after the comma are listed under the syntax let's explore the first command it's a standalone as you can see above because the rest of the components are in brackets it's okay for some to stand alone and I put s um instead of su that's okay with set as well because so long as you have the root command letters the su you can tag along as many letters after su and Stata will understand the command for this one some variable one variable two comma detail variable one variable to comprise the variable list or the list of variables I did not have any of the other components here but I wanted the detail I wanted this option so I put a comma and then I put a detail lastly this command is perfectly fine because variable list was optional so the next component is if and that's okay after if usually comes what's called an expression and you don't have to memorize this right away either it's an expression where a variable name is equal has the double equal sign to a number now let's move on to the next command a codebook here is the syntax for codebook again all the components after the command-r in bracket so here we are back in Stata let's try it codebook variable one if variable two with one equal sign is one and if we replace the variables with the actual variable names we can make a pseudo command here let's try it codebook variable 1 let's say is marital for marital status if variable two could be sex equals with one equal sign one immediately we see that is not allowed anytime you see red that denotes an error in Stata but watch if we put two equal signs it will work and here's a trick instead uh if you want to retype the former command you can go directly under the review window and click on that command immediately the same text appears in the command window and all we wanted to change was add an extra equal sign once that happened we see indeed that the marital variable came up and it's only looking at people who have the value of one for sex namely if they are males so you can see the total here is 220 for this data set actually has a total of 790 observations so already we can tell that Stata had sifted out the females and we're only looking at males what I really love about codebook and why I decided to include in this tutorial is that it can show you the number of missings in this case no one had a missing marital status and as you can see the numeric values of marital status and the frequencies are here along with their labels codebook is a very nice way to have a general overview numeric overview of your variable and again it shows you the frequencies the values the labels and if there are any missing x' now here's where you can see the red and green buttons in action if I type the command codebook by itself because I haven't specified which variables to show it will show all my 14 variables so go ahead and type codebook and hit enter it starts with my first listed variable called ID than sex then age etc and will show me the codebook information of all my variables in the order in my variables window so we have two options we can click the green button to show more like so we can also click more at the bottom here to show more when we can even hit the enter button in the command box like so however for our purposes we don't want to see everything here so we'll hit the red X button to break the output now it can allow us to enter another command because it shows the period in the output window ready for the next command now here's the command count and here's the syntax for count now we're back in stat Oh with a practice data set and we want to see how the command count works so in our example in the wrong example I had count comma if variable 1 let's choose marital equals equals and number let's say 1 again you see the red that's an error Stata did not understand your command now if we took out the comma it gave a count we can count also the number of people who have the numeric value of 2 similarly we can also count it for 3 and 4 4 because the variable marital has one two and three and force here's tabulate there are two flavors of tabulate the first is the one way table of frequencies the one way table means looking at one variables frequency for example we can tab the categorical variable marital status and see how many people are single how many are married widowed etc a two-way table or frequencies with look at two categorical variables in one table showing the frequencies for each cell not surprisingly there are different syntax rules for both depending on what table you want instead of here I have opened up my practice data set which has a lot of categorical variables a categorical variable is a variable whose numbers such as 1 or 2 actually represents something else for example 0 could be female and 1 could represent male or 1 could represent single 2 could represent married cetera so let's check out the tab command with one categorical variable with the option of row and let's choose a categorical variable called marital for marital status we put the comma to separate the main command from the command options and we type row see the option row is not allowed now let's check the two-way table with the option row immediately it shows that there is a key for each cell the first number represents the actual frequency or the number of people in those categories and the second number represents the percentage the row percentage now as you can see for each row all the percentages add up to 100 because this is the row percentage over here and over here that's why the one-way table can't show row percentages because it doesn't have two variables to compare them to each other here's tab stat a useful command for showing you more specific statistics notice what is in the brackets in the syntax and what isn't I've also put the main options here because tab stat as a standalone command with variables isn't very powerful for tab stat using the options is extremely helpful the by option here allows you to group statistics by variables while the statistics options below that which has an underlined S for shorthand is where you tell Stata to give you specific statistics of interest let's see what tab stat does by itself without options so we type tab stat and choose any list of variables age and maybe age at first sexual intercourse not very impressive right because it only shows you the mean of the variables you list now the mean is useful only for variables representing real numbers called numeric variables you can't use it with categorical variables such as sex if zero meant female and one meant male then the mean or the average of the two would be a half which doesn't mean either male or female but if you took the mean of the ages we'd find that Stata will compute the average age of all subjects which is 33 and that does make sense now let's try tab stat with the by option let's try is tab stat age by sex now see how I picked a numeric variable for my tab stat main function but separated it with a categorical variable of sex now let's get fancier and use the statistics option the S option first we'd like to see what statistical options we have so we type help tab stat which for shorthand can just be H and tab stat now from here we look at the options and we see the S shorthand and click on stat name over here we have a whole list of all the possible options you can get Stata to report to you we can look at the mean the count and the variance standard error of the mean the first percentile the median all of these the interquartile range etc so let's choose a couple or as many as we want to put into our command so I can click back on the review window and click on the command I'd like to have again and I can just change it from here so I can put s afterward now in Stata the options such as bi and statistics or row for the tab command these can be in any order after the comma bi can come second statistics can come first but let's just choose some statistics to report we'd say median IQR interquartile range we can even pick the 99th percentile and here you go we see that the age has been grouped into females and males and we see the total number of each here and we see that at the 50th at the median 30 is the median for both male and female the interquartile range is a little bit higher for male and the 99th percentile shows that the female age is is higher than the male age at the 99th percentile now what if we switch the order of by and s these options we can just highlight this move it to the end hit enter it shows the same thing so stat it in my experience does not discriminate between different orders after the comma I'd like to add that the sum the summarize command can also show you many of the same things here but there's a trick to that you have to type in the option detail so remember sums for summarize can be su m or su and we can choose however many variables we'd like again let's choose age for this and if I just type that it shows the number of observations for age the mean age standard deviation and the minimum and maximum but if we type detail as the option and the shorthand for detail is just the letter D we have a whole range of information here we have the percentiles from the first percentile to the 10th percentile to the median these represent the percentiles we have the smallest numbers in age the smallest subject is one years old the oldest subject is 90 years old these represent the data points which are the largest and smallest again we have the number of observations here we have the mean and standard deviation but we also get the variance skewness and kurtosis this concludes the second video in the Stata introductory tutorial
Info
Channel: UCSF GSI
Views: 44,816
Rating: 4.844523 out of 5
Keywords: Stata
Id: jTtIREfhyEY
Channel Id: undefined
Length: 14min 39sec (879 seconds)
Published: Mon Apr 28 2014
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.