Stata Tutorial: Intro Data Cleaning with Panel Data

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

[Music] hi everybody we're gonna take a quick look here at some data cleaning topics some kind of good habits to get into when you're starting a project some of which will apply specifically to panel data so let's take a look at some of the topics we will discuss here always the first thing you want to do is start a do file start recording your commands that you find useful both for your benefit and for other researchers who want to replicate your work the use of the preserve and restore commands renaming your variables so you and readers of your work can readily see what it is that you're doing labeling your variables again the same idea you want to organize that information sometimes it's going to be helpful to create a macro that collects variables into a list if you're gonna use the same variables again and again we'll take a look at that dealing with the problem of missing observations how to encode a string variable so you can actually use it in analysis and then with panel data one of the big things is figuring out how many observations you have per entity over time and sometimes you're gonna want that to match right so you're gonna want to have a perfectly balanced panel we'll see an easy way to take a look at that information and then a quick way to take a look for outliers so let's switch over to Stata here where I've I've already started without you so the data set we're gonna use for our example here is from the World Bank open data source so to access this first you have to install the WB open data package so SSC installed WB open data and then for the example here we're gonna use what's listed as topic number 15 so these are social indicators across countries and over time and you're going to want to use the long data format so that's gonna organize our data so we have two columns that heat that together I should say uniquely identify each observation so it's gonna have the time and the country identifier so it takes a few minutes it's a fairly large dataset going back I think to 1961 for all countries across the world so it'll take a minute for it to download but go ahead and and do that and you should see something that looks like this so we have all of these indicators across countries regions and years and let's expand that out a little bit and let's kind of imagine that we've already decided which variables we're going to want to focus in on so we'll kind of just hone in on a few of these and then imagine that we're going to be using those for our future analysis right so as we mentioned a good idea before you do anything else right is to start your do file editor and start recording all of the commands that you're going to find useful all right so our first one is going to be generating be the dataset to begin with and I'll go ahead and just for the sake of example here I do that so we just copy and paste and then when you're following along you would want to do this for each subsequent command and we toggle back over to Stata so the next idea is we're going to be making a lot of changes to this raw data right and we're not always sure that these changes are gonna be what we actually want to do right so you might want to have to you might want to be able to revert back to the starting point so a good can land to use at this point would be the preserve command right so once we hit enter here and use that preserve command that's basically a bookmark right so we can make all of these changes generate new variables all these things and if we want to go back to this point we can type restore and it'll bring the data back to its original formulation and you can go ahead and do this multiple times right so once you get a point where okay I'm pretty sure this is going to be where I you know I'm solid up to this point you can hit preserve and then you can make additional changes without the risk of losing that that kind of bookmarked spot okay okay so in this case with this World Bank data the names of the variables here are fairly clumsy so we might want to start to rename those variables triangles so let's imagine that we have if we kind of scroll down here say the variable of interest here is female life expectancy right so that's going to be our dependent variable so we might want to rename that variable bring that over here so that's gonna be our Y variable we could just rename it Y but then we'll know what it is but other people reading our work won't know so we might say Y underscore life expectancy underscore female something like that a little easier to deal with and that slots right into the same place where it was right and let's say we want to try to predict female life expectancy say with three X exogenously riyals from and let's say again you would actually put a lot more thought into it but just for the sake of our example here where we have this literacy rate if we pull that over a little bit more this is a literacy rate gender parity index right so it's measuring within a given country at a given year to what extent is literacy equalized across the gym so let's say we're going to take that literacy rate equality index and rename that that's gonna be our next one variable literate and let's say our again this is a gender equality index primary secondary school enrollment so as females experience greater gender equality in literacy rates and school enrollment do we see a gain in life expectancy that would be the question we'd be addressing there so we'll call this X to underscore enrollment and one more just for the the sake of our example here let's take Valle nerble employment percentage and that'll be our X 3 variable so again the format rename the original namespace and then your new name okay so now we've got four variables that we're gonna be using and then a whole bunch of variables that were not gonna be using right so again we we've put in that preserve command so we can do this without risk of of any permanent damage right but we should be able to see the the usefulness of getting rid of everything that we're not gonna be using so we could either say drop and then list all the variables that we don't want or keep and list all the variables that we do want so we're going to want all the identifying information right so we're gonna want our country code country name region region name and we'll keep these as well we haven't used those we're not gonna use those definitely want our year and then our X 1 X 2 X 3 just double clicking here and then our Y variable so we're gonna keep all of those get rid of everything else now we've got a nice clean dataset and again if we decided we made a mistake and want some of those variables back we could just hit restore and get back to the beginning okay so in this case it's not a not a huge deal we've only got three variable variables that are going to be on our the right-hand side of our regression hazard but if we had a longer list it might be helpful to use a global or a macro to collect those into a list right so we can do global and create an object called X list right that's gonna take each of these variables and we just list them out with just a space in between them so we don't need to type all of those out right and if we want to refer to that say we want the summary statistics we can go dollar sign X list and it would be just like if we listed those out individually and another nice thing about this naming convention x1 x2 x3 underscore is that creating this list we didn't really need to do that we could do something like there's some X and then an asterisk where that stands for okay we want stage to call up all the variables that start with X and then have different values after that and in this case all the variables start with X are the variables in our list right so a nice way to to organize things there okay now in this case we see that the variable label column here is all filled out so there's nice descriptions of all of our variables that's not always going to be the case and when there are variable labels missing or maybe incomplete before you get too far down the road here it's a real good idea to fill that in right and you can call this up with the the drop down menu or we can just use the label variable command so let's say we want to change the label for our X so it goes label variable rx1 underscore lit read and then whatever we want to throw into that variable label the proper description of the variable we put in quotes type it in and that's going to appear here right so again you want to make sure that if it's a month from now a year from now next time you open up this data set or run this new file you're gonna be able to exactly replicate and understand what you did and likewise when you hand this over to a supervisor or to somebody evaluating your research they're gonna be able to see exactly what you're doing okay now since this is an old data let's go ahead and browse our data here and like we said this is panel data in long form right so we have our country identifier here in alphabetical order so starting with Afghanistan and then our year identifier and we're gonna need Stata to be able to identify the country identifier right so in this case it says this country code but note that it is in red letters right so that means it's being it has been input as a string or a text variable so we can't be utilized as a numerical identifier so we need to encode that text or string into a numeric identifier and the command there is in code so we'll say encode and then the variable of interest country code that we want to generate a whole new variable that will be numeric in nature let's call it code one okay so what we should see now is a new variable at the bottom of the list here and if we browse it it's gonna look exactly like the original country code it'll be the last one here but it's no longer in red okay so now we can do something like this we can use our xt set command we're now we're telling state of what those panel identifiers are right and they're going to be part one the entity or cross-sectional identifies and year as the time identifier in an era are from 1960 to 2019 and that wouldn't have worked with the original data okay so now we're ready to go for our panel data even though we had that string variable to start with okay and now that's aroused that data again you'll notice that certainly with Afghanistan here there's a lot of missing values and as we scroll down there's a lot of missing observations in every variable for every country so what we have is somewhat of an unbalanced panel right so once we eliminate those missing observations we want to be able to see okay how many observations are are actually left right so it's kind of up to you the order that you might want to do this I would probably eliminate the missing observations first right if these are the variables that are going to be used in our regression any missing observations going to be eliminated in the analysis anyway right so there are other commands that we could do here but we can just say drop if say the value of our x1 lit rate is equal to missing so we go equal equal dot so exactly equal to a missing observation and here we get 13,000 observations that are eliminated and do the same thing for each of our other variables so what we'll be left with is just those rows that have numerical values for all four of our our variables improper our x3 equals equals dot as a missing value and then lastly drop if our Y variable is missing okay so now let's do that some command again so let's summarize our X list and our Y variable as well and so for each one we have 1500 observations that are remaining and now in terms of that panel dimensionality right so a real handy variable that we might want to create here is a count variable so how many observations are in the data set for each cross-sectional entity in this case how many years exist for each individual country so we can we're going to use the generate command but let's use the prefix buy code one so we're gonna break this out by our country code identifier and then colon and then we're gonna generate a variable called count that's equal to underscore capital in so normally if we just did generate count equals underscore capital in it would tell us the sample size how many total observations are in our data set but here by breaking out at breaking it out by country by code one it gives us the number of time periods per cross-sectional entity and let's now let's just take a look at what we've just done always submit your data to the eyeball test train see what it is that we're looking at so after all those observations were eliminated Afghanistan was left with only one observation Albania four observations 27 etcetera so we have a lot of countries that don't have too many observations left so now we might want to we're gonna have to decide what's gonna be our cutoff right so since we have a panel if you wanted to exactly balance the panel we would eliminate all the countries that don't have the maximum number of observations you might want to be a little bit less stringent than that we could do something like this we could so let's drop if our current variable is say less than or equal to ten all right so every country that's gonna be in the remaining sample should have at least ten time periods to get a little bit of dynamics going on in the panel something like that so that'll be up to you but that's a real easy way to to account for that to control for that so we can also use now some of our panel specific summary statistics so if we go XT some and then let's say for our Y variable right well it gives us all the usual summary statistics the mean the standard deviation the min and the max but it gives it in the kind of three available dimensions the overall so both time periods end and across entities and then looking at variation between entities and within entities over time and so these observations kind of tell us the degree to which our our panel is balanced or unbalanced so t-bar so it's the average number of time periods per country here is twenty four point four four the fact that that's not a whole number and it's not equal to the maximum number of time periods between 1960 and 2019 tells us that we don't have a balanced panel but we have again an average of about twenty four twenty four years of per each country all right the last thing i said we could take a look at here is looking for outliers right so we've kind of got an already an idea here but if we just do a summaries of all of our remaining observations for our x variables and for our y variable here again what we're looking for with an outlier is not a even a surprisingly small or large number it would have to be a value that is basically impossible right given the definition of your variable so we're looking for coding errors right we're looking for something that is order of magnitude different from the mean so for the most part our variables here are percentages that's our Y variable versus index numbers the X variables and so these are all more or less reasonable sorry that vulnerable employment that's also a percentage these all seem reasonable based on the min and the max you could also get the visual by using the box plot right so we can call that up as a graph command with the option box and you could do this for all the variables at once and kind of do them side by side the scale gets a little messed up there but if we can just do it one at a time and we've got our kind of interquartile range here and we'd be looking for observations that would appear as little dots that are well outside those outer fences red plus or minus two interquartile ranges which we don't seem to see here right but that's what you'd be looking for we would also want to call up all of our normal a data visualization tools we get our histograms get our scatter plots that's another video entirely the point is it now we've got a data set that is ready to go okay what we should have been doing hopefully is recording all these commands in our do file we've got it all organized we've got the observations that we need and now we're ready to do some analysis of the next video after this we'll actually do some fixed effects random effects panel data regression so I will see you then hopefully this was helpful [Music]

Info

Channel: Mike Jonas Econometrics

Views: 17,619

Rating: 4.975831 out of 5

Keywords: Data, econometrics, stata, panel data, longitudinal data, data cleaning, statatraining, stata tutorial

Id: Fb4RzzG6moE

Channel Id: undefined

Length: 22min 7sec (1327 seconds)

Published: Tue Dec 31 2019