Multilevel Mixed-Effects Modeling Using MATLAB

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hello and welcome to multi-level mixed effects modeling using MATLAB my name is Shashank Prasanna I am the product manager for the statistics and machine learning products here at the mat works here's the agenda for rest of today's presentation we'll start with a high-level discussion of linear mixed effect models what they are and why do we care about them we'll also discuss a little bit of theory behind linear mixed effect models and the standard assumptions behind these models after that we're going to take a deeper dive into fitting linear mixed effect models using the fit LM function in the statistics tool box we'll go through an example showing how to solve these problems in MATLAB the second half of this presentation will focus on panel data regression using some economic data we'll discuss what a panel is and again go through a MATLAB example that covers fitting various panel regression models this will be followed by some key takeaways ending with a Q&A session the only prerequisite for this presentation is basic familiarity with simple linear regression with that said and done let's get started an idea that is central to this topic is a concept of group data for example consider that you want to analyze SAT scores of students across the country this data can be grouped by school students attend the schools may then be grouped by the states they are in states in region and so on students in a particular school or state are affected by school or state specific factors such as maybe quality of education in a school or education laws in a state grouping information is important to make meaningful inferences on this type of data and they commonly occur in several experimental and observational studies with that in mind let's take a minute or two to quickly go over the concepts of linear mixed effect models if you are already familiar with this topic this can serve as a quick refresher so what are linear mixed effect models it can simply be defined as an extension of linear models for data that are collected and summarizing groups consider for example you have GDP data for different states collected over time if you were to fit a simple linear model with an intercept and slope you would likely get a fit like this this is obviously not a great model because it completely ignores the group nature of data in mixed FX terminology such models are known as fixed effect models they call fixed because the estimated parameters beta0 and beta1 do not vary with observations or groups now consider a mixed FX model let's break this down and take a closer look so we can relate it to the fixed effects model on the left the first half of this equation looks familiar and represents fixed effects like we saw before the coefficients beta subscript 0 0 and beta subscript 1 0 do not vary with observations or groups the second part of this equation represents what are called as random effects here we have a random effect coefficient for the intercept B subscript 0 J and random effect coefficient for the slope B subscript 1 J the subscript J is an indication that these coefficients are associated with a particular group or observational or experimental unit in our case it's the state and what makes them special and different from fixed effects is that they're random variables with a prior distribution and they represent a random variation around the fixed effects for each group fixed effects and random effects together are called mixed effect models and they're very well suited for group data so we've talked about what linear mixed effect models are so why should you use them in short the simply more accurate representation of group data linear mix effect models can make better prediction and forecast even when the groups have unequal number of observations linear mixed effect models with few parameters compared to fixed effect models and are less subjected to problems like overfitting you can also make inferences about the population of groups rather than just a sample used to fit the model let's now go through a code example and see how we can fit these models in MATLAB for the first exam the objective is to model and focus GDP of states in the u.s. we'll consider the response to be the log of GDP by state the predictors are the years from 1970 to 1986 for each state and the grouping variable here is state and we have 48 different states the approach we'll take is to fit simple fixed effects or ordinary least-squares models to begin with and then fit linear mixed effect models and perform Diagnostics and compare different models and see how we can improve the forecast results so let's jump in a math lab and get started I'd like to start by showing you the final results of our example what I have here is an HTML report that was automatically generated by MATLAB from my example script it includes table of contents and hyperlink to sections including all the visualizations that were generated by my analysis let's scroll all the way to the bottom and take a look at the final results the figure shows a comparison of GDP forecasts for two different approaches the left is the fixed effect model and the right is a linear mixed effect model the solid circles indicate observations that were used to fit the model and empty circles indicate observations that we deleted from the model the forecast confidence interval shows that the linear mixed effect model gives you a much more confident forecast even in the presence of missing data linear mixed effect models also give more accurate forecast and are less affected by missing data compared to fixed effect models so let's jump into MATLAB and see how we can go ahead and fit these models so this is MATLAB for those of you who haven't seen it before it's a fairly interactive environment with a number of different windows here we will talk about each of these as and when they become relevant let's start with the current folder window here this shows all the files and folders on my computer and the data we want to work with today is called a public data and it's a spreadsheet so let's open it up outside of MATLAB to see what the data looks like so the data set consists of several different economic variables including GDP for each of these 48 different states and also contains the time intervals for each of these observations you can find more information about this data set at the following reference so let's start by pulling this data in a mat lab the easiest ways to do that is to right-click and call import data or alternatively you can also drag and drop this file into the command window this is the import tool which can be used pulling data from spreadsheets text files CSV files and so on it also automatically recognizes the headers and extracts them for you for this example I am interested in the state the year and the GDP data so let me go ahead and pull this data I'll choose table as the data container which is a well-suited container for this sort of heterogeneous data when you have textual as well as numeric data when I'm ready I can go ahead and pull this in a MATLAB you can see that the workspace has now been pre-populated with the data set we just imported the workspace shows you all the different data types and variables we'll be working with as a part of your analysis and they're always available for you at your fingertips and you can open them up by double clicking on them and it opens up in the variable editor where you can take a look at what the data looks like so we can confirm that we have three different variables in this table which is the state here and GDP so let's start by pre-processing this data first I want to convert the state into a categorical array because it is my grouping variable and the next step I want to log transform the GDP data so I can go ahead and fit my regression model now before we start fitting regression models let me go ahead and pull up some visualization to take a look at what the data looks like I can do that directly from the variable editor here I'll go ahead and choose my predictor which is the year the response the log of GDP and state which is the grouping variable under the plots tab I can pull up the scatter plot here so let's take a look at what the scatter plot looks like so on the y axis here is the log of the GDP data and we see that we have the response variable for all these 48 states recorded from year 1970 on the way to 99 six on an average it looks like the average value of the log of GDP varies from state to state speaking of averages I can also pull up a box plot here to get a sense of what the GDP values look like with respect to the state's compared to each other so you see that the confidence intervals don't really overlap and this is an indication that we have to introduce or include some sort of grouping information into the regression model so this is a good idea to start with a simple model and improve it based on its results the first model we'll be exploring is a simple linear regression model with log of GDP as a response variable and will estimate the intercept term as well as the slope term for the year centered those with experience in analyzing regression models may have already gives why we need to send to the year this is because all the data were collected between 1970 and 1986 this means that the intercept represents the GDP at year 0 this causes a high negative correlation between the estimates of the slope and the intercept we can remove this correlation by centering the year as follows the fit LM function in the statistics tool box can be used to fit simple linear regression models if you are unfamiliar with this function you can use the function browser for functions recommendations or pull up help documentation to learn more the documentation is a great resource to learn how to use a function what is inputs are and also take a look at code examples which you can directly copy pasted into MATLAB execute it and get started let's go ahead and execute this model and see what the results look like the output on the command line shows that the coefficient estimates are statistically significant and we also have the goodness of fit statistics here which is r-squared and adjusted r-squared and this shows that you can certainly improve on this model one way to deal with group data is to go ahead and fit separate linear regression models for each group intersection of code will do exactly that and take a look at a visualization which shows the confidence intervals for the coefficient estimates for the intercept and slope for each group let's zoom in to take a closer look at some of these states and what you see here is that the confidence intervals for each of these states estimates for the intercept as well as slope don't really overlap for all the states the overlap for some states but if you look at all the 48 states you see that the confidence intervals gives us a clear indication that the random effect is needed to account for state to state variation in the intercept and slope since the confidence intervals don't overlap another approach to incorporate grouping information into a regression model to introduce the group as a categorical variable into the model which means that we'll introduce a dummy variable for each level in the grouping variable in this example we'll still use fit LM to fit an ordinary least-squares model here and the response is the same log of GDP and in addition to the intercept and slope term we also introduced state into the regression model and an interaction term between state and the year what fate LM does is introduce dummy variables for each of these two grouping information let's run this and take a look at the results compared to the previous model we've now introduced 48 new parameters into the model for the state and another 48 parameters for the interaction in term between state and year although the goodness of fit statistics here shows that this is indeed a much better model than a model without the grouping information there are disadvantages to fitting this type of models the number of independent variables in the model increased linearly with the number of levels in the grouping variable and this can result in a loss in degrees of freedom this can also be a problem when dealing with small set of observations or when you have large number of levels in your grouping variables to address some of these issues we can go in and fit a linear mixed effect model next the function used to fit a linear mixed effect model is called fit LME and the interface is almost identical to fit LM the only difference is that we've specified the random effects here the way to read this is we've introduced a random effect for the intercept as well as the slope which state as a grouping variable you can introduce multiple independent random effects in parentheses in this manner also specifying random effects in a single set of parentheses indicates that these two random effects could be correlated so let's run this and take a look at the results the output of the command line is similar to fit LM we have some information about the model we have the formula we are estimating here along with the fixed effect coefficients under the fixed effect section you can see that we have estimates for the intercept as well as the slope and the fit shows that these are statistically significant but I'd like to draw your attention to the random effects section of this display unlike fixed effects random effect coefficients themselves are not parameters in the model however they can be accepted from the model and we'll see that in a moment the model is actually parameterised by the prior covariance of the random effects and their estimates are shown here the confidence intervals of these estimates do not zero and this means that the random effect term is significant however if you want to formally test the significance of random effects you can do that by running the compare function the compare function allows you to compare two different linear mixed effect models in this case we'll be comparing a model with no random effects we only have fixed effects for the intercept and slope and we'll compare it with the model we estimated in the previous section which has the following random effects the compare function performs a theoretical likelihood ratio test the zero p-value here indicates that the introduction of random effects into the model significantly improves the model the random effect coefficients as I mentioned are not really parameters of the model but they can be extracted using the random effects function the vector of random effects B is also known as the best linear unbiased predators or bloops we can also extract the covariance parameters using this function what I want to do here is to take a closer look at the random effect coefficients for the intercept and slope and to see if there's a correlation between them here's a scatter plot for the random effects for the slope and random effect for the intercept the plot shows that there's almost no correlation between them and if any there may be a mild negative correlation and we can confirm this by looking at the covariance parameters here if you cannot easily describe your model using a formula or you want greater control over your model specification you can fit a linear mixed effect model by specifying the design matrices in the standard form in this piece of code I am attempting to reproduce the results from the previous section only this time instead of providing the equation I will go ahead and construct the fixed effect design matrix X by specifying the intercept as well as the slope terms here and also the random effect design matrix Z by again specifying the intercept and slope and they allowing them to be correlated we can use the Fit LME matrix function to fit a linear mixed effect model under this specification I won't compare numbers but the results will be identical this brings us to the last and the most interesting section of this example what we'll do is fit a fixed effect model for a subset of the 48 states will make this little more interesting by deleting majority of the observations for a subset of these states again we'll use these models to focus GDP into the future even in the presence of missing information recall that all the observations were from 1970 to 1986 and we've deleted 14 out of the 17 observations the linear mixed effect model we'll be using is same as the one we saw before so let's take a look at the results the solid circles here indicate the observations that were used to fit the model and the empty circles indicate observations that were excluded from the model but shown here for visual validation of the forecast the first plot on the left here shows a forecast of a fixed effect model which incorporates the group specific information using dummy variables for state and state and your interaction term the second plot here on the right shows the forecast of a linear mixed effect model with a random slope and intercept model the forecast confidence interval shows that the linear mixed effect model gives a much more confident forecast even in the presence of missing data take a closer look at the state Wyoming here due to the small number of observations available to fit this model the fixed effects approach captures a local trend and forecast the GDP which shows that goes downhill linear mixed effect model on the other hand forecasts an increasing GDP which agrees with the long-term trend and makes sense intuitively as well so that brings us to the end of the first example so let's go back to the slides and do a quick recap we saw Fateh lemme allows you to fit linear mixed effect models that give you more accurate results for group data than fit LM when you have missing observations for groups you can get better prediction and forecast accuracy fit LME fits few parameters for group data this can lead to reduce overfitting and give better predictions finally you can draw more accurate inferences on data compared to fixed effect models there are mostly three common ways to fit group data the easiest option which you can see on the first column is to fit separate regression models for each group independently the second option is to fit a model that includes group dummy variables we can also fit a linear mixed effect model like we've seen in the example we just explored in MATLAB so here is a comparison table that shows the advantages and disadvantages of each approach linear mixed effect models model group variation as a random variable that generalizes to the entire population of group levels linear mixed effect models fit few parameters and have increased model accuracy separate models are easier to interpret but it may not adequately represent the data compared to linear exhibit models we saw from the example that linear mixed effect models work very well when groups have missing observations and finally linear mixed effect models account for correlation within groups as well fit LME function allows you to fit linear mixed effect models that give you more accurate results for group data than fate LM in this next example we'll see how we can perform panel data regression using fit LM e panel data is widely used in economic studies so let's take a look at what a panel data is a panel consists of data collected by observing many subjects or multiple time periods subjects or cross-sectional units could be individuals households firms or even states or countries observations themselves could be GDP for states of countries unemployment rate for states individual salary and so on there are several common approaches to fitting this type of data in the next example we'll explore some of the popular forfeiting parallel data regression models the objective of this example is to investigate the productivity of public capital on the state's economic output or the state's GDP while the previous example was focused on prediction and forecasting this example will focus on building a model for inference the data we'll be working with is GDP by state and other economic data as well we'll see that in the following slide in little more detail consider you're an economist or a policymaker tasked with this problem so the approach we'll take is to first go and fit and compare various panels regression models analyze the effect of public capital on the state's GDP and publish our findings in a report that you can share with someone who's a colleague or a supervisor will use the cobb-douglas production function as the regression model this is a popular model in economics to model production the state GDP is the response and the following economic variables in the table here are the predictors to the model we also have to grouping variables state and region state has 48 states so 48 levels and region has nine different levels with that said and done let's jump in a MATLAB and get started let's go ahead and clear the workspace and start clean slate again the first step is to go and load some data and I have a script here does that does that for me it was automatically generated by the import tool and pulls in all the important economic variables which we need for this analysis from the spreadsheet we saw in the first example I'll also go ahead and perform some pre-processing steps such as converting state region and ear into categorical variables and LOC transforming some of these variables so that we can go ahead and fit the cobb-douglas production function we can go ahead and take a look at the variable in the variable editors to see if this looks ok or if we have all the variables we require for this analysis this looks good to me we have all the data and we don't have any missing observation this type of panel data set is called a balanced or come pandal this means that all the observations for each state are measured at the same time points in this case 1970 to 1986 and we have all the observations for every single state but it is common for economic panels to be unbalanced or incomplete let me show you what that means I will go ahead and create a copy of this data set let's call it panels and consider you have some information reaching for certain states here I'll go ahead and delete some information for Alabama a few years for Arizona and I can go and repeat this process but regardless of how many observations I delete it's hard to recognize that you have missing observations for certain years when the data is represented in this form this type of representation is called stacked form so what we are going to do now is go ahead and unstack this data and we can do that using by calling the unstack function we provide the panel we are working with and the indicator variable with which we want to unstack the panel let me run this and we can go ahead and take a look at the unstacked panel and unstack panel is an alternative way of looking at the same data we have repeated observations shown here on rows and all the cross-sections shown here as columns and we can clearly see that we have missing observations for these states which we deleted with economic data some states may have GDP or other economic variable going back several more years than other states and that can give rise to this sort of unbalanced or incomplete panel the reason I'm mentioning this is because linear mixed effect models are great tools and are very well suited for fitting balanced as well as unbalanced panel and give much better results compared to alternative methods of performing panel regression the first model we'll explore is an ordinary least square model or an OLS model we will again use fit LM to fit this model and this is the cobb-douglas production function without any grouping information OLS models are also called pool Oh Ellis regression models since it combines the cross-section and time-series aspects of data they're also referred to as population average models and the assumption here is that all the statistical requirements for oil Azimut let's take a look at the coefficient estimates on the command line the OLS model reports that the public capital is productive and plays an economically significant role in the state's output the inference here can be that at the state level the public capital has a significant positive impact on the level of output and does indeed belong to the production function the p-value show that the fit is statistically significant as we discussed before it should be noted that pooled regression models such as these lead to underestimated standard errors and inflated T statistics and therefore the resulting inference may not be valid the panel data fixed effect model addresses some of these challenges in parallel data terminology these models are also referred to as Lee squares with dummy variable models all the state specific information is incorporated as dummy variables and as we saw before fit elem automatically does this for you underneath the hood such models are ideal only when you have moderate number of cross sectional units or moderate number of levels in your grouping variable in contrast to the OLS model the least squares with dummy variable model reports that the public capital estimate is not economically significant in the state's output however in addition to the disadvantages of introducing dummy variables in the regression model as we already discussed in the previous example here we notice that the estimates for the public capital is also not statistically significant the next panel regression model we'd like to explore is called a one-way random effect model here we retain all the fixed effects as a taste and introduce a random effect only for the intercept term again in contrast to the OLS model a mixed effect model with state-specific random effect finds that the public capital is economically insignificant in the state's private production but again we notice that the fixed effect coefficient is not statistically significant in panel data analysis mixed effect models such as these are also known as error component models random coefficient regression models covariance structures models as well as multi-level models and import an aspect of fitting an accurate model is to introduce the right set of predictors we can attempt to improve upon this existing random effect model by introducing three new predictors highway water and utilities but economically these are constituents of public capital or public capital is the sum of these three variables let's take a look at the results of this model here we notice that the log of highway water and utilities estimates are statistically significant the log of utilities appears to have a negative economic impact on the state's output to formally test if this is indeed a better model than the previous model we can perform the theoretical likelihood ratio test and see what the p-value looks like the p-value of the likelihood ratio test is close to zero this is an indication that this model is a significant improvement over the previous model the next model we'll explore is called a two-way random effects model a two-way random effect model introduces random effects for state as well as random effects for time economically this could account for events that are specific to a year that effects the state's output we can again compare this model with the previous model to see if it is an improvement and the likelihood ratio test shows that this is indeed a significant improvement over the previous model so far we've built models that accounted for state specific contextual information by introducing random effects for each state this model can be extended to introduce random effects first as well as region in this example state is nested within region in other words groups of state form a region and no state-level observation is part of more than one region this can be done by introducing a random effect for region which has nine levels and a random effect for the interaction term between state and region which has 48 levels will retain the random effect for the years so this is also a to a random effect model the likelihood ratio test shows that this is indeed an improvement from the previous model but it is marginally significant based on the default significance level that brings us to the end of this example but consider you're an economist or a policymaker who's performing some of these analysis to see how the effect of the public capital is on the state's output you may want to build these regression models and share these results with colleagues or other economists so what I have here in this folder is another script which performs the same analysis as the one we saw the only difference is that I've introduced or I have included a lot of comment that explains what we're doing in each section I also have markup for equations and other visualizations as well now if I want to share this result with someone I can automatically go and publish the script in MATLAB so all I have to do is navigate to the publish tab here go ahead and choose a format you want to share this file or share this results in so I'll choose PDF and go ahead and hit publish what MATLAB does is automatically run through this file and goes through section by section execute the code and captures all these visualizations and embeds them into a report and renders any lytic equations and other HTML markups as well and finally what you get is a beautifully formatted report you can share with colleagues or anyone else who would be interested in the analysis you've been performing so here is the report we have table of content that was automatically pre-populated by MATLAB and all these are hyperlinked so we can navigate directly to a section take a look at the equation we're fitting all the analysis results automatically embedded into the document at the end of this document I've also included a comparison table which an economists would find useful you can go ahead and take a look at all the coefficient estimates for public capital and other economic variables based on each of these different models the numbers in the parentheses indicate the standard error for each of these estimates as well the models will fit and the analysis were based on these literature so be sure to check them out if you are interested in this topic we saw that we could specify fit and compare various types of panel regression models we explored fixed effect models one-way random effect and two-way random effect models and also nested and hierarchical models MATLAB offers convenient regression interface whether you are fitting ordinary least-squares model or mixed effect models we saw that there are plenty of visual Diagnostics that are easily accessible at a click of a button you can also compare and improve models visually or by performing statistical hypothesis tests finally we can publish and share results automatically from MATLAB without spending additional efforts in documenting the results if you are interested in learning more please take a look at the product documentation there are plenty of examples that can help you get started also feel free to visit the product page there are lots of videos and examples on related topics you may find useful and finally if you found this Malthus webinar useful there are a number of others that I would strongly recommend you can also find these on the product page by navigating to the webinar section one last piece of information all the code datasets and examples that we used in today's presentation will be available on MATLAB central so thanks for listening in
Info
Channel: MATLAB
Views: 10,192
Rating: 5 out of 5
Keywords: MATLAB, Simulink, MathWorks
Id: -XVVjwSqbZo
Channel Id: undefined
Length: 34min 49sec (2089 seconds)
Published: Sun Apr 30 2017
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.