Hierarchical Linear Models I: Introduction

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

member welcome to the series of lectures on hierarchical linear models the intent of the lectures is to introduce you to the methodologies that are appropriate for modeling data that have been clustered in some manner these types of models can be appropriate for both cross-sectional studies and longitudinal research provided that the data we are considering can be considered to be nested or clustered in some manner an example of when we'd encounter clustered data may be when we're looking at voters who are nested within countries or states or counties or if we're considering workers that are nested within different firms we also encounter clustering when we have complex survey data that's been collected by means other than simple random sampling when we have cluster sampling we first identify a primary sampling unit and then within those clusters do our random sampling we can also consider in longitudinal designs that the multiple time points at which we collect data are themselves nested within individuals in this case the time points are the lower level in the nesting occurs within individuals we can also extend this to multiple levels of nesting so we may have workers nested within firms but those firms in turn are nested within sectors and likewise in the longitudinal case we may have time points nested within individuals but then those individuals are nested within say different regions the problem that emerges when we have clustered data is that we end up violating the assumptions that are required by the traditional linear model first off we have observations that are no longer independent when we have individuals who are within the same cluster those individuals are going to share something in common they're going to have shared experiences within that particular cluster or in the case of longitudinal data we're going to have multiple observations on the same individual and the idiosyncrasies of each individual is probably going to be impacting every single one of those observations at the different time points so because we have these within cluster correlations or dependencies we can no longer be treating our data as though each observation or each data point is independent which is what's required by the traditional regression model or the traditional approach of analysis of variance we're also likely to encounter between group heteroscedasticity in other words in each cluster we have different sample sizes and likewise probably different levels of variability since we have different variants in different clusters then we can no longer be saying that we meet the assumption of homoscedasticity required by regression also known as the homogeneity of variance assumption in the context of analysis of variance with this assumption violated then we need to move on to other methods it's also the case that when we have clustered data we may have explanatory variables whose effects actually change depending on the environment that we're considering in other words if we have five clusters we may have five different regression lines that we could fit for each of those clusters with the regression lines changing when HLM is going to allow us to do is to make maximal use of the data so that we can get the most accurate estimates possible of those different slopes without throwing away degrees of freedom like we would if we are to isolate each sample and do the regression separately historically when researchers have encountered clustered data they've taken three different approaches the first would be to aggregate everything in other words you just simply take the mean in all of your variables and then you fit the regression line to the means and this way you no longer have clustered observations you're just fitting lines to the summary statistics the means the problem though is this runs the risk of something known as the ecological fallacy and this is something that occurs when we attempt to model micro behavior at the macro level but the macro relationship ends up being completely opposite to what's going on at the micro level and we'll see an example of this on the next slide a second approach is to treat macro variables as micro variables in other words we just assume that each individual in a cluster is going to have the same value for the cluster level variable so if we have students in schools then each student in the same school gets the same score on variables related to the school well this is essentially what we're going to do with HLM when we ko our data but we're going to be fitting a more complicated model or more sophisticated model that will appropriately deal with these clusters because if we were to simply do that type of coding and then fit a regression model to it fit the usual linear regression we'll be violating those assumptions that we just mentioned a third approach would be to run a model separately for each group so if we have five different clusters we'd be fitting five different regressions this of course is problematic though because by separating each of the samples we end up with several small samples and we have underpowered tests in each of those samples we're not taking account the fact that we do have quite a bit of data not just on one sample but in other samples and the other samples can actually be used to inform what's going on in one specific sample this is a very useful result that occurs only when we fit hierarchical in your models rather than trying to isolate each of these samples separately into a separate analysis we're more likely to get significant results that is we'll have more power than when we have a small end and we can use we can borrow information from the other groups to inform our inferences about what's going on in one specific group well I said that when we aggregate everything we run the risk of committing the ecological fallacy and that's illustrated in this picture here where we have two different groups Group A represented by circles group be represented by triangles and we can see that the relationship between x and y is pretty clearly positive that is as X gets larger so it is y but let's say instead we decide to take the averages in each group which we have here highlighted in the darkened circle in the darkened triangle so we sort of averaged over all of the dots and all over all the triangles and we now are going to be fitting our regression line between these two summary statistics between these two aggregated means what happens when we do so is we end up with a negative line which of course is not representative of anything that's going on in our data we end up with the opposite relationship than what is really truly accurately summarizing the relationship between x and y as another illustration of what can go wrong when we're just fitting a regression model to what's really your data we can consider this get scatterplot here where we just have several dots and we're just going to be considering them at first to be coming from all one group or not distinguishing the possibility that some of the dots come from one group and the other dots from another group if we do so we fit a regression line which you get right here this is the line that summer that minimizes the sum of the squared errors and this line passes through the center of the points more or less it doesn't actually hit any of the points it's going through the center of gravity of those points and that's what we would expect when we're minimizing the sum of the squared errors but if it's actually the case that these data points represent two different groups then the line that we just estimated by pooling everything ends up being an inferior representation of what's going on in the data here we see we have group a with the darkened circles and Group B with the hollow circles definitely there are two different relationships going on between our independent and our dependent variables for Group A there's a rather strong positive relationship for Group B there's a very weak if any relationship ideally we would come up with a model that would be able to distinguish between these two relationships that allows us to consider the fact that the slope is one thing in one group in the slope is something else in the other group and what HLM is going to allow us to do is to come up with precisely these types of lines but doing so in a manner that doesn't throw away information from other groups when we do so we're able to come up with better inferences and better descriptions of the true relationship between our independent variables and our dependent variable so in the first case when we fit all of the data at one time when we fit the model in the pooled case to all of the data simultaneously without distinguishing between groups we ended up with a slope of point two nine one specifically the prediction y hat would be equal to the intercept of 2.1 82 plus point two nine one times the value of x in the second case though when we fit a different model to each group we see that we have two very different slope estimates we have 0.5 for the first group that is where the slope is rather steep and then point one where the slope is not nearly as strong when we use the pooled case when we fit the model to all the data simultaneously without distinguishing between groups we ended up with a rather poor representation of what was going on in general we can do a much better job of explaining the relationship between x and y when we take into account the clustering that's going on within the data and allow for the contextual effects such that the independent variable has a different effect depending on which group we're talking about we can derive the hierarchical linear model from a very familiar regression style framework so we'll begin with the micro level where we have the outcome Y corresponding to individual I in the Jade cluster equaling some intercept alpha plus some slope beta times our independent variable plus then any error that's left over will make the usual error assumptions that we would from a regression model we'll assume that our errors are independent and have constant variance but note that we've subscript to the intercept in the slope with the J this is to account for the nesting and allowing us to have a separate intercept in a separate slope for each group and since we're subscripting J since we're saying that we can have a different alpha and a different beta for each group we can actually model those differences so we'll write out first alpha the intercept being equal to gamma 0 0 that is an overall average intercept plus then some disturbance that r1 J term is going to represent whether being in the Jade group pushes you up above the overall average intercept or below the overall average intercept gamma 0 0 summarizes the average intercept across all groups are when J is the disturbance representing whether you're higher or lower because you're in Group J a similar interpretation applies to beta 1 J so there's an overall average slope which we represent here with gamma 1 0 and then again we have a disturbance term that r2 J is telling us that when you're in the Jade group then that's going to push the slope either up or down that is the slope will either get bigger or it'll get smaller depending on which group you're in we use gamma 1 J as sort of an overall summary but our 2 J is going to allow us to capture variability between the different groups when we look at those so miniature aggressions for alpha and for beta we can think of those disturbance terms as being the types of error terms that we would expect in a usual regression model so when we we then substitute those mini equations in to the original micro model as we do in the very final equation on this slide we end up with this more complicated error term that is we have gamma zero zero the intercept and gamma one zero for the slope on the independent variable but then everything else we can kind of throw at the very end and call it error because it's the error from the different components of the model well when we have this error term we can see a few things going on it's clear most of all that these observations are no longer independent and also they're no longer homeless code a stick they're no longer independent because there are going to be error terms that are common to all members of the same cluster that is when you are in cluster one when J equals one then you're going to be getting that same value of the disturbance term no matter what person you are in that group all are going to get exactly the same disturbance second homoscedasticity is clearly violated when we look at the second term we can see R 2 J is multiplied by X that means when X gets bigger the R 2 J term gets bigger or smaller because R 2 J depends on X when X changes that means that the overall error term is going to be heteroscedastic it's going to be changing it's going to depend on the value of X because we're violating the assumptions of the usual linear model we're going to require more sophisticated modeling techniques because those modeling techniques are going to make the adjustment adjustments that we need to get the correct standard errors and make the correct inferences and the good news is once you master the software that can estimate hierarchical linear models that software will be making these adjustments for you to the standard errors you'll be using to make your to make your inferences you can then interpret the models in the manner that you would a traditional regression model and many times researchers who employ these models simply stop there they'll interpret the results just like they would a regression model but as we're going to see as we progress through these lectures there's actually quite a bit of additional information we can extract when we use hierarchical linear modeling that isn't available from other methods in particular the traditional regression model and so we can use HLM simply as a means of getting corrected standard errors to account for the clustering or we can really leverage the potential of these models and extract further information about the group to group variability when researchers particularly those coming from the Social Sciences first approach HLM there's sometimes some confusions that emerge especially when trying to implement software to carry out the estimation and there are two primary reasons why each LM gets confusing the first is that the terminology is inconsistent when we talk about each LM hierarchical in your models or multi-level models we're using the terminology that's common in disciplines like sociology and political science and also in education research but other disciplines such as psychology or biostatistics may be more familiar with the terms mixed or mixed effects models or sometimes variance components models and in economics particularly when economists study panel data data collected from panels over time there tends to be a an emphasis on what are called random effects models which really ultimately are pretty much the same as a mixed or mixed effects model which is pretty much the same as an HLM or a multi-level model all of these terms generally represent the same thing and it can be the case that different researchers will be using very different language to discuss what is exactly the same model and in some cases software may even contain multiple commands that we'll be estimating what is essentially the same model so we need to understand what we're talking about when we talk about variance components or random effects and how those then feed into mixed effects models and why we call HLM or multi-level models a type of a mixed effects model and we'll be discussing that in great detail as the lectures progress the second source of confusion is the fact that these types of models have been developed independently again also depending on discipline in the late 1990s and early 2000s the social sciences began discovering multi-level models and talking about HLM as a quote/unquote new methodology that we could use to handle the nested types of data that we frequently encounter in the social sciences in this view is popularized by Education Research that really made these models accessible to social science social scientists by producing textbooks such as the rotten bush and bright textbook from 2002 which is now somewhat dated but still an excellent resource for diving in and getting to know HLM and these textbooks tended to emphasize HLM as being a generalization of regression models deriving HLM like we did in the previous slides but beginning with a micro-level regression model and then adding in additional mini regressions to model the varying intercepts in the varying slopes well in the behavioral sciences and in clinical research these models have actually been around much longer the idea of random effects were developed in the context of analysis of variance the idea was we could consider factors that is treatments or some kind of categorical variable that would influence the outcome to be either fixed that is we're looking at all the possible levels that could possibly be of interest or they could be random that is representing a random draw of all possible levels for example if we want to study the effects of a doctor implementing some kind of treatment doctors we would have would be a random draw from a larger population random effects were used to differentiate between this type of context that is with doctors from the context where we have fixed effects that is all the levels that we could possibly be considering this has gone on for decades for random effects ANOVA which has led to mixed effects models that has models with both fixed and random effects and we summarized these random effects with variants components and hence the names that we saw in the previous slide for these types of models well it turns out that the software that is available or the commands that are available in commercial software packages generally come out of this analysis of variance tradition and it becomes difficult for somebody with a social science background to map on to the analysis of variance methodology or the random effects ANOVA or linear mixed models the regression based approach that they're used to seeing from the social science types of textbooks so is hoped that this class in this series of lectures is going to enable you to be able to make sense of these differences and be able to connect from one discipline or one approach to the other one type of terminology and bridge it to the terminology that you're more accustomed to this I think is the primary confusion when learning each LM social science hom textbooks tend to dry multi-level models as a generalization of regression models but nearly all software documentation discusses H LM from an ANOVA perspective so this has caused many researchers to miss out an effective software they've already licensed is fully capable of estimating these models that is you don't necessarily have to go out and shell money shell out money for a license for say H L M which is the name of software or for M plus which is another type of software that can estimate these models sometimes as in the case of M plus which is excellent software it nonetheless has a steep learning curve rather than just using something more familiar like SPSS which it turns out is fully capable of estimating these types of models because SPSS has commands for mixed models once we understand the mapping of each LM to write m effects models and mixed models we can then make full use of the software that's available to us well I've said that HL M can be derived from a regression framework it was also developed in the context of analysis of variance there's a third way we can think of HL m and that is in terms of a latent variable model so a latent variable model can be thought of as like a confirmatory factor analysis or a structural equation model that is where we have variables that we don't observe directly we sort of have to use indirect measures and bring the power of those separate measures together to inform the unobserved characteristics that were really interested in we'll be discussing each L M as random effects models where the random effects are parts of the model that are not estimated directly but rather summarized via their variances and call variances that is the random effects that we talked about inner models aren't going to yield types of regression style estimates that we are used to interpreting in a regression framework rather we're going to be getting variance components that summarize how much variability that we can expect when moving from one cluster to another since we're not observing these effects directly we can think of the random effects as being like latent variables that we would encounter in a confirmatory factor model or in a structural equation model because of this software for estimating structural equation models can be used to estimate multi-level models and M plus is a good example of this and in fact M plus has the capabilities of estimating not only SEM and not only HLM but can actually combine the two so you can fit fully multi-level structural equation models this approach of conceiving HLM as being a latent variable model is more common when we're dealing with longitudinal cases that is I mean we can't use it the cross-sectional case but generally it's been used more often when dealing with multiple observations nested within a single person in this case whatever our outcome is it'll be the outcome at time T for person I and it'll depend on some person varying intercept and person varying slope corresponding to time in other words we subscript alpha and we subscript beta with I just like we did previously when we were doing the regression derivation we subscript it with I representing that we can have a different intercept and a different slope on time for each individual in other words this type of model will allow us to look at different starting points in a time trajectory as well as different slopes of change over time the speed at which changes in scores occur for one individual versus another so we can substitute then in for that varying alpha the overall average intercept gamma 0 0 plus its disturbance and likewise will have a slope for time that is the amount of change we expect what time that is equal to the overall average change in time but then a disturbance that allows some people to change more rapidly and others to change less rapidly so you can substitute these many regressions into the original micro model and we up with something that we had seen previously on the previous slide except now instead of X were more explicitly defining that predictor as time because this approach can be thought of as a latent variable model we can use the type of path model drawings that one encounters frequently within the context of the SEM literature what's a little bit different though here is when we're looking at this particular example of a linear trend in a latent growth curve model as this is called we don't actually estimate loadings on the factors like we would in a confirmatory factor model instead we constrain in the case of the intercept those loadings all be 1 and then the loadings for the slope latent variable to be equal to whatever the gap SEL are in time so in this case 0 1 2 3 4 we have at the very top sort of this second level factor which is represented by the triangle with a 1 pointing to both the intercept and the slope and we have the parameter alpha 1 and alpha 2 jutting out from this triangle those represent the overall average intercept that we would expect in the overall average slope and then the sigh 1/1 + decide to 2 that we see corresponding to each of those factors represents the variance component that is how much person-to-person variability we would expect in the intercept and how much person-to-person variability we'd expect in the slope we can even estimate a covariance between these two which we see represented by side - 1 and the covariance we'll be able to tell us if for example people who start out with low scores tend to improve the highest that is they have a low intercept but a high slope or if those who start out high also tend to change the most as well so we can see that we can extract quite a bit of information when we're talking about hierarchically structured data in the longitudinal context whether we're talking about it as an H LM in a regression style framework or if we're talking about H along as a latent growth curve model where we draw it with a path diagram like we would in any other latent variable type of modeling we can even generalize this to introduce a quadratic slope and this would be the case where we expect the change overtime to be not a straight line but to maybe start out slow and then speed up or start out fast and then slow down the quadratic term will allow us to model this and it's simply a matter of adding in one more latent variable one more factor and then the loadings that we have corresponding to this particular quadratic slope we're simply going to be what we had for the linear slope but squared just as we would expect when we include a quadratic term in any model so we can keep generalizing this latent growth curve model we can generalize this much further to have both time varying and time invariant predictors in the model and so on if you want to learn more about hierarchical linear models beyond what's covered in these lectures there are several excellent textbooks available so this course will be drawing heavily from the rotten bush and break textbook which is worth picking up to have greater detail not only on examples but also on the underlying estimation involved in fitting hierarchical linear models another book that's a little bit dated but still quite useful as this measures in bhaskar text multi-level analysis if you're interested in the latent growth curve modeling the latent variable approach to hierarchical linear modeling you can check out the little green sage book from preacher at all and then finally the heck and Thomas book is a little more recent and what's useful about their book is it has examples of H LM in the latent variable framework for not only longitudinal data but also for cross-sectional data with some M plus syntax that'll help you fit those models if you happen to have access or license for M plus the prerequisites for understanding these lectures are not particularly strong we're going to be assuming that you have familiarity with basic statistical inference that is if I say something as statistically significant you know what it means you will have had a college level algebra course so you know a little bit about algebraic manipulations rules for exponents logarithms and so on and that you had at least one course in regression because we'll be comparing HLM to regression in all of the lectures if you have familiarity with more advanced statistical methods like logistic regression and Poisson models that will be helpful for the later lectures but it's not necessarily required we'll introduce that material as needed understanding matrix algebra will also be useful at the very least you should understand what a variance covariance or correlation matrix is if you don't know if those things are you should stop right now google it and then come back calculus you don't necessarily need to know we will encounter it when talking about how estimation is done when talking about what it means to say that we're finding the maximum likelihood estimates it'll also be relevant when we're talking about quadratic growth models like we saw in the previous slide for the latent growth curve modeling because the best way to interpret quadratic trends is to know a little bit of differential calculus to figure out what the rate of change is at a given time we're going to recover some of this material here but the bulk of the material that may be new to you the mathematical backgrounds will also introduce as we need when we talk about the different topics let's start out first with some college algebra or even some high school algebra reviewing the rules for exponents so if we raise for example X to the 8th power and then in turn raise that to the beef power we may as well have been writing that as X raised to the a times B power and you can go down and read these different rules for exponents but the one that I'd like to draw your attention to is the very last one where when we say that X is raised to the negative 8th power we may as well just rewrite that as 1 over X raised to a this is useful because when we turn to matrix algebra shortly we're going to be talking about matrix division as really being matrix multiplication where we multiply one matrix by the reciprocal of the other and we're going to be using this type of notation where we have the negative exponent so just keep in mind that when you see X to the negative one you could really just be writing that as 1 over X logarithms can be an intimidating topic but they become very useful especially when we talk about fitting maximum likelihood models so a logarithm is simply a reversal of an exponent the first rule here defines what an exponent is so we'll say that Y is the base be log of X if and only if x equals that base is to why now if that seemed a little bit too complex to think about what's important is recognizing that we have some rules that can be very useful and in particular there's the rule where if we take the log of a product of two numbers we could actually rewrite that as the sum of the log of those numbers separately so if we have log base B of x times y we could write this as the log that same base of X plus the log of the same base plus y the reason this is useful is that when we fit our models we're going to be using the method of maximum likelihood and if we don't use logarithms than maximum likelihood would require us to take the product of a large number of numbers that is we'd have to take the product of a number for each of the observations that we have that can end up becoming a very enormous number that our computers aren't going to want to work with so what we do instead is we take the log of those values and we take the log of that likelihood function so that we end up with an addition problem rather than a multiplication problem and because the values that end up optimizing the log of a function are the same as the values that we'll be optimizing the original function thus ends up being very useful for us matrix algebra is also an important concept when learning about hierarchical linear models I've seen attempts by teachers to try to teach HLM without introducing any types of matrices but I think that that's just a lost cause at some point you just have to accept that you are dealing with matrices you can think of a matrix as a collection of numbers you can even think of it as a generalization of numbers to multiple dimensions so if we have the number nine you can think of that as existing in one dimension and then a matrix would be a number that exists in multiple dimensions no matter how you think about it we like matrices first off because they allow us to write really large collections of numbers in a very simplified format so it's very popular for example to write a regression model as being the matrix Y and which will actually call a vector for reasons we'll define shortly we'll call that the vector Y or the matrix Y equaling the product of the matrix X that is all of our independent variables times beta which represents the weights or the coefficients that we'll be estimating in a regression model plus then a final vector or that is a type of matrix that corresponds to the error terms no matter how many independent variables we have no matter how many observations we have we can always summarize a regression model with this particular notation if we think about a matrix as being a generalization of a number then it makes sense that we can apply the usual arithmetic types of operations to matrices that is we can add them together we can subtract them we can multiply them and we can even do division although the division is going to actually be multiplication of one matrix times what we'll call the inverse which is a matrix version of the reciprocal of the other variable now when we perform these operations the matrices do have to have appropriate dimensions so a and B for example have to have dimensions that match so that we can perform the addition and likewise with multiplication we have to have the number of rows and 1 matrix equaling the number of columns in the other matrix if that's not the case we can sometimes take the transpose of a matrix that is we can rewrite the columns as rows and rows as columns and that's going to give us then a shaped matrix that will allow us to do multiplication and that way for example we could do a times B Prime that is a times B transpose if the original b matrix was not of correct dimensions to do the multiplication but the transpose was this also means for example that we can write the square of a matrix that is if we want to take a number we would write it 2 to the 2nd power well if we want to do the equivalent for matrix we'd write it as 4 the matrix a a times a prime this is the equivalent of having a squared matrix so C in this case is going to be equal to the squaring of matrix a now note from the previous page that we write matrix division as we did with the rule of exponents so in order to do the division we use this negative 1 to indicate that we're going to change that matrix so that it's something called a inverse if we have no inverse that's the equivalent of saying that a matrix equals zero insofar as it's impossible to divide a number by zero likewise we can't perform matrix division if we can't get an inverse for a matrix but if we assume that there is an inverse it's not always the case but if there is one then we would perform matrix division by multiplying the first matrix by the inverse of the second Matrix when you see two matrices written in this fashion you can think of this as being division performed on the matrices so some notes to keep in mind when dealing with matrices we're always going to write matrices in bold single numbers which we'll call scalars are written in plain face I think that this is one of the reasons why matrix algebra looks a little bit intimidating because everything's written in bold it seems like the the author of the textbook or of the slides is yelling at you with all the bold print so really what I recommend is if you start to get confused or intimidating by all the boldface print to just take out the bold and then rewrite everything that you see as though you're dealing with scalars that is just regular numbers just remember that when you see one matrix times the inverse of another what that is actually doing is just division and if you see something like a times a prime what that is doing is just simply squaring a number so if you make these appropriate substitutions then matrix algebra becomes a little bit less intimidating I use the term vector previously when I was talking about the regression model written in matrix form well the matrix with one row but multiple columns or one column but multiple rows is going to be called a vector a vector is simply a special type of matrix where you just have either one column or one row and finally just as it's possible to perform arithmetic on matrices it's also performed calculus operations as well if you were to be so interested as to pursue calculus far enough so that you're doing calculus on matrices while speaking of calculus there's a little bit that we're going to need to know and we'll introduce now the notion of nonlinear functions which becomes relevant for the two cases I mentioned previously under maximum-likelihood and then also interpreting quadratic growth models when we talk about non linear functions we're moving out of the world of algebra where we deal primarily with linearities that is when we deal primarily with straight lines into the world where we have lines that have some curve to them in this case we'd have a linear relationship represented by the dashed line and a nonlinear relationship represented by the curved line well high school algebra gives us all the tools that we need to understand the linear relationship but if we want to make full use of the information available in the curved line we have to turn to calculus in calculus we recognize that with a curved line the rate of change what we would think of as the slope in the linear context is going to depend on where we are in that particular on that particular line so the rate of change here is illustrated when we have X equal to 1.5 that rate of change we can think of as the slope of the line that's tangent that touches the the curved line when x equals 1.5 well that slope is going to be different for example if we're higher up on that particular graph so that rate of change is going to change as we move across values of X we can also think of having a non-linear function that's not just curved but it moves up and then back down so envision like a hill in your mind or you can think of it the opposite case where we'd have a u-shaped relationship well when we have a hill there's a definite top of that hill and at the top of that hill we can say that we are at the maximum of that particular function likewise if we have a u-shaped line then we can find the minimum the lowest point in that particular function we're going to be denoting the rate of change that is the slope of a function at a given point a given point on X we're going to call that the derivative and we're going to refer to that as with the letter D in the numerator of the books to be a fraction so d of f of x over or with back to D of X that's how we're going to represent a derivative the rate of change the slope for a nonlinear function when we just have one variable and we call this the derivative and when that derivative equals zero that's going to mean we are at the maximum the top of the hill or the minimum the bottom of that you shaped relationship on you'll also maybe see derivatives written with the Greek letter Delta of sorts which is how we refer to a partial derivatives so that's just going to be the derivative just like we normally define a derivative but focusing in on just one variable in our relationship even though other variables might be relevant as well well it helps to illustrate these concepts so here we have a relationship that is a hill shaped and at the very top of the hill we have the maximum and really when we refer to maximum likelihood estimation which we will in a few lectures when we talk about how these models are actually fit where we get the numbers that are reported in the output from our software and that get placed into tables in our journal articles the maximum likelihood estimates are really going to be the estimates that maximize the likelihood of the observed data we can think of this as being the case where we test out different parameter values so in this case we could have values below zero values above zero but it turns out that the best fit or the most likely parameter estimate is the one that sits right at zero and that's because the point at zero is where we get to the top of the hill and we know where at the top of the hill by the way because the derivative the slope of the line that's tangent to that point is itself equal to zero zero representing zero slope zero representing no change zero representing a flat line so we know that we are at the maximum what makes maximum likelihood estimation a little bit more intimidating is that we're never estimating just one parameter in the case of HLM we'll be estimating parameters for fixed effects and we'll also be estimating variance components to describe our random effects so we have multiple estimates that we have to come up with but really this is an optimization problem in multiple dimensions and we ultimately want to get to the top of the Hill to find the optimal values for the coefficients that we seek to estimate this is how we solve to get the equation for a traditional regression although we're not actually finding the optimum well you can do regression from a maximum likelihood perspective but more commonly we introduced it as the result of minimizing the sum of squares so in other words we have this function and the function is the value of the sum of squares we want to find the coefficient estimates that minimize that find the lowest point in this particular function and when we're at that point then we have recovered the regression estimates the OLS the least squares estimates it's also how we perform maximum likelihood of course as we saw in the previous slides except instead of finding a minimum we're now finding the maximum it's just recasting the problem a little bit differently using a different function but either way we're still optimizing in the sense of finding the top of the hill just like in regression we're finding the bottom of the hill now this may seem a little bit intimidating if you haven't encountered this before but you don't necessarily know how to need to know how to perform these operations you don't need to know how to do the optimization yourself the computer will be doing that work for you anyways what you should know though is that the notation for matrix algebra is a means of summarizing collections of numbers and that we can perform the same operations on matrices like we do for traditional numbers so we shouldn't get intimidated when we see the bold print representing matrices and it's also the case that we'll be in counting differential calculus but we just need to keep in mind the differential calculus that is finding derivatives is really all about finding the slope of a nonlinear function and it's particularly useful for optimization because we know we found the maximum or the minimum depending on what we're interested in when we've arrived at a derivative when we found the point on that nonlinear or that curved line where the slope equals zero so long as you have an intuition for these operations an intuition for what a matrix is and an intuition for what differential calculus does the more complicated parts of the subsequent lectures should be quite a bit easier to follow

Info

Channel: Methods Consultants of Ann Arbor

Views: 54,859

Rating: 4.9616613 out of 5

Keywords: HLM, multilevel models, mixed models, random effects, variance components

Id: 2w7Q4Wjn1uM

Channel Id: undefined

Length: 42min 38sec (2558 seconds)

Published: Tue May 05 2015