025. Handling Missing Data in Longitudinal Models

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hello everyone and welcome to another stat 437 lecture video in today's video we're actually done talking about different types of models for longitudinal data so we've gone through marginal models for linear data we saw marginal models for other types of outcomes we saw linear mixed effects models and then most recently we just looked at transition models where we have categorical or binary data and you know we worked through the computing the transition probabilities there using standard logistic regression so we've seen a bunch of different ways of fitting uh longitudinal data to different models and we've interpreted those models we've test hypotheses with them we talked about some of the strengths or the weaknesses and in the following lecture after we're sort of done talking about today's topic we'll go through and we'll review everything that we talked about with longitudinal data before we start talking about survival analysis now survival analysis isn't really distinct from longitudinal data it's a sort of a specific type and so we'll see more modeling there but in terms of you know following one response over time and just fitting date or models to that with what we've done so far you actually have quite a wide range of tools to accommodate that now one thing that we have not sort of dealt with at all is looking at the realities of what these data look like in the real world and one of the most common issues you're going to run into if you try to take the methods that we've been talking about in this course so far and apply them to real world data sets is that oftentimes data are really messy and probably the most common way that these data can be messy is through missingness right so you're going to have responses that you just should have recollected for an individual but they're not in the data set and how do you deal with that so that's going to be the focus of today's lecture and of the next lecture where we sort of look at missing data for longitudinal models uh both sort of in concept and in practice i'm going to ignore doing a full theoretical deep dive of this material i'll post some supplementary notes that explore what we're talking about with a little bit more depth the only reason i'm not doing the theoretical deep dive is because you could realistically have an entire course just on the methods that we're going to talk about here and i think it's better to introduce them to you so that you have them in your toolbox and then show you how to actually go about using them in r than it is to try and do a full theoretical underpinning if you're interested in this there's plenty of good books out there to learn about missing data or you know you can feel free to ask me and i'll sort of point you in the direction of material that you might be interested in but for now if we can understand conceptually what we're doing with missingness and uh then you know in the next few lectures talk about sort of practically how to go about that that's what you know we want to uh want to be doing so with that i can open up the slides here and we can talk about missingness in terms of longitudinal data so in the most recent lecture we analyzed these data which came from the financial crisis study right and the data looked sort of like this right we had an id we had their age we had a number of other baseline factors and then we saw these outcomes y2 y3y4 and so on right now these were the data that we analyzed they started with id number 10 the second person had the id number 17 third person had id number 29. if you were paying you know close attention to that that may have seemed strange to you and the real reason that that would have seemed strange was because in actual fact we collected a whole lot more data than just that okay so the real data that were collected had additional individuals being measured and you can see for instance id2 was excluded from our data set and that's because id2 did not have measurements at time 6 and at time 8. we have nas there for not applicable not a number right id 11 was missing almost everything except for at times six and seven and so if you actually looked at any of the real data sets from the course um if you download them from their from their true source oftentimes you're going to see these types of patterns emerge where there's lots of missing data and longitudinal studies and you know some of this is because it's hard to follow people for long periods of time missingness is a problem sort of across statistics but it's a particularly large problem in longitudinal studies and so it's worth asking you know when we analyzed the data that just ignored this right we just took all the people who completely responded compared to the true data that were collected you know what did we lose what were we giving up and so missing data refer to any observations which we wanted to collect which we had some intention of collecting but which we did not record in the file for absolutely any reason right so whether we didn't record them because we didn't actually observe them or if we didn't record them because you know the paper that we wrote their response down on got destroyed or if we didn't report them because they decided that they didn't like our study and they left right any reason that they're not reported is considered missing data and it's a pervasive problem across all of statistics any data set you download is likely going to have missingness in it but it's particularly common in longitudinal studies and it that sort of makes sense if you think about you know you start following someone when they're 10 years old trying to keep someone in sort of the study location for a period of 10 years well people are going to move away people are going to die of other causes people are going to lose interest in helping you out and so it's a particularly big problem for longitudinal data so when we're talking about missingness generally speaking we want to talk about the classification of how this missingness came to be and then we can explore different methods depending on how we classify it right and different types of missingness are going to be a bigger problem than others so you can think sort of intuitively if you had you know a piece of paper that you were recording responses on if some of those papers get randomly destroyed right then that doesn't really have anything to do with the underlying process right it's a shame that you're missing that data but that doesn't have anything to do with the process itself whereas if you imagine that you know someone who starts having really bad outcomes from a study right someone who has really severe side effects if those people become more likely to drop out of your study well now suddenly we're biasing the results right because the people that are left in the study are those who happen to be responding well to the treatment and so if you're only analyzing the people that you uh take all of the responses for well all of the people who would have had negative responses dropped out of your study and you're just left with a positive sample right so we have to be really careful about where is this missingness coming from and in order to do that we can talk about sort of classifying different missing data mechanisms so to do this we're going to define some notation here and we take rig rij to be an observation indicator so in particular if we make the observation so if y ij is observed then we set rij to be equal to 1 and otherwise we set it to be 0. so we assume that we observe this for everyone at all time points and it just says 1 if we did observe that value 0 otherwise and so if we're thinking about rij now as being a random variable then the way that the missingness is being introduced is sort of the distribution of these rijs we're also going to assume for now that we're only concerned with missingness in the outcome so in iaj or in yij in theory you might have missing covariates as well and those are going to be a problem some of the methods we're looking at will be able to handle those others won't be able to but for now we're just going to be thinking about missingness in the outcome so we're going to by way of notation partition the outcome y i into y i o and y i m and so this y i o is going to be the observed component and this y i m is going to be the random or the missing component sorry and we're thinking about these as both being random variables but you know it's it's sort of up to us to decide are their distributions the same right so y i has some distribution that we're assuming but maybe the observed values and the missing values follow different distributions so you know thinking about that situation where the individuals who are responding poorly have missing values well then yim is going to sort of be a worse distribution in terms of the outcome than y i o is going to be okay so that's just the notation there in terms of introducing actual sort of missing data types we tend to think about this as this hierarchy okay so in the best case the best case for us is when we can say that the data are missing completely at random and we abbreviate that as mcar or mcar so m car data we define them based on the probability of those missing indicators and if we can say that given the observed the missing and any additional covariates if the distribution only depends on those additional covariates so it doesn't depend on the outcome at all then we say that the data are missing completely at random and the reason it's completely at random is because it doesn't have anything to do with the outcome it's completely random as far as the outcome is concerned so for instance if we're in the situation where we're talking about a hard drive failure as being the cause of certain missing data right maybe we have a bunch of different researchers who are collecting data samples on patients and one of them you know drops their laptop in a lake that data is going to be completely random as to what's missing right those those individuals are going to be representative of all other individuals it's not going to be informative at all we say it's missing completely at random and that's ideal if it's completely at random we're going to see that we can basically just ignore it now we say that data are missing at random and we abbreviate that as mar or mar if instead of it being totally random here it only depends on the things that we observe so the conditional distribution given the observed the missing and the additional covariates only depends on the observed and the additional covariates so in other words the missing values do not are not informative as to the missingness mechanism okay and so for instance let's say that you have a study design where you're trying a new experimental drug out right and you're trying to see does this help the outcome well maybe what you're doing is you're saying anyone who observes too high of an outcome over time gets removed from our study because we need to give them a treatment that's actually going to be helping them okay so you know if it's blood pressure medication we know that we have good medication that exists which sort of keeps your blood pressure regulated at a level if you're trying a new blood pressure medication and there's a patient who observes you know blood pressure that's too high for consecutive periods you might remove them from their study on ethical grounds to say you need to be assigned to a treatment that's actually going to work for you and so in that case we know from their observed values say at time 1 if their blood pressure's still too high you know times 1 and 2 then we've removed them from the study and so we can predict whether or not they're missing on the basis of only the values we've observed right so we have missing completely at random doesn't depend on the outcomes at all missing at random only depends on the outcomes that we've observed there's you know sort of one final obvious uh way to add on to this and that's to say that data are not missing at random and we abbreviate that nmr you'll also sometimes see m nar for missing not at random they sort of interchangeably used there right and so if you have say individuals who smoke they're less likely to continue to respond to your smoking questionnaire well then that's going to be an example of data that are not missing at random because the missing values the amount that someone is smoking tell us whether or not or predictive as to whether or not they actually responded and so you can see this as sort of being a hierarchy right so data which are missing completely at random also satisfy the properties for missing at random and data which are missing at random would also similarly satisfy the properties for not missing at random which is this sort of distribution is the one that we're working from right but uh you know intuitively from our perspective we prefer data to be missing completely at random if they're going to be missing at all if they're going to be uh not completely at random then we hope that they're missing at random and otherwise you know we're sort of left stuck with dealing with data which are not missing at random and so we're going to think about those as sort of three layers and also for the most part we're really only dealing with completely at random or at random data not because those are the most common but because those are the ones that we can address the best right so sort of out of necessity now in addition to thinking about the mechanism we can also think about some patterns of missingness right so here i have sort of three different data sets and uh and we can just look at these top top row for now um and the idea is that in these red these red marks here uh would be observations that we didn't make so these are missing values right and there's a few different patterns that we have going on here okay and the pattern of missingness is also going to turn out to be somewhat important so in this first example we have what is called a dropout pattern and the idea there is that someone starts in the study right we make some observation for them and then eventually they drop out of the study and once they drop out of the study we make no more observations for them right so the second person we observe up to time four and then they've dropped out you know these next two we observe the whole time this person we observe up until the end of time three and then they're gone forever okay so a dropout pattern is quite a nice pattern to deal with because once someone is gone we know that they're gone forever and so we just sort of need to predict their first dropout section now the second pattern is not a dropout right because for instance there's this person they're missing at time four and then they come back and we see a value of them at time five right but we can call this pattern monotone and the reason that we can call it a monotone dropout pattern is because if we rearrange the columns a little bit we see something that looks like dropout right so if we move column four down to where column five is we move column three to where column four was and then we move column five to where column three was right so we sort of are swapping those three columns around there then we get this pattern down here all right so we have one two five three four and in this case you can see now we have this dropout pattern right so as soon as we have uh someone not being observed then they're never observed again and then for everyone else they're fully observed right and so it looks as though we have this drop out just by rearranging and so we call that a monotone missing pattern because we can sort of order the columns in such a way so as for it to be uh monotonically increasing um dropout patterns are of course monotone because uh that's sort of how they're defined now this last pattern here is one which is non-monotone right so it looks quite similar to the missingness that we have in the second case except if we try to rearrange it you know i'd encourage you to do this yourself you're always going to find that there's going to be some individual who drops out for a period is reobserved and then drops out again okay and so because of that this is not a monotone missing pattern and it's going to turn out that monotone patterns are a lot easier to deal with than non-monotone patterns and that's just going to be because of a sequential modeling procedure that we'll see but when we have monotone missingness we're happy because it's easier to deal with right and so we can either think about the mechanisms or we can think about the patterns of their missingness so you know we've talked a lot about sort of classifying missingness now but it's probably worth taking a moment to ask you know why do we even care what are the impacts of missingness i think intuitively you'd get the sense that missing data are going to be a problem but you know in what way are they a problem and so we can ask you know what happens if we ignore the fact that we have missing data so in the previous analysis when we were doing the modeling of the financial crisis data we did a complete case analysis so a complete case analysis is occurs when we just ignore any missingness and we only keep the people that we have all of the observations for okay now it turns out that a complete case analysis is valid if the data are missing completely at random so if your missingness is not at all informative about the outcomes then you can just do a complete case analysis however you might not want to and the reason that you might not want to is because they're going to be unnecessarily inefficient and it's only valid when the data are missing completely at random so what do we mean by this well if you think about it uh when you only keep the complete responders you're throwing away a lot of observations that were partial responders right so if 50 of your data has at least some missingness you might be throwing away half of the responders even if they're only missing a little bit and so that's throwing away a lot of information you're gonna end up with more variability in your results than you would need to otherwise you're essentially artificially limiting your sample size the other issue is that we can't actually test whether data are missing completely at random you know there might be a good reason to suspect that they are again thinking about sort of a hard drive failure scenario but in the absence of that this is a very strong assumption to be making and if we're wrong about that assumption if the data are not missing completely at random well then the results of a complete case analysis are totally biased they're entirely unreliable and we can't actually say anything about it so we might turn to an available data analysis instead so the idea with an available data analysis is that instead of only looking at the people who completely responded we look at anyone the full set of responses that they had okay and so if someone was responding you know 70 of their responses were there we consider those 70 of responses so this is going to be more efficient than a complete case analysis and if the data are missing completely at random it's going to still be valid however there's a downside again of course there is this makes the data inherently unbalanced right so again if you think about taking 10 measurements of people right balanced data means that all of the individuals in your data set have those exact same 10 measurements taken and you know they're sort of at the same time so you can interpret you know the first measurement of every person as being at each time now if some people only have six of those ten done well then those people are not going to be sort of corresponding to the balanced data assumption and so any techniques that require you to have balanced data are not going to work any correlation pattern matrices that require balanced data aren't going to work right so you're really limiting the types of analyses you can run using an available data analysis okay and again we're still relying on that missing completely at random assumption so if they're missing at random or if they're not missing at random the available analysis and the complete case will not work if you're using a likelihood based technique so the transition models and the linear mixed effects models if you fit using those techniques they're going to be valid whether you're missing at random or if you're missing completely at random okay and this is going to be true so long as the models are correctly specified as long as you've correctly specified the likelihood you're going to be valid under those first two situations so that's great right that's that's a good thing to know um and that assumption about them being required to be correctly specified well we're already making that assumption to use those models in the first place so likelihood-based procedures actually have some inbuilt resistance to model misspecification but if we're thinking about something like the generalized estimating equations that's not going to have that sort of resilience to missingness okay so in all other situations other than those listed here for the sake of our course they're going to be biased and unreliable right if you have any missingness that's uh not missing at random then already you have an unreliable analysis if you have data which are not missing completely at random and you're not using likelihood techniques and you haven't dealt with them in another way biased and invalid inference okay and just like the way that it's going to be biased and the way that it's going to be invalid are not necessarily predictable because that's going to depend on how is the data missing you know what are the sort of actual models that play everything else and so you can't even say things like you know we're going to be overly conservative or something like that because it's just going to totally invalidate the analysis it's like using the wrong tool for the job and so it becomes a really big problem very quickly so with that what are some general techniques that we can use to handle the missingness in this data so we're going to talk about sort of four families of techniques and we've already briefly touched on two of them so uh you know that's that's good um the complete case analysis is going to take just the complete responders in the data frame right so you just drop everyone who you didn't respond to fully and as we already said if the data are missing completely at random this is valid now in general i'm only actually going to suggest that you do a complete case analysis if you're just trying to sort of fit some models to see if model fitting looks like it may work okay not when you need any actual results but if you're like i just need to write some quick code here to make sure that i can fit data then a complete case analysis might be valid slightly better is the available data analysis and this is using all of the observations so if you're using ge for instance they're sort of built to use an available data analysis you can take unbalanced data really easily and this is going to be my recommendation if you're using uh sort of if you're willing to assume missing completely at random using an available data analysis if it's missing completely at random is sort of okay you're not going to be inefficient but you need to have a really strong justification for why the data might be missing completely at random in all other situations you're going to be wanting to look at one of the remaining two techniques so we have weighting techniques and weighting techniques essentially the idea is that we're going to be generating pseudo data sets where we weight each individual that we did observe to account for people who are like that person but who we did not observe okay so we're going to talk about these in a little bit more depth but weighting techniques are one broad class for addressing missingness and the last class of techniques that we'll talk about in detail is imputation techniques and imputation techniques rely on actually filling in values for the missing data right so at their base level what we're trying to do is predict the uh the values that are missing and if we've predicted them we can fill them in and then pretend as though we actually had those observed and if we adjust our standard errors everything's going to work out nicely okay so those are sort of the four overarching classes the first two complete case and available data should sort of only be relied on in extreme situations where you're just trying to do a quick analysis but these weighting techniques and imputation techniques are both quite powerful and you know we'll explore them now now for handling not missing out random data we're not going to talk about that at all for the rest of this class and the general reason is that you need to model jointly the distribution of y i and ri right so you need to model the missingness alongside the observations if you're not doing a model for y-i-r-i jointly then you're not going to be able to handle this not missing at random data right and so as a general rule the techniques are sort of beyond the scope of this course but again there's sort of a full literature on missing data if this is something that's interesting to you obviously these types of problems are actually quite important to think about because not missing out random data happens sort of all of the time right so we're artificially limiting ourselves because we need to you know start with what is easy what is possible but just you know know that what we're talking about is not going to help us with not missing at random and that's an important case to consider so on to waiting techniques right we've talked briefly about dropout when i was you know introducing the different patterns right and so dropout occurs when someone leaves the study and then does not ever return and so we could think about defining this dropout indicator here which is going to take a value that equals the last period of observation so if we take one plus the summation of their observed indicators right then if someone is last observed at point three right then rij is going to be a one for for stages one and two and then it's a zero after that so d i is going to be three right so it's the last point that they were observed if someone's observed for the full study then di is going to be k plus 1 which sort of we read as they didn't drop out until after the study was already over okay but so if we have these di terms then we can start talking about dropout and we're going to think about weighting techniques as being specific to dropout they're not technically you can use them when we're not using dropout but they're going to be easiest to understand in this situation so we can sort of conceptualize these waiting techniques as discussing the probability of inclusion in the sample okay so what we're saying here is we want to say that we have these probabilities pi i j and pi i j is going to be given by the conditional probability that the dropout indicator is after time j given that we know it's at least at time j okay and we condition on i to sort of indicate that these can differ based on individuals so what this is essentially saying is it's the probability that an individual lasts beyond time j if we know that they were already in the study available there right so um if you wanted to know the probability that they last beyond time k then you're saying what's the probability that they were fully observed in this study you know the probability that they're observed past time two given that they were at least at time two is essentially the probability that um they didn't drop out at time two right so this gives the probability the individual i was still under observation at time j assuming that they made it to at least time j minus 1. now any probability that we look at like this you know you should think we could estimate this using logistic regression and we could we could fit a model to estimate these probabilities from our data without any problem and then the way we're interpreting this is that if you have a low pi ij then you are unlikely to have been observed in our sample and if you have a high pi ij then you were quite likely to be observed in our sample okay and so we can think about sort of a general population and our observed data so on the left hand side here i have a hundred dots that are representing the true population and sort of we're thinking about them as two different types of individuals right and then on the right hand side we actually have the observed data so in the observed data these lighter circles right are going to be uh sort of the people that we did not observe and the darker values so given sort of here and here and here and here and so on right these are the values that we actually did observe and so if you were to work this out we have 100 people in this population and 40 of them are blue 60 of them are this magenta color and of the blue people we observed 20 values and there were 40 total blues and so they had a 50 chance of being observed right in the magenta individuals we observe three there are 60 of them and so they have a 5 chance of being observed right and so you can think about sort of on a bigger scale our true population and our observed data are going to look something like this right we have classes of people who are going to be similar to one another we're going to observe some of them we're not going to observe others but we could think about then taking our observed data and from our observed data creating a pseudo population right so in our observed data we said that we observed 20 blue individuals but we know that there should have been 40 blue individuals and so what if we just let each of these blue individuals count for two different people right so then each of the observed 20 counts for two and that gives us our 40. right similarly we observed only three of these magenta individuals and we know that there were supposed to be 60 of them right and so what if we let each of them count for 20 right then we're going to get 20 times 3 is 60 individuals right and so another way of actually getting to this is taking the inverse probability right so we knew that pi b we estimated that to be one half and so if you take one over one-half that gives us two and so if each individual counts for two then we've reweighted them up to the population the pseudo population where their the blue individuals are as represented as they were in our true population same thing with the magenta if we take one over five percent that's going to give us 20. so if we rewrite each of these at 20 people then you know we've re-rated them up to be 60 percent of the total population and so in general we could you know think about um applying this to longitudinal data where we take pi i to be the probability that an individual did not drop out of the study right so it's the probability that our dropout indicator is larger than k which is actually just the product of all of our pi ijs right so if we take that probability then if we don't go one over that probability that gives us a weight right and it's an inverse probability weight it says for people like individual eye in our sample they had a pi i probability of actually being observed at the end of our study of having not dropped out at any point and so you know if that probability is 5 then what we're saying is that for every 20 people that were like this person in our original study we only observed one of them you know one in 20. uh similarly if you had like a 90 chance of of being observed well then for every 10 people in the original population 9 out of 10 of them are still observed right and so what you can see that when pi i is low our weight goes high you have to count for more people because they were rare to actually see but what that means is that there's a lot of people in the base population who we didn't observe at the end of the study but who we should have observed at the end of the study okay and so if we re-weight and then we just perform a complete case analysis then actually in this re-weighted data set we're going to find that whether the data are missing completely at random or they're missing at random we get missingness properly accounted for we have reliable results this is assuming that our probabilities are estimated correctly that these pi eyes are correctly estimated okay and the rationale for that is just that we've taken an observed data set right and by re-weighting we've made it look like our true population again and so if we've reweighted that in the correct way if each person is counting for the correct individuals well then in the it's sort of in a large enough data set we're going to get consistent and unbiased results okay now this was using the complete case analysis we can actually do this slightly more efficiently and that's instead of giving every person an individual weight we could assign each of their observations that they actually had an individual weight right so if you think about instead of using the probability that di was greater than k we could say at time j what's the probability that d i was greater than j what's the probability that dropout happened after time j and that's just going to be the product of the probabilities up to and including time j right in the same way that that was true at time k on this last slide here right so we can do that just at each time j and then if we take the inverse there that's the probability that they were still in the sample at time j and if we wait their observation at time j by that same value then we're going to sort of be able to do this available data analysis except uh keeping around more responders right so we're going to be able to keep more individuals in our sample that way and that's going to provide sort of a more efficient way of correcting these estimators so concretely we can look at this using generalized estimating equations okay so what we're going to do is we're going to define this weight matrix this wi to be a diagonal matrix and it's going to be the full k dimensions ki dimensions and on the diagonals we're going to take the observed indicator so rij and multiply it by that weight at time j okay and so what that's going to mean is that if the individual is observed the diagonal element is going to be this w i j this is going to be their weight and if the individual was not observed we're going to have a zero okay and so then we take this weight matrix and then we plug it in to our ge estimating equations right and so we have this di prime vi inverse and then normally we would just have this residual term here but we multiply by this weight matrix and by doing so we're going to be sort of performing that available data analysis and we call this inverse probability weighted generalized estimating equations ipw ge okay now we either need x i to be fully observed or vi to be diagonal in order for this to work okay uh now this isn't actually a big assumption because we don't need to correctly specify vi and so if x i has missingness at all then what we can do is we can just take vi to be diagonal and then as long as our wi matrix is correctly specified and our mean structure is correctly specified this is going to work with missing out random or missing completely at random data okay so that was the general purpose of weighting techniques um so the idea there being you know you construct these pseudo populations by re-weighting individuals based on how likely they were to have made it and then correspondingly how many individuals in the full population should those individuals represent imputation techniques are perhaps a little bit more intuitive if you're not used to thinking about waiting but they're going to be quite computationally intensive and so the idea with imputation is that we want to estimate these missing values these yims based on the observed values the yios and the variants x i so what we'll do generally speaking is we're going to compute the parameters of interest as though the imputed values were the truth and then in general we're going to use what's called multiple imputation which involves sort of estimating the values computing the parameters of interest many times over and then sort of averaging those to get our true estimates right so we're going to predict the values we're going to use those predicted values to compute parameters and then we're going to do that a bunch more times and then when we average the results we'll sort of get valid predicted values as long as we're filling them in in a sensible way so we need to choose whether we're using single imputation or multiple imputation and we need to choose how we're actually going to be going about imputing i'm not even gonna spend any time thinking about single versus multiple imputation and i'm just going to tell you you should never use single imputation it's just not effective you're going to be underestimating your standard errors you're not going to be using a technique that makes much sense the only time that maybe you could use single imputation is if you're just trying to get a sense of whether your procedure is giving you results that sort of are reasonable but uh in general you should always be using multiple imputation and so the idea is that we're going to perform this imputation process m different times and each of those times we're going to estimate our beta parameter right so we get beta hat k for each of those k equals 1 to m estimation procedures and then we can take beta hat to be the simple average of those now the nice part is that we also can get an estimated variance of beta hat and the way that we do this is through this kind of ugly looking expression here but it's broken down into these two different components where this first component is the average covariance of each of the different imputed data sets right so we get beta hat k from each of them there's some way of estimating the variance there and so we can estimate that and take the average and then we have this second term here and the second term is basically the correlation or the covariance between um the actual values and the overall average right and so we can think about this as sort of adding on a covariance now the thing is if you actually look at this coefficient that we have out front here then the coefficient there is going to sort of be going down to zero rather quickly right you have m in the numerator you have m squared in the denominator uh that's gonna go to zero and if you have a process that doesn't have very much variation in it then you're also going to find that that term is quite small but we need it there just to sort of account for the correct correlation or the correct covariance of our estimators right so then the decision we have to make is how do we actually perform the imputation how do we predict the values and in general we're going to be thinking about fitting regression models to do this right so you could think about taking say at time 2 the expected value of y i2 given the first observed value and the covariates as fitting some glm right so we have some link function g here and then we have some set of covariates which are dependent on the first observation and any covariates that we actually have and so we're saying that this link function is given by zi one transpose gamma two right so we can fit that glm model which gives us predictions for the observed values right so then we can write y i hat two which is going to be the predicted values based on the observed models for the first or the second stage and we could use that to predict all of the values for those who we didn't actually observe now in order to ensure that we're not using deterministic predictions right if you fit a glm or you fit a linear regression model and you just predict based on the average you're going to get the same prediction every time you run it right and so to ensure that we're not doing that right because if we're going to do this multiple times over we don't want to get the same predictions every single time right that would defeat the purpose of doing multiple invitation so instead of actually using why i hat two what we're going to do is we're going to sample from the distribution right and so if you're thinking about a linear regression right then we're sort of thinking about that as being normally distributed and so we'd sample from a normal distribution with a mean of y i hat to or if this is a if y i is a binary response and you've used a logistic regression then y i hat two is sort of the probability that it equals one and so you can generate a binary response with the probability that it equals 1 given by y i hat 2. right and so that's going to let you generate sort of outcomes and you can fill those in for anyone who has missing values then we can repeat the process at time point 3 where now we can condition on y 1 y i 2 and x i and we can condition on that for those individuals that have missing values where we've plugged in y i hat for y i 2 right and so we can sort of continue this for all of the stages up to k and then that gives us our estimated uh hat or bi beta hat one and then we repeat this m times for the multiple imputation now the problem with what we've just outlined here this five-step process where we fit a regression model we sample from the uh distribution and then sort of repeat is that we're actually underestimating the amount of variability and if you remember thinking back to your regression course where when you're plugging in something that is a predicted value you're going to end up with more variants than if you consider them fixed the problem is that we're essentially treating y i j hat as fixed values after we've estimated them and that's not going to be the case so instead what we're going to do is this procedure of drawing our coefficients that we're estimating from their posterior distributions now if you're familiar with bayesian inference then that sentence might make sense to you if not that's okay you know you can think about this as saying instead of our regression parameters being fixed we're going to say that they have some distribution and when we estimate them with our regression equation we're estimating their mean and their variance and then we can sample from whatever relevant distribution is right and so in practice this is sort of uh you know requires some bayesian understanding but the idea is that by sampling the coefficients to be random and then sampling from the distribution for our y i hats then we're sort of correctly accounting for all of the variants that's there instead of using those predictions directly there's this other procedure called predictive mean matching right and the idea is quite similar so you're going to fit the regression equations just as we did before so you're going to sample from the random coefficients and you're going to use those to generate these predicted y i j hats okay and then for every individual that's missing we're going to say take the kappa nearest individuals who had observed values and when we're saying happy nearest individuals what we mean is we're predicting the value for all of the people whether they were observed or not right then we just look at the observed people and we say which are the closest individuals to this missing prediction here right so if you predict that there an individual had a missing value had a probability of it being missing of you know or a probability y i j hat is you know predicted to be 0.65 right then what you're going to do is you're going to say for all of the individuals who actually had an observation which were the closest to this .65 and kappa is just some number you specify it could be three what are the three closest observations what are the ten closest observations that's going to be up to you but once you pick those sort of ten closest observations those three closest observations you can randomly sample one of them and then what you're going to do is you're going to borrow their observed value and you're going to plug that in as your imputed value instead okay and so then you repeat this process for all of the j and you repeat this m times so the key difference here is that we're fitting the same types of regression models but then we're using our predicted means instead of those predicted means to then randomly sample from the outcome distribution we're using them to match individuals with those who actually had observed values and this is going to have a sort of a couple of benefits over the standard regression technique and the two primary benefits are first you're always going to end up with a value that was actually plausible right when you're predicting with a regression equation you can always end up with values that don't make any sense and that's because you know our regression equations are just going to predict a value they don't know what the underlying scenario was like but if you're borrowing from someone who actually had their value observed well then we know that that was a possible value to observed and we're basically just saying you're the closest person in our data set why don't we just pretend like the observation was yours the second benefit is that you actually can misspecify the model slightly and this is still going to be valid so in the regression case you have to correctly specify the actual outcome model in order to make valid predictions here as long as you specify uh a model which correctly matches people together so you know when you're predicting them as long as the people that you're saying are closest to your observed value are actually closest to your observed value whether or not your predictions are sort of correct in the overarching sense as long as you're keeping people who are similar close together this is still going to work so in general we're going to prefer predictive mean matching but i think the regression situation where you just fit progressive regression models and you estimate using them is sort of more intuitive now for both of these obviously there's you know this whole idea of sampling from the posterior distribution which is a little bit more mathematically involved than what we're getting into but if it helps the way that i personally think about this is that with these regression-based imputation techniques we just fit a regression model to predict the outcome at stage two using the input at stage one and our covariates and then we take those values and we pretend like they were observed then we fit a regression model at stage three using one and two and the covariates and then we predict those values with matching instead of using the predictions directly we're pairing off people who are close to each other and borrowing actually observed values and then we just need to be careful and that's where all of that sort of fancier math comes in to ensure that we're correctly accounting for all of the variability in the sample but conceptually it's just sequential regression models and the r code is going to handle that for us now the last point that i want to bring up in terms of imputation is that likelihood can sort of be seen as an imputation procedure itself so we said that any of these likelihood-based procedures as long as you've correctly specified the likelihood will be valid with mcar or mar data and the basic reason is that the conditional distribution given x i is going to be the same between just the observed people and just the missing people as well as the full sample right and so in this sense likelihood procedures are going to be acceptable now you're also going to see in the literature if you look out people talking about expectation maximization em algorithms and this sort of lets you explicitly make likelihood an imputation technique and you can estimate the models sort of in a little bit more generality using these procedures um i'm not gonna require you to fit expectation maximization but i just wanted to point out that the reason that the likelihood is going to work with mar or mcar data is because it's actually sort of a subset of the imputation techniques that we're using so if you're using a maximum likelihood technique you don't need to think about using imputation directly but if you want to use a technique that does not allow for likelihood interpretation so if you want to use gee you can use imputation instead or you can use the inverse probability weighted ge itself okay so uh as a general overview that's sort of you know the really fast way of handling missing data and longitudinal uh analysis so missing data is a pervasive issue in longitudinal studies and ignoring missingness at best causes a loss of efficiency and at worst completely invalidates your analysis so it's something you definitely should be concerned with we can categorize them as missing completely at random where the missingness does not depend on the outcomes missing at random where it depends on only the observed information or not missing at random where it depends on both the observed and the unobserved missingness we can talk about monotone patterns which everything that we saw today sort of assumes monotone missing and you can try and think about how the imputation technique relies on this monotone missing right because we have to fit these sequential uh models and so if it's not monotonically missing well then you're going to end up having sort of an an inability to predict all of these values for people because you're going to have sort of missingness and not missing this but if it's monotone we can impute exactly as as discussed and if it's non-monotone there are other methods that are similar to what we talked about but that just require a little bit more uh care in in the models that you're fitting right if we're using complete case analysis or available data analysis that's valid only under m8 are missing completely at random we can use waiting techniques which generate sort of pseudo populations based on how likely someone was to be observed in the population and those are going to account for missing at random data uh in in particular we saw how we can adapt the ge estimating equations to weight each individual right and it's sort of the intuitive way that we're thinking about that is each person that we do observe is going to count for themselves plus some people who are like them that we didn't observe because they were missing we can think about imputation techniques which we saw you can sort of just fit these sequential regression models and then predict uh the outcomes and you know we have to be a little bit careful about how we're accounting for the variances there but at the base level we're just thinking about a bunch of regression models end up predicting those values okay and so that sort of covers our uh discussion our very brief discussion of missingness in these models in these longitudinal data right so in the next lecture we'll actually see sort of the implementation of some of these with an r package and you know we'll see how can we actually accommodate missingness and where do we actually need to right because again if our likelihood is correctly specified we don't need to handle missingness at all and so i'll introduce that to you and that's going to be valid beyond the scope of longitudinal data itself it's going to be valid sort of in any analysis that you want to run and then um after that we're moving on to survival analysis so we'll talk we'll do one sort of wrap up lecture on longitudinal data analysis and then after we do that wrap up lecture we're moving on to survival analysis so as always if you have any questions about what was talked about today i know it's a lot of information thrown at you so definitely feel free to ask me if you have any questions about anything we've covered in longitudinal analysis so far i encourage you to reach out or ask on teams and otherwise i will see you all in the next lecture
Info
Channel: Dr. Dylan Spicker
Views: 2,916
Rating: undefined out of 5
Keywords:
Id: ObN-uZLtYSA
Channel Id: undefined
Length: 51min 9sec (3069 seconds)
Published: Mon Jan 24 2022
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.