Statistical Methods Series: Mixed Models

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
- We are delighted to present Dr. Bolker. So this presentation, I think, has... The total number of people that have registered for this is over 800, and that speaks a lot about Ben. So by looking at his website, according to his own words, he wears multiple hats. He's a mathematical, statistical, and computational biologist. He has a bachelor's in physics and mathematics from Yale and a PhD in zoology from Cambridge University. Currently, he's a professor in the Department of Math and Statistics and Biology at McMaster, and he's the Director of the School of Computational Science Engineering. So the things I would like to highlight about Ben is really, if you have ever had any question regarding random effects, mixed models, and you looked it up in Stack Overflow, probably you've run across one of his answers. He has 3,400 answers, approximately, in Stack Overflow, and he has been a co-developer of many packages that many of us use. So lme4, JLMN, PNB, BBM, LE. And so these are packages that have become the bread and butter for many of the analysis that we do. So from a very personal perspective, I'm very happy to have Ben here because back in the day, when he was teaching in UF, I think it was like 2007, I took a course with him. That's how I started getting to our basic statistics MLE. So that makes this particularly special. But anyway, without further ado, Ben, the floor is yours. - All right. Thank you very much. It's an honor. It's a little scary. I said in a tweet last night that I have this many people in intro bio lectures, but I think mostly, they're not paying attention. And looking at the participant list, I see lots of people I know, and hello everyone. I am going to get started 'cause I, of course, was ambitious about what I wanted to share with you. I tried to and I'm gonna be ignoring the chat and assume that Jody will let me know if something is actually burning. I'm going to share my entire screen. And so Jody, will you put the links in the chat for people? - [Jody] Yep, I will. - So all the materials are here, if you want to follow along. I want to start out by saying just a couple of quick overview things. This is intended to be about one third overview and two thirds nitty-gritty coding, but there's some conceptual stuff mixed in with the coding as well. This is an extremely heterogeneous audience, and I apologize in advance if you're either overwhelmed or bored, but I had to do that. I'm just doing the best I can, and you can ask super basic questions or super advanced questions, and I'll try to answer them. So assuming that everybody can see this and that it's big enough, I'm gonna start by just going over some foundations that again, I hope most people have at least heard of, but if you haven't, this should get you on board. What I said, and I'll mention in the context of forecasting that mixed models are not specifically a forecasting tool, but they're a pretty general powerful framework and you certainly can use them for forecasting. So some highlights from what I said in the overview that went on the website, they're an extension of linear and generalized linear models. They're dealing with observations that are measured within groups, which could be field sites, years, blocks, genotypes, species, I'll talk a little more about that. And you can think about them either as accounting for non-independence or correlation, you can think about them as a way to estimate variation among groups, or you can think about them as a way to estimate the effects of each group without using more statistical power than you need to. And they're most useful when you have a lot of groups and especially when you have different amounts of data, different amounts of noise, or different numbers of observations in each group. So people will talk about the random effective years. And when they say that, they mean typically that that year is the grouping variable, and that has to be a discrete categorical variable. And they often and implicitly, there is something that varies across groups. In most mixed model packages in R that this is notation invented by Doug Bates that's denoted as f bar g, and it's red as the effective f varies across the groups defined by g. If you're only interested in variation in the interceptor, the baseline, then f is one, and we would call that an intercept-only model. And what we're trying to do when we estimate the effects of f, so the effects in the example I'm going to give, the effects of increasing fire frequency for each biome, we're estimating effects for each group, but we're shrinking them or pooling them towards the population average. Again, all of this is very quick, and if you've never heard of this, any of this stuff before, just hang on, because I don't have time to do this properly. So hopefully, this is reminding you. There is not a really hard and fast rule about this is a random effect or this is not a random effect, and I'll refer you to a couple of references. All of the references are at the end. Random effect don't let you calculate p values for the difference of a particular group from another group or the difference of a particular group from the mean. They do give you an estimate of the value of that group, so for prediction, management that should be sufficient. They're also interesting if you're an evolutionary biologist and you want to quantify the variation, for example, across species or across genotypes. They're handy because you can make predictions for new groups that weren't sampled in your data. They're handy because they share information across groups. They're useful when you have groups with not very much information. You don't have to throw them away. You can use those samples, but weight them according... They get automatically weighted according to the amount of information they contain. In the classical definition, the groups must be randomly sampled from some larger population. That's something that I don't feel as strongly about that part of definition. You could also think about it when you have a variable that you're not interested in that you're trying to control for that's categorical that comes in discrete groups. The groups are exchangeable, and this approach that I'm gonna talk about works best when you have at least five or six groups and a small... It's fine, if there's a small number, more at least one, but a small number, at least, yeah. An average of at least one, and it's useful when they're unbalanced groups. I'm gonna be using lme4, which is probably the most widely used mixed model package in R, and I'm gonna be using an extension called gamm4 that adds some generalized additive model machinery, a little bit to deal with spatial auto-correlation. I have a real grab bag of which other tools from the R-Universe I use. So there will be a little bit of tidy verse stuff in here, but I'm not a complete tidy converse. And I'll be using some additional packages. And I'll try to notate functions that use those extended packages with double colon. And I think this link is somewhere else as well. This is a Google sheet where I and other people have started to try to compile a big list of the R mixed model ecosystem. That's it for the stats intro. Here's the science intro. I've been gradually working on a paper for several years with Max Moritz and Rick Bat Laurie where they got ahold of a big synthetic global terrestrial dataset on species richness, primary productivity, and fire, and I can answer specific questions later, but I'm gonna go over this pretty quickly. The middle row here is essentially our response variables, and in the paper, we're analyzing all three of these species richness patterns. I'm gonna focus in this example on the bird richness data, because all three analyses really pretty much go in parallel. There's not that much extra except to be done when you're to learn about the methodology when you're doing all three of them. The predictor variables are the size. Okay, sorry. We have three different geographic scales that we're interested in. We have large scale. These are called floristic realms, and they're essentially kind of evolutionary units, biogeographic large-scale units. Biomes, which as people here or ecologists, everybody should be familiar with, tropical forests, mangrove, desert and shrub, boreal forest, tundra, et cetera. The eco regions are our sampling units, and these were developed by Alsen and al in 2001, essentially as individual ecological units, and I can't really go into the definition very much more. There are about 700 of these. So they're smaller in scale them, either the realms or the biomes. For the predictor variables we're interested, we included the size of the eco region, because that obviously has an effect on species diversity, although that's not our primary interest. We're interested in the mean, the annual grams of carbon per meter squared of net primary productivity and the coefficient of variation, the inter-annual variability of NPP. We're also interested in, and this is kind of what's new about this analysis, the fraction of NPP eaten and consumed. I'm gonna use fire eaten later on by NPP on average and the inter-annual variability of that fire consumption variable. Okay, I think that's everything I have to say about the data. So we're going to try to estimate the effects of NPP fire consumption. These inter-annual variability, this is a coefficient of variation, and they're all the two-way interactions on species richness. But we're gonna try to account for variability in those effects across realms, across biomes, and across biomes within realms. So that's why we need mixed models. We have large scale discreet grouping variable. We have medium scale discreet grouping variable. We're also gonna include the biome by realm interaction, which instead of being tropical grasslands or Neotropics is tropical grasslands in the Neotropics, and then the eco region is the sampling unit. I want to talk briefly. I want to take a breath. I want to talk briefly about a conceptual issue which comes up, which is nesting and crossing of random effects. Realm and biome, or what we would consider cross random effects, there's more than one tropical grassland. Sorry, there's more than one realm, there's more than one biome, and each biome can occur in each realm to go back to a kind of non-ecological example. A nested random effect would be one where class, if we have classes within schools, class one in school one is just a label. It doesn't have anything to do with class one in school too. So class one with each of these subunits is a separate subunit that doesn't share anything with the other units with the same label if we have crossed random effects. Tropical forest in the Neotropics has something in common with tropical forests in other realms. So in this case, the biomes and the realms are crossed, and that's what gets us our biome by realm level. If you, I'm gonna skip that, not as important. A little bit more terminology, I'm almost done with the overview and I'm not horribly behind schedule yet. So that's wonderful. When people say a random effect, if somebody just said, comes into your office and says, oh, I ran a model with a random effect of species, what they usually mean is that they're considering the variation in the intercept across species. If I said a random effect of biome, I would typically mean that the average species richness varies across biomes. This is actually a very special case of random effects. Most random effects that are arguably more random effects ought to be random slopes models, models that consider the variation of an ecological effect across the levels of the grouping variable. When we have more than one parameter in that term however, so we're going to fit a model, eventually, that allows the intercept on all four of the main effects to vary across biomes or across realms, that's gonna mean that we need to estimate, for example, the variation in the effect of fire and the variation in the effect of inter-annual variability in fire, and the correlation between those two effects, which means that when we have four effects and an intercept, we're gonna end up having to estimate 15 parameters. If we cut that down and we assume that biomes that are especially fire-sensitive are not also especially sensitive to NPP, then we could choose to assume that the random effects were all independent of each other, which would cut the number of parameters down to five. I'm gonna work with basically ever... Most of our variables are gonna be the... Sorry, the NPP and fire variables are gonna be log-scaled. The coefficients of variation are all essentially unitless already because they're proportions. We're gonna center everything. This is gonna make everything easier and it's gonna avoid us screwing up. This is probably on my top 10 tell people to read it paper. It's just explaining why you should scale and center your variables. And because these are gonna be logged law, we're gonna estimate effects of log things on log species richnesses, the coefficients are gonna be interpretable as elasticities. I wanna finish with a couple of cartoons or schematics from a couple. This is from Gelman and Hill's book on regression modeling, showing that you have simple models that are easy, but don't do what you want, complex models that you want that break or don't work, and there's a similar picture from Uriarte and Yackulic showing that we have simple models that are definitely wrong, although if you have good eyes, you can see, they say, how likely is it that the model is wrong, and it's certain on both sides, which I like. But we're going from a simple model to a complex model, and we're basically trying to get as far to the right as we can. We're trying to fit the model that makes us happiest as ecologists that doesn't overfit too much or break the machinery. Okay. Jody, any urgent questions so far? - [Jody] Nope, you're doing just fine. - Okay. So I apologize for not doing this in RStudio, but that's too big. Oh, that's still too big. That's way too big. Hang on. Hang on. Of course, we have technical difficulties. Well, do not adjust your set. All right. So I'm now, this is the R markdown file that I've been presenting from, and I'm gonna just zoom down here. These are all the notes you had, all the notes I showed you. I'm gonna zoom down here until I get to the actual code. And I have, well, and I'm going to start by just dumping in all bunch of packages. I'm trying to illustrate best practices here, like annotating what I'm using the packages for and putting all the packages upfront so that when my code breaks, I know, 'cause I don't have a package, I know that right away. I'm grabbing the data here. Jody, should I make this console viewable, or should I make it one larger? - [Jody] Try one larger. Let's see what it looks like. - Yeah, that's probably good. So there's probably a bunch of stuff that I don't need it here, but it includes the bird log diversity, which is gonna be my response variable, and it includes scaled, logged versions or scaled versions of the coefficients of variation. EFI is fire or eaten. And I also have the biome and the realms and the biome interacting with the floristic realms. So this, starting from the left side of the diagram, I'm going to fit a model using pretty much the standard, our syntax for the fixed effects. I forgot to scale and log the area upfront so I'm including that, and I'm only including an intercept, a variation in the intercept across biomes. So that's the expected. That's the mean logged species diversity at the average values of all the predictors and the squared here is putting in all the paralyzed interactions of these variables. That's not too bad. That's quick anyway, and it didn't complain at me. It's best to check your diagnostics as early as you can, because when you look at diagnostics, you see things. You want to look at the diagnostics before you get to look at the p values and the confidence intervals so that you don't say, oh, I really like that model. Let's hope I can keep it. This is what you get if you do plot on LMER model, and it's just plotting fit against residuals. And I hope that this is flat and there's a smoothing line here and that doesn't look too bad. Unlike linear models, if I want to plot something like a scale location plot, which is the square of the absolute value of the residuals against the fitted, I have to do it myself, it's not quite as easy, and there might be a little bit of a tendency for decreasing variance at higher values of the fitted value, but this isn't something that's gonna worry me a lot. And I'm gonna look at the influence plot from the car package, which is giving me the leverage, the potential influence of a residual on the x-axis and the size of the residual on the y-axis, and the bubbles here are Cook's distance, which is essentially the product of the residuals and the hat values. I apologize. I think the labels on the plots are quite small, but there's nothing I can do about that in a hurry. So I'll try to tell you what's on the axes. And potentially, interesting points are numbered, and it also prints out the residuals on the hat value in the Cook's distance for values above a threshold. This is... The largest of these is Cook's distance is 0.1. 0.5 is a typical rule of thumb for when you should worry about a Cook's distance. I'm gonna come back in a little bit and talk about DHARMa, which is a newer fancier way of plotting the residuals, particularly effective when you're dealing with binary data. But it shows a Q-Q plot. So at an estimate of the goodness of fit of the distribution of the residuals, and on the right-hand side, it's showing a predicted versus residuals plot, except that its scale, it's different because it's scaled so that the predictions go from zero to one and the residuals go from zero to one. This is supposed to be a uniform. These points are supposed to be uniformly distributed in this plane. The lines are quantile regressions. So this is the median. The middle line is the median of the residuals and the outer lines are the 25th and 75th quartiles of the residual, 12 quartiles of the residuals. This looks alarming. I will come back later. I shouldn't really be looking at the diagnostics yet anyway, because this model is supposed to be just something that I'm trying to see if something really simply works. So I'm not gonna worry about that too much at the moment. If we want to plot coefficient plots, which again, I shouldn't be doing yet, this is showing me again, the labels are too small, but this is showing me all of the coefficients and their estimated values, and they're 95% confidence intervals. And I would say this is actually a much better way to look at things in the summary, which encourages you to stare at the stars on the p value column. So this gets... Because I've scaled all my predictors, this tells me about the magnitudes. You can't see this, but that's NPP. So that's got a very large. That's larger than everything else. That's not a big surprise. It does look like preliminarily, we might have some significant defects here. Actually, let me try. That's a little better. Can I try to bump it up one more? Probably. Much better. I have done that, yeah. This is the same plot, except that I've used a utility function that I defined in the utils.r that I posted to just sort the effects by magnitude, well, not by magnitude, but I mean, they're going positive to negative. There's an argument for showing the absolute values here, but then it gets confusing. But now we can more clearly pick that NPP is the big thing, fire is the next strongest effect, and then we get inter-annual variability of NPP, and then we start getting all the two-way interactions. Okay, that's all too easy. So I'm gonna take the model. Instead of a random inter, I'm gonna subtract. So this is the update function, which I really love, which says, okay, drop the one by biome random effect you had here, and instead add variation across biomes in all four of the main effects and the intercept. And that takes noticeably long. That takes, and it's warning me about a singular fit. So good things are getting nice and complicated. I'll come back to that in a minute. The nice thing about update is that you get to see just what changes. That can be a problem. If you have all really long list of models, you might forget what was in your base model, but in general, this reduces the amount of code and makes it easier to see what's going on. I think I need to speed up a little bit. DW plot is also really great for comparing fits of models. I'm only looking at the fixed effects here. I'm not looking at the variance across biomes, but just looking at the fixed effects, I can see that adding in the random slopes didn't make a huge difference. Things bounced around. Things that were significant before aren't significant. But if you're not obsessed with p values, you can say, this is giving me approximately the same message that I had before. Okay. The maximum model, which I thought I wanted to mention, the maximum model is the model that includes all of the effects that you could possibly measure variation. So it's all, and this is gonna take about 30 seconds. So I'll set it going while I talk about it. So it's essentially every term in the fixed effects, every predictor you have where you've measured different values of that predictor in each group. So I can measure the variation in the effect of fire across biomes because I have multiple eco regions in each biome and each, well, that was faster than I thought, each biome has a different fire consumption variable. So those are all identifiable. This is not actually the maximum model because the maximum model would also include all of the two way interactions, but that would be a little bit crazy. Going over this though, I have, and unfortunately, the formula syntax doesn't let me compact this quite as much as I should, as I'd like to, but I have all four of those effects and the intercept is included by default, varying across all three of my geographic scales. And what you're gonna see, so I can ask. I have a utility function that says, is that singular? Singular means a variance is estimated as zero or a correlation is estimated as plus one or minus one, or in a little bit of brain explosion, if you have a five-dimensional random variable, it might have collapsed to three dimensions. So I'm trying to estimate a five-by-five covariance matrix, but there's really all. I can only have the data to estimate three independent dimensions on the other two collapsed to zero. And it can be quite tricky when you... So this is the variance covariance structure that was estimated for that model. So for each group, I have the grouping variable, I have each of the effects, I have the standard deviation of each effect, and then have the correlations among every pair of effects. So there's 45 parameters here. That sounds like a bad idea. You can see that the correlations are very large. There's only one of them that's actually listed as minus one. None of these standard deviations are zero, but it's very easy in these large scale models. You can get a model where all the standard deviations are positive and all the correlations are between plus and minus one. But because it's a 17-dimensional thing, collapsed 12 dimensions, you can't look at it and see, and that's why you need this criteria that I could get into more technical details of people. So the maximum model, the idea I explained, I should have explained this a little bit before. It usually doesn't work because it's usually too complicated for the data you have. It can also sometimes not work. If I have an example that I linked to here, there are nest boxes that are measured in two different seasons. And in a linear model, if I try to estimate the... I say, well, I measure each nest box in both seasons. So I should be able to estimate that variability in that effect, but because there's only one measurement of each nest box in each season, that turns out to be confounded with the residual variants. There are as many random effect values as there are residuals. Okay, what do we do about this? There's a big argument about how you should try to deal with models that are too complicated, and they basically boil down to two schools of thought. One is that you should make the model as complicated. If you don't run into any technical problems, if there's no singularity, no convergence warnings, everything looks wonderful, then I would argue, you should use the maximum model. Barr et al and Chosus say, take your maximal model and throw stuff away until it at least works computationally. Bates and Vestis and Matthew Shaq and other people say, no, you should use a data-driven approach. Essentially, do some kind of step-wise or some kind of model selection to get. Don't do the selection on the fixed effects, but select among the possible random effects and see what is the best model. Very briefly, singularity and lack of convergence are different things. Singularity means you have a zero variance lurking in there somewhere. Convergence warning means the computer thinks there might be something wrong. You know, it tried to estimate the gradient or the second derivatives at the optimum and things look a little bit funky and maybe there's something wrong. There's a very sad story to tell you over beer, but there are lots of false positives from the convergence warnings. And the gold standard is to run your model with a bunch of different optimizers and see whether different numerical methods all give you answers that are close enough and close enough is a subjective what aspect of the model are you interested in question. Once you run, you say I ran all of these different optimizers. Even though some of them are giving me convergence warnings, I'm basically getting the same story I'm interested in from all of the different optimizers. So I'm gonna ignore the convergence warnings and go ahead. There are 27 possible combinations. Oops, right. I could try to fit the full five-by-five covariance matrix. I could try to fit 15 parameters for each of the three levels, and I just showed you that that doesn't work very well. I could assume that they're all independent, which means I only have to estimate five parameters, the variability in diversity, variability in the effect of fire, variability in the effect of NPP and so forth, or I could just say, screw it. I can't estimate anything. I'm just gonna estimate. I'm gonna use an intercept-only model. So I have three possible models, each of which could apply at three possible at each of the geographic levels I'm interested in. At the biome, the realm, and the biome by realm interaction. So I have 27 models to look at. Generally, I don't like fitting a whole bunch of models and seeing which one is best. There are ways around this, but they're much more complicated. So this is some overly complicated code that I'm gonna just talk about briefly. What it's doing is constructing a formula. It's either constructing a formula with a one in it, a formula with these variable separated by a single bar, which means we're looking at a correlated model or all of these variables separated from the grouping variable by a double bar, which means they're independent of each other, and I'm squashing those all into a data frame. So I'll just show you what the first... I hate it when that happens, excuse me. Oh, good. Perfect. We will be back in a moment. So I'm gonna just show you briefly the first row of that dataset is the simplest model, one by biome, one by realms, one by interaction, the last row of that dataset. Well, you can't even see it, but it's the complicated case for all three of them. So there's 27 rows in that data frame. And now I'm gonna run through and run the model basically for every row in that dataset. The whole thing takes about 10 minutes to run on this reasonably fast machine. So I'm gonna cheat and read in the pre-computed set of 27 models. I'm gonna look at the AIC for each model, I'm gonna see whether each model is singular or not, and I'm gonna see whether each model has any warnings that has warning is in the utilities file. And this is what the whole thing looks like. Which model it is, what the AIC is, whether not it's a singular fit, whether or not I have convergence warnings speeding up a little bit. I'm gonna find the best model. The best model turns out to be number 19 and hmm, oh. Let me just make sure I got the right one there. Sorry about that, folks. Huh, well, we'll see what happens. Oh, no, that's right. Sorry, I just misremembered. It has intercept. I could look at this again. This is the summary. No, this is just, yes, this is the printout. This has independent effects at the biome realm interaction level and sorry. Independent values at this level and intercepts only at the biome and floristic realm. Some quick diagnostics. These are gonna be small again. There's my fitted versus residual. There's my scale location plot. It doesn't look very different. There's my influence plot. Nothing terribly alarming. There are some larger Cook's distances here. And this is what DHARMa says. So I still the fit of the distribution looks good, but that's actually not something I care about a huge amount, but it looks like I've got a problem with biomes. I'm gonna claim that that's a false positive. That that's actually not something I need to worry about because, okay, sorry, I'm gonna plot this. This is plotting against the fitted values. I'm gonna plot this. If I plot it against NPP instead, I still have a problem. So I still have what? So this is NPP on the x-axis and residuals on the y-axis and quantile regressions are the lines, and this looks like I've got a big problem. DHARMa computes the residuals, ignoring the random effect component. And that's a good default because in a lot of cases, the random effects can lead to artifacts in the residuals that make you all worried for not a good reason. But I'm going to argue. I'm gonna show you some complicated code, but I'm basically just gonna show you the picture here. So this is on the left-hand side. These are residuals plotted, without including any of the random effects. NPP on the x-axis, residuals on the y-axis. These are now regular residuals, not residuals scale between zero and one. So the pattern is not quite the same, and this is the low S fit. But we've clearly, if I were looking at this, I would say we've clearly got a problem. These are the residuals taking the random effects into account, and most of the problem goes away. So this is basically saying that we're underpredicting some hot spots and we're underpredicting some cold spots, but once we include the among biome and among biome by realm variation, we're actually accounting for that. Okay, I am actually almost done. So I think that's mostly a model I'm comfortable with. The last thing that I want to check is geographic is spatial auto-correlation. So I'm extracting the residuals, and I'm doing a plot of the residuals with positive residuals in blue and negative residuals in red and the size and the intensity both increasing with the amount of the residuals. You should please ignore that big red data point there. I actually very recently, I think I might want to get rid of that. The salient point here, these are the residuals, and there's clearly spatial pattern left here. I am clearly overpredicting in a bunch of the Neotropical rainforest and underpredicting in most of the Southern Neotropics and underpredicting in Central America, even though I've taken the realms and the biomes and the biomes by realms into account. So I need and there are fancier more formal ways to do this. I'm switching to the gamm4 package. Gamm4 is basically lme4, but you can stick in the various smooth functions that the mgcv package provides. If you have no idea what I'm talking about, I'm sorry. Remarkably, this is a little bit of a pain because these values, if I want to fit a spatial model to these values, I have to account for the fact I have to do something about projection and the fact that I've got data like all over a sphere, and these points are actually... Kamchatka is actually close to Alaska even though it doesn't look like it on this picture. Gamm4 allows me to add a spherical smooth to my model. So I'm taking the formula that I had and adding a spherical smooth to it and fitting the model, and that's actually gonna take a little while. I'm now I'm stuck with it. I will show you that that doesn't make a huge difference in my conclusions, but it's nice to have taken care of it. This is the comparison dot whisker plot between gamm4 and lme4, didn't make a huge difference. Gamm4 is kind of a pain to work with. The utilities that I included here actually make it a little bit easier to work with. And I'm gonna switch back to the pictures for the last few minutes, just to accelerate slightly. So this is the coefficient plot that we put in the paper, which is basically the same thing that you were looking at with nicer labels and a different order and showing the results for all three taxa. You can compute R squared values, and there's a package which I've hacked a little bit which will do this, which will compute the R squared of the model and the partial R squared for each term. There are double dangerous bins in here to indicate that there are some kind of funny things. R squares are very hard for mixed models and I'm not actually absolutely comfortable, but roughly speaking, they mirror the absolute values of the coefficient value. So NPP is the biggest fire is the next biggest area, which was a negative value is now a positive R squared and it's the next largest effect. This is what the paper says when we look at it for all three taxa. Emmeans at effects and ggeffects, there's a predict method for these models, but it definitely makes it easier to get the predictions if you use one of these ad-on packages. So I'm constructing a vector of the NPPs from min to max by 51, and then I'm asking emmeans to extract the predictions and plotting the points and the predictions. There's also a package on GitHub for getting the partial residuals, which is basically subtract this plot, and I really will be done in one minute, I think. This plot shows NPP on the x-axis, log NPP, and predicted log bird diversity or observed log bird diversity. But the variability in this point cloud includes all the variability and all the other predictors that we're just squashing into two dimensions now. So arguably, it looks a little bit worse than it should. The partial residuals are taking the predicted values for each data point of all of the non-focal predictors and subtracting them. So this is in principle, the variation. If my model is okay, it's the variation only due to NPP and residual variation. I haven't looked at random effects very much here because that wasn't the primary goal for this model. The random effects at the biome realm region, we have one for every biome realm combination, and we have one for... So this is all of the variability of the effect of the inter-annual variability on NPP in a particular biome functional realm combination. For the biome level and the realms level, we only have an intercept. So it's a lot easier to look at. We would have liked to look at the... We would have liked to dissect this variation a lot more and look at the across biome variation in the effect of fire. But honestly, even though this is a very big painstakingly synthesized dataset, it's not really big enough. We don't really have a lot of signal left at the level of these to examine. We're mostly using the random effects as a statistical independence term, not to explore the relationships. I will take questions now. - All right. Thank you, Ben. So I'm looking at the link that pollev.com, FI2021, and I'm ordering the questions according to the ones that were most upvoted. So the first one. Is it valid to use AICC or other information criterion to choose between allowing variation of intercept only versus also allowing slopes to vary? - I don't think anybody really knows. I don't think any... I'm not aware that anybody's really evaluated that carefully. I would say there were two issues. One issue is that we actually already know that the AIC is a little bit suspect for this case, because the AIC, the derivation of the AIC assumes that the models are all in the middle of their parameter space, and none of the models, that variation, sorry. That none of the models are at the edge of the overall parameter space where the overall parameter space is the variance covariance space, and you can't have a variance less than zero. So some of the models with zero variances in them are not technically valid, even for doing AIC comparisons. In general for p values, that tends to make AIC a little bit conservative. So it's likely to say that that these extra values are not necessary. So you would perhaps throw them away unnecessarily. I guess what I'm worried about with AICC is that you're kind of stacking issues. We don't really know how well AICC works for nonlinear, for GLMs. We're not really... AICC was derived for simple linear models so that, sorry, I'm not being very clear here. I think it should be valid, but if you're dealing with a dataset that's small enough that you need AICC, then I'm a little bit worried. It would tend to make you more conservative, and that might make you very conservative, and I might worry about being that conservative. That was a long, not really clear answer. Next. - Can you just elaborate a little bit more on that and say then how to determine if you should choose a model that has a variation of intercepts only versus allowing slopes to verify? - So Barr et al don't think you should. Barr et al think you should use the most complicated model that works, that computes adequately and isn't singular. Matches check et al do some simulations with I think a p value cutoff of 0.1 or 0.2. So they use a kind of loose p value criterion. I did when I was doing this analysis, and I don't know that I can pull it up immediately. I think you do, in any case, want to check that your selection of, if you can, that your selection of random variables isn't making a huge difference. I'm gonna share my screen for a second. Chis, this is part of the bigger, more complicated, messy workflow that I did for the whole project. These are the fixed effects for all 27 models, showing which ones are singular and not singular and showing the variability of the fixed effect across all of these. So I mean, I think I would hesitantly use AIC to select these things, and I might use AIC precisely because it's the least conservative of AIC, BIC, and AICC, and therefore it's gonna lead you to include the random effects in your model when they're on the edge, which I think is actually a good idea. - Fantastic. All right. The other question. What are your thoughts on the pros and cons of fitting these types of models in a Bayesian versus maximum likelihood framework? - That's a simple question with a short answer, no. I think the Bayesian models are... I think if I could snap my fingers and fit a Bayesian model with the same speed, convenience, and simplicity as fitting a frequentist model, I would generally go with the Bayesian model all the time. I have in my extras of my slides, I said, Bayesian approach is slower but you get more. So it might take 10 or 20 or 100 times longer to fit the model, but it makes it easy to regularize. So it makes it easy. It avoids the which random effects am I going to put in? You put them all in, you put an appropriately regularized, you put a strong enough prior on all of them that it's sensible and you don't run into the boundaries and you don't get zeros and you don't get bad stuff happening. It's nice to be able to include informative priors and they handle uncertainty better. Most of the confidence intervals that I showed you in the predictive plots don't account for the uncertainty in the random effects component of the model. So they're probably a little bit too narrow, and the only way around that in the frequentist case is parametric bootstrapping, which is really slow. So yeah. - Great. Another question here is, what are the best practices for when you have covariance that are measured at different scales? For example, you could have some predictors that are measured at the biome level and some that are measured at eco region level. - What's really nice about these models is that you can pretty much, that's not really something you have to worry about. You can pretty much toss in the predictors, allow for variability at whatever levels are appropriate and identifiable, and the model structure will take care of it. And by the way, if I'm saying something that's absolute garbage, I'm hoping Denis will stop me. - So no garbage. I haven't heard anything. But so the question here is, every time I apply the mixed effects model, I receive a singular comment. But how do I cope with a similar fit? - So unfortunately, you have lots of choices. The model with a singular fit is probably overfitted, is probably too complicated, might be lowering your power. But as far as we know, it's actually a well-posed statistical answer. It just has some of the variances set to zero. So and I guess I would clarify, if this is an intercept-only model, if you've already simplified the random effect as much as you can. So if you haven't simplified the model, if you're trying to fit a five-by-five covariance matrix, then try making it diagonal, try making it intercept-only. I mentioned in my extras, there's an option called compound symmetry where you could assume that all the correlations are the same. All the paralyzed correlations are the same. That's another way to simplify it. And Mave McGillicutty in New South Wales is working on... So there's other ways to simplify, but let's get back to assuming you have an intercept-only model. You can simply say, hey, this is not the best estimate of this variance is zero, and if that was the last random effect in my model, then I guess I'm gonna fit a generalized, I'm gonna fit a linear model or a generalized linear model. I've estimated the among group variance is zero, and so leaving it out of my model is basically gonna get to get me the same answer as leaving it in my model and having lme4 yell at me that it's singular. You could also regularize. So there's a package called blme, which is a Bayesian overlay, and this is in the extras as well, which basically adds the weakest possible regularizing prior and adds a prior to the variance terms that prevents them from being estimated as exactly zero, and you could go forward with that. It's a little bit of an inferential no man's land because you're regularizing the model, but you're estimating that most likely, you're not estimating a mean as you normally would in Bayesian statistics. You're not sampling. It's just doing an optimization. So it's a possibility. I would be tempted to, if you got one intercept level random effect in your model and it's singular, I would carefully explain to the reviewers that that variability was estimated as zero so I'm taking it out of the model. - Great. There are some questions also regarding how you scaled the different-- - So the statistician's answer would be, well, I guess you better get some better data then. Sorry. Questions. - Yeah. That's right. So there's a question regarding how you scaled your variables. So it seems like you log-scaled some of the variables, but what about other types of scaling? For instance, the question asks, Z-score scaling for the NPP data. - Right, we actually did both. So I actually took the log. And so we went back and forth and we tried a couple of different things and I will comment that none of this affects, well, non-linear transformations certainly affect the fit of the model. Fitting on the log scale or the linear scale are asking different questions. Centering and scaling, so Z-scaling or centering alone change the parameter estimates and the interpretation of the parameters, but they don't change the predictions, they don't change the overall fit. What we actually did was log transform NPP and fire, not log transform the coefficients of variation because they're essentially already unitless. They're already ratios of standard deviations to means, and then we centered everything and scaled everything. But there's an argument about whether in the presentation of the coefficients, we actually removed the scaling because we wanted them to be interpretable as elasticities. It's a little tricky. The default is Z-scaling is sort of always a good idea, but if you've thought about the default carefully and decided that you don't want to use the default, go for it. - So essentially, you did the Z-score even after taking the logs, right, to get for the first stuck the log, and then the Z-score. - And the problem is that makes it. So what is this variable? Well, this variable is the deviation around the geometric mean measured in units of one standard deviation of the log. Okay, that's nice. The only point is that we can then put them on. We can then compare the coefficients directly, and that's a measurement of magnitude. The reason we decided in the paper to unscale them when presenting the coefficients is that we're also presenting the R squared values, which are essentially a measure, a unitless measure of magnitude. And there's actually, if we were doing, sorry. I'll say one more thing, and then we can get on to the next question. If we were doing, and Chills's 2010 paper says this very nicely, if we were fitting regular old linear models, then the squared Z-scored parameter applied to the Z-scored predictor would be exactly the partial R squared. But it's mixed. So the R squared is a more complicated. - Fantastic. There's apparently a lot of interests regarding your comment that we should be careful, or we should not test differences between groups using mixed effects models. So I think the question is really wanting you to expand on why that would be the case. - Sure. So this gets back to sort of proper statistical philosophy, which I don't really do, but I'll give you my understanding. The reason that we can get away, so when we estimate the random effect, so I'm talking about something like the variation in the effect of fire across ecosystems, we're treating that variation as a random variable. In the frequentist world, there's a difference between a parameter and a random variable. And a parameter is a thing that has a true value, and a random variable is a thing that has a distribution. So when Doug Bates talks about the random effects, he likes to call them the conditional modes of the distribution of this effect, meaning this is a random variable, but given the parameters I've estimated and given the data I've seen, that conditional distribution moves to look like the data, and we can estimate a conditional mean and a conditional standard deviation, but those things don't have the same inferential properties as a proper frequentist parameter and a proper frequentist estimate, which has a sampling distribution and a bunch of stuff we could prove things about. In the Bayesian world, this is another thing that's nice about the Bayesian world, in the Bayesian world, a parameter is a parameter is a parameter, and either it has a hyper parameter or not. So if we wanted to test statistical differences between groups, we could do that, but now the Bayesians don't like p values so you're kind of back where you started. I want to share my screen for one more second to make one more conceptual point, which is that I think people are way more concerned than they should be about testing. You know, who cares that... Let's say I take these standard deviations, these 95% intervals of the conditional distributions. And I say, oh, well, tropical or subtropical moist forest is significantly different from the average of all biomes, but tropical grasslands aren't. I don't think that's a very interesting... This is purely opinion, but I would question, I would want to have a conversation with you about why you would want to know that one particular group was significantly different from another particular group. You might have a good reason, but you might not. I know that my supervisor told me to. That's the best reason really. - Yeah, I think there are a couple of questions that were actually focused on that topic of how to test difference between groups and all that. - And I mean, another way to put it is there's no... What estimating random effects as a random effect means you get to pretend that you're only estimating one parameter instead of estimating a parameter for every level. So you're estimating a variance for an intercept level, for an intercept random effect. There's one parameter, one top level parameter in the model, which is the variance. And you don't... All the values of the individual groups don't count against your parameter count and the no free lunch, what that costs, is that you can't make inferences about differences among the groups. - Great. So another question here has to do with the R squared values in marginal and conditional. What are their arguments against this? Your double band road signs. Are there any useful alternatives that can be presented to approximate goodness of fit for mixed models? - So I think the conditional and marginal random effects actually are not bad. And they're generally only weird when you're in one of those overfitted model, trying, pushing the data too hard situations. So in general, I think that the conditional and marginal effects are pretty reasonable, and I would never argue about them. The double dangerous bands here have more to do with the partial R squared values, which I computed, which are not included in the standard Nakagawa, Shields, F. Johnson, left check, et cetera, conditional and marginal R squared framework. I'm computing an R squared for each parameter for each effect, and that's done with, I think I didn't include the reference, but there's a series of papers by I think Byron Edwards and Jaeger and others, where they come up with some other frame, some frameworks, they're very nice papers, and I can try to post them in the chat or send them along later. They come up with frameworks for estimating partial R squares, and there are a few different ways to do it. And when I use the way that I thought looked most appropriate, I got a total model R squared that was smaller than the R squared for one of the parameters. And I actually talk to the author and he said, yeah, when I do, I see that happening sometimes too. I'm not quite sure I understand why that happens. And so I backed off to a different method, which gives me more sensible answers, but that may explain a little bit why I'm expressing caution about using these. - All right. There's a question here. If it makes sense, if we regress conditional modes of random terms with another variable about G. So for example, if you have a random intercepts that varies from species to species and regressing that intercept as a function of say species traits. - That's generally a bad idea. There's papers by I think Jared Hadfield and other people. So the conditional modes more traditionally go by the jargon of BLUP, which is best linear unbiased predictor, but Doug Bates doesn't like it because in the GLMM framework, they're not linear and they're not unbiased and they might not be best. So he likes to call them conditional modes, but for searching the literature, again, I think Hadfield's paper is probably the one I would find. I'm sure that there's a reference. I'm not sure that there's a reference to this in the GLMM fact. That's generally dicey. I would try, I can't say off the top of my head, but generally, if you can incorporate the question you're interested in, the relationship you're interested in as part of the model, yeah. My answer is I can see the question you're asking. I know that there are pitfalls in taking the conditional modes or BLUPs as input variables for another analysis, and I'm not quite sure how to answer to address the question that this person wants to address. - So is it your comments perhaps ideally that person would include a species traits directly into the first model rather than-- - Yes, but I'm not sure. Yes, but I'm not sure if they would get an answer to the question they want. So there's American Naturalist 2010, Hadfield is the first author on the art. The title is the misuse of BLUP in ecology and evolution. And I don't remember the details, but that's where I would start. - So I'm looking here. Is it valid to use means separation using emmeans with glmer in lme4? - Repeat the question, which I probably still won't understand, but I'll try again. - Yeah, it just says, if it's valid to use mean separation using emmeans within glmer in lme4. - So I will take a guess at what the question means. So emmeans, so if I mean separation, you're saying something about, let me take the predicted values for some pair of means and see whether they're significantly different from each other or not. Emmeans is reliable for glmms with, well, a couple of caveats. One is that emmeans in general, uncertainties ignore the uncertainty in the estimates of the variance, like I said before about the predictive ribbons that I was drawing. So it probably underestimates the residual slightly. And I was gonna talk about bias correction, but that's not actually relevant. Other than that, the underlying linear machinery of glmms is sufficiently the same as that of lmss that you can probably, that doing those comparisons will probably work. But the caveat there is I'm not absolutely sure what mean separation means in this context. So I might've answered the wrong question. - No problem. Another question here, and I think this is gonna be our last question, because we're almost almost 1:30. Should you always take coefficients from the full model or use model averaging via AIC or from the best model estimated by model selection? - What a big can of worms. Sorry. If you are predicting and you don't care about your confidence intervals being a little bit too narrow, then model averaging is a good idea. If you actually want to estimate effect sizes and do inference and calculate p values, then you should always, in my opinion, you should always use basically the biggest model that works or the AIC selected model. I think you should never do selection on the focal parameters. So like, if you're interested in the fixed effects, doing selection on which fixed effects to include and then using those coefficients. So model averaging or some other form of regularized estimation, if you're trying to do prediction and you don't care about good uncertainty estimates and the full model otherwise. - Great. All right. So I think that was great. Thank you, Ben, for agreeing to present, and thanks everybody for participating on the event. I would just like to make a final comments that this is a seminar series. So we're gonna have additional presenters. So we hope you can join on the seminars that we're gonna have. The next one is on December 6th at 9:00 a.m. because our presenter is coming from Australia, and he's gonna be talking about species archetype models and regions of common profile models. So thank you all, and thanks again, Ben. That was really great. - Thank you. I talked fast enough, right? - It went well. - Goodbye. Well, goodbye to everybody. Hello and goodbye to the people I know. Enjoy your afternoon or evening or whatever it is where you are. - All right. Bye-bye.
Info
Channel: Ecological Forecasting
Views: 611
Rating: undefined out of 5
Keywords: ecology, statistics, mixed models, R code
Id: iFikMDuNeVM
Channel Id: undefined
Length: 79min 36sec (4776 seconds)
Published: Wed Nov 03 2021
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.