- We are delighted to present Dr. Bolker. So this presentation, I think, has... The total number of people
that have registered for this is over 800, and that
speaks a lot about Ben. So by looking at his website, according to his own words,
he wears multiple hats. He's a mathematical, statistical, and computational biologist. He has a bachelor's in physics
and mathematics from Yale and a PhD in zoology from
Cambridge University. Currently, he's a professor in the Department of Math
and Statistics and Biology at McMaster, and he's the Director of the School of Computational
Science Engineering. So the things I would like to
highlight about Ben is really, if you have ever had any question
regarding random effects, mixed models, and you looked
it up in Stack Overflow, probably you've run
across one of his answers. He has 3,400 answers,
approximately, in Stack Overflow, and he has been a
co-developer of many packages that many of us use. So lme4, JLMN, PNB, BBM, LE. And so these are packages that have become the bread and butter for many of the analysis that we do. So from a very personal perspective, I'm very happy to have Ben
here because back in the day, when he was teaching in UF,
I think it was like 2007, I took a course with him. That's how I started getting
to our basic statistics MLE. So that makes this particularly special. But anyway, without further
ado, Ben, the floor is yours. - All right. Thank you very much. It's an honor. It's a little scary. I said in a tweet last night that I have this many people
in intro bio lectures, but I think mostly, they're
not paying attention. And looking at the participant list, I see lots of people I
know, and hello everyone. I am going to get started 'cause I, of course, was ambitious about what I wanted to share with you. I tried to and I'm gonna
be ignoring the chat and assume that Jody will let me know if something is actually burning. I'm going to share my entire screen. And so Jody, will you put the
links in the chat for people? - [Jody] Yep, I will. - So all the materials are here, if you want to follow along. I want to start out by saying just a couple
of quick overview things. This is intended to be
about one third overview and two thirds nitty-gritty coding, but there's some conceptual stuff mixed in with the coding as well. This is an extremely
heterogeneous audience, and I apologize in advance if you're either overwhelmed
or bored, but I had to do that. I'm just doing the best I can, and you can ask super basic questions or super advanced questions,
and I'll try to answer them. So assuming that everybody can see this and that it's big enough, I'm gonna start by just
going over some foundations that again, I hope most
people have at least heard of, but if you haven't, this
should get you on board. What I said, and I'll mention
in the context of forecasting that mixed models are not
specifically a forecasting tool, but they're a pretty
general powerful framework and you certainly can
use them for forecasting. So some highlights from what I said in the overview that went on the website, they're an extension of linear and generalized linear models. They're dealing with observations that are measured within groups, which could be field sites, years, blocks, genotypes, species, I'll talk
a little more about that. And you can think about them either as accounting for
non-independence or correlation, you can think about them as a way to estimate
variation among groups, or you can think about them as a way to estimate the
effects of each group without using more statistical
power than you need to. And they're most useful when
you have a lot of groups and especially when you have
different amounts of data, different amounts of noise, or different numbers of
observations in each group. So people will talk about
the random effective years. And when they say that, they mean typically that that
year is the grouping variable, and that has to be a discrete
categorical variable. And they often and implicitly, there is something that
varies across groups. In most mixed model packages in R that this is notation
invented by Doug Bates that's denoted as f bar g, and it's red as the effective f varies across the groups defined by g. If you're only interested in variation in the interceptor, the
baseline, then f is one, and we would call that
an intercept-only model. And what we're trying to do when we estimate the effects of f, so the effects in the
example I'm going to give, the effects of increasing
fire frequency for each biome, we're estimating effects for each group, but we're shrinking them or pooling them towards
the population average. Again, all of this is very quick, and if you've never heard of
this, any of this stuff before, just hang on, because I don't
have time to do this properly. So hopefully, this is reminding you. There is not a really hard and fast rule about this is a random effect or this is not a random effect, and I'll refer you to
a couple of references. All of the references are at the end. Random effect don't let
you calculate p values for the difference of a particular
group from another group or the difference of a
particular group from the mean. They do give you an estimate
of the value of that group, so for prediction, management
that should be sufficient. They're also interesting if
you're an evolutionary biologist and you want to quantify
the variation, for example, across species or across genotypes. They're handy because
you can make predictions for new groups that weren't
sampled in your data. They're handy because they
share information across groups. They're useful when you have groups with not very much information. You don't have to throw them away. You can use those samples,
but weight them according... They get automatically weighted according to the amount of
information they contain. In the classical definition, the groups must be randomly sampled from some larger population. That's something that I
don't feel as strongly about that part of definition. You could also think about it when you have a variable
that you're not interested in that you're trying to control
for that's categorical that comes in discrete groups. The groups are exchangeable, and this approach that
I'm gonna talk about works best when you have
at least five or six groups and a small... It's fine, if there's a small
number, more at least one, but a small number, at least, yeah. An average of at least one, and it's useful when
they're unbalanced groups. I'm gonna be using lme4, which is probably the most widely used mixed model package in R, and I'm gonna be using
an extension called gamm4 that adds some generalized
additive model machinery, a little bit to deal with
spatial auto-correlation. I have a real grab bag
of which other tools from the R-Universe I use. So there will be a little bit
of tidy verse stuff in here, but I'm not a complete tidy converse. And I'll be using some
additional packages. And I'll try to notate functions that use those extended
packages with double colon. And I think this link is
somewhere else as well. This is a Google sheet where I
and other people have started to try to compile a big list
of the R mixed model ecosystem. That's it for the stats intro. Here's the science intro. I've been gradually working
on a paper for several years with Max Moritz and Rick Bat Laurie where they got ahold of a big synthetic global terrestrial dataset
on species richness, primary productivity, and fire, and I can answer specific questions later, but I'm gonna go over this pretty quickly. The middle row here is essentially
our response variables, and in the paper, we're analyzing all three of these
species richness patterns. I'm gonna focus in this example on the bird richness data, because all three analyses really pretty much go in parallel. There's not that much
extra except to be done when you're to learn about the methodology when you're doing all three of them. The predictor variables are the size. Okay, sorry. We have three different geographic scales that we're interested in. We have large scale. These are called floristic realms, and they're essentially
kind of evolutionary units, biogeographic large-scale units. Biomes, which as people
here or ecologists, everybody should be familiar with, tropical forests,
mangrove, desert and shrub, boreal forest, tundra, et cetera. The eco regions are our sampling units, and these were developed
by Alsen and al in 2001, essentially as individual
ecological units, and I can't really go into
the definition very much more. There are about 700 of these. So they're smaller in scale them, either the realms or the biomes. For the predictor
variables we're interested, we included the size of the eco region, because that obviously has an
effect on species diversity, although that's not our primary interest. We're interested in the mean, the annual grams of
carbon per meter squared of net primary productivity and the coefficient of variation, the inter-annual variability of NPP. We're also interested in, and this is kind of what's
new about this analysis, the fraction of NPP eaten and consumed. I'm gonna use fire eaten
later on by NPP on average and the inter-annual variability of that fire consumption variable. Okay, I think that's
everything I have to say about the data. So we're going to try to estimate the effects of NPP fire consumption. These inter-annual variability, this is a coefficient of variation, and they're all the two-way interactions on species richness. But we're gonna try to
account for variability in those effects across
realms, across biomes, and across biomes within realms. So that's why we need mixed models. We have large scale
discreet grouping variable. We have medium scale
discreet grouping variable. We're also gonna include the
biome by realm interaction, which instead of being tropical
grasslands or Neotropics is tropical grasslands in the Neotropics, and then the eco region
is the sampling unit. I want to talk briefly. I want to take a breath. I want to talk briefly about a conceptual issue which comes up, which is nesting and
crossing of random effects. Realm and biome, or what we would consider
cross random effects, there's more than one tropical grassland. Sorry, there's more than one realm, there's more than one biome, and each biome can occur in each realm to go back to a kind of
non-ecological example. A nested random effect
would be one where class, if we have classes within schools, class one in school one is just a label. It doesn't have anything to do
with class one in school too. So class one with each of these subunits is a separate subunit that
doesn't share anything with the other units with the same label if we have crossed random effects. Tropical forest in the Neotropics
has something in common with tropical forests in other realms. So in this case, the biomes
and the realms are crossed, and that's what gets us
our biome by realm level. If you, I'm gonna skip
that, not as important. A little bit more terminology, I'm almost done with the overview and I'm not horribly behind schedule yet. So that's wonderful. When people say a random effect, if somebody just said, comes
into your office and says, oh, I ran a model with a
random effect of species, what they usually mean is that they're considering the variation in the intercept across species. If I said a random effect of
biome, I would typically mean that the average species
richness varies across biomes. This is actually a very
special case of random effects. Most random effects that are
arguably more random effects ought to be random slopes models, models that consider the
variation of an ecological effect across the levels of
the grouping variable. When we have more than one
parameter in that term however, so we're going to fit a model, eventually, that allows the intercept on
all four of the main effects to vary across biomes or across realms, that's gonna mean that we
need to estimate, for example, the variation in the effect of fire and the variation in the effect of inter-annual variability in fire, and the correlation
between those two effects, which means that when we have
four effects and an intercept, we're gonna end up having
to estimate 15 parameters. If we cut that down and we assume that biomes that are
especially fire-sensitive are not also especially sensitive to NPP, then we could choose to assume that the random effects were
all independent of each other, which would cut the number
of parameters down to five. I'm gonna work with basically ever... Most of our variables are gonna be the... Sorry, the NPP and fire variables are gonna be log-scaled. The coefficients of variation are all essentially unitless already because they're proportions. We're gonna center everything. This is gonna make everything easier and it's gonna avoid us screwing up. This is probably on my top 10
tell people to read it paper. It's just explaining why you should scale and center your variables. And because these are gonna be logged law, we're gonna estimate effects of log things on log species richnesses, the coefficients are gonna be
interpretable as elasticities. I wanna finish with a couple of cartoons or schematics from a couple. This is from Gelman and Hill's
book on regression modeling, showing that you have
simple models that are easy, but don't do what you want, complex models that you want
that break or don't work, and there's a similar picture
from Uriarte and Yackulic showing that we have simple models that are definitely wrong, although if you have
good eyes, you can see, they say, how likely is it
that the model is wrong, and it's certain on both
sides, which I like. But we're going from a simple
model to a complex model, and we're basically trying to get as far to the right as we can. We're trying to fit the model that makes us happiest as ecologists that doesn't overfit too
much or break the machinery. Okay. Jody, any urgent questions so far? - [Jody] Nope, you're doing just fine. - Okay. So I apologize for not
doing this in RStudio, but that's too big. Oh, that's still too big. That's way too big. Hang on. Hang on. Of course, we have technical difficulties. Well, do not adjust your set. All right. So I'm now, this is the R markdown file that I've been presenting from, and I'm gonna just zoom down here. These are all the notes you
had, all the notes I showed you. I'm gonna zoom down here until
I get to the actual code. And I have, well, and I'm going to start by just dumping in all bunch of packages. I'm trying to illustrate
best practices here, like annotating what I'm
using the packages for and putting all the packages upfront so that when my code breaks, I know, 'cause I don't have a package,
I know that right away. I'm grabbing the data here. Jody, should I make this console viewable, or should I make it one larger? - [Jody] Try one larger. Let's see what it looks like. - Yeah, that's probably good. So there's probably a bunch of stuff that I don't need it here, but it includes the bird log diversity, which is gonna be my response variable, and it includes scaled, logged versions or scaled versions of the
coefficients of variation. EFI is fire or eaten. And I also have the biome and the realms and the biome interacting
with the floristic realms. So this, starting from the
left side of the diagram, I'm going to fit a model using
pretty much the standard, our syntax for the fixed effects. I forgot to scale and log the area upfront so I'm including that, and I'm
only including an intercept, a variation in the
intercept across biomes. So that's the expected. That's the mean logged species diversity at the average values
of all the predictors and the squared here is putting in all the paralyzed interactions
of these variables. That's not too bad. That's quick anyway, and
it didn't complain at me. It's best to check your
diagnostics as early as you can, because when you look at
diagnostics, you see things. You want to look at the diagnostics before you get to look at the p values and the confidence intervals
so that you don't say, oh, I really like that model. Let's hope I can keep it. This is what you get if
you do plot on LMER model, and it's just plotting
fit against residuals. And I hope that this is flat and there's a smoothing line here and that doesn't look too bad. Unlike linear models, if I want to plot something
like a scale location plot, which is the square of the
absolute value of the residuals against the fitted, I
have to do it myself, it's not quite as easy, and there might be a
little bit of a tendency for decreasing variance at higher values of the fitted value, but this isn't something
that's gonna worry me a lot. And I'm gonna look at the influence plot from the car package, which
is giving me the leverage, the potential influence of
a residual on the x-axis and the size of the
residual on the y-axis, and the bubbles here are Cook's distance, which is essentially the
product of the residuals and the hat values. I apologize. I think the labels on the
plots are quite small, but there's nothing I can
do about that in a hurry. So I'll try to tell
you what's on the axes. And potentially, interesting
points are numbered, and it also prints out the
residuals on the hat value in the Cook's distance for
values above a threshold. This is... The largest of these is
Cook's distance is 0.1. 0.5 is a typical rule of thumb for when you should worry
about a Cook's distance. I'm gonna come back in a little bit and talk about DHARMa, which is a newer fancier way
of plotting the residuals, particularly effective when
you're dealing with binary data. But it shows a Q-Q plot. So at an estimate of the goodness of fit of the distribution of the residuals, and on the right-hand side, it's showing a predicted
versus residuals plot, except that its scale, it's different because it's scaled so that the predictions
go from zero to one and the residuals go from zero to one. This is supposed to be a uniform. These points are supposed to be uniformly distributed in this plane. The lines are quantile regressions. So this is the median. The middle line is the
median of the residuals and the outer lines are
the 25th and 75th quartiles of the residual, 12
quartiles of the residuals. This looks alarming. I will come back later. I shouldn't really be looking
at the diagnostics yet anyway, because this model is
supposed to be just something that I'm trying to see if
something really simply works. So I'm not gonna worry about
that too much at the moment. If we want to plot coefficient plots, which again, I shouldn't be doing yet, this is showing me again,
the labels are too small, but this is showing me
all of the coefficients and their estimated values, and they're 95% confidence intervals. And I would say this is
actually a much better way to look at things in the
summary, which encourages you to stare at the stars
on the p value column. So this gets... Because I've scaled all my predictors, this tells me about the magnitudes. You can't see this, but that's NPP. So that's got a very large. That's larger than everything else. That's not a big surprise. It does look like preliminarily, we might have some
significant defects here. Actually, let me try. That's a little better. Can I try to bump it up one more? Probably. Much better. I have done that, yeah. This is the same plot, except that I've used a
utility function that I defined in the utils.r that I posted to just sort the effects by magnitude, well, not by magnitude, but I mean, they're going
positive to negative. There's an argument for showing
the absolute values here, but then it gets confusing. But now we can more clearly
pick that NPP is the big thing, fire is the next strongest effect, and then we get inter-annual
variability of NPP, and then we start getting
all the two-way interactions. Okay, that's all too easy. So I'm gonna take the model. Instead of a random
inter, I'm gonna subtract. So this is the update
function, which I really love, which says, okay, drop the one by biome
random effect you had here, and instead add variation across biomes in all four of the main
effects and the intercept. And that takes noticeably long. That takes, and it's warning
me about a singular fit. So good things are getting
nice and complicated. I'll come back to that in a minute. The nice thing about update is that you get to see just what changes. That can be a problem. If you have all really
long list of models, you might forget what
was in your base model, but in general, this
reduces the amount of code and makes it easier to
see what's going on. I think I need to speed up a little bit. DW plot is also really great
for comparing fits of models. I'm only looking at
the fixed effects here. I'm not looking at the
variance across biomes, but just looking at the fixed effects, I can see that adding in the random slopes didn't make a huge difference. Things bounced around. Things that were significant
before aren't significant. But if you're not obsessed
with p values, you can say, this is giving me
approximately the same message that I had before. Okay. The maximum model, which I
thought I wanted to mention, the maximum model is the model that includes all of the effects that you could possibly measure variation. So it's all, and this is
gonna take about 30 seconds. So I'll set it going
while I talk about it. So it's essentially every
term in the fixed effects, every predictor you have where you've measured different values of that predictor in each group. So I can measure the variation in the effect of fire across biomes because I have multiple
eco regions in each biome and each, well, that was
faster than I thought, each biome has a different
fire consumption variable. So those are all identifiable. This is not actually the maximum model because the maximum
model would also include all of the two way interactions, but that would be a little bit crazy. Going over this though, I have, and unfortunately, the formula syntax doesn't let me compact this
quite as much as I should, as I'd like to, but I have
all four of those effects and the intercept is included by default, varying across all three
of my geographic scales. And what you're gonna see, so I can ask. I have a utility function
that says, is that singular? Singular means a variance
is estimated as zero or a correlation is estimated
as plus one or minus one, or in a little bit of brain explosion, if you have a five-dimensional
random variable, it might have collapsed
to three dimensions. So I'm trying to estimate a
five-by-five covariance matrix, but there's really all. I can only have the data to estimate three independent dimensions on the other two collapsed to zero. And it can be quite tricky when you... So this is the variance
covariance structure that was estimated for that model. So for each group, I have
the grouping variable, I have each of the effects, I have the standard
deviation of each effect, and then have the correlations
among every pair of effects. So there's 45 parameters here. That sounds like a bad idea. You can see that the
correlations are very large. There's only one of them that's actually listed as minus one. None of these standard
deviations are zero, but it's very easy in
these large scale models. You can get a model where all the standard
deviations are positive and all the correlations are
between plus and minus one. But because it's a 17-dimensional thing, collapsed 12 dimensions, you
can't look at it and see, and that's why you need this criteria that I could get into more
technical details of people. So the maximum model,
the idea I explained, I should have explained
this a little bit before. It usually doesn't work because it's usually too
complicated for the data you have. It can also sometimes not work. If I have an example
that I linked to here, there are nest boxes that are measured in two different seasons. And in a linear model, if
I try to estimate the... I say, well, I measure each
nest box in both seasons. So I should be able to
estimate that variability in that effect, but because
there's only one measurement of each nest box in each season, that turns out to be confounded
with the residual variants. There are as many random effect values as there are residuals. Okay, what do we do about this? There's a big argument about how you should
try to deal with models that are too complicated, and they basically boil down
to two schools of thought. One is that you should make
the model as complicated. If you don't run into
any technical problems, if there's no singularity,
no convergence warnings, everything looks wonderful,
then I would argue, you should use the maximum model. Barr et al and Chosus say, take your maximal model
and throw stuff away until it at least works computationally. Bates and Vestis and Matthew
Shaq and other people say, no, you should use a data-driven approach. Essentially, do some kind of step-wise or some kind of model selection to get. Don't do the selection
on the fixed effects, but select among the
possible random effects and see what is the best model. Very briefly, singularity
and lack of convergence are different things. Singularity means you have a zero variance lurking in there somewhere. Convergence warning
means the computer thinks there might be something wrong. You know, it tried to
estimate the gradient or the second derivatives at the optimum and things look a little bit funky and maybe there's something wrong. There's a very sad story
to tell you over beer, but there are lots of false positives from the convergence warnings. And the gold standard is to run your model with a bunch of different optimizers and see whether different
numerical methods all give you answers that are close enough and close enough is a subjective what aspect of the model are
you interested in question. Once you run, you say I ran all of these
different optimizers. Even though some of them are
giving me convergence warnings, I'm basically getting the
same story I'm interested in from all of the different optimizers. So I'm gonna ignore the
convergence warnings and go ahead. There are 27 possible combinations. Oops, right. I could try to fit the full
five-by-five covariance matrix. I could try to fit 15 parameters for each of the three levels, and I just showed you that
that doesn't work very well. I could assume that
they're all independent, which means I only have to
estimate five parameters, the variability in diversity, variability in the effect of fire, variability in the effect
of NPP and so forth, or I could just say, screw it. I can't estimate anything. I'm just gonna estimate. I'm gonna use an intercept-only model. So I have three possible models, each of which could
apply at three possible at each of the geographic
levels I'm interested in. At the biome, the realm, and
the biome by realm interaction. So I have 27 models to look at. Generally, I don't like
fitting a whole bunch of models and seeing which one is best. There are ways around this, but they're much more complicated. So this is some overly complicated code that I'm gonna just talk about briefly. What it's doing is constructing a formula. It's either constructing a
formula with a one in it, a formula with these variable
separated by a single bar, which means we're looking
at a correlated model or all of these variables separated from the grouping
variable by a double bar, which means they're
independent of each other, and I'm squashing those
all into a data frame. So I'll just show you what the first... I hate it when that happens, excuse me. Oh, good. Perfect. We will be back in a moment. So I'm gonna just show you briefly the first row of that dataset
is the simplest model, one by biome, one by
realms, one by interaction, the last row of that dataset. Well, you can't even see it, but it's the complicated
case for all three of them. So there's 27 rows in that data frame. And now I'm gonna run through and run the model basically
for every row in that dataset. The whole thing takes
about 10 minutes to run on this reasonably fast machine. So I'm gonna cheat and read in the pre-computed
set of 27 models. I'm gonna look at the AIC for each model, I'm gonna see whether each
model is singular or not, and I'm gonna see whether
each model has any warnings that has warning is in the utilities file. And this is what the
whole thing looks like. Which model it is, what the AIC is, whether not it's a singular fit, whether or not I have convergence warnings speeding up a little bit. I'm gonna find the best model. The best model turns out to
be number 19 and hmm, oh. Let me just make sure I
got the right one there. Sorry about that, folks. Huh, well, we'll see what happens. Oh, no, that's right. Sorry, I just misremembered. It has intercept. I could look at this again. This is the summary. No, this is just, yes,
this is the printout. This has independent effects at the biome realm
interaction level and sorry. Independent values at this level and intercepts only at the
biome and floristic realm. Some quick diagnostics. These are gonna be small again. There's my fitted versus residual. There's my scale location plot. It doesn't look very different. There's my influence plot. Nothing terribly alarming. There are some larger
Cook's distances here. And this is what DHARMa says. So I still the fit of the
distribution looks good, but that's actually not
something I care about a huge amount, but it looks like I've
got a problem with biomes. I'm gonna claim that
that's a false positive. That that's actually not
something I need to worry about because, okay, sorry, I'm gonna plot this. This is plotting against
the fitted values. I'm gonna plot this. If I plot it against NPP
instead, I still have a problem. So I still have what? So this is NPP on the x-axis
and residuals on the y-axis and quantile regressions are the lines, and this looks like
I've got a big problem. DHARMa computes the residuals, ignoring the random effect component. And that's a good default
because in a lot of cases, the random effects can lead
to artifacts in the residuals that make you all worried
for not a good reason. But I'm going to argue. I'm gonna show you some complicated code, but I'm basically just gonna
show you the picture here. So this is on the left-hand side. These are residuals plotted, without including any
of the random effects. NPP on the x-axis,
residuals on the y-axis. These are now regular residuals, not residuals scale between zero and one. So the pattern is not quite the same, and this is the low S fit. But we've clearly, if
I were looking at this, I would say we've clearly got a problem. These are the residuals taking the random effects into account, and most of the problem goes away. So this is basically saying that we're underpredicting some hot spots and we're underpredicting some cold spots, but once we include the among biome and among biome by realm variation, we're actually accounting for that. Okay, I am actually almost done. So I think that's mostly a
model I'm comfortable with. The last thing that I want
to check is geographic is spatial auto-correlation. So I'm extracting the residuals, and I'm doing a plot of the residuals with positive residuals in blue and negative residuals in red and the size and the
intensity both increasing with the amount of the residuals. You should please ignore that
big red data point there. I actually very recently, I think I might want to get rid of that. The salient point here,
these are the residuals, and there's clearly
spatial pattern left here. I am clearly overpredicting in a bunch of the Neotropical rainforest and underpredicting in most
of the Southern Neotropics and underpredicting in Central America, even though I've taken
the realms and the biomes and the biomes by realms into account. So I need and there are fancier
more formal ways to do this. I'm switching to the gamm4 package. Gamm4 is basically lme4, but you can stick in the
various smooth functions that the mgcv package provides. If you have no idea what I'm
talking about, I'm sorry. Remarkably, this is a little bit of a pain because these values, if I want to fit a spatial
model to these values, I have to account for the
fact I have to do something about projection and the fact that I've got data like all over a sphere, and these points are actually... Kamchatka is actually close to Alaska even though it doesn't look
like it on this picture. Gamm4 allows me to add a
spherical smooth to my model. So I'm taking the formula that I had and adding a spherical smooth
to it and fitting the model, and that's actually gonna
take a little while. I'm now I'm stuck with it. I will show you that that doesn't make a huge
difference in my conclusions, but it's nice to have taken care of it. This is the comparison dot whisker plot between gamm4 and lme4,
didn't make a huge difference. Gamm4 is kind of a pain to work with. The utilities that I included here actually make it a little
bit easier to work with. And I'm gonna switch back to the pictures for the last few minutes,
just to accelerate slightly. So this is the coefficient
plot that we put in the paper, which is basically the same
thing that you were looking at with nicer labels and a different order and showing the results
for all three taxa. You can compute R squared values, and there's a package which
I've hacked a little bit which will do this, which will compute the
R squared of the model and the partial R squared for each term. There are double dangerous bins in here to indicate that there are
some kind of funny things. R squares are very hard for mixed models and I'm not actually
absolutely comfortable, but roughly speaking, they mirror the absolute values
of the coefficient value. So NPP is the biggest fire
is the next biggest area, which was a negative value
is now a positive R squared and it's the next largest effect. This is what the paper says when we look at it for all three taxa. Emmeans at effects and ggeffects, there's a predict method for these models, but it definitely makes it
easier to get the predictions if you use one of these ad-on packages. So I'm constructing a vector of the NPPs from min to max by 51, and then I'm asking emmeans
to extract the predictions and plotting the points
and the predictions. There's also a package on GitHub for getting the partial residuals, which is basically subtract this plot, and I really will be done
in one minute, I think. This plot shows NPP on
the x-axis, log NPP, and predicted log bird diversity or observed log bird diversity. But the variability in this point cloud includes all the variability
and all the other predictors that we're just squashing
into two dimensions now. So arguably, it looks a little
bit worse than it should. The partial residuals are
taking the predicted values for each data point of all
of the non-focal predictors and subtracting them. So this is in principle, the variation. If my model is okay, it's the variation only due
to NPP and residual variation. I haven't looked at random
effects very much here because that wasn't the
primary goal for this model. The random effects at
the biome realm region, we have one for every
biome realm combination, and we have one for... So this is all of the variability of the effect of the
inter-annual variability on NPP in a particular biome
functional realm combination. For the biome level and the realms level, we only have an intercept. So it's a lot easier to look at. We would have liked to look at the... We would have liked to dissect
this variation a lot more and look at the across biome variation in the effect of fire. But honestly, even
though this is a very big painstakingly synthesized dataset, it's not really big enough. We don't really have a lot of signal left at the level of these to examine. We're mostly using the random effects as a statistical independence term, not to explore the relationships. I will take questions now. - All right. Thank you, Ben. So I'm looking at the link
that pollev.com, FI2021, and I'm ordering the questions
according to the ones that were most upvoted. So the first one. Is it valid to use AICC or
other information criterion to choose between allowing
variation of intercept only versus also allowing slopes to vary? - I don't think anybody really knows. I don't think any... I'm not aware that anybody's really
evaluated that carefully. I would say there were two issues. One issue is that we actually
already know that the AIC is a little bit suspect for
this case, because the AIC, the derivation of the AIC assumes that the models are all in the middle of their parameter space, and none of the models,
that variation, sorry. That none of the models are at the edge of the overall parameter space where the overall parameter space is the variance covariance space, and you can't have a
variance less than zero. So some of the models with
zero variances in them are not technically valid,
even for doing AIC comparisons. In general for p values, that tends to make AIC a
little bit conservative. So it's likely to say that
that these extra values are not necessary. So you would perhaps throw
them away unnecessarily. I guess what I'm worried about with AICC is that you're kind of stacking issues. We don't really know how well AICC works for nonlinear, for GLMs. We're not really... AICC was derived for simple linear models so that, sorry, I'm not
being very clear here. I think it should be valid, but if you're dealing with a dataset that's small enough that you need AICC, then I'm a little bit worried. It would tend to make
you more conservative, and that might make you very conservative, and I might worry about
being that conservative. That was a long, not really clear answer. Next. - Can you just elaborate
a little bit more on that and say then how to determine if you should choose a model that has a variation of intercepts only versus allowing slopes to verify? - So Barr et al don't think you should. Barr et al think you should
use the most complicated model that works, that computes adequately and isn't singular. Matches check et al do some simulations with I think a p value
cutoff of 0.1 or 0.2. So they use a kind of
loose p value criterion. I did when I was doing this analysis, and I don't know that I
can pull it up immediately. I think you do, in any case, want to check that your selection of, if you can, that your selection of random variables isn't making a huge difference. I'm gonna share my screen for a second. Chis, this is part of the
bigger, more complicated, messy workflow that I did
for the whole project. These are the fixed
effects for all 27 models, showing which ones are
singular and not singular and showing the variability
of the fixed effect across all of these. So I mean, I think I
would hesitantly use AIC to select these things, and I might use AIC precisely because it's the least
conservative of AIC, BIC, and AICC, and therefore it's gonna lead you to include the random
effects in your model when they're on the edge, which I think is actually a good idea. - Fantastic. All right. The other question. What are your thoughts
on the pros and cons of fitting these types of models in a Bayesian versus maximum
likelihood framework? - That's a simple question
with a short answer, no. I think the Bayesian models are... I think if I could snap my fingers and fit a Bayesian model
with the same speed, convenience, and simplicity as fitting a frequentist model, I would generally go with the
Bayesian model all the time. I have in my extras of my slides, I said, Bayesian approach
is slower but you get more. So it might take 10 or
20 or 100 times longer to fit the model, but it
makes it easy to regularize. So it makes it easy. It avoids the which random
effects am I going to put in? You put them all in, you put
an appropriately regularized, you put a strong enough
prior on all of them that it's sensible and you
don't run into the boundaries and you don't get zeros and you don't get bad stuff happening. It's nice to be able to
include informative priors and they handle uncertainty better. Most of the confidence intervals that I showed you in the predictive plots don't account for the uncertainty in the random effects
component of the model. So they're probably a
little bit too narrow, and the only way around
that in the frequentist case is parametric bootstrapping,
which is really slow. So yeah. - Great. Another question here is, what are the best practices
for when you have covariance that are measured at different scales? For example, you could
have some predictors that are measured at the biome level and some that are measured
at eco region level. - What's really nice about these models is that you can pretty much, that's not really something
you have to worry about. You can pretty much
toss in the predictors, allow for variability at whatever levels are
appropriate and identifiable, and the model structure
will take care of it. And by the way, if I'm saying something
that's absolute garbage, I'm hoping Denis will stop me. - So no garbage. I haven't heard anything. But so the question here is, every time I apply the
mixed effects model, I receive a singular comment. But how do I cope with a similar fit? - So unfortunately, you
have lots of choices. The model with a singular
fit is probably overfitted, is probably too complicated,
might be lowering your power. But as far as we know, it's actually a well-posed
statistical answer. It just has some of the
variances set to zero. So and I guess I would clarify, if this is an intercept-only model, if you've already
simplified the random effect as much as you can. So if you haven't simplified the model, if you're trying to fit a
five-by-five covariance matrix, then try making it diagonal,
try making it intercept-only. I mentioned in my extras, there's an option called compound symmetry where you could assume that all the correlations are the same. All the paralyzed
correlations are the same. That's another way to simplify it. And Mave McGillicutty in New
South Wales is working on... So there's other ways to simplify, but let's get back to assuming you have an intercept-only model. You can simply say, hey, this is not the best estimate
of this variance is zero, and if that was the last
random effect in my model, then I guess I'm gonna fit a generalized, I'm gonna fit a linear model or a generalized linear model. I've estimated the among
group variance is zero, and so leaving it out of my model is basically gonna get
to get me the same answer as leaving it in my model and having lme4 yell at
me that it's singular. You could also regularize. So there's a package called blme, which is a Bayesian overlay, and this is in the extras as well, which basically adds the weakest
possible regularizing prior and adds a prior to the variance terms that prevents them from being
estimated as exactly zero, and you could go forward with that. It's a little bit of an
inferential no man's land because you're regularizing the model, but you're estimating that most likely, you're not estimating a
mean as you normally would in Bayesian statistics. You're not sampling. It's just doing an optimization. So it's a possibility. I would be tempted to, if
you got one intercept level random effect in your
model and it's singular, I would carefully explain to the reviewers that that variability
was estimated as zero so I'm taking it out of the model. - Great. There are some questions also regarding how you scaled the different-- - So the statistician's
answer would be, well, I guess you better get
some better data then. Sorry. Questions. - Yeah. That's right. So there's a question regarding how you scaled your variables. So it seems like you log-scaled
some of the variables, but what about other types of scaling? For instance, the question asks, Z-score scaling for the NPP data. - Right, we actually did both. So I actually took the log. And so we went back and forth and we tried a couple of different things and I will comment that
none of this affects, well, non-linear transformations certainly affect the fit of the model. Fitting on the log scale
or the linear scale are asking different questions. Centering and scaling, so
Z-scaling or centering alone change the parameter estimates and the interpretation of the parameters, but they don't change the predictions, they don't change the overall fit. What we actually did was
log transform NPP and fire, not log transform the
coefficients of variation because they're essentially
already unitless. They're already ratios of
standard deviations to means, and then we centered everything
and scaled everything. But there's an argument about whether in the
presentation of the coefficients, we actually removed the scaling because we wanted them to be
interpretable as elasticities. It's a little tricky. The default is Z-scaling is
sort of always a good idea, but if you've thought
about the default carefully and decided that you don't
want to use the default, go for it. - So essentially, you did the Z-score even after taking the logs, right, to get for the first stuck
the log, and then the Z-score. - And the problem is that makes it. So what is this variable? Well, this variable is the deviation around the geometric mean measured in units of one
standard deviation of the log. Okay, that's nice. The only point is that
we can then put them on. We can then compare the
coefficients directly, and that's a measurement of magnitude. The reason we decided in
the paper to unscale them when presenting the coefficients is that we're also presenting
the R squared values, which are essentially a measure, a unitless measure of magnitude. And there's actually,
if we were doing, sorry. I'll say one more thing, and then we can get on
to the next question. If we were doing, and Chills's 2010 paper
says this very nicely, if we were fitting
regular old linear models, then the squared Z-scored parameter applied to the Z-scored predictor would be exactly the partial R squared. But it's mixed. So the R squared is a more complicated. - Fantastic. There's apparently a lot of
interests regarding your comment that we should be careful, or we should not test
differences between groups using mixed effects models. So I think the question
is really wanting you to expand on why that would be the case. - Sure. So this gets back to sort of
proper statistical philosophy, which I don't really do, but
I'll give you my understanding. The reason that we can get away, so when we estimate the random effect, so I'm talking about
something like the variation in the effect of fire across ecosystems, we're treating that variation
as a random variable. In the frequentist world, there's a difference between a parameter and a random variable. And a parameter is a thing
that has a true value, and a random variable is a
thing that has a distribution. So when Doug Bates talks
about the random effects, he likes to call them
the conditional modes of the distribution of this effect, meaning this is a random variable, but given the parameters I've estimated and given the data I've seen, that conditional distribution
moves to look like the data, and we can estimate a conditional mean and a conditional standard deviation, but those things don't have
the same inferential properties as a proper frequentist parameter and a proper frequentist estimate, which has a sampling distribution and a bunch of stuff we
could prove things about. In the Bayesian world, this is another thing that's
nice about the Bayesian world, in the Bayesian world, a parameter is a parameter is a parameter, and either it has a
hyper parameter or not. So if we wanted to test
statistical differences between groups, we could do that, but now the Bayesians don't like p values so you're kind of back where you started. I want to share my screen
for one more second to make one more conceptual point, which is that I think people
are way more concerned than they should be about testing. You know, who cares that... Let's say I take these
standard deviations, these 95% intervals of the
conditional distributions. And I say, oh, well, tropical
or subtropical moist forest is significantly different
from the average of all biomes, but tropical grasslands aren't. I don't think that's a very interesting... This is purely opinion,
but I would question, I would want to have a
conversation with you about why you would want to know that one particular group
was significantly different from another particular group. You might have a good
reason, but you might not. I know that my supervisor told me to. That's the best reason really. - Yeah, I think there
are a couple of questions that were actually focused on that topic of how to test difference
between groups and all that. - And I mean, another way
to put it is there's no... What estimating random
effects as a random effect means you get to pretend that you're only estimating one parameter instead of estimating a
parameter for every level. So you're estimating a variance for an intercept level, for
an intercept random effect. There's one parameter, one top
level parameter in the model, which is the variance. And you don't... All the values of the individual groups don't count against your parameter count and the no free lunch, what that costs, is that you can't make inferences about differences among the groups. - Great. So another question here has to do with the R squared values
in marginal and conditional. What are their arguments against this? Your double band road signs. Are there any useful alternatives
that can be presented to approximate goodness
of fit for mixed models? - So I think the conditional
and marginal random effects actually are not bad. And they're generally only weird when you're in one of those
overfitted model, trying, pushing the data too hard situations. So in general, I think
that the conditional and marginal effects
are pretty reasonable, and I would never argue about them. The double dangerous bands here have more to do with the
partial R squared values, which I computed, which are not included in the
standard Nakagawa, Shields, F. Johnson, left check, et cetera, conditional and marginal
R squared framework. I'm computing an R
squared for each parameter for each effect, and that's done with, I think I didn't include the reference, but there's a series of papers by I think Byron Edwards
and Jaeger and others, where they come up with some other frame, some frameworks, they're very nice papers, and I can try to post them in the chat or send them along later. They come up with frameworks for estimating partial R squares, and there are a few
different ways to do it. And when I use the way that I thought looked most appropriate, I got a total model R squared that was smaller than the R squared for one of the parameters. And I actually talk to the
author and he said, yeah, when I do, I see that
happening sometimes too. I'm not quite sure I
understand why that happens. And so I backed off to a different method, which gives me more sensible answers, but that may explain a little bit why I'm expressing
caution about using these. - All right. There's a question here. If it makes sense, if we
regress conditional modes of random terms with
another variable about G. So for example, if you
have a random intercepts that varies from species to species and regressing that intercept as a function of say species traits. - That's generally a bad idea. There's papers by I think Jared
Hadfield and other people. So the conditional modes more traditionally go
by the jargon of BLUP, which is best linear unbiased predictor, but Doug Bates doesn't like it because in the GLMM framework, they're not linear and
they're not unbiased and they might not be best. So he likes to call
them conditional modes, but for searching the literature, again, I think Hadfield's paper is probably the one I would find. I'm sure that there's a reference. I'm not sure that there's
a reference to this in the GLMM fact. That's generally dicey. I would try, I can't say
off the top of my head, but generally, if you can incorporate the question you're interested in, the relationship you're interested in as part of the model, yeah. My answer is I can see the
question you're asking. I know that there are pitfalls in taking the conditional modes or BLUPs as input variables for another analysis, and I'm not quite sure how to answer to address the question that
this person wants to address. - So is it your comments perhaps ideally that person would include a species traits directly into the first
model rather than-- - Yes, but I'm not sure. Yes, but I'm not sure if
they would get an answer to the question they want. So there's American Naturalist 2010, Hadfield is the first author on the art. The title is the misuse of
BLUP in ecology and evolution. And I don't remember the details, but that's where I would start. - So I'm looking here. Is it valid to use means separation using emmeans with glmer in lme4? - Repeat the question, which I probably still won't
understand, but I'll try again. - Yeah, it just says, if it's
valid to use mean separation using emmeans within glmer in lme4. - So I will take a guess
at what the question means. So emmeans, so if I mean separation, you're saying something about, let me take the predicted
values for some pair of means and see whether they're
significantly different from each other or not. Emmeans is reliable for glmms with, well, a couple of caveats. One is that emmeans in general, uncertainties ignore the uncertainty in the estimates of the variance, like I said before about
the predictive ribbons that I was drawing. So it probably underestimates
the residual slightly. And I was gonna talk
about bias correction, but that's not actually relevant. Other than that, the
underlying linear machinery of glmms is sufficiently
the same as that of lmss that you can probably, that doing those comparisons
will probably work. But the caveat there is
I'm not absolutely sure what mean separation
means in this context. So I might've answered the wrong question. - No problem. Another question here, and I think this is gonna
be our last question, because we're almost almost 1:30. Should you always take
coefficients from the full model or use model averaging via AIC or from the best model
estimated by model selection? - What a big can of worms. Sorry. If you are predicting and you don't care about
your confidence intervals being a little bit too narrow, then model averaging is a good idea. If you actually want to
estimate effect sizes and do inference and calculate p values, then you should always, in my opinion, you should always use
basically the biggest model that works or the AIC selected model. I think you should never do selection on the focal parameters. So like, if you're interested
in the fixed effects, doing selection on which
fixed effects to include and then using those coefficients. So model averaging or some other form of
regularized estimation, if you're trying to do prediction and you don't care about
good uncertainty estimates and the full model otherwise. - Great. All right. So I think that was great. Thank you, Ben, for agreeing to present, and thanks everybody for
participating on the event. I would just like to make a final comments that this is a seminar series. So we're gonna have additional presenters. So we hope you can join on the seminars that we're gonna have. The next one is on
December 6th at 9:00 a.m. because our presenter is
coming from Australia, and he's gonna be talking
about species archetype models and regions of common profile models. So thank you all, and thanks again, Ben. That was really great. - Thank you. I talked fast enough, right? - It went well.
- Goodbye. Well, goodbye to everybody. Hello and goodbye to the people I know. Enjoy your afternoon or evening or whatever it is where you are. - All right. Bye-bye.