<- read.csv(here::here("data", "potato_tuber_size.csv")) potato
14 Variance & Variance Components
Mixed, hierarchical or multilevel models provide the advantage of being able to estimate the variance of random variables and model correlations within the grouping structure of random variables. Instead of looking at a variable as a collection of specific levels to estimate, random effects view variables as being a random draw from a probability distribution.
The decision of how to designate a variable as random or fixed depends on (1) your experimental aims with regard to inference and (2) your data structure. 1 There is a philosophical consider and practical consideration. The philosophical approach is that random variable represents a small sample of a population you want to make inference on. The practical consideration is that when there are few sample levels, the random estimation procedure is not very reliable for those conditions. Ben Bolker has an excellent summary that we strongly recommend that you read to learn more about this. Below is an excerpt from that on the consequences of too few levels to estimate a random effect:
1 This animation show how accounting for grouping structure as random effects can impact the results of a linear model.
Treating factors with small numbers of levels as random will in the best case lead to very small and/or imprecise estimates of random effects; in the worst case it will lead to various numerical difficulties such as lack of convergence, zero variance estimates, etc.. (A small simulation exercise shows that at least the estimates of the standard deviation are downwardly biased in this case; it’s not clear whether/how this bias would affect the point estimates of fixed effects or their estimated confidence intervals.) In the classical method-of-moments approach these problems may not arise (because the sums of squares are always well defined as long as there are at least two units), but the underlying problems of lack of power are there nevertheless.
14.1 Variance component estimation for 2+ random effects
14.2 Multiple crossed variance components
This is a common scenario: an experiment is conducted and their are multiple grouping levels, none of which are nested within each other, but the observations have multiple group memberships. A common example is a multi-environmental trial where multiple crop genotypes are evaluated at multiple locations and years. We can consider genotypes, locations and years as random effects depending on our experimental aims. Very few R packges can handle crossed random effects, but lme4 can!
The data used in this example is a collection of potato field trials conducted across 3 locations and 17 years. Each year, the same potato clones[^vc-1] are evaluated across all 3 locations, but each year, different clones are evaluated. Some clones are evaluted for multiple years, and a small number are evaluted each year. These potato clones were evaluated for the length-to-width ratio (LxW). We want to know how much location, year and clone contribute to this trait.
[^vc-1] A clone is a genetically distinct potato genotype that is vegetatively propagated. It may be a released variety or an experimental breeding line.
year | year of trial, 16 levels |
state | location of trial, 3 levels |
clone | potato genotype |
LxW | length-to-width ratio |
Number of observations for each location and year:
table(potato$year, potato$state)
Total number of clones evaluated:
length(unique(potato$clone))
Total counts for how often invividual clones were evaluated:
|> count(clone, name = "Frequency of Evaluation") |> count(`Frequency of Evaluation`) potato
Distribution of the dependent variable, LxW or length-to-width ratio:
hist(potato$LxW, ylab=NULL, xlab = NULL, main = "Potato Clone LxW Ratio", col = "turquoise", breaks = 20)
14.2.1 Model fitting
For this analysis, location (“state”) is a fixed effect because it only has 3 levels, and the researchers are only interested in understanding clone performance in those locations (Idaho, Oregon and Washington). The remaining effects are random because (1) they represent a sample of the full population that inference is desired for, and (2) each has a sufficient number of levels to make estimation feasible.
<- lmer(LxW ~ state + (1|year) + (1|state:year) + (1|clone),
potato_m1 data = potato)
::check_model(potato_m1, check = c('qq', 'linearity'), detrend=FALSE) performance
14.2.2 Inference on random effects
The variance components for each random effect can be extracted most easily using the tidy()
function from broom.mixed. We can also a bit of extra code to calculate the percent of variance explained by each component.
<- tidy(potato_m1, "ran_pars", scales = "vcov") |>
var_comps ::select(group, variance = "estimate") |>
dplyrmutate(`percent variance` = round(variance/sum(variance) * 100, 1))
var_comps
::: note-information, collapse=false ## log likelihood ratio tests ANOVA as classically defined (F-tests contrasting the between group and within group variation) is not an option for random effects. There are several ways to test if random effects are impactful on the model overall. One of the most reliable and popular methods is the log likelihood ratio test. In brief, a reduced model is refit from a full specified model omitting a random variable. The log likelihood from the two models (the full specified and reduced models) are compared and a p-value is computed for that difference given the change in number of parameters estimated. The null hypothesis is that the models are equivalent and the alternative hypothesis is that the models are not equivalent. Hence, low p-values provide evidence that the omitted factor is impactful on the dependent variable of interest. :::
The function ranova()
in lmerTest conducts log-likelihood ratio tests for all random effects in a model.[^vc-2]
ranova(potato_m1)
[^vc-2] It is also possible to conduct likelihood tests manually by constructing reduced models and comparing it to the fully specificed (or more specified) model. In such cases, the fixed effects need to be identical between models, and they need to fit using maximimun likelihood instead of REML.
It is possible to make inferences on specific levels of random effects. This take a different analytical approach than fixed effects and has a different term, predictions, or more specifically Best Linear Unbiased Predictions, commonly called “BLUPs” (rhymes with “cups”). The estimated marginal means from fixed effects are technically BLUEs, or Best Linear Unbiased Estimates. By their nature, BLUPs “shrink” the estimates towards zero, reducing their overall spread compared to fixed effects. VSNi has a short summary on BLUPs and BLUEs.
Recall that random effects are distributed with a mean of zero and a standard deviation, \(\sigma_b\) that is estimated in the model fitting procedure. We can add the overall model intercept to all BLUPs in order to shift them to a scale that may be more intuitive for some. Since this constant (the model intercept) was added to all BLUPs, the overall relationship between the them (in this example, potato clones) is unchanged.
# the model intercept
= fixef(potato_m1)[1]
intercept
# all random effects from the model
<- REextract(potato_m1)
random_effects
# filter to clone BLUPs and add the intercept
<- filter(random_effects, groupFctr == "clone") |>
clone_re rename(clone = groupID, BLUP = "(Intercept)", SE = "(Intercept)_se") |>
mutate(blup_adj = BLUP + intercept)
hist(clone_re$BLUP, ylab=NULL, xlab = NULL, main = "Clone Random Effects", col = "turquoise")
Below are the BLUPs (in red) and standard errors for each clone, arranged from lowest to highest length-to-width ratio. The horizontal gray dotted line indicates the average clone effect.
ggplot(clone_re, aes(x = reorder(clone, blup_adj), y = blup_adj)) +
geom_hline(yintercept = intercept, color = "gray", linetype = 2) +
geom_linerange(aes(ymax = blup_adj + SE, ymin = blup_adj - SE)) +
geom_point(size = 0.7, col = "red3") +
theme_classic(base_size = 14) +
ggtitle("length-to-width ratio") +
ylab("BLUP + intercept") +
theme(axis.text.x = element_blank(),
axis.ticks.x = element_blank(),
axis.title.x = element_blank())
14.3 Nested variance components
It is possible to have a nested data set: learners in a classroom and classrooms in institutions, cows in dairies and dairies in counties or other regional indicators. These end up being truly hierarchical models, where we expect that there is some covariance between observations at the different levels. Cows in the same dairy are likely to be more correlated with each other than cows from another dairy. But how correlated? And how much does each grouping level influence the variation of whatever dependent variable is being studied.
This example is looking at pay at institutions of higher education in Washington. All annual earnings less than $25,000 were excluded from the data set, and only earners with a minimum of four years of data were included. Although this is public information, all individual names were anonymized.[^vc-3]
[^vc-3] This data set (source) was incompletely curated and hence no substantive conclusions should be drawn from this particular analysis. This example is for teaching and demonstration purposes only.
For this data set, we are concerned with what is contributing to employee salaries over time - to what extent did year, the institution itself and the individual employee contribute to salary changes over time. While this reflects a complete data set for Washington State (all institutions, all employees), the interest is inference for other U.S. higher education institutions. This is an example where we will use random slopes to examine how an individual’s salary changed over time. The data set:
|——-|——————————-| | agency | educational institution, 36 levels | | id | anonymized employee within each agency | | salary | annual salary | | year | year of salary, set to 0 for 2019 |
This is a rather large data set, and it requires quite a bit of computing power to analyze. Hence, the analysis was done locally, and the data set and model object were saved as a .RData file. Now let’s load this object into an R session.
load(here::here("data/salary_models.RData"))
14.3.1 Model fitting
Salary is very skewed, where the 70% of salaries are less than $100,000, but there are a few very high salaries, up to a maximum value of $3.1 million. We will log transform the data in a log-linear model for analysis. This is a special case of generalized linear mixed model (GLMM) and the only GLMM model used because it still follow the assumption of normally-distributed residuals terms when a the dependent variable is log-transformed.
hist(log(wa_salary$salary), ylab=NULL, xlab = NULL, main = "Washington Higher Ed Salaries", col = "turquoise")
boxplot(log(salary) ~ year, data = wa_salary, ylab=NULL, xlab = NULL, main = "Salaries by Year", col = "turquoise")
The analysis is derived from this model:
<- lmer(log(salary) ~ factor(year0) + (year0|agency/id),
m1 data = wa_salary, REML = FALSE)
Random intercepts and slopes were fit for agency and id nested within agency.
14.3.2 Inference
<- tidy(m1, "ran_pars", scales = "vcov") |>
var_comps_salary ::select(group, variance = "estimate") |>
dplyrmutate(`percent variance` = round(variance/sum(variance) * 100, 1))
var_comps_salary
# the model intercept
= fixef(m1)[1]
intercept
# all random effects from the model
<- REextract(m1)
random_effects
# filter to clone BLUPs and add the intercept
<- filter(random_effects, groupFctr == "id:agency") |>
clone_re rename(employee = groupID, BLUP = "(Intercept)", SE = "(Intercept)_se") |>
mutate(blup_adj = BLUP + intercept)
14.4 Variance estimation under heteroscedasticity
In the “general” linear model days, when a premium was placed on the i.i.d. error paradigm, if we did reject \(H_0\), it would set off a minor crisis in the form of a hunt for a variance stabilizing transformation. In contemporary modeling, we simply proceed with inference on estimable functions using the equal variance model.
– Walt Stroup, Marina Ptukhiuna and Julie Garai (2024), Generalized Linear Mixed Models, \(2^{nd}\) Ed, Section 7.2.3.1
In previous sections, we have assumed the error terms or residuals were “i.i.d.”, that is “independently and identically distributed. This means they were shared the same distribution (identical) and were uncorrelated (independent). Longitudinal studies, that is, those with repeated measures, do have correlated residuals, so we relax the independence assumption and model those correlations. However, residuals can be unexpectedly related to the their observations, particular treatments or the order data was gathered from the experimental units (among other causes). As mentioned in the previous quote, we now have tools for handling this rather than trying to transform the data. Below are examples on how to model heteroscedasticity.
14.4.1 Case 1: unequal variance due to a factor
This data set is from a set of canola variety trials conducted in a single year across multiple locations. The trials included 38 varieties that were evaluated at 9 locations using a RCBD design.
<- read.csv(here::here("data", "MET_trial_variance.csv")) |>
var_ex1 mutate(block = as.character(block)) |>
::drop_na() tidyr
Exploratory data visualizations indicate that the dependent variable, seed yield, varied greatly overall and certain locations had smaller variance compared to others.
hist(var_ex1$yield)
boxplot(yield ~ site, data = var_ex1)
The study is not fully crossed; all sites did not include all varieties, although there is substantial overlap. As a result, only variety and the site-by-variety interaction are included in the statistical model.
<- lme(yield ~ site:variety + variety,
m1_a random = ~ 1 |site/block,
na.action = na.exclude,
data = var_ex1)
The residual plot indicates some association between the residuals and fitted values.
plot(m1_a)
We can add a term to model the variance by site.
\[Var(\epsilon_{ij}) = \sigma^2 \delta^2_{s_{ij}} \] Details on the implementation can be found in (Pinheiro and Bates 2000).
<- update(m1_a, weights = varIdent(form = ~1|site)) m1_b
The function varIdent()
is used to set the stratifying variable at which to estimate variance. Like many functions in R, there are additional arguments to consider for more complex scenarios (type ?varIdent
in an R console to check).
<- update(m1_a, weights = varIdent(form = ~1|site)) m1_b
is equivalent to
<- lme(yield ~ site:variety + variety,
m1_b random = ~ 1 |site/block,
weights = varIdent(form = ~1|site),
na.action = na.exclude,
data = var_ex1)
The residual plot is now much cleaner. The result is a better-fitting model and with that, better inference for variety at the site level.
plot(m1_b)
anova(m1_a, m1_b)
14.5 Coefficient of Variation
The coefficient of variation can be manually calculated as thus:
\[ \frac {\sigma}{\mu} * 100 \]
<- mean(var_ex1$yield, na.rm = TRUE)
m1_ave = sigma(m1_b)/m1_ave*100
m1_cv round(m1_cv, 1)
However, in cases of unequal variance, the overall error term can be larger than expected under homoscedasticity (with varIdent()
) or much much smaller (e.g. with varPower()
). Interpret the coefficient of variation from mixed models with caution.