library(nlme); library(performance); library(emmeans)
library(dplyr); library(broom.mixed); library(ggplot2)
11 Repeated measures mixed models
In the previous chapters we covered how to run linear mixed models for different experiment designs. All of the examples in those chapters were independent measure designs, where each subject was assigned to a different treatment. Now we will move on to experiment with repeated measures random effects.
Studies that involve repeated observations of the exact same experimental units require a repeated measures component to properly model correlations across time with the experiment unit. This is common in any studies that are evaluated across different time periods. For example, if samples are collected over the different time periods from same subject, we have to repeated measures effect while analyzing the main effects.
In these models, the ‘iid’ assumption (idependently and identically distributed) is being violated, so we need to introduce specialized covariance structures that can account for these correlations between error terms.
There are several types of covariance structures:
Structure name | R function | Assumption |
---|---|---|
Autoregressive (AR1) | corAR1() |
observations which are more proximate are more correlated than measures that are more distant |
Compound symmetry | corCompSymm() |
Correlation is equal for all time gaps between observations |
Unstructured | CorSymm() |
Correlations are different for the various time gaps |
To read more about selecting appropriate covariance structure based on your data, please refer to this link.
The repeated measures syntax in nlme follow this convention: form = ~ time|grouping
. You can also use 1|group
and the observation order for each group will be. The default starting value (value
) is zero, and if fixed = FALSE
(the current nlme default), this value will be allowed to change during the model fitting process.
There are several other options in the nlme machinery (search “cor” for more options and details on the syntax).
Fitting models with correlated observations requires new libraries including ‘mmrm’ and ‘nlme’. The ‘lmer’ package allows random effects only.
In this tutorial we will analyze the data with repeated measures from different experiment designs including randomized complete block design, split plot, and split-split plot design.
For examples used in this chapter we will fitting model using mmrm
and lme
packages. So, let’s start with loading the required libraries for this analysis
library(mmrm); library(performance); library(emmeans)
library(dplyr); library(broom.mixed); library(ggplot2)
12 Example Analysis
First, we will start with the first example from a randomized complete block design with repeated measures.
12.1 RCBD Repeated Measures
The example shown below contains data from a sorghum trial laid out as a randomized complete block design (5 blocks) with variety (4 varieties) treatment effect. The response variable ‘y’ is the leaf area index assessed in five consecutive weeks on each plot.
We need to have time as numeric and factor variable. In the model, to assess the week effect, week was used as a factor (factweek). For the correlation matrix, week needs to be numeric (week).
<- agriTutorial::sorghum %>%
dat mutate(week = as.numeric(factweek),
block = as.character(varblock))
block | blocking unit |
Replicate | replication unit |
Week | Time points when data was collected |
variety | treatment factor, 4 levels |
y | yield (lbs) |
12.1.1 Data Integrity Checks
Let’s do preliminary data check including evaluating data structure, distribution of treatments, number of missing values, and distribution of response variable.
str(dat)
'data.frame': 100 obs. of 9 variables:
$ y : num 5 4.84 4.02 3.75 3.13 4.42 4.3 3.67 3.23 2.83 ...
$ variety : Factor w/ 4 levels "1","2","3","4": 1 1 1 1 1 1 1 1 1 1 ...
$ Replicate: Factor w/ 5 levels "1","2","3","4",..: 1 1 1 1 1 2 2 2 2 2 ...
$ factweek : Factor w/ 5 levels "1","2","3","4",..: 1 2 3 4 5 1 2 3 4 5 ...
$ factplot : Factor w/ 20 levels "1","2","3","4",..: 1 1 1 1 1 2 2 2 2 2 ...
$ varweek : int 1 2 3 4 5 1 2 3 4 5 ...
$ varblock : int 1 1 1 1 1 2 2 2 2 2 ...
$ week : num 1 2 3 4 5 1 2 3 4 5 ...
$ block : chr "1" "1" "1" "1" ...
In this data, we have block, factplot, factweek as factor variables and y & week as numeric.
table(dat$variety, dat$block)
1 2 3 4 5
1 5 5 5 5 5
2 5 5 5 5 5
3 5 5 5 5 5
4 5 5 5 5 5
The cross tabulation shows a equal number of varieties in each block.
ggplot(data = dat, aes(y = y, x = factweek, fill = variety)) +
geom_boxplot() +
#scale_fill_brewer(palette="Dark2") +
scale_fill_viridis_d(option = "F") +
theme_bw()
Looks like variety ‘1’ has the lowest yield and showed drastic reduction in yield over weeks compared to other varieties.
One last step before we fit model is to look at the distribution of response variable.
hist(dat$y, main = "", xlab = "yield")
12.1.2 Model Building
Let’s fit the basic model first using lme()
from the nlme package.
<- lme(y ~ variety + factweek + variety:factweek, random = ~1|block/factplot,
lm1 data = dat,
na.action = na.exclude)
The model fitted above doesn’t account for the repeated measures effect. To account for the variation caused by repeated measurements, we can model the correlation among responses for a given subject which is plot (factor variable) in this case.
By adding this correlation structure, what we are implying is to keep each plot independent, but to allowing AR1 or compound symmetry correlations between responses for a given subject, here time variable is week
and it must be numeric.
<- corAR1(form = ~ week|block/factplot, value = 0.2, fixed = FALSE)
cs1 <- corCompSymm(form = ~ week|block/factplot, value = 0.2, fixed = FALSE) cs2
In the code chunk above, we fitted two correlation structures including AR1 and compound symmetry matrices. Next we will update the model lm1, with these two matrices. In nlme, please search the help tool to know more about functions for different correlation structure classes.
<- update(lm1, corr = cs1)
lm2 <- update(lm1, corr= cs2) lm3
Now let’s compare how model fitness differs among models with no correlation structure (lm1), with AR1 correlation structure (lm2), and with compound symmetry structure (lm3). We will compare these models by using anova()
or by compare_performance()
function from the ‘performance’ library.
anova(lm1, lm2, lm3)
Model df AIC BIC logLik Test L.Ratio p-value
lm1 1 23 18.837478 73.62409 13.58126
lm2 2 24 -2.347391 54.82125 25.17370 1 vs 2 23.18487 <.0001
lm3 3 24 20.837478 78.00612 13.58126
<- compare_performance(lm1, lm2, lm3) result
Some of the nested models seem to be identical and probably only vary in
their random effects.
print_md(result)
Name | Model | AIC (weights) | AICc (weights) | BIC (weights) | R2 (cond.) | R2 (marg.) | ICC | RMSE | Sigma |
---|---|---|---|---|---|---|---|---|---|
lm1 | lme | -50.5 (<.001) | -36.0 (<.001) | 9.4 (<.001) | 0.99 | 0.37 | 0.98 | 0.10 | 0.13 |
lm2 | lme | -77.5 (>.999) | -61.5 (>.999) | -15.0 (>.999) | 0.97 | 0.41 | 0.95 | 0.15 | 0.18 |
lm3 | lme | -48.5 (<.001) | -32.5 (<.001) | 14.0 (<.001) | 0.98 | 0.37 | 0.98 | 0.11 | 0.14 |
We prefer to chose model with lower AIC and BIC values. In this scenario, we will move forward with lm2 model containing AR1 structure.
Let’s run a tidy()
on lm2 model to look at the estimates for random and fixed effects.
tidy(lm2)
Warning in tidy.lme(lm2): ran_pars not yet implemented for multiple levels of
nesting
# A tibble: 20 × 7
effect term estimate std.error df statistic p.value
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 fixed (Intercept) 4.24 0.291 64 14.6 5.44e-22
2 fixed variety2 0.906 0.114 12 7.94 4.05e- 6
3 fixed variety3 0.646 0.114 12 5.66 1.05e- 4
4 fixed variety4 0.912 0.114 12 8.00 3.78e- 6
5 fixed factweek2 -0.196 0.0571 64 -3.44 1.04e- 3
6 fixed factweek3 -0.836 0.0755 64 -11.1 1.60e-16
7 fixed factweek4 -1.16 0.0867 64 -13.3 4.00e-20
8 fixed factweek5 -1.54 0.0943 64 -16.3 1.57e-24
9 fixed variety2:factweek2 0.0280 0.0807 64 0.347 7.30e- 1
10 fixed variety3:factweek2 0.382 0.0807 64 4.73 1.26e- 5
11 fixed variety4:factweek2 -0.0140 0.0807 64 -0.174 8.63e- 1
12 fixed variety2:factweek3 0.282 0.107 64 2.64 1.03e- 2
13 fixed variety3:factweek3 0.662 0.107 64 6.20 4.55e- 8
14 fixed variety4:factweek3 0.388 0.107 64 3.64 5.55e- 4
15 fixed variety2:factweek4 0.228 0.123 64 1.86 6.77e- 2
16 fixed variety3:factweek4 0.744 0.123 64 6.06 7.86e- 8
17 fixed variety4:factweek4 0.390 0.123 64 3.18 2.28e- 3
18 fixed variety2:factweek5 0.402 0.133 64 3.01 3.70e- 3
19 fixed variety3:factweek5 0.672 0.133 64 5.04 4.11e- 6
20 fixed variety4:factweek5 0.222 0.133 64 1.66 1.01e- 1
12.1.3 Check Model Assumptions
check_model(lm2, check = c('normality', 'linearity'))
12.1.4 Inference
The ANOVA table suggests a highly significant effect of the variety, week, and variety x week interaction effect.
anova(lm2, type = "marginal")
numDF denDF F-value p-value
(Intercept) 1 64 212.10509 <.0001
variety 3 12 28.28895 <.0001
factweek 4 64 74.79758 <.0001
variety:factweek 12 64 7.03546 <.0001
We can estimate the marginal means for variety and week effect and their interaction using emmeans()
function.
<- emmeans(lm2, ~ variety) mean_1
NOTE: Results may be misleading due to involvement in interactions
mean_1
variety emmean SE df lower.CL upper.CL
1 3.50 0.288 4 2.70 4.29
2 4.59 0.288 4 3.79 5.39
3 4.63 0.288 4 3.84 5.43
4 4.61 0.288 4 3.81 5.40
Results are averaged over the levels of: factweek
Degrees-of-freedom method: containment
Confidence level used: 0.95
<- emmeans(lm2, ~ variety*factweek)
mean_2 mean_2
variety factweek emmean SE df lower.CL upper.CL
1 1 4.24 0.291 4 3.43 5.05
2 1 5.15 0.291 4 4.34 5.96
3 1 4.89 0.291 4 4.08 5.70
4 1 5.15 0.291 4 4.35 5.96
1 2 4.05 0.291 4 3.24 4.85
2 2 4.98 0.291 4 4.17 5.79
3 2 5.07 0.291 4 4.27 5.88
4 2 4.94 0.291 4 4.14 5.75
1 3 3.41 0.291 4 2.60 4.21
2 3 4.59 0.291 4 3.79 5.40
3 3 4.71 0.291 4 3.91 5.52
4 3 4.71 0.291 4 3.90 5.51
1 4 3.09 0.291 4 2.28 3.89
2 4 4.22 0.291 4 3.41 5.03
3 4 4.48 0.291 4 3.67 5.28
4 4 4.39 0.291 4 3.58 5.20
1 5 2.70 0.291 4 1.89 3.51
2 5 4.01 0.291 4 3.20 4.82
3 5 4.02 0.291 4 3.21 4.83
4 5 3.83 0.291 4 3.03 4.64
Degrees-of-freedom method: containment
Confidence level used: 0.95
Here is a quick step to make sure your fitting model correctly: make sure to have two time variables in your data one being numeric (e.g. ‘day’ as number) and other being factor/character(e.g. ‘day_factor’ as a factor/character). Where, numeric variable is used for fitting correlation matrix and factor/character variable used in model statement to evaluate the time variable effect on response variable.
12.2 Split Plot Repeated Measures
Recall, we have evaluated split plot design Chapter 5. In this example we will use the same methodology used in Chapter 5 and update it with repeated measures component.
Next, let’s load “Yield” data. It is located here.
<- read.csv(here::here("data/Yield.csv")) Yield
This example contains yield data in a split-plot design. The yield data was collected repeatedly from the same Reps over 5 Sample_times. In this data set, we have:
Rep | replication unit |
Variety | Main plot, 2 levels |
Fertilizer | Split plot, 3 levels |
Yield | crop yield |
Sample_time | time points for data collection |
12.2.1 Data Integrity Checks
Firstly, we need to look at the class of variables in the data set.
str(Yield)
'data.frame': 120 obs. of 6 variables:
$ Sample_time: int 1 1 1 1 1 1 1 1 1 1 ...
$ Variety : chr "VAR1" "VAR1" "VAR1" "VAR1" ...
$ Fertilizer : int 1 1 1 1 2 2 2 2 3 3 ...
$ Rep : int 1 2 3 4 1 2 3 4 1 2 ...
$ pH : num 7.07 7.06 7.08 7.09 7.13 7.12 7.15 7.14 7.18 7.18 ...
$ Yield : num 0.604 0.595 3.145 3.091 2.415 ...
We will now convert the fertilizer and Rep into factor. In addition, we need to create a new factor variable (sample_time1) to analyze the time effect.
For lme(), independent variables in a character/factor form works fine. But, for mmrm() independent variables must be a factor. Thus, for sake of consistancy, we will be using independent variables in factor class.
$Variety <- factor(Yield$Variety)
Yield$Fertilizer <- factor(Yield$Fertilizer)
Yield$Sample_time1 <- factor(Yield$Sample_time)
Yield$Rep <- factor(Yield$Rep) Yield
To fit model, we first need to convert Variety, Fertilizer, and Sample_time as factors. In addition, we need to create a new variable named ‘plot’ with a unique value for each plot. In addition, we need a create variable for each subject which is plot in this case and contains a unique value for each plot. The plot variable is needed to model the variation in each plot over the sampling time. The plot will be used as a subject with repeated measures. The subject variable can be factor or numeric but the time (it could be year, or sample_time) has to be a factor.
##creating a plot variable
$plot <- factor(paste(Yield$Rep, Yield$Fertilizer, Yield$Variety, sep='-'))
Yield$Rep2 <- factor(paste(Yield$Rep, Yield$Variety, sep='-'))
Yieldtable(Yield$plot)
1-1-VAR1 1-1-VAR2 1-2-VAR1 1-2-VAR2 1-3-VAR1 1-3-VAR2 2-1-VAR1 2-1-VAR2
5 5 5 5 5 5 5 5
2-2-VAR1 2-2-VAR2 2-3-VAR1 2-3-VAR2 3-1-VAR1 3-1-VAR2 3-2-VAR1 3-2-VAR2
5 5 5 5 5 5 5 5
3-3-VAR1 3-3-VAR2 4-1-VAR1 4-1-VAR2 4-2-VAR1 4-2-VAR2 4-3-VAR1 4-3-VAR2
5 5 5 5 5 5 5 5
table(Yield$Fertilizer, Yield$Variety)
VAR1 VAR2
1 20 20
2 20 20
3 20 20
Looks like a well balanced design with 2 variety treatments and 3 fertilizer treatments.
Before fitting a model, let’s check the distribution of the response variable.
hist(Yield$Yield)
12.2.2 Model fit
This data can be analyze either using nlme
or mmrm
.
using lme() from nlme package.
Let’s say we want to fit a model using AR1 structure as shown in the RCBD repeated measures example. Previously, we used lme()
from nlme package to fit the model. In this example, along with nlme()
we will also mmrm()
function from the mmrm package. In addition, instead of summary()
function we will use tidy()
function from the ‘broom.mixed’ packageto look at estimates of mixed and random effects. In this model we used tidy()
. This will generate a tidy workflow in particular by providing standardized verbs that provide information on estimates, standard errors, confidence intervals, etc.
= corAR1(form = ~ Sample_time|Rep/Variety/plot, value = 0.2, fixed = FALSE)
corr_str1
<- lme(Yield ~ Sample_time1*Variety*Fertilizer,
mod random = ~ 1|Rep/Variety/plot,
corr= corr_str1,
data = Yield, na.action= na.exclude)
#summary(mod)
tidy(mod)
# A tibble: 30 × 7
effect term estimate std.error df statistic p.value
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 fixed (Intercept) 1.86 0.708 72 2.63 0.0105
2 fixed Sample_time12 0.515 0.688 72 0.748 0.457
3 fixed Sample_time13 0.787 0.674 72 1.17 0.247
4 fixed Sample_time14 1.35 0.675 72 2.00 0.0496
5 fixed Sample_time15 2.84 0.675 72 4.21 0.0000731
6 fixed VarietyVAR2 -0.996 0.861 3 -1.16 0.331
7 fixed Fertilizer2 1.27 0.861 12 1.47 0.167
8 fixed Fertilizer3 2.07 0.861 12 2.40 0.0333
9 fixed Sample_time12:VarietyVAR2 0.739 0.974 72 0.759 0.451
10 fixed Sample_time13:VarietyVAR2 0.269 0.954 72 0.282 0.779
# ℹ 20 more rows
<- mmrm(
fit1 formula = Yield ~ Sample_time1*Variety*Fertilizer + ar1(Sample_time1|Rep/plot),
data = Yield)
tidy(fit1)
# A tibble: 30 × 6
term estimate std.error df statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 2.86 0.464 12.7 6.16 0.0000387
2 Sample_time12 0.656 0.310 1.81 2.12 0.182
3 Sample_time13 1.40 0.414 2.29 3.39 0.0636
4 Sample_time14 1.46 0.484 2.87 3.01 0.0605
5 Sample_time15 2.47 0.549 3.14 4.50 0.0186
6 VarietyVAR2 -1.07 0.656 12.7 -1.63 0.128
7 Fertilizer2 1.67 0.656 12.7 2.55 0.0245
8 Fertilizer3 0.595 0.656 12.7 0.908 0.381
9 Sample_time12:VarietyVAR2 -0.591 0.438 1.81 -1.35 0.321
10 Sample_time13:VarietyVAR2 -0.412 0.586 2.29 -0.704 0.546
# ℹ 20 more rows
12.2.3 Model diagnostics
We will use check_model()
from ‘performance’ package to evaluate the model fitness of model fitted using nlme (mod1). However, the mmrm model class doesn’t work with performance package, so we will evalute the model diagnostics by plotting the residuals using base R functions.
check_model(mod, check = c('normality', 'linearity'))
plot(residuals(fit1), xlab = "fitted values", ylab = "residuals")
qqnorm(residuals(fit1)); qqline(residuals(fit1))
These diagnostic plots look great! The linearity and homogeneity of variance plots show no trend. The normal Q-Q plots for the overall residuals and for the random effects all fall on a straight line so we can be satisfied with that.
12.2.4 Inference
anova(mod, type = "marginal")
numDF denDF F-value p-value
(Intercept) 1 72 6.899272 0.0105
Sample_time1 4 72 5.318690 0.0008
Variety 1 3 1.338879 0.3310
Fertilizer 2 12 2.936073 0.0916
Sample_time1:Variety 4 72 0.998154 0.4143
Sample_time1:Fertilizer 8 72 8.158884 <.0001
Variety:Fertilizer 2 12 0.237417 0.7923
Sample_time1:Variety:Fertilizer 8 72 0.731698 0.6631
#car::Anova(fit1, type = "III")
#Anova.mmrm(fit1, type = "III")
Next, we can estimate marginal means and confidence intervals for the independent variables using emmeans()
.
emmeans(mod,~ Fertilizer)
NOTE: Results may be misleading due to involvement in interactions
Fertilizer emmean SE df lower.CL upper.CL
1 2.44 0.472 3 0.937 3.94
2 6.69 0.472 3 5.192 8.20
3 8.06 0.472 3 6.554 9.56
Results are averaged over the levels of: Sample_time1, Variety
Degrees-of-freedom method: containment
Confidence level used: 0.95
emmeans(mod,~ Variety)
NOTE: Results may be misleading due to involvement in interactions
Variety emmean SE df lower.CL upper.CL
VAR1 6.09 0.438 3 4.69 7.48
VAR2 5.37 0.438 3 3.98 6.77
Results are averaged over the levels of: Sample_time1, Fertilizer
Degrees-of-freedom method: containment
Confidence level used: 0.95
emmeans(fit1,~ Fertilizer)
NOTE: Results may be misleading due to involvement in interactions
Fertilizer emmean SE df lower.CL upper.CL
1 3.27 0.246 8.02 2.70 3.83
2 7.93 0.246 8.02 7.37 8.50
3 8.51 0.246 8.02 7.94 9.07
Results are averaged over the levels of: Sample_time1, Variety, Rep
Confidence level used: 0.95
emmeans(fit1,~ Variety)
NOTE: Results may be misleading due to involvement in interactions
Variety emmean SE df lower.CL upper.CL
VAR1 6.89 0.2 8.02 6.42 7.35
VAR2 6.25 0.2 8.02 5.79 6.71
Results are averaged over the levels of: Sample_time1, Fertilizer, Rep
Confidence level used: 0.95
Add a link for further exploring emmeans and contrast statements.
12.3 Split-split plot repeated measures
Recall, we have evaluated the split-split experiment design in Chapter 5, where we had a one factor in main-plot, other in subplot and the third factor in sub-subplot. In this example we will be adding a repeated measures statement.
<- read.csv(here::here("data", "split_split_repeated.csv")) phos
plot | experimental unit |
block | replication unit |
Ptrt | Main plot, 2 levels |
Inoc | Split plot, 2 levels |
Cv | Split-split plot, 5 levels |
time | time points for data collection |
P_leaf | leaf phosphorous content |
- Factorial design repeated measures
As explained in Chapter 6, factorial designs involve examining several factors simultaneously
https://search.r-project.org/CRAN/refmans/agridat/html/gregory.cotton.html
<- agridat::gregory.cotton
data3 str(data3)
'data.frame': 144 obs. of 6 variables:
$ yield : num 0.99 1.34 1.26 1.44 1.4 1.36 1.23 1.28 1.56 1.64 ...
$ year : Factor w/ 2 levels "Y1","Y2": 1 1 1 1 1 1 1 1 1 1 ...
$ nitrogen: Factor w/ 2 levels "N0","N1": 1 1 1 1 1 1 1 1 1 1 ...
$ date : Factor w/ 4 levels "D1","D2","D3",..: 1 1 1 1 1 1 1 1 1 2 ...
$ water : Factor w/ 3 levels "I1","I2","I3": 1 1 1 2 2 2 3 3 3 1 ...
$ spacing : Factor w/ 3 levels "S1","S2","S3": 1 2 3 1 2 3 1 2 3 1 ...
table(data3$nitrogen, data3$date)
D1 D2 D3 D4
N0 18 18 18 18
N1 18 18 18 18
The biggest advantage of mixed models is their incredible flexibility. They handle clustered individuals as well as repeated measures (even in the same model). They handle crossed random factors as well as nested
The biggest disadvantage of mixed models, at least for someone new to them, is their incredible flexibility. It’s easy to mis-specify a mixed model, and this is a place where a little knowledge is definitely dangerous.