Field Guide to the R Mixed Model Wilderness
  1. 4  Model Prep & Workflow
  • Preface
  • 1  Introduction
  • 2  Zen and the Art of Statistical Analysis
  • 3  Mixed Model Background
  • 4  Model Prep & Workflow
  • Experiment designs
    • 5  Randomized Complete Block Design
    • 6  Factorial RCBD Design
    • 7  Split Plot Design
    • 8  Split-Split Plot Design
    • 9  Strip Plot Design
    • 10  Incomplete Block Design
    • 11  Latin Square Design
  • 12  Repeated Measures
  • 13  Marginal Means and Contrasts
  • 14  Variance and Variance Components
  • 15  Troubleshooting
  • 16  Additional Resources
  • References

Table of contents

  • 4.0.1 Step 1: Define the Research Question(s)
  • 4.1 Example experiment
    • 4.1.1 Step 2: Conduct data integrity checks
    • 4.1.2 Step 3: Fit the Model
    • 4.1.3 Step 4: Check model assumptions
    • 4.1.4 Step 5: Conduct inference
  • View source
  • Report an issue

4  Model Preparation and Flow

This chapter provides a brief introduction to the steps involved in data analysis using linear mixed models and some thoughts on data quality and data interpretation.

Note

The steps explained below were applied to all example analyses.

Analyzing data using linear mixed models involves several key steps, from data preparation to model interpretation. Here’s a structured approach:

graph TD
  A[Define Research Question] --> B[Evaluate Data Quality]
  B --> C[Fit Model]
  C --> D[Check Model Assumptions]
  D -- Met --> E[Conduct Inference]
  D -- Unmet --> F[Check data or consider others models]

4.0.1 Step 1: Define the Research Question(s)

It is important to define what is your question that you want to answer with your research data because it directly influences the model specification, interpretation, and validity. Ideally, tgis was determined prior to design of the experiment and data acquisition. However, things change as an experiment unfolds, and sometimes experimental aims become lost in translation between the original grant proposal for a project and when the project is finally implemented. Before embarking on an analysis, write down the goals of analysis in the most precise terms possible. Examples:

How much does barley yield change as the result of this new fertilizer source compared to another source?

What is the estimated change in milk yield with each unit increase in cow parity?

Does an experimental drug has a stronger effect on reducing cholesterol rates where a difference of 5 units or more is considered a meaningful result?

What are the relative contributions of location, year and cultivar influencing barley yield across southern Idaho?

The first two examples are asking for specific inference on the estimated effects of a particular intervention. The third example is a hypothesis test, and the fourth example is also inferential, but focused on variance instead of point estimates.

4.1 Example experiment

For the following steps, imagine that someone conducted a field trial to study wheat yield response to different nitrogen fertilizer treatments. In this case, we want to analyze how wheat yield responded to the fertilizer treatments using a randomized complete block design. Thus, we have identified the dependent variable (yield) and the fixed effects (nitrogen ferilizer rates) and the random effect (block). A study like this could have other goals such as what is the optimal nitrogen fertilizer rate, how much does yield change as a function of added nitrogen, or what mathematical model best describes this relationship. It is helpful to define these goals before embarking on an analysis.

4.1.1 Step 2: Conduct data integrity checks

The first thing is to make sure the data is what we expect. Here are the steps to verify our data:

4.1.1.1 Check the structure of the data

In this step, we need to make sure that class of variables in data is as expected. For example, replication/block and year are often in the numeric format after import into R. But, for analysis the replication unit and other variables that are not truly numerical1 needs to be in a ‘character’ or ‘factor’ format. The dependent variable must also be in the correct format for its intended data type. For this tutorial, the dependent variable is expected to be numeric in all cases2, although categorical outcomes are certainly plausible for many studies.

1 No one views the year 2026 as 2,026.

2 Since this tutorial is exclusively focused on general linear models and is not addressing generalized scenarios at all.

Warning

Most R import functions (e.g. read.csv(), read_csv(), read_excel()) interprets continuously varying numbers (correctly) as numerical variables, but if they do not, that is often a signal that there is an unexpected character in a particular column of data that is incorrectly read as non-numerical. Perhaps a comma is present when a period is expected, or a cells reads as “N/A” instead of “NA”. This is an indicator to double check your data to ensure there are no errors or unexpected input in it.

We can use the command str() in R to look at the class of each variable in the data set.

str(dataset)

This will print the data class of each variable present in the data set and a few example values for each variable.

The example code below shows the conversion of rep from numeric to factor class and the conversion of yield from character to numeric class.

dataset$block <- as.factor(dataset$block)
dataset$rep <- as.character(dataset$rep)
dataset$yield <- as.numeric(dataset$yield)
TipFactor versus character class

Often, factor and character object classes are used interchangeably. However, we need to keep in mind the differences between these two classes. A character variable represents data stored in a text or “string” format. A factor variable is a categorical variable type with values stored as set levels. Most linear modeling packages in R expect categorical independent variables to be formatted as factors, but, many will also automatically convert character variables to factors.

4.1.1.2 Inspect the independent variables

Running a cross-tabulation across treatments and replications is generally sufficient to make sure ensure the expected levels of these factors are present in the data.

table(dataset$trt, dataset$block)

The output from this code will give us the number of observations in each replication for a given treatment. It’s helpful to check (1) all the expected levels of categorical data are present and no additional levels are present (e.g., “high”, “High” or “High”); (2) the counts are as expected; and (3) what the balance of treatments is. Are they perfectly balanced? Slightly unbalanced? Very unbalanced? Is one treatment combination missing altogether? Depending on the context, these are resolvable issues.The main primary goal is to verify what the data looks like after import compared to your personal understanding of the data.

4.1.1.3 Check the extent of missing data

colSums(is.na(dataset))

This will give you the number of missing values in each variable. Missing data is a normal part of experiments and in most cases, is not a problem. However, if there is more missingness than expected, it is good to check that the data are correctly entered and that nothing went wrong during data import.

TipMissingness versus zeros

We occasionally see users substitute missing data with zeros. This is not recommended at all unless there is a clear reason. If I planted an experiment and one plot failed to produce any crop at all due to a field conditions (e.g. drought, disease), it is reasonable to assign a zero to that plot for a variable we expected to measure such as total biomass. However, if I failed to plant anything in a plot due to user error, assigning a zero is not an accurate description of what occurred. Likewise, it not appropriate to treat true zeros as missing. Sometimes highly unusual conditions can occur, for example, a deer eats the entirety of a plot. In these unusual instances, it is up to the researcher to decide if these conditions reflect a common reality they want incorporated into the final estimates (assign the plot as “zero” or the closes appropriate value), or exclude that (set the data for that plot as missing).

4.1.1.4 Inspect the dependent variable and all other continuous variables

Check the dependent variable and all continuous variables (i.e. covariates) to ensure their distributions are following expectations. A histogram is often sufficient to accomplish this. The goal of this step is to verify the distribution of the variable and to make sure there are no anomalies in the data such as zero-inflation, right or left skewness, or any extreme observations (high or low). This is designed to be a quick check, so there is no need to spend time prettifying these plots.

hist(data$yield)

Data are not expected to be normally distributed at this point, so don’t bother running any normality tests like the Shapiro-Wilk test. This histogram is a check to ensure that the data are entered correctly and appear valid. It requires a mixture of domain knowledge and statistical training to know this, but over time, if you look at these plots regularly, you will gain a feel for what your data should look like at this stage.

4.1.1.5 Next Steps

The purpose of these checks is to help us find any data errors that ought to be fixed prior to model fitting. They are designed to be done quickly and should be conducted for every analysis if you have not previously inspected your data as thus. We do this before every analysis and often discover surprising things! It is best to discover these things early, since they are likely to impact the final analysis.

If do you identify issues, check your data and correct as necessary. Once a data set passes these checks, we can move to next step: model fitting.

4.1.2 Step 3: Fit the Model

4.1.2.1 Notes on formula syntax and expectations

In this guide, we use lme4 and nlme packages to fit linear mixed models. These packages have overlapping functionality and follow a similar framework and syntax. The general framework for lmer() and lme() models is to specify the response variable, fixed factors, and random factors and follow the R generic function for formula(). For this demonstration, let’s assume the dependent variable is ‘yield’, ‘tr’t is a fixed effect and ’block’ is a random effect.

The code below shows the R syntax for a mixed model with one fixed and one random effect:

  • lme4
  • nlme
model_lmer <- lmer(yield ~ trt + (1|block),
                   data = dataset, 
                   na.action = na.exclude)
model_lme <- lme(yield ~ trt ,
                  random = ~ 1|block,
                  data = data1, 
                  na.action = na.exclude)

The parentheses are used to indicate a random effect, and this particular notation (1|block) indicates that a ‘random intercept’ model is being fit 3. This is the most common approach. It means there is one fit for each block. Here, note that random effects are specified differently in the lmer() and nlme() models.

3 Please refer to Chapter 3

Tipna.action = na.exclude

You may have noticed the final argument for na.action in the model statement.

The argument na.action = na.exclude provides instructions for how to handle missing data. The option na.exclude removes the missing data points before proceeding with the analysis. When any observation-level model outputs is generated (e.g. predictions, residuals), they are padded in the appropriate place to account for missing data. This is handy because it makes it easier to add those results to the original data set if so desired.

We use the argument na.action = na.exclude as instruction for how to handle missing data: conduct the analysis, adjusting as needed for the missing data, and when prediction or residuals are output, padding them in the appropriate places for missing data so they can be easily merged into the main data set if need be.

Even when there are no missing data and this step is not necessary, it is a good habit to be in.

4.1.3 Step 4: Check model assumptions

Linear mixed models rely on independence of the observations, normality and homoscedasticity of residuals and linearity between the dependent and independent variables. Violating these assumptions can lead to biased estimates, incorrect inference, and model convergence issues. Always check model assumptions, and if they are unmet, consider alternative model specifications.

Note‘iid’ assumption for residuals

In these model, the error terms, \(\epsilon\) are assumed to be “iid”, that is, independently and identically distributed. This means they are expected to be normally distributed with a mean of 0 standard deviation of \(\sigma\), or more specifically, \(\sigma \mathbf{I}\), which refers to constant variance and a covariance of zero between residuals.

In this guide, we well be testing these assumptions by using graphical methods. This can be done in two ways:

4.1.3.1 Original method

We can use base plot() function in R for ‘lme4’ and ‘nlme’ modelling objects to check the homoscedasticity (residuals vs. fitted values plot) of residuals.

  • lme4
  • nlme
plot(model_lmer, resid(., scaled=TRUE) ~ fitted(.), 
     xlab = "fitted values", ylab = "studentized residuals")
plot(model_lme, resid(., scaled=TRUE) ~ fitted(.), 
     xlab = "fitted values", ylab = "studentized residuals")

In this output, we expect to see a plot with random and uniform distribution of points. If we notice any specific pattern in the distribution of points, we need to look into the model structure and response variable distribution closely.

To check the normality of the residuals, we need to extract the residuals first using the resid() function and then generating a qq-plot:

  • lme4
  • nlme
qqnorm(resid(model_lmer), main = NULL); qqline(resid(model_lmer))
qqnorm(resid(model_lme), main = NULL); qqline(resid(model_lme))

Interpretiong qq-plots: if residuals falls closely along the 45-degree qq-line, it suggests normality of the residuals. It is common see mild deviations at the ends of the distribution, and since these models are robust to modest deviations from normality, that may not be a concern. However, if there is strong deviation of residual points from the qq-line, consider a different model that fits the data better. How to do this is beyond the scope of this guide.

4.1.3.2 New Method

Nowadays, we can take advantage of the performance package package, which provides a comprehensive suite of diagnostic plots.

The diagnostic plots we created above can be created using one function check_model() from the performance package.

Note

Read the documentation for check_model() to find what other checks this function can do. If you would like to check all assumptions at once you can use the argument check = "all".

  • lme4
  • nlme
check_model(model_lmer, check = c('normality', 'linearity'))
check_model(model_lme, check = c('normality', 'linearity'))

4.1.4 Step 5: Conduct inference

After verifying the assumptions of model, we can move to the inference of model. Based on analysis goals, we can either conduct analysis of variance using anova() function and/or we can estimate marginal means using emmeans() function from the ‘emmeans’ package. We can also run post hoc comparisons to evaluate the pairwise comparison or contrasts using estimated means.

WarningModel assumptions & statistical inference

Please remember that conclusions should not be drawn from models that do not meet linear mixed model assumptions.

3  Mixed Model Background
5  Randomized Complete Block Design

© {year}

Uidaho Logo

  • View source
  • Report an issue