str(data)
4 Model Preparation and Flow
This chapter provides a brief introduction to the steps involved in data analysis using linear mixed models and some thoughts on data quality and data interpretation.
The steps explained below were applied to all the chapters.
Analyzing data using linear mixed models involves several key steps, from data preparation to model interpretation. Here’s a structured approach:
4.1 Define the Research Question
It’s important to define what is your question that you want to answer with your research data because it directly influences the model specification, interpretation, and validity. For example, We conducted a randomized field trial to study wheat yield response to different nitrogen fertilizer treatments. In this case, we want to analyze how wheat yield responded to the fertilizer treatments. Next step involves identifying the dependent variable (yield), and determine the fixed (treatment) and random (replications) effects in the experiment design.
4.2 Data integrity checks
The first thing is to make sure the data is what we expect. Here are the steps to verify our data:
- Check structure of the data
In this step, we need to make sure that class of variables in data is as expected e.g. replication/block are generally in the numeric format. But, for analysis the blocks/replications needs to be in a ‘character’ or ‘factor’ format. The response variable must be in a ‘numeric’ format. Sometimes the class of response variable appears to be a character, it’s important to check that and convert it to numeric format to avoid errors in model fitting.
We can use the base code str()
in R to look at the class of each variable in the data set.
This code will create an output showing the data class of each variable present in the data.
The code chunk below can be used to change the class of variable from numeric to character/factor or vice-versa. Here the example code below shows the conversion of rep from numeric to factor class and conversion of yield from character to numeric class.
$rep <- as.factor(data$rep)
data$yield <- as.numeric(data$yield) data
Often, factor and character data class are used interchangeable. But we need to keep in mind the difference in these two classes. A character variable represents data stored in a text format. A factor variable is categorical variable type with values stored as levels. =factor v character
Likewise, there is a difference in ‘integer’ and ‘numeric’ data class. A ‘integer’ class stores data as a whole number. A ‘numeric’ class stores decimal values as well (this is a default data class for numbers in R).
- Inspect the independent variables
Running a cross tabulations across treatments and replications is generally enough to make sure the expected levels are present in the data.
table(data$trt, data$rep)
The output from this code will give us the number of observations in each replication for a given treatment. It’s supposed to have equal number of observations across treatments. Unbalanced independent variables can affect the fitting, interpretation and performance of mixed model which can result in biased estimates for fixed effects.
- Check the extent of missing data
It’s important to check the extent of missing values in the data.
colSums(is.na(data))
This will give you a number of missing values in each variable. It’s an easy identification to
This is not needed for the data sets in this tutorial that have already been comprehensively examined, but it is helpful to check that the extent of missing values displayed in an R session is what you expect.
Having extreme values in the data can lead to biased estimates, model convergence issues, and incorrect variance estimates. Properly handling missing data through imputation ensures more reliable results.
If there were independent variables with a continuous distribution (a covariate), plot those data.
- Inspect the dependent variable
Last, check the dependent variable to ensure its distribution is following expectations. A histogram is often quite sufficient to accomplish this. This is designed to be a quick check, so no need to spend time making the plot look good.
The code below does not test for the normality of the response variable. Here we are creating a histogram to look at count of a given values of the response variable.
The target of this step is to verify the distribution of the variable and to make sure there are no anomalies in the data such as zero-inflation, right or left skewness, or any extreme high/low observations.
hist(data$response)
Data are not expected to be normally distributed at this point, so don’t bother running any Shapiro-Wilk tests. This histogram is a check to ensure that the data are entered correctly and they appear valid. It requires a mixture of domain knowledge and statistical training to know this, but over time, if you look at these plots with regularity, you will gain a feel for what your data should look like at this stage.
These are not complicated checks. They are designed to be done quickly and should be done for every analysis if you have not previously inspected your data as thus. We do this before every analysis and often discover surprising things! Best to discover these things early, since they are likely to impact the final analysis.
Once our data passes through these pre-checks we can move to next step is building the model. the structure of data variables,
4.3 Model building - Formula syntax and expectations
In this guide we used lme4 and nlme packages fit linear mixed model.
The general framework for lmer()
and lme()
models include specifying the response variable, fixed factors, and random factors. For this demonstration, let’s assume response variable = yield, fixed factor = treatment, random factor = block.
The code below shows the R syntax mixed model with one fixed and one random effect:
<- lmer(response ~ fixed + (1|random),
model_lmer data = data1,
na.action = na.exclude)
<- lme(response ~ fixed,
model_lme random = ~ 1|random,
data = data1,
na.action = na.exclude)
The parentheses are used to indicate a random effect, and this particular notation (1|block)
indicates that a ‘random intercept’ model is being fit (Please refer to Chapter 3). This is the most common approach. It means there is one overall effect fit for each block. Here, note that random effects are specified differently in the lmer()
and nlme()
models.
na.action = na.exclude
You may have noticed the final argument for na.action
in the model statement:
model_lmer <- lmer(response ~ fixed + (1|random),
data = data1,
na.action = na.exclude)
The argument na.action = na.exclude
provides instructions for how to handle missing data. na.exclude
removes the missing data points before proceeding with the analysis. When any obervation-levels model outputs is generated (e.g. predictions, residuals), they are padded in the appropriate place to account for missing data. This is handy because it makes it easier to add those results to the original data set if so desired.
Even when there are no missing data and this step is not necessary, it’s a good habit to be in.
We use the argument na.action = na.exclude
as instruction for how to handle missing data: conduct the analysis, adjusting as needed for the missing data, and when prediction or residuals are output, please pad them in the appropriate places for missing data so they can be easily merged into the main data set if need be.
4.4 Model Assumptions
Linear mixed models rely on linearity, independence, normality, homoscedasticity of residuals. Violating these assumptions can lead to biased estimates, inefficient inference, and convergence issues. Always diagnose model assumptions using residual plots, tests, and alternative specifications in R.
In this guide, we well be testing these assumptions by using graphical methods. This can be done in two ways:
4.4.1 Original method
We can use base plot()
function in R for lme4 and nlme objects to check the homoscedasticity (residuals vs. fitted values plot) of residuals.
plot(model_lmer, resid(., scaled=TRUE) ~ fitted(.),
xlab = "fitted values", ylab = "studentized residuals")
plot(model_lme, resid(., scaled=TRUE) ~ fitted(.),
xlab = "fitted values", ylab = "studentized residuals")
In this output, we expect to see a plot with random and uniform distribution of points. If we notice any specific pattern in the distribution of points, we need to look into the model structure and response variable distribution closely.
To check the normality of the residuals, we need to extract the residuals first using the resid()
function and then generate a qq-plot and line:
qqnorm(resid(model_lmer), main = NULL); qqline(resid(model_lmer))
qqnorm(resid(model_lme), main = NULL); qqline(resid(model_lme))
The interpretation of the qq-plots generated from this code chunk is : if residuals falls closely along the 45-degree qq-line, it suggests normality of the residuals. However, if there is strong deviation of residual points from the qq-line, consider model adjustments or data transformation.
4.4.2 New Method
Nowadays, we can take advantage of the performance package package, which provides a comprehensive suite of diagnostic plots.
The diagnostic plots we created above can be created using one function check_model()
from the performance package.
Please look for check_model()
in help tab to find what other checks you can perform using this function. If you would like to check all assumptions you can use the argument check = "all"
.
check_model(model_lmer, check = c('normality', 'linearity'))
check_model(model_lme, check = c('normality', 'linearity'))
4.5 Inference
After verifying the assumptions of model, we can move to the inference of model. Based on analysis goals, we can either conduct analysis of variance using anova()
function or we can estimate marginal means using emmeans()
function from the emmeans package. Further, we can also run a post-hoc comparison to evaluate the pairwise comparison or contrasts using estimated means.
Please remember that conclusions cannot be drawn from the model that doesn’t meet linear mixed model assumptions.