library(dplyr)
<- read.csv(here::here("data", "trial_data.csv")) %>%
variety_trials mutate(trial = gsub("_H_", "_H-", trial)) %>%
::separate(trial, c("program", "crop", "location", "year"),
tidyrsep = "_", remove = FALSE)
Aggregating & Summarising Data
At the end of this lesson, you should:
- be able to aggregate data and perform actions on those aggregated data using
group_by()
andsummarise()
- understand when to use
rowwise()
for operations
You may find yourself wanting to calculate summary statistics across a grouping variable. To do this, a data set needs to be split up by that variable, a summary statistic calculated, and the resulting data recombined, or ‘split-apply-combine’. There’s some nice tools to do this in the dplyr package.
Prep Work
First, load libraries & import data:
Basic grouping & aggregation
The group_by
will group data and then any statistic can be calculated or summary action can be done on that grouped data using summarise()
.
The basic formula:
%>% group_by(variable) %>% summarise(new_var = ...) mydata
This data set has several categorical variables that can be used for grouping:
str(variety_trials)
'data.frame': 1882 obs. of 10 variables:
$ trial : chr "SWIdahoCereals_H-S_PAR_2018" "SWIdahoCereals_H-S_PAR_2018" "SWIdahoCereals_H-S_PAR_2018" "SWIdahoCereals_H-S_PAR_2018" ...
$ program : chr "SWIdahoCereals" "SWIdahoCereals" "SWIdahoCereals" "SWIdahoCereals" ...
$ crop : chr "H-S" "H-S" "H-S" "H-S" ...
$ location : chr "PAR" "PAR" "PAR" "PAR" ...
$ year : chr "2018" "2018" "2018" "2018" ...
$ rep : int 1 2 3 4 1 2 3 4 1 2 ...
$ variety : chr "12SB0197" "12SB0197" "12SB0197" "12SB0197" ...
$ yield : num 71.7 108.6 81.7 103.8 65.3 ...
$ grain_protein: num 9.83 9.6 11.27 10.35 10.23 ...
$ test_weight : num 62.1 64.2 65.6 64.3 62.8 65.2 65.1 65.6 65 65.3 ...
The function tally()
counts observations:
%>% group_by(trial) %>% tally() variety_trials
# A tibble: 28 × 2
trial n
<chr> <int>
1 SWIdahoCereals_H-S_PAR_2018 88
2 SWIdahoCereals_H-S_WEI_2018 96
3 SWIdahoCereals_H-W_PAR_2017 120
4 SWIdahoCereals_H-W_PAR_2018 80
5 SWIdahoCereals_H-W_WEI_2018 48
6 SWIdahoCereals_HRS_PAR_2016 60
7 SWIdahoCereals_HRS_PAR_2017 32
8 SWIdahoCereals_HRS_PAR_2019 60
9 SWIdahoCereals_HRS_PAR_2020 48
10 SWIdahoCereals_HRW_PAR_2019 44
# ℹ 18 more rows
Let’s group by crop and pull out the mean yield and standard deviation.
<- variety_trials %>% group_by(crop) %>%
yield_crop summarise(yield_mean = mean(yield, na.rm = TRUE),
yield_sd = sd(yield, na.rm = TRUE),
yield_min = min(yield, na.rm = TRUE),
yield_max = max(yield, na.rm = TRUE),
total = n())
yield_crop
# A tibble: 8 × 6
crop yield_mean yield_sd yield_min yield_max total
<chr> <dbl> <dbl> <dbl> <dbl> <int>
1 H-S 57.6 34.3 16.6 119. 184
2 H-W 84.4 42.2 4.17 199. 248
3 HRS 111. 38.5 56.0 498. 200
4 HRW 136. 45.0 66.2 197. 100
5 HWS 116. 32.0 56.6 253. 132
6 HWW 115. 38.5 68.7 192. 76
7 SWS 104. 42.4 12.1 219. 316
8 SWW 94.0 43.9 0.705 201. 626
summarise()
only returns a single value back for each group. If you want more than that (e.g. to run a linear model on each group), there are other tools for that. This is intended to be addressed in Lesson ‘repeating actions’.
Grouping across multiple variables
Let’s examine how many crops and years there are using the table()
command:
table(variety_trials$crop, variety_trials$year)
2016 2017 2018 2019 2020
H-S 0 0 184 0 0
H-W 0 120 128 0 0
HRS 60 32 0 60 48
HRW 0 0 0 44 56
HWS 44 32 0 24 32
HWW 0 0 0 44 32
SWS 40 40 132 56 48
SWW 0 160 206 108 152
This tells us how many rows of data occur for each variable combination. This information can help inform us how to group information.
You can group by as many conditions as you want:
%>% group_by(crop, year) %>%
variety_trials summarise(protein_na = sum(is.na(grain_protein))) %>% arrange(desc(protein_na))
`summarise()` has grouped output by 'crop'. You can override using the
`.groups` argument.
# A tibble: 24 × 3
# Groups: crop [8]
crop year protein_na
<chr> <chr> <int>
1 SWW 2017 160
2 H-W 2018 128
3 H-W 2017 120
4 H-S 2018 96
5 SWW 2018 90
6 SWS 2018 64
7 SWS 2017 40
8 HRS 2017 32
9 HWS 2017 32
10 HRW 2019 4
# ℹ 14 more rows
You can group by a numeric variable. If you do that, dplyr will look for common values to group observations. This can be successful when there are repeat ‘integers’ (e.g. year, replicate), but if all values are unique (which is often the case with floating point numbers), then the number of groups is the number of observations.
Summarising across multiple variables
Use across()
to conduct the same summary action(s) across multiple columns.
%>% group_by(trial) %>%
variety_trials summarise(across(c(yield, grain_protein), ~ mean(.x, na.rm = TRUE)))
# A tibble: 28 × 3
trial yield grain_protein
<chr> <dbl> <dbl>
1 SWIdahoCereals_H-S_PAR_2018 91.0 10.9
2 SWIdahoCereals_H-S_WEI_2018 27.0 NaN
3 SWIdahoCereals_H-W_PAR_2017 121. NaN
4 SWIdahoCereals_H-W_PAR_2018 63.1 NaN
5 SWIdahoCereals_H-W_WEI_2018 28.4 NaN
6 SWIdahoCereals_HRS_PAR_2016 111. 14.3
7 SWIdahoCereals_HRS_PAR_2017 150. NaN
8 SWIdahoCereals_HRS_PAR_2019 106. 11.7
9 SWIdahoCereals_HRS_PAR_2020 90.6 12.7
10 SWIdahoCereals_HRW_PAR_2019 88.2 8.20
# ℹ 18 more rows
%>% group_by(trial) %>%
variety_trials summarise(across(c(yield, grain_protein), ~ sd(.x, na.rm = TRUE)))
# A tibble: 28 × 3
trial yield grain_protein
<chr> <dbl> <dbl>
1 SWIdahoCereals_H-S_PAR_2018 16.9 1.00
2 SWIdahoCereals_H-S_WEI_2018 5.52 NA
3 SWIdahoCereals_H-W_PAR_2017 26.3 NA
4 SWIdahoCereals_H-W_PAR_2018 7.64 NA
5 SWIdahoCereals_H-W_WEI_2018 11.2 NA
6 SWIdahoCereals_HRS_PAR_2016 24.3 0.908
7 SWIdahoCereals_HRS_PAR_2017 75.3 NA
8 SWIdahoCereals_HRS_PAR_2019 10.2 0.780
9 SWIdahoCereals_HRS_PAR_2020 13.2 1.14
10 SWIdahoCereals_HRW_PAR_2019 10.2 0.612
# ℹ 18 more rows
Row-wise summaries
Many operations in R are already vectorized across rows, but when they are not, you can use rowwise()
to implement that.
Field disease scoring may benefit from this system. Often, several measurements are made on a single experimental unit (usually a plot), and those measurements are averaged together to create a final disease incidence score. Here is how to do that with rowwise()
.
First, simulate a set of disease scores between 0 and 100 (indicating percent infection).
set.seed(208)
# step 1: generate a set of possible scores: 0, 10, 20,...100
<- c(0:10 * 10L)
score_range # sample those possible scores to generate 50 data points
<- sample(score_range, 50, replace = TRUE)
scores # arrange those 50 data points into a datafrmae of 5 columns, each column reflecting 10 observations
<- data.frame(plot = letters[1:10],
disease_df score1 = scores[1:10],
score5 = scores[11:20],
score3 = scores[21:30],
score4 = scores[31:40],
score2 = scores[41:50])
disease_df
plot score1 score5 score3 score4 score2
1 a 20 20 40 30 40
2 b 10 10 40 70 20
3 c 80 70 40 10 30
4 d 80 100 30 30 30
5 e 60 100 20 70 60
6 f 40 0 30 100 70
7 g 20 90 50 40 90
8 h 20 40 80 0 0
9 i 100 90 90 100 50
10 j 80 40 70 100 60
Data sets exist like this. A person might have a set of 10 experimental plots to evaluate for some trait. The trait assay protocol may require that multiple observations be gathered per plot (from a statistical standpoint, this is a technical replicate, not a true replicate) and then reduced to a single number per plot using a simple mean. Row-wise functions can accomplish this.
<- disease_df %>% rowwise() %>%
disease_df_sum mutate(score_final = mean(score1:score2),
max_score = max(score1:score2))
disease_df_sum
# A tibble: 10 × 8
# Rowwise:
plot score1 score5 score3 score4 score2 score_final max_score
<chr> <int> <int> <int> <int> <int> <dbl> <int>
1 a 20 20 40 30 40 30 40
2 b 10 10 40 70 20 15 20
3 c 80 70 40 10 30 55 80
4 d 80 100 30 30 30 55 80
5 e 60 100 20 70 60 60 60
6 f 40 0 30 100 70 55 70
7 g 20 90 50 40 90 55 90
8 h 20 40 80 0 0 10 20
9 i 100 90 90 100 50 75 100
10 j 80 40 70 100 60 70 80
<- disease_df %>% select(score1, score4, score5) %>%
disease_df_sum1 rowwise() %>%
mutate(score_final = mean(c(score1, score4)),
median_score = median(c(score1, score4)))
disease_df_sum1
# A tibble: 10 × 5
# Rowwise:
score1 score4 score5 score_final median_score
<int> <int> <int> <dbl> <dbl>
1 20 30 20 25 25
2 10 70 10 40 40
3 80 10 70 45 45
4 80 30 100 55 55
5 60 70 100 65 65
6 40 100 0 70 70
7 20 40 90 30 30
8 20 0 40 10 10
9 100 100 90 100 100
10 80 100 40 90 90
It is possible use mutate()
or summary()
commands on a grouped data frame. A summary()
call will return one value per group + summary function (e.g. mean
). A mutate()
call will return one value per row + summary function. All the previous examples in this lesson used summary()
. Here is one example using mutate()
:
%>%
variety_trials select(trial, rep, variety, crop, yield) %>% group_by(crop) %>%
mutate(relative_yield = yield/mean(yield, na.rm=TRUE)) %>%
arrange(desc(yield)) %>% head(15)
# A tibble: 15 × 6
# Groups: crop [6]
trial rep variety crop yield relative_yield
<chr> <int> <chr> <chr> <dbl> <dbl>
1 SWIdahoCereals_HRS_PAR_2017 1 WB9411 HRS 498. 4.50
2 SWIdahoCereals_HRS_PAR_2017 3 12SB0197 HRS 297. 2.68
3 SWIdahoCereals_HWS_PAR_2017 3 Dayn HWS 253. 2.18
4 SWIdahoCereals_SWS_PAR_2017 1 UI Stone SWS 219. 2.12
5 SWIdahoCereals_SWS_PAR_2017 4 UI Stone SWS 217. 2.09
6 SWIdahoCereals_SWS_PAR_2017 4 WA8277 SWS 204. 1.97
7 SWIdahoCereals_SWW_PAR_2017 1 Bobtail SWW 201. 2.14
8 SWIdahoCereals_H-W_PAR_2017 3 WA8269 H-W 199. 2.36
9 SWIdahoCereals_HWS_PAR_2017 1 LCS Star HWS 198. 1.71
10 SWIdahoCereals_HRW_PAR_2020 2 LCS Jet HRW 197. 1.45
11 SWIdahoCereals_HRW_PAR_2020 3 LCS Rocket HRW 196. 1.44
12 SWIdahoCereals_SWW_PAR_2017 2 Agripro Legion SWW 194. 2.06
13 SWIdahoCereals_H-W_PAR_2017 4 NSA10-2196 H-W 193. 2.29
14 SWIdahoCereals_HWS_PAR_2017 4 LCS Star HWS 193. 1.66
15 SWIdahoCereals_HRW_PAR_2020 2 Scorpio HRW 193. 1.42
In this case, the mean value used for calculating ‘relative_yield’ is the group mean.