Aggregating & Summarising Data

Learning Goals

At the end of this lesson, you should:

be able to aggregate data and perform actions on those aggregated data using group_by() and summarise()
understand when to use rowwise() for operations

You may find yourself wanting to calculate summary statistics across a grouping variable. To do this, a data set needs to be split up by that variable, a summary statistic calculated, and the resulting data recombined, or ‘split-apply-combine’. There’s some nice tools to do this in the dplyr package.

Prep Work

First, load libraries & import data:

library(dplyr)

variety_trials <- read.csv(here::here("data", "trial_data.csv")) %>% 
  mutate(trial = gsub("_H_", "_H-", trial)) %>% 
  tidyr::separate(trial, c("program", "crop", "location", "year"),
                  sep = "_", remove = FALSE)

Basic grouping & aggregation

The group_by will group data and then any statistic can be calculated or summary action can be done on that grouped data using summarise().

The basic formula:

mydata %>% group_by(variable) %>% summarise(new_var = ...)

This data set has several categorical variables that can be used for grouping:

str(variety_trials)

'data.frame':   1882 obs. of  10 variables:
 $ trial        : chr  "SWIdahoCereals_H-S_PAR_2018" "SWIdahoCereals_H-S_PAR_2018" "SWIdahoCereals_H-S_PAR_2018" "SWIdahoCereals_H-S_PAR_2018" ...
 $ program      : chr  "SWIdahoCereals" "SWIdahoCereals" "SWIdahoCereals" "SWIdahoCereals" ...
 $ crop         : chr  "H-S" "H-S" "H-S" "H-S" ...
 $ location     : chr  "PAR" "PAR" "PAR" "PAR" ...
 $ year         : chr  "2018" "2018" "2018" "2018" ...
 $ rep          : int  1 2 3 4 1 2 3 4 1 2 ...
 $ variety      : chr  "12SB0197" "12SB0197" "12SB0197" "12SB0197" ...
 $ yield        : num  71.7 108.6 81.7 103.8 65.3 ...
 $ grain_protein: num  9.83 9.6 11.27 10.35 10.23 ...
 $ test_weight  : num  62.1 64.2 65.6 64.3 62.8 65.2 65.1 65.6 65 65.3 ...

The function tally() counts observations:

variety_trials %>% group_by(trial) %>% tally()

# A tibble: 28 × 2
   trial                           n
   <chr>                       <int>
 1 SWIdahoCereals_H-S_PAR_2018    88
 2 SWIdahoCereals_H-S_WEI_2018    96
 3 SWIdahoCereals_H-W_PAR_2017   120
 4 SWIdahoCereals_H-W_PAR_2018    80
 5 SWIdahoCereals_H-W_WEI_2018    48
 6 SWIdahoCereals_HRS_PAR_2016    60
 7 SWIdahoCereals_HRS_PAR_2017    32
 8 SWIdahoCereals_HRS_PAR_2019    60
 9 SWIdahoCereals_HRS_PAR_2020    48
10 SWIdahoCereals_HRW_PAR_2019    44
# ℹ 18 more rows

Let’s group by crop and pull out the mean yield and standard deviation.

yield_crop <- variety_trials %>% group_by(crop) %>% 
  summarise(yield_mean = mean(yield, na.rm = TRUE),
            yield_sd = sd(yield, na.rm = TRUE),
            yield_min = min(yield, na.rm = TRUE),
            yield_max = max(yield, na.rm = TRUE),
            total = n())

yield_crop

# A tibble: 8 × 6
  crop  yield_mean yield_sd yield_min yield_max total
  <chr>      <dbl>    <dbl>     <dbl>     <dbl> <int>
1 H-S         57.6     34.3    16.6        119.   184
2 H-W         84.4     42.2     4.17       199.   248
3 HRS        111.      38.5    56.0        498.   200
4 HRW        136.      45.0    66.2        197.   100
5 HWS        116.      32.0    56.6        253.   132
6 HWW        115.      38.5    68.7        192.    76
7 SWS        104.      42.4    12.1        219.   316
8 SWW         94.0     43.9     0.705      201.   626

Note

summarise() only returns a single value back for each group. If you want more than that (e.g. to run a linear model on each group), there are other tools for that. This is intended to be addressed in Lesson ‘repeating actions’.

Grouping across multiple variables

Let’s examine how many crops and years there are using the table() command:

table(variety_trials$crop, variety_trials$year)

     
      2016 2017 2018 2019 2020
  H-S    0    0  184    0    0
  H-W    0  120  128    0    0
  HRS   60   32    0   60   48
  HRW    0    0    0   44   56
  HWS   44   32    0   24   32
  HWW    0    0    0   44   32
  SWS   40   40  132   56   48
  SWW    0  160  206  108  152

This tells us how many rows of data occur for each variable combination. This information can help inform us how to group information.

You can group by as many conditions as you want:

variety_trials %>% group_by(crop, year) %>% 
  summarise(protein_na = sum(is.na(grain_protein))) %>% arrange(desc(protein_na))

`summarise()` has grouped output by 'crop'. You can override using the
`.groups` argument.

# A tibble: 24 × 3
# Groups:   crop [8]
   crop  year  protein_na
   <chr> <chr>      <int>
 1 SWW   2017         160
 2 H-W   2018         128
 3 H-W   2017         120
 4 H-S   2018          96
 5 SWW   2018          90
 6 SWS   2018          64
 7 SWS   2017          40
 8 HRS   2017          32
 9 HWS   2017          32
10 HRW   2019           4
# ℹ 14 more rows

FYI

You can group by a numeric variable. If you do that, dplyr will look for common values to group observations. This can be successful when there are repeat ‘integers’ (e.g. year, replicate), but if all values are unique (which is often the case with floating point numbers), then the number of groups is the number of observations.

Summarising across multiple variables

Use across() to conduct the same summary action(s) across multiple columns.

variety_trials %>% group_by(trial) %>% 
  summarise(across(c(yield, grain_protein), ~ mean(.x, na.rm = TRUE)))

# A tibble: 28 × 3
   trial                       yield grain_protein
   <chr>                       <dbl>         <dbl>
 1 SWIdahoCereals_H-S_PAR_2018  91.0         10.9 
 2 SWIdahoCereals_H-S_WEI_2018  27.0        NaN   
 3 SWIdahoCereals_H-W_PAR_2017 121.         NaN   
 4 SWIdahoCereals_H-W_PAR_2018  63.1        NaN   
 5 SWIdahoCereals_H-W_WEI_2018  28.4        NaN   
 6 SWIdahoCereals_HRS_PAR_2016 111.          14.3 
 7 SWIdahoCereals_HRS_PAR_2017 150.         NaN   
 8 SWIdahoCereals_HRS_PAR_2019 106.          11.7 
 9 SWIdahoCereals_HRS_PAR_2020  90.6         12.7 
10 SWIdahoCereals_HRW_PAR_2019  88.2          8.20
# ℹ 18 more rows

variety_trials %>% group_by(trial) %>% 
  summarise(across(c(yield, grain_protein), ~ sd(.x, na.rm = TRUE)))

# A tibble: 28 × 3
   trial                       yield grain_protein
   <chr>                       <dbl>         <dbl>
 1 SWIdahoCereals_H-S_PAR_2018 16.9          1.00 
 2 SWIdahoCereals_H-S_WEI_2018  5.52        NA    
 3 SWIdahoCereals_H-W_PAR_2017 26.3         NA    
 4 SWIdahoCereals_H-W_PAR_2018  7.64        NA    
 5 SWIdahoCereals_H-W_WEI_2018 11.2         NA    
 6 SWIdahoCereals_HRS_PAR_2016 24.3          0.908
 7 SWIdahoCereals_HRS_PAR_2017 75.3         NA    
 8 SWIdahoCereals_HRS_PAR_2019 10.2          0.780
 9 SWIdahoCereals_HRS_PAR_2020 13.2          1.14 
10 SWIdahoCereals_HRW_PAR_2019 10.2          0.612
# ℹ 18 more rows

Practice Problems

As usual, consider how these data aggregation functions can support your own work.
For some of these exercises, you may need to use other dplyr functions.

Count the number of observations for each ‘variety’ in the ‘variety_trials’ data set.
Use ‘variety-trials’ data, group data by ‘variety’ and calculate the summary statistics (mean, sd, min, and max) for ‘grain_protein’.
Use ‘variety-trials’ data, group data by ‘crop’ and ‘year’ variables and calculate the mean and standard deviation across ‘test_weight’ and ‘grain_protein’.

Row-wise summaries

Many operations in R are already vectorized across rows, but when they are not, you can use rowwise() to implement that.

Field disease scoring may benefit from this system. Often, several measurements are made on a single experimental unit (usually a plot), and those measurements are averaged together to create a final disease incidence score. Here is how to do that with rowwise().

First, simulate a set of disease scores between 0 and 100 (indicating percent infection).

set.seed(208)

# step 1: generate a set of possible scores: 0, 10, 20,...100
score_range <- c(0:10 * 10L)
# sample those possible scores to generate 50 data points
scores <- sample(score_range, 50, replace = TRUE)
# arrange those 50 data points into a datafrmae of 5 columns, each column reflecting 10 observations
disease_df <- data.frame(plot = letters[1:10],
                         score1 = scores[1:10],
                         score5 = scores[11:20],
                         score3 = scores[21:30],
                         score4 = scores[31:40],
                         score2 = scores[41:50])
disease_df

   plot score1 score5 score3 score4 score2
1     a     20     20     40     30     40
2     b     10     10     40     70     20
3     c     80     70     40     10     30
4     d     80    100     30     30     30
5     e     60    100     20     70     60
6     f     40      0     30    100     70
7     g     20     90     50     40     90
8     h     20     40     80      0      0
9     i    100     90     90    100     50
10    j     80     40     70    100     60

Data sets exist like this. A person might have a set of 10 experimental plots to evaluate for some trait. The trait assay protocol may require that multiple observations be gathered per plot (from a statistical standpoint, this is a technical replicate, not a true replicate) and then reduced to a single number per plot using a simple mean. Row-wise functions can accomplish this.

disease_df_sum <- disease_df %>% rowwise() %>% 
  mutate(score_final = mean(score1:score2),
        max_score = max(score1:score2))
disease_df_sum

# A tibble: 10 × 8
# Rowwise: 
   plot  score1 score5 score3 score4 score2 score_final max_score
   <chr>  <int>  <int>  <int>  <int>  <int>       <dbl>     <int>
 1 a         20     20     40     30     40          30        40
 2 b         10     10     40     70     20          15        20
 3 c         80     70     40     10     30          55        80
 4 d         80    100     30     30     30          55        80
 5 e         60    100     20     70     60          60        60
 6 f         40      0     30    100     70          55        70
 7 g         20     90     50     40     90          55        90
 8 h         20     40     80      0      0          10        20
 9 i        100     90     90    100     50          75       100
10 j         80     40     70    100     60          70        80

disease_df_sum1 <- disease_df %>% select(score1, score4, score5) %>% 
rowwise() %>% 
  mutate(score_final = mean(c(score1, score4)),
        median_score = median(c(score1, score4)))
disease_df_sum1

# A tibble: 10 × 5
# Rowwise: 
   score1 score4 score5 score_final median_score
    <int>  <int>  <int>       <dbl>        <dbl>
 1     20     30     20          25           25
 2     10     70     10          40           40
 3     80     10     70          45           45
 4     80     30    100          55           55
 5     60     70    100          65           65
 6     40    100      0          70           70
 7     20     40     90          30           30
 8     20      0     40          10           10
 9    100    100     90         100          100
10     80    100     40          90           90

Practice Problem

Import “weather_data.csv”, select ‘station, ’tmin_F’ and ‘tmax_F’ using a select() function. Calculate the sum, mean, and median for ‘tmin_F’ and ‘tmax_F’ using rowwise() function.

Putting it all together

It is possible use mutate() or summary() commands on a grouped data frame. A summary() call will return one value per group + summary function (e.g. mean). A mutate() call will return one value per row + summary function. All the previous examples in this lesson used summary(). Here is one example using mutate():

variety_trials %>% 
  select(trial, rep, variety, crop, yield) %>% group_by(crop) %>%
  mutate(relative_yield = yield/mean(yield, na.rm=TRUE)) %>% 
  arrange(desc(yield)) %>% head(15)

# A tibble: 15 × 6
# Groups:   crop [6]
   trial                         rep variety        crop  yield relative_yield
   <chr>                       <int> <chr>          <chr> <dbl>          <dbl>
 1 SWIdahoCereals_HRS_PAR_2017     1 WB9411         HRS    498.           4.50
 2 SWIdahoCereals_HRS_PAR_2017     3 12SB0197       HRS    297.           2.68
 3 SWIdahoCereals_HWS_PAR_2017     3 Dayn           HWS    253.           2.18
 4 SWIdahoCereals_SWS_PAR_2017     1 UI Stone       SWS    219.           2.12
 5 SWIdahoCereals_SWS_PAR_2017     4 UI Stone       SWS    217.           2.09
 6 SWIdahoCereals_SWS_PAR_2017     4 WA8277         SWS    204.           1.97
 7 SWIdahoCereals_SWW_PAR_2017     1 Bobtail        SWW    201.           2.14
 8 SWIdahoCereals_H-W_PAR_2017     3 WA8269         H-W    199.           2.36
 9 SWIdahoCereals_HWS_PAR_2017     1 LCS Star       HWS    198.           1.71
10 SWIdahoCereals_HRW_PAR_2020     2 LCS Jet        HRW    197.           1.45
11 SWIdahoCereals_HRW_PAR_2020     3 LCS Rocket     HRW    196.           1.44
12 SWIdahoCereals_SWW_PAR_2017     2 Agripro Legion SWW    194.           2.06
13 SWIdahoCereals_H-W_PAR_2017     4 NSA10-2196     H-W    193.           2.29
14 SWIdahoCereals_HWS_PAR_2017     4 LCS Star       HWS    193.           1.66
15 SWIdahoCereals_HRW_PAR_2020     2 Scorpio        HRW    193.           1.42

In this case, the mean value used for calculating ‘relative_yield’ is the group mean.