Skip to contents

Introduction

The purpose of factcuratoR is to provide sets of functions to help standardize variety testing data for curation of variety testing data sets supporting the WAVE project.

The goal of the WAVE program curation is to generate trial data and trial metadata that conform to the controlled vocabulary codebooks.

The codebooks specify the variable required for each file (as a single column in a data frame, matrix, tibble, etc) and the accepted values for the the variable. See the Introduction to codebooks vignette for more information on the structure of and formatting for the codebooks.

Overview of validation functions

The validation functions check that the data conforms to the controlled vocabulary codebooks. There are functions to validate:

  • variable names. The functions return tables for the curator to check that required variables are present and that names are standardized.
  • variable values. The functions return tables for the curator to check that values for a single variable match controlled vocabularies or are in the accepted value range.

Finally, standardize_cols_by_cb() enables curators to standardize the files (select and order columns) according to the standards established in the codebooks.

First, load factcuratoR and point to the main codebook which must be named codebooks_all_db.csv for the validation functions to find it.

library(factcuratoR)

rlang::check_installed("here")

codebook_folder <- here::here(
  "tests/testthat/test_controlled_vocab")

knitroutputfolder <- here::here("inst/extdata/intro_validation", "output")

Create some test data

test_data <- data.frame(location = c("Aberdeen", "Soda Springs", NA, "location_x"),
                        year = c(rep(2020, 3), NA),
                        variety = c("variety_1", "AAC Wildfire", "", NA),
                        rep_temp = 1:4,
                        sourcefile = "test")

Validate trial data

It may be easiest to point the most cleaned up version of the data to a new variable (e.g. df_validate below) so that the calls to the validation functions don’t need to be updated every time there is a more cleaned up version of the data.

df_validate <- test_data

Check column names

The function validate_colnames() will check the variable names between the codebook and data. The argument codebook_name = "trial_data" indicates what codebook from the codebooks_all_db.csv should be used for this step.

colname_valid <- validate_colnames(df_validate, 
                                   codebook_name = "trial_data", 
                                   db_folder = codebook_folder)
#> Joining with `by = join_by(comment)`
#> Status: # A tibble: 3 × 4 comment n required req <chr> <int> <lgl> <int> 1
#> exists in both data and codebook 1 NA NA 2 not present in codebook: trial_data
#> 4 NA NA 3 not present in data 27 TRUE 6

The main goal of validating column names is to get the value in the required column of row 3 to be zero. A value of zero indicates that all the required columns are present in the data and completely filled out. The second row in the summary highlights columns that are not in the codebook.

If required columns appear missing in the initial check, it may be present in the data set with a differing name. Often this is the case for many variables; collaborators each have their own unique name for common variables. When this happens, creating a file to rename files en masse is the best solution. See facthelpeR for functions to help with renaming columns. Determining if a variable is captured by the existing controlled vocabulary is a human decision made based on your own knowledge and when needed, consultation with other project participants. When in doubt, ask.

If a required variable is truly missing, then you will need to find the data. Looking at annual reports is a good starting point, and you may have to eventually request this information from the collaborator if you cannot find it elsewhere.

Sometimes, there are also new variables not captured by the existing vocabulary. In that case, we will need to decide if this variable should be added to the database. The answer is usually “yes”, but usually a discussion with project participants is needed for this decision. If there is a decision to add a new variable, add it to controlled vocabulary using the correct formatting described in the Introduction to codebooks vignette.

Full report of validating column names

knitr::kable(colname_valid)
colname_data colname_codebook required col_num comment
NA trial TRUE 1 not present in data
NA entry TRUE 3 not present in data
NA plot TRUE 4 not present in data
NA range TRUE 6 not present in data
NA rep TRUE 5 not present in data
NA row TRUE 7 not present in data
NA heading_date NA NA not present in data
NA yield_bu_acre NA NA not present in data
NA yield_lb_plot NA NA not present in data
NA yield_corrected NA NA not present in data
NA stand NA NA not present in data
NA moisture NA NA not present in data
NA height NA NA not present in data
NA falling_number NA NA not present in data
NA grain_protein NA NA not present in data
NA test_weight NA NA not present in data
NA test_weight_cleaned NA NA not present in data
NA lodging NA NA not present in data
NA plump_percent_6_64 NA NA not present in data
NA thin_percent_5_64 NA NA not present in data
NA yield_g_plot NA NA not present in data
NA yield_lb_acre NA NA not present in data
NA yield_kg_ha NA NA not present in data
NA hessian_fly_percent NA NA not present in data
NA emergence NA NA not present in data
NA fdk_rep_xx NA NA not present in data
NA inf_spikelet_rep_xx NA NA not present in data
location NA NA NA not present in codebook: trial_data
year NA NA NA not present in codebook: trial_data
rep_temp NA NA NA not present in codebook: trial_data
sourcefile NA NA NA not present in codebook: trial_data
variety variety TRUE 2 exists in both data and codebook

To interact with the variable names that still need to be fixed

colname_valid_check <- colname_valid %>% 
  filter(comment == "not present in codebook: trial_data")

col_info <- find_col_info(df_validate, 
                       cols_check = colname_valid_check$colname_data, 
                       by_col = sourcefile)

knitr::kable(col_info)
n contained_in variable example
3 test location Aberdeen
4 test rep_temp 1
4 test sourcefile test
3 test year 2020

Check column contents

The function confront_data() is a wrapper around validate::confront() that checks that column contents match controlled vocabularies or are in the accepted value range.

Note: The argument blends = TRUE will check for blends in the variety column

colcontent_valid <- confront_data(df_validate, 
                                  df_type = "trial_data",
                                  db_folder = codebook_folder)
#> Warning: Some issues left to resolve 
#> # A tibble: 2 × 5
#>   required fails   nNA error warning
#>   <lgl>    <int> <int> <int>   <int>
#> 1 TRUE         1     1     6       0
#> 2 NA           0     0    21       0

The summary output for validating column contents reports the number of columns that have fails (that is, a column does not match controlled vocabularies or is not within the accepted range) and NA values. A column will return error = TRUE if a column is not present in the data.

The goal of validating column contents is to achieve zero fails for all columns and for required columns, achieve zero NA values (so that each observation has a value).

The errors can be fixed by either fixing errors or standardizing contents in the raw data or adding new controlled vocabularies to the codebooks. For example, if the name “variety_1” is a real variety name, then it should be added to the cultivar codebook.

Full report of validating column contents

colcontent_summary <- colcontent_valid[["summary"]]
knitr::kable(colcontent_summary)
required name items passes fails nNA error warning expression crop_type
TRUE variety 4 1 2 1 FALSE FALSE variety %vin% db[[“cultivar.csv”]][[“variety”]] NA
TRUE trial 0 0 0 0 TRUE FALSE is.character(trial) NA
TRUE entry 0 0 0 0 TRUE FALSE entry - 0 >= -1e-08 & entry - 500 <= 1e-08 & entry%%1 == 0 NA
TRUE plot 0 0 0 0 TRUE FALSE plot - 1 >= -1e-08 & plot - 1e+06 <= 1e-08 & plot%%1 == 0 NA
TRUE range 0 0 0 0 TRUE FALSE range - 1 >= -1e-08 & range - 200 <= 1e-08 & range%%1 == 0 NA
TRUE rep 0 0 0 0 TRUE FALSE rep - 1 >= -1e-08 & rep - 20 <= 1e-08 & rep%%1 == 0 NA
TRUE row 0 0 0 0 TRUE FALSE row - 1 >= -1e-08 & row - 200 <= 1e-08 & row%%1 == 0 NA
NA heading_date 0 0 0 0 TRUE FALSE heading_date - 1 >= -1e-08 & heading_date - 300 <= 1e-08 & heading_date%%1 == 0 NA
NA yield_bu_acre 0 0 0 0 TRUE FALSE yield_bu_acre - 0 >= -1e-08 & yield_bu_acre - 500 <= 1e-08 NA
NA yield_lb_plot 0 0 0 0 TRUE FALSE yield_lb_plot - 0 >= -1e-08 & yield_lb_plot - 50 <= 1e-08 NA
NA yield_corrected 0 0 0 0 TRUE FALSE yield_corrected - 0 >= -1e-08 & yield_corrected - 500 <= 1e-08 NA
NA stand 0 0 0 0 TRUE FALSE stand - 0 >= -1e-08 & stand - 100 <= 1e-08 NA
NA moisture 0 0 0 0 TRUE FALSE moisture - 0 >= -1e-08 & moisture - 30 <= 1e-08 NA
NA height 0 0 0 0 TRUE FALSE height - 0 >= -1e-08 & height - 200 <= 1e-08 NA
NA falling_number 0 0 0 0 TRUE FALSE falling_number - 0 >= -1e-08 & falling_number - 600 <= 1e-08 & falling_number%%1 == 0 NA
NA grain_protein 0 0 0 0 TRUE FALSE grain_protein - 0 >= -1e-08 & grain_protein - 100 <= 1e-08 NA
NA test_weight 0 0 0 0 TRUE FALSE test_weight - 0 >= -1e-08 & test_weight - 100 <= 1e-08 NA
NA test_weight_cleaned 0 0 0 0 TRUE FALSE test_weight_cleaned - 0 >= -1e-08 & test_weight_cleaned - 100 <= 1e-08 NA
NA lodging 0 0 0 0 TRUE FALSE lodging - 0 >= -1e-08 & lodging - 100 <= 1e-08 NA
NA plump_percent_6_64 0 0 0 0 TRUE FALSE plump_percent_6_64 - 0 >= -1e-08 & plump_percent_6_64 - 100 <= 1e-08 NA
NA thin_percent_5_64 0 0 0 0 TRUE FALSE thin_percent_5_64 - 0 >= -1e-08 & thin_percent_5_64 - 100 <= 1e-08 NA
NA yield_g_plot 0 0 0 0 TRUE FALSE is.numeric(yield_g_plot) NA
NA yield_lb_acre 0 0 0 0 TRUE FALSE is.numeric(yield_lb_acre) NA
NA yield_kg_ha 0 0 0 0 TRUE FALSE is.numeric(yield_kg_ha) NA
NA hessian_fly_percent 0 0 0 0 TRUE FALSE hessian_fly_percent - 0 >= -1e-08 & hessian_fly_percent - 100 <= 1e-08 NA
NA emergence 0 0 0 0 TRUE FALSE emergence - 0 >= -1e-08 & emergence - 100 <= 1e-08 NA
NA fdk_rep_xx 0 0 0 0 TRUE FALSE fdk_rep_xx - 0 >= -1e-08 & fdk_rep_xx - 100 <= 1e-08 NA
NA inf_spikelet_rep_xx 0 0 0 0 TRUE FALSE inf_spikelet_rep_xx - 0 >= -1e-08 & inf_spikelet_rep_xx - 100 <= 1e-08 NA

To check for validation fails interactively

var <- c("variety")
colcontent_violate <- 
  validate::violating(test_data, colcontent_valid[[2]][var]) %>% 
  relocate(matches(var))

knitr::kable(colcontent_violate)
variety location year rep_temp sourcefile
1 variety_1 Aberdeen 2020 1 test
3 NA 2020 3 test

Validate metadata

The steps in validating the metadata is the same as validating the trial data, except the argument codebook_name = "trials_metadata".

metadata_validate <- test_data

Check variable names

metadata_colname_valid <- 
  validate_colnames(
    metadata_validate, 
    "trials_metadata",
    db_folder = codebook_folder) %>%
    select(comment, colname_data, colname_codebook, required, col_num) 
#> Joining with `by = join_by(comment)`
#> Status: # A tibble: 3 × 4 comment n required req <chr> <int> <lgl> <int> 1
#> exists in both data and codebook 2 NA NA 2 not present in codebook:
#> trials_metadata 3 NA NA 3 not present in data 25 TRUE 10

knitr::kable(metadata_colname_valid)
comment colname_data colname_codebook required col_num
not present in data NA trial TRUE 1
not present in data NA nursery TRUE 3
not present in data NA latitude TRUE 8
not present in data NA longitude TRUE 9
not present in data NA irrigation TRUE 10
not present in data NA planting_date TRUE 11
not present in data NA harvest_date TRUE 12
not present in data NA plot_length TRUE 14
not present in data NA plot_width TRUE 15
not present in data NA tillage TRUE 31
not present in data NA row_spacing_in FALSE 16
not present in data NA npks_lb_acre FALSE 17
not present in data NA chemical_trts FALSE 18
not present in data NA seed_rate_per_acre FALSE 19
not present in data NA seed_trt FALSE 20
not present in data NA agronomic_notes FALSE 21
not present in data NA soil_type FALSE 23
not present in data NA soil_om FALSE 24
not present in data NA ph FALSE 25
not present in data NA n_lbs_acre FALSE 26
not present in data NA p_ppm FALSE 27
not present in data NA k_ppm FALSE 28
not present in data NA s_ppm FALSE 29
not present in data NA free_lime_pct FALSE 30
not present in data NA unpublished FALSE 33
not present in codebook: trials_metadata variety NA NA NA
not present in codebook: trials_metadata rep_temp NA NA NA
not present in codebook: trials_metadata sourcefile NA NA NA
exists in both data and codebook location location TRUE 6
exists in both data and codebook year year TRUE 4

Check variable values

metadata_colcontent_valid <- confront_data(metadata_validate, 
                                           df_type = "trials_metadata",
                                           db_folder = codebook_folder)
#> Warning: Some issues left to resolve 
#> # A tibble: 2 × 5
#>   required fails   nNA error warning
#>   <lgl>    <int> <int> <int>   <int>
#> 1 FALSE        0     0    15       0
#> 2 TRUE         1     2    10       0

metadata_colcontent_summary <- metadata_colcontent_valid[["summary"]]

knitr::kable(metadata_colcontent_summary)
required name items passes fails nNA error warning expression
TRUE year 4 3 0 1 FALSE FALSE year - 1995 >= -1e-08 & year - 2024 <= 1e-08 & year%%1 == 0
TRUE location 4 2 1 1 FALSE FALSE location %vin% db[[“locations.csv”]][[“location”]]
TRUE trial 0 0 0 0 TRUE FALSE is.character(trial)
TRUE nursery 0 0 0 0 TRUE FALSE nursery %vin% db[[“nursery.csv”]][[“nursery”]]
TRUE latitude 0 0 0 0 TRUE FALSE latitude - 42 >= -1e-08 & latitude - 49 <= 1e-08
TRUE longitude 0 0 0 0 TRUE FALSE longitude - -124.77 >= -1e-08 & longitude - -111.05 <= 1e-08
TRUE irrigation 0 0 0 0 TRUE FALSE irrigation %vin% c(“irrigated”, “dryland”)
TRUE planting_date 0 0 0 0 TRUE FALSE grepl(“[0-9]{4}-|/[0-9]{2}-|/[0-9]{2}”, planting_date)
TRUE harvest_date 0 0 0 0 TRUE FALSE grepl(“[0-9]{4}-|/[0-9]{2}-|/[0-9]{2}”, harvest_date)
TRUE plot_length 0 0 0 0 TRUE FALSE plot_length - 5 >= -1e-08 & plot_length - 200 <= 1e-08 | plot_length %vin% c(-9)
TRUE plot_width 0 0 0 0 TRUE FALSE plot_width - 2 >= -1e-08 & plot_width - 50 <= 1e-08 | plot_width %vin% c(-9)
TRUE tillage 0 0 0 0 TRUE FALSE tillage %vin% c(“conventional”, “no till”, “conservation”)
FALSE row_spacing_in 0 0 0 0 TRUE FALSE row_spacing_in - 1 >= -1e-08 & row_spacing_in - 50 <= 1e-08 | row_spacing_in %vin% c(-9)
FALSE npks_lb_acre 0 0 0 0 TRUE FALSE is.character(npks_lb_acre)
FALSE chemical_trts 0 0 0 0 TRUE FALSE nchar(as.character(chemical_trts)) <= 20
FALSE seed_rate_per_acre 0 0 0 0 TRUE FALSE seed_rate_per_acre %vin% c(7e+05, 1e+06)
FALSE seed_trt 0 0 0 0 TRUE FALSE seed_trt %vin% c(TRUE, FALSE)
FALSE agronomic_notes 0 0 0 0 TRUE FALSE nchar(as.character(agronomic_notes)) <= 500
FALSE soil_type 0 0 0 0 TRUE FALSE nchar(as.character(soil_type)) <= 50
FALSE soil_om 0 0 0 0 TRUE FALSE soil_om - 0 >= -1e-08 & soil_om - 100 <= 1e-08
FALSE ph 0 0 0 0 TRUE FALSE ph - 0 >= -1e-08 & ph - 14 <= 1e-08
FALSE n_lbs_acre 0 0 0 0 TRUE FALSE n_lbs_acre - 0 >= -1e-08 & n_lbs_acre - 500 <= 1e-08
FALSE p_ppm 0 0 0 0 TRUE FALSE p_ppm - 0 >= -1e-08 & p_ppm - 200 <= 1e-08
FALSE k_ppm 0 0 0 0 TRUE FALSE k_ppm - 0 >= -1e-08 & k_ppm - 1000 <= 1e-08
FALSE s_ppm 0 0 0 0 TRUE FALSE s_ppm - 0 >= -1e-08 & s_ppm - 200 <= 1e-08
FALSE free_lime_pct 0 0 0 0 TRUE FALSE free_lime_pct - 0 >= -1e-08 & free_lime_pct - 100 <= 1e-08
FALSE unpublished 0 0 0 0 TRUE FALSE unpublished %vin% c(TRUE, FALSE)

#Check for any fails interactively:

metadata_var <- c("location")
metadata_colcontent_violate <- 
  validate::violating(test_data, metadata_colcontent_valid[[2]][metadata_var]) %>% 
  relocate(matches(var))

knitr::kable(metadata_colcontent_violate)
variety location year rep_temp sourcefile
4 NA location_x NA 4 test