Introduction to the validation functions
Jacqueline Tay
Nov 12, 2024
Source:vignettes/intro_validation.Rmd
intro_validation.Rmd
Introduction
The purpose of factcuratoR is to provide sets of functions to help standardize variety testing data for curation of variety testing data sets supporting the WAVE project.
The goal of the WAVE program curation is to generate trial data and trial metadata that conform to the controlled vocabulary codebooks.
The codebooks specify the variable required for each file (as a single column in a data frame, matrix, tibble, etc) and the accepted values for the the variable. See the Introduction to codebooks vignette for more information on the structure of and formatting for the codebooks.
Overview of validation functions
The validation functions check that the data conforms to the controlled vocabulary codebooks. There are functions to validate:
- variable names. The functions return tables for the curator to check that required variables are present and that names are standardized.
- variable values. The functions return tables for the curator to check that values for a single variable match controlled vocabularies or are in the accepted value range.
Finally, standardize_cols_by_cb()
enables curators to standardize the files (select and order columns)
according to the standards established in the codebooks.
First, load factcuratoR and point to the main codebook which must be named codebooks_all_db.csv for the validation functions to find it.
library(factcuratoR)
rlang::check_installed("here")
codebook_folder <- here::here(
"tests/testthat/test_controlled_vocab")
knitroutputfolder <- here::here("inst/extdata/intro_validation", "output")
Create some test data
test_data <- data.frame(location = c("Aberdeen", "Soda Springs", NA, "location_x"),
year = c(rep(2020, 3), NA),
variety = c("variety_1", "AAC Wildfire", "", NA),
rep_temp = 1:4,
sourcefile = "test")
Validate trial data
It may be easiest to point the most cleaned up version of the data to a new variable (e.g. df_validate below) so that the calls to the validation functions don’t need to be updated every time there is a more cleaned up version of the data.
df_validate <- test_data
Check column names
The function validate_colnames(
) will check the variable
names between the codebook and data. The argument
codebook_name = "trial_data"
indicates what codebook from
the codebooks_all_db.csv should be used for this step.
colname_valid <- validate_colnames(df_validate,
codebook_name = "trial_data",
db_folder = codebook_folder)
#> Joining with `by = join_by(comment)`
#> Status: # A tibble: 3 × 4 comment n required req <chr> <int> <lgl> <int> 1
#> exists in both data and codebook 1 NA NA 2 not present in codebook: trial_data
#> 4 NA NA 3 not present in data 27 TRUE 6
The main goal of validating column names is to get the value in the required column of row 3 to be zero. A value of zero indicates that all the required columns are present in the data and completely filled out. The second row in the summary highlights columns that are not in the codebook.
If required columns appear missing in the initial check, it may be present in the data set with a differing name. Often this is the case for many variables; collaborators each have their own unique name for common variables. When this happens, creating a file to rename files en masse is the best solution. See facthelpeR for functions to help with renaming columns. Determining if a variable is captured by the existing controlled vocabulary is a human decision made based on your own knowledge and when needed, consultation with other project participants. When in doubt, ask.
If a required variable is truly missing, then you will need to find the data. Looking at annual reports is a good starting point, and you may have to eventually request this information from the collaborator if you cannot find it elsewhere.
Sometimes, there are also new variables not captured by the existing vocabulary. In that case, we will need to decide if this variable should be added to the database. The answer is usually “yes”, but usually a discussion with project participants is needed for this decision. If there is a decision to add a new variable, add it to controlled vocabulary using the correct formatting described in the Introduction to codebooks vignette.
Full report of validating column names
knitr::kable(colname_valid)
colname_data | colname_codebook | required | col_num | comment |
---|---|---|---|---|
NA | trial | TRUE | 1 | not present in data |
NA | entry | TRUE | 3 | not present in data |
NA | plot | TRUE | 4 | not present in data |
NA | range | TRUE | 6 | not present in data |
NA | rep | TRUE | 5 | not present in data |
NA | row | TRUE | 7 | not present in data |
NA | heading_date | NA | NA | not present in data |
NA | yield_bu_acre | NA | NA | not present in data |
NA | yield_lb_plot | NA | NA | not present in data |
NA | yield_corrected | NA | NA | not present in data |
NA | stand | NA | NA | not present in data |
NA | moisture | NA | NA | not present in data |
NA | height | NA | NA | not present in data |
NA | falling_number | NA | NA | not present in data |
NA | grain_protein | NA | NA | not present in data |
NA | test_weight | NA | NA | not present in data |
NA | test_weight_cleaned | NA | NA | not present in data |
NA | lodging | NA | NA | not present in data |
NA | plump_percent_6_64 | NA | NA | not present in data |
NA | thin_percent_5_64 | NA | NA | not present in data |
NA | yield_g_plot | NA | NA | not present in data |
NA | yield_lb_acre | NA | NA | not present in data |
NA | yield_kg_ha | NA | NA | not present in data |
NA | hessian_fly_percent | NA | NA | not present in data |
NA | emergence | NA | NA | not present in data |
NA | fdk_rep_xx | NA | NA | not present in data |
NA | inf_spikelet_rep_xx | NA | NA | not present in data |
location | NA | NA | NA | not present in codebook: trial_data |
year | NA | NA | NA | not present in codebook: trial_data |
rep_temp | NA | NA | NA | not present in codebook: trial_data |
sourcefile | NA | NA | NA | not present in codebook: trial_data |
variety | variety | TRUE | 2 | exists in both data and codebook |
To interact with the variable names that still need to be fixed
colname_valid_check <- colname_valid %>%
filter(comment == "not present in codebook: trial_data")
col_info <- find_col_info(df_validate,
cols_check = colname_valid_check$colname_data,
by_col = sourcefile)
knitr::kable(col_info)
n | contained_in | variable | example |
---|---|---|---|
3 | test | location | Aberdeen |
4 | test | rep_temp | 1 |
4 | test | sourcefile | test |
3 | test | year | 2020 |
Check column contents
The function confront_data()
is a wrapper around
validate::confront()
that checks that column contents match
controlled vocabularies or are in the accepted value range.
Note: The argument blends
= TRUE will check for blends
in the variety column
colcontent_valid <- confront_data(df_validate,
df_type = "trial_data",
db_folder = codebook_folder)
#> Warning: Some issues left to resolve
#> # A tibble: 2 × 5
#> required fails nNA error warning
#> <lgl> <int> <int> <int> <int>
#> 1 TRUE 1 1 6 0
#> 2 NA 0 0 21 0
The summary output for validating column contents reports the number
of columns that have fails (that is, a column does not match controlled
vocabularies or is not within the accepted range) and NA values. A
column will return error = TRUE
if a column is not present
in the data.
The goal of validating column contents is to achieve zero fails for all columns and for required columns, achieve zero NA values (so that each observation has a value).
The errors can be fixed by either fixing errors or standardizing contents in the raw data or adding new controlled vocabularies to the codebooks. For example, if the name “variety_1” is a real variety name, then it should be added to the cultivar codebook.
Full report of validating column contents
colcontent_summary <- colcontent_valid[["summary"]]
knitr::kable(colcontent_summary)
required | name | items | passes | fails | nNA | error | warning | expression | crop_type |
---|---|---|---|---|---|---|---|---|---|
TRUE | variety | 4 | 1 | 2 | 1 | FALSE | FALSE | variety %vin% db[[“cultivar.csv”]][[“variety”]] | NA |
TRUE | trial | 0 | 0 | 0 | 0 | TRUE | FALSE | is.character(trial) | NA |
TRUE | entry | 0 | 0 | 0 | 0 | TRUE | FALSE | entry - 0 >= -1e-08 & entry - 500 <= 1e-08 & entry%%1 == 0 | NA |
TRUE | plot | 0 | 0 | 0 | 0 | TRUE | FALSE | plot - 1 >= -1e-08 & plot - 1e+06 <= 1e-08 & plot%%1 == 0 | NA |
TRUE | range | 0 | 0 | 0 | 0 | TRUE | FALSE | range - 1 >= -1e-08 & range - 200 <= 1e-08 & range%%1 == 0 | NA |
TRUE | rep | 0 | 0 | 0 | 0 | TRUE | FALSE | rep - 1 >= -1e-08 & rep - 20 <= 1e-08 & rep%%1 == 0 | NA |
TRUE | row | 0 | 0 | 0 | 0 | TRUE | FALSE | row - 1 >= -1e-08 & row - 200 <= 1e-08 & row%%1 == 0 | NA |
NA | heading_date | 0 | 0 | 0 | 0 | TRUE | FALSE | heading_date - 1 >= -1e-08 & heading_date - 300 <= 1e-08 & heading_date%%1 == 0 | NA |
NA | yield_bu_acre | 0 | 0 | 0 | 0 | TRUE | FALSE | yield_bu_acre - 0 >= -1e-08 & yield_bu_acre - 500 <= 1e-08 | NA |
NA | yield_lb_plot | 0 | 0 | 0 | 0 | TRUE | FALSE | yield_lb_plot - 0 >= -1e-08 & yield_lb_plot - 50 <= 1e-08 | NA |
NA | yield_corrected | 0 | 0 | 0 | 0 | TRUE | FALSE | yield_corrected - 0 >= -1e-08 & yield_corrected - 500 <= 1e-08 | NA |
NA | stand | 0 | 0 | 0 | 0 | TRUE | FALSE | stand - 0 >= -1e-08 & stand - 100 <= 1e-08 | NA |
NA | moisture | 0 | 0 | 0 | 0 | TRUE | FALSE | moisture - 0 >= -1e-08 & moisture - 30 <= 1e-08 | NA |
NA | height | 0 | 0 | 0 | 0 | TRUE | FALSE | height - 0 >= -1e-08 & height - 200 <= 1e-08 | NA |
NA | falling_number | 0 | 0 | 0 | 0 | TRUE | FALSE | falling_number - 0 >= -1e-08 & falling_number - 600 <= 1e-08 & falling_number%%1 == 0 | NA |
NA | grain_protein | 0 | 0 | 0 | 0 | TRUE | FALSE | grain_protein - 0 >= -1e-08 & grain_protein - 100 <= 1e-08 | NA |
NA | test_weight | 0 | 0 | 0 | 0 | TRUE | FALSE | test_weight - 0 >= -1e-08 & test_weight - 100 <= 1e-08 | NA |
NA | test_weight_cleaned | 0 | 0 | 0 | 0 | TRUE | FALSE | test_weight_cleaned - 0 >= -1e-08 & test_weight_cleaned - 100 <= 1e-08 | NA |
NA | lodging | 0 | 0 | 0 | 0 | TRUE | FALSE | lodging - 0 >= -1e-08 & lodging - 100 <= 1e-08 | NA |
NA | plump_percent_6_64 | 0 | 0 | 0 | 0 | TRUE | FALSE | plump_percent_6_64 - 0 >= -1e-08 & plump_percent_6_64 - 100 <= 1e-08 | NA |
NA | thin_percent_5_64 | 0 | 0 | 0 | 0 | TRUE | FALSE | thin_percent_5_64 - 0 >= -1e-08 & thin_percent_5_64 - 100 <= 1e-08 | NA |
NA | yield_g_plot | 0 | 0 | 0 | 0 | TRUE | FALSE | is.numeric(yield_g_plot) | NA |
NA | yield_lb_acre | 0 | 0 | 0 | 0 | TRUE | FALSE | is.numeric(yield_lb_acre) | NA |
NA | yield_kg_ha | 0 | 0 | 0 | 0 | TRUE | FALSE | is.numeric(yield_kg_ha) | NA |
NA | hessian_fly_percent | 0 | 0 | 0 | 0 | TRUE | FALSE | hessian_fly_percent - 0 >= -1e-08 & hessian_fly_percent - 100 <= 1e-08 | NA |
NA | emergence | 0 | 0 | 0 | 0 | TRUE | FALSE | emergence - 0 >= -1e-08 & emergence - 100 <= 1e-08 | NA |
NA | fdk_rep_xx | 0 | 0 | 0 | 0 | TRUE | FALSE | fdk_rep_xx - 0 >= -1e-08 & fdk_rep_xx - 100 <= 1e-08 | NA |
NA | inf_spikelet_rep_xx | 0 | 0 | 0 | 0 | TRUE | FALSE | inf_spikelet_rep_xx - 0 >= -1e-08 & inf_spikelet_rep_xx - 100 <= 1e-08 | NA |
Validate metadata
The steps in validating the metadata is the same as validating the
trial data, except the argument
codebook_name = "trials_metadata"
.
metadata_validate <- test_data
Check variable names
metadata_colname_valid <-
validate_colnames(
metadata_validate,
"trials_metadata",
db_folder = codebook_folder) %>%
select(comment, colname_data, colname_codebook, required, col_num)
#> Joining with `by = join_by(comment)`
#> Status: # A tibble: 3 × 4 comment n required req <chr> <int> <lgl> <int> 1
#> exists in both data and codebook 2 NA NA 2 not present in codebook:
#> trials_metadata 3 NA NA 3 not present in data 25 TRUE 10
knitr::kable(metadata_colname_valid)
comment | colname_data | colname_codebook | required | col_num |
---|---|---|---|---|
not present in data | NA | trial | TRUE | 1 |
not present in data | NA | nursery | TRUE | 3 |
not present in data | NA | latitude | TRUE | 8 |
not present in data | NA | longitude | TRUE | 9 |
not present in data | NA | irrigation | TRUE | 10 |
not present in data | NA | planting_date | TRUE | 11 |
not present in data | NA | harvest_date | TRUE | 12 |
not present in data | NA | plot_length | TRUE | 14 |
not present in data | NA | plot_width | TRUE | 15 |
not present in data | NA | tillage | TRUE | 31 |
not present in data | NA | row_spacing_in | FALSE | 16 |
not present in data | NA | npks_lb_acre | FALSE | 17 |
not present in data | NA | chemical_trts | FALSE | 18 |
not present in data | NA | seed_rate_per_acre | FALSE | 19 |
not present in data | NA | seed_trt | FALSE | 20 |
not present in data | NA | agronomic_notes | FALSE | 21 |
not present in data | NA | soil_type | FALSE | 23 |
not present in data | NA | soil_om | FALSE | 24 |
not present in data | NA | ph | FALSE | 25 |
not present in data | NA | n_lbs_acre | FALSE | 26 |
not present in data | NA | p_ppm | FALSE | 27 |
not present in data | NA | k_ppm | FALSE | 28 |
not present in data | NA | s_ppm | FALSE | 29 |
not present in data | NA | free_lime_pct | FALSE | 30 |
not present in data | NA | unpublished | FALSE | 33 |
not present in codebook: trials_metadata | variety | NA | NA | NA |
not present in codebook: trials_metadata | rep_temp | NA | NA | NA |
not present in codebook: trials_metadata | sourcefile | NA | NA | NA |
exists in both data and codebook | location | location | TRUE | 6 |
exists in both data and codebook | year | year | TRUE | 4 |
Check variable values
metadata_colcontent_valid <- confront_data(metadata_validate,
df_type = "trials_metadata",
db_folder = codebook_folder)
#> Warning: Some issues left to resolve
#> # A tibble: 2 × 5
#> required fails nNA error warning
#> <lgl> <int> <int> <int> <int>
#> 1 FALSE 0 0 15 0
#> 2 TRUE 1 2 10 0
metadata_colcontent_summary <- metadata_colcontent_valid[["summary"]]
knitr::kable(metadata_colcontent_summary)
required | name | items | passes | fails | nNA | error | warning | expression |
---|---|---|---|---|---|---|---|---|
TRUE | year | 4 | 3 | 0 | 1 | FALSE | FALSE | year - 1995 >= -1e-08 & year - 2024 <= 1e-08 & year%%1 == 0 |
TRUE | location | 4 | 2 | 1 | 1 | FALSE | FALSE | location %vin% db[[“locations.csv”]][[“location”]] |
TRUE | trial | 0 | 0 | 0 | 0 | TRUE | FALSE | is.character(trial) |
TRUE | nursery | 0 | 0 | 0 | 0 | TRUE | FALSE | nursery %vin% db[[“nursery.csv”]][[“nursery”]] |
TRUE | latitude | 0 | 0 | 0 | 0 | TRUE | FALSE | latitude - 42 >= -1e-08 & latitude - 49 <= 1e-08 |
TRUE | longitude | 0 | 0 | 0 | 0 | TRUE | FALSE | longitude - -124.77 >= -1e-08 & longitude - -111.05 <= 1e-08 |
TRUE | irrigation | 0 | 0 | 0 | 0 | TRUE | FALSE | irrigation %vin% c(“irrigated”, “dryland”) |
TRUE | planting_date | 0 | 0 | 0 | 0 | TRUE | FALSE | grepl(“[0-9]{4}-|/[0-9]{2}-|/[0-9]{2}”, planting_date) |
TRUE | harvest_date | 0 | 0 | 0 | 0 | TRUE | FALSE | grepl(“[0-9]{4}-|/[0-9]{2}-|/[0-9]{2}”, harvest_date) |
TRUE | plot_length | 0 | 0 | 0 | 0 | TRUE | FALSE | plot_length - 5 >= -1e-08 & plot_length - 200 <= 1e-08 | plot_length %vin% c(-9) |
TRUE | plot_width | 0 | 0 | 0 | 0 | TRUE | FALSE | plot_width - 2 >= -1e-08 & plot_width - 50 <= 1e-08 | plot_width %vin% c(-9) |
TRUE | tillage | 0 | 0 | 0 | 0 | TRUE | FALSE | tillage %vin% c(“conventional”, “no till”, “conservation”) |
FALSE | row_spacing_in | 0 | 0 | 0 | 0 | TRUE | FALSE | row_spacing_in - 1 >= -1e-08 & row_spacing_in - 50 <= 1e-08 | row_spacing_in %vin% c(-9) |
FALSE | npks_lb_acre | 0 | 0 | 0 | 0 | TRUE | FALSE | is.character(npks_lb_acre) |
FALSE | chemical_trts | 0 | 0 | 0 | 0 | TRUE | FALSE | nchar(as.character(chemical_trts)) <= 20 |
FALSE | seed_rate_per_acre | 0 | 0 | 0 | 0 | TRUE | FALSE | seed_rate_per_acre %vin% c(7e+05, 1e+06) |
FALSE | seed_trt | 0 | 0 | 0 | 0 | TRUE | FALSE | seed_trt %vin% c(TRUE, FALSE) |
FALSE | agronomic_notes | 0 | 0 | 0 | 0 | TRUE | FALSE | nchar(as.character(agronomic_notes)) <= 500 |
FALSE | soil_type | 0 | 0 | 0 | 0 | TRUE | FALSE | nchar(as.character(soil_type)) <= 50 |
FALSE | soil_om | 0 | 0 | 0 | 0 | TRUE | FALSE | soil_om - 0 >= -1e-08 & soil_om - 100 <= 1e-08 |
FALSE | ph | 0 | 0 | 0 | 0 | TRUE | FALSE | ph - 0 >= -1e-08 & ph - 14 <= 1e-08 |
FALSE | n_lbs_acre | 0 | 0 | 0 | 0 | TRUE | FALSE | n_lbs_acre - 0 >= -1e-08 & n_lbs_acre - 500 <= 1e-08 |
FALSE | p_ppm | 0 | 0 | 0 | 0 | TRUE | FALSE | p_ppm - 0 >= -1e-08 & p_ppm - 200 <= 1e-08 |
FALSE | k_ppm | 0 | 0 | 0 | 0 | TRUE | FALSE | k_ppm - 0 >= -1e-08 & k_ppm - 1000 <= 1e-08 |
FALSE | s_ppm | 0 | 0 | 0 | 0 | TRUE | FALSE | s_ppm - 0 >= -1e-08 & s_ppm - 200 <= 1e-08 |
FALSE | free_lime_pct | 0 | 0 | 0 | 0 | TRUE | FALSE | free_lime_pct - 0 >= -1e-08 & free_lime_pct - 100 <= 1e-08 |
FALSE | unpublished | 0 | 0 | 0 | 0 | TRUE | FALSE | unpublished %vin% c(TRUE, FALSE) |
#Check for any fails interactively:
metadata_var <- c("location")
metadata_colcontent_violate <-
validate::violating(test_data, metadata_colcontent_valid[[2]][metadata_var]) %>%
relocate(matches(var))
knitr::kable(metadata_colcontent_violate)
variety | location | year | rep_temp | sourcefile | |
---|---|---|---|---|---|
4 | NA | location_x | NA | 4 | test |