# Data assertion and checks via testthat

· by Hunter Glanz and Nicholas Horton · Read in about 5 min · (1017 words) ·

## Reproducibility and Replicability

On May 7, 2019 the National Academies of Sciences, Engineering, Medicine published, “New report examines reproducibility and replicability in science” article here. The report recommends “ways that researchers, academic institutions, journals, and funders should help strengthen rigor and transparency in order to improve the reproducibility and replicability of scientific research.”

Reproducibility is at the core of data acumen and needs to be stressed at all levels of the data science curriculum. In today’s entry, we discuss regression testing in the context of data analysis. (Here regression is referring to returning to a former or less developed state, not a statistical model.) Typically software testing is used to ensure that programmatic outputs remain consistent as updates and changes are made to a software package. In our context, we describe how the testthat package in R is helpful to verify assumptions about the underlying data as a minimal standard for ensuring that analyses using that data are appropriate.

## The testthat package

The goal of the testthat package is to make testing R code less painful and tedious by

• providing functions that make it easy to describe what you expect a function to do, including catching errors, warnings, and messages;

• integrating into existing workflow, whether it’s information testing on the command line, building test suites, or using R CMD check;

• displaying test progress visually, showing a pass, fail, or error for every expectation possibly in color.

A thorough walkthrough of code testing can be found in Hadley Wickham’s book on R Packages, and is definitely worth a read no matter your experience level with code testing.

## data assertion and checks

The testthat package can also be used to help with data consistency checking or data validation (part of the ETL (Extract, Transform, Load) process) by embedding assertions and checks into a data analysis workflow.

Here we consider an example where data are loaded from the fivethirtyeight package (see The fivethirtyeight R Package: “Tame Data” Principles for Introductory Statistics and Data Science Courses.

Let’s see how we might ensure consistency checking for the biopics dataset (the raw data behind the 538 “‘Straight Outta Compton’ Is The Rare Biopic Not About White Dudes” blog post).

library(fivethirtyeight)
library(tidyverse)
library(testthat)
glimpse(biopics)
## Rows: 761
## Columns: 14
## $title <chr> "10 Rillington Place", "12 Years a Slave", "127 ... ##$ site               <chr> "tt0066730", "tt2024544", "tt1542344", "tt283307...
## $country <chr> "UK", "US/UK", "US/UK", "Canada", "US", "US", "U... ##$ year_release       <int> 1971, 2013, 2010, 2014, 1998, 2008, 2002, 2013, ...
## $box_office <dbl> NA, 5.67e+07, 1.83e+07, NA, 5.37e+05, 8.12e+07, ... ##$ director           <chr> "Richard Fleischer", "Steve McQueen", "Danny Boy...
## $number_of_subjects <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 1, 3, 3, 3, 1, ... ##$ subject            <chr> "John Christie", "Solomon Northup", "Aron Ralsto...
## $type_of_subject <chr> "Criminal", "Other", "Athlete", "Other", "Other"... ##$ race_known         <chr> "Unknown", "Known", "Unknown", "Known", "Unknown...
## $subject_race <chr> NA, "African American", NA, "White", NA, "Asian ... ##$ person_of_color    <lgl> FALSE, TRUE, FALSE, FALSE, FALSE, TRUE, FALSE, T...
## $subject_sex <chr> "Male", "Male", "Male", "Male", "Male", "Male", ... ##$ lead_actor_actress <chr> "Richard Attenborough", "Chiwetel Ejiofor", "Jam...
inspectdf::inspect_types(biopics)
## # A tibble: 4 x 4
##   type        cnt  pcnt col_name
##   <chr>     <int> <dbl> <named list>
## 1 character    10 71.4  <chr [10]>
## 2 integer       2 14.3  <chr [2]>
## 3 logical       1  7.14 <chr [1]>
## 4 numeric       1  7.14 <chr [1]>
table(biopics$country) ## ## Canada Canada/UK UK US US/Canada US/UK ## 18 13 146 489 11 82 ## US/UK/Canada ## 2 length(table(biopics$country))
## [1] 7

Imagine that we are planning to analyse the country variable (which designates the country of origin for the biopic) which has seven distinct levels. The testthat package can help us to confirm assertions about the variable.

countrycheck <- c("Canada", "Canada/UK", "UK", "US", "US/Canada", "US/UK", "US/UK/Canada")
testthat::expect_setequal(biopics$country, countrycheck) When we test against the list of all countries, we do not get an error. However, when we compare against the smaller country list the expect_setequal() function will tell us which entries in the country vector are not in our smaller test set. countrysmall <- c("Canada", "UK", "US") testthat::expect_setequal(biopics$country, countrysmall)
## Error: biopics$country[c(2, 3, 10, 11, 13, 14, 15, 19, 20, ...)] absent from countrysmall biopics$country[2]
## [1] "US/UK"

Other sanity checks can be added. Here we can (incorrectly) assume that all of the biopics date from later than 1950.

range(biopics$year_release) ## [1] 1915 2014 expect_lt(max(biopics$year_release), 2015)
expect_gt(min(biopics$year_release), 1950) ## Error: min(biopics$year_release) is not strictly more than 1950. Difference: -35

We see that the earliest biopic was released in 1915.

Taking time to ensure that variables and datasets correspond to what is described in the codebook is an important component of data validation. Students can and should incorporate such checks into their data ingestation workflow (and they should be required to do so as part of their projects and analyses).