Data Integrity Tests for R

Quick note: in this document I use all caps to denote things that you’d have to fit in for your own applications. 

Testing Idioms

The easiest tool for testing in R is stopifnot() command. stopifnot() accepts True/False statements. If the statement is true, R will march forward; if the statement is false, R will stop and throw an error, drawing the programmers attention to the problem.

For example:

# Make sure VARIABLE is numeric: 
stopifnot( is.numeric(VARIABLE) )

There is one nuance, however — stopifnot wants something that’s either TRUE or FALSE, but truth statements applied to data.frame objects actually generate vectors with a TRUE / FALSE value for each row.

So when working with a data.framestopifnot with throw an error unless ALL the values are TRUE. If you want to test for whether SOME values are TRUE, you can use any().

# Make sure some values are not zero

stopifnot( any( df$VARIABLE != 0) )

stopifnot can also easily be combined with other functions. To check that a vector has 100 entries, for example:

stopifnot( length(VECTOR) == 100 )

 

Warning: Tests and Interactive Programming:

One really important thing to know about: if you run an R file using the source command, an error will stop R in its tracks. But if you’re highlighting a block of code and sending it to R by typing “Command-Return”  (with the standard R GUI text-editor or in RStudio) or clicking the Run button in RStudio ( ), R will throw an error then keep going!

There is a weird work around available, but if you don’t want to use it, you need to either (a) watch your output, or (b) make sure to run your code using the source(file.R)command or by clicking the Source button in RStudio () instead of the Run button.

To see, try the following code:

x <- 'test'
stopifnot( is.numeric(x) )
print( 'hello world' )

If you run this interactively, you’ll see a red error message, but the print statement will still run. If you save this file and execute it with the source command / Source button, the print statement will never be run.

 

 

When To Write Tests

The best way to get into writing tests is to think about how you check your data interactively to make stuff work. After a merge or plyr command, most people pause to browse the data and/or watch the code step by step, or do a set of quick tabs or plots.  But these are not systematic, and you generally only do them once (when you’re first writing the code).

A great way to write tests is to think about what you’re looking for when you do these interactive tests and convert the logic of those interactive interrogations into systematic assert statements. That way they’ll be baked into your code, and will be executed every time your code runs!1

  • After merge: No where are problems with data made more clear then in a merge. ALWAYS add tests after a merge! More on that below.
  • After complicated manipulations: If you have to think more than a little about how to get R to do something, there’s a chance you missed something. Add a test or two to make sure you did it right! Personally, for example, I almost never use plyr commands without adding tests — it’s just not a natural way to think about things, so I know I may have screwed up (and often have!).
  • Before dropping observations: Dropping observations masks problems. Before you drop variables, add a test to, say, count the number of observations you expect to drop

Test Examples

General Tests

Test number of observations is right:

stopifnot( nrow(MY_DF) = VALUE )

Check var that should have no missing has no missing.

stopifnot( ! any( is.na(df$VARIABLE) ) )

Check my unique identifier is actually unique.

stopifnot( ! any(duplicated(MY_DF$VARIABLE) ) )

# or, more succinctly:

stopifnot( ! anyDuplicated(MY_DF$VARIABLE) )

Make sure values of gender have a reasonable value. Note this is a “reasonableness” test, not an absolute test. It’s possible this would fail and the data is ok, but this way if there’s a problem your attention will be flagged so you can check.

stopifnot( 0.4 < mean(MY_DF$GENDER) & mean(MY_DF$GENDER) < 0.6 )

 

Post-merge checks

Personally, I hate that R doesn’t have an analogue of the _m variable from Stata to check which rows merged and which didn’t (though some external libraries implement this) so I hack it myself and check it’s integrity as follows:

df1$MERGE1 = 1
df2$MERGE2 = 2

# If want 1:1 merge, make sure id is unique!

stopifnot( ! anyDuplicated(df1$ID_VARIABLE) )
stopifnot( ! anyDuplicated(df2$ID_VARIABLE) )

df = merge(df1, df2, by='ID_VARIABLE', all=TRUE)

df[ is.na(MERGE1) ] = 0
df[ is.na(MERGE2) ] = 0

df$MERGE = df$MERGE1 + DF$MERGE2

# Check that no observations from df2 failed to match with df1. 

stopifnot( all(df$MERGE != 2) )

# Say I know 3 obs in df1 will have failed to merge, and those three are OK, but ONLY those 3. 

stopifnot( length(df[ df$merge == 1 ]) == 3 )


After Regressions

Make sure num obs is correct and obs you expect in regression aren’t dropping out due to missing values or something.

model = lm(df$VAR1 ~ df$VAR2)
stopifnot( nobs(model) == 24 )

 

 

 

  1. Like many things in my life, credit for this way of thinking goes to Adriane Fresh