Data Integrity Tests for Stata

Quick note: in this document I use all caps to denote things that you’d have to fit in for your own applications. For example, VARIABLE1 is a generic variable name, and CONDITION is a generic condition you might want to include (like age < 100). Also, Stata accepts return scalars (like r(N) with or without macro quotes (` and ‘). For formatting reasons that are hard to explain, I don’t use the macro quotes here; you are welcome to use them in your code code if you want!

Testing Idioms

Key to testing in Stata is the assert command, the syntax for which is quite sample: type assert then a statement. If the statement is true, Stata will march forward; if the statement is false, Stata will stop and throw an error, drawing the programmers attention to the problem. For example:

* Make sure VARIABLE is never missing:
assert VARIABLE != .

* Make sure VARIABLE1 is not missing if VARIABLE2 is not missing
assert VARIABLE1 != . if VARIABLE2 != .

There are, however, situations where the syntax of assert is limited. assert always tries to test things row by row, just like most Stata commands. But if you want to test a more general property of your dataset or don’t want an absolutist test (a certain number of known exceptions are permissible), you need to combine assert with a helper command. For example:

* Make sure no more than 10 missing observations of
* VARIABLE1. Note r(N) is a local macro where "count"
* puts the number it displays. 

count if VARIABLE1 == .
assert r(N) <= 10

* Make sure your data has no duplicate values of
* VARIABLE2. Here we use the two return values 
* from the "duplicates report" command. 

duplicates report VARIABLE2
assert r(N) == r(unique_value)

If you aren’t sure what values are made available through a command or how those values are called, type return list or ereturn list after you’re run the relevant command (like summarize, duplicates report, or regress).

When To Write Tests

The best way to get into writing tests is to think about how you check your data interactively to make stuff work. After a merge or egen, most people pause to browse the data and/or watch the code step by step, or do a set of quick tabs or plots. But these are not systematic, and you generally only do them once (when you’re first writing the code).

A great way to write tests is to think about what you’re looking for when you do these interactive tests and convert the logic of those interactive interrogations into systematic assert statements. That way they’ll be baked into your code, and will be executed every time your code runs!¹

After merge: No where are problems with data made more clear then in a merge. ALWAYS add tests after a merge! More on that below.
After complicated commands (like bysort and egen): If you have to think more than a little about how to get Stata to do something, there’s a chance you missed something. Add a test or two to make sure you did it right! Personally, for example, I almost never use bysort or egen without adding tests — it’s just not a natural way to think about things, so I know I may have screwed up (and often have!).
Before drop if statements: Dropping observations masks problems. Before you drop variables, add a test to, say, count the number of observations you expect to drop (count if CONDITION, then assert r(N) < NUMBER_YOU_EXPECT_TO_DROP, then drop if CONDITION`)
When using conditions (if statements) with more than one or two simple clauses

Test Examples

General Tests

Test number of observations is right:

count 
assert r(N) == VALUE

Check var that should have no missing has no missing.

assert MY_VAR != .

Check my unique identifier is actually unique.

duplicates report MY_IDENTIFIER
assert r(N) == r(unique_value)

In a panel, make sure each ID group has exactly one observation for each of five years

bysort ID: egen test_obs = count(YEAR)
bysort ID: egen test_min = min(YEAR)
bysort ID: egen test_max = max(YEAR)

assert test_obs == 5 & test_min == 1990 & test_max == 1995
drop test_*

Make sure values of gender have a reasonable value. Note this is a “reasonableness” test, not an absolute test. It’s possible this would fail and the data is ok, but this way if there’s a problem your attention will be flagged so you can check.

sum GENDER
assert r(mean) >= 0.4 & r(mean) < 0.6

Check that a new variable (a Herfindahl) has reasonable values. Do so by checking has right range of values, and that places with more groups have lower values (because we’ve all gotten these backwards in our lives at least once!). For context, assume data is long (I’m calculating Herf within each group identified with ID, and within each ID there are grows for each GROUP).

Like the test above, this is not an absolutist test — it COULD be the correlation is negative if group sizes are really strange. But this will PROBABLY pass and if it fails, you definitely want to check to make sure you know why it’s failing.

assert MY_HERF >= 0 & MY_HERF <= 1 
bysort ID: egen NUM_GROUPS = count(GROUP) 
corr NUM_GROUPS MY_HERF assert r(rho) < 0

One annoyance of Stata is that replace doesn’t give you access to the number of changes that are made, so you can’t easily test for whether your replace command made changes. But there are ways to check replace when important — here’s an admittedly verbose way of making sure your replace command is doing what you expect. I don’t use it for singular replace commands, but it is useful when I’m running a loop with a replace inside and I want to make sure it’s doing the right thing at each stage of the loop. Here’s an example where I want to make sure at each step there is exactly 1 and only 1 change made:

forvalues x in VAR1 VAR2 VAR3 VAR4 {

    count if `x' == NEW_VALUE & CONDITION
    local pre_replace_count = r(N)

    replace `x' = NEW_VALUE if CONDITION

    count if `x' == NEW_VALUE & CONDITION
    assert r(N) - `pre_replace_count' == 1

}

Post-merge checks

After a merge, make sure that there are no observations in second dataset not in first.

assert _m != 2

Imagine we know 5 observations in original data won’t merge and that’s ok. Add check that ONLY 5 fail.

count if _m == 1
assert r(N) < 6
drop _m

After Regressions

Make sure num obs is correct and obs you expect in regression aren’t dropping out due to missing values or something.

reg Y1 X1 X2 
assert e(N) == 5

Like many things in my life, credit for this way of thinking goes to Adriane Fresh