Excluding missing or bad data in R: not as easy as it should be!

On-going series of posts about functions in my R package (https://github.com/Deleetdk/kirkegaard ).

Suppose you have a list or a simple vector (lists are vectors) with some data. However, some of it is missing or bad in various ways: NA, NULL, NaN, Inf (or -Inf). Usually, we want to get rid of these datapoints, but it can be difficult with the built-in functions. R’s built-in functions for handling missing (or bad) data are:

is.na
is.nan
is.infinite / is.finite
is.null

Unfortunately, they are not consistently vectorized and some of them match multiple types. For instance:

x = list(1, NA, 2, NULL, 3, NaN, 4, Inf) #example list
is.na(x)
#> [1] FALSE  TRUE FALSE FALSE FALSE  TRUE FALSE FALSE

So, is.na actually matches NaN as well. What about is.nan?

is.nan(x)
#> Error in is.nan(x) : default method not implemented for type 'list'

But that turns out not to be vectorized. But it gets worse:

sapply(x, is.nan)
#> [[1]]
#> [1] FALSE
#> 
#> [[2]]
#> [1] FALSE
#> 
#> [[3]]
#> [1] FALSE
#> 
#> [[4]]
#> logical(0)
#> 
#> [[5]]
#> [1] FALSE
#> 
#> [[6]]
#> [1] TRUE
#> 
#> [[7]]
#> [1] FALSE
#> 
#> [[8]]
#> [1] FALSE

Note that calling is.nan on NULL returns an empty logical vector (logical(0)) instead of FALSE. This also changes the output from sapply to a list instead of a vector we can subset with. is.infinite behaves the same way: not vectorized and gives logical(0) for NULL.

But suppose you want a robust function for handling missing data and one that has specificity. I could not find such a function, so I wrote one. Testing it:

are_equal(exclude_missing(x), list(1, 2, 3, 4))
#> [1] TRUE
are_equal(exclude_missing(x, .NA = F), list(1, NA, 2, 3, 4))
#> [1] TRUE
are_equal(exclude_missing(x, .NULL = F), list(1, 2, NULL, 3, 4))
#> [1] TRUE
are_equal(exclude_missing(x, .NaN = F), list(1, 2, 3, NaN, 4))
#> [1] TRUE
are_equal(exclude_missing(x, .Inf = F), list(1, 2, 3, 4, Inf))
#> [1] TRUE

So, in all cases does it exclude only the type that we want to exclude, and it does not fail due to lack of vectorization in the base-r functions.

Edited

Turns out that there are more problems:

is.na(list(NA))
#> [1] TRUE

So, for some reason, is.na returns TRUE when given a list with NA. This shouldn’t happen I think.

You Might Also Like

kirkegaard: df_add_delta()

kirkegaard: conditional recoding with conditional_change()

Automatic testing of all possible multiple regression models given a set of predictors