On-going series of posts about functions in my R package (https://github.com/Deleetdk/kirkegaard ).
Suppose you have a list or a simple vector (lists are vectors) with some data. However, some of it is missing or bad in various ways: NA, NULL, NaN, Inf (or -Inf). Usually, we want to get rid of these datapoints, but it can be difficult with the built-in functions. R’s built-in functions for handling missing (or bad) data are:
- is.na
- is.nan
- is.infinite / is.finite
- is.null
Unfortunately, they are not consistently vectorized and some of them match multiple types. For instance:
x = list(1, NA, 2, NULL, 3, NaN, 4, Inf) #example list is.na(x) #> [1] FALSE TRUE FALSE FALSE FALSE TRUE FALSE FALSE
So, is.na actually matches NaN as well. What about is.nan?
is.nan(x) #> Error in is.nan(x) : default method not implemented for type 'list'
But that turns out not to be vectorized. But it gets worse:
sapply(x, is.nan) #> [[1]] #> [1] FALSE #> #> [[2]] #> [1] FALSE #> #> [[3]] #> [1] FALSE #> #> [[4]] #> logical(0) #> #> [[5]] #> [1] FALSE #> #> [[6]] #> [1] TRUE #> #> [[7]] #> [1] FALSE #> #> [[8]] #> [1] FALSE
Note that calling is.nan on NULL returns an empty logical vector (logical(0)) instead of FALSE. This also changes the output from sapply to a list instead of a vector we can subset with. is.infinite behaves the same way: not vectorized and gives logical(0) for NULL.
But suppose you want a robust function for handling missing data and one that has specificity. I could not find such a function, so I wrote one. Testing it:
are_equal(exclude_missing(x), list(1, 2, 3, 4)) #> [1] TRUE are_equal(exclude_missing(x, .NA = F), list(1, NA, 2, 3, 4)) #> [1] TRUE are_equal(exclude_missing(x, .NULL = F), list(1, 2, NULL, 3, 4)) #> [1] TRUE are_equal(exclude_missing(x, .NaN = F), list(1, 2, 3, NaN, 4)) #> [1] TRUE are_equal(exclude_missing(x, .Inf = F), list(1, 2, 3, 4, Inf)) #> [1] TRUE
So, in all cases does it exclude only the type that we want to exclude, and it does not fail due to lack of vectorization in the base-r functions.
Edited
Turns out that there are more problems:
is.na(list(NA))
#> [1] TRUE
So, for some reason, is.na returns TRUE when given a list with NA. This shouldn’t happen I think.