{"id":5839,"date":"2016-03-09T17:47:54","date_gmt":"2016-03-09T16:47:54","guid":{"rendered":"http:\/\/emilkirkegaard.dk\/en\/?p=5839"},"modified":"2016-03-09T18:02:39","modified_gmt":"2016-03-09T17:02:39","slug":"excluding-missing-or-bad-data-in-r-not-as-easy-as-it-should-be","status":"publish","type":"post","link":"https:\/\/emilkirkegaard.dk\/en\/2016\/03\/excluding-missing-or-bad-data-in-r-not-as-easy-as-it-should-be\/","title":{"rendered":"Excluding missing or bad data in R: not as easy as it should be!"},"content":{"rendered":"<p><em>On-going series of posts about functions in my R package (https:\/\/github.com\/Deleetdk\/kirkegaard ).<\/em><\/p>\n<p>Suppose you have a list or a simple vector (<a href=\"http:\/\/adv-r.had.co.nz\/Data-structures.html#vectors\">lists are vectors<\/a>) with some data. However, some of it is missing or bad in various ways: NA, NULL, NaN, Inf (or -Inf). Usually, we want to get rid of these datapoints, but it can be difficult with the built-in functions. R&#8217;s built-in functions for handling missing (or bad) data are:<\/p>\n<ul>\n<li><em>is.na<\/em><\/li>\n<li><em>is.nan<\/em><\/li>\n<li><em>is.infinite \/ is.finite<\/em><\/li>\n<li><em>is.null<\/em><\/li>\n<\/ul>\n<p>Unfortunately, they are not consistently vectorized and some of them match multiple types. For instance:<\/p>\n<pre>x = list(1, NA, 2, NULL, 3, NaN, 4, Inf) #example list\r\nis.na(x)\r\n#&gt; [1] FALSE\u00a0 TRUE FALSE FALSE FALSE\u00a0 TRUE FALSE FALSE<\/pre>\n<p>So, <em>is.na<\/em> actually matches <em>NaN<\/em> as well. What about <em>is.nan<\/em>?<\/p>\n<pre>is.nan(x)\r\n#&gt; Error in is.nan(x) : default method not implemented for type 'list'<\/pre>\n<p>But that turns out not to be vectorized. But it gets worse:<\/p>\n<pre>sapply(x, is.nan)\r\n#&gt; [[1]]\r\n#&gt; [1] FALSE\r\n#&gt; \r\n#&gt; [[2]]\r\n#&gt; [1] FALSE\r\n#&gt; \r\n#&gt; [[3]]\r\n#&gt; [1] FALSE\r\n#&gt; \r\n#&gt; [[4]]\r\n#&gt; logical(0)\r\n#&gt; \r\n#&gt; [[5]]\r\n#&gt; [1] FALSE\r\n#&gt; \r\n#&gt; [[6]]\r\n#&gt; [1] TRUE\r\n#&gt; \r\n#&gt; [[7]]\r\n#&gt; [1] FALSE\r\n#&gt; \r\n#&gt; [[8]]\r\n#&gt; [1] FALSE<\/pre>\n<p>Note that calling <em>is.nan<\/em> on <em>NULL<\/em> returns an empty logical vector (<em>logical(0)<\/em>) instead of <em>FALSE<\/em>. This also changes the output from <em>sapply<\/em> to a list instead of a vector we can subset with. <em>is.infinite<\/em> behaves the same way: not vectorized and gives <em>logical(0)<\/em> for <em>NULL<\/em>.<\/p>\n<p>But suppose you want a robust function for handling missing data and one that has specificity. I could not find such a function, so I wrote one. Testing it:<\/p>\n<pre>are_equal(exclude_missing(x), list(1, 2, 3, 4))\r\n#&gt; [1] TRUE\r\nare_equal(exclude_missing(x, .NA = F), list(1, NA, 2, 3, 4))\r\n#&gt; [1] TRUE\r\nare_equal(exclude_missing(x, .NULL = F), list(1, 2, NULL, 3, 4))\r\n#&gt; [1] TRUE\r\nare_equal(exclude_missing(x, .NaN = F), list(1, 2, 3, NaN, 4))\r\n#&gt; [1] TRUE\r\nare_equal(exclude_missing(x, .Inf = F), list(1, 2, 3, 4, Inf))\r\n#&gt; [1] TRUE<\/pre>\n<p>So, in all cases does it exclude only the type that we want to exclude, and it does not fail due to lack of vectorization in the base-r functions.<\/p>\n<p><strong>Edited<\/strong><\/p>\n<p>Turns out that there are more problems:<\/p>\n<p id=\"rstudio_console_output\" class=\"GCWXI2KCJKB\" tabindex=\"0\"><span class=\"GCWXI2KCPJB ace_keyword\">is.na(list(NA))<br \/>\n#&gt; <\/span>[1] TRUE<\/p>\n<p class=\"GCWXI2KCJKB\" tabindex=\"0\">So, for some reason, <em>is.na<\/em> returns <em>TRUE<\/em> when given a list with NA. This shouldn&#8217;t happen I think.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>On-going series of posts about functions in my R package (https:\/\/github.com\/Deleetdk\/kirkegaard ). Suppose you have a list or a simple vector (lists are vectors) with some data. However, some of it is missing or bad in various ways: NA, NULL, NaN, Inf (or -Inf). Usually, we want to get rid of these datapoints, but it [&hellip;]<\/p>\n","protected":false},"author":17,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[2089],"tags":[2313,1979],"class_list":["post-5839","post","type-post","status-publish","format-standard","hentry","category-programming","tag-missing-data","tag-r","entry"],"_links":{"self":[{"href":"https:\/\/emilkirkegaard.dk\/en\/wp-json\/wp\/v2\/posts\/5839","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/emilkirkegaard.dk\/en\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/emilkirkegaard.dk\/en\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/emilkirkegaard.dk\/en\/wp-json\/wp\/v2\/users\/17"}],"replies":[{"embeddable":true,"href":"https:\/\/emilkirkegaard.dk\/en\/wp-json\/wp\/v2\/comments?post=5839"}],"version-history":[{"count":5,"href":"https:\/\/emilkirkegaard.dk\/en\/wp-json\/wp\/v2\/posts\/5839\/revisions"}],"predecessor-version":[{"id":5844,"href":"https:\/\/emilkirkegaard.dk\/en\/wp-json\/wp\/v2\/posts\/5839\/revisions\/5844"}],"wp:attachment":[{"href":"https:\/\/emilkirkegaard.dk\/en\/wp-json\/wp\/v2\/media?parent=5839"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/emilkirkegaard.dk\/en\/wp-json\/wp\/v2\/categories?post=5839"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/emilkirkegaard.dk\/en\/wp-json\/wp\/v2\/tags?post=5839"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}