{"id":5412,"date":"2015-08-30T06:01:21","date_gmt":"2015-08-30T05:01:21","guid":{"rendered":"http:\/\/emilkirkegaard.dk\/en\/?p=5412"},"modified":"2019-04-02T09:21:31","modified_gmt":"2019-04-02T08:21:31","slug":"converting-a-data-frame-to-a-numerical-matrix-in-r-not-so-easy","status":"publish","type":"post","link":"https:\/\/emilkirkegaard.dk\/en\/2015\/08\/converting-a-data-frame-to-a-numerical-matrix-in-r-not-so-easy\/","title":{"rendered":"Converting a data.frame to a numerical matrix in R, not so easy!"},"content":{"rendered":"<h3>Edit 2019<\/h3>\n<p>You don&#8217;t really need the below. Look at the `model.matrix()` function, which converts data frames into matrices for <strong>glmnet<\/strong> and similar function.<\/p>\n<h3>Original post<\/h3>\n<p>Sometimes you need to use a function that wants a numeric matrix as input. One such function is glmnet.cv() which performs lasso regression with cross validation, which is <em><a href=\"https:\/\/www.goodreads.com\/book\/show\/17397466-an-introduction-to-statistical-learning\">very cool<\/a><\/em>. Unfortunately, it is picky about how it wants the input data. Here&#8217;s some lines of my code:<\/p>\n<pre>fit_cv = cv.glmnet(x = as.matrix(temp_df[predictors]), #predictor vars matrix\r\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 y = as.matrix(temp_df[dependent]), #dep var matrix\r\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 weights = weights_, #weights\r\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 alpha = alpha_) #type of shrinkage<\/pre>\n<p>We see that x must be a matrix of the predictors, y must be a matrix with the dependent (usually just one), the weights and alpha are optional, but since I am working with aggregate data I am almost always using weights. Alpha controls the kind of shrinkage used.<\/p>\n<p>All well and good, until it isn&#8217;t. In my case, the predictor data.frame contains some factor variables. R actually uses numeric values as its internal representation of these, but displays them with strings. For instance:<\/p>\n<pre id=\"rstudio_console_output\" class=\"GEM3DMTCFGB\" tabindex=\"0\"><span class=\"GEM3DMTCLFB ace_keyword\">DF = data.frame(a = 1:3, b = letters[10:12],\r\n<\/span> <span class=\"GEM3DMTCLFB ace_keyword\">               c = seq(as.Date(\"2004-01-01\"), by = \"week\", len = 3),\r\n<\/span> <span class=\"GEM3DMTCLFB ace_keyword\">               stringsAsFactors = TRUE)<\/span><\/pre>\n<p class=\"GEM3DMTCFGB\" tabindex=\"0\">Which prints out like this:<\/p>\n<pre id=\"rstudio_console_output\" class=\"GEM3DMTCFGB\" tabindex=\"0\"><span class=\"GEM3DMTCLGB ace_keyword\">&gt; <\/span><span class=\"GEM3DMTCLFB ace_keyword\">DF\r\n<\/span>  a b          c\r\n1 1 j 2004-01-01\r\n2 2 k 2004-01-08\r\n3 3 l 2004-01-15<\/pre>\n<p class=\"GEM3DMTCFGB\" tabindex=\"0\">However, suppose we use my as.matrix solution above, then we get:<\/p>\n<pre id=\"rstudio_console_output\" class=\"GEM3DMTCFGB\" tabindex=\"0\"><span class=\"GEM3DMTCLGB ace_keyword\">&gt; <\/span><span class=\"GEM3DMTCLFB ace_keyword\">as.matrix(DF)\r\n<\/span>     a   b   c           \r\n[1,] \"1\" \"j\" \"2004-01-01\"\r\n[2,] \"2\" \"k\" \"2004-01-08\"\r\n[3,] \"3\" \"l\" \"2004-01-15\"<\/pre>\n<p class=\"GEM3DMTCFGB\" tabindex=\"0\">Which is not what we wanted. It gave us a character matrix which glmnet.cv() will then throw a nonsensical error about. Their bad error made me spend some time finding the actual error.\u00a0Save yourself and others time. Always write good error messages for functions that will be used more than a couple of times!<\/p>\n<p class=\"GEM3DMTCFGB\" tabindex=\"0\">Is there some easy built in way to solve the problem?<\/p>\n<pre id=\"rstudio_console_output\" class=\"GEM3DMTCFGB\" tabindex=\"0\"><span class=\"GEM3DMTCLGB ace_keyword\">&gt; <\/span><span class=\"GEM3DMTCLFB ace_keyword\">as.numeric(DF)\r\n<\/span><span class=\"GEM3DMTCPFB ace_constant\">Error: (list) object cannot be coerced to type 'double'<\/span><\/pre>\n<p class=\"GEM3DMTCFGB\" tabindex=\"0\">The easiest solution did not work.<\/p>\n<pre id=\"rstudio_console_output\" class=\"GEM3DMTCFGB\" tabindex=\"0\"><span class=\"GEM3DMTCLGB ace_keyword\">&gt; <\/span><span class=\"GEM3DMTCLFB ace_keyword\">as.numeric(DF$b)\r\n<\/span>[1] 1 2 3<\/pre>\n<p class=\"GEM3DMTCFGB\" tabindex=\"0\">However, it does work for a single column. So maybe we can just try using it on all the columns:<\/p>\n<pre id=\"rstudio_console_output\" class=\"GEM3DMTCFGB\" tabindex=\"0\"><span class=\"GEM3DMTCLGB ace_keyword\">&gt; <\/span><span class=\"GEM3DMTCLFB ace_keyword\">apply(DF, 2, as.numeric)\r\n<\/span>     a  b  c\r\n[1,] 1 NA NA\r\n[2,] 2 NA NA\r\n[3,] 3 NA NA\r\n<span class=\"GEM3DMTCPFB ace_constant\">Warning messages:\r\n<\/span><span class=\"GEM3DMTCPFB ace_constant\">1: <\/span><span class=\"GEM3DMTCPFB ace_constant\">In apply(DF, 2, as.numeric) :<\/span><span class=\"GEM3DMTCPFB ace_constant\"> NAs introduced by coercion\r\n<\/span><span class=\"GEM3DMTCPFB ace_constant\">2: <\/span><span class=\"GEM3DMTCPFB ace_constant\">In apply(DF, 2, as.numeric) :<\/span><span class=\"GEM3DMTCPFB ace_constant\"> NAs introduced by coercion<\/span><\/pre>\n<p class=\"GEM3DMTCFGB\" tabindex=\"0\">What? What is going on?<\/p>\n<pre id=\"rstudio_console_output\" class=\"GEM3DMTCFGB\" tabindex=\"0\"><span class=\"GEM3DMTCLGB ace_keyword\">&gt; <\/span><span class=\"GEM3DMTCLFB ace_keyword\">apply(as.matrix(DF), 2, as.numeric)\r\n<\/span>     a  b  c\r\n[1,] 1 NA NA\r\n[2,] 2 NA NA\r\n[3,] 3 NA NA\r\n<span class=\"GEM3DMTCPFB ace_constant\">Warning messages:\r\n<\/span><span class=\"GEM3DMTCPFB ace_constant\">1: <\/span><span class=\"GEM3DMTCPFB ace_constant\">In apply(as.matrix(DF), 2, as.numeric) :<\/span><span class=\"GEM3DMTCPFB ace_constant\"> NAs introduced by coercion\r\n<\/span><span class=\"GEM3DMTCPFB ace_constant\">2: <\/span><span class=\"GEM3DMTCPFB ace_constant\">In apply(as.matrix(DF), 2, as.numeric) :<\/span><span class=\"GEM3DMTCPFB ace_constant\"> NAs introduced by coercion<\/span><\/pre>\n<p class=\"GEM3DMTCFGB\" tabindex=\"0\">It looks apply() does a silent as.matrix() which then causes the NAs. OK. How do we convert just the factor columns then? Maybe try some of the more fancy built in conversion calls:<\/p>\n<pre id=\"rstudio_console_output\" class=\"GEM3DMTCFGB\" tabindex=\"0\"><span class=\"GEM3DMTCLGB ace_keyword\">&gt; <\/span><span class=\"GEM3DMTCLFB ace_keyword\">as.matrix.data.frame(DF)\r\n<\/span>     a   b   c           \r\n[1,] \"1\" \"j\" \"2004-01-01\"\r\n[2,] \"2\" \"k\" \"2004-01-08\"\r\n[3,] \"3\" \"l\" \"2004-01-15\"<\/pre>\n<p class=\"GEM3DMTCFGB\" tabindex=\"0\">Nope.<\/p>\n<pre id=\"rstudio_console_output\" class=\"GEM3DMTCFGB\" tabindex=\"0\"><span class=\"GEM3DMTCLGB ace_keyword\">&gt; <\/span><span class=\"GEM3DMTCLFB ace_keyword\">as.data.frame.matrix(DF)\r\n<\/span>  a b     c\r\n1 1 j 12418\r\n2 2 k 12425\r\n3 3 l 12432<\/pre>\n<p class=\"GEM3DMTCFGB\" tabindex=\"0\">Closer, this time the date got converted, but the factor got converted to character, not integers. We could do a loop:<\/p>\n<pre id=\"rstudio_console_output\" class=\"GEM3DMTCFGB\" tabindex=\"0\"><span class=\"GEM3DMTCLGB ace_keyword\">&gt; <\/span><span class=\"GEM3DMTCLFB ace_keyword\">for (col_idx in seq_along(DF)) {\r\n<\/span><span class=\"GEM3DMTCLGB ace_keyword\">+ <\/span><span class=\"GEM3DMTCLFB ace_keyword\">  DF[col_idx] = as.numeric(DF[[col_idx]])\r\n<\/span><span class=\"GEM3DMTCLGB ace_keyword\">+ <\/span><span class=\"GEM3DMTCLFB ace_keyword\">}\r\n<\/span><span class=\"GEM3DMTCLGB ace_keyword\">&gt; <\/span><span class=\"GEM3DMTCLFB ace_keyword\">DF = as.matrix(DF)\r\n<\/span><span class=\"GEM3DMTCLGB ace_keyword\">&gt; <\/span><span class=\"GEM3DMTCLFB ace_keyword\">str(DF)\r\n<\/span> num [1:3, 1:3] 1 2 3 1 2 ...\r\n - attr(*, \"dimnames\")=List of 2\r\n  ..$ : NULL\r\n  ..$ : chr [1:3] \"a\" \"b\" \"c\"<\/pre>\n<p class=\"GEM3DMTCFGB\" tabindex=\"0\">Which works, but now it is getting silly. Maybe some implicit loops:<\/p>\n<pre id=\"rstudio_console_output\" class=\"GEM3DMTCFGB\" tabindex=\"0\"><span class=\"GEM3DMTCLGB ace_keyword\">&gt; <\/span><span class=\"GEM3DMTCLFB ace_keyword\">lapply(DF, as.numeric)\r\n<\/span>$a\r\n[1] 1 2 3\r\n$b\r\n[1] 1 2 3\r\n$c\r\n[1] 12418 12425 12432<\/pre>\n<p class=\"GEM3DMTCFGB\" tabindex=\"0\">Closer, but it returns a list, not a matrix. Maybe just try converting:<\/p>\n<pre id=\"rstudio_console_output\" class=\"GEM3DMTCFGB\" tabindex=\"0\"><span class=\"GEM3DMTCLGB ace_keyword\">&gt; <\/span><span class=\"GEM3DMTCLFB ace_keyword\">as.matrix(lapply(DF, as.numeric))\r\n<\/span>  [,1]     \r\na Numeric,3\r\nb Numeric,3\r\nc Numeric,3<\/pre>\n<p class=\"GEM3DMTCFGB\" tabindex=\"0\">But no no, life isn&#8217;t that easy. What about as.data.frame?<\/p>\n<pre id=\"rstudio_console_output\" class=\"GEM3DMTCFGB\" tabindex=\"0\"><span class=\"GEM3DMTCLGB ace_keyword\">&gt; <\/span><span class=\"GEM3DMTCLFB ace_keyword\">as.data.frame(lapply(DF, as.numeric))\r\n<\/span>  a b     c\r\n1 1 1 12418\r\n2 2 2 12425\r\n3 3 3 12432<\/pre>\n<p class=\"GEM3DMTCFGB\" tabindex=\"0\">Huh, that works, but as.matrix didn&#8217;t. Oh well, just one final step:<\/p>\n<pre id=\"rstudio_console_output\" class=\"GEM3DMTCFGB\" tabindex=\"0\"><span class=\"GEM3DMTCLGB ace_keyword\">&gt; <\/span><span class=\"GEM3DMTCLFB ace_keyword\">DF = as.matrix(as.data.frame(lapply(DF, as.numeric)))\r\n<\/span><span class=\"GEM3DMTCLGB ace_keyword\">&gt; <\/span><span class=\"GEM3DMTCLFB ace_keyword\">str(DF)\r\n<\/span> num [1:3, 1:3] 1 2 3 1 2 ...\r\n - attr(*, \"dimnames\")=List of 2\r\n  ..$ : NULL\r\n  ..$ : chr [1:3] \"a\" \"b\" \"c\"<\/pre>\n<p class=\"GEM3DMTCFGB\" tabindex=\"0\">We got what we wanted!<\/p>\n<p class=\"GEM3DMTCFGB\" tabindex=\"0\">Sometimes, R does not make your life easy.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Edit 2019 You don&#8217;t really need the below. Look at the `model.matrix()` function, which converts data frames into matrices for glmnet and similar function. Original post Sometimes you need to use a function that wants a numeric matrix as input. One such function is glmnet.cv() which performs lasso regression with cross validation, which is very [&hellip;]<\/p>\n","protected":false},"author":17,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[4],"tags":[2206,2208,753,2207,1979],"class_list":["post-5412","post","type-post","status-publish","format-standard","hentry","category-computer","tag-data-frame","tag-factor","tag-matrix","tag-numeric","tag-r","entry"],"_links":{"self":[{"href":"https:\/\/emilkirkegaard.dk\/en\/wp-json\/wp\/v2\/posts\/5412","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/emilkirkegaard.dk\/en\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/emilkirkegaard.dk\/en\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/emilkirkegaard.dk\/en\/wp-json\/wp\/v2\/users\/17"}],"replies":[{"embeddable":true,"href":"https:\/\/emilkirkegaard.dk\/en\/wp-json\/wp\/v2\/comments?post=5412"}],"version-history":[{"count":3,"href":"https:\/\/emilkirkegaard.dk\/en\/wp-json\/wp\/v2\/posts\/5412\/revisions"}],"predecessor-version":[{"id":7802,"href":"https:\/\/emilkirkegaard.dk\/en\/wp-json\/wp\/v2\/posts\/5412\/revisions\/7802"}],"wp:attachment":[{"href":"https:\/\/emilkirkegaard.dk\/en\/wp-json\/wp\/v2\/media?parent=5412"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/emilkirkegaard.dk\/en\/wp-json\/wp\/v2\/categories?post=5412"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/emilkirkegaard.dk\/en\/wp-json\/wp\/v2\/tags?post=5412"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}