{"id":5453,"date":"2015-09-07T15:16:58","date_gmt":"2015-09-07T14:16:58","guid":{"rendered":"http:\/\/emilkirkegaard.dk\/en\/?p=5453"},"modified":"2015-09-07T15:16:58","modified_gmt":"2015-09-07T14:16:58","slug":"easy-plotting-of-kmeans-cluster-analysis-with-ggplot2","status":"publish","type":"post","link":"https:\/\/emilkirkegaard.dk\/en\/2015\/09\/easy-plotting-of-kmeans-cluster-analysis-with-ggplot2\/","title":{"rendered":"Easy plotting of kmeans cluster analysis with ggplot2"},"content":{"rendered":"<p>I looked around to see if I could find a nice function for just plotting the results of kmeans() using ggplot2. I could not find that. So I wrote my own function.<\/p>\n<p>The example data I&#8217;m using is <em>real GDP growth<\/em>, not sure exactly what it is, but the file can be found here:\u00a0<a href=\"http:\/\/emilkirkegaard.dk\/en\/wp-content\/uploads\/OECD-real-economic-growth.pdf\">OECD real economic growth<\/a>. The datafile: <a href=\"http:\/\/emilkirkegaard.dk\/en\/wp-content\/uploads\/GDP.ods\">GDP.<\/a> Because some data is missing, I use imputation using <em>irmi()<\/em> from <em>VIM<\/em>.<\/p>\n<p>One thing about kmeans is that it is an indeterministic algorithm because it depends on an initial position of the cluster means. In the default settings, these are picked at random. This means that results won&#8217;t be completely identical always. Even when they are, factors may be change numbers. Such is life. We can get around that by running the method a large number of times and picking the best of the solutions found. I set the default for this to 100 runs.<\/p>\n<p>kmeans can use raw data if desired. Generally however this is not desired. That is, we want all the data to count equally, not differ in importance because their scales are different. The default setting is thus to standardize the data first.<\/p>\n<p>Then there is the question of how to plot the solution. The usual solution is a scatter plot. I used that as well. The scatter plot is based on the two first factors as found by <em>fa()<\/em> (<em>psych<\/em> package).<\/p>\n<p>Finally, I color code the factors and add rownames to the points for easy identification.<\/p>\n<p>The results:<\/p>\n<p><a href=\"http:\/\/emilkirkegaard.dk\/en\/wp-content\/uploads\/cluster_GDP1.png\"><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-large wp-image-5457\" src=\"http:\/\/emilkirkegaard.dk\/en\/wp-content\/uploads\/cluster_GDP1-1024x653.png\" alt=\"cluster_GDP1\" width=\"720\" height=\"459\" \/><\/a><a href=\"http:\/\/emilkirkegaard.dk\/en\/wp-content\/uploads\/cluster_GDP2.png\"><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-large wp-image-5456\" src=\"http:\/\/emilkirkegaard.dk\/en\/wp-content\/uploads\/cluster_GDP2-1024x653.png\" alt=\"cluster_GDP2\" width=\"720\" height=\"459\" \/><\/a><\/p>\n<p>In the first plot, I have clustered the countries. In the second I have clustered the years. This is simply done by first transposing the dataset.<\/p>\n<p><strong>R code<\/strong><\/p>\n<pre>library(pacman)\r\np_load(kirkegaard, psych, plyr, stringr, weights, VIM, ggplot2, readODS)\r\n\r\n#read from file\r\nd = read.ods(\"GDP.ods\", 1)\r\n\r\n#fix names\r\nrow_names = d$A[-1]\r\ncol_names = d[1, -1]\r\nd = d[-1, -1]\r\n\r\n#recode NA\r\nd[d==\"\"] = NA\r\n\r\n#to numeric\r\nd = as.data.frame(lapply(d, as.numeric))\r\n\r\n#impute NA\r\nd = irmi(d, noise = F)\r\n\r\n#names\r\nrownames(d) = row_names\r\ncolnames(d) = col_names\r\n\r\n#transposed version\r\nd2 = as.data.frame(t(d))\r\n\r\n#some cors\r\nround(wtd.cors(d), 2)\r\nround(wtd.cors(t(d)), 2)\r\n\r\nplot_kmeans = function(df, clusters, runs, standardize=T) {\r\n\u00a0 library(psych)\r\n\u00a0 library(ggplot2)\r\n\u00a0 \r\n\u00a0 #standardize\r\n\u00a0 if (standardize) df = std_df(df)\r\n\u00a0 \r\n\u00a0 #cluster\r\n\u00a0 tmp_k = kmeans(df, centers = clusters, nstart = 100)\r\n\u00a0 \r\n\u00a0 #factor\r\n\u00a0 tmp_f = fa(df, 2, rotate = \"none\")\r\n\u00a0 \r\n\u00a0 #collect data\r\n\u00a0 tmp_d = data.frame(matrix(ncol=0, nrow=nrow(df)))\r\n\u00a0 tmp_d$cluster = as.factor(tmp_k$cluster)\r\n\u00a0 tmp_d$fact_1 = as.numeric(tmp_f$scores[, 1])\r\n\u00a0 tmp_d$fact_2 = as.numeric(tmp_f$scores[, 2])\r\n\u00a0 tmp_d$label = rownames(df)\r\n\u00a0 \r\n\u00a0 #plot\r\n\u00a0 g = ggplot(tmp_d, aes(fact_1, fact_2, color = cluster)) + geom_point() + geom_text(aes(label = label), size = 3, vjust = 1, color = \"black\")\r\n\u00a0 return(g)\r\n}\r\n\r\nplot_kmeans(d, 2)\r\nggsave(\"cluster_GDP1.png\")\r\nplot_kmeans(d2, 2)\r\nggsave(\"cluster_GDP2.png\")<\/pre>\n<p>&nbsp;<\/p>\n","protected":false},"excerpt":{"rendered":"<p>I looked around to see if I could find a nice function for just plotting the results of kmeans() using ggplot2. I could not find that. So I wrote my own function. The example data I&#8217;m using is real GDP growth, not sure exactly what it is, but the file can be found here:\u00a0OECD real [&hellip;]<\/p>\n","protected":false},"author":17,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[1766,2089],"tags":[2219,2220,1979],"class_list":["post-5453","post","type-post","status-publish","format-standard","hentry","category-math-science","category-programming","tag-cluster-analysis","tag-ggplot2","tag-r","entry"],"_links":{"self":[{"href":"https:\/\/emilkirkegaard.dk\/en\/wp-json\/wp\/v2\/posts\/5453","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/emilkirkegaard.dk\/en\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/emilkirkegaard.dk\/en\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/emilkirkegaard.dk\/en\/wp-json\/wp\/v2\/users\/17"}],"replies":[{"embeddable":true,"href":"https:\/\/emilkirkegaard.dk\/en\/wp-json\/wp\/v2\/comments?post=5453"}],"version-history":[{"count":1,"href":"https:\/\/emilkirkegaard.dk\/en\/wp-json\/wp\/v2\/posts\/5453\/revisions"}],"predecessor-version":[{"id":5458,"href":"https:\/\/emilkirkegaard.dk\/en\/wp-json\/wp\/v2\/posts\/5453\/revisions\/5458"}],"wp:attachment":[{"href":"https:\/\/emilkirkegaard.dk\/en\/wp-json\/wp\/v2\/media?parent=5453"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/emilkirkegaard.dk\/en\/wp-json\/wp\/v2\/categories?post=5453"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/emilkirkegaard.dk\/en\/wp-json\/wp\/v2\/tags?post=5453"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}