{"id":5471,"date":"2015-09-08T21:44:18","date_gmt":"2015-09-08T20:44:18","guid":{"rendered":"http:\/\/emilkirkegaard.dk\/en\/?p=5471"},"modified":"2015-09-08T21:44:18","modified_gmt":"2015-09-08T20:44:18","slug":"web-scraping-with-r-using-rvest-spatial-autocorrelation","status":"publish","type":"post","link":"https:\/\/emilkirkegaard.dk\/en\/2015\/09\/web-scraping-with-r-using-rvest-spatial-autocorrelation\/","title":{"rendered":"Web scraping with R using rvest &#038; spatial autocorrelation"},"content":{"rendered":"<p><strong>Scraping with R<\/strong><\/p>\n<p>Although other languages are probably more suitable for web scraping (e.g. Python with <a href=\"https:\/\/en.wikipedia.org\/wiki\/Scrapy\">Scrapy<\/a>), R does have some scraping capabilities. Unsurprisingly, the ever awesome Hadley has written a great package for this: rvest. A simple tutorial and demonstration of it can be found <a href=\"http:\/\/blog.rstudio.org\/2014\/11\/24\/rvest-easy-web-scraping-with-r\/\">here<\/a>, which I the one I used. To do web scraping efficiently, one needs to be familiar with CSS selectors (<a href=\"http:\/\/www.w3schools.com\/css\/default.asp\">this website<\/a> is useful even if <a href=\"http:\/\/www.quora.com\/Why-do-people-hate-W3schools-com\">hated by some<\/a>) and probably also regex (try <a href=\"http:\/\/regexone.com\/\">this interactive tutorial<\/a>).<\/p>\n<p><strong>What to scrape? &amp; spatial autocorrelation<\/strong><\/p>\n<p>There are billions of things one could scrape and use for studies. I could spend all my time doing this if I didn&#8217;t have so many other projects that I need to work on and finish up. In this case I am in need of spatial data for the <em>Admixture in the Americas<\/em> project because a commenter told us to look into <a href=\"http:\/\/www.r-bloggers.com\/correction-for-spatial-and-temporal-auto-correlation-in-panel-data-using-r-to-estimate-spatial-hac-errors-per-conley\/\">the problem of spatial autocorrelation<\/a>. Essentially, spatial aurocorrelation is when your data concerns units that have some kind of location. This can be an actual location on Earth (or in space), or in some kind of abstract space such as <a href=\"https:\/\/en.wikipedia.org\/wiki\/Phylogenetics\">phylogenetic space<\/a>. Generally speaking, units that are closer to each other tend to be more similar (positive autocorrelation). I&#8217;ve read that this can make some predictors seem like they work when in fact they don&#8217;t. I haven&#8217;t seen an example of this yet, but I will try to see if I can simulate some data to confirm the problem (and maybe make a Shiny app?). A good rule of thumb for statistics is that <em>if you can&#8217;t simulate data that show a given phenomenon, you don&#8217;t understand it well enough<\/em>. The below visualization was offered in the R tutorial above.<\/p>\n<p><a href=\"http:\/\/emilkirkegaard.dk\/en\/wp-content\/uploads\/spatial-autocor.png\"><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-5472\" src=\"http:\/\/emilkirkegaard.dk\/en\/wp-content\/uploads\/spatial-autocor.png\" alt=\"spatial autocor\" width=\"444\" height=\"165\" \/><\/a><\/p>\n<p>Back to the topic. To control for spatial autocorrelation in an analysis, one needs some actual measures of each unit&#8217;s location. Since our units are countries and regions within countries, one needs latitude and longitude numbers. We looked around a bit and found some lists of these numbers. However, they did not cover all our units. We could only find data for US states and sovereign countries. Our dataset however also includes Mexican and Brazilian states, Colombian departments and various overseas territories.<\/p>\n<p>Since we are really interested in the people living in these areas, the best measure would be the <a href=\"https:\/\/en.wikipedia.org\/wiki\/Center_of_population\">center of population<\/a>, which is the demographic equivalent of <a href=\"https:\/\/en.wikipedia.org\/wiki\/Center_of_mass\">center of mass<\/a> in physics. This however requires having access to lots of population density information, which we could likewise not find. Theoretically, one could find a list of the largest cities of each unit and then calculate a weighted <a href=\"https:\/\/en.wikipedia.org\/wiki\/Spherical_geometry\">spherical geometrical<\/a> <a href=\"https:\/\/en.wikipedia.org\/wiki\/Central_tendency\">central tendency<\/a> of this and use that. This would probably be a good measure altho it is sensitive to differences in rates of urbanicity (proportion of people living in cities). One probably could obtain this information using Wikipedia and do the calculations, but it would be fairly time consuming and I suspect for fairly little gain (hence a new research proposal: for some units obtain both center of population and capital location data as well as some other data, compare results.). I decided on a simpler solution: finding the capital of each unit and using the location of that as a estimate. Because capitals are often located near the center of population for political reasons and because they are often the largest city as well, they should make okay approximations.<\/p>\n<p><strong>The actual scraping code<\/strong><\/p>\n<p>Below is the R code I wrote. Note that it is done fairly verbosely. One could create functions to make it simpler, but since I was just learning this stuff, I did it a simpler way. The code is commented, so you (and myself, when I read this post at some later point!) should be able to figure out what the code does.<\/p>\n<pre>library(pacman)\r\np_load(kirkegaard, stringr, psych, rvest)\r\n\r\n\r\n# Colombian location scrape -----------------------------------------------\r\ncol_deps = html(\"https:\/\/en.wikipedia.org\/wiki\/Departments_of_Colombia\")\r\n\r\n#get names of deps and their capitals\r\ndep_table = col_deps %&gt;% \r\n\u00a0 html_node(\".wikitable\") %&gt;%\r\n\u00a0 html_table(., fill = T)\r\n\r\n#get links to each capital\r\nlink_caps = col_deps %&gt;%\r\n\u00a0 html_node(\".wikitable\") %&gt;%\r\n\u00a0 html_nodes(\"td:nth-child(3) a\") %&gt;%\r\n\u00a0 html_attr(., \"href\")\r\n\r\n#visit each capital page\r\nfor (cap in seq_along(link_caps)) {\r\n\u00a0 print(str_c(\"fetching \", cap, \" of \", length(link_caps)))\r\n\u00a0 \r\n\u00a0 cap_page = html(str_c(\"https:\/\/en.wikipedia.org\", link_caps[cap]))\r\n\u00a0 link_geohack = cap_page %&gt;%\r\n\u00a0\u00a0\u00a0 html_node(\".mergedbottomrow td a\") %&gt;%\r\n\u00a0\u00a0\u00a0 html_attr(\"href\")\r\n\u00a0 \r\n\u00a0 #get geohack\r\n\u00a0 geohack = html(str_c(\"https:\", link_geohack))\r\n\u00a0 \r\n\u00a0 #get lat and lon\r\n\u00a0 latlon = geohack %&gt;%\r\n\u00a0\u00a0\u00a0 html_nodes(\".geo span\") %&gt;%\r\n\u00a0\u00a0\u00a0 html_text()\r\n\u00a0 \r\n\u00a0 #save\r\n\u00a0 dep_table[cap + 1, \"lat\"] = latlon[1]\r\n\u00a0 dep_table[cap + 1, \"lon\"] = latlon[2]\r\n}\r\n\r\n#scrape the last one manually\r\ndep_table[1, \"lat\"] = 4.598056\r\ndep_table[1, \"lon\"] = -74.075833\r\n\r\n#export\r\nwrite_clipboard(dep_table)\r\n\r\n\r\n\r\n# world capitals ----------------------------------------------------------\r\n\r\nlist_caps = html(\"https:\/\/en.wikipedia.org\/wiki\/List_of_national_capitals_in_alphabetical_order\")\r\n\r\n#get the table\r\ntable = list_caps %&gt;% \r\n\u00a0 html_node(\".wikitable\") %&gt;%\r\n\u00a0 html_table()\r\n\r\n#get the capital links\r\ncap_links = list_caps %&gt;% \r\n\u00a0 html_node(\".wikitable\") %&gt;%\r\n\u00a0 html_nodes(\"td:nth-child(1) a:nth-child(1)\") %&gt;%\r\n\u00a0 html_attr(\"href\")\r\n\r\n#remove the wrong entry\r\ncap_links = cap_links[-169]\r\n\r\n#visit each capital\r\nfor (cap in seq_along(cap_links)) {\r\n\u00a0 message(str_c(\"fetching \", cap, \" out of \", length(cap_links)))\r\n\u00a0 \r\n\u00a0 #check if already have data\r\n\u00a0 if (!is.na(table[cap, \"lat\"])) {\r\n\u00a0\u00a0\u00a0 msg = str_c(\"already have data. Skipping\")\r\n\u00a0\u00a0\u00a0 message(msg)\r\n\u00a0\u00a0\u00a0 next\r\n\u00a0 }\r\n\u00a0 \r\n\u00a0 \r\n\u00a0 #fetch capital page\r\n\u00a0 url = str_c(\"https:\/\/en.wikipedia.org\", cap_links[cap])\r\n\u00a0 message(url)\r\n\u00a0 tried = try({\r\n\u00a0\u00a0\u00a0 cap_page = html(url)\r\n\u00a0 })\r\n\u00a0 \r\n\u00a0 #no coordinates for that capital, skip\r\n\u00a0 if (any(class(tried) == \"try-error\")) {\r\n\u00a0\u00a0\u00a0 msg = str_c(\"error in html fetch \", cap, \". Skipping\")\r\n\u00a0\u00a0\u00a0 message(msg)\r\n\u00a0\u00a0\u00a0 next\r\n\u00a0 }\r\n\u00a0 \r\n\u00a0 #get link to geohack\r\n\u00a0 tried = try({\r\n\u00a0\u00a0\u00a0 all_links = cap_page %&gt;%\r\n\u00a0\u00a0\u00a0\u00a0\u00a0 html_nodes(\"a\") %&gt;%\r\n\u00a0\u00a0\u00a0\u00a0\u00a0 html_attr(\"href\")\r\n\u00a0 })\r\n\u00a0 \r\n\u00a0 #is there a link to wmflabs?\r\n\u00a0 if(!any(str_detect(all_links, \"tools.wmflabs.org\"), na.rm = T)) {\r\n\u00a0\u00a0\u00a0 msg = str_c(\"No link to wmflabs anywhere. Skipping.\")\r\n\u00a0\u00a0\u00a0 message(msg)\r\n\u00a0\u00a0\u00a0 next\r\n\u00a0 }\r\n\u00a0 \r\n\u00a0 #fetch url\r\n\u00a0 which_url = which(str_detect(all_links, \"tools.wmflabs.org\"))[1] #get the first if there are multiple\r\n\u00a0 url = str_c(\"https:\", all_links[which_url])\r\n\u00a0 \r\n\u00a0 #visit geohack\r\n\u00a0 geohack = html(url)\r\n\u00a0 \r\n\u00a0 #get lat and lon\r\n\u00a0 latlon = geohack %&gt;%\r\n\u00a0\u00a0\u00a0 html_nodes(\".geo span\") %&gt;%\r\n\u00a0\u00a0\u00a0 html_text()\r\n\u00a0 \r\n\u00a0 #save\r\n\u00a0 table[cap, \"lat\"] = latlon[1]\r\n\u00a0 table[cap, \"lon\"] = latlon[2]\r\n\u00a0 msg = str_c(\"Success!\")\r\n\u00a0 message(msg)\r\n}\r\n\r\n\r\n# States of Mexico --------------------------------------------------------\r\nurl = \"https:\/\/en.wikipedia.org\/wiki\/States_of_Mexico\"\r\nstate_Mexico = html(url)\r\n\r\n#get main table\r\nMex_table = state_Mexico %&gt;%\r\n\u00a0 html_node(\".wikitable\") %&gt;%\r\n\u00a0 html_table()\r\n\r\n#links to capitals\r\ncap_links = state_Mexico %&gt;%\r\n\u00a0 html_node(\".wikitable\") %&gt;%\r\n\u00a0 html_nodes(\"td:nth-child(4) a\") %&gt;%\r\n\u00a0 html_attr(\"href\")\r\n\r\n#visit each state page\r\ncap_pages = lapply(cap_links, function(x) {\r\n\u00a0 html(str_c(\"https:\/\/en.wikipedia.org\", x))\r\n})\r\n\r\n#fetch all the links for each page\r\ncap_pages_links = lapply(cap_pages, function(x) {\r\n\u00a0 html_nodes(x, \"a\") %&gt;%\r\n\u00a0\u00a0\u00a0 html_attr(\"href\")\r\n})\r\n\r\n#fetch the geohack links\r\ngeohack_links = lapply(cap_pages_links, function(x) {\r\n\u00a0 which_url = which(str_detect(x, \"tools.wmflabs.org\"))[1]\r\n\u00a0 x[which_url]\r\n})\r\n\r\n#geohack pages\r\ngeohack_pages = lapply(geohack_links, function(x) {\r\n\u00a0 url = str_c(\"https:\", x)\r\n\u00a0 html(url)\r\n})\r\n\r\n#lat and lon\r\nlatlon = ldply(geohack_pages, function(x) {\r\n\u00a0 html_nodes(x, \".geo span\") %&gt;%\r\n\u00a0\u00a0\u00a0 html_text()\r\n})\r\ncolnames(latlon) = c(\"lat\", \"lon\")\r\n\r\n#add to df\r\nMex_table = cbind(Mex_table, latlon)\r\n\r\n#export\r\nwrite_clipboard(Mex_table, 99)\r\n\r\n\r\n# US states ---------------------------------------------------------------\r\n#main list\r\nUS_states_page = html(\"https:\/\/en.wikipedia.org\/wiki\/List_of_states_and_territories_of_the_United_States\")\r\n\r\n#get the table\r\nUS_table = US_states_page %&gt;%\r\n\u00a0 html_node(\".wikitable\") %&gt;%\r\n\u00a0 html_table()\r\n\r\n#get links to the state capitals\r\ncap_links = US_states_page %&gt;%\r\n\u00a0 html_node(\".wikitable\") %&gt;%\r\n\u00a0 html_nodes(\"td:nth-child(3) a\") %&gt;%\r\n\u00a0 html_attr(\"href\")\r\n\r\n#fetch pages\r\ncap_pages = lapply(cap_links, function(x) {\r\n\u00a0 url = str_c(\"https:\/\/en.wikipedia.org\", x)\r\n\u00a0 html(url)\r\n})\r\n\r\n#fetch geohack pages\r\ngeohacks = lapply(cap_pages, function(x) {\r\n\u00a0 #find links\r\n\u00a0 links = html_nodes(x, \"a\") %&gt;%\r\n\u00a0\u00a0\u00a0 html_attr(\"href\")\r\n\u00a0 \r\n\u00a0 #find the right one\r\n\u00a0 which_link = which(str_detect(links, \"tools.wmflabs.org\"))\r\n\u00a0 link = links[which_link][1]\r\n\u00a0 \r\n\u00a0 url = str_c(\"https:\", link)\r\n\u00a0 html(url)\r\n})\r\n\r\n#fetch latlon\r\nlatlon = ldply(geohacks, function(x) {\r\n\u00a0 html_nodes(x, \".geo span\") %&gt;%\r\n\u00a0\u00a0\u00a0 html_text()\r\n})\r\ncolnames(latlon) = c(\"lat\", \"lon\")\r\n\r\n#add to df\r\nUS_table = cbind(US_table, latlon)\r\n\r\n#export\r\nwrite_clipboard(US_table, 99)\r\n\r\n\r\n# Brazil ------------------------------------------------------------------\r\n\r\n#the main page we get info from\r\nstates_page = html(\"https:\/\/en.wikipedia.org\/wiki\/States_of_Brazil\")\r\n\r\n#get the table into R\r\nBRA_table = states_page %&gt;%\r\n\u00a0 html_node(\".wikitable\") %&gt;%\r\n\u00a0 html_table\r\n\r\n#get links to state caps\r\ncap_links = states_page %&gt;%\r\n\u00a0 html_node(\".wikitable\") %&gt;%\r\n\u00a0 html_nodes(\"td:nth-child(4) a\") %&gt;%\r\n\u00a0 html_attr(\"href\")\r\n\r\n#get each cap page\r\ncap_pages = lapply(cap_links, function(x) {\r\n\u00a0 url = str_c(\"https:\/\/en.wikipedia.org\", x)\r\n\u00a0 x = html(url)\r\n})\r\n\r\n#fetch geohack pages\r\ngeohacks = lapply(cap_pages, function(x) {\r\n\u00a0 #find links\r\n\u00a0 links = html_nodes(x, \"a\") %&gt;%\r\n\u00a0\u00a0\u00a0 html_attr(\"href\")\r\n\u00a0 \r\n\u00a0 #find the right one\r\n\u00a0 which_link = which(str_detect(links, \"tools.wmflabs.org\"))\r\n\u00a0 link = links[which_link][1]\r\n\u00a0 \r\n\u00a0 url = str_c(\"https:\", link)\r\n\u00a0 html(url)\r\n})\r\n\r\n#fetch latlon\r\nlatlon = ldply(geohacks, function(x) {\r\n\u00a0 html_nodes(x, \".geo span\") %&gt;%\r\n\u00a0\u00a0\u00a0 html_text()\r\n})\r\ncolnames(latlon) = c(\"lat\", \"lon\")\r\n\r\nBRA_table$lat = latlon[[1]]\r\nBRA_table$lon = latlon[[2]]\r\n\r\nwrite_clipboard(BRA_table)\r\nwrite_clipboard(latlon)\r\n\r\n\r\n# clean up the tables and merge ---------------------------------------------------------\r\n#instead of merging into mega 5 times, we first clean up each dataset and then marge once\r\n\r\n##Colombia\r\n#rename Amazon because a state with the identical name exists for Brazil\r\ndep_table[2, 2] = \"Amazonas (COL)\"\r\n\r\n#abbreviations\r\nCOL_abbrev = as_abbrev(dep_table$Department)\r\n\r\n#subset to the cols we want\r\ncol_clean = dep_table[c(2, 3, 7, 8)]\r\n\r\n#rownames\r\nrownames(col_clean) = COL_abbrev\r\n\r\n##World\r\nworld_clean = table[c(2, 1, 4, 5)]\r\n\r\n#some have multiple listed capitals, we just pick the first one mentioned\r\nworld_clean$City %&lt;&gt;% str_extract(\"[^\\\\(]*\") %&gt;% #get all text before parenthesis\r\n\u00a0 str_trim #trim whitespace\r\n\r\n#remove comments from country names\r\nworld_clean$Country %&lt;&gt;%\r\n\u00a0 str_replace_all(\"\\\\[.*?\\\\]\", \"\") %&gt;%\r\n\u00a0 str_trim\r\n\r\n#abbreviations\r\nrownames(world_clean) = as_abbrev(world_clean$Country)\r\n\r\n##Mexico\r\n#subset cols\r\nMEX_clean = Mex_table[c(1, 4, 10, 11)]\r\n\r\n#remove numbers from the names\r\nMEX_clean$State = MEX_clean$State %&gt;% str_replace_all(\"\\\\d\", \"\")\r\n\r\n#abbrevs\r\nrownames(MEX_clean) = MEX_clean$State %&gt;% as_abbrev\r\n\r\n##Brazil\r\nBRA_clean = BRA_table[c(2, 4, 16, 17)]\r\n\r\n#add identifiers to ambiguous names\r\nBRA_clean[4, \"State\"] = \"Amazonas (BRA)\"\r\nBRA_clean[7, \"State\"] = \"Distrito Federal (BRA)\"\r\n\r\n#abbrevs\r\nrownames(BRA_clean) = BRA_clean$State %&gt;% as_abbrev\r\n\r\n##USA\r\nUSA_clean = US_table[c(1, 3, 11, 12)]\r\n\r\n#clean names\r\nUSA_clean$State = USA_clean$State %&gt;% str_replace_all(\"\\\\[.*?\\\\]\", \"\")\r\n\r\n#abbevs\r\nrownames(USA_clean) = USA_clean$State %&gt;% as_abbrev(., georgia = \"s\")\r\n\r\n# merge into megadataset --------------------------------------------------\r\n#give them the same colnames\r\ncolnames(col_clean) = colnames(MEX_clean) = colnames(BRA_clean) = colnames(USA_clean) = colnames(world_clean) = c(\"Names2\", \"Capital\", \"lat\", \"lon\")\r\n\r\n#stack by bow\r\nall_clean = rbind(col_clean, MEX_clean, BRA_clean, USA_clean, world_clean)\r\n\r\n#load mega\r\nmega = read_mega(\"Megadataset_v2.0l.csv\")\r\n\r\n#merge into mega\r\nmega2 = merge_datasets(mega, all_clean, 1)\r\n\r\n#save\r\nwrite_mega(mega2, \"Megadataset_v2.0m.csv\")\r\n\r\n<\/pre>\n<p>&nbsp;<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Scraping with R Although other languages are probably more suitable for web scraping (e.g. Python with Scrapy), R does have some scraping capabilities. Unsurprisingly, the ever awesome Hadley has written a great package for this: rvest. A simple tutorial and demonstration of it can be found here, which I the one I used. To do [&hellip;]<\/p>\n","protected":false},"author":17,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[2089],"tags":[2046,1979,2226,2227,1400],"class_list":["post-5471","post","type-post","status-publish","format-standard","hentry","category-programming","tag-admixture-in-the-americas","tag-r","tag-spatial-autocorrelation","tag-web-scraping","tag-wikipedia","entry"],"_links":{"self":[{"href":"https:\/\/emilkirkegaard.dk\/en\/wp-json\/wp\/v2\/posts\/5471","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/emilkirkegaard.dk\/en\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/emilkirkegaard.dk\/en\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/emilkirkegaard.dk\/en\/wp-json\/wp\/v2\/users\/17"}],"replies":[{"embeddable":true,"href":"https:\/\/emilkirkegaard.dk\/en\/wp-json\/wp\/v2\/comments?post=5471"}],"version-history":[{"count":1,"href":"https:\/\/emilkirkegaard.dk\/en\/wp-json\/wp\/v2\/posts\/5471\/revisions"}],"predecessor-version":[{"id":5473,"href":"https:\/\/emilkirkegaard.dk\/en\/wp-json\/wp\/v2\/posts\/5471\/revisions\/5473"}],"wp:attachment":[{"href":"https:\/\/emilkirkegaard.dk\/en\/wp-json\/wp\/v2\/media?parent=5471"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/emilkirkegaard.dk\/en\/wp-json\/wp\/v2\/categories?post=5471"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/emilkirkegaard.dk\/en\/wp-json\/wp\/v2\/tags?post=5471"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}