Scraping with R

Although other languages are probably more suitable for web scraping (e.g. Python with Scrapy), R does have some scraping capabilities. Unsurprisingly, the ever awesome Hadley has written a great package for this: rvest. A simple tutorial and demonstration of it can be found here, which I the one I used. To do web scraping efficiently, one needs to be familiar with CSS selectors (this website is useful even if hated by some) and probably also regex (try this interactive tutorial).

What to scrape? & spatial autocorrelation

There are billions of things one could scrape and use for studies. I could spend all my time doing this if I didn’t have so many other projects that I need to work on and finish up. In this case I am in need of spatial data for the Admixture in the Americas project because a commenter told us to look into the problem of spatial autocorrelation. Essentially, spatial aurocorrelation is when your data concerns units that have some kind of location. This can be an actual location on Earth (or in space), or in some kind of abstract space such as phylogenetic space. Generally speaking, units that are closer to each other tend to be more similar (positive autocorrelation). I’ve read that this can make some predictors seem like they work when in fact they don’t. I haven’t seen an example of this yet, but I will try to see if I can simulate some data to confirm the problem (and maybe make a Shiny app?). A good rule of thumb for statistics is that if you can’t simulate data that show a given phenomenon, you don’t understand it well enough. The below visualization was offered in the R tutorial above.

spatial autocor

Back to the topic. To control for spatial autocorrelation in an analysis, one needs some actual measures of each unit’s location. Since our units are countries and regions within countries, one needs latitude and longitude numbers. We looked around a bit and found some lists of these numbers. However, they did not cover all our units. We could only find data for US states and sovereign countries. Our dataset however also includes Mexican and Brazilian states, Colombian departments and various overseas territories.

Since we are really interested in the people living in these areas, the best measure would be the center of population, which is the demographic equivalent of center of mass in physics. This however requires having access to lots of population density information, which we could likewise not find. Theoretically, one could find a list of the largest cities of each unit and then calculate a weighted spherical geometrical central tendency of this and use that. This would probably be a good measure altho it is sensitive to differences in rates of urbanicity (proportion of people living in cities). One probably could obtain this information using Wikipedia and do the calculations, but it would be fairly time consuming and I suspect for fairly little gain (hence a new research proposal: for some units obtain both center of population and capital location data as well as some other data, compare results.). I decided on a simpler solution: finding the capital of each unit and using the location of that as a estimate. Because capitals are often located near the center of population for political reasons and because they are often the largest city as well, they should make okay approximations.

The actual scraping code

Below is the R code I wrote. Note that it is done fairly verbosely. One could create functions to make it simpler, but since I was just learning this stuff, I did it a simpler way. The code is commented, so you (and myself, when I read this post at some later point!) should be able to figure out what the code does.

library(pacman)
p_load(kirkegaard, stringr, psych, rvest)


# Colombian location scrape -----------------------------------------------
col_deps = html("https://en.wikipedia.org/wiki/Departments_of_Colombia")

#get names of deps and their capitals
dep_table = col_deps %>% 
  html_node(".wikitable") %>%
  html_table(., fill = T)

#get links to each capital
link_caps = col_deps %>%
  html_node(".wikitable") %>%
  html_nodes("td:nth-child(3) a") %>%
  html_attr(., "href")

#visit each capital page
for (cap in seq_along(link_caps)) {
  print(str_c("fetching ", cap, " of ", length(link_caps)))
  
  cap_page = html(str_c("https://en.wikipedia.org", link_caps[cap]))
  link_geohack = cap_page %>%
    html_node(".mergedbottomrow td a") %>%
    html_attr("href")
  
  #get geohack
  geohack = html(str_c("https:", link_geohack))
  
  #get lat and lon
  latlon = geohack %>%
    html_nodes(".geo span") %>%
    html_text()
  
  #save
  dep_table[cap + 1, "lat"] = latlon[1]
  dep_table[cap + 1, "lon"] = latlon[2]
}

#scrape the last one manually
dep_table[1, "lat"] = 4.598056
dep_table[1, "lon"] = -74.075833

#export
write_clipboard(dep_table)



# world capitals ----------------------------------------------------------

list_caps = html("https://en.wikipedia.org/wiki/List_of_national_capitals_in_alphabetical_order")

#get the table
table = list_caps %>% 
  html_node(".wikitable") %>%
  html_table()

#get the capital links
cap_links = list_caps %>% 
  html_node(".wikitable") %>%
  html_nodes("td:nth-child(1) a:nth-child(1)") %>%
  html_attr("href")

#remove the wrong entry
cap_links = cap_links[-169]

#visit each capital
for (cap in seq_along(cap_links)) {
  message(str_c("fetching ", cap, " out of ", length(cap_links)))
  
  #check if already have data
  if (!is.na(table[cap, "lat"])) {
    msg = str_c("already have data. Skipping")
    message(msg)
    next
  }
  
  
  #fetch capital page
  url = str_c("https://en.wikipedia.org", cap_links[cap])
  message(url)
  tried = try({
    cap_page = html(url)
  })
  
  #no coordinates for that capital, skip
  if (any(class(tried) == "try-error")) {
    msg = str_c("error in html fetch ", cap, ". Skipping")
    message(msg)
    next
  }
  
  #get link to geohack
  tried = try({
    all_links = cap_page %>%
      html_nodes("a") %>%
      html_attr("href")
  })
  
  #is there a link to wmflabs?
  if(!any(str_detect(all_links, "tools.wmflabs.org"), na.rm = T)) {
    msg = str_c("No link to wmflabs anywhere. Skipping.")
    message(msg)
    next
  }
  
  #fetch url
  which_url = which(str_detect(all_links, "tools.wmflabs.org"))[1] #get the first if there are multiple
  url = str_c("https:", all_links[which_url])
  
  #visit geohack
  geohack = html(url)
  
  #get lat and lon
  latlon = geohack %>%
    html_nodes(".geo span") %>%
    html_text()
  
  #save
  table[cap, "lat"] = latlon[1]
  table[cap, "lon"] = latlon[2]
  msg = str_c("Success!")
  message(msg)
}


# States of Mexico --------------------------------------------------------
url = "https://en.wikipedia.org/wiki/States_of_Mexico"
state_Mexico = html(url)

#get main table
Mex_table = state_Mexico %>%
  html_node(".wikitable") %>%
  html_table()

#links to capitals
cap_links = state_Mexico %>%
  html_node(".wikitable") %>%
  html_nodes("td:nth-child(4) a") %>%
  html_attr("href")

#visit each state page
cap_pages = lapply(cap_links, function(x) {
  html(str_c("https://en.wikipedia.org", x))
})

#fetch all the links for each page
cap_pages_links = lapply(cap_pages, function(x) {
  html_nodes(x, "a") %>%
    html_attr("href")
})

#fetch the geohack links
geohack_links = lapply(cap_pages_links, function(x) {
  which_url = which(str_detect(x, "tools.wmflabs.org"))[1]
  x[which_url]
})

#geohack pages
geohack_pages = lapply(geohack_links, function(x) {
  url = str_c("https:", x)
  html(url)
})

#lat and lon
latlon = ldply(geohack_pages, function(x) {
  html_nodes(x, ".geo span") %>%
    html_text()
})
colnames(latlon) = c("lat", "lon")

#add to df
Mex_table = cbind(Mex_table, latlon)

#export
write_clipboard(Mex_table, 99)


# US states ---------------------------------------------------------------
#main list
US_states_page = html("https://en.wikipedia.org/wiki/List_of_states_and_territories_of_the_United_States")

#get the table
US_table = US_states_page %>%
  html_node(".wikitable") %>%
  html_table()

#get links to the state capitals
cap_links = US_states_page %>%
  html_node(".wikitable") %>%
  html_nodes("td:nth-child(3) a") %>%
  html_attr("href")

#fetch pages
cap_pages = lapply(cap_links, function(x) {
  url = str_c("https://en.wikipedia.org", x)
  html(url)
})

#fetch geohack pages
geohacks = lapply(cap_pages, function(x) {
  #find links
  links = html_nodes(x, "a") %>%
    html_attr("href")
  
  #find the right one
  which_link = which(str_detect(links, "tools.wmflabs.org"))
  link = links[which_link][1]
  
  url = str_c("https:", link)
  html(url)
})

#fetch latlon
latlon = ldply(geohacks, function(x) {
  html_nodes(x, ".geo span") %>%
    html_text()
})
colnames(latlon) = c("lat", "lon")

#add to df
US_table = cbind(US_table, latlon)

#export
write_clipboard(US_table, 99)


# Brazil ------------------------------------------------------------------

#the main page we get info from
states_page = html("https://en.wikipedia.org/wiki/States_of_Brazil")

#get the table into R
BRA_table = states_page %>%
  html_node(".wikitable") %>%
  html_table

#get links to state caps
cap_links = states_page %>%
  html_node(".wikitable") %>%
  html_nodes("td:nth-child(4) a") %>%
  html_attr("href")

#get each cap page
cap_pages = lapply(cap_links, function(x) {
  url = str_c("https://en.wikipedia.org", x)
  x = html(url)
})

#fetch geohack pages
geohacks = lapply(cap_pages, function(x) {
  #find links
  links = html_nodes(x, "a") %>%
    html_attr("href")
  
  #find the right one
  which_link = which(str_detect(links, "tools.wmflabs.org"))
  link = links[which_link][1]
  
  url = str_c("https:", link)
  html(url)
})

#fetch latlon
latlon = ldply(geohacks, function(x) {
  html_nodes(x, ".geo span") %>%
    html_text()
})
colnames(latlon) = c("lat", "lon")

BRA_table$lat = latlon[[1]]
BRA_table$lon = latlon[[2]]

write_clipboard(BRA_table)
write_clipboard(latlon)


# clean up the tables and merge ---------------------------------------------------------
#instead of merging into mega 5 times, we first clean up each dataset and then marge once

##Colombia
#rename Amazon because a state with the identical name exists for Brazil
dep_table[2, 2] = "Amazonas (COL)"

#abbreviations
COL_abbrev = as_abbrev(dep_table$Department)

#subset to the cols we want
col_clean = dep_table[c(2, 3, 7, 8)]

#rownames
rownames(col_clean) = COL_abbrev

##World
world_clean = table[c(2, 1, 4, 5)]

#some have multiple listed capitals, we just pick the first one mentioned
world_clean$City %<>% str_extract("[^\\(]*") %>% #get all text before parenthesis
  str_trim #trim whitespace

#remove comments from country names
world_clean$Country %<>%
  str_replace_all("\\[.*?\\]", "") %>%
  str_trim

#abbreviations
rownames(world_clean) = as_abbrev(world_clean$Country)

##Mexico
#subset cols
MEX_clean = Mex_table[c(1, 4, 10, 11)]

#remove numbers from the names
MEX_clean$State = MEX_clean$State %>% str_replace_all("\\d", "")

#abbrevs
rownames(MEX_clean) = MEX_clean$State %>% as_abbrev

##Brazil
BRA_clean = BRA_table[c(2, 4, 16, 17)]

#add identifiers to ambiguous names
BRA_clean[4, "State"] = "Amazonas (BRA)"
BRA_clean[7, "State"] = "Distrito Federal (BRA)"

#abbrevs
rownames(BRA_clean) = BRA_clean$State %>% as_abbrev

##USA
USA_clean = US_table[c(1, 3, 11, 12)]

#clean names
USA_clean$State = USA_clean$State %>% str_replace_all("\\[.*?\\]", "")

#abbevs
rownames(USA_clean) = USA_clean$State %>% as_abbrev(., georgia = "s")

# merge into megadataset --------------------------------------------------
#give them the same colnames
colnames(col_clean) = colnames(MEX_clean) = colnames(BRA_clean) = colnames(USA_clean) = colnames(world_clean) = c("Names2", "Capital", "lat", "lon")

#stack by bow
all_clean = rbind(col_clean, MEX_clean, BRA_clean, USA_clean, world_clean)

#load mega
mega = read_mega("Megadataset_v2.0l.csv")

#merge into mega
mega2 = merge_datasets(mega, all_clean, 1)

#save
write_mega(mega2, "Megadataset_v2.0m.csv")