In trying to merge some data, I was confronted with a problem of matching up strings where the author had mutilated them. He had done so in two ways: cutting them off at the 8th character or using personal abbreviations. The first one is relatively easy to deal with. The second one is not.
So I looked around a bit and found that there are others who had similar problems:
- http://stats.stackexchange.com/questions/3425/how-to-quasi-match-two-vectors-of-strings-in-r
- http://stackoverflow.com/questions/2231993/merging-two-data-frames-using-fuzzy-approximate-string-matching-in-r
The large table below shows the matching results. The leftmost column has the strings I need to match. The list to be matched against is the list of country and regional names and their ISO 3 abbreviations here: https://osf.io/59dr7/
It looks like for most cases the shortened version worked fine. 165 of 193 matches were found all of which were correct.
The agrep (with max distance = .1, the default), found a match in 175 cases, so only a little improvement there. But it gets worse, in many cases, it disagrees with the stricter matching method and gets it wrong. There is no case where it is correct over the simpler method. Strangely, in all 10 cases where the simple method failed, agrep got it right. But it got it wrong in the easier cases. In some cases, it is truly bizarre: given the string “United S”, it goes for “United Republic of Tanzania” instead of the much easier “United States”. Strangely, a common error is preferring a subset/longer version over an exact match. No human would make this error. E.g. given “Moldova”, it prefers “Moldova, Republic” of over just “Moldova”.
There are a number of different errors it makes. In the comments below I have noted the type of error (my judgment).
For the moment, I would caution the use of this algorithm.
| Country | Genetic_distance_SA | to_short_result | agrep_result | best_match | agreement | filled_in | comments | |
| Norway | 1455.52 | Norway | Norway | Norway | TRUE | Norway | ||
| Netherla | 1453.28 | Netherlands | Netherlands | Netherlands | TRUE | Netherlands | ||
| Ireland | 1940.31 | Ireland | Iceland | Ireland | FALSE | Ireland | prefers substitution over exact | |
| Liechste | 1511.83 | Liechtenstein | Liechtenstein | FALSE | Liechtenstein | correct | ||
| Germany | 1484.92 | Germany | Germany | Germany | TRUE | Germany | ||
| Sweeden | 1453.79 | Sweden | Sweden | FALSE | Sweden | correct | ||
| Switzerl | 1557.96 | Switzerland | Switzerland | Switzerland | TRUE | Switzerland | ||
| Iceland | 1932.36 | Iceland | Iceland | Iceland | TRUE | Iceland | ||
| Denmark | 1472.52 | Denmark | Denmark | Denmark | TRUE | Denmark | ||
| Belgium | 1940.31 | Belgium | Belgium | Belgium | TRUE | Belgium | ||
| Austria | 1465.7 | Austria | Australia | Austria | FALSE | Austria | prefers part deleted | |
| France | 1896.22 | France | France | France | TRUE | France | ||
| Slovenia | 1292.32 | Slovenia | Slovenia | Slovenia | TRUE | Slovenia | ||
| Finland | 2420.3 | Finland | Finland | Finland | TRUE | Finland | ||
| Spain | 1929.44 | Spain | Saint Barth<U+FFFD>lemy | Spain | FALSE | Spain | no idea | |
| Italy | 1961.9 | Italy | Italy | Italy | TRUE | Italy | ||
| Luxembur | 1929.98 | TRUE | Luxembourg | |||||
| Czech Re | 1524.73 | Czech Republic | Czech Republic | Czech Republic | TRUE | Czech Republic | ||
| U. K. | 1916.91 | TRUE | UK | |||||
| Greece | 1283.05 | Greece | Greece | Greece | TRUE | Greece | ||
| Cyprus | 1288.53 | Cyprus | Cyprus | Cyprus | TRUE | Cyprus | ||
| Estonia | 2302.86 | Estonia | Estonia | Estonia | TRUE | Estonia | ||
| Slovakia | 1573.38 | Slovakia | Slovakia | Slovakia | TRUE | Slovakia | ||
| Malta | 1912.52 | Malta | Gibraltar | Malta | FALSE | Malta | prefers subset + substitution over exact | |
| Poland | 1905.67 | Poland | Poland | Poland | TRUE | Poland | ||
| Lithuani | 2389.28 | Lithuania | Lithuania | Lithuania | TRUE | Lithuania | ||
| Portugal | 1949.34 | Portugal | Portugal | Portugal | TRUE | Portugal | ||
| Latvia | 2256.69 | Latvia | Latvia | Latvia | TRUE | Latvia | ||
| Croatia | 1289.64 | Croatia | Croatia | Croatia | TRUE | Croatia | ||
| Romania | 1928.4 | Romania | Romania | Romania | TRUE | Romania | ||
| Bulgaria | 1399.01 | Bulgaria | Bulgaria | Bulgaria | TRUE | Bulgaria | ||
| Serbia | 1421.01 | Serbia | Serbia | Serbia | TRUE | Serbia | ||
| Russia | 1975.49 | Russia | Russia | Russia | TRUE | Russia | ||
| Albania | 1301.47 | Albania | Albania | Albania | TRUE | Albania | ||
| Macedoni | 1334.51 | Macedonia | Macedonia | Macedonia | TRUE | Macedonia | ||
| Armenia | 1558.32 | Armenia | Armenia | Armenia | TRUE | Armenia | ||
| Moldova | 1527.95 | Moldova | Moldova, Republic of | Moldova | FALSE | Moldova | prefers longer | |
| Botswana | 347.18 | Botswana | Botswana | Botswana | TRUE | Botswana | ||
| South Af | 0 | South Africa | South Africa | South Africa | TRUE | South Africa | ||
| Ghana | 395.9 | Ghana | Ghana | Ghana | TRUE | Ghana | ||
| Eq Guine | 373.2 | TRUE | Equatorial Guinea | |||||
| Congo | 452.9 | Congo | Congo | Congo | TRUE | Congo | ||
| Kenya | 366.78 | Kenya | Kenya | Kenya | TRUE | Kenya | ||
| Cameroon | 319.62 | Cameroon | Cameroon | Cameroon | TRUE | Cameroon | ||
| Tanzania | 352.54 | Tanzania | Tanzania | Tanzania | TRUE | Tanzania | ||
| Nigeria | 342.24 | Nigeria | Nigeria | Nigeria | TRUE | Nigeria | ||
| Uganda | 358.75 | Uganda | Uganda | Uganda | TRUE | Uganda | ||
| Zambia | 352.54 | Zambia | Gambia | Zambia | FALSE | Zambia | prefers substitution over exact | |
| Sudan | 316.95 | Sudan | Sudan | Sudan | TRUE | Sudan | ||
| Zimbabwe | 352.54 | Zimbabwe | Zimbabwe | Zimbabwe | TRUE | Zimbabwe | ||
| Ethiopia | 705.3 | Ethiopia | Ethiopia | Ethiopia | TRUE | Ethiopia | ||
| Guinea | 395.9 | Guinea | Guinea | Guinea | TRUE | Guinea | ||
| CentAfrR | 469.7 | TRUE | Central African Republic | |||||
| SierraLe | 395.9 | TRUE | Sierra Leone | |||||
| Mozambiq | 355.75 | Mozambique | Mozambique | Mozambique | TRUE | Mozambique | ||
| CongoDR | 410.17 | Congo Republic of | Congo Republic of | FALSE | Congo Republic of | wrong but excuseable; Congo Democratic Republic | ||
| Andorra | 1912.83 | Andorra | Andorra | Andorra | TRUE | Andorra | ||
| Angola | 353.49 | Angola | Angola | Angola | TRUE | Angola | ||
| Belarus | 1949.85 | Belarus | Belarus | Belarus | TRUE | Belarus | ||
| Benin | 394.54 | Benin | Benin | Benin | TRUE | Benin | ||
| Bosnia | 1337.37 | Bosnia | Bosnia and Herzegovina | Bosnia | FALSE | Bosnia | prefers longer | |
| BurkinaF | 378.36 | Burkina Faso | Burkina Faso | FALSE | Burkina Faso | correct | ||
| Burundi | 362.98 | Burundi | Burundi | Burundi | TRUE | Burundi | ||
| Cape Ver | 963.09 | Cape Verde | Cape Verde | Cape Verde | TRUE | Cape Verde | ||
| Chad | 537.81 | Chad | Chad | Chad | TRUE | Chad | ||
| Comoros | 352.54 | Comoros | Comoros | Comoros | TRUE | Comoros | ||
| IvoryCoa | 468.01 | TRUE | Ivory Coast | |||||
| Djibouti | 750.88 | Djibouti | Djibouti | Djibouti | TRUE | Djibouti | ||
| Eritrea | 665.96 | Eritrea | Eritrea | Eritrea | TRUE | Eritrea | ||
| Gabon | 360.07 | Gabon | Gabon | Gabon | TRUE | Gabon | ||
| Gambia | 395.9 | Gambia | Gambia | Gambia | TRUE | Gambia | ||
| Georgia | 1613.7 | Georgia | Georgia | Georgia | TRUE | Georgia | ||
| Guinea-B | 395.9 | Guinea-Bissau | Guinea-Bissau | Guinea-Bissau | TRUE | Guinea-Bissau | ||
| Lesotho | 352.54 | Lesotho | Lesotho | Lesotho | TRUE | Lesotho | ||
| Liberia | 395.9 | Liberia | Liberia | Liberia | TRUE | Liberia | ||
| Malawi | 352.54 | Malawi | Malawi | Malawi | TRUE | Malawi | ||
| Mali | 430.58 | Mali | Australia | Mali | FALSE | Mali | prefers subset + substitution over exact | |
| Mauritan | 681.23 | Mauritania | Mauritania | Mauritania | TRUE | Mauritania | ||
| Namibia | 419.32 | Namibia | Namibia | Namibia | TRUE | Namibia | ||
| Niger | 315.05 | Niger | Niger | Niger | TRUE | Niger | ||
| Rwanda | 364.26 | Rwanda | Rwanda | Rwanda | TRUE | Rwanda | ||
| SaoTomeP | 339.89 | TRUE | Sao Tome and Principe | |||||
| Senegal | 395.9 | Senegal | Senegal | Senegal | TRUE | Senegal | ||
| Seychell | 1709.06 | Seychelles | Seychelles | Seychelles | TRUE | Seychelles | ||
| Somalia | 500.74 | Somalia | Somalia | Somalia | TRUE | Somalia | ||
| Swazilan | 400.17 | Swaziland | Swaziland | Swaziland | TRUE | Swaziland | ||
| Togo | 395.9 | Togo | Togo | Togo | TRUE | Togo | ||
| Ukraine | 1947.94 | Ukraine | Ukraine | Ukraine | TRUE | Ukraine | ||
| Australi | 1971.39 | Australia | Australia | Australia | TRUE | Australia | ||
| United S | 1792.33 | United States | United Republic of Tanzania | United States | FALSE | United States | bizarre | |
| New Zeal | 2061.12 | New Zealand | New Zealand | New Zealand | TRUE | New Zealand | ||
| Canada | 1958.73 | Canada | Canada | Canada | TRUE | Canada | ||
| Japan | 2176.04 | Japan | Japan | Japan | TRUE | Japan | ||
| Hong Kon | 2674.63 | Hong Kong | Hong Kong | Hong Kong | TRUE | Hong Kong | ||
| Korea | 2399.11 | Korea | Korea Democratic People’s Republic of | Korea | FALSE | Korea | prefers subset + substitution over exact | |
| Israel | 1539.63 | Israel | Israel | Israel | TRUE | Israel | ||
| Singapor | 2459.04 | Singapore | Singapore | Singapore | TRUE | Singapore | ||
| Qatar | 1733.83 | Qatar | Qatar | Qatar | TRUE | Qatar | ||
| Hungary | 2432.96 | Hungary | Hungary | Hungary | TRUE | Hungary | ||
| Bahrain | 971.99 | Bahrain | Bahrain | Bahrain | TRUE | Bahrain | ||
| Chile | 2279.52 | Chile | Chile | Chile | TRUE | Chile | ||
| Argentin | 1994.59 | Argentina | Argentina | Argentina | TRUE | Argentina | ||
| Barbados | 468.02 | Barbados | Barbados | Barbados | TRUE | Barbados | ||
| Uruguay | 1918.61 | Uruguay | Uruguay | Uruguay | TRUE | Uruguay | ||
| Cuba | 1370.91 | Cuba | Aruba | Cuba | FALSE | Cuba | prefers subset + substitution over exact | |
| Saudi Ar | 1468.3 | Saudi Arabia | Saudi Arabia | Saudi Arabia | TRUE | Saudi Arabia | ||
| Mexico | 2024.64 | Mexico | Mexico | Mexico | TRUE | Mexico | ||
| Malaysia | 1922.77 | Malaysia | Malaysia | Malaysia | TRUE | Malaysia | ||
| Trinidad | 1024.1 | Trinidad and Tobago | Trinidad and Tobago | Trinidad and Tobago | TRUE | Trinidad and Tobago | ||
| Kuwait | 1081.15 | Kuwait | Kuwait | Kuwait | TRUE | Kuwait | ||
| Lebanon | 1543.46 | Lebanon | Lebanon | Lebanon | TRUE | Lebanon | ||
| Venezuel | 1280.81 | Venezuela, Bolivarian Republic of | Venezuela, Bolivarian Republic of | Venezuela, Bolivarian Republic of | TRUE | Venezuela, Bolivarian Republic of | ||
| Mauritiu | 1792.48 | Mauritius | Mauritius | Mauritius | TRUE | Mauritius | ||
| Jamaica | 595.5 | Jamaica | Jamaica | Jamaica | TRUE | Jamaica | ||
| Peru | 2096.08 | Peru | Hviderusland | Peru | FALSE | Peru | prefers subset + substitution over exact | |
| Dominica | 521.08 | Dominica | Dominica | Dominica | TRUE | Dominica | ||
| SaintLuc | 497.7 | TRUE | Saint Lucia | |||||
| Ecuador | 2228.58 | Ecuador | Ecuador | Ecuador | TRUE | Ecuador | ||
| Brazil | 1875.81 | Brazil | Brazil | Brazil | TRUE | Brazil | ||
| SaintVin | 395.9 | TRUE | Saint Vincent | |||||
| Colombia | 1973.6 | Colombia | Colombia | Colombia | TRUE | Colombia | ||
| Iran | 1945.07 | Iran | France | Iran | FALSE | Iran | prefers subset + substitution over exact | |
| Tonga | 2390.38 | Tonga | Tonga | Tonga | TRUE | Tonga | ||
| Turkey | 2167.95 | Turkey | Turkey | Turkey | TRUE | Turkey | ||
| Belize | 1481.26 | Belize | Belize | Belize | TRUE | Belize | ||
| Tunisia | 203.38 | Tunisia | Tunisia | Tunisia | TRUE | Tunisia | ||
| Jordan | 1539.63 | Jordan | Jordan | Jordan | TRUE | Jordan | ||
| SriLanka | 1783.84 | TRUE | Sri Lanka | |||||
| DomRep | 1206.72 | TRUE | Dominican Republic | |||||
| W. Samoa | 2388.58 | W. Samoa | W. Samoa | W. Samoa | TRUE | W. Samoa | ||
| Fiji | 2534.15 | Fiji | Fiji | Fiji | TRUE | Fiji | ||
| China | 2646.26 | China | China | China | TRUE | China | ||
| Thailand | 2068.81 | Thailand | Thailand | Thailand | TRUE | Thailand | ||
| Surinam | 1562.55 | Suriname | Suriname | Suriname | TRUE | Suriname | ||
| Paraguay | 2243.61 | Paraguay | Paraguay | Paraguay | TRUE | Paraguay | ||
| Bolivia | 2410.22 | Bolivia | Bolivia, Plurinational State of | Bolivia | FALSE | Bolivia | prefers longer | |
| Philipin | 2628.84 | Philipines | Philipines | Philipines | TRUE | Philipines | ||
| Egypt | 1401.52 | Egypt | Egypt | Egypt | TRUE | Egypt | ||
| Syria | 1590.05 | Syria | Syria | Syria | TRUE | Syria | ||
| Honduras | 1979.74 | Honduras | Honduras | Honduras | TRUE | Honduras | ||
| Indonesi | 2602.63 | Indonesia | Indonesia | Indonesia | TRUE | Indonesia | ||
| VietNam | 2264.3 | Viet Nam | Viet Nam | FALSE | Viet Nam | correct, but odd, vs. Vietnam | ||
| Morocco | 191.55 | Morocco | Morocco | Morocco | TRUE | Morocco | ||
| Guatemal | 2040.41 | Guatemala | Guatemala | Guatemala | TRUE | Guatemala | ||
| Irak | 1625.4 | Irak | Iran | Irak | FALSE | Irak | prefers substitution over exact | |
| India | 1888.5 | India | India | India | TRUE | India | ||
| Laos | 3012.42 | Laos | Lao People’s Democratic Republic | Laos | FALSE | Laos | prefers longer | |
| Pakistan | 1901.47 | Pakistan | Pakistan | Pakistan | TRUE | Pakistan | ||
| Madagasc | 1678.96 | Madagascar | Madagascar | Madagascar | TRUE | Madagascar | ||
| Papua | 3115.88 | Papua New Guinea | Papua New Guinea | FALSE | Papua New Guinea | correct | ||
| Yemen | 1190.13 | Yemen | Yemen | Yemen | TRUE | Yemen | ||
| Nepal | 2030.11 | Nepal | Nepal | Nepal | TRUE | Nepal | ||
| CookIsla | 2437.7 | TRUE | Cook Islands | |||||
| Macau | 2660.44 | Macau | Macao | Macau | FALSE | Macau | prefers substitution over exact | |
| Marianas | 2437.7 | Mariana Isl. | Mariana Isl. | FALSE | Mariana Isl. | correct | ||
| Marshall | 2437.7 | Marshall Islands | Marshall Islands | Marshall Islands | TRUE | Marshall Islands | ||
| NCaledon | 2437.7 | TRUE | New Caledonia | |||||
| Taiwan | 2673.71 | Taiwan | Taiwan, Province of China | Taiwan | FALSE | Taiwan | prefers longer | |
| PuertoRi | 1654.5 | TRUE | Puerto Rico | |||||
| Afghanis | 1962.74 | Afghanistan | Afghanistan | Afghanistan | TRUE | Afghanistan | ||
| Algeria | 185.65 | Algeria | Algeria | Algeria | TRUE | Algeria | ||
| Antigua/ | 491.84 | Antigua and Barbuda | Antigua and Barbuda | FALSE | Antigua and Barbuda | correct | ||
| Azerbaij | 2190.35 | Azerbaijan | Azerbaijan | Azerbaijan | TRUE | Azerbaijan | ||
| Bahamas | 594.17 | Bahamas | Bahamas | Bahamas | TRUE | Bahamas | ||
| Banglade | 1897.24 | Bangladesh | Bangladesh | Bangladesh | TRUE | Bangladesh | ||
| Bhutan | 2082.28 | Bhutan | Bhutan | Bhutan | TRUE | Bhutan | ||
| Brunei | 1904.48 | Brunei | Brunei Darussalam | Brunei | FALSE | Brunei | prefers longer | |
| Burma | 2138.54 | Burma | Burma | Burma | TRUE | Burma | ||
| Cambodia | 2254.37 | Cambodia | Cambodia | Cambodia | TRUE | Cambodia | ||
| Costa Ri | 1938.1 | Costa Rica | Costa Rica | Costa Rica | TRUE | Costa Rica | ||
| El Salva | 1016.14 | El Salvador | El Salvador | El Salvador | TRUE | El Salvador | ||
| Grenada | 537.25 | Grenada | Grenada | Grenada | TRUE | Grenada | ||
| Guyana | 1379.76 | Guyana | Guyana | Guyana | TRUE | Guyana | ||
| Haiti | 434.51 | Haiti | Haiti | Haiti | TRUE | Haiti | ||
| Kazakhst | 2122.18 | Kazakhstan | Kazakhstan | Kazakhstan | TRUE | Kazakhstan | ||
| Kiribati | 2281.44 | Kiribati | Kiribati | Kiribati | TRUE | Kiribati | ||
| Korea (N | 2399.11 | Korea North | Korea North | FALSE | Korea North | correct | ||
| Kyrgysta | 2143.13 | TRUE | Kyrgyzstan | |||||
| Libya | 185.65 | Libya | Libano | Libya | FALSE | Libya | prefers subset, deletion, insertion over exact | |
| Maldives | 1836.17 | Maldives | Maldives | Maldives | TRUE | Maldives | ||
| Micrones | 2437.7 | Micronesia, Federated States of | Micronesia, Federated States of | Micronesia, Federated States of | TRUE | Micronesia, Federated States of | ||
| Mongolia | 2542.15 | Mongolia | Mongolia | Mongolia | TRUE | Mongolia | ||
| Nicaragu | 1856.28 | Nicaragua | Nicaragua | Nicaragua | TRUE | Nicaragua | ||
| Oman | 1594.25 | Oman | Cayman Islands | Oman | FALSE | Oman | prefers subset + substitution over exact | |
| Panama | 1809.12 | Panama | Panama | Panama | TRUE | Panama | ||
| SKittsNe | 469.61 | TRUE | Saint Kitts and Nevis | |||||
| Solomon | 3050.76 | Solomon Islands | Solomon Islands | FALSE | Solomon Islands | |||
| Tajikist | 2000.53 | Tajikistan | Tajikistan | Tajikistan | TRUE | Tajikistan | ||
| TimorLes | 2602.63 | TRUE | Timor–Leste | |||||
| Turkmeni | 2212.49 | Turkmenistan | Turkmenistan | Turkmenistan | TRUE | Turkmenistan | ||
| UArabEm | 1286.35 | TRUE | United Arab Emirates | |||||
| Uzbekist | 2193.47 | Uzbekistan | Uzbekistan | Uzbekistan | TRUE | Uzbekistan | ||
| Vanuatu | 2385.83 | Vanuatu | Vanuatu | Vanuatu | TRUE | Vanuatu |
The R code:
gn = read.csv("genetic_distance.csv", encoding = "UTF-8", stringsAsFactors = F)
gn$abbrev = as_abbrev(gn$Country)
trans = read.csv("countrycodes.csv", sep=";", encoding = "UTF-8", stringsAsFactors = F)
trans$shorter = str_sub(trans$Names, 1, 8)
intersect(trans$shorter, gn$Country)
matches = data.frame(source_names = gn$Country,
to_short = pmatch (gn$Country, trans$shorter)
)
agrep(gn$Country[4], trans$shorter)
best_matches = matrix(nrow = nrow(gn))
for (idx in seq_along(gn$Country)) {
match_idx = agrep(gn$Country[idx], trans$shorter, max.distance = .1, useBytes = T)
#skip on no match
if (length(match_idx) == 0) next
#insert match
best_matches[idx] = match_idx
}
matches$agrep = best_matches
matches$to_short_result = trans[matches$to_short, "Names"]
matches$agrep_result = trans[matches$agrep, "Names"]
for (idx in 1:nrow(matches)) {
if (!is.na(matches[idx, "to_short_result"])) {
matches[idx, "best_match"] = matches[idx, "to_short_result"]
}
if (is.na(matches[idx, "to_short_result"])) {
matches[idx, "best_match"] = matches[idx, "agrep_result"]
}
}
#output
write.table(matches, "clipboard", na = "", sep = "\t")