Reading up on the huge animal breeding literature gives a useful background to one’s thinking about what selection on humans will do in the future (embryo selection and direct editing á la CRISPR).


I made the above infograph some time ago, maybe 1-2 years. It is still pretty accurate. The newest data for genome sequencing does not look much different.

Steve Hsu has been following some of the animal breeding literature, e.g. Frontiers in cattle genomics.

I digged around a bit and found some reviews. They mentioned various interesting experiments. Of course, the most interesting experiment is still the Russian domesticated fox experiment (I want one of these!). Recently, there was an interesting one about breeding for brain size in guppies.


There is also the famous rat maze ability experiments. Solving mazes is g-loaded in humans (Jensen, 1980, book). A good review is Tolman and Tryon Early research on the inheritance of the ability to learn.


The most new and interesting part in relationship to humans is using genomic predictors alone. There is a recent, easy to read review: Understanding genomic selection in
poultry breeding.

selection for eggs

Because the animal breeding field has been going for so long, one find 100s if not 1000s of these types of graphs, yet they are still exciting. One might wonder: is there nothing one cannot select for? It seems no matter the trait, evolution finds a way. Dawkins seems to agree:

Political opposition to eugenic breeding of humans sometimes spills over into the almost certainly false assertion that it is impossible. Not only is it immoral, you may hear it said, it wouldn’t work. Unfortunately, to say that something is morally wrong, or politically undesirable, is not to say that it wouldn’t work. I have no doubt that, if you set your mind to it and had enough time and enough political power, you could breed a race of superior body-builders, or high-jumpers, or shot-putters; pearl fishers, sumo wrestlers, or sprinters; or (I suspect, although now with less confidence because there are no animal precedents) superior musicians, poets, mathematicians or wine-tasters. The reason I am confident about selective breeding for athletic prowess is that the qualities needed are so similar to those that demonstrably work in the breeding of racehorses and carthorses, of greyhounds and sledge dogs. The reason I am still pretty confident about the practical feasibility (though not the moral or political desirability) of selective breeding for mental or otherwise uniquely human traits is that there are so few examples where an attempt at selective breeding in animals has ever failed, even for traits that might have been thought surprising. Who would have thought, for example, that dogs could be bred for sheep-herding skills, or ‘pointing’, or bull-baiting?

[from The Greatest Show on Earth]

Selection for High and Low Fatness in Swine


Also interesting is that selective breeding makes it possible to estimate realized heritability, not just from family relationships.


I think we will see some interesting humans in the future. The reason is this: embryo selection is very close and genetic engineering is fairly close. If some countries ban them, others will allow them. Or one can sail or fly to a seastead. Or use any number of black market solutions that will inevitably spring up. Probably, not all jurisdictions will ban it, so there will be reproductive havens+tourism just like there are tax havens and even suicide havens. I don’t think Western governments will dare to force abortions on pregnant returnees, so there is nothing much they can do at that point. There is also of course the near-impossibility of proving that a fetus is a result of embryo selection, not normal fertilization. After all, embryo selection is just choosing between actual possibilities (hopefully, philosophy readers will allow me the flagrant abuse of modal terminology). If everybody starts having healthier children by using this technology, there will be no way to prove that a particular couple ‘cheated’. It is only in the aggregate one can prove that something is going on. A particular couple may just have been lucky. As for direct editing, it may be possible to spot genetically, but I doubt this will happen.

In the EU, I suspect the legality of this practice will come down to legal interpretation. The EU has a CHARTER OF FUNDAMENTAL RIGHTS OF THE EUROPEAN UNION, in which one can read:

Article 3
Right to the integrity of the person
1.   Everyone has the right to respect for his or her physical and mental integrity.
2.   In the fields of medicine and biology, the following must be respected in particular:
(a) the free and informed consent of the person concerned, according to the procedures laid down by law;
(b) the prohibition of eugenic practices, in particular those aiming at the selection of persons;
(c) the prohibition on making the human body and its parts as such a source of financial gain;
(d) the prohibition of the reproductive cloning of human beings.

But given that selection of persons is widely done for e.g. Down’s syndrome, (b) is clearly ignored in practice. (c) is also ignored e.g. for sperm and egg selling, altho they call it donation (with a nice monetary benefit in return). So, the best hope is that embryo selection for medical reasons will sneak into practice and become so standard that it would seem outlandish to ban it. This is well underway. When the public comes to accept it, the judges will probably make up some legal reason to interpret (b) narrowly, e.g. as to refer to forced sterilization. One may be able to find support for this in the background work for this charter, altho I haven’t looked into it.

Given that the technology will likely come into wide-scale practice within the next couple of decades, what remains to be researched more — a lot more — is how people will actually make choices. When prospective parent(s) have to make decisions re. which embryos to implement, there will be a choice. With a limited choice of embryos, one cannot simultaneously maximize all desirable traits and minimize all undesirable traits. There will probably be clear trends in this: few will select against intelligence, few will select short boys, few will select nasty diseases, most will select for health and happiness. People like Helen Henderson are not common:

I can say, without hesitation, that my life has been richer because I have MS. How can anyone who has no experience with disabilities understand that?

[From Future Human Evolution.]

If they still try to get children with horrible genetic diseases, the government probably (should?) will step in and ban it.

Still, there will be lots of variation. This variation in selective pressure between people should — together with strong assortative mating — result in divergence of human lines. This is will somewhat akin to dog, cat and horse breeds. Assortative mating is apparently so strong that people even choose pets that are similar to themselves: Self seeks like: many humans choose their dog pets following rules used for assortative mating.


We truly live in interesting times. :)

If you want to read more like this, there was also recently the double paper: Eugenics, Ready or Not I, II. (I could not find a link to part 2.)


In trying to merge some data, I was confronted with a problem of matching up strings where the author had mutilated them. He had done so in two ways: cutting them off at the 8th character or using personal abbreviations. The first one is relatively easy to deal with. The second one is not.

So I looked around a bit and found that there are others who had similar problems:

The large table below shows the matching results. The leftmost column has the strings I need to match. The list to be matched against is the list of country and regional names and their ISO 3 abbreviations here:

It looks like for most cases the shortened version worked fine. 165 of 193 matches were found all of which were correct.

The agrep (with max distance = .1, the default), found a match in 175 cases, so only a little improvement there. But it gets worse, in many cases, it disagrees with the stricter matching method and gets it wrong. There is no case where it is correct over the simpler method. Strangely, in all 10 cases where the simple method failed, agrep got it right. But it got it wrong in the easier cases. In some cases, it is truly bizarre: given the string “United S”, it goes for “United Republic of Tanzania” instead of the much easier “United States”. Strangely, a common error is preferring a subset/longer version over an exact match. No human would make this error. E.g. given “Moldova”, it prefers “Moldova, Republic” of over just “Moldova”.

There are a number of different errors it makes. In the comments below I have noted the type of error (my judgment).

For the moment, I would caution the use of this algorithm.

Country Genetic_distance_SA to_short_result agrep_result best_match agreement filled_in comments
Norway 1455.52 Norway Norway Norway TRUE Norway
Netherla 1453.28 Netherlands Netherlands Netherlands TRUE Netherlands
Ireland 1940.31 Ireland Iceland Ireland FALSE Ireland prefers substitution over exact
Liechste 1511.83 Liechtenstein Liechtenstein FALSE Liechtenstein correct
Germany 1484.92 Germany Germany Germany TRUE Germany
Sweeden 1453.79 Sweden Sweden FALSE Sweden correct
Switzerl 1557.96 Switzerland Switzerland Switzerland TRUE Switzerland
Iceland 1932.36 Iceland Iceland Iceland TRUE Iceland
Denmark 1472.52 Denmark Denmark Denmark TRUE Denmark
Belgium 1940.31 Belgium Belgium Belgium TRUE Belgium
Austria 1465.7 Austria Australia Austria FALSE Austria prefers part deleted
France 1896.22 France France France TRUE France
Slovenia 1292.32 Slovenia Slovenia Slovenia TRUE Slovenia
Finland 2420.3 Finland Finland Finland TRUE Finland
Spain 1929.44 Spain Saint Barth<U+FFFD>lemy Spain FALSE Spain no idea
Italy 1961.9 Italy Italy Italy TRUE Italy
Luxembur 1929.98 TRUE Luxembourg
Czech Re 1524.73 Czech Republic Czech Republic Czech Republic TRUE Czech Republic
U. K. 1916.91 TRUE UK
Greece 1283.05 Greece Greece Greece TRUE Greece
Cyprus 1288.53 Cyprus Cyprus Cyprus TRUE Cyprus
Estonia 2302.86 Estonia Estonia Estonia TRUE Estonia
Slovakia 1573.38 Slovakia Slovakia Slovakia TRUE Slovakia
Malta 1912.52 Malta Gibraltar Malta FALSE Malta prefers subset + substitution over exact
Poland 1905.67 Poland Poland Poland TRUE Poland
Lithuani 2389.28 Lithuania Lithuania Lithuania TRUE Lithuania
Portugal 1949.34 Portugal Portugal Portugal TRUE Portugal
Latvia 2256.69 Latvia Latvia Latvia TRUE Latvia
Croatia 1289.64 Croatia Croatia Croatia TRUE Croatia
Romania 1928.4 Romania Romania Romania TRUE Romania
Bulgaria 1399.01 Bulgaria Bulgaria Bulgaria TRUE Bulgaria
Serbia 1421.01 Serbia Serbia Serbia TRUE Serbia
Russia 1975.49 Russia Russia Russia TRUE Russia
Albania 1301.47 Albania Albania Albania TRUE Albania
Macedoni 1334.51 Macedonia Macedonia Macedonia TRUE Macedonia
Armenia 1558.32 Armenia Armenia Armenia TRUE Armenia
Moldova 1527.95 Moldova Moldova, Republic of Moldova FALSE Moldova prefers longer
Botswana 347.18 Botswana Botswana Botswana TRUE Botswana
South Af 0 South Africa South Africa South Africa TRUE South Africa
Ghana 395.9 Ghana Ghana Ghana TRUE Ghana
Eq Guine 373.2 TRUE Equatorial Guinea
Congo 452.9 Congo Congo Congo TRUE Congo
Kenya 366.78 Kenya Kenya Kenya TRUE Kenya
Cameroon 319.62 Cameroon Cameroon Cameroon TRUE Cameroon
Tanzania 352.54 Tanzania Tanzania Tanzania TRUE Tanzania
Nigeria 342.24 Nigeria Nigeria Nigeria TRUE Nigeria
Uganda 358.75 Uganda Uganda Uganda TRUE Uganda
Zambia 352.54 Zambia Gambia Zambia FALSE Zambia prefers substitution over exact
Sudan 316.95 Sudan Sudan Sudan TRUE Sudan
Zimbabwe 352.54 Zimbabwe Zimbabwe Zimbabwe TRUE Zimbabwe
Ethiopia 705.3 Ethiopia Ethiopia Ethiopia TRUE Ethiopia
Guinea 395.9 Guinea Guinea Guinea TRUE Guinea
CentAfrR 469.7 TRUE Central African Republic
SierraLe 395.9 TRUE Sierra Leone
Mozambiq 355.75 Mozambique Mozambique Mozambique TRUE Mozambique
CongoDR 410.17 Congo Republic of Congo Republic of FALSE Congo Republic of wrong but excuseable; Congo Democratic Republic
Andorra 1912.83 Andorra Andorra Andorra TRUE Andorra
Angola 353.49 Angola Angola Angola TRUE Angola
Belarus 1949.85 Belarus Belarus Belarus TRUE Belarus
Benin 394.54 Benin Benin Benin TRUE Benin
Bosnia 1337.37 Bosnia Bosnia and Herzegovina Bosnia FALSE Bosnia prefers longer
BurkinaF 378.36 Burkina Faso Burkina Faso FALSE Burkina Faso correct
Burundi 362.98 Burundi Burundi Burundi TRUE Burundi
Cape Ver 963.09 Cape Verde Cape Verde Cape Verde TRUE Cape Verde
Chad 537.81 Chad Chad Chad TRUE Chad
Comoros 352.54 Comoros Comoros Comoros TRUE Comoros
IvoryCoa 468.01 TRUE Ivory Coast
Djibouti 750.88 Djibouti Djibouti Djibouti TRUE Djibouti
Eritrea 665.96 Eritrea Eritrea Eritrea TRUE Eritrea
Gabon 360.07 Gabon Gabon Gabon TRUE Gabon
Gambia 395.9 Gambia Gambia Gambia TRUE Gambia
Georgia 1613.7 Georgia Georgia Georgia TRUE Georgia
Guinea-B 395.9 Guinea-Bissau Guinea-Bissau Guinea-Bissau TRUE Guinea-Bissau
Lesotho 352.54 Lesotho Lesotho Lesotho TRUE Lesotho
Liberia 395.9 Liberia Liberia Liberia TRUE Liberia
Malawi 352.54 Malawi Malawi Malawi TRUE Malawi
Mali 430.58 Mali Australia Mali FALSE Mali prefers subset + substitution over exact
Mauritan 681.23 Mauritania Mauritania Mauritania TRUE Mauritania
Namibia 419.32 Namibia Namibia Namibia TRUE Namibia
Niger 315.05 Niger Niger Niger TRUE Niger
Rwanda 364.26 Rwanda Rwanda Rwanda TRUE Rwanda
SaoTomeP 339.89 TRUE Sao Tome and Principe
Senegal 395.9 Senegal Senegal Senegal TRUE Senegal
Seychell 1709.06 Seychelles Seychelles Seychelles TRUE Seychelles
Somalia 500.74 Somalia Somalia Somalia TRUE Somalia
Swazilan 400.17 Swaziland Swaziland Swaziland TRUE Swaziland
Togo 395.9 Togo Togo Togo TRUE Togo
Ukraine 1947.94 Ukraine Ukraine Ukraine TRUE Ukraine
Australi 1971.39 Australia Australia Australia TRUE Australia
United S 1792.33 United States United Republic of Tanzania United States FALSE United States bizarre
New Zeal 2061.12 New Zealand New Zealand New Zealand TRUE New Zealand
Canada 1958.73 Canada Canada Canada TRUE Canada
Japan 2176.04 Japan Japan Japan TRUE Japan
Hong Kon 2674.63 Hong Kong Hong Kong Hong Kong TRUE Hong Kong
Korea 2399.11 Korea Korea Democratic People’s Republic of Korea FALSE Korea prefers subset + substitution over exact
Israel 1539.63 Israel Israel Israel TRUE Israel
Singapor 2459.04 Singapore Singapore Singapore TRUE Singapore
Qatar 1733.83 Qatar Qatar Qatar TRUE Qatar
Hungary 2432.96 Hungary Hungary Hungary TRUE Hungary
Bahrain 971.99 Bahrain Bahrain Bahrain TRUE Bahrain
Chile 2279.52 Chile Chile Chile TRUE Chile
Argentin 1994.59 Argentina Argentina Argentina TRUE Argentina
Barbados 468.02 Barbados Barbados Barbados TRUE Barbados
Uruguay 1918.61 Uruguay Uruguay Uruguay TRUE Uruguay
Cuba 1370.91 Cuba Aruba Cuba FALSE Cuba prefers subset + substitution over exact
Saudi Ar 1468.3 Saudi Arabia Saudi Arabia Saudi Arabia TRUE Saudi Arabia
Mexico 2024.64 Mexico Mexico Mexico TRUE Mexico
Malaysia 1922.77 Malaysia Malaysia Malaysia TRUE Malaysia
Trinidad 1024.1 Trinidad and Tobago Trinidad and Tobago Trinidad and Tobago TRUE Trinidad and Tobago
Kuwait 1081.15 Kuwait Kuwait Kuwait TRUE Kuwait
Lebanon 1543.46 Lebanon Lebanon Lebanon TRUE Lebanon
Venezuel 1280.81 Venezuela, Bolivarian Republic of Venezuela, Bolivarian Republic of Venezuela, Bolivarian Republic of TRUE Venezuela, Bolivarian Republic of
Mauritiu 1792.48 Mauritius Mauritius Mauritius TRUE Mauritius
Jamaica 595.5 Jamaica Jamaica Jamaica TRUE Jamaica
Peru 2096.08 Peru Hviderusland Peru FALSE Peru prefers subset + substitution over exact
Dominica 521.08 Dominica Dominica Dominica TRUE Dominica
SaintLuc 497.7 TRUE Saint Lucia
Ecuador 2228.58 Ecuador Ecuador Ecuador TRUE Ecuador
Brazil 1875.81 Brazil Brazil Brazil TRUE Brazil
SaintVin 395.9 TRUE Saint Vincent
Colombia 1973.6 Colombia Colombia Colombia TRUE Colombia
Iran 1945.07 Iran France Iran FALSE Iran prefers subset + substitution over exact
Tonga 2390.38 Tonga Tonga Tonga TRUE Tonga
Turkey 2167.95 Turkey Turkey Turkey TRUE Turkey
Belize 1481.26 Belize Belize Belize TRUE Belize
Tunisia 203.38 Tunisia Tunisia Tunisia TRUE Tunisia
Jordan 1539.63 Jordan Jordan Jordan TRUE Jordan
SriLanka 1783.84 TRUE Sri Lanka
DomRep 1206.72 TRUE Dominican Republic
W. Samoa 2388.58 W. Samoa W. Samoa W. Samoa TRUE W. Samoa
Fiji 2534.15 Fiji Fiji Fiji TRUE Fiji
China 2646.26 China China China TRUE China
Thailand 2068.81 Thailand Thailand Thailand TRUE Thailand
Surinam 1562.55 Suriname Suriname Suriname TRUE Suriname
Paraguay 2243.61 Paraguay Paraguay Paraguay TRUE Paraguay
Bolivia 2410.22 Bolivia Bolivia, Plurinational State of Bolivia FALSE Bolivia prefers longer
Philipin 2628.84 Philipines Philipines Philipines TRUE Philipines
Egypt 1401.52 Egypt Egypt Egypt TRUE Egypt
Syria 1590.05 Syria Syria Syria TRUE Syria
Honduras 1979.74 Honduras Honduras Honduras TRUE Honduras
Indonesi 2602.63 Indonesia Indonesia Indonesia TRUE Indonesia
VietNam 2264.3 Viet Nam Viet Nam FALSE Viet Nam correct, but odd, vs. Vietnam
Morocco 191.55 Morocco Morocco Morocco TRUE Morocco
Guatemal 2040.41 Guatemala Guatemala Guatemala TRUE Guatemala
Irak 1625.4 Irak Iran Irak FALSE Irak prefers substitution over exact
India 1888.5 India India India TRUE India
Laos 3012.42 Laos Lao People’s Democratic Republic Laos FALSE Laos prefers longer
Pakistan 1901.47 Pakistan Pakistan Pakistan TRUE Pakistan
Madagasc 1678.96 Madagascar Madagascar Madagascar TRUE Madagascar
Papua 3115.88 Papua New Guinea Papua New Guinea FALSE Papua New Guinea correct
Yemen 1190.13 Yemen Yemen Yemen TRUE Yemen
Nepal 2030.11 Nepal Nepal Nepal TRUE Nepal
CookIsla 2437.7 TRUE Cook Islands
Macau 2660.44 Macau Macao Macau FALSE Macau prefers substitution over exact
Marianas 2437.7 Mariana Isl. Mariana Isl. FALSE Mariana Isl. correct
Marshall 2437.7 Marshall Islands Marshall Islands Marshall Islands TRUE Marshall Islands
NCaledon 2437.7 TRUE New Caledonia
Taiwan 2673.71 Taiwan Taiwan, Province of China Taiwan FALSE Taiwan prefers longer
PuertoRi 1654.5 TRUE Puerto Rico
Afghanis 1962.74 Afghanistan Afghanistan Afghanistan TRUE Afghanistan
Algeria 185.65 Algeria Algeria Algeria TRUE Algeria
Antigua/ 491.84 Antigua and Barbuda Antigua and Barbuda FALSE Antigua and Barbuda correct
Azerbaij 2190.35 Azerbaijan Azerbaijan Azerbaijan TRUE Azerbaijan
Bahamas 594.17 Bahamas Bahamas Bahamas TRUE Bahamas
Banglade 1897.24 Bangladesh Bangladesh Bangladesh TRUE Bangladesh
Bhutan 2082.28 Bhutan Bhutan Bhutan TRUE Bhutan
Brunei 1904.48 Brunei Brunei Darussalam Brunei FALSE Brunei prefers longer
Burma 2138.54 Burma Burma Burma TRUE Burma
Cambodia 2254.37 Cambodia Cambodia Cambodia TRUE Cambodia
Costa Ri 1938.1 Costa Rica Costa Rica Costa Rica TRUE Costa Rica
El Salva 1016.14 El Salvador El Salvador El Salvador TRUE El Salvador
Grenada 537.25 Grenada Grenada Grenada TRUE Grenada
Guyana 1379.76 Guyana Guyana Guyana TRUE Guyana
Haiti 434.51 Haiti Haiti Haiti TRUE Haiti
Kazakhst 2122.18 Kazakhstan Kazakhstan Kazakhstan TRUE Kazakhstan
Kiribati 2281.44 Kiribati Kiribati Kiribati TRUE Kiribati
Korea (N 2399.11 Korea North Korea North FALSE Korea North correct
Kyrgysta 2143.13 TRUE Kyrgyzstan
Libya 185.65 Libya Libano Libya FALSE Libya prefers subset, deletion, insertion over exact
Maldives 1836.17 Maldives Maldives Maldives TRUE Maldives
Micrones 2437.7 Micronesia, Federated States of Micronesia, Federated States of Micronesia, Federated States of TRUE Micronesia, Federated States of
Mongolia 2542.15 Mongolia Mongolia Mongolia TRUE Mongolia
Nicaragu 1856.28 Nicaragua Nicaragua Nicaragua TRUE Nicaragua
Oman 1594.25 Oman Cayman Islands Oman FALSE Oman prefers subset + substitution over exact
Panama 1809.12 Panama Panama Panama TRUE Panama
SKittsNe 469.61 TRUE Saint Kitts and Nevis
Solomon 3050.76 Solomon Islands Solomon Islands FALSE Solomon Islands
Tajikist 2000.53 Tajikistan Tajikistan Tajikistan TRUE Tajikistan
TimorLes 2602.63 TRUE Timor–Leste
Turkmeni 2212.49 Turkmenistan Turkmenistan Turkmenistan TRUE Turkmenistan
UArabEm 1286.35 TRUE United Arab Emirates
Uzbekist 2193.47 Uzbekistan Uzbekistan Uzbekistan TRUE Uzbekistan
Vanuatu 2385.83 Vanuatu Vanuatu Vanuatu TRUE Vanuatu


The R code:

gn = read.csv("genetic_distance.csv", encoding = "UTF-8", stringsAsFactors = F)
gn$abbrev = as_abbrev(gn$Country)

trans = read.csv("countrycodes.csv", sep=";", encoding = "UTF-8", stringsAsFactors = F)
trans$shorter = str_sub(trans$Names, 1, 8)

intersect(trans$shorter, gn$Country)

matches = data.frame(source_names = gn$Country,
                     to_short = pmatch (gn$Country, trans$shorter)

agrep(gn$Country[4], trans$shorter)

best_matches = matrix(nrow = nrow(gn))
for (idx in seq_along(gn$Country)) {
  match_idx = agrep(gn$Country[idx], trans$shorter, max.distance = .1, useBytes = T)
  #skip on no match
  if (length(match_idx) == 0) next
  #insert match
  best_matches[idx] = match_idx

matches$agrep = best_matches

matches$to_short_result = trans[matches$to_short, "Names"]
matches$agrep_result = trans[matches$agrep, "Names"]

for (idx in 1:nrow(matches)) {
  if (![idx, "to_short_result"])) {
    matches[idx, "best_match"] = matches[idx, "to_short_result"]
  if ([idx, "to_short_result"])) {
    matches[idx, "best_match"] = matches[idx, "agrep_result"]

write.table(matches, "clipboard", na = "", sep = "\t")

John Fuerst suggested that I write a meta-analysis, review and methodology paper on the S factor. That seems like a decent idea once I get some more studies done (data are known to exist on France (another level), Japan (analysis done, writing pending), Denmark, Sweden and Turkey (reanalysis of Lynn’s data done, but there is much more data).

However, before doing that it seems okay to post my check list here in case someone else is planning on doing a study.

A methodology paper is perhaps not too bad an idea. Here’s a quick check list of what I usually do:
  1. Find some country for which there exist administrative divisions that number preferably at least 10 and as many as possible.
  2. Find cognitive data for these divisions. Usually this is only available for fairly large divisions, like states but may sometimes be available for smaller divisions. One can sometimes find real IQ test data, but usually one will have to rely on scholastic ability tests such as PISA. Often one will have to use a regional or national variant of this.
  3. Find socioeconomic outcome data for these divisions. This can usually be found at some kind of official statistics bureau’s website. These websites often have English language editions for non-English speaker countries. Sometimes they don’t and one has to rely on clever use of guessing and Google Translate. If the country has a diverse ethnoracial demographic, obtain data for this as well. If possible, try to obtain data for multiple levels of administrative divisions and time periods so one can see changes over levels or time. Sometimes data will be available for a variety of years, so one can do a longitudinal study. Other times one will have to average all the years for each variable.
  4. If there are lots of variables to choose from, then choose a diverse mix of variables. Avoid variables that are overly dependent on local natural environment, such as the presence of a large body of water.
  5. Use the redundancy algorithm to remove the most redundant variables. I usually use a threshold of |.90|, such that if a pair of variables in the dataset correlate >= that level, then remove one of them. One can also average them if they are e.g. gendered versions, such as life expectancy or mean income by gender.
  6. Use the mixedness algorithms to detect if any cases are structural outliers, i.e. that they don’t fit the factor structure of the remaining cases. Create parallel datasets without the problematic cases.
  7. Factor analyze the dataset with outliers with ordinary factor analysis (FA), rank order and robust FA. Use ordinary FA on the dataset without the structural outliers. Plot all the FA loading sets using the loadings plotter function. Make note of variables that change their loadings between analyses, and variables that load in unexpected ways.
  8. Extract the S factors and examine their relationship to the ethnoracial variables and cognitive scores.
  9. If the country has seen substantial immigration over the recent decades, it may be a good idea to regress out the effect of this demographic and examine the loadings.
  10. Write up the results. Use lots of loading plots and scatter plots with names.
  11. After you have written a draft, contact natives to get their opinion. Maybe you missed something important about the country. People who speak the local language are also useful when gathering data, but generally, you will have to do things yourself.


If I missed something, let me know.

A recent paper informs us that we have now found a small number of SNPs that explain skin color in European samples.

In the International Visible Trait Genetics (VisiGen) Consortium, we investigated the genetics of human skin color by combining a series of genome-wide association studies (GWAS) in a total of 17,262 Europeans with functional follow-up of discovered loci. Our GWAS provide the first genome-wide significant evidence for chromosome 20q11.22 harboring the ASIP gene being explicitly associated with skin color in Europeans. In addition, genomic loci at 5p13.2 (SLC45A2), 6p25.3 (IRF4), 15q13.1 (HERC2/OCA2), and 16q24.3 (MC1R) were confirmed to be involved in skin coloration in Europeans. In follow-up gene expression and regulation studies of 22 genes in 20q11.22, we highlighted two novel genes EIF2S2 and GSS, serving as competing functional candidates in this region and providing future research lines. A genetically inferred skin color score obtained from the 9 top-associated SNPs from 9 genes in 940 worldwide samples (HGDP-CEPH) showed a clear gradual pattern in Western Eurasians similar to the distribution of physical skin color, suggesting the used 9 SNPs as suitable markers for DNA prediction of skin color in Europeans and neighboring populations, relevant in future forensic and anthropological investigations.


All 9 SNPs listed in Table 1 were used to construct a genetically inferred skin color score in 940 samples from 54 worldwide populations (HGDP-CEPH samples), which showed a spatial distribution with a clear gradual increase in skin darkness from Northern Europe to Southern Europe to Northern Africa, the Middle East and Western Asia (Figure S2); in agreement with the known distribution of skin color across these geographic regions. Outside of these geographic regions, the inferred skin color score appeared rather similar (i.e., failing to discriminate), despite the known phenotypic skin color difference between generally lighter Asians/Native Americans and darker Africans. This demonstrates that although these 9 SNPs can explain skin color variation among Europeans, they cannot explain existing skin color differences between Asians/Native Americans and Africans. Therefore, these differences in skin color variation may partly be due to different DNA variants not identifiable by this European study with restricted genetic origin.

The same general problem may apply to the Piffer results. Perhaps the SNPs found only affect cognitive ability within European samples (or Euroasian, because there is one Chinese replication). This sounds like a case of epistasis, where the other necessary gene(s) for the identified SNPs to have an effect on cognitive ability have substantial frequencies in European populations, but don’t exist or are very rare in non-European populations.

As far as I know, this is a possible but unlikely scenario. It will perhaps serve as one of the remaining areas where non-hereditarians can point to and say that there is still reasonable doubt. The solution is to perform GWAS on African subjects. Luckily, a large number of such subjects live in or near (relatively) affluent countries in the Americas.

Many airports have free wifi services. The problem with these is that they are time-limited, usually to 1 to 3 hours. This can be very annoying if one is stuck in an airport for an extended period, as I am right now.

Non-technical solution

If you have spent your time on one device, you can simply switch to a new one. If you have brought a smartphone, tablet and a laptop, you can use the time on each of these.

This solution may be sufficient in some situations.

Technical solution

The wifi services rely on your computers MAC address to identity you. They keep track of these and so when you have used all the time on a given MAC, it will be temporarily blocked from using the internet again.

The solution is simple: we kill the batman we switch to a new MAC address every time one has expired. How do we do this? The built in network controller can change the MAC address, but this did not work for me. Instead I downloaded macchanger using:

sudo apt-get install macchanger

This is a small program that lets you easily change MAC addresses. I found a ton of guides, but they did not fully work.

Here’s my current routine.

  1. Disable the wifi using by clicking turn off in the dock-menu.
  2. Delete the previous connection to the network.
  3. Open a terminal as root.
  4. Type:
    macchanger -s wlan0

    to show the current MAC address.

  5. Type:
    macchanger -a wlan0

    to get a new similar MAC address.

  6. Re-do step (4) to see that it worked.
  7. Turn on wifi.
  8. Connect to the network.
  9. Enjoy internet for as long as it lasts, start over from step (1).

I’m not sure if everything here is strictly necessary, but this works for me.

Due to lengthy discussion over at Unz concerning the good performance of some African groups in the UK, it seems worth it to review the Danish and Norwegian results. Basically, some African groups perform better on some measures than native British. The author is basically arguing that this disproves global hereditarianism. I think not.

The over-performance relative to home country IQ of some African countries is not restricted to the UK. In my studies of immigrants in Denmark and Norway, I found the same thing. It is very clear that there are strong selection effects for some countries, but not others, and that this is a large part of the reason why the home country IQ x performance in host country are not higher. If the selection effect was constant across countries, it would not affect the correlations. But because it differs between countries, it essentially creates noise in the correlations.

Two plots:


The codes are ISO-3 codes. SO e.g. NGA is Nigeria, GHA is Ghana, KEN = Kenya and so on. They perform fairly well compared to their home country IQ, both in Norway and Denmark. But Somalia does not and the performance of several MENAP immigrants is abysmal.

The scores on the Y axis are S factor scores for their performance in these countries. They are general factors extracted from measures of income, educational attainment, use of social benefits, crime and the like. The S scores correlate .77 between the countries. For details, see the papers concerning the data:

I did not use the scores from the papers, I redid the analysis. The code is posted below for those curious. The kirkegaard package is my personal package. It is on github. The megadataset file is on OSF.


p_load(kirkegaard, ggplot2)

M = read_mega("Megadataset_v2.0e.csv")

DK = M[111:135] #fetch danish data
DK = DK[miss_case(DK) <= 4, ] #keep cases with 4 or fewer missing
DK = irmi(DK, noise = F) #impute the missing
DK.S = fa(DK) #factor analyze
DK_S_scores = data.frame(DK.S = as.vector(DK.S$scores) * -1) #save scores, reversed
rownames(DK_S_scores) = rownames(DK) #add rownames

M = merge_datasets(M, DK_S_scores, 1) #merge to mega

ggplot(M, aes(LV2012estimatedIQ, DK.S)) + 
  geom_point() +
  geom_text(aes(label = rownames(M)), vjust = 1, alpha = .7) +
  geom_smooth(method = "lm", se = F)

# Norway ------------------------------------------------------------------

NO_work = cbind(M[""], #for work data

NO_income = cbind(M["Norway.Income.index.2009"], #for income data

#make DF
NO = cbind(M["NorwayViolentCrimeAdjustedOddsRatioSkardhamar2014"],

#get 5 year means
NO[""] = apply(NO_work[1:5],1,mean,na.rm=T) #get means, ignore missing
NO["OutOfWork.2010to2014.women"] = apply(NO_work[6:10],1,mean,na.rm=T) #get means, ignore missing

#get means for income and add to DF
NO["Income.index.2009to2012"] = apply(NO_income,1,mean,na.rm=T) #get means, ignore missing

plot_miss(NO) #view is data missing?

NO = NO[miss_case(NO) <= 3, ] #keep those with 3 datapoints or fewer missing
NO = irmi(NO, noise = F) #impute the missing

NO_S = fa(NO) #factor analyze
NO_S_scores = data.frame(NO_S = as.vector(NO_S$scores) * -1) #save scores, reverse
rownames(NO_S_scores) = rownames(NO) #add rownames

M = merge_datasets(M, NO_S_scores, 1) #merge with mega

ggplot(M, aes(LV2012estimatedIQ, NO_S)) +
  geom_point() +
  geom_text(aes(label = rownames(M)), vjust = 1, alpha = .7) +
  geom_smooth(method = "lm", se = F)


cor(M$NO_S, M$DK.S, use = "pair")



A reanalysis of (Carl, 2015) revealed that the inclusion of London had a strong effect on the S loading of crime and poverty variables. S factor scores from a dataset without London and redundant variables was strongly related to IQ scores, r = .87. The Jensen coefficient for this relationship was .86.



Carl (2015) analyzed socioeconomic inequality across 12 regions of the UK. In my reading of his paper, I thought of several analyses that Carl had not done. I therefore asked him for the data and he shared it with me. For a fuller description of the data sources, refer back to his article.

Redundant variables and London

Including (nearly) perfectly correlated variables can skew an extracted factor. For this reason, I created an alternative dataset where variables that correlated above |.90| were removed. The following pairs of strongly correlated variables were found:

  1. median.weekly.earnings and log.weekly.earnings r=0.999
  2. GVA.per.capita and log.GVA.per.capita r=0.997
  3. R.D.workers.per.capita and log.weekly.earnings r=0.955
  4. log.GVA.per.capita and log.weekly.earnings r=0.925
  5. economic.inactivity and children.workless.households r=0.914

In each case, the first of the pair was removed from the dataset. However, this resulted in a dataset with 11 cases and 11 variables, which is impossible to factor analyze. For this reason, I left in the last pair.

Furthermore, because capitals are known to sometimes strongly affect results (Kirkegaard, 2015a, 2015b, 2015d), I also created two further datasets without London: one with the redundant variables, one without. Thus, there were 4 datasets:

  1. A dataset with London and redundant variables.
  2. A dataset with redundant variables but without London.
  3. A dataset with London but without redundant variables.
  4. A dataset without London and redundant variables.

Factor analysis

Each of the four datasets was factor analyzed. Figure 1 shows the loadings.


Figure 1: S factor loadings in four analyses.

Removing London strongly affected the loading of the crime variable, which changed from moderately positive to moderately negative. The poverty variable also saw a large change, from slightly negative to strongly negative. Both changes are in the direction towards a purer S factor (desirable outcomes with positive loadings, undesirable outcomes with negative loadings). Removing the redundant variables did not have much effect.

As a check, I investigated whether these results were stable across 30 different factor analytic methods.1 They were, all loadings and scores correlated near 1.00. For my analysis, I used those extracted with the combination of minimum residuals and regression.


Due to London’s strong effect on the loadings, one should check that the two methods developed for finding such cases can identify it (Kirkegaard, 2015c). Figure 2 shows the results from these two methods (mean absolute residual and change in factor size):

Figure 2: Mixedness metrics for the complete dataset.

As can be seen, London was identified as a far outlier using both methods.

S scores and IQ

Carl’s dataset also contains IQ scores for the regions. These correlate .87 with the S factor scores from the dataset without London and redundant variables. Figure 3 shows the scatter plot.

Figure 3: Scatter plot of S and IQ scores for regions of the UK.

However, it is possible that IQ is not really related to the latent S factor, just the other variance of the extracted S scores. For this reason I used Jensen’s method (method of correlated vectors) (Jensen, 1998). Figure 4 shows the results.

Figure 4: Jensen’s method for the S factor’s relationship to IQ scores.

Jensen’s method thus supported the claim that IQ scores and the latent S factor are related.

Discussion and conclusion

My reanalysis revealed some interesting results regarding the effect of London on the loadings. This was made possible by data sharing demonstrating the importance of this practice (Wicherts & Bakker, 2012).

Supplementary material

R source code and datasets are available at the OSF.


Carl, N. (2015). IQ and socioeconomic development across Regions of the UK. Journal of Biosocial Science, 1–12.

Jensen, A. R. (1998). The g factor: the science of mental ability. Westport, Conn.: Praeger.

Kirkegaard, E. O. W. (2015a). Examining the S factor in Mexican states. The Winnower. Retrieved from

Kirkegaard, E. O. W. (2015b). Examining the S factor in US states. The Winnower. Retrieved from

Kirkegaard, E. O. W. (2015c). Finding mixed cases in exploratory factor analysis. The Winnower. Retrieved from

Kirkegaard, E. O. W. (2015d). The S factor in Brazilian states. The Winnower. Retrieved from

Revelle, W. (2015). psych: Procedures for Psychological, Psychometric, and Personality Research (Version 1.5.4). Retrieved from

Wicherts, J. M., & Bakker, M. (2012). Publish (your data) or (let the data) perish! Why not publish your data too? Intelligence, 40(2), 73–76.

1There are 6 different extraction and 5 scoring methods supported by the fa() function from the psych package (Revelle, 2015). Thus, there are 6*5 combinations.

Some time ago, I stumbled upon this paper:
Searls, D. T., Mead, N. A., & Ward, B. (1985). The relationship of students’ reading skills to TV watching, leisure time reading, and homework. Journal of Reading, 158-162.

Sample is very large:

To enlarge on such information, the National Assessment of Educational Progress (NAEP) gathered data on the TV viewing habits of 9, 13, and 17 year olds across the U.S. during its 1979-80 assessment of reading skills. In this survey, 21,208 9 year olds, 30,488 13 year olds, and 25,551 17 year olds responded to questions about their back- grounds and to a wide range of items probing their reading comprehension skills. These data provide information on the amount of TV watched by different groups of students and allow comparisons of reading skills and TV watching.

The relationship turns out to be interestingly nonlinear:

TV reading compre age

For understanding, it is better to visualize the data anew:


I will just pretend that reading comprehension is cognitive ability, usually a fair approximation.

So, if we follow the smarties: At 9 they watch a fairly amount of TV (3-4 hours per day), then at 13, they watch about half of that (1-2), and then at age 17, they barely watch it (<1).

Developmental hypothesis: TV is interesting but only to persons at a certain cognitive ability level. Young smart children fit in the target group, but as they age and become smarter, they grow out of the target group and stop watching.

Alternatives hypotheses?

R code

The code for the plot above.

d = data.frame(c(1.5, 2.2, 2.3),
               c(3, 3, 1.3),
               c(5.2, .2, -2.2),
               c(-1.7, -6.9, -8.1))

colnames(d) = c("<1 hour", "1-2 hours", "3-4 hours", ">4 hours")
d$age = factor(c("9", "13", "17"), levels = c("9", "13", "17"))

d = melt(d, id.vars = "age")


ggplot(d, aes(age, value)) +
  geom_point(aes(color = variable)) +
  ylab("Relative reading comprehension score") +
  scale_color_discrete(name = "TV watching per day") +
  scale_shape_discrete(guide = F)


A dataset was compiled with 17 diverse socioeconomic variables for 32 departments of Colombia and the capital district. Factor analysis revealed an S factor. Results were robust to data imputation and removal of a redundant variable. 14 of 17 variables loaded in the expected direction. Extracted S factors correlated about .50 with the cognitive ability estimate. The Jensen coefficient for the S factor for this relationship was .60.



The general socioeconomic factor is the mathematical construct associated with the idea that positive outcomes tend to go along with other positive outcomes, and likewise for the negative. Mathematically, this shows up as a factor where the desirable outcomes load positively and where the undesirable outcomes load negatively. As far as I know, (Kirkegaard, 2014b) was the first to report such a factor, although Lynn (1979) was close to the same idea. The factor is called s at the individual level, and S when found in aggregated data.

By now, S factors have been found between countries (Kirkegaard, 2014b), twice between country-of-origin groups within countries (Kirkegaard, 2014a), numerous times within countries (reviewed in (Kirkegaard, 2015c)) and also at the level of first names (Kirkegaard & Tranberg, 2015). This paper analyzes data for 33 Colombian departments including the capital district.

Data sources

Most of the data were found via the English-language website which is an aggregator of statistical information concerning countries and their divisions. A second source was a Spanish-language report (DANE, 2011). One variable had to be found on Wikipedia (“List of Colombian departments by GDP,” 2015). Finally, HDI2010 was found in a Spanish-language UN report (United Nations Development Programme & UNDP Colombia, 2011).

Variables were selected according to two criteria: 1) they must be socioeconomically important and 2) they must not be strongly dependent on local climatic conditions. For instance, fishermen per capita would be a variable that fails both criteria, since it is not generally seen as socioeconomically important and is dependent on having access to a body of water.

The included variables are:

  • SABER, verbal scores
  • SABER, math scores
  • Acute malnutrition, %
  • Chronic malnutrition, %
  • Low birth weight, %
  • Access to clean water, %
  • The presence of a sewerage system, %
  • Immunization coverage, %
  • Child mortality, rate
  • Infant mortality, rate
  • Life expectancy at birth
  • Total fertility rate
  • Births that occur in a health clinic, %
  • Unemployment, %
  • GDP per capita
  • Poverty, %
  • GINI
  • Domestic violence, rate
  • Urbanicity, %
  • Population, absolute number
  • HDI 2010

SABER is a local academic achievement test similar to PISA.

Missing data

When collecting the data, I noticed that quite a number of the variables have missing data. The matrixplot is shown in Figure 1.


Figure 1: Matrix plot for the dataset.

The red fields indicate missing data (NA). The greyscale fields indicate high (dark) and low values in each variable. We see that the same departments tend to miss data.

Redundant variables and imputation

Very highly correlated variables cause problems for factor analysis and result in ‘double weighing’ of some variables. For this reason I used the algorithm I developed to find the most highly correlated pairs of variables and remove one of them automatically (Kirkegaard, 2015a). I used a rule of thumb that variables which correlate at >.90 should be removed. There was only one such pair (infant mortality and child mortality, r = .922; removed infant mortality).

I imputed the missing data using the irmi() function from the VIM package (Templ, Alfons, Kowarik, & Prantner, 2015). This was done without noise to make the results replicable. I had no interest in trying to estimate standard errors, so multiple imputation was unnecessary (Donders, van der Heijden, Stijnen, & Moons, 2006).

To check whether results were comparable across methods, datasets were saved with every combination of imputation and removal of the redundant variable, thus creating 4 datasets.

Factor analysis

I carried out factor analysis on the 4 datasets. The factor loadings plot is shown in Figure 2.

Figure 2: Factor loadings plot.

Results were were similar across methods. Per S factor theory, the desirable variables should have positive loadings and the undesirable negative loadings. This was not entirely the case. 3 variables that are generally considered undesirable loaded positively: unemployment rate, low birth weight and domestic violence.

Unemployment rate and crime has been found to load in the wrong direction before when analyzing state-like units. It may be due to the welfare systems being better in the higher S departments, making it possible to survive without working.

It is said that cities breed crime and since urbanicity has a very high positive S loading, the crime result may be a side-effect of that. Alternatively, the legal system may be better (e.g. less corrupt) in the higher S departments making it more likely for crimes to be reported. This is perhaps especially so for crimes against women.

The result with low birth weight is more strange given that higher birth weight is a known correlate of higher educational levels and cognitive ability (Shenkin, Starr, & Deary, 2004). One of the other variables suggest an answer: in the lower S departments, a large fraction (30-40%) of births are home-births, and it seems likely that this would result in fewer reports of low birth weights.

Generally, the results are consistent with those from other countries; 14 of 17 variables loaded in the expected direction.

Mixed cases

Mixed cases are cases that do not fit the factor structure of a dataset. I have previously developed two methods for detecting such cases (Kirkegaard, 2015b). Neither method indicated any strong mixed cases in the unimputed, unreduced dataset or the imputed, reduced dataset. Removing the least congruent case would only improve the factor size by 1.2%point, and the case with the greatest mean absolute residual had only .89.

Unlike previous analysis, the capital district was kept because it did not appear to be a structural outlier.

Cognitive ability, S and HDI

The two cognitive variables correlated at .84, indicating the presence of the aggregate general cognitive ability factor (G factor; Rindermann, 2007). They were averaged to form an estimate of the G factor.

The correlations between S factors, HDI and cognitive ability is shown in Table 1.





















 Table 1: Correlation matrix for cognitive ability, S factor and HDI. Correlations below diagonal are weighted by the square root of population size.

Weighted and unweighted correlations were approximately the same. The imputed and trimmed S factor was nearly identical to the HDI values, despite that the HDI values are from 2010 and the data the S factor is based on is from 2005. Results are fairly similar to those found in other countries.

Figure 3 shows a scatter plot of S factor (reduced, imputed dataset) and cognitive ability.


Figure 3: Scatter plot of S factor scores and cognitive ability.

Jensen’s method

Finally, as a robustness test, I used Jensen’s method (method of correlated vectors (Frisby & Beaujean, 2015; Jensen, 1998)) to see if cognitive abilities’ association with the S factor scores was due to the latent trait. Figure 4 shows the Jensen plot.

Figure 4: Jensen plot for S factor loadings and cognitive ability.

The correlation was .60, which is satisfactory given the relatively few variables (N=16).


  • I don’t speak Spanish, so I may have overlooked some variables that should have been included in the analysis. They may also be translation errors as I had to rely on those found on the websites I used.
  • No educational attainment variables were included despite these often having very strong loadings. None were available in the data sources I consulted.
  • Data was missing for many cases and had to be imputed.

Supplementary material

Data files, R source code and high quality figures are available in the Open Science Framework repository.


A factor analysis was carried out on 6 socioeconomic variables for 506 census tracts of Boston. An S factor was found with positive loadings for median value of owner-occupied homes and average number of rooms in these; negative loadings for crime rate, pupil-teacher ratio, NOx pollution, and the proportion of the population of ‘lower status’. The S factor scores were negatively correlated with the estimated proportion of African Americans in the tracts r = -.36 [CI95 -0.43; -0.28]. This estimate was biased downwards due to data error that could not be corrected for.

The general socioeconomic factor (s/S1) is a similar construct to that of general cognitive ability (GCA; g factor, intelligence, etc., (Gottfredson, 2002; Jensen, 1998). For ability data, it has been repeatedly found that performance on any cognitive test is positively related to performance on any other test, no matter which format (pen pencil, read aloud, computerized), and type (verbal, spatial, mathematical, figural, or reaction time-based) has been tried. The S factor is similar. It has been repeatedly found that desirable socioeconomic outcomes tend are positively related to other desirable socioeconomic outcomes, and undesirable outcomes positively related to other undesirable outcomes. When this pattern is found, one can extract a general factor such that the desirable outcomes have positive loadings and then undesirable outcomes have negative loadings. In a sense, this is the latent factor that underlies the frequently used term “socioeconomic status” except that it is broader and not just restricted to income, occupation and educational attainment, but also includes e.g. crime and health.

So far, S factors have been found for country-level (Kirkegaard, 2014b), state/regional-level (e.g. Kirkegaard, 2015), country of origin-level for immigrant groups (Kirkegaard, 2014a) and first name-level data (Kirkegaard & Tranberg, In preparation). The S factors found have not always been strictly general in the sense that sometimes an indicator loads in the ‘wrong direction’, meaning that either an undesirable variable loads positively (typically crime rates), or a desirable outcome loads negatively. These findings should not be seen as outliers to be explained away, but rather to be explained in some coherent fashion. For instance, crime rates may load positively despite crime being undesirable because the justice system may be better in the higher S states, or because of urbanicity tends to create crime and urbanicity usually has a positive loading. To understand why some indicators sometimes load in the wrong direction, it is important to examine data at many levels. This paper extends the S factor to a new level, that of census tracts in the US.

Data source
While taking a video course on statistical learning based on James, Witten, Hastie, & Tibshirani (2013), I noted that a dataset used as an example would be useful for an S factor analysis. The dataset concerns 506 census tracts of Boston and includes the following variables (Harrison & Rubinfeld, 1978):

  • Median value of owner-occupied homes
  • Average number of rooms in owner units.
  • Proportion of owner units built before 1940.
  • Proportion of the population that is ‘lower status’. “Proportion of adults without, some high school education and proportion of male workers classified as laborers)”.
  • Crime rate.
  • Proportion of residential land zoned for lots greater than 25k square feet.
  • Proportion of nonretail business acres.
  • Full value property tax rate.
  • Pupil-teacher ratios for schools.
  • Whether the tract bounds the Charles River.
  • Weighted distance to five employment centers in the Boston region.
  • Index of accessibility to radial highways.
  • Nitrogen oxide concentration. A measure of air pollution.
  • Proportion of African Americans.

See the original paper for a more detailed description of the variables.

This dataset has become very popular as a demonstration dataset in machine learning and statistics which shows the benefits of data sharing (Wicherts & Bakker, 2012). As Gilley & Pace (1996) note “Essentially, a cottage industry has sprung up around using these data to examine alternative statistical techniques.”. However, as they re-checked the data, they found a number of errors. The corrected data can be downloaded here, which is the dataset used for this analysis.

The proportion of African Americans
The variable concerning African Americans have been transformed by the following formula: 1000(x – .63)2. Because one has to take the square root to reverse the effect of taking the square, some information is lost. For example, if we begin with the dataset {2, -2, 2, 2, -2, -2} and take the square of these and get {4, 4, 4, 4, 4, 4}, it is impossible someone to reverse this transformation and get the original because they cannot tell whether 4 results from -2 or 2 being squared.

In case of the actual data, the distribution is shown in Figure 1.

Figure 1: Transformed data for the proportion of blacks by census tract.

Due to the transformation, the values around 400 actually mean that the proportion of blacks is around 0. The function for back-transforming the values is shown in Figure 2.

Figure 2: The transformation function.

We can now see the problem of back-transforming the data. If the transformed data contains a value between 0 and about 140, then we cannot tell which original value was with certainty. For instance, a transformed value of 100 might correspond to an original proportion of .31 or .95.

To get a feel for the data, one can use the Racial Dot Map explorer and look at Boston. Figure 3 shows the Boston area color-coded by racial groups.

Boston race
Figure 3: Racial dot map of Boston area.

As can be seen, the races tend to live rather separate with large areas dominated by one group. From looking at it, it seems that Whites and Asians mix more with each other than with the other groups, and that African Americans and Hispanics do the same. One might expect this result based on the groups’ relative differences in S factor and GCA (Fuerst, 2014). Still, this should be examined by numerical analysis, a task which is left for another investigation.

Still, we are left with the problem of how to back-transform the data. The conservative choice is to use only the left side of the function. This is conservative because any proportion above .63 will get back-transformed to a lower value. E.g. .80 will become .46, a serious error. This is the method used for this analysis.

Factor analysis
Of the variables in the dataset, there is the question of which to use for S factor analysis. In general when doing these analyses, I have sought to include variables that measure something socioeconomically important and which is not strongly influenced by the local natural environment. For instance, the dummy variable concerning the River Charles fails on both counts. I chose the following subset:

  • Median value of owner-occupied homes
  • Average number of rooms in owner units.
  • Proportion of the population that is ‘lower status’.
  • Crime rate.
  • Pupil-teacher ratios for schools.
  • Nitrogen oxide concentration. A measure of air pollution.

Which concern important but different things. Figure 4 shows the loadings plot for the factor analysis (reversed).2


Figure 4: Loadings plot for the S factor.

The S factor was confirmed for this data without exceptions, in that all indicator variables loaded in the expected direction. The factor was moderately strong, accounting for 47% of the variance.

Relationship between S factor and proportions of African Americans
Figure 5 shows a scatter plot of the relationship between the back-transformed proportion of African Americans and the S factor.

Figure 5: Scatter plot of S scores and the back-transformed proportion of African Americans by census tract in Boston.

We see that there is a wide variation in S factor even among tracts with no or very few African Americans. These low S scores may be due to Hispanics or simply reflect the wide variation within Whites (there few Asians back then). The correlation between proportion of African Americans and S is -.36 [CI95 -0.43; -0.28].

We see that many very low S points lie around S [-3 to -1.5]. Some of these points may actually be census tracts with very high proportions of African Americans that were back-transformed incorrectly.

The value of r = -.36 should not be interpreted as an estimate of effect size of ancestry on S factor for census tracts in Boston because the proportions of the other sociological races were not used. A multiple regression or similar method with all sociological races as the predictors is necessary to answer this question. Still, the result above is in the expected direction based on known data concerning the mean GCA of African Americans, and the relationship between GCA and socioeconomic outcomes (Gottfredson, 1997).

The back-transformation process likely introduced substantial error in the results.

Data are relatively old and may not reflect reality in Boston as it is now.

Supplementary material
Data, high quality figures and R source code is available at the Open Science Framework repository.



1 Capital S is used when the data are aggregated, and small s is used when it is individual level data. This follows the nomenclature of (Rindermann, 2007).

2 To say that it is reversed is because the analysis gave positive loadings for undesirable outcomes and negative for desirable outcomes. This is because the analysis includes more indicators of undesirable outcomes and the factor analysis will choose the direction to which most indicators point as the positive one. This can easily be reversed by multiplying with -1.