Two datasets of socioeconomic data was obtained from different sources. Both were factor analyzed and revealed a general factor (S factor). These factors were highly correlated with each other (.79 to .95), HDI (.68 to .93) and with cognitive ability (PISA; .70 to .78). The federal district was a strong outlier and excluding it improved results.

Method of correlated vectors was strongly positive for all 4 analyses (r’s .78 to .92 with reversing).


In a number of recent articles (Kirkegaard 2015a,b,c,d,e), I have analyzed within-country regional data to examine the general socioeconomic factor, if it exists in the dataset (for the origin of the term, see e.g. Kirkegaard 2014). This work was inspired by Lynn (2010) whose datasets I have also reanalyzed. While doing work on another project (Fuerst and Kirkegaard, 2015*), I needed an S factor for Mexican states, if such exists. Since I was not aware of any prior analysis of this country in this fashion, I decided to do it myself.

The first problem was obtaining data for the analysis. For this, one needs a number of diverse indicators that measure important economic and social matters for each Mexican state. Mexico has 31 states and a federal district, so one can use a decent number of indicators to examine the S factor. Mexico is a Spanish speaking country and English comprehension is fairly poor. According to Wikipedia, only 13% of people speak English there. Compare with 86% for Denmark, 64% for Germany and 35% for Egypt.

S factor analysis 1 – Wikipedian data

Data source and treatment

Unlike for the previous countries, I could not easily find good data available in English. As a substitute, I used data from Wikipedia:

These come from various years, are sometimes not given per person, and often have no useful source given. So they are of unknown veracity, but they are probably fine for a first look. The HDI is best thought of as a proxy for the S factor, so we can use it to examine construct validity.

Some variables had data for multiple time-points and they were averaged.

Some data was given in raw numbers. I calculated per capita versions of them using the population data also given.


The variables above minus HDI and population size were factor analyzed using minimum residuals to extract 1 factor. The loadings plot is shown below.


The literacy variables had a near perfect loading on S (.99). Unemployment unexpectedly loaded positively and so did homicides per capita altho only slightly. This could be because unemployment benefits are only in existence in the higher S states such that going unemployed would mean starvation. The homicide loading is possibly due to the drug war in the country.

Analysis 2 – Data obtained from INEG

Data source and treatment

Since the results based on Wikipedia data was dubious, I searched further for more data. I found it on the Spanish-language statistical database, Instituto Nacional De Estadística Y Geografía, which however had the option of showing poorly done English translations. This is not optimal as there are many translation errors which may result in choosing the wrong variable for further analysis. If any Spanish-speaker reads this, I would be happy if they would go over my chosen variables and confirm that they are correct. I ended up with the following variables:

  1. Cost of crime against individuals and households
  2. Cost of crime on economic units
  3. Annual percentage change of GDP at 2008 prices
  4. Crime prevalence rate per 10,000 economic units
  5. Crime prevalence rate per hundred thousand inhabitants aged 18 years and over, by state
  6. Dark figure of crime on economic units
  7. Dark figure (crimes not reported and crimes reported that were not investigated)
  8. Doctors per 100 000 inhabitants
  9. Economic participation of population aged 12 to 14 years
  10. Economic participation of population aged 65 and over
  11. Economic units.
  12. Economically active population. Age 15 and older
  13. Economically active population. Unemployed persons. Age 15 and older
  14. Electric energy users
  15. Employed population by income level. Up to one minimum wage. Age 15 and older
  16. Employed population by income level. More than 5 minimum wages. Age 15 and older
  17. Employed population by income level. Do not receive income. Age 15 and older
  18. Fertility rate of adolescents aged 15 to 19 years
  19. Female mortality rate for cervical cancer
  20. Global rate of fertility
  21. Gross rate of women participation
  22. Hospital beds per 100 thousand inhabitants
  23. Inmates in state prisons at year end
  24. Life expectancy at birth
  25. Literacy rate of women 15 to 24 years
  26. Literacy rate of men 15 to 24 years
  27. Median age
  28. Nurses per 100 000 inhabitants
  29. Percentage of households victims of crime
  30. Percentage of births at home
  31. Percentage of population employed as professionals and technicians
  32. Prisoners rate (per 10,000 inhabitants age 18 and over)
  33. Rate of maternal mortality (deaths per 100 thousand live births)
  34. Rate of inhabitants aged 18 years and over that consider their neighborhood or locality as unsafe, per hundred thousand inhabitants aged 18 years and over
  35. Rate of inhabitants aged 18 years and over that consider their state as unsafe, per hundred thousand inhabitants aged 18 years and over
  36. Rate sentenced to serve a sentence (per 1,000 population age 18 and over)
  37. State Gross Domestic Product (GDP) at constant prices of 2008
  38. Total population
  39. Total mortality rate from respiratory diseases in children under 5 years
  40. Total mortality rate from acute diarrheal diseases (ADD) in population under 5 years
  41. Unemployment rate of men
  42. Unemployment rate of women
  43. Households
  44. Inhabited housings with available computer
  45. Inhabited housings that have toilet
  46. Inhabited housings that have a refrigerator
  47. Inhabited housings with available water from public net
  48. Inhabited housings that have drainage
  49. Inhabited housings with available electricity
  50. Inhabited housings that have a washing machine
  51. Inhabited housings with television
  52. Percentage of housing with piped water
  53. Percentage of housing with electricity
  54. Proportion of population with access to improved sanitation, urban and rural
  55. Proportion of population with sustainable access to improved sources of water supply, in urban and rural areas

There are were data for multiple years for most of them. I used all data from the last 10 years, approximately. For all data with multiple years, I calculated the mean value.

For data given in raw numbers, I calculated the appropriate per unit measures (per person, per economically active person (?), per household).

A matrix plot for all the S factor relevant data (e.g. not population size) is shown below. It shows missing data in red, as well as the relative difference between datapoints. Thus, cells that are completely white or black are outliers compared to the other data.


One variable (inmates per person) has a few missing datapoints.

Multiple other variables had strong outliers. I examined these to determine if they were real or due to data error.

Inspection revealed that the GDP per person data was clearly incorrect for one state (Campeche) but I could not find the source of error. The data is the same as on the website and did not match the data on Wikipedia. I deleted it to be safe.

The GDP change outlier seems to be real (Campeche) which has negative growth. According to this site, it is due to oil fields closing.

The rest of the outliers were hard to say something about due to the odd nature of the data (“dark crime”?), or were plausible. E.g. Mexico City (aka Federal District, the capital) was an outlier on nurses and doctors per capita, but this is presumably due to many large hospitals being located there.

Some data errors of my own were found and corrected but there is no guarantee there are not more. Compiling a large set of data like this frequently results in human errors.

Factor analysis

Since there were only 32 cases — 31 states + federal district — and 47 variables (excluding the bogus GDP per capita one), this gives problems for factor analysis. There are various recommendations, but almost none of them are met by this dataset (Zhao, 2009). To test limits, I decided to try factor analyzing all of the variables. This produced warnings:

The estimated weights for the factor scores are probably incorrect.  Try a different factor extraction method.
In factor.scores, the correlation matrix is singular, an approximation is used
Warning messages:
1: In cor.smooth(R) : Matrix was not positive definite, smoothing was done
2: In cor.smooth(R) : Matrix was not positive definite, smoothing was done
3: In cor.smooth(r) : Matrix was not positive definite, smoothing was done
4: In cor.smooth(r) : Matrix was not positive definite, smoothing was done

Warnings such these do not always mean that the result is nonsense, but they often do. For that reason, I wanted to extract an S factor with a smaller number of variables. From the 47, I selected the following 21 variables as generally representative and interpretable:

  1. GDP.change,              #Economic
  2. Unemploy.men.rate,
  3. Unemploy.women.rate,
  4. Low.income.peap,
  5. High.income.peap,
  6. Prof.tech.employ.pct,
  7. crime.rate.per.adult,   #crime
  8. Inmates.per.pers,
  9. Unsafe.neighborhood.percept.rate,
  10. Has.water.net.per.hh,    #material goods
  11. Elec.pct,
  12. Has.wash.mach.per.hh,
  13. Doctors.per.pers,      #Health
  14. Nurses.per.pers,
  15. Hospital.beds.per.pers,
  16. Total.fertility,
  17. Home.births.pct,
  18. Maternal.death.rate,
  19. Life.expect,
  20. Women.participation,   #Gender equality
  21. Lit.young.women        #education

Note that peap = per economically active person, hh = household.

The selection was made by my judgment call and others may choose different variables.

Automatic reduction of dataset

As a robustness check and evidence against a possible claim that I picked the variables such as to get an S factor that most suited my prior beliefs, I decided to find an automatic method of selecting a subset of variables for factor analysis. I noticed that in the original dataset, some variables overlapped near perfectly. This would mean that whatever they measure, it would get measured twice or more when extracting a factor. Highly correlated variables can also create nonsense solutions, especially when extracting more than 1 factor.

Another piece of insight comes from the fact that for cognitive data, general factors extracted from a less broad selection of subtests are worse measures of general cognitive ability than those from broader selections (Johnson et al, 2008).

Lastly, subtests from different domains tend to be less correlated than those from the same domain (hence the existence of group factors).

Combining all this, it seems a decent idea that to reduce a dataset by 1 variable, one should calculate all the intercorrelations and find the highest one. Then one should remove one of the variables responsible for it. One can do this repeatedly to remove more than 1 variable from a dataset. Concerning the question of which of the two variables to remove, I can think of three ways: always removing the first, always the second, choosing at random. I implemented all three settings and chose the second as the default. This is because in many datasets the first of a set of highly correlated variables is usually the ‘primary one’, E.g. unemployment, unemployment men, unemployment women. The algorithm also outputs step-by-step information concerning which variables was removed and what their correlation was.

Having written the R code for the algorithm, I ran it on the Mexican dataset. I wanted to obtain a solution using the largest possible number of variables without getting a warning from the factor extraction function. So I first removed 1 variable, and then ran the factor analysis. When I received an error, I removed another, and so on. After having removed 20 variables, I no longer received an error. This left the analysis with 27 variables, or 6 more than my chosen selection. The output from the reduction algorithm was:

> s3 = remove.redundant(s, 20)
[1] "Dropping variable number 1"
[1] "Most correlated vars are Good.water.prop and Piped.water.pct r=0.997"
[1] "Dropping variable number 2"
[1] "Most correlated vars are Piped.water.pct and Has.water.net.per.hh r=0.996"
[1] "Dropping variable number 3"
[1] "Most correlated vars are Fertility.teen and Total.fertility r=0.99"
[1] "Dropping variable number 4"
[1] "Most correlated vars are Good.sani.prop and Has.drainage.per.hh r=0.984"
[1] "Dropping variable number 5"
[1] "Most correlated vars are Victims.crime.households and crime.rate.per.adult r=0.97"
[1] "Dropping variable number 6"
[1] "Most correlated vars are Nurses.per.pers and Doctors.per.pers r=0.962"
[1] "Dropping variable number 7"
[1] "Most correlated vars are Lit.young.men and Lit.young.women r=0.938"
[1] "Dropping variable number 8"
[1] "Most correlated vars are Elec.pct and Has.elec.per.hh r=0.938"
[1] "Dropping variable number 9"
[1] "Most correlated vars are Has.wash.mach.per.hh and Has.refrig.per.household r=0.926"
[1] "Dropping variable number 10"
[1] "Most correlated vars are Prisoner.rate and Inmates.per.pers r=0.901"
[1] "Dropping variable number 11"
[1] "Most correlated vars are Unemploy.women.rate and Unemploy.men.rate r=0.888"
[1] "Dropping variable number 12"
[1] "Most correlated vars are Women.participation and Has.computer.per.household r=0.877"
[1] "Dropping variable number 13"
[1] "Most correlated vars are Hospital.beds.per.pers and Doctors.per.pers r=0.87"
[1] "Dropping variable number 14"
[1] "Most correlated vars are Has.computer.per.household and Prof.tech.employ.pct r=0.868"
[1] "Dropping variable number 15"
[1] "Most correlated vars are Unemploy.men.rate and Unemployed.15plus.peap r=0.866"
[1] "Dropping variable number 16"
[1] "Most correlated vars are Has.tv.per.hh and Has.elec.per.hh r=0.864"
[1] "Dropping variable number 17"
[1] "Most correlated vars are Has.elec.per.hh and Has.drainage.per.hh r=0.851"
[1] "Dropping variable number 18"
[1] "Most correlated vars are Median.age and Prof.tech.employ.pct r=0.846"
[1] "Dropping variable number 19"
[1] "Most correlated vars are Home.births.pct and Low.income.peap r=0.806"
[1] "Dropping variable number 20"
[1] "Most correlated vars are Life.expect and Has.water.net.per.hh r=0.796

In my opinion the output shows that the function works. In most cases, the pair of variables found was either a (near-)double measure e.g. percent of population with electricity and percent of households with electricity, or closely related e.g. literacy in men and women. Sometimes however, the pair did not seem to be closely related, e.g. women’s participation and percent of households with a computer.

Since this dataset selected the variable with missing data, I used the irmi() function from the VIM package to impute the missing data (Templ et al, 2014).

Factor loadings: stability

The factor loading plots are shown below.

S_self_all S_self_automatic S_self_chosen

Each analysis relied upon a unique but overlapping selection of variables. Thus, it is possible to correlate the loadings of the overlapping parts for each analysis. This is a measure of loading stability in different factor analytic environments, as also done by Ree and Earles (1993) for general cognitive ability factor (g factor). The correlations were .98, 1.00, .98 (n’s 21, 27, 12), showing very high stability across datasets. Note that it was not possible to use the loadings from the Wikipedian data factor analysis because the variables were not strictly speaking overlapping.

Factor loadings: interpretation

Examining the factor loadings reveals some things of interest. Generally for all analyses, whatever that is generally considered good loads positively, and whatever considered bad loads negatively.

Unemployment (together, men, women) has positive loadings, whereas it ‘should’ have negative loadings. This is perhaps because the lower S factor states have more dysfunctional or no social security nets such that not working means starvation, and that this keeps people from not working. This is merely a conjecture because I don’t know much about Mexico. Hopefully someone more knowledgeable than me will read this and have a better answer.

Crime variables (crime rate, victimization, inmates/prisoner per capita, sentencing rate) load positively whereas it should load negatively. This pattern has been found before, see Kirkegaard (2015e) for a review of S factor studies and crime variables.

Factor scores

Next I correlated the factor scores from all 4 analysis with each other as well as HDI and cognitive ability as measured by PISA tests (the cognitive data is from Fuerst and Kirkegaard, 2015*; the HDI data from Wikipedia). The correlation matrix is shown below.

“regression” method S.all S.chosen S.automatic S.wiki HDI Cognitive ability
S.all 1.00 -0.08 -0.02 0.08 -0.17 -0.12
S.chosen -0.08 1.00 0.93 0.84 0.93 0.65
S.automatic -0.02 0.93 1.00 0.89 0.88 0.74
S.wiki 0.08 0.84 0.89 1.00 0.76 0.78
HDI -0.17 0.93 0.88 0.76 1.00 0.53
Cognitive ability -0.12 0.65 0.74 0.78 0.53 1.00


Strangely, despite the similar factor loadings, the factor scores from the factor extracted from all the variables had about no relation to the others. This probably indicates that the factor scoring method could not handle this type of odd case. The default scoring method for the factor analysis is “regression”, but there are a few others. Bartlett’s method yielded results for S.all that fit with the other factors, while none of the others did. See the psych package documentation for details (Revelle, 2015). I changed the extraction method for all the other analyses to Bartlett’s to remove method specific variance. The new correlation table is shown below:

Bartlett’s method S.all S.chosen S.automatic S.wiki HDI.mean Cognitive ability
S.all 1.00 0.79 0.88 0.88 0.68 0.74
S.chosen 0.79 1.00 0.95 0.87 0.93 0.70
S.automatic 0.88 0.95 1.00 0.88 0.89 0.74
S.wiki 0.88 0.87 0.88 1.00 0.75 0.78
HDI.mean 0.68 0.93 0.89 0.75 1.00 0.53
Cognitive ability 0.74 0.70 0.74 0.78 0.53 1.00


Intriguingly, now all the correlations are stronger. Perhaps Bartlett’s method is better for handling this type of extraction involving general factors from datasets with low case to variable ratios. It certainly deserves empirical investigation, including reanalysis of prior datasets. I reran the earlier parts of this paper with the Bartlett method. It did not substantially change results. The correlations between loadings across analysis increased a bit (to .98, 1.00, .99).

One possibility however is that the stronger results is just due to Bartlett’s method creating outliers that happen to lie on the regression line. This did not seem to be the case, see scatterplots below.


S factor scores and cognitive ability

The next question is to what degree the within country differences in Mexico can be explained by cognitive ability. The correlations are in the above table as well, they are in the region .70 to .78 for the various S factors. In other words, fairly high. One could plot all of them vs. cognitive ability, but that would give us 4 plots. Instead, I plot only the S factor from my chosen variables since this has the highest correlation with HDI and thus the best claim for construct validity. It is also the most conservative option because of the 4 S factors, it has the lowest correlation with cognitive ability. The plot is shown below:


We see that the federal district is a strong outlier, just like in the study with US states and Washington DC (Kirkegaard, 2015c). One should then remove it and rerun all the analyses. This includes the S factor extractions because the presence of a strong ‘mixed case’ (to be explained further in a future publication) affects the S factor extracted (see again, Kirkegaard, 2015c).

Analyses without Federal District

I reran all the analyses without the federal district. Generally, this did not change much with regards to loadings. Crime and unemployment still had positive loadings.

The loadings correlations across analyses increased to 1.00, 1.00, 1.00.

S.all S.chosen S.automatic S.wiki HDI mean Cognitive ability
S.all 1.00 0.99 0.98 0.93 0.85 0.78
S.chosen 0.99 1.00 0.98 0.94 0.88 0.80
S.automatic 0.98 0.98 1.00 0.90 0.90 0.75
S.wiki 0.93 0.94 0.90 1.00 0.75 0.77
HDI mean 0.85 0.88 0.90 0.75 1.00 0.56
Cognitive ability 0.78 0.80 0.75 0.77 0.56 1.00


The factor score correlations increased meaning that the federal district outlier was a source of discrepancy between the extraction methods. This can be seen in the scatterplots above in that  there is noticeable variation in how far from the rest the federal district lies. After this is resolved, the S factors from the INEG dataset are in near-perfect agreement (.99, .98, .98) while the one from Wikipedia data is less so but still respectable (.93, .94, .90). Correlations with cognitive ability also improved a bit.

Method of correlated vectors

In line with earlier studies, I examine whether the measures that are better measures of the latent S factor are also correlated more highly with the criteria variable, cognitive ability.

MCV_S_all MCV_S_automatic MCV_S_chosen MCV_S_wiki

The MCV results are strong: .90 .78 .85 and .92 for the analysis with all variables, chosen variables, automatically chosen variables and Wikipedian variables respectively. Note that these are for the analyses without the federal district, but they were similar with it too.

Discussion and conclusion

Generally, the present analysis reached similar findings to those before, especially with the one about US states. Cognitive ability was a very strong correlate of the S factors, especially once the federal district outlier was removed before the analysis. Further work is needed to find out why unemployment and crime variables sometimes load positively in S factor analyses with regions or states as the unit of analysis.

MCV analysis supported the idea that cognitive ability is related to the S factor, not just some non-S factor source of variance also present in the dataset.

Supplementary material

Data files, R code, figures are available at the Open Science Framework repository.


  • Fuerst, J. and Kirkegaard, E. O. W. (2015*). Admixture in the Americas part 2: Regional and National admixture. (Publication venue undecided.)
  • Johnson, W., Nijenhuis, J. T., & Bouchard Jr, T. J. (2008). Still just 1g: Consistent results from five test batteries. Intelligence, 36(1), 81-95.
  • Kirkegaard, E. O. W. (2014). The international general socioeconomic factor: Factor analyzing international rankings. Open Differential Psychology.
  • Kirkegaard, E. O. W. (2015a). S and G in Italian regions: Re-analysis of Lynn’s data and new data. The Winnower.
  • Kirkegaard, E. O. W. (2015b). Indian states: G and S factors. The Winnower.
  • Kirkegaard, E. O. W. (2015c). Examining the S factor in US states. The Winnower.
  • Kirkegaard, E. O. W. (2015d). The S factor in China. The Winnower.
  • Kirkegaard, E. O. W. (2015e). The S factor in the British Isles: A reanalysis of Lynn (1979). The Winnower.
  • Lynn, R. (2010). In Italy, north–south differences in IQ predict differences in income, education, infant mortality, stature, and literacy. Intelligence, 38(1), 93-100.
  • Ree, M. J., & Earles, J. A. (1991). The stability of g across different methods of estimation. Intelligence, 15(3), 271-278.
  • Revelle, W. (2015). psych: Procedures for Psychological, Psychometric, and Personality Research. CRAN
  • Templ, M., Alfons A., Kowarik A., Prantner, B. (2014). VIM: Visualization and Imputation of Missing Values. CRAN
  • Zhao, N. (2009). The Minimum Sample Size in Factor Analysis. Encorewiki.

* = not yet published, year is expected publication year.


Blog commenter Lion of the Judah-sphere has claimed that the SAT does not correlate as well with comprehensive IQ tests as said IQ tests correlate with one another. At first I assumed he was wrong, but my recent analysis suggesting Harvard undergrads have an average Wechsler IQ of 122, really makes me wonder.

While an IQ of 122 (white norms) is 25 points above the U.S. mean of 97, it seems very low for a group of people who averaged 1490 out of 1600 on the SAT. According to my formula, since 1995 a score of 1490 on the SAT equated to an IQ of 141. But my formula was based on modern U.S. norms; because demographic changes have made the U.S. mean IQ 3 points below the white American mean (and made the U.S. standard deviation 3.4 percent larger than the white SD), converting to white norms reduces Harvard’s SAT IQ equivalent to 139.

In general, research correlating the SAT with IQ has been inconsistent, with correlations ranging from 0.4 to 0.8. I think much depends on the sample. Among people who took similar courses in similar high schools, the SAT is probably an excellent measure of IQ. But considering the wide range of schools and courses American teenagers have experienced, the SAT is not, in my judgement, a valid measure of IQ. Nor should it be. Universities should not be selecting students based on biological ability, but rather on acquired academic skills.

The lower values are due to restriction of range, e.g. Frey and Detterman (2004). When corrected, the value goes up to .7-.8 range. Also .54 using ICAR60 (Condon and Revelle, 2014) without correction for reliability or restriction.

As for the post, I can think of few things:

1. The sample recruited is likely not representative of Harvard. Probably mostly social sci/humanities students, who have lower scores.

2. Regression towards the mean means that the Harvard student body won’t be as exceptional on their second measurement as on their first. This is because some of the reason they were so high was just good luck.

3. The SAT is teachable to some extent which won’t transfer well to other tests. This reduces the correlation between between SAT and other GCA tests.

4. Harvard uses affirmative action which lowers the mean SAT of the students a bit. It seems to be about 1500.

SAT has an approx. mean of ~500 per subtest, ceiling of 800 and SD of approx. 100. So a 1500 total score is 750+750=(500+250)+(500+250)=(500+2.5sd)+(500+2.5sd), or about 2.5 SD above the mean. Test-retest reliability/stability for a few months is around .86 (mean of values here research.collegeboard.org/sites/default/files/publications/2012/7/researchreport-1982-7-test-disclosure-retest-performance-sat.pdf, n≈3700).

The interesting question is how much regression towards the mean we can expect? I decided to investigate using simulated data. The idea is basically that we first generate some true scores (per classical test theory), and then make two measurements of them. Then using the first measurement, we make a selection that has a mean 2.5 SD above the mean, then we check how well this group performs on the second testing.

In R, the way we do this, is to simulate some randomly distributed data, and then create new variables that are a sum of true score and error for the measurements. This presents us with a problem.

How much error to add?

We can solve this question either by trying some values, or analytically. Analytically, it is like this:

Cor (test1 x test2) = cor(test1 x true score) * cor(test2 x true score)

The correlation of each testing is due to their common association with the true scores. To simulate a testing, we need the correlation between testing and true score. Since test-rest is .863, we take the square root and get .929. The squared value of a correlation is the amount of variance it explains, so if we square it get back to where we were before. Since the total variance is 1, we can calculate the remaining variance as 1-.863=.137. We can take the square root of this to get the correlation, which is .37. Since the correlations are those we need when adding, we have to weigh the true score by .929 and the error by .370 to get the measurements such that they have a test-retest correlation of .863.

One could also just try some values until one gets something that is about right. In this case, just weighing the true score by 1 and error by .4 produces nearly the same result. Trying out a few values like this is faster than doing the math. In fact, I initially tried out some values to hit the approximate value but then decided to solve it analytically as well.

How much selection to make?

This question is more difficult analytically. I have asked some math and physics people and they could not solve it analytically. The info we have is the mean value of the selected group, which is about 2.5, relative to a standard normal population. Assuming we make use of top-down selection (i.e. everybody above a threshold gets selected, no one below), where must we place our threshold to get a mean of 2.5? It is not immediately obvious. I solved it by trying some values and calculating the mean trait value in the selected group. It turns out that to get a selected group with a mean of 2.5, the threshold must be 2.14.

Since this group is selected for having positive errors and positive true scores, their true scores and second testing scores will be lower. How much lower? 2.32 and 2.15 according to my simulation. I’m not sure why the second measurement scores are lower than the true scores, perhaps someone more learned in statistics can tell me.

So there is still some way down to an IQ of 122 from 132.3 (2.15*15+100). However, we may note that they used a shorter WAIS-R form, which correlates .91 with the full-scale. Factoring this in reduces our estimate to 129.4. Given the selectivity noted above, this is not so unrealistic.

Also, the result can apparently be reached simply by 2.5*.86. I was aware that this might work, but wasn’t sure so I tested it (the purpose of this post). One of the wonders of statistical software like this is that one can do empirical mathematics. :)

R code


#size and true score
n.size = 1e6
true.score = rnorm(n.size)

#add error to true scores
test.1 = .929*true.score + .370*rnorm(n.size)
test.2 = .929*true.score + .370*rnorm(n.size)
SAT = data.frame(true.score,
#verify it is correct

#select a subsample
selected = filter(SAT, test.1>2.14) #selected sample
describe(selected) #desc. stats


I ran into trouble trying to remove the effects of one variable on other variables doing the writing my reanalysis of the Noble et al 2015 paper. It was not completely obvious which exact method to use.

The authors in their own paper used OLS aka. standard linear regression. However, since I wanted to plot the results at the case-level, this was not useful to me. Before doing this I did not really understand how partial correlations worked, or what semi-partial correlations were, or how multiple regression works exactly. I still don’t understand the last, but the others I will explain with example results.

Generated data

Since we are cool, we use R.

We begin by loading some stuff I need and generating some data:

#load libs
 library(pacman) #install this first if u dont have it
 p_load(Hmisc, psych, ppcor ,MASS, QuantPsyc, devtools)
 #Generate data
 df = as.data.frame(mvrnorm(200000, mu = c(0,0), Sigma = matrix(c(1,0.50,0.50,1), ncol = 2), empirical=TRUE))
[1] 0.5

The correlation from this is exactly .500000, as chosen above.

Then we add a gender specific effect:

df$gender = as.factor(c(rep("M",100000),rep("F",100000)))
 df[1:100000,"V2"] = df[1:100000,"V2"]+1 #add 1 to males
 cor(df[1:2])[1,2] #now theres error due to effect of maleness!

[1] 0.4463193

The correlation has been reduced a bit as expected. It is time to try and undo this and detect the real correlation of .5.

#multiple regression
 model = lm(V1 ~ V2+gender, df) #standard linear model
 coef(model)[-1] #unstd. betas
 lm.beta(model) #std. betas
V2        genderM 
0.4999994 -0.5039354
V2        genderM 
0.5589474 -0.2519683 

The raw beta coefficient is on spot, but the standardized is not at all. This certainly makes interpretation more difficult if the std. beta does not correspond in size with correlations.

#split samples
 df.genders = split(df, df$gender) #split by gender
 cor(df.genders$M[1:2])[1,2] #males
 cor(df.genders$F[1:2])[1,2] #females
[1] 0.4987116
[1] 0.5012883

These are both approximately correct. Note however that these are the values not for the total sample as before, but for each subsample. If the sumsamples differ in their correlation with, it will show up here clearly.

#partial correlation
df$gender = as.numeric(df$gender) #convert to numeric
partial.r(df, c(1:2),3)[1,2]
[1] 0.5000005

This is using an already made function for doing partial correlations. Partial correlations are calculated from the residuals of both variables. This means that they correlate whatever remains of the variables after everything that could be predicted from the controlling variable is removed.

#residualize both
df.r2 = residuals.DF(df, "gender")
[1] 0.5000005

This should be the exact same as the above, but done manually. And it is.

spcor(df)$estimate #hard to interpet output
spcor.test(df$V1, df$V2, df$gender)[1] #partial out gender from V2
spcor.test(df$V2, df$V1, df$gender)[1] #partial out gender from V1
V1 V2 gender
V1 1.0000000 0.4999994 -0.2253951
V2 0.4472691 1.0000000 0.4479416
gender -0.2253103 0.5005629 1.0000000
1 0.4999994
1 0.4472691

Semi-partial correlations aka part correlations are the same as above, except that only one of the two variables is residualized by the controlled variable. The above has two different already made functions for calculating these. The first works on an entire data.frame, and outputs semi-partials for all variables controlling for all other variables. The output is a bit tricky to read however, and there is no explanation in the help. One has to read it by looking at each row, this is the original variable correlated with the residualized variables in each col.
The two calls below are using the other function one where has to specify which two variables to correlate and which variables to control for. The second variable called is the one that gets residualized by the control variables.
We see that the results are as they should be. Controlling V1 for gender has about zero effect. This is because gender has no effect on this variable aside from a very small chance effect (r=-0.00212261). Controlling V2 for gender has the desired effect of returning a value very close to .5, as it should.

#residualize only V2
df.r1.V1= df.r2 #copy above
df.r1.V1$V1 = df$V1 #fetch orig V1
[1] 0.4999994
#residualize only V1
df.r1.V2= df.r2 #copy above
df.r1.V2$V2 = df$V2 #fetch orig V2
[1] 0.4472691

These are the two manual ways of doing the same as above. We get the exact same results, so that is good.

So where does this lead us? Well, apparently using multiple regression to control variables is a bad idea since it results in difficult to interpret results.


Remains to be done:

  • Admixture analysis (doing)
  • Proofreading and editing
  • Deciding how to control for age and scanner (technical question)


I explore a large (N≈1000), open dataset of brain measurements and find a general factor of brain size (GBSF) that covers all regions except possibly the amygdala (loadings near-zero, 3 out of 4 negative). It is very strongly correlated with total brain size volume and surface area (rs>.9). The factor was (near)identical across genders after adjustments for age were made (factor congruence 1.00).

GBSF had similar correlations to cognitive measures as did other aggregate brain size measures: total cortical area and total brain volume. I replicated the finding that brain measures were associated with parental income and educational attainment.



A recent paper by Noble et al (2015) has gotten considerable attention in the media. An interesting fact about the paper is that most of the data was published in the paper, perhaps inadvertently. I was made aware of this fact by an observant commenter, FranklinDMadoff, on the blog of James Thompson (Psychological Comments). In this paper I make use of the same data, revisit their conclusions as well do some of my own.

The abstract of the paper reads:

Socioeconomic disparities are associated with differences in cognitive development. The extent to which this translates to disparities in brain structure is unclear. We investigated relationships between socioeconomic factors and brain morphometry, independently of genetic ancestry, among a cohort of 1,099 typically developing individuals between 3 and 20 years of age. Income was logarithmically associated with brain surface area. Among children from lower income families, small differences in income were associated with relatively large differences in surface area, whereas, among children from higher income families, similar income increments were associated with smaller differences in surface area. These relationships were most prominent in regions supporting language, reading, executive functions and spatial skills; surface area mediated socioeconomic differences in certain neurocognitive abilities. These data imply that income relates most strongly to brain structure among the most disadvantaged children.

The results are not all that interesting, but the dataset is very large for a neuroscience study, the median sample size of median samples sizes in a meta-analysis of 49 meta-analysis is 116.5 (Button et al, 2013; based on the data in their Table 1). Furthermore, they provide a very large set of different, non-overlapping brain measurements which are useful for a variety of analyses, and they provide genetic admixture data which can be used for admixture mapping.

Why their results are as expected

The authors give their results (positive relationships between various brain size measures and parental educational and economic variables) environmental interpretations. For instance:

It is possible that, in these regions, associations between parent education and children’s brain surface area may be mediated by the ability of more highly educated parents to earn higher incomes, thereby having the ability to purchase more nutritious foods, provide more cognitively stimulating home learning environments, and afford higher quality child care settings or safer neighborhoods, with more opportunities for physical activity and less exposure to environmental pollutants and toxic stress3, 37. It will be important in the future to disambiguate these proximal processes by measuring home, family and other environmental mediators21.

However, one could also expect the relationship to be due to general cognitive ability (GCA; aka. general intelligence) and its relationship to favorable educational and economic outcomes, as well as brain measures. Figure 1 illustrates this expected relationship:

Figure 1

Figure 1 – Relationships between variables

The purple line is the one the authors are often arguing for based on their observed positive relationships. As can be seen in the figure, this positive relationship is also expected because of parental education/income’s positive relationship to parental GCA, which is related to parental brain properties which are highly heritable. Based on these well-known relationships, we can estimate some expected correlations. The true score relationship between adult educational attainment and GCA is somewhere around .56 (Strenze, 2007).

The relationship between GCA and whole brain size is around .24-.28, depending on whether one wants to use the unweighted mean, n-weighted mean or median, and which studies one includes of those collected by Pietschnig et al (2014). I used healthy samples (as opposed to clinical) and only FSIQ. This value is uncorrected for measurement error of the IQ test, typically assumed to be around .90. If we choose the conservative value of .24 and then correct with .90, we get .27 as an estimated true score correlation.

The heritability of whole brain size is very high. Bouchard (2014) summarized a few studies: one found cerebral total h^2 of = .89, another whole-brain grey matter .82, whole-brain white matter .87, and a third total brain volume .80. Perhaps there is some publication bias in these numbers, so we can choose .80 as an estimate. We then correct this for measurement error and get .89. None of the previous studies were corrected for restriction of range which is fairly common because most studies use university students (Henrich et al, 2010) who average perhaps 1 standard deviation above the population mean in GCA. If we multiply these numbers we get an estimate of r=.13 between parental education and total brain volume or a similar measure. As for income, the expected correlation is somewhat lower because the relationship between GCA and income is weaker, perhaps .23 (Strenze, 2007). This gives .05. However, Strenze did not account for the non-linearity of the income x GCA relationship, so it is probably somewhat higher.

Initial analyses

Analysis was done in R. Code file, figures, and data are available in supplementary material

Collecting the data

The authors did not just publish one datafile with comments about the variables, but instead various excel files were attached to parts of the figures. There are 6 such files. They all contain the same number of cases and they overlap completely (as can be seen by the subjectID column). The 6 files however do not overlap completely in their columns and some of them have unique columns. These can all be merged into one dataset.

Dealing with missing data

The original authors dealt with this simply by relying on the complete cases only. This method can bias the results when the data is not missing completely at random. Instead, it is generally better to impute missing data (Sterne et al, 2009). Figure 1 shows the matrixplot of the data file.


The red areas mean missing data, except in the case of nominal variables which are for some reason always colored red (an error I think). Examining the structure of missing data showed that it was generally not possible to impute the data, since many cases were missing most of their values. One will have to exclude these cases. Doing so reduces the sample size from 1500 to 1068. The authors report having 1099 complete cases, but I’m not sure where the discrepancy arises.

Dealing with gender

Since males have much larger brain volumes than females, even after adjustment for body size, there is the question of how to deal with gender (no distinction is being made here between sex and gender). The original authors did this by regressing the effect out. However, in my experience, regression does not always accomplish this perfectly, so when possible one should just split the sample by gender and calculate results in each one-gender sample. One cannot do the sampling splitting when one is interested in the specific regression effect of gender, or when the resulting samples would be too small.

Dealing with age

This problem is tricky. The original authors used age and age2 to deal with age in a regression model. However, since I wanted to visualize the relationships between variables, this option was not useful to me because it would only give me the summary statistics with the effects of age, not the data. Instead, I calculated the residuals for all variables of interest after they were regressed on age, age2 and age3. The cubic age was used to further catch non-linear effects of age, as noted by e.g. Jensen (2006: 132-133).

Dealing with scanning site

One peculiar feature of the study not discussed by the authors was the relatively effect of different scanners on their results, see e.g. their Table 3. To avoid scanning site influencing the results, I also regressed this out (as a nominal variable with 13 levels).

Dealing with size

The dataset does not have size measures thus making it impossible to adjust for body size. This is problematic as it is known that body size correlates with GCA both within and between species. We are interested in differences in brain size holding body size equal. This cannot be done in the present study.

Factor analyzing brain size measurements

Why would one want to factor analyze brain measures?

The short answer is the same as that to the question: why would one want to factor analyze cognitive ability data? The answer: To explore the latent relationships in the data not immediately obvious. A factor analysis will reveal whether there is a general factor of some domain, which can be a theoretically very important discovery (Dalliard, 2013; Jensen, 1998:chapter 2). If there is no general factor, this will also be revealed and may be important as well. This is not to say that general factors or the lack thereof are the only interesting thing about the factor structure, multifactor structures are also interesting, whether orthogonal (uncorrelated) or as part of a hierarchical solution (Jensen, 2002).

The long answer is that human psychology is fundamentally a biological fact, a matter of brain physics and chemistry. This is not to say that important relationships can not fruitfully be described better at higher-levels (e.g. cognitive science), but that ultimately the origin of anything mental is biology. This fact should not be controversial except among the religious, for it is merely the denial of dualism, of ghosts, spirits, gods and other immaterial beings. As Jensen (1997) wrote:

Although the g factor is typically the largest component of the common factor variance, it is the most “invisible.” It is the only “factor of the mind” that cannot possibly be described in terms of any particular kind of knowledge or skill, or any other characteristics of psychometric tests. The fact that psychometric g is highly heritable and has many physical and brain correlates means that it is not a property of the tests per se. Rather, g is a property of the brain that is reflected in observed individual differences in the many types of behavior commonly referred to as “cognitive ability” or “intelligence.” Research on the explanation of g, therefore, must necessarily extend beyond psychology and psychometrics. It is essentially a problem for brain neurophysiology. [my emphasis]

If GCA is a property of the brain, or at least that there is an analogous general brain performance factor, it may be possible to find it with the same statistical methods that found the GCA. Thus, to find it, one must factor analyze a large, diverse sample of brain measurements that are known to correlate individually with GCA in the hope that there will be a general factor which will correlate very strongly with GCA. There is no guarantee as I see it that this will work, as I see it, but it is something worth trying.

In their chapter on brain and intelligence, Colom and Thompson (2011) write:

The interplay between genes and behavior takes place in the brain. Therefore, learning the language of the brain would be crucial to understand how genes and behavior interact. Regarding this issue, Kovas and Plomin (2006) proposed the so -called “ generalist genes ” hypothesis, on the basis of multivariate genetic research findings showing significant genetic overlap among cognitive abilities such as the general factor of intelligence ( g ), language, reading, or mathematics. The hypothesis has implication for cognitive neuroscience, because of the concepts of pleiotropy (one gene affecting many traits) and polygenicity (many genes affecting a given trait). These genetic concepts suggest a “ generalist brain ” : the genetic influence over the brain is thought to be general and distributed.

Which brain measurements have so far been found to correlate with GCA (or its IQ proxy)?

Below I have compiled a list of brain measurements that have at some point been found to be correlated with GCA IQ scores:

  • Brain evoked potentials: habituation time (Jensen, 1998:155)
  • Brain evoked potentials: complexity of waveform (Deary and Carol, 1997)
  • Brain intracellular pH-level (Jensen, 1998:162)
  • Brain size: total and brain regions (Jung and Haier, 2007)
  • Of the above, grey matter and white matter separate
  • Cortical thickness (Deary et al, 2010)
  • Cortical development (Shaw, P. et al. 2006)
  • Nerve conduction velocity (Deary and Carol, 1997)
  • Brain wave (EEG) coherence (Jensen, 2002)
  • Event related desynchronization of brain waves (Jensen, 2002)
  • White matter lesions (Turken et al, 2008)
  • Concentrations of N-acetyl aspartate (Jung, et al. 2009)
  • Water diffusion parameters (Deary et al, 2010)
  • White matter integrity (Deary et al, 2010)
  • White matter network efficiency (Li et al. 2009)
  • Cortical glucose metabolic rate during mental activity / Neural efficiency (Neubauer et al, 2009)
  • Uric acid level (Jensen, 1998:162)
  • Density of various regions (Frangou et al 2004)
  • White matter fractional anisotropy (Navas‐Sánchez et al 2014; Kim et al 2014)
  • Reliable responding to changing inputs (Euler et al, 2015)

Most of the references above lead to the reviews I relied upon (Deary and Carol, 1997; Jensen, 1998, 2002; Deary et al, 2010). There are surely more, and probably a large number of the above are false-positives. Some I could not find a direct citation for. We cannot know which are false positives until large datasets are compiled with these measures as well as a large number of cognitive tests. A simple WAIS battery won’t suffice, there needs to be elementary cognitive tests too, and other tests that vary more in content, type and g-loading. This is necessary if we are to use the method of correlated vectors as this does not work well without diversity in factor indicators. It is also necessary if we are to examine non-GCA factors.

My hypothesis is that if there is a general brain factor, then it will have a hierarchical structure similar to GCA. Figure 2 shows a hypothetical structure of this.

Figure 2

Notes: Where squares at latent variables and circles are observed variables. I am aware this is opposite of normal practice (e.g. Beaujean, 2014) but text is difficult to fit into circles.

Of these, the speed factor has to do with speed of processing which can be enhanced in various ways (nerve conduction velocity, higher ‘clock’ frequency). Efficiency has to do with efficient use of resources (primarily glucose). Connectivity has to do with better intrabrain connectivity, either by having more connections, less problematic connections or similar. Size has to do with having more processing power by scaling up the size. Some areas may matter more than others for this. Integrity has to do with withstanding assaults, removing garbage (which is known to be the cause of many neurodegenerative diseases) and the like. There are presumably more factors, and some of mine may need to be split.

Previous studies and the present study

Altho factor analysis is common in differential psychology and related fields, it is somewhat rare outside of those. And when it is used, it is often done in ways that are questionable (see e.g. controversy surrounding Hampshire et al (2012): Ashton et al (2014a), Hampshire et al (2014), Ashton et al (2014b), Haier et al (2014a), Ashton et al (2014c), Haier et al (2014b)). On the other hand, factor analytic methods have been used in a surprisingly diverse collection of scientific fields (Jöreskog 1996; Cudeck and MacCallum, 2012).

I am only familiar with one study applying factor analysis to different brain measures and it was a fairly small study at n=132 (Pennington et al, 2000). They analyzed 13 brain regions and reported a two-factor solution. It is worth quoting their methodology section:

Since the morphometric analyses yield a very large number of variables per subject, we needed a data reduction strategy that fit with the overall goal of exploring the etiology of individual differences in the size of major brain structures. There were two steps to this strategy: (1) selecting a reasonably small set of composite variables that were both comprehensive and meaningful; and (2) factor analyzing the composite variables. To arrive at the 13 composite variables discussed earlier, we (1) picked the major subcortical structures identified by the anatomic segmentation algorithms, (2) reduced the set of possible cortical variables by combining some of the pericallosal partitions as described earlier, and (3) tested whether it was justifiable to collapse across hemispheres. In the total sample, there was a high degree of correlation (median R=.93, range=.82-.99) between the right and left sides of any given structure; it thus seemed reasonable to collapse across hemispheres in creating composites. We next factor-analyzed the 13 brain variables in the total sample of 132 subjects, using Principal Components factor analysis with Varimax rotation (Maxwell & Delaney, 1990). The criteria for a significant factor was an eigenvalue>l.0, with at least two variables loading on the factor.

The present study makes it possible to perform a better analysis. The sample is about 8 times larger and has 27 non-overlapping measurements of brain size, broadly speaking. The major downside of the variables in the present study is that the cerebral is not divided into smaller areas as done in their study. Given the very large sample size, one could use 100 variables or more.

The available brain measures are:

  1. cort_area.ctx.lh.caudalanteriorcingulate
  2. cort_area.ctx.lh.caudalmiddlefrontal
  3. cort_area.ctx.lh.fusiform
  4. cort_area.ctx.lh.inferiortemporal
  5. cort_area.ctx.lh.middletemporal
  6. cort_area.ctx.lh.parsopercularis
  7. cort_area.ctx.lh.parsorbitalis
  8. cort_area.ctx.lh.parstriangularis
  9. cort_area.ctx.lh.rostralanteriorcingulate
  10. cort_area.ctx.lh.rostralmiddlefrontal
  11. cort_area.ctx.lh.superiortemporal
  12. cort_area.ctx.rh.caudalanteriorcingulate
  13. cort_area.ctx.rh.caudalmiddlefrontal
  14. cort_area.ctx.rh.fusiform
  15. cort_area.ctx.rh.parsopercularis
  16. cort_area.ctx.rh.parsorbitalis
  17. cort_area.ctx.rh.parstriangularis
  18. cort_area.ctx.rh.rostralanteriorcingulate
  19. cort_area.ctx.rh.rostralmiddlefrontal
  20. vol.Left.Cerebral.White.Matter
  21. vol.Left.Cerebral.Cortex
  22. vol.Left.Hippocampus
  23. vol.Left.Amygdala
  24. vol.Right.Cerebral.White.Matter
  25. vol.Right.Cerebral.Cortex
  26. vol.Right.Hippocampus
  27. vol.Right.Amygdala

I am not expert in neuroscience, but as far as I know, the above measurements are independent and thus suitable for factor analysis. They reported additional aggregate measures such as total surface area and total volume. They also reported total cranial volume, which permits the calculations of another two brain measurements: the non-brain volume of the cranium (subtracting total brain volume from total intracranial volume), and the proportion of intracranial volume used for brain.

The careful reader has perhaps noticed something bizarre about the dataset, namely that there is an unequal number of left hemisphere (“lh”) and right hemisphere (“rh”) regions (11 vs. 8). I have no idea why this is, but it is somewhat problematic in factor analysis since this weights some variables twice as well as weighting the left side a bit more.

The present dataset is inadequate for properly testing the general brain factor hypothesis because it only has measurements from one domain: size. The original authors may have more measurements they did not publish. However, one can examine the existence of the brain size factor, as a prior test of the more general hypothesis.

Age and overall brain size

As an initial check, I plotted the relationship between total brain size measures and age. These are shown in Figure 3 and 4.

Figure 3 Figure 4

Curiously, these show that the size increase only occurs up to about age 8 and 10, or so. I was under the impression that brain size continued to go up until the body in general stopped growing, around 15-20 years. This study does not appear to be inconsistent with others (e.g. Giedd, 1999). The relationship is clearly non-linear, so one will need to use the age corrections described above. To see if the correction worked, we plot the total size variables and age. There should be near-zero correlation. Results in Figures 5 and 6.

Figure 5 Figure 6

Instead we still see a slight correlation for both genders, both apparently due to a single outlier. Very odd. I examined these outliers (IDs: P0009 and P0010) but did not see anything special about them. I removed them and reran the residualization from the original data. This produced new outliers similar to before (with IDs following them). When I removed them, new ones. I figure it is due to some error with the residualization process. Indeed, a closer look revealed that the largest outliers (positive and negative) were always the first two indexes. I thus removed these before doing more analyses. The second largest outliers had no particular index. I tried removing more age outliers, but it was not possible to completely remove the correlations between age and the other variables (usually remained near r=.03). Figure 6a shows the same as Figure 6 just without the two outliers.

Figure 6a

The genders are somewhat displaced on the age variable, but if one looks at the x-axis, one an see that this is in fact a very, very small difference.

General brain size factor with and without residualization

Results for the factor analysis without residualization are shown in Figure 7. I used the fa() function from the psych package with default settings: 1 factor extracted with the minimum residuals method. Previous studies have shown factor extraction method to be of little importance as long as it isn’t principal components with a smaller number of variables (Kirkegaard, 2014).

Figure 7

We see that the factors are quite similar (factor congruence .95) but that the male factor is quite a bit stronger (var% M/F 26 vs. 16). This suggests that the factor either works differently in the genders, or there is error in the results. If it is error, we should see an improvement after removing some of it. Figure 8 shows the same plot using the residualized data.

Figure 8

The results were more similar now and stronger for both genders (var% M/F = 34 vs. 33).

The amygdala results are intriguing, suggesting that this region does not increase in size along with the rest of the brain. The right amygdala even had negative loadings in both genders.

Using all that’s left

The next thing one might want to do is extract multiple factors. I tried extracting various solutions with nfactors 3-5. These however are bogus models due to the near-1 correlation between the brain sides. This results in spurious factors that load on just 2 variables (left and right versions) with loadings near 1. One could solve this by either averaging those with 2 measurements, or using only those from the left side. It makes little difference because they correlate so highly. It should be noted tho that doing this means one can’t see any lateralization effects such as that suggested for the right amygdala.

I redid all the results using the left side variables only. Figure 9 shows the results.

Figure 9

Now all regions had positive loadings and the var% increased a bit for both genders to 36/36. Factor congruence was 1.00, even for the non-residualized data. It thus seems that the missing measures of the right side or the use of near-doubled measures had a negative impact on the results as well.

One can calculate other measures of factor strength/internal reliability, such as the average intercorrelation, Cronbach’s alpha, Guttman’s G6. These are shown in Table 1.

Table 1- Internal reliability measures
Sample Mean r Alpha (raw) Alpha (std.) G6
Male .33 .48 .88 .90
Female .34 .45 .89 .90


Multiple factors

We are now ready to explore factor solutions with more factors. Three different methods suggested extracted at most 5 factors both datasets (using nScree() from nFactors package). I extracted solutions for 2 to 6 factors for each dataset, the last included by accident. All of these were extracted with oblique rotation method of oblimin thus possibly returning correlated factors. The prediction from a hierarchical model is clear: factors extracted in this way should be correlated. Figures 10 to 14 show the factor loadings of these solutions.

Figure 10 Figure 11 Figure 12 Figure 13

Figure 14

So it looks like results very pretty good with 4 factors and not too good with the others. The problem with this method is that the factors extracted may be similar/identical but not in the same order and with the same name. This means that the plots above may plot the wrong factors together which defeats the entire purpose. So what we need is an automatic method of pairing up the factors correctly if possible. The exhaustive method is trying all the pairings of factors for each number of factors to extract, and then calculating some summary metrics or finding the best overall pairing combination. This would involve quite a lot of comparisons, since e.g. one can pair up set 2 sets of, say, 5 factors in 5*4/2 ways (10).

I settled for a quicker solution. For each factor solution pair, I calculated all the cross-analysis congruence factors. Then for each factor, I found the factor from the other analysis it had the highest congruence with and saved this information. This method can miss some okay but not great solutions, but I’m not overly concerned about those. In a good fit, the factors found in each analysis should map 1 to 1 to each other such that their highest congruence is with the analog factor from the other analysis.

From this information, I calculated the mean of the best congruence pairs, the minimum, and whether there was a mismatch. A mismatch occurs when two or more factors from one analysis maps to (has the highest congruence) with the same factor from the other analysis. I calculated three metrics for all the analyses performed above. The results are shown in Table 2.

Table 2 – Cross-analysis comparison metrics
Factors.extracted mean.best.cong min.best.cong factor.mismatch
2 0.825 0.73 FALSE
3 0.713 0.37 TRUE
4 0.960 0.93 FALSE
5 0.720 0.35 TRUE
6 0.765 0.58 FALSE


As can be seen, the two analyses with 4 factors were a very good match. Those with 3 and 5 terrible as they produced factor mismatches. The analyses with 2 and 6 were also okay.

The function for going thru all the oblique solutions for two samples also returns the necessary information to match up the factors if they need reordering. If there is a mismatch, this operation is nonsensical, so I won’t re-do all the plots. The plot above with 4 factors just happens to already be correctly ordered. This however need not be the case. The only plot that needs to be redone is that with 6 factors. It is shown in Figure 15.

Figure 15

Compare with figure 14 above. One might wonder whether the 4 or 6 factor solutions are the best. In this case, the answer is the 4 factor solutions because the female 6 factor solution is illegal — one factor loading is above 1 (“a Heywood case”). At present, given the relatively few regional measures, and the limitation to only volume and surface measures, I would not put too much effort into theorizing about the multifactor structure found so far. It is merely a preliminary finding and may change drastically when more measures are added or measures are sampled differently.

A more important finding from all the multifactor solutions was that all produced correlated factors, which indicates a general factor.

Aggregate measures and the general brain size factor

So, the general brain size factor (GBSF) may exist, but is it useful? At first, we may want to correlate the various aggregate variables. Results are in Table 3.

Table 3 – Correlations between aggregate brain measures
cort_area.ctx.total cort_area.ctx.lh.total vol.WholeBrain vol.IntracranialVolume GBSF
cort_area.ctx.total 0.997 0.869 0.746 0.953
cort_area.ctx.lh.total 0.997 0.867 0.751 0.953
vol.WholeBrain 0.832 0.832 0.822 0.923
vol.IntracranialVolume 0.638 0.642 0.798 0.776
GBSF 0.950 0.950 0.905 0.711

Notes: Correlations above diagonal are males, below females.

The total areas of the brain are almost symmetrical: the correlation of the total surface area and left side only is a striking .997. Intracranial volume is a decent proxy (.822) for whole brain volume, but is somewhat worse for total surface area (.746). GBSF has very strong correlations with the surface areas (.95), but not quite as strong as the analogous situation in cognitive data: IQ and extracted general factor (GCA factor) usually correlate .99 with a reasonable sample of subtests: Ree and Earles (1991) reported that an average GCA factor correlated .991 with an unweighted sum score in a sample of >9k, Kirkegaard (2014b) found a .99 correlation between extracted GCA and an unweighted sum in a Dutch university sample of ~500.

Correlations with cognitive measures

The authors have data for 4 cognitive tests, however, data are only public for 2 of them. These are in the authors’ words:

Flanker inhibitory control test (N = 1,074).
The NIH Toolbox Cognition Battery version of the flanker task was adapted from the Attention Network Test (ANT). Participants were presented with a stimulus on the center of a computer screen and were required to indicate the left-right orientation while inhibiting attention to the flankers (surrounding stimuli). On some trials the orientation of the flankers was congruent with the orientation of the central stimulus and on the other trials the flankers were incongruent. The test consisted of a block of 25 fish trials (designed to be more engaging and easier to see to make the task easier for children) and a block of 25 arrow trials, with 16 congruent and 9 incongruent trials in each block, presented in pseudorandom order. Participants who responded correctly on 5 or more of the 9 incongruent trials then proceeded to the arrows block. All children age 9 and above received both the fish and arrows blocks regardless of performance. The inhibitory control score was based on performance on both congruent and incongruent trials. A two-vector method was used that incorporated both accuracy and reaction time (RT) for participants who maintained a high level of accuracy (>80% correct), and accuracy only for those who did not meet this criteria. Each vector score ranged from 0 to 5, for a maximum total score of 10 (M = 7.67, s.d. = 1.86).
List sorting working memory test (N = 1,084).
This working memory measure requires participants to order stimuli by size. Participants were presented with a series of pictures on a computer screen and heard the name of the object from a speaker. The test was divided into the One-List and Two-List conditions. In the One-List condition, participants were told to remember a series of objects (food or animals) and repeat them in order, from smallest to largest. In the Two-List condition, participants were told to remember a series of objects (food and animals, intermixed) and then again report the food in order of size, followed by animals in order of size. Working memory scores consisted of combined total items correct on both One-List and Two-List conditions, with a maximum of 28 points (M = 17.71, s.d. = 5.39).

I could not locate a factor analytic study for the Flanker test, so I don’t know how g-loaded it is. Working memory (WM) is known to have a strong relationship to GCA (Unsworth et al, 2014). The WM variable should probably be expected to be the most g-loaded of the two. The implication given the causal hypothesis of brain size for GCA is that the WM test should show higher correlations to the brain measures. Figures X and X show the histograms for the cognitive measures.


Note that the x-values do not have any interpretation as they are the residual raw values, not raw values. For the Flanker test, we see that it is bimodal. It seems that a significant part of the sample did not understand the test and thus did very poorly. One should probably either remove them or use a non-parametric measure if one wanted to rely on this variable. I decided to remove them since the sample was sufficiently large that this wasn’t a big problem. The procedure reduced the skew from -1.3/-1.1 to -.2/-.1 respectively for the male and female samples. The sample sizes were reduced from 548/516 to 522/487 respectively. One could plausibly combine them into one measure which would perhaps be a better estimate of GCA than either of them alone. This would be the case if their g-loading was about similar. If however, one is much more g-loaded than the other, it would degrade the measurement towards a middle level. I combined the two measures by first normalizing them (to put them on the same scale) and then averaging them.

Given the very high correlations between the GBSF of these data and the other aggregate measures, it is not expected that the GBSF will correlate much more strongly with cognitive measures than the other aggregate brain measures. Table X shows the correlations.

Table X – Correlations between cognitive measures and aggregate brain size measures
Variable WM Flanker Combined WM Flanker Combined
Males Females
Flanker 0.407 0.393
WM.Flanker.mean 0.830 0.847 0.824 0.845
cort_area.ctx.total 0.302 0.138 0.235 0.236 0.201 0.237
cort_area.ctx.lh.total 0.302 0.137 0.235 0.239 0.203 0.238
vol.WholeBrain 0.263 0.103 0.201 0.158 0.120 0.146
vol.IntracranialVolume 0.213 0.101 0.170 0.154 0.101 0.137
GBSF 0.311 0.147 0.252 0.223 0.181 0.218


As for the GBSF, given that it is a ‘distillate’ (Jensen’s term), one would expect it to have slightly higher correlations with the cognitive measures than the merely unweighted ‘sum’ measures. This was the case for males, but not females. In general, the female correlations were weaker, especially the whole brain volume x WM (.263 vs. .158). Despite the large sample sizes, this difference is not very certain, the 95% confidence intervals are -.01 to .22. A larger sample is necessary to examine this question. The finding is intriguing is that if real, it would pose an alternative solution to the Ankney-Rushton anomaly, that is, the fact that males have greater brain size and this is related to IQ scores, but do not consistently perform better on IQ tests (Jackson and Rushton, 2006). Note however that the recent large meta-analysis of brain size x IQ studies did not find an effect of gender, so perhaps the above results are a coincidence (Pietschnig et al 2014).

We also see that the total cortical area variables were stronger correlates of cognitive measures than whole brain volume, but a larger sample is necessary to confirm this pattern.

Lastly, we see a moderately strong correlation between the two cognitive measures (≈.4). The combined measure was a weaker correlate of the criteria variables, which is what is expected if the Flanker test was a relatively weaker test of GCA than the WM one.

Correlations with parental education and income

It is time to revisit the results reported by the original authors, namely correlations between educational/economic variables and brain measures. I think the correlations between specific brain regions and criteria variables is mostly a fishing expedition of chance results (multiple testing) and of no particular interest unless strong predictions can be made before looking at the data. For this reason, I present only correlations with the aggregate brain measures, as seen in Table X.

Table x – Correlations between educational/economic variables and other variables
Variable ED ln_Inc Income ED ln_Inc Income
Males Females
WM 0.131 0.192 0.175 0.170 0.229 0.174
Flanker 0.163 0.180 0.188 0.118 0.131 0.106
WM.Flanker.mean 0.168 0.215 0.206 0.178 0.215 0.165
cort_area.ctx.total 0.104 0.217 0.207 0.128 0.173 0.154
cort_area.ctx.lh.total 0.108 0.215 0.208 0.133 0.170 0.152
vol.WholeBrain 0.103 0.190 0.195 0.064 0.112 0.078
vol.IntracranialVolume 0.126 0.157 0.159 0.086 0.104 0.100
GBSF 0.109 0.206 0.204 0.100 0.157 0.137
ED 0.559 0.542 0.561 0.513
ln_Inc 0.559 0.866 0.561 0.855
Income 0.542 0.866 0.513 0.855


Here the correlations of the combined cognitive measure was higher than WM, unlike before, so perhaps the diagnosis from before was wrong. In general, the correlations of income and brain measures were stronger than that for education. This is despite the fact that GCA is more strongly correlated to educational attainment than income. This was however not the same in this sample: correlations of WM and Flanker were stronger with the economic variables. Perhaps there is more range restriction in the educational variable than the income one. An alternative environmental interpretation is that it is the affluence that causes the larger brains.

If we recall the theoretic predictions of the strength of the correlations, the incomes are stronger than expected (actual .19/.09 M/F, predicted about .05), while the educational ones are a bit weaker than expected (actual .1/.6, predicted about .13). However, the sample sizes are not larger enough for these results to be certain enough to question the theory.

Racial admixture

To me surprise, the sample had racial admixture data. This is surprising because such data has been available to testing the genetic hypothesis of group differences for many years, apparently without anyone publishing something on the issue. As I argued elsewhere, this is odd given that a good dataset would be able to decisively settle the ‘race and intelligence’ controversy (Dalliard, 2014; Rote and Rodgers, 2005; Rushton and Jensen, 2005). It is actually very good evidence for the genetic hypothesis because if it was false, and these datasets showed it, it would have been a great accomplishment for a mainstream scientist to publish a paper decisively showing that it was indeed false. However, if it was true, then any mainstream scientist could not publish it without risking personal assaults, getting fired and possibly pulled in court as were academics who previously researched that topic (Gottfredson, 2005; Intelligence 1998; Nyborg 2011; 2003).

The genomic data however appeared to be an either/or (1 or 0) variable in the released data files. Oddly, some persons had no value for any racial group. It turns out that the data was merely rounded in the spreadsheet file. This explained why some persons had 0 for all groups: These persons did not belong at least 50% to any racial group, and thus they were assigned a 0 in every case.

I can think of two ways to count the number of persons in the main categories. One can count the total ‘summed’ persons. In this way, if person A has 50% ancestry from race R, and person B has 30%, this would sum to .8 persons. One can think of it as the number of pure-breed persons’ worth of ancestry from that that group. Another way is to count everybody as 1 who is above some threshold for ancestry. I chose to use 20% and 80% for thresholds, which correspond with persons with substantial ancestry from that racial cluster, and persons with mostly ancestry from that cluster. One could choose other values of course, and there is a degree of arbitrariness, but it is not important what the particular values are.

Results are in Table X.

Racial group European African Amerindian East Asian Oceanian Central Asia Sum
Summed ancestry ‘persons’ 686.364 134.2714 48.31457 163.49868 8.59802 26.95408 1068.00075
Persons with >20% 851 187 89 238 8 30 1403
Persons with >80% 647 105 3 121 0 21 897


Note that the number 1068 is the exact number of persons in the complete sample, which means that the summed ancestry for all groups has an error of a mere .00075.

Another way of understanding the data is to plot histograms of each racial group. These are shown below in Figures X to X.

Race_European_histogramRace_African_histogram Race_Amerindian_histogram Race_East_Asian_histogram   Race_Oceanian_histogramRace_Central_Asian_histogram


Since European ancestry is the largest, the other plots are mostly empty except for the 0% bar. But we do see a fair amount of admixture in the dataset.

Regression, residualization, correlation and power

There are a couple of different methods one could use to examine the admixture data. A simple correlation is justified when dealing with a group that only has 2 sources of ancestry. This is the easiest case to handle. For this to work, the groups most have a different genotypic mean of the trait in question (GCA and brain size variables in this case) and there must be a substantially admixtured population. Even given a large hypothesized genotypic difference, the expected correlation is actually quite small. For African Americans (such as those in the sample), their European ancestry% is about 15-25% depending on the exact sample. The standard deviation of their European ancestry% is not always reported, but one can calculate it if one has some data, which we do.

The first problem with this dataset is that there are no sociological race categories (“white”, “African American”, “Hispanic”, “Asian” etc.), but only genomic data. This means that to get an African American subsample, we must create one based on actual actual ancestry. There are two criteria that needs to be met for inclusion in that group: 1) the person must be substantially African, 2) the person must be mostly a mix of European and African ancestry. Going with the values from before, this means that the person must be at least 20% African, and at least 80% combined European and African.

Dealing with scanner and site

There are a variety of ways to use the data and they may or may not give similar results. First is the question of which variables to control for. In the earlier sections of this paper, I controlled for Age, Age2, Age3, Scanner (12 different). For producing the above ancestry plots and results I did not control for anything. Controlling the ancestry variables for scanner is problematic as people from different races live in different places. Controlling for this thus removes the racial differences for no reason. One could similarly control for site where the scanner is (I did not do this earlier). We can compare this to scanner by a contingency table, as shown in Table X below:

Table X – Contingency table of scanner site and scanner #
Site/scanner 0 1 10 11 12 2 3 4 5 6 7 8 9
Cornel 0 0 0 96 0 0 0 0 0 0 0 0 0
Davis 0 0 0 0 0 0 0 0 0 0 114 0 0
Hawaii 0 0 0 0 0 0 0 0 0 202 0 0 0
KKI 0 0 0 0 0 0 103 0 0 0 0 0 0
MGH 0 0 0 0 0 0 0 0 115 0 0 0 13
UCLA 0 0 27 0 22 0 0 10 0 0 0 0 0
UCSD 109 93 0 0 0 0 0 0 0 0 0 0 0
UMMS 0 0 0 0 0 56 0 0 0 0 0 0 0
Yale 0 0 0 0 0 0 0 0 0 0 0 108 0


As we can see, these are clearly inter-dependent, given the obvious fact that the scanners have a particular location and was not moved around (all columns have only 1 cell with value>0). Some sites however have multiple scanners, some have only one. E.g. UCSD has two scanners (#0 and #1), while KKI has only one (#3).

Controlling for scanner however makes sense if we are looking at brain size variables, as this removes differences between measurements due to differences in the scanning equipment or (post-)processing. So perhaps one would want to control brain measurements for scanner and age effects, but only control the remaining variables for age affects.

Dealing with gender

As before

To be continued…



  • Ashton, M. C., Lee, K., & Visser, B. A. (2014a). Higher-order g versus blended variable models of mental ability: Comment on Hampshire, Highfield, Parkin, and Owen (2012). Personality and Individual Differences, 60, 3-7.
  • Ashton, M. C., Lee, K., & Visser, B. A. (2014b). Orthogonal factors of mental ability? A response to Hampshire et al. Personality and Individual Differences, 60, 13-15.
  • Ashton, M. C., Lee, K., & Visser, B. A. (2014c). Further response to Hampshire et al. Personality and Individual Differences, 60, 18-19.
  • Beaujean, A. A. (2014). Latent Variable Modeling Using R: A Step by Step Guide: A Step-by-Step Guide. Routledge.
  • Bouchard Jr, T. J. (2014). Genes, Evolution and Intelligence. Behavior genetics, 44(6), 549-577.
  • Button, K. S., Ioannidis, J. P., Mokrysz, C., Nosek, B. A., Flint, J., Robinson, E. S., & Munafò, M. R. (2013). Power failure: why small sample size undermines the reliability of neuroscience. Nature Reviews Neuroscience, 14(5), 365-376.
  • Colom, R., & Thompson, P. M. (2011). Intelligence by Imaging the Brain. The Wiley-Blackwell handbook of individual differences, 3, 330.
  • Cudeck, R., & MacCallum, R. C. (Eds.). (2012). Factor analysis at 100: Historical developments and future directions. Routledge.
  • Dalliard, M. (2013). Is Psychometric g a Myth?. Human Varieties.
  • Dalliard, M. (2014). The Elusive X-Factor: A Critique of J. M. Kaplan’s Model of Race and IQ. Open Differential Psychology.
  • Deary, I. J., & Caryl, P. G. (1997). Neuroscience and human intelligence differences. Trends in Neurosciences, 20(8), 365-371.
  • Deary, I. J., Penke, L., & Johnson, W. (2010). The neuroscience of human intelligence differences. Nature Reviews Neuroscience, 11(3), 201-211.
  • Dekaban, A.S. and Sadowsky, D. (1978). Changes in brain weights during the span of human life: relation of brain weights to body heights and body weights, Ann. Neurology, 4:345-356.
  • Euler, M. J., Weisend, M. P., Jung, R. E., Thoma, R. J., & Yeo, R. A. (2015). Reliable Activation to Novel Stimuli Predicts Higher Fluid Intelligence. NeuroImage.
  • Frangou, S., Chitins, X., & Williams, S. C. (2004). Mapping IQ and gray matter density in healthy young people. Neuroimage, 23(3), 800-805.
  • Giedd, J. N., Blumenthal, J., Jeffries, N. O., Castellanos, F. X., Liu, H., Zijdenbos, A., … & Rapoport, J. L. (1999). Brain development during childhood and adolescence: a longitudinal MRI study. Nature neuroscience, 2(10), 861-863.
  • Gottfredson, L. S. (2005). Suppressing intelligence research: Hurting those we intend to help. In R. H. Wright & N. A. Cummings (Eds.), Destructive trends in mental health: The well-intentioned path to harm (pp. 155-186). New York: Taylor and Francis.
  • Haier, R. J., Karama, S., Colom, R., Jung, R., & Johnson, W. (2014a). A comment on “Fractionating Intelligence” and the peer review process. Intelligence, 46, 323-332.
  • Haier, R. J., Karama, S., Colom, R., Jung, R., & Johnson, W. (2014b). Yes, but flaws remain. Intelligence, 46, 341-344.
  • Hampshire, A., Highfield, R. R., Parkin, B. L., & Owen, A. M. (2012). Fractionating human intelligence. Neuron, 76(6), 1225-1237.
  • Hampshire, A., Parkin, B., Highfield, R., & Owen, A. M. (2014). Response to:“Higher-order g versus blended variable models of mental ability: Comment on Hampshire, Highfield, Parkin, and Owen (2012)”. Personality and Individual Differences, 60, 8-12.
  • Henrich, J., Heine, S. J., & Norenzayan, A. (2010). The weirdest people in the world?. Behavioral and brain sciences, 33(2-3), 61-83.
  • Intelligence. (1998). Special issue dedicated to Arthur Jensen. Volume 26, Issue 3.
  • Jackson, D. N., & Rushton, J. P. (2006). Males have greater g: Sex differences in general mental ability from 100,000 17-to 18-year-olds on the Scholastic Assessment Test. Intelligence, 34(5), 479-486.
  • Jensen, A. R., & Weng, L. J. (1994). What is a good g?. Intelligence, 18(3), 231-258.
  • Jensen, A. R. (1997). The psychometrics of intelligence. In H. Nyborg (Ed.), The scientific study of human nature: Tribute to Hans J. Eysenck at eighty. New York: Elsevier. Pp. 221—239.
  • Jensen, A. R. (1998). The g Factor: The Science of Mental Ability. Preager.
  • Jensen, A. R. (2002). Psychometric g: Definition and substantiation. The general factor of intelligence: How general is it, 39-53.
  • Jung, R. E. & Haier, R. J. (2007). The Parieto-Frontal Integration Theory (P-FIT) of intelligence: converging neuroimaging evidence. Behav. Brain Sci. 30, 135–154; discussion 154–187.
  • Jung, R. E. et al. (2009). Imaging intelligence with proton magnetic resonance spectroscopy. Intelligence 37, 192–198.
  • Jöreskog, K. G. (1996). Applied factor analysis in the natural sciences. Cambridge University Press.
  • Kim, S. E., Lee, J. H., Chung, H. K., Lim, S. M., & Lee, H. W. (2014). Alterations in white matter microstructures and cognitive dysfunctions in benign childhood epilepsy with centrotemporal spikes. European Journal of Neurology, 21(5), 708-717.
  • Kirkegaard, E. O. W. (2014a). The international general socioeconomic factor: Factor analyzing international rankings. Open Differential Psychology.
  • Kirkegaard, E. O. W. (2014b). The personal Jensen coefficient does not predict grades beyond its association with g. Open Differential Psychology.
  • Li, Y. et al. (2009). Brain anatomical network and intelligence. PLoS Comput. Biol. 5, e1000395.
  • Navas‐Sánchez, F. J., Alemán‐Gómez, Y., Sánchez‐Gonzalez, J., Guzmán‐De‐Villoria, J. A., Franco, C., Robles, O., … & Desco, M. (2014). White matter microstructure correlates of mathematical giftedness and intelligence quotient. Human brain mapping, 35(6), 2619-2631.
  • Neubauer, A. C. & Fink, A. (2009). Intelligence and neural efficiency. Neurosci. Biobehav. Rev. 33, 1004–1023.
  • Noble, K. G., Houston, S. M., Brito, N. H., Bartsch, H., Kan, E., Kuperman, J. M., … & Sowell, E. R. (2015). Family income, parental education and brain structure in children and adolescents. Nature Neuroscience.
  • Nyborg, H. (2003). The sociology of psychometric and bio-behavioral sciences: A case study of destructive social reductionism and collective fraud in 20th century academia. Nyborg H.(Ed.). The scientific study of general intelligence. Tribute to Arthur R. Jensen, 441-501.
  • Nyborg, H. (2011). The greatest collective scientific fraud of the 20th century: The demolition of differential psychology and eugenics. Mankind Quarterly, Spring Issue.
  • Pennington, B. F., Filipek, P. A., Lefly, D., Chhabildas, N., Kennedy, D. N., Simon, J. H., … & DeFries, J. C. (2000). A twin MRI study of size variations in the human brain. Journal of Cognitive Neuroscience, 12(1), 223-232.
  • Pietschnig, J., Penke, L., Wicherts, J. M., Zeiler, M., & Voracek, M. (2014). Meta-Analysis of Associations Between Human Brain Volume And Intelligence Differences: How Strong Are They and What Do They Mean?. Available at SSRN 2512128.
  • Ree, M. J., & Earles, J. A. (1991). The stability of g across different methods of estimation. Intelligence, 15(3), 271-278.
  • Rowe, D. C., & Rodgers, J. E. (2005). Under the skin: On the impartial treatment of genetic and environmental hypotheses of racial differences. American Psychologist, 60(1), 60.
  • Rushton, J. P., & Jensen, A. R. (2005). Thirty years of research on race differences in cognitive ability. Psychology, public policy, and law, 11(2), 235.
  • Shaw, P. et al. (2006). Intellectual ability and cortical development in children and adolescents. Nature 440, 676–679 (2006).
  • Sterne, J. A., White, I. R., Carlin, J. B., Spratt, M., Royston, P., Kenward, M. G., … & Carpenter, J. R. (2009). Multiple imputation for missing data in epidemiological and clinical research: potential and pitfalls. Bmj, 338, b2393.
  • Strenze, T. (2007). Intelligence and socioeconomic success: A meta-analytic review of longitudinal research. Intelligence, 35(5), 401-426.
  • Turken, A. et al. (2008). Cognitive processing speed and the structure of white matter pathways: convergent evidence from normal variation and lesion studies. Neuroimage 42, 1032–1044
  • Unsworth, N., Fukuda, K., Awh, E., & Vogel, E. K. (2014). Working memory and fluid intelligence: Capacity, attention control, and secondary memory retrieval. Cognitive psychology, 71, 1-26.


This is a post in the on-going series of comments on studies of international/transracial adoption. A global genetic/hereditarian model of cognitive differences and their socioeconomic effects implies that adoptees from different populations/countries/regions should show the usual group differences in the above mentioned traits and outcomes, all else equal. All else is of course not equal since adoptees from different regions can be adopted at different ages, experience different environment leading up to the adoption, possibly experience different environments after adoption thru no cause of their own (discrimination), and so on. It is not a strict test: finding the usual group differences can be explained by non-genetic factors, and finding no differences or unexpected ones could be consistent with a genetic model given strong non-genetic effects such as differences in adoption practices between origin countries/regions/populations. However, were such differences to be found relatively consistently in sufficiently powered studies, it would be an important prediction verified by the genetic model, broadly speaking. The question thus remains: what do the studies show?

Bruce et al (2009) – Disinhibited Social Behavior Among Internationally Adopted Children

Their abstract reads:

Postinstitutionalized children frequently demonstrate persistent socioemotional difficulties. For example, some postinstitutionalized children display an unusual lack of social reserve with unfamiliar adults. This behavior, which has been referred to as indiscriminate friendliness, disinhibited attachment behavior, and disinhibited social behavior, was examined by comparing children internationally adopted from institutional care to children internationally adopted from foster care and children raised by their biological families. Etiological factors and behavioral correlates were also investigated. Both groups of adopted children displayed more disinhibited social behavior than the nonadopted children. Of the etiological factors examined, only the length of time in institutional care was related to disinhibited social behavior. Disinhibited social behavior was not significantly correlated with general cognitive ability, attachment-related behaviors, or basic emotion abilities. However, this behavior was negatively associated with inhibitory control abilities even after controlling for the length of time in institutional care. These results suggest that disinhibited social behavior might reflect underlying deficits in inhibitory control.

While this does not seem immediately relevant, the authors do investigate IQ. The study design is a three-way comparison between adoptees in foster homes, institutional care and non-adoptees. The sample are small: 40 x 40 x 40. These are children who spent most of their lives first in foster homes and were then adopted, children who spent most of their lives in institutional care and then adopted, and biological children of the adoptive families for comparison. The groups were not equal in origin composition:

However, because countries have institutional or foster systems to care for wards of the state, the institutional care and foster care groups differed in terms of country of origin. The institutional care group was primarily from Eastern Europe (45%) and China (43%), whereas the foster care group was primarily from South Korea (80%).

Their cognitive measure is:

General cognitive ability.To provide an estimate of the children’s general cognitive functioning, each child was administered the vocabulary and block design subtests of the Wechsler Intelligence Scale for Children, 3rd edition (Wechsler, 1991). These subtests are considered the best measures of verbal and nonverbal intelligence, respectively, and are highly correlated with the full scale intelligence quotient (Sattler, 1992). Raw scores on the subtests were converted into age-normed scaled scores. The scaled scores were then summed and transformed into full scale intelligence quotient equivalents.

They give the mean FSIQ by the above three groups, not origin groups. These are:

Group Institutional care Foster care Control
FSIQ 102.68 (16.25) 109.37 (12.93) 117.11 (15.88)

Note: numbers in parentheses are SDs.

The high scores is presumably due to the FLynn effect. The IQ test is from 1991, but the study is from 2009, so there has been 18 years for raw score gains compared to the normative sample. An alternative idea is that the families were just above average SES/IQ which boosts the IQ scores of younger children. Nearly 100% of the adoptive families were Caucasian (presumably European) and were part of the Minnesota International Adoption Project Registry. According to Wiki, the non-Hispanic % of this state is 83%, so Caucasians are a bit overrepresented. In general, these elevation effects are not important when comparing groups within a study.

I contacted the first author to ask if she would give me some more data, and she obliged:

Please find the requested information below. For each region of origin, I provided the number of children, mean, and standard deviation for the Block Design Standard Score, Vocabulary Standard Score, and full scale IQ equivalent. I did not provide these statistics for regions with less than 5 children. Please let me know if you have any questions

Origin N FSIQ Vocabulary Block design
Institutional care
Eastern Europe (e.g., Russia, Romania, Ukraine) 18 94.84 (15.070) 8.39 (2.615) 8.39 (2.615)
China 17 111.77 (13.774) 11.76 (2.969) 12.29 (3.274)
Foster care
South Korea 31/32 110.38 (12.457) 12.10 (2.970) 11.53 (2.940)
Guatemala 5 104.64 (16.609) 12.40 (3.050) 9.20 (3.033)

Notes: Sample size for S. Korea was 31, 31, 32. No explanation given. Numbers in parentheses are SDs.

Again we see that East Asians do well, altho not better than the control children (mean=117). Eastern European do less well, but it is hard to say the exact expected mean since it is not stated how many come from which countries. The IQ of Lynn and Vanhanen (2012) gives 91 as the IQ of Romania, 96.6 for Russia and 94.3 for Ukraine. The Guatemalans certainly do better than expected (79 IQ), but N=5, and the demographics of Guatemala is very mixed making it possible that the adoptees actually had predominantly European ancestry.

The email reply

Hello Emil,
Please find the requested information below. For each region of origin, I provided the number of children, mean, and standard deviation for the Block Design Standard Score, Vocabulary Standard Score, and full scale IQ equivalent. I did not provide these statistics for regions with less than 5 children. Please let me know if you have any questions
Thank you for your interest in our publication,

Institutional care group:
Eastern Europe (e.g., Russia, Romania, Ukraine)
N Mean Std. Deviation
Block Design Standard Score 18 9.83 3.884
Vocabulary Standard Score 18 8.39 2.615
IQ equivalent 18 94.84 15.070

N Mean Std. Deviation
Block Design Standard Score 17 12.29 3.274
Vocabulary Standard Score 17 11.76 2.969
IQ equivalent 17 111.77 13.774

Foster care group:
South Korea
N Mean Std. Deviation
Block Design Standard Score 32 11.53 2.940
Vocabulary Standard Score 31 12.10 2.970
IQ equivalent 31 110.38 12.457

N Mean Std. Deviation
Block Design Standard Score 5 9.20 3.033
Vocabulary Standard Score 5 12.40 3.050
IQ equivalent 5 104.64 16.609


Bruce, J., Tarullo, A. R., & Gunnar, M. R. (2009). Disinhibited social behavior among internationally adopted children. Development and psychopathology, 21(01), 157-171.

I reanalyze data reported by Richard Lynn in a 1979 paper concerning IQ and socioeconomic variables in 12 regions of the United Kingdom as well as Ireland. I find a substantial S factor across regions (66% of variance with MinRes extraction). I produce a new best estimate of the G scores of regions. The correlation of this with the S scores is .79. The MCV with reversal correlation is .47.

The interdisciplinary academic field examining the effect of general intelligence on large scale social phenomena has been called social ecology of intelligence by Richard Lynn (1979, 1980) and sociology of intelligence by Gottfredson (1998). One could also call it cognitive sociology by analogy with cognitive epidemiology (Deary, 2010; Special issue in Intelligence Volume 37, Issue 6, November–December 2009; Gottfredson, 2004). Whatever the name, it is a field that has received renewed attention recently. Richard Lynn and co-authors report data on Italy (Lynn 2010a, 2010b, 2012a, Piffer and Lynn 2014, see also papers by critics), Spain (Lynn 2012b), China (Lynn and Cheng, 2013) and India (Lynn and Yadav, 2015). Two of his older studies cover the British Isles and France (Lynn, 1979, 1980).

A number of my recent papers have reanalyzed data reported by Lynn, as well as additional data I collected. These cover Italy, India, United States, and China (Kirkegaard 2015a, 2015b, 2015c, 2015d). This paper reanalyzes Lynn’s 1979 paper.

Cognitive data and analysis

Lynn’s paper contains 4 datasets for IQ data that covers 11 regions in Great Britain. He further summarizes some studies that report data on Northern Ireland and the Republic of Ireland, so that his cognitive data covers the entire British Isles. Lynn only uses the first 3 datasets to derive a best estimate of the IQs. The last dataset does not report cognitive scores as IQs, but merely percentages of children falling into certain score intervals. Lynn converts these to a mean (method not disclosed). However, he is unable to convert this score to the IQ scale since the inter-personal standard deviation (SD) is not reported in the study. Lynn thus overlooks the fact that one can use the inter-regional SD from the first 3 studies to convert the 4th study to the common scale. Furthermore, using the intervals one could presumably estimate the inter-personal SD, altho I shall not attempt this. The method for converting the mean scores to the IQ score is this:

  1. Standardize the values by subtracting the mean and dividing by the inter-regional SD.
  2. Calculate the inter-regional SD in the other studies, and find the mean of these. Do the same for the inter-regional means.
  3. Multiple the standardized scores by the mean inter-regional SD from the other studies and add the inter-regional mean.

However, I did not use this method. I instead factor analyzed the four 4 IQ datasets as given and extracted 1 factor (extraction method = MinRes). All factor loadings were strongly positive indicating that G could be reliably measured among the regions. The factor score from this analysis was put on the same scale as the first 3 studies by the method above. This is necessary because the IQs for Northern Ireland and the Republic of Ireland are given on that scale. Table 1 shows the correlations between the cognitive variables. The correlations between G and the 4 indicator variables are their factor loadings (italic).

Table 1 – Correlations between cognitive datasets
Vernon.navy Vernon.army Douglas Davis G Lynn.mean
Vernon.navy 1 0.66 0.92 0.62 0.96 0.92
Vernon.army 0.66 1 0.68 0.68 0.75 0.89
Douglas 0.92 0.68 1 0.72 0.99 0.93
Davis 0.62 0.68 0.72 1 0.76 0.74
G 0.96 0.75 0.99 0.76 1 0.96
Lynn.mean 0.92 0.89 0.93 0.74 0.96 1


It can be noted that my use of factor analysis over simply averaging the datasets had little effect. The correlation of Lynn’s method (mean of datasets 1-3) and my G factor is .96.

Socioeconomic data and analysis

Lynn furthermore reports 7 socioeconomic variables. I quote his description of these:

“1. Intellectual achievement: (a) first-class honours degrees. All first-class honours graduates of the year 1973 were taken from all the universities in the British Isles (with the exception of graduates of Birkbeck College, a London College for mature and part-time students whose inclusion would bias the results in favour of London). Each graduate was allocated to the region where he lived between the ages of 11 and 18. This information was derived from the location of the graduate’s school. Most of the data were obtained from The Times, which publishes annually lists of students obtaining first-class degrees and the schools they attended. Students who had been to boarding schools were written to requesting information on their home residence. Information from the Republic of Ireland universities was obtained from the college records.

The total number of students obtaining first-class honours degrees was 3477, and information was obtained on place of residence for 3340 of these, representing 96 06 per cent of the total.
There are various ways of calculating the proportions of first-class honours graduates produced by each region. Probably the most satisfactory is to express the numbers of firsts in each region per 1000 of the total age cohorts recorded in the census of 1961. In this year the cohorts were approximately 9 years old. The reason for going back to 1961 for a population base is that the criterion taken for residence is the school attended and the 1961 figures reduce the distorting effects of subsequent migration between the regions. However, the numbers in the regions have not changed appreciably during this period, so that it does not matter greatly which year is taken for picking up the total numbers of young people in the regions aged approximately 21 in 1973. (An alternative method of calculating the regional output of firsts is to express the output as a percentage of those attending university. This method yields similar figures.)

2. Intellectual achievement: (b) Fellowships of the Royal Society. A second measure of intellectual achievement taken for the regions is Fellowships of the Royal Society. These are well-known distinctions for scientific work in the British Isles and are open equally to citizens of both the United Kingdom and the Republic of Ireland. The population consists of all Fellows of the Royal Society elected during the period 1931-71 who were born after the year 1911. The number of individuals in this population is 321 and it proved possible to ascertain the place of birth of 98 per cent of these. The Fellows were allocated to the region in which they were born and the numbers of Fellows born in each region were then calculated per million of the total population of the region recorded in the census of 1911. These are the data shown in Table 2. The year 1911 was taken as the population base because the majority of the sample was born between the years 1911-20, so that the populations in 1911 represent approximately the numbers in the regions around the time most of the Fellows were born. (The populations of the regions relative to one another do not change greatly over the period, so that it does not make much difference to the results which census year is taken for the population base.)

3. Per capita income. Figures for per capita incomes for the regions of the United Kingdom are collected by the United Kingdom Inland Revenue. These have been analysed by McCrone (1965) for the standard regions of the UK for the year 1959/60. These results have been used and a figure for the Republic of Ireland calculated from the United Nations Statistical Yearbook.

4. Unemployment. The data are the percentages of the labour force unemployed in the regions for the year 1961 (Statistical Abstracts of the UK and of Ireland).

5. Infant mortality. The data are the numbers of deaths during the first year of life expressed per 1000 live births for the year 1961 (Registrar Generals’ Reports).

6. Crime. The data are offences known to the police for 1961 and expressed per 1000 population (Statistical Abstracts of the UK and of Ireland).

7. Urbanization. The data are the percentages of the population living in county boroughs, municipal boroughs and urban districts in 1961 (Census).”

Lynn furthermore reports historical achievement scores as well as an estimate of inter-regional migration (actually change in population which can also be due to differential fertility). I did not use these in my analysis but they can be found in the datafile in the supplementary material.

Since there are 13 regions in total and 7 variables, I can analyze all variables at once and still almost conform to the rule of thumb of having a case-to-variable ratio of 2 (Zhao, 2009). Table 2 shows the factor loadings from this factor analysis as well as the correlation with G for each socioeconomic variable.

Table 2 – Correlations between S, S indicators, and G
Variable S G
Fellows.RS 0.92 0.92
First.class 0.55 0.58
Income 0.99 0.72
Unemployment -0.85 -0.79
Infant.mortality -0.68 -0.69
Crime 0.83 0.52
Urbanization 0.88 0.64
S 1 0.79


The crime variable had a strong positive loading on the S factor and also a positive correlation with the G factor. This is in contrast to the negative relationship found at the individual-level between the g factor and crime variables at about r=-.2 (Neisser 1996). The difference in mean IQ between criminal and non-criminal samples is usually around 7-15 points depending on which criminal group (sexual, violent and chronic offenders score lower than other offenders; Guay et al, 2005). Beaver and Wright (2011) found that IQ of countries was also negatively related to crime rates, r’s range from -.29 to -.58 depending on type of crime variable (violent crimes highest). At the level of country of origin groups, Fuerst and Kirkegaard (2014a) found that crime variables had strong negative loadings on the S factor (-.85 and -.89) and negative correlations with country of origin IQ. Altho not reported in the paper, Kirkegaard (2014b) found that the loading of 2 crime variables on the S factor in Norway among country of origin groups was -.63 and -.86 (larceny and violent crime; calculated using the supplementary material using the fully imputed dataset). Kirkegaard (2015a) found S loadings of .16 and -.72 of total crime and intentional homicide variables in Italy. Among US states, Kirkegaard (2015c) found S loadings of -.61 and -.71 for murder rate and prison rate. The scatter plot is shown in Figure 1.

Figure 1 – Scatter plot of regional G and S












So, the most similar finding in previous research is that from Italy. There are various possible explanations. Lynn (1979) thinks it is due to large differences in urbanization (which loads positively in multiple studies, .88 in this study). There may be some effect of the type of crime measurement. Future studies could examine this question by employing many different crime variables. My hunch is that it is a combination of differences in urbanization (which increases crime), immigration of crime prone persons into higher S areas, and differences in the justice system between areas.

Method of correlated vectors (MCV)

As done in the previous analysis of S factors, I performed MCV analysis to see whether the G factor was the reason for the association with the G factor score. S factor indicators with negative loadings were reversed to avoid inflating the result (these are marked with “_r” in the plot). The result is shown in Figure 2.

Figure 2 – MCV scatter plot








As in the previous analyses, the relationship was positive even after reversal.

Per capita income and the FLynn effect

An interesting quote from the paper is:

This interpretation [that the first factor of his factor analysis is intelligence] implies that the mean population IQs should be regarded as the cause of the other variables. When causal relationships between the variables are considered, it is obvious that some of the variables are dependent on others. For instance, people do not become intelligent as a consequence of getting a first-class honours degree. Rather, they get firsts because they are intelligent. The most plausible alternative causal variable, apart from IQ, is per capita income, since the remaining four are clearly dependent variables. The arguments against positing per capita income as the primary cause among this set of variables are twofold. First, among individuals it is doubtful whether there is any good evidence that differences in income in affluent nations are a major cause of differences in intelligence. This was the conclusion reached by Burt (1943) in a discussion of this problem. On the other hand, even Jencks (1972) admits that IQ is a determinant of income. Secondly, the very substantial increases in per capita incomes that have taken place in advanced Western nations since 1945 do not seem to have been accompanied by any significant increases in mean population IQ. In Britain the longest time series is that of Burt (1969) on London schoolchildren from 1913 to 1965 which showed that the mean IQ has remained approximately constant. Similarly in the United States the mean IQ of large national samples tested by two subtests from the WISC has remained virtually the same over a 16 year period from the early 1950s to the mid-1960s (Roberts, 1971). These findings make it doubtful whether the relatively small differences in per capita incomes between the regions of the British Isles can be responsible for the mean IQ differences. It seems more probable that the major causal sequence is from the IQ differences to the income differences although it may be that there is also some less important reciprocal effect of incomes on IQ. This is a problem which could do with further analysis.

Compare with Lynn’s recent overview of the history of the FLynn effect (Lynn, 2013).


  • Beaver, K. M.; Wright, J. P. (2011). “The association between county-level IQ and county-level crime rates”. Intelligence 39: 22–26. doi:10.1016/j.intell.2010.12.002
  • Deary, I. J. (2010). Cognitive epidemiology: Its rise, its current issues, and its challenges. Personality and individual differences, 49(4), 337-343.
  • Guay, J. P., Ouimet, M., & Proulx, J. (2005). On intelligence and crime: A comparison of incarcerated sex offenders and serious non-sexual violent criminals. International journal of law and psychiatry, 28(4), 405-417.
  • Gottfredson, L. S. (1998). Jensen, Jensenism, and the sociology of intelligence. Intelligence, 26(3), 291-299.
  • Gottfredson, L. S. (2004). Intelligence: is it the epidemiologists’ elusive” fundamental cause” of social class inequalities in health?. Journal of personality and social psychology, 86(1), 174.
  • Intelligence, Special Issue: Intelligence, health and death: The emerging field of cognitive epidemiology. Volume 37, Issue 6, November–December 2009
  • Kirkegaard, E. O. W., & Fuerst, J. (2014a). Educational attainment, income, use of social benefits, crime rate and the general socioeconomic factor among 71 immigrant groups in Denmark. Open Differential Psychology.
  • Kirkegaard, E. O. W. (2014b). Crime, income, educational attainment and employment among immigrant groups in Norway and Finland. Open Differential Psychology.
  • Kirkegaard, E. O. W. (2015a). S and G in Italian regions: Re-analysis of Lynn’s data and new data. The Winnower.
  • Kirkegaard, E. O. W. (2015b). Indian states: G and S factors. The Winnower.
  • Kirkegaard, E. O. W. (2015c). Examining the S factor in US states. The Winnower.
  • Kirkegaard, E. O. W. (2015d). The S factor in China. The Winnower.
  • Lynn, R. (1979). The social ecology of intelligence in the British Isles. British Journal of Social and Clinical Psychology, 18(1), 1-12.
  • Lynn, R. (1980). The social ecology of intelligence in France. British Journal of Social and Clinical Psychology, 19(4), 325-331.
  • Lynn, R. (2010a). In Italy, north–south differences in IQ predict differences in income, education, infant mortality, stature, and literacy. Intelligence, 38(1), 93-100.
  • Lynn, R. (2010b). IQ differences between the north and south of Italy: A reply to Beraldo and Cornoldi, Belacchi, Giofre, Martini, and Tressoldi. Intelligence, 38(5), 451-455.
  • Lynn, R. (2012a). IQs in Italy are higher in the north: A reply to Felice and Giugliano. Intelligence, 40(3), 255-259.
  • Lynn, R. (2012b). North-south differences in Spain in IQ, educational attainment, per capita income, literacy, life expectancy and employment. Mankind Quarterly, 52(3/4), 265.
  • Lynn, R. (2013). Who discovered the Flynn effect? A review of early studies of the secular increase of intelligence. Intelligence, 41(6), 765-769.
  • Lynn, R., & Cheng, H. (2013). Differences in intelligence across thirty-one regions of China and their economic and demographic correlates. Intelligence, 41(5), 553-559.
  • Lynn, R., & Yadav, P. (2015). Differences in cognitive ability, per capita income, infant mortality, fertility and latitude across the states of India. Intelligence, 49, 179-185.
  • Neisser, Ulric et al. (February 1996). “Intelligence: Knowns and Unknowns”. American Psychologist 52 (2): 77–101, 85.
  • Piffer, D., & Lynn, R. (2014). New evidence for differences in fluid intelligence between north and south Italy and against school resources as an explanation for the north–south IQ differential. Intelligence, 46, 246-249.
  • Zhao, N. (2009). The Minimum Sample Size in Factor Analysis. Encorewiki.


This is another relatively small international adoption study. 159 children were adopted in homes in the Netherlands from Sri Lanka, Korea and Colombia.

Participants are described as:

The present study examines the development and adjustment of 159 adopted children at age 7 years. The largest group, 129 adopted children, was selected from 2 studies, starting when the child was aged 5 months. In these studies a short-term early intervention was implemented in three sessions at home between 6 and 9 months in an experimental group, and results were compared with a control group. The families for this experiment were randomly recruited through Dutch adoption organizations, and not selected on (future) problems. Also, to avoid selection, the parents were not aware of the intervention when they entered the study. They were requested to participate in a study examining the development of adopted children. The results of the intervention study were reported elsewhere (Juffer, Hoksbergen, Riksen-Walraven, & Kohnstamm, 1997 ; Stams, Juffer, Van IJzendoorn, & Hoksbergen, in press). The intervention was not repeated during the following years. The original studies involved 70 mixed families, i.e., adoptive families with biological children, and 90 all-adoptive families, i.e., adoptive families without biological children. As intervention effects were found at age 7 in a small intervention group of 20 mixed families (Stams et al., in press), we decided to omit this group from the present study. The remaining sample consisted of 55 intervention and 74 control families. An additional group of 30 families, matched on the original criteria, was randomly recruited from one adoption agency at age 7, serving as a post-test-only group. The absence of intervention or testing effects on any of the outcome measures was confirmed in preliminary analyses, contrasting intervention with control groups, and control groups with the post-test-only group recruited at age 7, respectively.

The adoptive parents were Caucasian white, and in all families the mother was the primary caregiver. The families were predominantly from middle-class or upper middle-class backgrounds. The attrition rate was 8 %, that is, 11 of 140 participants from the original studies. The major reasons for declining were disinterest or health problems of family members. Four mothers had died of incurable illnesses. A series of separate Bonferroni-corrected statistical tests confirmed the absence of differential attrition in the total sample with respect to child background variables, such as age at placement, and family background variables, such as socioeconomic status or family type (with or without biological children).

The children, 73 boys and 86 girls, were adopted from Sri Lanka (N=108), South Korea (N=37), and Colombia (N=14). The infants from Sri Lanka were in the care of their biological mother until their adoption placement at a mean age of 7 weeks (SD=3). Korean and Colombian infants stayed in an institution or foster home after separation from their biological mother at birth, until their adoption placement at a mean age of 15 weeks (SD=4). In comparison with adoptions from Romania, for example (O’Connor et al., 2000 ; Rutter et al., 1998), the material conditions in the Korean and Colombian institutions were relatively favorable, as these homes received substantial support from a Dutch adoption agency. However, little is known about the quality of care, whereas one may assume that frequent changes of caretakers and nonoptimal child–caretaker ratios, often found in institutions, resulted in less favorable socioemotional conditions (O’Connor, Bredenkamp, Rutter, & the ERA Study Team, 1999). Little is known about the child-rearing conditions of the Sri Lankian infants after birth. Based on anecdotal evidence from parent reports, pre- and post-natal care for the relinquishing mother and her baby were far from optimal in Sri Lanka, and the health condition of the mother was deplorable in many cases (Juffer, 1993).

The paper is primarily about some socioemotional variables of no particular interest to me. However, they also accessed cognitive ability with the following test:

Intelligence. Intelligence was measured with the abbreviated Revised Amsterdam Child Intelligence Test (RACIT). Bleichrodt, Drenth, Zaal, and Resing (1987) found empirical evidence for convergent validity, as the RACIT correlatedr¯ ±86 with the Wechsler Intelligence Scale for Children-Revised (WISC-R). At age 7, the abbreviated RACIT correlatedr¯±92 with the full RACIT. The abbreviated RACIT showed a somewhat lower test–retest reliability, namely, r¯±86 versus r¯±88, and a somewhat lower internal consistency, namely, α¯±90 versusα¯±94, than the full RACIT. The abbreviated RACIT does not seem to underestimate or overestimate the level of individual intelligence.

In the present study, we used the abbreviated RACIT, which consisted of the following subtests : flexibility of closure (α¯ ±84), paired associates (split-half reliability¯±77), perceptual reasoning (split-half reliability¯±73), vocabulary (α¯±74), inductive reasoning (α¯±86), ideational fluency (α¯±81). The reliability of the abbreviated RACIT was .91 (N¯163), and was estimated on the basis of the number of subtests, the reliabilities of the subtests, and the correlations between the subtests (Nunally, 1978). The raw scores were transformed to standardized intelligence scores with a mean of 100 (SD15). The standardized scores were derived from a representative sample of 1415 children between age 4 and 11, drawn from the Dutch general school population in 1982 (Bleichrodt et al., 1987).

The results of the testing at age 7 are:

table 8 stams

The adoptive families in adoption studies are usually somewhat above average which boosts the IQ of especially younger children. This along with the FLynn effect is probably responsible for the higher than 100 scores. But among the groups, we see Koreans on top as is usually seen in these studies.


As Jason Malloy has mentioned, it is strange that in the race intelligence debates, people usually cite the same few studies over and over:

Shortly after writing that post, I decided that more needed to be written about transracial adoption research as a behavior genetic experiment. Arthur Jensen, Richard Lynn, and J. Philippe Rushton have all cited the Minnesota Transracial Adoption Study, as well as several IQ studies of transracially adopted Asians, in support of the hereditarian position. And Richard Nisbett has referenced several other adoption studies that suggest no racial gaps. However, I suspected there was more data for transracially adopted children than what this small cadre of scientists had already discussed (at the very least for important variables other than intelligence); research that could give us a more complete picture of what these unusual children become, and what this can tell us about the causes of ethnic differences in socially valued outcomes.

One way of finding hard to find studies is going thru articles that cite popular reviews of the topic. Sometimes, this is not possible if the reviews have thousands of citations. However, sometimes, it is. In this case, I used the review: Kim, W. J. (1995). International adoption: A case review of Korean children. Child Psychiatry and Human Development, 25(3), 141-154. Then simply look up the studies that cite it (101 results on Scholar). We are looking for studies that report country or area of origin as well as relevant criteria variables such as IQ scores, GPA, educational attainment/achievement, income, socioeconomic status/s factor, crime rates, use of public benefits and so on.

One such study is: Lindblad, F., Hjern, A., & Vinnerljung, B. (2003). Intercountry Adopted Children as Young Adults—‐A Swedish Cohort Study. American Journal of Orthopsychiatry, 73(2), 190-202. The abstract is promising:

In a national cohort study, the family and labor market situation, health problems, and education of 5,942 Swedish intercountry adoptees born between 1968 and 1975 were examined and compared with those of the general population, immigrants, and a siblings group—all age matched—in national registers from 1997 to 1999. Adoptees more often had psychiatric problems and were longtime recipients of social welfare. Level of education was on par with that of the general population but lower when adjusted for socioeconomic status.

The sample consists of:

There were 5,942 individuals in the adoptee study group: 3,237 individuals were born in the Far East (2,658 were born in South Korea), 1,422 in South Asia, 871 in Latin America, and 412 in Africa. In the other study groups there were 1,884 siblings, 8,834 European immigrants, 3,544 non-European immigrants, and 723,154 individuals in the general population.

So, by the usual standards, this is a very large study. We are interested in the region of birth results. They are in two tables:

Table 7Table 8

We can note that the Far East — i.e. mostly South Korean, presumably the rest are North Korean (?), Chinese, Japanese, Vietnamese (?) — usually gets the better outcomes. They were less often married, mixed results for living with parents, more likely to have a university degree, less likely to have only primary school, more likely to be in the workforce, less likely to be unemployed, less likely to receive welfare, mixed results for hospital admissions for substance abuse, much less likely to be admitted for alcohol abuse (likely to be due to Asian alcohol flush syndrome), less likely to be admitted for a psychiatric diagnosis, and less likely to receive disability pension.

It would probably have been better if one could aggregate the results and look at the general socioeconomic factor instead. It is not possible to do so with the above results, since there are only 4 cases and 11 variables. One could calculate a score by choosing some or all of the variables. Or one could assign them factor loadings manually and then calculate scores. I calculated a unit-weighted score based on all but the first two indicators (married and living with parents since these are not socioeconomically important). Two indicators (uni degree and workforce) were reversed (by 1/OR). I also calculated the median score which is resistant to outliers (e.g. the alcohol abuse indicator). Results:


Socioeconomic outcomes by region of origin, and estimated S scores
Group Latin America Africa South Asia Far East
Uni.degree 2.50 1.43 1.67 1
Only.primary.ed 1.60 1.50 1.00 1
Workforce 1.43 1.43 1.11 1
Unemployed 1.30 0.90 1.30 1
Welfare.use 1.90 1.50 1.30 1
Substance.abuse 2.70 0.70 1.00 1
Alcohol.abuse 4.50 4.90 3.60 1
Psychiatric.diag 1.50 1.40 1.20 1
Disability.pension 1.30 1.80 1.30 1
Mean.S 2.08 1.73 1.50 1
Median.S 1.60 1.43 1.30 1


It is interesting to see that the Africans did better than the Latin Americans. Perhaps there is something strange going on. Perhaps the Latin Americans are from countries with high African% admixture. Or perhaps it’s some kind of selection effect.

In their discussion they write:

There were considerable differences between adoptees from different geographical regions with better outcomes in many respects for children from the Far East, in this context mainly South Korea. Sim­ilar positive adjustment results concerning Asian adoptees have been presented previously. For in­ stance, an excellent prognosis concerning adjustment and identity development in Chinese adoptees in Britain was described (Bagley, 1993). A Dutch group recently presented data about academic achievement and intelligence in 7-year-old children adopted in in­ fancy (Stams, Juffer, Rispens, & Hoksbergen, 2000). The South Korean group had high IQs with 31% above a score of 120. Pre- and postnatal care before adoption seems to be particularly well organized in South Korea (Kim, 1995), which may be one important reason for the positive outcome. The differences among the geographic regions may also, however, be due to a large number of other factors such as differ­ences in nutrition, motives behind the adoption, qual­ity of care in the orphanage-foster home before the adoption, genetic dispositions, and Swedish preju­dices against “foreign-looking” people. Another ex­planation may be a larger number of younger infants in the South Korean group. However, that is not pos­sible to verify from our register data.

The usual cultural explanations.

I have also contacted the Danish statistics office to hear if they have Danish data.



Item-level data from Raven’s Standard Progressive Matrices was compiled for 12 diverse groups from previously published studies. The method of correlated vectors was used on every possible pair of groups with available data (45 comparisons). Depending on exact method chosen, the mean and mean MCV correlation was about .46/51. Only 2/1 of 45 were negative. Spearman’s hypothesis is confirmed for item-level data from the Standard Progressive Matrices.

Introduction and method

The method of correlated vectors (MCV) is a statistical method invented by Arthur Jensen (1998, p. 371, see also appendix B). The purpose of it is to measure to which degree a latent variable is responsible for an observed correlation between an aggregate measure and a criteria variable. Jensen had in mind the general factor of cognitive ability data (the g factor) as measured by various IQ tests and their subtests, and criteria variables such as brain size, however the method is applicable to any latent trait (e.g. general socioeconomic factor Kirkegaard, 2014a, b). When this method is applied to group differences, particularly ethnoracial ones, it is called Spearman’s hypothesis (SH) because Spearman was the first to note it in his 1927 book.

By now, several large studies and meta-analysis of MCV results for group differences have been published (te Nijenhuis et al (2015a, 2015b, 2014), Jensen (1985)). These studies generally support the hypothesis. Almost all studies use subtest loadings instead of item loadings. This is probably because psychologists are reluctant to share their data (Wicherts, et al, 2006) and as a result there are few open datasets available to use for this purpose. Furthermore, before the introduction of modern computers and the internet, it was impractical to share item-level data. There are advantages and disadvantages to using item-level data over subtest-level data. There are more items than subtests which means that the vectors will be longer and thus sampling error will be smaller. On the other hand, items are less reliable and less pure measures of the g factor which introduces both error and more non-g ability variance.

The recent study by Nijenhuis et al (2015a) however, employed item-level data from Raven’s Standard Progressive Matrices (SPM) and included a diverse set of samples (Libyan, Russian, South African, Roma from Serbia, Moroccan and Spanish). The authors did not use their collected data to its full extent, presumably because they were comparing the groups (semi-)manually. To compare all combinations with a dataset of e.g. 10 groups means that one has to do 45 comparisons (10*9/2). However, this task can easily be overcome with programming skills, and I thus saw an opportunity to gather more data regarding SH.

The authors did not provide the data in the paper despite it being easy to include it in tables. However, the data was available from the primary studies they cited in most cases. Thus, I collected the data from their data sources (it can be found in the supplementary material). This resulted in data from 12 samples of which 10 had both difficulty and item-whole correlations data. Table 1 gives an overview of the datasets:

Table 1- Overview of samples
Short name Race Selection N Year Ref Country Description
A1 African Undergraduates 173 2000 Rushton and Skuy 2000 South Africa University of the Witwatersrand and the Rand Afrikaans University in Johannesburg, South Africa
W1 European Undergraduates 136 2000 Rushton and Skuy 2000 South Africa University of the Witwatersrand and the Rand Afrikaans University in Johannesburg, South Africa
W2 European Std 7 classes 1056 1992 Owen 1992 South Africa 20 schools in the Pretoria-Witwatersrand-Vereeniging (PWV) area and 10 schools in the Cape Peninsula
C1 Colored (African European) Std 7 classes 778 1992 Owen 1992 South Africa 20 coloured schools in the Cape Peninsula
I1 Indian Std 7 classes 1063 1992 Owen 1992 South Africa 30 schools selected at random from the list of high schools in and around Durban
A2 African Std 7 classes 1093 1992 Owen 1992 South Africa Three schools in the PWV area and 25 schools in KwaZulu (Natal)
A3 African First year Engineering students 198 2002 Rushton et al 2002 South Africa First-year students from the Faculties of Engineering and the Built Environment at the University of the Witwatersrand
I2 Indian First year Engineering students 58 2002 Rushton et al 2002 South Africa First-year students from the Faculties of Engineering and the Built Environment at the University of the Witwatersrand
W3 European First year Engineering students 86 2002 Rushton et al 2002 South Africa First-year students from the Faculties of Engineering and the Built Environment at the University of the Witwatersrand
R1 Roma Adults ages 16 to 66 231 2004.5 Rushton et al 2007 Serbia The communities (i.e., Drenovac, Mirijevo, and Rakovica) are in the vicinity of Belgrade
W4 European Adults ages 18 to 65 258 2012 Diaz et al 2012 Spain Mainly from the city of Valencia
NA1 North African Adults ages 18 to 50 202 2012 Diaz et al 2012 Morocco Casablanca, Marrakech, Meknes and Tangiers


Item-whole correlations and item loadings

The data in the papers did usually not contain the actual factor loadings of the items. Instead, they contained the item-whole correlations. The authors argue that one can use these because of the high correlation of unweighted means with extracted g-factors (often, r=.99, e.g. Kirkegaard, in review). Some studies did provide both loadings and item-whole correlations, yet the authors did not correlate them to see how good proxies the item-whole correlations are for the loadings. I calculated this for the 4 studies that included both metrics. Results are shown in Table 2.

Table 2 – Item-whole correlations x g-loadings in 4 studies.
W2 C1 I1 A2
W2 0.549 0.099 0.327 0.197
C1 0.695 0.900 0.843 0.920
I1 0.616 0.591 0.782 0.686
A2 0.626 0.882 0.799 0.981

Note: Within sample correlations between item-whole correlations and item factor loadings are in the diagonal, marked with italic.

As can be seen, the item-whole correlations were not in all cases great proxies for the actual loadings.

To further test this idea, I calculated the item-whole correlations and the factor loadings (first factor, minimum residuals) in the open Wicherts dataset (N=500ish, Dutch university students, see Wicherts and Bakker 2012) tested on Raven’s Advanced Progressive Matrices. The correlation was .89. Thus, aside from the odd result in the W2 sample, item-whole correlations were a reasonable proxy for the factor loadings.

Item difficulties across samples

If two groups are tested on the same test and this test measures the same trait in both groups, then even if the groups have different mean trait levels, the order of difficulty of the items or subtests should be similar. Rushton et al (2000, 2002, 2007) have examined this in previous studies and found it generally to be the case. Table 3 below shows the cross-sample correlations of item difficulties.

Table 3 – Intercorrelations between item difficulties in 12 samples
A1 W1 W2 C1 I1 A2 A3 I2 W3 R1 NA1 W4
A1 1 0.88 0.98 0.96 0.99 0.86 0.96 0.89 0.79 0.89 0.95 0.93
W1 0.88 1 0.93 0.79 0.87 0.65 0.96 0.97 0.94 0.7 0.92 0.95
W2 0.98 0.93 1 0.95 0.98 0.82 0.97 0.92 0.84 0.86 0.96 0.95
C1 0.96 0.79 0.95 1 0.98 0.94 0.89 0.81 0.69 0.95 0.92 0.87
I1 0.99 0.87 0.98 0.98 1 0.88 0.95 0.88 0.79 0.91 0.95 0.92
A2 0.86 0.65 0.82 0.94 0.88 1 0.76 0.68 0.56 0.97 0.82 0.76
A3 0.96 0.96 0.97 0.89 0.95 0.76 1 0.96 0.9 0.8 0.95 0.96
I2 0.89 0.97 0.92 0.81 0.88 0.68 0.96 1 0.92 0.72 0.91 0.92
W3 0.79 0.94 0.84 0.69 0.79 0.56 0.9 0.92 1 0.6 0.88 0.91
R1 0.89 0.7 0.86 0.95 0.91 0.97 0.8 0.72 0.6 1 0.86 0.8
NA1 0.95 0.92 0.96 0.92 0.95 0.82 0.95 0.91 0.88 0.86 1 0.97
W4 0.93 0.95 0.95 0.87 0.92 0.76 0.96 0.92 0.91 0.8 0.97 1


The mean intercorrelation is.88. This is quite remarkable given the diversity of the samples.

Item-whole correlations across samples

Given the above, one might expect similar results for the item-whole correlations. This however is not so. Results are in Table 4.

Table 4 – Intercorrelations between item-whole correlations in 10 samples
A1 W1 W2 C1 I1 A2 A3 I2 W3 R1
A1 1 -0.2 0.59 0.58 0.73 0.54 0.27 0.04 -0.3 0.57
W1 -0.2 1 0.17 -0.59 -0.25 -0.68 0.42 0.51 0.55 -0.55
W2 0.59 0.17 1 0.44 0.79 0.29 0.61 0.25 0.02 0.39
C1 0.58 -0.59 0.44 1 0.79 0.94 0.01 -0.25 -0.49 0.78
I1 0.73 -0.25 0.79 0.79 1 0.69 0.42 0.09 -0.33 0.63
A2 0.54 -0.68 0.29 0.94 0.69 1 -0.13 -0.3 -0.52 0.77
A3 0.27 0.42 0.61 0.01 0.42 -0.13 1 0.26 0.37 0.02
I2 0.04 0.51 0.25 -0.25 0.09 -0.3 0.26 1 0.34 -0.21
W3 -0.3 0.55 0.02 -0.49 -0.33 -0.52 0.37 0.34 1 -0.49
R1 0.57 -0.55 0.39 0.78 0.63 0.77 0.02 -0.21 -0.49 1

Note: The last two samples, NA1 and W4, did not have item-whole correlation data.

The reason for this state of affairs is that the factor loadings change when the group mean trait level changes. For many samples, most of the items were too easy (passing rates at or very close to 100%). When there is no variation in a variable, one cannot calculate a correlation to some other variable. This means that for a substantial number of items for multiple samples, there was missing data for the items.

The lack of cross-sample consistency in item-whole correlations may also explain the weak MCV results in Diaz et al, 2012 since they used g-loadings from another study instead of from their own samples.

Spearman’s hypothesis using one static vector of estimated factor loadings

Some of the sample had rather low sample sizes (I2, N=58, W3, N=86). Thus one might get the idea to use the item-whole correlations from one or more of the large samples for comparisons involving other groups. In fact, given the instability of item-whole correlations across sample as can be seen in Table 4, this is a bad idea. However, for sake of completeness, I calculated the results based on this anyway. As the best estimate of factor loadings, I averaged the item-whole correlations data from the four largest samples (W2, C1, I1 and A2).

Using this vector of item-whole correlations, I used MCV on every possible sample comparison. Because there were 12 samples, this number is 66. MCV analysis was done by subtracting the lower scoring sample’s item difficulties from the higher scoring sample’s thus producing a vector of the sample difference on each item. This vector I correlated with the vector of item-whole correlations.  The results are shown in Table 5.

Table 5 – MCV correlations of group differences across 12 samples using 1 static item-whole correlations
A1 W1 W2 C1 I1 A2 A3 I2 W3 R1 NA1 W4
A1 NA -0.15 0.2 0.42 0.1 0.83 -0.03 -0.12 -0.26 0.8 -0.35 -0.32
W1 -0.15 NA -0.31 0.07 -0.14 0.47 -0.29 -0.19 -0.4 0.4 0.31 0.06
W2 0.2 -0.31 NA 0.56 0.4 0.86 -0.27 -0.28 -0.38 0.83 -0.35 -0.46
C1 0.42 0.07 0.56 NA 0.53 0.88 0.23 0.11 -0.06 0.64 -0.23 -0.05
I1 0.1 -0.14 0.4 0.53 NA 0.88 -0.02 -0.1 -0.24 0.83 -0.45 -0.29
A2 0.83 0.47 0.86 0.88 0.88 NA 0.66 0.52 0.32 0.2 0.42 0.4
A3 -0.03 -0.29 -0.27 0.23 -0.02 0.66 NA -0.17 -0.41 0.61 0.43 -0.43
I2 -0.12 -0.19 -0.28 0.11 -0.1 0.52 -0.17 NA -0.37 0.46 0.33 -0.25
W3 -0.26 -0.4 -0.38 -0.06 -0.24 0.32 -0.41 -0.37 NA 0.23 -0.05 -0.11
R1 0.8 0.4 0.83 0.64 0.83 0.2 0.61 0.46 0.23 NA 0.36 0.32
NA1 -0.35 0.31 -0.35 -0.23 -0.45 0.42 0.43 0.33 -0.05 0.36 NA -0.03
W4 -0.32 0.06 -0.46 -0.05 -0.29 0.4 -0.43 -0.25 -0.11 0.32 -0.03 NA


As one can see, the results are all over the place. The mean MCV correlation is .12.

Spearman’s hypothesis using a variable vector of estimated factor loadings

Since item-whole correlations varied from sample to sample, another idea is to use the samples’ item-whole correlations. I used the unweighted mean of the item-whole correlations for each item (te Nijenhuis et al used a weighted mean). In some cases, only one sample has item-whole correlations for some items (because the other sample had no variance on the item, i.e. 100% get it right). In these cases, one can choose to use the value from the remaining sample, or one can ignore the item and calculated MCV based on the remaining items. I calculated results using both methods, they are shown in Table 6 and 7.

Table 6 – MCV correlations of group differences across 10 samples using variable item-whole correlations, method 1
A1 W1 W2 C1 I1 A2 A3 I2 W3 R1
A1 NA 0.79 0.39 0.29 0.05 0.7 0.48 0.41 0.37 0.71
W1 0.79 NA 0.8 0.5 0.76 0.51 0.79 0.4 0.6 0.54
W2 0.39 0.8 NA 0.68 0.63 0.85 0.43 0.47 0.5 0.79
C1 0.29 0.5 0.68 NA 0.52 0.88 0.32 0.3 -0.09 0.67
I1 0.05 0.76 0.63 0.52 NA 0.84 0.47 0.4 0.31 0.79
A2 0.7 0.51 0.85 0.88 0.84 NA 0.57 0.43 -0.03 0.22
A3 0.48 0.79 0.43 0.32 0.47 0.57 NA 0.38 0.66 0.64
I2 0.41 0.4 0.47 0.3 0.4 0.43 0.38 NA 0.6 0.44
W3 0.37 0.6 0.5 -0.09 0.31 -0.03 0.66 0.6 NA 0.2
R1 0.71 0.54 0.79 0.67 0.79 0.22 0.64 0.44 0.2 NA


Table 6 – MCV correlations of group differences across 10 samples using variable item-whole correlations, method 2
A1 W1 W2 C1 I1 A2 A3 I2 W3 R1
A1 NA 0.42 0.4 0.33 0.06 0.72 0.48 0.3 0.15 0.74
W1 0.42 NA 0.72 0.14 0.52 0.18 0.7 0.44 0.65 0.35
W2 0.4 0.72 NA 0.68 0.63 0.85 0.44 0.53 0.5 0.79
C1 0.33 0.14 0.68 NA 0.52 0.88 0.39 0.19 -0.08 0.67
I1 0.06 0.52 0.63 0.52 NA 0.84 0.51 0.4 0.28 0.79
A2 0.72 0.18 0.85 0.88 0.84 NA 0.62 0.3 0.02 0.22
A3 0.48 0.7 0.44 0.39 0.51 0.62 NA 0.42 0.55 0.67
I2 0.3 0.44 0.53 0.19 0.4 0.3 0.42 NA 0.58 0.35
W3 0.15 0.65 0.5 -0.08 0.28 0.02 0.55 0.58 NA 0.08
R1 0.74 0.35 0.79 0.67 0.79 0.22 0.67 0.35 0.08 NA


Nearly all results are positive using either method. The results are slightly stronger when ignoring items where both samples do not have item-whole correlation data. A better way to visualize the results is to use a histogram with a density curve inputted, as shown in Figure 1 and 2.

SH method 1

Figure 1 – Histogram of MCV results using method 1

SH method 2

Figure 2 – Histogram of MCV results using method 2

Note: The vertical line shows the mean value.

The mean/median result for method 1 was .51/.50, and .46/.48 for method 2. Almost all MCV results were positive, there were only 2/45 that were negative for method 1, and 1/45 for method 2.

Mean MCV value by sample and moderator analysis

It is interesting to examine the mean MCV value by sample. They are shown in Table 7.

Table 7 – MCV correlation means, SDs, and medians by sample
Sample mean SD median
A1 0.46 0.24 0.41
W1 0.63 0.15 0.60
W2 0.62 0.18 0.63
C1 0.45 0.29 0.50
I1 0.53 0.26 0.52
A2 0.55 0.31 0.57
A3 0.53 0.15 0.48
I2 0.43 0.08 0.41
W3 0.35 0.28 0.37
R1 0.56 0.23 0.64


There is no obvious racial pattern. Instead, one might expect the relatively lower result of some samples to be due to sampling error. MCV is extra sensitive to sampling error. If so, the mean correlation should be higher for the larger samples. To see if this was the case, I calculated the rank-order correlation between sample size and sample mean MCV, r=.45/.65 using method 1 or 2 respectively. Rank-order was used because the effect of sample size on sampling error is non-linear. Figure 3 shows the scatter plot of this.


Figure 3 – Sample size as a moderator variable at the sample mean-level


One can also examine sample size as a moderating variable as the comparison-level. This increases the number of datapoints to 45. I used the harmonic mean of the 2 samples as the sample size metric. Figure 4 shows a scatter plot of this.


Figure 4 – Sample size as a moderator variable at the comparison-level


The rank-order correlations are .45/.44 using method 1/2 data. We can see in the plot that the results from the 6 largest comparisons (harmonic mean sample size>800) range from .52 to .88 with a mean of .74/.73 and SD of .15 using method 1/2 results. For the smaller studies (harmonic mean sample size<800), the results range from -.09/-.08 to .89/.79 with a mean of .48/.43 and SD of .22/.23 using method 1/2 results. The results from the smaller studies vary more, as expected with their higher sampling error.

I also examine the group difference size as a moderator variable. I computed this as the difference between the mean item difficulty by the groups. However, it had a near-zero relationship to the MCV results (rank-order r=.03, method 1 data).

Discussion and conclusion

Spearman’s hypothesis has been decisively confirmed using item-level data from Raven’s Standard Progressive Matrices. The analysis presented here can easily be extended to cover more datasets, as well as item-level data from other IQ tests. Researchers should compile such data into open datasets so they can be used for future studies.

It is interesting to note the consistency of results within and across samples that differ in race. Race differences in general intelligence as measured by the SPM appear to be just like those within races.

Supplementary material

R code and dataset is available at the Open Science Framework repository.


  • Dıaz, A., & Sellami, K. Infanzó n, E., Lanzó n, T., & Lynn, R.(2012). A comparative study of general intelligence in Spanish and Moroccan samples. Spanish Journal of Psychology, 15(2), 526-532.
  • Jensen, A. R. (1998). The g factor: The science of mental ability. Westport, CT: Praeger.
  • Jensen, A. R. (1985). The nature of the black–white difference on various psychometric tests: Spearman’s hypothesis. Behavioral and Brain Sciences, 8(02), 193-219.
  • Kirkegaard, E. O. (2014a). The international general socioeconomic factor: Factor analyzing international rankings. Open Differential Psychology.
  • Kirkegaard, E. O. (2014b). Crime, income, educational attainment and employment among immigrant groups in Norway and Finland. Open Differential Psychology.
  • Kirkegaard, E. O. W. (in review). Examining the ICAR and CRT tests in a Danish student sample. Open Differential Psychology.
  • Owen, K. (1992). The suitability of Raven’s Standard Progressive Matrices for various groups in South Africa. Personality and Individual Differences, 13(2), 149-159.
  • Rushton, J. P., Čvorović, J., & Bons, T. A. (2007). General mental ability in South Asians: Data from three Roma (Gypsy) Communities in Serbia. Intelligence, 35, 1-12.
  • Rushton, J. P., Skuy, M., & Fridjhon, P. (2002). Jensen effects among African, Indian, and White engineering students in South Africa on Raven’s Standard Progressive Matrices. Intelligence, 30, 409-423.
  • Rushton, J. P., & Skuy, M. (2000). Performance on Raven’s Matrices by African and White university students in South Africa. Intelligence, 28, 251-265.
  • Spearman, C. (1927). The abilities of man.
  • te Nijenhuis, J., Al-Shahomee, A. A., van den Hoek, M., Grigoriev, A., and Repko, J. (2015a). Spearman’s hypothesis tested comparing Libyan adults with various other groups of adults on the items of the Standard Progressive Matrices. Intelligence. Volume 50, May–June 2015, Pages 114–117
  • te Nijenhuis, J., David, H., Metzen, D., & Armstrong, E. L. (2014). Spearman’s hypothesis tested on European Jews vs non-Jewish Whites and vs Oriental Jews: Two meta-analyses. Intelligence, 44, 15-18.
  • te Nijenhuis, J., van den Hoek, M., & Armstrong, E. L. (2015b). Spearman’s hypothesis and Amerindians: A meta-analysis. Intelligence, 50, 87-92.
  • Wicherts, J. M., Borsboom, D., Kats, J., & Molenaar, D. (2006). The poor availability of psychological research data for reanalysis. American Psychologist, 61(7), 726.
  • Wicherts, J. M., & Bakker, M. (2012). Publish (your data) or (let the data) perish! Why not publish your data too?. Intelligence, 40(2), 73-76.



In this commentary I explain how mean differences between normal distributions give rise to different percentages of the populations being above or below a given threshold, depending on where the threshold is.


Research uncovers flawed IQ scoring system” is the headline on phys.org, which often posts news about research from other fields. It concerns a study by Harrison et al (2015). The researchers have allegedly “uncovered anomalies and issues with the Wechsler Adult Intelligence Scale-Fourth Edition (WAIS-IV), one of the most widely used intelligence tests in the world”. An important discovery, if true. Let’s hear it from the lead researcher:

“Looking at the normal distribution of scores, you’d expect that only about five per cent of the population should get an IQ score of 75 or less,” says Dr. Harrison. “However, while this was true when we scored their tests using the American norms, our findings showed that 21 per cent of college and university students in our sample had an IQ score this low when Canadian norms were used for scoring.”

How can it be? To learn more, we delve into the actual paper titled: Implications for Educational Classification and Psychological Diagnoses Using the Wechsler Adult Intelligence Scale–Fourth Edition With Canadian Versus American Norms.

The paper

First they summarize a few earlier studies on Canada and the US. The Canadians obtained higher raw scores. Of course, this was hypothesized to be due to differences in ethnicity and educational achievement factors. However, this did not quite work out, so Harrison et al decided to investigate it more (they had already done so in 2014). Their method consists of taking the scores from a large mixed sample consisting of healthy people — i.e. with no diagnosis, 11% — and people with various mental disorders (e.g. 53.5% with ADHD), and then scoring this group on both the American and the Canadian norms. What did they find?

Blast! The results were similar to the results from the previous standardization studies! What happened? To find out, Harrison et al do a thorough examination of various subgroups in various ways. No matter which age group they compare, the result won’t go away. They also report the means and Cohen’s d for each subtest and aggregate measure — very helpful. I reproduce their Table 1 below:

Score M (US)
p d r
FSIQ 95.5 12.9 88.1 14.4 <.001 0.54 0.99
GAI 98.9 14.7 92.5 16 <.001 0.42 0.99
Index Scores
Verbal Comprehension 97.9 15.1 91.8 16.3 <.001 0.39 0.99
Perceptual Reasoning 99.9 14.1 94.5 15.9 <.001 0.36 0.99
Working Memory 90.5 12.8 83.5 13.8 <.001 0.53 0.99
Processing Speed 95.2 12.9 90.4 14.1 <.001 0.36 0.99
Subtest Scores
Verbal Subtests
Vocabulary 9.9 3.1 8.7 3.3 <.001 0.37 0.99
Similarities 9.7 3 8.5 3.3 <.001 0.38 0.98
Information 9.2 3.1 8.5 3.3 <.001 0.22 0.99
Arithmetic 8.2 2.7 7.4 2.7 <.001 0.3 0.99
Digit Span 8.4 2.5 7.1 2.7 <.001 0.5 0.98
Performance Subtests
Block Design 9.8 3 8.9 3.2 <.001 0.29 0.99
Matrix Reasoning 9.8 2.9 9.1 3.2 <.001 0.23 0.99
Visual Puzzles 10.5 2.9 9.4 3.1 <.001 0.37 0.99
Symbol Search 9.3 2.8 8.5 3 <.001 0.28 0.99
Coding 8.9 2.5 8.2 2.6 <.001 0.27 0.98


Sure enough, the scores are lower using the Canadian norms. And very ‘significant’ too. A mystery.

Next, they go on to note how this sometimes changes the classification of individuals into 7 arbitrarily chosen intervals of IQ scores, and how this differs between subtests. They spend a lot of e-ink noting percents about this or that classification. For instance:

“Of interest was the percentage of individuals who would be classified as having a FSIQ below the 10th percentile or who would fall within the IQ range required for diagnosis of ID (e.g., 70 ± 5) when both normative systems were applied to the same raw scores. Using American norms, 13.1% had an IQ of 80 or less, and 4.2% had an IQ of 75 or less. By contrast, when using Canadian norms, 32.3% had an IQ of 80 or less, and 21.2% had an IQ of 75 or less.”

I wonder if some coherent explanation can be found for all these results. In their discussion they ask:

“How is it that selecting Canadian over American norms so markedly lowers the standard scores generated from the identical raw scores? One possible explanation is that more extreme scores occur because the Canadian normative sample is smaller than the American (cf. Kahneman, 2011).”

If the reader was unsure, yes, this is Kahneman’s 2011 book about cognitive biases and dual process theory.

They have more suggestions about the reason:

“One cannot explain this difference simply by saying it is due to the mature students in the sample who completed academic upgrading, as the score differences were most prominent in the youngest cohorts. It is difficult to explain these findings simply as a function of disability status, as all participants were deemed otherwise qualified by these postsecondary institutions (i.e., they had met normal academic requirements for entry into regular postsecondary programs). Furthermore, in Ontario, a diagnosis of LD is given only to students with otherwise normal thinking and reasoning skills, and so students with such a priori diagnosis would have had otherwise average full scale or general abilities scores when tested previously. Performance exaggeration seems an unlikely cause for the findings, as the students’ scores declined only when Canadian norms were applied. Finally, although no one would argue that a subset of disabled students might be functioning below average, it is difficult to believe that almost half of these postsecondary students would fall in this IQ range given that they had graduated from high school with marks high enough to qualify for acceptance into bona fide postsecondary programs. Whatever the cause, our data suggest that one must question both the representativeness of the Canadian normative sample in the younger age ranges and the accuracy of the scores derived when these norms are applied.”

And finally they conclude with a recommendation not to use the Canadian norms for Canadians because this results in lower IQs:

Overall, our findings suggest a need to examine more carefully the accuracy and applicability of the WAIS-IV Canadian norms when interpreting raw test data obtained from Canadian adults. Using these norms appears to increase the number of young adults identified as intellectually impaired and could decrease the number who qualify for gifted programming or a diagnosis of LD. Until more research is conducted, we strongly recommend that clinicians not use Canadian norms to determine intellectual impairment or disability status. Converting raw scores into Canadian standard scores, as opposed to using American norms, systematically lowers the scores of postsecondary students below the age of 35, as the drop in FSIQ was higher for this group than for older adults. Although we cannot know which derived scores most accurately reflect the intellectual abilities of young Canadian adults, it certainly seems implausible that almost half of postsecondary students have FSIQ scores below the 16th percentile, calling into question the accuracy of all other derived WAIS-IV Canadian scores in the classification of cognitive abilities.

Are you still wondering what it going on?

Populations with different mean IQs and cut-offs

Harrison et al seems to have inadvertently almost rediscovered the fact that Canadians are smarter than Americans. They don’t quite make it to this point even when faced with obvious and strong evidence (multiple standardization samples). They somehow don’t realize that using the norms from these standardization samples will reproduce the differences found in those samples, and won’t really report anything new.

Their numerous differences in percents reaching this or that cut-off are largely or entirely explained by simple statistics. They have two populations which have an IQ difference of 7.4 points (95.5 – 88.1 from Table 1) or 8.1 points (15 * .54 d from Table 1). Now, if we plot these (I used a difference of 7.5 IQ) and choose some arbitrary cut-offs, like those between arbitrarily chosen intervals, we see something like this:


Except that I cheated and chose all the cut-offs. The brown and the green lines are the ratios between the densities (read off the second y-axis). We see that around 100, they are generally low, but as we get further from the means, they get a lot larger. This simple fact is not generally appreciated. It’s not a new problem, Arthur Jensen spent much of a chapter in his behemoth 1980 book on the topic, he quotes for instance:

“In the construction trades, new apprentices were 87 percent white and 13 percent black. [Blacks constitute 12 percent of the U.S. population.] For the Federal Civil Service, of those employees above the GS-5 level, 88.5 percent were white, 8.3 percent black, and women account for 30.1 of all civil servants. Finally, a 1969 survey of college teaching positions showed whites with 96.3 percent of all posi­ tions. Blacks had 2.2 percent, and women accounted for 19.1 percent. (U.S. Commission on Civil Rights, 1973)”

Sounds familiar? Razib Khan has also written about it. Now, let’s go back to one of the quotes:

“Using American norms, 13.1% had an IQ of 80 or less, and 4.2% had an IQ of 75 or less. By contrast, when using Canadian norms, 32.3% had an IQ of 80 or less, and 21.2% had an IQ of 75 or less. Most notably, only 0.7% (2 individuals) obtained a FSIQ of 70 or less using American norms, whereas 9.7% had IQ scores this low when Canadian norms were used. At the other end of the spectrum, 1.4% of the students had FSIQ scores of 130 or more (gifted) when American norms were used, whereas only 0.3% were this high using Canadian norms.”

We can put these in a table and calculate the ratios:

IQ threshold Percent US Percent CAN US/CAN CAN/US
130 1.4 0.3 4.67 0.21
80 13.1 32.3 0.41 2.47
75 4.2 21.2 0.20 5.05
70 0.7 9.7 0.07 13.86


And we can also calculate the expected values based on the two populations (with means of 95.5 and 88) above:

IQ threshold Percent US Percent CAN US/CAN CAN/US
130 1.07 0.26 4.12 0.24
80 15.07 29.69 0.51 1.97
75 8.59 19.31 0.44 2.25
70 4.46 11.51 0.39 2.58


This is fairly close right? The only outlier (in italic) is the much lower than expected value for <70 IQ using US norms, perhaps a sampling error. But overall, this is a pretty good fit to the data. Perhaps we have our explanation.

What about those (mis)classification values in their Table 2? Well, for similar reasons that I won’t explain in detail, these are simply a function of the difference between the groups in that variable, e.g. Cohen’s d. In fact, if we correlate the d vector and the “% within same classification” we get a correlation of -.95 (-.96 using rank-orders).

MCV analysis

Incidentally, the d values report in their Table 1 are useful for using the method of correlated vectors. In a previous study comparing US and Canadian IQ data, Dutton and Lynn (2014) compared WAIS-IV standardization data. They found a mean difference of .31 d, or 4.65 IQ, which was reduced to 2.1 IQ if the samples were matched on education, ethnicity and sex. An interesting thing was that the difference between the countries was largest on the most g-loading subtests. When this happens, it is called a Jensen effect (or that it has a positive Jensen coefficient, Kirkegaard 2014). The value in their study was .83, which is on the high side (see e.g. te Nijenhuis et al, 2015).

I used the same loadings as used in their study (McFarland, 2013), and found a correlation of .24 (.35 with rank-order), substantially weaker.

Supplementary material

The R code and data files can be found in the Open Science Framework repository.


  • Harrison, A. G., Holmes, A., Silvestri, R., Armstrong, I. T. (2015). Implications for Educational Classification and Psychological Diagnoses Using the Wechsler Adult Intelligence Scale–Fourth Edition With Canadian Versus American Norms. Journal of Psychoeducational Assessment. 1-13.
  • Jensen, A. R. (1980). Bias in Mental Testing.
  • Kirkegaard, E. O. (2014). The personal Jensen coefficient does not predict grades beyond its association with g. Open Differential Psychology.
  • McFarland, D. (2013). Model individual subtests of the WAIS IV with multiple latent
    factors. PLoSONE. 8(9): e74980. doi:10.1371/journal.pone.0074980
  • te Nijenhuis, J., van den Hoek, M., & Armstrong, E. L. (2015). Spearman’s hypothesis and Amerindians: A meta-analysis. Intelligence, 50, 87-92.