This is a post in the on-going series of comments on studies of international/transracial adoption. A global genetic/hereditarian model of cognitive differences and their socioeconomic effects implies that adoptees from different populations/countries/regions should show the usual group differences in the above mentioned traits and outcomes, all else equal. All else is of course not equal since adoptees from different regions can be adopted at different ages, experience different environment leading up to the adoption, possibly experience different environments after adoption thru no cause of their own (discrimination), and so on. It is not a strict test: finding the usual group differences can be explained by non-genetic factors, and finding no differences or unexpected ones could be consistent with a genetic model given strong non-genetic effects such as differences in adoption practices between origin countries/regions/populations. However, were such differences to be found relatively consistently in sufficiently powered studies, it would be an important prediction verified by the genetic model, broadly speaking. The question thus remains: what do the studies show?

Bruce et al (2009) – Disinhibited Social Behavior Among Internationally Adopted Children

Their abstract reads:

Postinstitutionalized children frequently demonstrate persistent socioemotional difficulties. For example, some postinstitutionalized children display an unusual lack of social reserve with unfamiliar adults. This behavior, which has been referred to as indiscriminate friendliness, disinhibited attachment behavior, and disinhibited social behavior, was examined by comparing children internationally adopted from institutional care to children internationally adopted from foster care and children raised by their biological families. Etiological factors and behavioral correlates were also investigated. Both groups of adopted children displayed more disinhibited social behavior than the nonadopted children. Of the etiological factors examined, only the length of time in institutional care was related to disinhibited social behavior. Disinhibited social behavior was not significantly correlated with general cognitive ability, attachment-related behaviors, or basic emotion abilities. However, this behavior was negatively associated with inhibitory control abilities even after controlling for the length of time in institutional care. These results suggest that disinhibited social behavior might reflect underlying deficits in inhibitory control.

While this does not seem immediately relevant, the authors do investigate IQ. The study design is a three-way comparison between adoptees in foster homes, institutional care and non-adoptees. The sample are small: 40 x 40 x 40. These are children who spent most of their lives first in foster homes and were then adopted, children who spent most of their lives in institutional care and then adopted, and biological children of the adoptive families for comparison. The groups were not equal in origin composition:

However, because countries have institutional or foster systems to care for wards of the state, the institutional care and foster care groups differed in terms of country of origin. The institutional care group was primarily from Eastern Europe (45%) and China (43%), whereas the foster care group was primarily from South Korea (80%).

Their cognitive measure is:

General cognitive ability.To provide an estimate of the children’s general cognitive functioning, each child was administered the vocabulary and block design subtests of the Wechsler Intelligence Scale for Children, 3rd edition (Wechsler, 1991). These subtests are considered the best measures of verbal and nonverbal intelligence, respectively, and are highly correlated with the full scale intelligence quotient (Sattler, 1992). Raw scores on the subtests were converted into age-normed scaled scores. The scaled scores were then summed and transformed into full scale intelligence quotient equivalents.

They give the mean FSIQ by the above three groups, not origin groups. These are:

Group Institutional care Foster care Control
FSIQ 102.68 (16.25) 109.37 (12.93) 117.11 (15.88)

Note: numbers in parentheses are SDs.

The high scores is presumably due to the FLynn effect. The IQ test is from 1991, but the study is from 2009, so there has been 18 years for raw score gains compared to the normative sample. An alternative idea is that the families were just above average SES/IQ which boosts the IQ scores of younger children. Nearly 100% of the adoptive families were Caucasian (presumably European) and were part of the Minnesota International Adoption Project Registry. According to Wiki, the non-Hispanic % of this state is 83%, so Caucasians are a bit overrepresented. In general, these elevation effects are not important when comparing groups within a study.

I contacted the first author to ask if she would give me some more data, and she obliged:

Please find the requested information below. For each region of origin, I provided the number of children, mean, and standard deviation for the Block Design Standard Score, Vocabulary Standard Score, and full scale IQ equivalent. I did not provide these statistics for regions with less than 5 children. Please let me know if you have any questions

Origin N FSIQ Vocabulary Block design
Institutional care
Eastern Europe (e.g., Russia, Romania, Ukraine) 18 94.84 (15.070) 8.39 (2.615) 8.39 (2.615)
China 17 111.77 (13.774) 11.76 (2.969) 12.29 (3.274)
Foster care
South Korea 31/32 110.38 (12.457) 12.10 (2.970) 11.53 (2.940)
Guatemala 5 104.64 (16.609) 12.40 (3.050) 9.20 (3.033)

Notes: Sample size for S. Korea was 31, 31, 32. No explanation given. Numbers in parentheses are SDs.

Again we see that East Asians do well, altho not better than the control children (mean=117). Eastern European do less well, but it is hard to say the exact expected mean since it is not stated how many come from which countries. The IQ of Lynn and Vanhanen (2012) gives 91 as the IQ of Romania, 96.6 for Russia and 94.3 for Ukraine. The Guatemalans certainly do better than expected (79 IQ), but N=5, and the demographics of Guatemala is very mixed making it possible that the adoptees actually had predominantly European ancestry.

The email reply

Hello Emil,
Please find the requested information below. For each region of origin, I provided the number of children, mean, and standard deviation for the Block Design Standard Score, Vocabulary Standard Score, and full scale IQ equivalent. I did not provide these statistics for regions with less than 5 children. Please let me know if you have any questions
Thank you for your interest in our publication,

Institutional care group:
Eastern Europe (e.g., Russia, Romania, Ukraine)
N Mean Std. Deviation
Block Design Standard Score 18 9.83 3.884
Vocabulary Standard Score 18 8.39 2.615
IQ equivalent 18 94.84 15.070

N Mean Std. Deviation
Block Design Standard Score 17 12.29 3.274
Vocabulary Standard Score 17 11.76 2.969
IQ equivalent 17 111.77 13.774

Foster care group:
South Korea
N Mean Std. Deviation
Block Design Standard Score 32 11.53 2.940
Vocabulary Standard Score 31 12.10 2.970
IQ equivalent 31 110.38 12.457

N Mean Std. Deviation
Block Design Standard Score 5 9.20 3.033
Vocabulary Standard Score 5 12.40 3.050
IQ equivalent 5 104.64 16.609


Bruce, J., Tarullo, A. R., & Gunnar, M. R. (2009). Disinhibited social behavior among internationally adopted children. Development and psychopathology, 21(01), 157-171.

I reanalyze data reported by Richard Lynn in a 1979 paper concerning IQ and socioeconomic variables in 12 regions of the United Kingdom as well as Ireland. I find a substantial S factor across regions (66% of variance with MinRes extraction). I produce a new best estimate of the G scores of regions. The correlation of this with the S scores is .79. The MCV with reversal correlation is .47.

The interdisciplinary academic field examining the effect of general intelligence on large scale social phenomena has been called social ecology of intelligence by Richard Lynn (1979, 1980) and sociology of intelligence by Gottfredson (1998). One could also call it cognitive sociology by analogy with cognitive epidemiology (Deary, 2010; Special issue in Intelligence Volume 37, Issue 6, November–December 2009; Gottfredson, 2004). Whatever the name, it is a field that has received renewed attention recently. Richard Lynn and co-authors report data on Italy (Lynn 2010a, 2010b, 2012a, Piffer and Lynn 2014, see also papers by critics), Spain (Lynn 2012b), China (Lynn and Cheng, 2013) and India (Lynn and Yadav, 2015). Two of his older studies cover the British Isles and France (Lynn, 1979, 1980).

A number of my recent papers have reanalyzed data reported by Lynn, as well as additional data I collected. These cover Italy, India, United States, and China (Kirkegaard 2015a, 2015b, 2015c, 2015d). This paper reanalyzes Lynn’s 1979 paper.

Cognitive data and analysis

Lynn’s paper contains 4 datasets for IQ data that covers 11 regions in Great Britain. He further summarizes some studies that report data on Northern Ireland and the Republic of Ireland, so that his cognitive data covers the entire British Isles. Lynn only uses the first 3 datasets to derive a best estimate of the IQs. The last dataset does not report cognitive scores as IQs, but merely percentages of children falling into certain score intervals. Lynn converts these to a mean (method not disclosed). However, he is unable to convert this score to the IQ scale since the inter-personal standard deviation (SD) is not reported in the study. Lynn thus overlooks the fact that one can use the inter-regional SD from the first 3 studies to convert the 4th study to the common scale. Furthermore, using the intervals one could presumably estimate the inter-personal SD, altho I shall not attempt this. The method for converting the mean scores to the IQ score is this:

  1. Standardize the values by subtracting the mean and dividing by the inter-regional SD.
  2. Calculate the inter-regional SD in the other studies, and find the mean of these. Do the same for the inter-regional means.
  3. Multiple the standardized scores by the mean inter-regional SD from the other studies and add the inter-regional mean.

However, I did not use this method. I instead factor analyzed the four 4 IQ datasets as given and extracted 1 factor (extraction method = MinRes). All factor loadings were strongly positive indicating the existence of the G factor among the regions. The factor score from this analysis was put on the same scale as the first 3 studies by the method above. This is necessary because the IQs for Northern Ireland and the Republic of Ireland are given on that scale. Table 1 shows the correlations between the cognitive variables. The correlations between G and the 4 indicator variables are their factor loadings (italic).

Table 1 – Correlations between cognitive datasets Douglas Davis G Lynn.mean 1 0.66 0.92 0.62 0.96 0.92 0.66 1 0.68 0.68 0.75 0.89
Douglas 0.92 0.68 1 0.72 0.99 0.93
Davis 0.62 0.68 0.72 1 0.76 0.74
G 0.96 0.75 0.99 0.76 1 0.96
Lynn.mean 0.92 0.89 0.93 0.74 0.96 1


It can be noted that my use of factor analysis over simply averaging the datasets had little effect. The correlation of Lynn’s method (mean of datasets 1-3) and my G factor is .96.

Socioeconomic data and analysis

Lynn furthermore reports 7 socioeconomic variables. I quote his description of these:

“1. Intellectual achievement: (a) first-class honours degrees. All first-class honours graduates of the year 1973 were taken from all the universities in the British Isles (with the exception of graduates of Birkbeck College, a London College for mature and part-time students whose inclusion would bias the results in favour of London). Each graduate was allocated to the region where he lived between the ages of 11 and 18. This information was derived from the location of the graduate’s school. Most of the data were obtained from The Times, which publishes annually lists of students obtaining first-class degrees and the schools they attended. Students who had been to boarding schools were written to requesting information on their home residence. Information from the Republic of Ireland universities was obtained from the college records.

The total number of students obtaining first-class honours degrees was 3477, and information was obtained on place of residence for 3340 of these, representing 96 06 per cent of the total.
There are various ways of calculating the proportions of first-class honours graduates produced by each region. Probably the most satisfactory is to express the numbers of firsts in each region per 1000 of the total age cohorts recorded in the census of 1961. In this year the cohorts were approximately 9 years old. The reason for going back to 1961 for a population base is that the criterion taken for residence is the school attended and the 1961 figures reduce the distorting effects of subsequent migration between the regions. However, the numbers in the regions have not changed appreciably during this period, so that it does not matter greatly which year is taken for picking up the total numbers of young people in the regions aged approximately 21 in 1973. (An alternative method of calculating the regional output of firsts is to express the output as a percentage of those attending university. This method yields similar figures.)

2. Intellectual achievement: (b) Fellowships of the Royal Society. A second measure of intellectual achievement taken for the regions is Fellowships of the Royal Society. These are well-known distinctions for scientific work in the British Isles and are open equally to citizens of both the United Kingdom and the Republic of Ireland. The population consists of all Fellows of the Royal Society elected during the period 1931-71 who were born after the year 1911. The number of individuals in this population is 321 and it proved possible to ascertain the place of birth of 98 per cent of these. The Fellows were allocated to the region in which they were born and the numbers of Fellows born in each region were then calculated per million of the total population of the region recorded in the census of 1911. These are the data shown in Table 2. The year 1911 was taken as the population base because the majority of the sample was born between the years 1911-20, so that the populations in 1911 represent approximately the numbers in the regions around the time most of the Fellows were born. (The populations of the regions relative to one another do not change greatly over the period, so that it does not make much difference to the results which census year is taken for the population base.)

3. Per capita income. Figures for per capita incomes for the regions of the United Kingdom are collected by the United Kingdom Inland Revenue. These have been analysed by McCrone (1965) for the standard regions of the UK for the year 1959/60. These results have been used and a figure for the Republic of Ireland calculated from the United Nations Statistical Yearbook.

4. Unemployment. The data are the percentages of the labour force unemployed in the regions for the year 1961 (Statistical Abstracts of the UK and of Ireland).

5. Infant mortality. The data are the numbers of deaths during the first year of life expressed per 1000 live births for the year 1961 (Registrar Generals’ Reports).

6. Crime. The data are offences known to the police for 1961 and expressed per 1000 population (Statistical Abstracts of the UK and of Ireland).

7. Urbanization. The data are the percentages of the population living in county boroughs, municipal boroughs and urban districts in 1961 (Census).”

Lynn furthermore reports historical achievement scores as well as an estimate of inter-regional migration (actually change in population which can also be due to differential fertility). I did not use these in my analysis but they can be found in the datafile in the supplementary material.

Since there are 13 regions in total and 7 variables, I can analyze all variables at once and still almost conform to the rule of thumb of having a case-to-variable ratio of 2 (Zhao, 2009). Table 2 shows the factor loadings from this factor analysis as well as the correlation with G for each socioeconomic variable.

Table 2 – Correlations between S, S indicators, and G
Variable S G
Fellows.RS 0.92 0.92
First.class 0.55 0.58
Income 0.99 0.72
Unemployment -0.85 -0.79
Infant.mortality -0.68 -0.69
Crime 0.83 0.52
Urbanization 0.88 0.64
S 1 0.79


The crime variable had a strong positive loading on the S factor and also a positive correlation with the G factor. This is in contrast to the negative relationship found at the individual-level between the g factor and crime variables at about r=-.2 (Neisser 1996). The difference in mean IQ between criminal and non-criminal samples is usually around 7-15 points depending on which criminal group (sexual, violent and chronic offenders score lower than other offenders; Guay et al, 2005). Beaver and Wright (2011) found that IQ of countries was also negatively related to crime rates, r’s range from -.29 to -.58 depending on type of crime variable (violent crimes highest). At the level of country of origin groups, Fuerst and Kirkegaard (2014a) found that crime variables had strong negative loadings on the S factor (-.85 and -.89) and negative correlations with country of origin IQ. Altho not reported in the paper, Kirkegaard (2014b) found that the loading of 2 crime variables on the S factor in Norway among country of origin groups was -.63 and -.86 (larceny and violent crime; calculated using the supplementary material using the fully imputed dataset). Kirkegaard (2015a) found S loadings of .16 and -.72 of total crime and intentional homicide variables in Italy. Among US states, Kirkegaard (2015c) found S loadings of -.61 and -.71 for murder rate and prison rate.

So, the most similar finding in previous research is that from Italy. There are various possible explanations. Lynn (1979) thinks it is due to large differences in urbanization (which loads positively in multiple studies, .88 in this study). There may be some effect of the type of crime measurement. Future studies could examine this question by employing many different crime variables. My hunch is that it is a combination of differences in urbanization (which increases crime), immigration of crime prone persons into higher S areas, and differences in the justice system between areas.

Method of correlated vectors (MCV)

As done in the previous analysis of S factors, I performed MCV analysis to see whether the G factor was the reason for the association with the G factor score. S factor indicators with negative loadings were reversed to avoid inflating the result (these are marked with “_r” in the plot). The result is shown in Figure 1.


As in the previous analyses, the relationship was positive even after reversal.

Per capita income and the FLynn effect

An interesting quote from the paper is:

This interpretation [that the first factor of his factor analysis is intelligence] implies that the mean population IQs should be regarded as the cause of the other variables. When causal relationships between the variables are considered, it is obvious that some of the variables are dependent on others. For instance, people do not become intelligent as a consequence of getting a first-class honours degree. Rather, they get firsts because they are intelligent. The most plausible alternative causal variable, apart from IQ, is per capita income, since the remaining four are clearly dependent variables. The arguments against positing per capita income as the primary cause among this set of variables are twofold. First, among individuals it is doubtful whether there is any good evidence that differences in income in affluent nations are a major cause of differences in intelligence. This was the conclusion reached by Burt (1943) in a discussion of this problem. On the other hand, even Jencks (1972) admits that IQ is a determinant of income. Secondly, the very substantial increases in per capita incomes that have taken place in advanced Western nations since 1945 do not seem to have been accompanied by any significant increases in mean population IQ. In Britain the longest time series is that of Burt (1969) on London schoolchildren from 1913 to 1965 which showed that the mean IQ has remained approximately constant. Similarly in the United States the mean IQ of large national samples tested by two subtests from the WISC has remained virtually the same over a 16 year period from the early 1950s to the mid-1960s (Roberts, 1971). These findings make it doubtful whether the relatively small differences in per capita incomes between the regions of the British Isles can be responsible for the mean IQ differences. It seems more probable that the major causal sequence is from the IQ differences to the income differences although it may be that there is also some less important reciprocal effect of incomes on IQ. This is a problem which could do with further analysis.

Compare with Lynn’s recent overview of the history of the FLynn effect (Lynn, 2013).


  • Beaver, K. M.; Wright, J. P. (2011). “The association between county-level IQ and county-level crime rates”. Intelligence 39: 22–26. doi:10.1016/j.intell.2010.12.002
  • Deary, I. J. (2010). Cognitive epidemiology: Its rise, its current issues, and its challenges. Personality and individual differences, 49(4), 337-343.
  • Guay, J. P., Ouimet, M., & Proulx, J. (2005). On intelligence and crime: A comparison of incarcerated sex offenders and serious non-sexual violent criminals. International journal of law and psychiatry, 28(4), 405-417.
  • Gottfredson, L. S. (1998). Jensen, Jensenism, and the sociology of intelligence. Intelligence, 26(3), 291-299.
  • Gottfredson, L. S. (2004). Intelligence: is it the epidemiologists’ elusive” fundamental cause” of social class inequalities in health?. Journal of personality and social psychology, 86(1), 174.
  • Intelligence, Special Issue: Intelligence, health and death: The emerging field of cognitive epidemiology. Volume 37, Issue 6, November–December 2009
  • Kirkegaard, E. O. W., & Fuerst, J. (2014a). Educational attainment, income, use of social benefits, crime rate and the general socioeconomic factor among 71 immigrant groups in Denmark. Open Differential Psychology.
  • Kirkegaard, E. O. W. (2014b). Crime, income, educational attainment and employment among immigrant groups in Norway and Finland. Open Differential Psychology.
  • Kirkegaard, E. O. W. (2015a). S and G in Italian regions: Re-analysis of Lynn’s data and new data. The Winnower.
  • Kirkegaard, E. O. W. (2015b). Indian states: G and S factors. The Winnower.
  • Kirkegaard, E. O. W. (2015c). Examining the S factor in US states. The Winnower.
  • Kirkegaard, E. O. W. (2015d). The S factor in China. The Winnower.
  • Lynn, R. (1979). The social ecology of intelligence in the British Isles. British Journal of Social and Clinical Psychology, 18(1), 1-12.
  • Lynn, R. (1980). The social ecology of intelligence in France. British Journal of Social and Clinical Psychology, 19(4), 325-331.
  • Lynn, R. (2010a). In Italy, north–south differences in IQ predict differences in income, education, infant mortality, stature, and literacy. Intelligence, 38(1), 93-100.
  • Lynn, R. (2010b). IQ differences between the north and south of Italy: A reply to Beraldo and Cornoldi, Belacchi, Giofre, Martini, and Tressoldi. Intelligence, 38(5), 451-455.
  • Lynn, R. (2012a). IQs in Italy are higher in the north: A reply to Felice and Giugliano. Intelligence, 40(3), 255-259.
  • Lynn, R. (2012b). North-south differences in Spain in IQ, educational attainment, per capita income, literacy, life expectancy and employment. Mankind Quarterly, 52(3/4), 265.
  • Lynn, R. (2013). Who discovered the Flynn effect? A review of early studies of the secular increase of intelligence. Intelligence, 41(6), 765-769.
  • Lynn, R., & Cheng, H. (2013). Differences in intelligence across thirty-one regions of China and their economic and demographic correlates. Intelligence, 41(5), 553-559.
  • Lynn, R., & Yadav, P. (2015). Differences in cognitive ability, per capita income, infant mortality, fertility and latitude across the states of India. Intelligence, 49, 179-185.
  • Neisser, Ulric et al. (February 1996). “Intelligence: Knowns and Unknowns”. American Psychologist 52 (2): 77–101, 85.
  • Piffer, D., & Lynn, R. (2014). New evidence for differences in fluid intelligence between north and south Italy and against school resources as an explanation for the north–south IQ differential. Intelligence, 46, 246-249.
  • Zhao, N. (2009). The Minimum Sample Size in Factor Analysis. Encorewiki.


This is another relatively small international adoption study. 159 children were adopted in homes in the Netherlands from Sri Lanka, Korea and Colombia.

Participants are described as:

The present study examines the development and adjustment of 159 adopted children at age 7 years. The largest group, 129 adopted children, was selected from 2 studies, starting when the child was aged 5 months. In these studies a short-term early intervention was implemented in three sessions at home between 6 and 9 months in an experimental group, and results were compared with a control group. The families for this experiment were randomly recruited through Dutch adoption organizations, and not selected on (future) problems. Also, to avoid selection, the parents were not aware of the intervention when they entered the study. They were requested to participate in a study examining the development of adopted children. The results of the intervention study were reported elsewhere (Juffer, Hoksbergen, Riksen-Walraven, & Kohnstamm, 1997 ; Stams, Juffer, Van IJzendoorn, & Hoksbergen, in press). The intervention was not repeated during the following years. The original studies involved 70 mixed families, i.e., adoptive families with biological children, and 90 all-adoptive families, i.e., adoptive families without biological children. As intervention effects were found at age 7 in a small intervention group of 20 mixed families (Stams et al., in press), we decided to omit this group from the present study. The remaining sample consisted of 55 intervention and 74 control families. An additional group of 30 families, matched on the original criteria, was randomly recruited from one adoption agency at age 7, serving as a post-test-only group. The absence of intervention or testing effects on any of the outcome measures was confirmed in preliminary analyses, contrasting intervention with control groups, and control groups with the post-test-only group recruited at age 7, respectively.

The adoptive parents were Caucasian white, and in all families the mother was the primary caregiver. The families were predominantly from middle-class or upper middle-class backgrounds. The attrition rate was 8 %, that is, 11 of 140 participants from the original studies. The major reasons for declining were disinterest or health problems of family members. Four mothers had died of incurable illnesses. A series of separate Bonferroni-corrected statistical tests confirmed the absence of differential attrition in the total sample with respect to child background variables, such as age at placement, and family background variables, such as socioeconomic status or family type (with or without biological children).

The children, 73 boys and 86 girls, were adopted from Sri Lanka (N=108), South Korea (N=37), and Colombia (N=14). The infants from Sri Lanka were in the care of their biological mother until their adoption placement at a mean age of 7 weeks (SD=3). Korean and Colombian infants stayed in an institution or foster home after separation from their biological mother at birth, until their adoption placement at a mean age of 15 weeks (SD=4). In comparison with adoptions from Romania, for example (O’Connor et al., 2000 ; Rutter et al., 1998), the material conditions in the Korean and Colombian institutions were relatively favorable, as these homes received substantial support from a Dutch adoption agency. However, little is known about the quality of care, whereas one may assume that frequent changes of caretakers and nonoptimal child–caretaker ratios, often found in institutions, resulted in less favorable socioemotional conditions (O’Connor, Bredenkamp, Rutter, & the ERA Study Team, 1999). Little is known about the child-rearing conditions of the Sri Lankian infants after birth. Based on anecdotal evidence from parent reports, pre- and post-natal care for the relinquishing mother and her baby were far from optimal in Sri Lanka, and the health condition of the mother was deplorable in many cases (Juffer, 1993).

The paper is primarily about some socioemotional variables of no particular interest to me. However, they also accessed cognitive ability with the following test:

Intelligence. Intelligence was measured with the abbreviated Revised Amsterdam Child Intelligence Test (RACIT). Bleichrodt, Drenth, Zaal, and Resing (1987) found empirical evidence for convergent validity, as the RACIT correlatedr¯ ±86 with the Wechsler Intelligence Scale for Children-Revised (WISC-R). At age 7, the abbreviated RACIT correlatedr¯±92 with the full RACIT. The abbreviated RACIT showed a somewhat lower test–retest reliability, namely, r¯±86 versus r¯±88, and a somewhat lower internal consistency, namely, α¯±90 versusα¯±94, than the full RACIT. The abbreviated RACIT does not seem to underestimate or overestimate the level of individual intelligence.

In the present study, we used the abbreviated RACIT, which consisted of the following subtests : flexibility of closure (α¯ ±84), paired associates (split-half reliability¯±77), perceptual reasoning (split-half reliability¯±73), vocabulary (α¯±74), inductive reasoning (α¯±86), ideational fluency (α¯±81). The reliability of the abbreviated RACIT was .91 (N¯163), and was estimated on the basis of the number of subtests, the reliabilities of the subtests, and the correlations between the subtests (Nunally, 1978). The raw scores were transformed to standardized intelligence scores with a mean of 100 (SD15). The standardized scores were derived from a representative sample of 1415 children between age 4 and 11, drawn from the Dutch general school population in 1982 (Bleichrodt et al., 1987).

The results of the testing at age 7 are:

table 8 stams

The adoptive families in adoption studies are usually somewhat above average which boosts the IQ of especially younger children. This along with the FLynn effect is probably responsible for the higher than 100 scores. But among the groups, we see Koreans on top as is usually seen in these studies.



Item-level data from Raven’s Standard Progressive Matrices was compiled for 12 diverse groups from previously published studies. The method of correlated vectors was used on every possible pair of groups with available data (45 comparisons). Depending on exact method chosen, the mean and mean MCV correlation was about .46/51. Only 2/1 of 45 were negative. Spearman’s hypothesis is confirmed for item-level data from the Standard Progressive Matrices.

Introduction and method

The method of correlated vectors (MCV) is a statistical method invented by Arthur Jensen (1998, p. 371, see also appendix B). The purpose of it is to measure to which degree a latent variable is responsible for an observed correlation between an aggregate measure and a criteria variable. Jensen had in mind the general factor of cognitive ability data (the g factor) as measured by various IQ tests and their subtests, and criteria variables such as brain size, however the method is applicable to any latent trait (e.g. general socioeconomic factor Kirkegaard, 2014a, b). When this method is applied to group differences, particularly ethnoracial ones, it is called Spearman’s hypothesis (SH) because Spearman was the first to note it in his 1927 book.

By now, several large studies and meta-analysis of MCV results for group differences have been published (te Nijenhuis et al (2015a, 2015b, 2014), Jensen (1985)). These studies generally support the hypothesis. Almost all studies use subtest loadings instead of item loadings. This is probably because psychologists are reluctant to share their data (Wicherts, et al, 2006) and as a result there are few open datasets available to use for this purpose. Furthermore, before the introduction of modern computers and the internet, it was impractical to share item-level data. There are advantages and disadvantages to using item-level data over subtest-level data. There are more items than subtests which means that the vectors will be longer and thus sampling error will be smaller. On the other hand, items are less reliable and less pure measures of the g factor which introduces both error and more non-g ability variance.

The recent study by Nijenhuis et al (2015a) however, employed item-level data from Raven’s Standard Progressive Matrices (SPM) and included a diverse set of samples (Libyan, Russian, South African, Roma from Serbia, Moroccan and Spanish). The authors did not use their collected data to its full extent, presumably because they were comparing the groups (semi-)manually. To compare all combinations with a dataset of e.g. 10 groups means that one has to do 45 comparisons (10*9/2). However, this task can easily be overcome with programming skills, and I thus saw an opportunity to gather more data regarding SH.

The authors did not provide the data in the paper despite it being easy to include it in tables. However, the data was available from the primary studies they cited in most cases. Thus, I collected the data from their data sources (it can be found in the supplementary material). This resulted in data from 12 samples of which 10 had both difficulty and item-whole correlations data. Table 1 gives an overview of the datasets:

Table 1- Overview of samples
Short name Race Selection N Year Ref Country Description
A1 African Undergraduates 173 2000 Rushton and Skuy 2000 South Africa University of the Witwatersrand and the Rand Afrikaans University in Johannesburg, South Africa
W1 European Undergraduates 136 2000 Rushton and Skuy 2000 South Africa University of the Witwatersrand and the Rand Afrikaans University in Johannesburg, South Africa
W2 European Std 7 classes 1056 1992 Owen 1992 South Africa 20 schools in the Pretoria-Witwatersrand-Vereeniging (PWV) area and 10 schools in the Cape Peninsula
C1 Colored (African European) Std 7 classes 778 1992 Owen 1992 South Africa 20 coloured schools in the Cape Peninsula
I1 Indian Std 7 classes 1063 1992 Owen 1992 South Africa 30 schools selected at random from the list of high schools in and around Durban
A2 African Std 7 classes 1093 1992 Owen 1992 South Africa Three schools in the PWV area and 25 schools in KwaZulu (Natal)
A3 African First year Engineering students 198 2002 Rushton et al 2002 South Africa First-year students from the Faculties of Engineering and the Built Environment at the University of the Witwatersrand
I2 Indian First year Engineering students 58 2002 Rushton et al 2002 South Africa First-year students from the Faculties of Engineering and the Built Environment at the University of the Witwatersrand
W3 European First year Engineering students 86 2002 Rushton et al 2002 South Africa First-year students from the Faculties of Engineering and the Built Environment at the University of the Witwatersrand
R1 Roma Adults ages 16 to 66 231 2004.5 Rushton et al 2007 Serbia The communities (i.e., Drenovac, Mirijevo, and Rakovica) are in the vicinity of Belgrade
W4 European Adults ages 18 to 65 258 2012 Diaz et al 2012 Spain Mainly from the city of Valencia
NA1 North African Adults ages 18 to 50 202 2012 Diaz et al 2012 Morocco Casablanca, Marrakech, Meknes and Tangiers


Item-whole correlations and item loadings

The data in the papers did usually not contain the actual factor loadings of the items. Instead, they contained the item-whole correlations. The authors argue that one can use these because of the high correlation of unweighted means with extracted g-factors (often, r=.99, e.g. Kirkegaard, in review). Some studies did provide both loadings and item-whole correlations, yet the authors did not correlate them to see how good proxies the item-whole correlations are for the loadings. I calculated this for the 4 studies that included both metrics. Results are shown in Table 2.

Table 2 – Item-whole correlations x g-loadings in 4 studies.
W2 C1 I1 A2
W2 0.549 0.099 0.327 0.197
C1 0.695 0.900 0.843 0.920
I1 0.616 0.591 0.782 0.686
A2 0.626 0.882 0.799 0.981

Note: Within sample correlations between item-whole correlations and item factor loadings are in the diagonal, marked with italic.

As can be seen, the item-whole correlations were not in all cases great proxies for the actual loadings.

To further test this idea, I calculated the item-whole correlations and the factor loadings (first factor, minimum residuals) in the open Wicherts dataset (N=500ish, Dutch university students, see Wicherts and Bakker 2012) tested on Raven’s Advanced Progressive Matrices. The correlation was .89. Thus, aside from the odd result in the W2 sample, item-whole correlations were a reasonable proxy for the factor loadings.

Item difficulties across samples

If two groups are tested on the same test and this test measures the same trait in both groups, then even if the groups have different mean trait levels, the order of difficulty of the items or subtests should be similar. Rushton et al (2000, 2002, 2007) have examined this in previous studies and found it generally to be the case. Table 3 below shows the cross-sample correlations of item difficulties.

Table 3 – Intercorrelations between item difficulties in 12 samples
A1 W1 W2 C1 I1 A2 A3 I2 W3 R1 NA1 W4
A1 1 0.88 0.98 0.96 0.99 0.86 0.96 0.89 0.79 0.89 0.95 0.93
W1 0.88 1 0.93 0.79 0.87 0.65 0.96 0.97 0.94 0.7 0.92 0.95
W2 0.98 0.93 1 0.95 0.98 0.82 0.97 0.92 0.84 0.86 0.96 0.95
C1 0.96 0.79 0.95 1 0.98 0.94 0.89 0.81 0.69 0.95 0.92 0.87
I1 0.99 0.87 0.98 0.98 1 0.88 0.95 0.88 0.79 0.91 0.95 0.92
A2 0.86 0.65 0.82 0.94 0.88 1 0.76 0.68 0.56 0.97 0.82 0.76
A3 0.96 0.96 0.97 0.89 0.95 0.76 1 0.96 0.9 0.8 0.95 0.96
I2 0.89 0.97 0.92 0.81 0.88 0.68 0.96 1 0.92 0.72 0.91 0.92
W3 0.79 0.94 0.84 0.69 0.79 0.56 0.9 0.92 1 0.6 0.88 0.91
R1 0.89 0.7 0.86 0.95 0.91 0.97 0.8 0.72 0.6 1 0.86 0.8
NA1 0.95 0.92 0.96 0.92 0.95 0.82 0.95 0.91 0.88 0.86 1 0.97
W4 0.93 0.95 0.95 0.87 0.92 0.76 0.96 0.92 0.91 0.8 0.97 1


The mean intercorrelation is.88. This is quite remarkable given the diversity of the samples.

Item-whole correlations across samples

Given the above, one might expect similar results for the item-whole correlations. This however is not so. Results are in Table 4.

Table 4 – Intercorrelations between item-whole correlations in 10 samples
A1 W1 W2 C1 I1 A2 A3 I2 W3 R1
A1 1 -0.2 0.59 0.58 0.73 0.54 0.27 0.04 -0.3 0.57
W1 -0.2 1 0.17 -0.59 -0.25 -0.68 0.42 0.51 0.55 -0.55
W2 0.59 0.17 1 0.44 0.79 0.29 0.61 0.25 0.02 0.39
C1 0.58 -0.59 0.44 1 0.79 0.94 0.01 -0.25 -0.49 0.78
I1 0.73 -0.25 0.79 0.79 1 0.69 0.42 0.09 -0.33 0.63
A2 0.54 -0.68 0.29 0.94 0.69 1 -0.13 -0.3 -0.52 0.77
A3 0.27 0.42 0.61 0.01 0.42 -0.13 1 0.26 0.37 0.02
I2 0.04 0.51 0.25 -0.25 0.09 -0.3 0.26 1 0.34 -0.21
W3 -0.3 0.55 0.02 -0.49 -0.33 -0.52 0.37 0.34 1 -0.49
R1 0.57 -0.55 0.39 0.78 0.63 0.77 0.02 -0.21 -0.49 1

Note: The last two samples, NA1 and W4, did not have item-whole correlation data.

The reason for this state of affairs is that the factor loadings change when the group mean trait level changes. For many samples, most of the items were too easy (passing rates at or very close to 100%). When there is no variation in a variable, one cannot calculate a correlation to some other variable. This means that for a substantial number of items for multiple samples, there was missing data for the items.

The lack of cross-sample consistency in item-whole correlations may also explain the weak MCV results in Diaz et al, 2012 since they used g-loadings from another study instead of from their own samples.

Spearman’s hypothesis using one static vector of estimated factor loadings

Some of the sample had rather low sample sizes (I2, N=58, W3, N=86). Thus one might get the idea to use the item-whole correlations from one or more of the large samples for comparisons involving other groups. In fact, given the instability of item-whole correlations across sample as can be seen in Table 4, this is a bad idea. However, for sake of completeness, I calculated the results based on this anyway. As the best estimate of factor loadings, I averaged the item-whole correlations data from the four largest samples (W2, C1, I1 and A2).

Using this vector of item-whole correlations, I used MCV on every possible sample comparison. Because there were 12 samples, this number is 66. MCV analysis was done by subtracting the lower scoring sample’s item difficulties from the higher scoring sample’s thus producing a vector of the sample difference on each item. This vector I correlated with the vector of item-whole correlations.  The results are shown in Table 5.

Table 5 – MCV correlations of group differences across 12 samples using 1 static item-whole correlations
A1 W1 W2 C1 I1 A2 A3 I2 W3 R1 NA1 W4
A1 NA -0.15 0.2 0.42 0.1 0.83 -0.03 -0.12 -0.26 0.8 -0.35 -0.32
W1 -0.15 NA -0.31 0.07 -0.14 0.47 -0.29 -0.19 -0.4 0.4 0.31 0.06
W2 0.2 -0.31 NA 0.56 0.4 0.86 -0.27 -0.28 -0.38 0.83 -0.35 -0.46
C1 0.42 0.07 0.56 NA 0.53 0.88 0.23 0.11 -0.06 0.64 -0.23 -0.05
I1 0.1 -0.14 0.4 0.53 NA 0.88 -0.02 -0.1 -0.24 0.83 -0.45 -0.29
A2 0.83 0.47 0.86 0.88 0.88 NA 0.66 0.52 0.32 0.2 0.42 0.4
A3 -0.03 -0.29 -0.27 0.23 -0.02 0.66 NA -0.17 -0.41 0.61 0.43 -0.43
I2 -0.12 -0.19 -0.28 0.11 -0.1 0.52 -0.17 NA -0.37 0.46 0.33 -0.25
W3 -0.26 -0.4 -0.38 -0.06 -0.24 0.32 -0.41 -0.37 NA 0.23 -0.05 -0.11
R1 0.8 0.4 0.83 0.64 0.83 0.2 0.61 0.46 0.23 NA 0.36 0.32
NA1 -0.35 0.31 -0.35 -0.23 -0.45 0.42 0.43 0.33 -0.05 0.36 NA -0.03
W4 -0.32 0.06 -0.46 -0.05 -0.29 0.4 -0.43 -0.25 -0.11 0.32 -0.03 NA


As one can see, the results are all over the place. The mean MCV correlation is .12.

Spearman’s hypothesis using a variable vector of estimated factor loadings

Since item-whole correlations varied from sample to sample, another idea is to use the samples’ item-whole correlations. I used the unweighted mean of the item-whole correlations for each item (te Nijenhuis et al used a weighted mean). In some cases, only one sample has item-whole correlations for some items (because the other sample had no variance on the item, i.e. 100% get it right). In these cases, one can choose to use the value from the remaining sample, or one can ignore the item and calculated MCV based on the remaining items. I calculated results using both methods, they are shown in Table 6 and 7.

Table 6 – MCV correlations of group differences across 10 samples using variable item-whole correlations, method 1
A1 W1 W2 C1 I1 A2 A3 I2 W3 R1
A1 NA 0.79 0.39 0.29 0.05 0.7 0.48 0.41 0.37 0.71
W1 0.79 NA 0.8 0.5 0.76 0.51 0.79 0.4 0.6 0.54
W2 0.39 0.8 NA 0.68 0.63 0.85 0.43 0.47 0.5 0.79
C1 0.29 0.5 0.68 NA 0.52 0.88 0.32 0.3 -0.09 0.67
I1 0.05 0.76 0.63 0.52 NA 0.84 0.47 0.4 0.31 0.79
A2 0.7 0.51 0.85 0.88 0.84 NA 0.57 0.43 -0.03 0.22
A3 0.48 0.79 0.43 0.32 0.47 0.57 NA 0.38 0.66 0.64
I2 0.41 0.4 0.47 0.3 0.4 0.43 0.38 NA 0.6 0.44
W3 0.37 0.6 0.5 -0.09 0.31 -0.03 0.66 0.6 NA 0.2
R1 0.71 0.54 0.79 0.67 0.79 0.22 0.64 0.44 0.2 NA


Table 6 – MCV correlations of group differences across 10 samples using variable item-whole correlations, method 2
A1 W1 W2 C1 I1 A2 A3 I2 W3 R1
A1 NA 0.42 0.4 0.33 0.06 0.72 0.48 0.3 0.15 0.74
W1 0.42 NA 0.72 0.14 0.52 0.18 0.7 0.44 0.65 0.35
W2 0.4 0.72 NA 0.68 0.63 0.85 0.44 0.53 0.5 0.79
C1 0.33 0.14 0.68 NA 0.52 0.88 0.39 0.19 -0.08 0.67
I1 0.06 0.52 0.63 0.52 NA 0.84 0.51 0.4 0.28 0.79
A2 0.72 0.18 0.85 0.88 0.84 NA 0.62 0.3 0.02 0.22
A3 0.48 0.7 0.44 0.39 0.51 0.62 NA 0.42 0.55 0.67
I2 0.3 0.44 0.53 0.19 0.4 0.3 0.42 NA 0.58 0.35
W3 0.15 0.65 0.5 -0.08 0.28 0.02 0.55 0.58 NA 0.08
R1 0.74 0.35 0.79 0.67 0.79 0.22 0.67 0.35 0.08 NA


Nearly all results are positive using either method. The results are slightly stronger when ignoring items where both samples do not have item-whole correlation data. A better way to visualize the results is to use a histogram with a density curve inputted, as shown in Figure 1 and 2.

SH method 1

Figure 1 – Histogram of MCV results using method 1

SH method 2

Figure 2 – Histogram of MCV results using method 2

Note: The vertical line shows the mean value.

The mean/median result for method 1 was .51/.50, and .46/.48 for method 2. Almost all MCV results were positive, there were only 2/45 that were negative for method 1, and 1/45 for method 2.

Mean MCV value by sample and moderator analysis

It is interesting to examine the mean MCV value by sample. They are shown in Table 7.

Table 7 – MCV correlation means, SDs, and medians by sample
Sample mean SD median
A1 0.46 0.24 0.41
W1 0.63 0.15 0.60
W2 0.62 0.18 0.63
C1 0.45 0.29 0.50
I1 0.53 0.26 0.52
A2 0.55 0.31 0.57
A3 0.53 0.15 0.48
I2 0.43 0.08 0.41
W3 0.35 0.28 0.37
R1 0.56 0.23 0.64


There is no obvious racial pattern. Instead, one might expect the relatively lower result of some samples to be due to sampling error. MCV is extra sensitive to sampling error. If so, the mean correlation should be higher for the larger samples. To see if this was the case, I calculated the rank-order correlation between sample size and sample mean MCV, r=.45/.65 using method 1 or 2 respectively. Rank-order was used because the effect of sample size on sampling error is non-linear. Figure 3 shows the scatter plot of this.


Figure 3 – Sample size as a moderator variable at the sample mean-level


One can also examine sample size as a moderating variable as the comparison-level. This increases the number of datapoints to 45. I used the harmonic mean of the 2 samples as the sample size metric. Figure 4 shows a scatter plot of this.


Figure 4 – Sample size as a moderator variable at the comparison-level


The rank-order correlations are .45/.44 using method 1/2 data. We can see in the plot that the results from the 6 largest comparisons (harmonic mean sample size>800) range from .52 to .88 with a mean of .74/.73 and SD of .15 using method 1/2 results. For the smaller studies (harmonic mean sample size<800), the results range from -.09/-.08 to .89/.79 with a mean of .48/.43 and SD of .22/.23 using method 1/2 results. The results from the smaller studies vary more, as expected with their higher sampling error.

I also examine the group difference size as a moderator variable. I computed this as the difference between the mean item difficulty by the groups. However, it had a near-zero relationship to the MCV results (rank-order r=.03, method 1 data).

Discussion and conclusion

Spearman’s hypothesis has been decisively confirmed using item-level data from Raven’s Standard Progressive Matrices. The analysis presented here can easily be extended to cover more datasets, as well as item-level data from other IQ tests. Researchers should compile such data into open datasets so they can be used for future studies.

It is interesting to note the consistency of results within and across samples that differ in race. Race differences in general intelligence as measured by the SPM appear to be just like those within races.

Supplementary material

R code and dataset is available at the Open Science Framework repository.


  • Dıaz, A., & Sellami, K. Infanzó n, E., Lanzó n, T., & Lynn, R.(2012). A comparative study of general intelligence in Spanish and Moroccan samples. Spanish Journal of Psychology, 15(2), 526-532.
  • Jensen, A. R. (1998). The g factor: The science of mental ability. Westport, CT: Praeger.
  • Jensen, A. R. (1985). The nature of the black–white difference on various psychometric tests: Spearman’s hypothesis. Behavioral and Brain Sciences, 8(02), 193-219.
  • Kirkegaard, E. O. (2014a). The international general socioeconomic factor: Factor analyzing international rankings. Open Differential Psychology.
  • Kirkegaard, E. O. (2014b). Crime, income, educational attainment and employment among immigrant groups in Norway and Finland. Open Differential Psychology.
  • Kirkegaard, E. O. W. (in review). Examining the ICAR and CRT tests in a Danish student sample. Open Differential Psychology.
  • Owen, K. (1992). The suitability of Raven’s Standard Progressive Matrices for various groups in South Africa. Personality and Individual Differences, 13(2), 149-159.
  • Rushton, J. P., Čvorović, J., & Bons, T. A. (2007). General mental ability in South Asians: Data from three Roma (Gypsy) Communities in Serbia. Intelligence, 35, 1-12.
  • Rushton, J. P., Skuy, M., & Fridjhon, P. (2002). Jensen effects among African, Indian, and White engineering students in South Africa on Raven’s Standard Progressive Matrices. Intelligence, 30, 409-423.
  • Rushton, J. P., & Skuy, M. (2000). Performance on Raven’s Matrices by African and White university students in South Africa. Intelligence, 28, 251-265.
  • Spearman, C. (1927). The abilities of man.
  • te Nijenhuis, J., Al-Shahomee, A. A., van den Hoek, M., Grigoriev, A., and Repko, J. (2015a). Spearman’s hypothesis tested comparing Libyan adults with various other groups of adults on the items of the Standard Progressive Matrices. Intelligence. Volume 50, May–June 2015, Pages 114–117
  • te Nijenhuis, J., David, H., Metzen, D., & Armstrong, E. L. (2014). Spearman’s hypothesis tested on European Jews vs non-Jewish Whites and vs Oriental Jews: Two meta-analyses. Intelligence, 44, 15-18.
  • te Nijenhuis, J., van den Hoek, M., & Armstrong, E. L. (2015b). Spearman’s hypothesis and Amerindians: A meta-analysis. Intelligence, 50, 87-92.
  • Wicherts, J. M., Borsboom, D., Kats, J., & Molenaar, D. (2006). The poor availability of psychological research data for reanalysis. American Psychologist, 61(7), 726.
  • Wicherts, J. M., & Bakker, M. (2012). Publish (your data) or (let the data) perish! Why not publish your data too?. Intelligence, 40(2), 73-76.



Research uncovers flawed IQ scoring system” is the headline on, which often posts news about research from other fields. It concerns a study by Harrison et al (2015). The researchers have allegedly “uncovered anomalies and issues with the Wechsler Adult Intelligence Scale-Fourth Edition (WAIS-IV), one of the most widely used intelligence tests in the world”. An important discovery, if true. Let’s hear it from the lead researcher:

“Looking at the normal distribution of scores, you’d expect that only about five per cent of the population should get an IQ score of 75 or less,” says Dr. Harrison. “However, while this was true when we scored their tests using the American norms, our findings showed that 21 per cent of college and university students in our sample had an IQ score this low when Canadian norms were used for scoring.”

How can it be? To learn more, we delve into the actual paper titled: Implications for Educational Classification and Psychological Diagnoses Using the Wechsler Adult Intelligence Scale–Fourth Edition With Canadian Versus American Norms.

The paper

First they summarize a few earlier studies on Canada and the US. The Canadians obtained higher raw scores. Of course, this was hypothesized to be due to differences in ethnicity and educational achievement factors. However, this did not quite work out, so Harrison et al decided to investigate it more (they had already done so in 2014). Their method consists of taking the scores from a large mixed sample consisting of healthy people — i.e. with no diagnosis, 11% — and people with various mental disorders (e.g. 53.5% with ADHD), and then scoring this group on both the American and the Canadian norms. What did they find?

Blast! The results were similar to the results from the previous standardization studies! What happened? To find out, Harrison et al do a thorough examination of various subgroups in various ways. No matter which age group they compare, the result won’t go away. They also report the means and Cohen’s d for each subtest and aggregate measure — very helpful. I reproduce their Table 1 below:

Score M (US)
p d r
FSIQ 95.5 12.9 88.1 14.4 <.001 0.54 0.99
GAI 98.9 14.7 92.5 16 <.001 0.42 0.99
Index Scores
Verbal Comprehension 97.9 15.1 91.8 16.3 <.001 0.39 0.99
Perceptual Reasoning 99.9 14.1 94.5 15.9 <.001 0.36 0.99
Working Memory 90.5 12.8 83.5 13.8 <.001 0.53 0.99
Processing Speed 95.2 12.9 90.4 14.1 <.001 0.36 0.99
Subtest Scores
Verbal Subtests
Vocabulary 9.9 3.1 8.7 3.3 <.001 0.37 0.99
Similarities 9.7 3 8.5 3.3 <.001 0.38 0.98
Information 9.2 3.1 8.5 3.3 <.001 0.22 0.99
Arithmetic 8.2 2.7 7.4 2.7 <.001 0.3 0.99
Digit Span 8.4 2.5 7.1 2.7 <.001 0.5 0.98
Performance Subtests
Block Design 9.8 3 8.9 3.2 <.001 0.29 0.99
Matrix Reasoning 9.8 2.9 9.1 3.2 <.001 0.23 0.99
Visual Puzzles 10.5 2.9 9.4 3.1 <.001 0.37 0.99
Symbol Search 9.3 2.8 8.5 3 <.001 0.28 0.99
Coding 8.9 2.5 8.2 2.6 <.001 0.27 0.98


Sure enough, the scores are lower using the Canadian norms. And very ‘significant’ too. A mystery.

Next, they go on to note how this sometimes changes the classification of individuals into 7 arbitrarily chosen intervals of IQ scores, and how this differs between subtests. They spend a lot of e-ink noting percents about this or that classification. For instance:

“Of interest was the percentage of individuals who would be classified as having a FSIQ below the 10th percentile or who would fall within the IQ range required for diagnosis of ID (e.g., 70 ± 5) when both normative systems were applied to the same raw scores. Using American norms, 13.1% had an IQ of 80 or less, and 4.2% had an IQ of 75 or less. By contrast, when using Canadian norms, 32.3% had an IQ of 80 or less, and 21.2% had an IQ of 75 or less.”

I wonder if some coherent explanation can be found for all these results. In their discussion they ask:

“How is it that selecting Canadian over American norms so markedly lowers the standard scores generated from the identical raw scores? One possible explanation is that more extreme scores occur because the Canadian normative sample is smaller than the American (cf. Kahneman, 2011).”

If the reader was unsure, yes, this is Kahneman’s 2011 book about cognitive biases and dual process theory.

They have more suggestions about the reason:

“One cannot explain this difference simply by saying it is due to the mature students in the sample who completed academic upgrading, as the score differences were most prominent in the youngest cohorts. It is difficult to explain these findings simply as a function of disability status, as all participants were deemed otherwise qualified by these postsecondary institutions (i.e., they had met normal academic requirements for entry into regular postsecondary programs). Furthermore, in Ontario, a diagnosis of LD is given only to students with otherwise normal thinking and reasoning skills, and so students with such a priori diagnosis would have had otherwise average full scale or general abilities scores when tested previously. Performance exaggeration seems an unlikely cause for the findings, as the students’ scores declined only when Canadian norms were applied. Finally, although no one would argue that a subset of disabled students might be functioning below average, it is difficult to believe that almost half of these postsecondary students would fall in this IQ range given that they had graduated from high school with marks high enough to qualify for acceptance into bona fide postsecondary programs. Whatever the cause, our data suggest that one must question both the representativeness of the Canadian normative sample in the younger age ranges and the accuracy of the scores derived when these norms are applied.”

And finally they conclude with a recommendation not to use the Canadian norms for Canadians because this results in lower IQs:

Overall, our findings suggest a need to examine more carefully the accuracy and applicability of the WAIS-IV Canadian norms when interpreting raw test data obtained from Canadian adults. Using these norms appears to increase the number of young adults identified as intellectually impaired and could decrease the number who qualify for gifted programming or a diagnosis of LD. Until more research is conducted, we strongly recommend that clinicians not use Canadian norms to determine intellectual impairment or disability status. Converting raw scores into Canadian standard scores, as opposed to using American norms, systematically lowers the scores of postsecondary students below the age of 35, as the drop in FSIQ was higher for this group than for older adults. Although we cannot know which derived scores most accurately reflect the intellectual abilities of young Canadian adults, it certainly seems implausible that almost half of postsecondary students have FSIQ scores below the 16th percentile, calling into question the accuracy of all other derived WAIS-IV Canadian scores in the classification of cognitive abilities.

Are you still wondering what it going on?

Populations with different mean IQs and cut-offs

Harrison et al seems to have inadvertently almost rediscovered the fact that Canadians are smarter than Americans. They don’t quite make it to this point even when faced with obvious and strong evidence (multiple standardization samples). They somehow don’t realize that using the norms from these standardization samples will reproduce the differences found in those samples, and won’t really report anything new.

Their numerous differences in percents reaching this or that cut-off are largely or entirely explained by simple statistics. They have two populations which have an IQ difference of 7.4 points (95.5 – 88.1 from Table 1) or 8.1 points (15 * .54 d from Table 1). Now, if we plot these (I used a difference of 7.5 IQ) and choose some arbitrary cut-offs, like those between arbitrarily chosen intervals, we see something like this:


Except that I cheated and chose all the cut-offs. The brown and the green lines are the ratios between the densities (read off the second y-axis). We see that around 100, they are generally low, but as we get further from the means, they get a lot larger. This simple fact is not generally appreciated. It’s not a new problem, Arthur Jensen spent much of a chapter in his behemoth 1980 book on the topic, he quotes for instance:

“In the construction trades, new apprentices were 87 percent white and 13 percent black. [Blacks constitute 12 percent of the U.S. population.] For the Federal Civil Service, of those employees above the GS-5 level, 88.5 percent were white, 8.3 percent black, and women account for 30.1 of all civil servants. Finally, a 1969 survey of college teaching positions showed whites with 96.3 percent of all posi­ tions. Blacks had 2.2 percent, and women accounted for 19.1 percent. (U.S. Commission on Civil Rights, 1973)”

Sounds familiar? Razib Khan has also written about it. Now, let’s go back to one of the quotes:

“Using American norms, 13.1% had an IQ of 80 or less, and 4.2% had an IQ of 75 or less. By contrast, when using Canadian norms, 32.3% had an IQ of 80 or less, and 21.2% had an IQ of 75 or less. Most notably, only 0.7% (2 individuals) obtained a FSIQ of 70 or less using American norms, whereas 9.7% had IQ scores this low when Canadian norms were used. At the other end of the spectrum, 1.4% of the students had FSIQ scores of 130 or more (gifted) when American norms were used, whereas only 0.3% were this high using Canadian norms.”

We can put these in a table and calculate the ratios:

IQ threshold Percent US Percent CAN US/CAN CAN/US
130 1.4 0.3 4.67 0.21
80 13.1 32.3 0.41 2.47
75 4.2 21.2 0.20 5.05
70 0.7 9.7 0.07 13.86


And we can also calculate the expected values based on the two populations (with means of 95.5 and 88) above:

IQ threshold Percent US Percent CAN US/CAN CAN/US
130 1.07 0.26 4.12 0.24
80 15.07 29.69 0.51 1.97
75 8.59 19.31 0.44 2.25
70 4.46 11.51 0.39 2.58


This is fairly close right? The only outlier (in italic) is the much lower than expected value for <70 IQ using US norms, perhaps a sampling error. But overall, this is a pretty good fit to the data. Perhaps we have our explanation.

What about those (mis)classification values in their Table 2? Well, for similar reasons that I won’t explain in detail, these are simply a function of the difference between the groups in that variable, e.g. Cohen’s d. In fact, if we correlate the d vector and the “% within same classification” we get a correlation of -.95 (-.96 using rank-orders).

MCV analysis

Incidentally, the d values report in their Table 1 are useful for using the method of correlated vectors. In a previous study comparing US and Canadian IQ data, Dutton and Lynn (2014) compared WAIS-IV standardization data. They found a mean difference of .31 d, or 4.65 IQ, which was reduced to 2.1 IQ if the samples were matched on education, ethnicity and sex. An interesting thing was that the difference between the countries was largest on the most g-loading subtests. When this happens, it is called a Jensen effect (or that it has a positive Jensen coefficient, Kirkegaard 2014). The value in their study was .83, which is on the high side (see e.g. te Nijenhuis et al, 2015).

I used the same loadings as used in their study (McFarland, 2013), and found a correlation of .24 (.35 with rank-order), substantially weaker.

Supplementary material

The R code and data files can be found in the Open Science Framework repository.


  • Harrison, A. G., Holmes, A., Silvestri, R., Armstrong, I. T. (2015). Implications for Educational Classification and Psychological Diagnoses Using the Wechsler Adult Intelligence Scale–Fourth Edition With Canadian Versus American Norms. Journal of Psychoeducational Assessment. 1-13.
  • Jensen, A. R. (1980). Bias in Mental Testing.
  • Kirkegaard, E. O. (2014). The personal Jensen coefficient does not predict grades beyond its association with g. Open Differential Psychology.
  • McFarland, D. (2013). Model individual subtests of the WAIS IV with multiple latent
    factors. PLoSONE. 8(9): e74980. doi:10.1371/journal.pone.0074980
  • te Nijenhuis, J., van den Hoek, M., & Armstrong, E. L. (2015). Spearman’s hypothesis and Amerindians: A meta-analysis. Intelligence, 50, 87-92.

In my previous post, I reviewed data about education and beliefs about nuclear energy/power. In this post, I report some results for intelligence measured by wordsum. The wordsum is a 10 word vocabulary test widely given in US surveys. It correlates .71 with a full-scale IQ test, so it’s a decent proxy for group data, although not that good for individual level. It is short and innocent enough that it can be given as part of larger surveys and has strong enough psychometric properties that it is useful to researchers. Win-win.

The data

The data comes from the ANES 2012 survey. A large survey given to about 6000 US citizens. The datafile has 2240 variables of various kinds.

Method and results

The analysis was done in R.

The datafile does not come with a summed wordsum score, so one must do this oneself, which involves recoding the 10 variables. In doing so I noticed that one of the words (#5, labeled “g”) has a very strong distractor: 43% pick item 5 (the correct), 49% pick item 2. This must clearly be either a very popular confusion (which makes it not a confusion at all because if a word is commonly used to mean x, that is what it means in addition to other meanings if any), or an ambiguous word, or two reasonable answers. Clearly, something is strange. For this reason, I downloaded the questionnaire itself. However, the word options has been redacted!


I don’t know if this is for copyright reasons or test-integrity reasons. The first would be very silly, since it is merely 10 lists of 1+6 words, hardly copyrightable. The second is more open to discussion. How many people actually check the questionnaire so that they can get the word quiz correct? Can’t be that many, but perhaps a few. Anyway, if the alternative is to change the words every year to avoid possible cheating, this is the better option. Changing words would mean that between year results were not so easily comparable, even if care is taken in choosing similar words. Still, for the above reasons, I decided to analyze the test with and without the odd item.

First I recoded the data. Any response other than the right one was coded as 0, and the right one as 1. For the policy and science questions, I recoded the missing/soft refusal and (hard?) refusal into NA (missing data).

Second, for each case, I summed the number of correct responses. Histograms:

hist9 hist10

The test shows a very strong ceiling effect. They need to add more difficult words. They can do this without ruining comparison-ability with older datasets, since one can just exclude the newer harder words in the analysis if one wants to do longitudinal comparisons.

Third, I examined internal reliability using alpha() from psych. This revealed that the average inter-item correlation would increase if the odd item was dropped, tho the effect was small: .25 to .27.

Fourth, I used both standard and item response theory factor analysis on 10 and 9 item datasets and extracted 1 factor. Examining factor loadings showed that the odd item had the lowest loading using both methods (classic: median=.51, item=.37; IRT: median=.68, item=.44).

Fifth, I correlated the scores from all of these methods to each other: classical FA on 10 and 9 items, IRT FA on 10 and 9 items, unit-weighted sum of 10 and 9 items.


Surprisingly, the IRT vs. non-IRT scores did not correlate that strongly, only around .90 (range .87 to .94). The others usually correlated near 1 (range .96 to .99). The scores with and without the odd item correlated strongly tho, so the rest of the analyses I used the 10-item version. Due to the rather ‘low’ correlations between the estimates, I retained both the IRT and classic FA scores for further analysis.

Sixth, I converted both these factor scores to standard IQ format (mean=100, sd=15). Since the output scores of IRT FA was not normal, I normalized it first using scale().

Seventh, I plotted the answer categories with their mean IQ and 99% confidence intervals. For each plot, I plotted both the IQ estimates. Classical FA scores in black, IRT scores in red.



We see as expected that nuclear proponents (people who think we should have more) have higher IQ than those who think we should have fewer or the same. Note that this is a non-linear result. The group who thinks we should have fewer have a slightly higher IQ than those who favor the status quo, altho this difference is pretty slight. If it is real, it is perhaps because it is easier to favor the status quo, than to have an opinion on which way society should change in line with cognitive miser theory (Renfrow and Howard, 2013).

The result for the question of whether global warming is happening or not is interesting. There is almost not effect of intelligence on getting such an obvious question right.

The next two questions were also given to people who said that global warming was not happening. When it was, they prefixed it with “Assuming it’s happening”. Here we see a strong negative effect on thinking that the consequences will be good, as well as weak effects on the cause of global warming.


Renfrow, D. G., & Howard, J. A. (2013). Social psychology of gender and race. In Handbook of Social Psychology (pp. 491-531). Springer Netherlands.

R code and data

Because the data is not given in a good format, I upload it here in such: anes_timeseries_2012

##Nuclear, WORDSUM and Educational attainment

#load data
d = read.csv("anes_timeseries_2012.csv")
wordsum = subset(d, select=c(wordsum_setb,wordsum_setd,wordsum_sete,wordsum_setf,

wordsum$wordsum_setb[wordsum$wordsum_setb != 5] = 0 #everything else than to 0
wordsum$wordsum_setb[wordsum$wordsum_setb == 5] = 1 #correct to 1
table(wordsum$wordsum_setb) #verify
wordsum$wordsum_setd[wordsum$wordsum_setd != 3] = 0 #everything else than to 0
wordsum$wordsum_setd[wordsum$wordsum_setd == 3] = 1 #correct to 1
table(wordsum$wordsum_setd) #verify
wordsum$wordsum_sete[wordsum$wordsum_sete != 1] = 0 #everything else than to 0
table(wordsum$wordsum_sete) #verify
wordsum$wordsum_setf[wordsum$wordsum_setf != 3] = 0 #everything else than to 0
wordsum$wordsum_setf[wordsum$wordsum_setf == 3] = 1 #correct to 1
table(wordsum$wordsum_setf) #verify
wordsum$wordsum_setg[wordsum$wordsum_setg != 5] = 0 #everything else than to 0
wordsum$wordsum_setg[wordsum$wordsum_setg == 5] = 1 #correct to 1
table(wordsum$wordsum_setg) #verify
wordsum$wordsum_seth[wordsum$wordsum_seth != 4] = 0 #everything else than to 0
wordsum$wordsum_seth[wordsum$wordsum_seth == 4] = 1 #correct to 1
table(wordsum$wordsum_seth) #verify
wordsum$wordsum_setj[wordsum$wordsum_setj != 1] = 0 #everything else than to 0
wordsum$wordsum_setj[wordsum$wordsum_setj == 1] = 1 #correct to 1
table(wordsum$wordsum_setj) #verify
wordsum$wordsum_setk[wordsum$wordsum_setk != 1] = 0 #everything else than to 0
wordsum$wordsum_setk[wordsum$wordsum_setk == 1] = 1 #correct to 1
table(wordsum$wordsum_setk) #verify
wordsum$wordsum_setl[wordsum$wordsum_setl != 4] = 0 #everything else than to 0
wordsum$wordsum_setl[wordsum$wordsum_setl == 4] = 1 #correct to 1
table(wordsum$wordsum_setl) #verify
wordsum$wordsum_seto[wordsum$wordsum_seto != 2] = 0 #everything else than to 0
wordsum$wordsum_seto[wordsum$wordsum_seto == 2] = 1 #correct to 1
table(wordsum$wordsum_seto) #verify
d$envir_nuke = mapvalues(d$envir_nuke, c(-9,-8),c(NA,NA)) #remove nonresponse
d$envir_gwarm = mapvalues(d$envir_gwarm, c(-9,-8),c(NA,NA)) #remove nonresponse
d$envir_gwhow = mapvalues(d$envir_gwhow, c(-9,-8),c(NA,NA)) #remove nonresponse
d$envir_gwgood = mapvalues(d$envir_gwgood, c(-9,-8),c(NA,NA)) #remove nonresponse

wordsum10 = data.frame(wordsum10=apply(wordsum,1,sum)) #sum all
wordsum9 = data.frame(wordsum9=apply(wordsum[-5],1,sum)) #sum with potentially broken item

hist(unlist(wordsum10),freq = FALSE,xlab="Wordsum score",main="Histogram of wordsum10")
hist(unlist(wordsum9),freq = FALSE,xlab="Wordsum score",main="Histogram of wordsum9")

alpha(wordsum) #reliabilities
alpha(wordsum[-5]) #minus g
round(cor(wordsum),2) #cors

wordsum10.fa = irt.fa(wordsum) #all items IRT FA
wordsum9.fa = irt.fa(wordsum[-5]) #minus g item
wordsum10.1 = score.irt(wordsum10.fa$fa,wordsum)
wordsum9.1 = score.irt(wordsum9.fa$fa,wordsum[-5])

wordsum10.fa.n = fa(wordsum)
wordsum10.n.1 = wordsum10.fa.n$scores
wordsum9.fa.n = fa(wordsum[-5])
wordsum9.n.1 = wordsum9.fa.n$scores


factor.scores = cbind(wordsum10.n.1,
colnames(factor.scores) = c("fa.10","fa.9","fa.irt.10","fa.irt.9","sum10","sum9")

#policy questions
policy = subset(d, select=c(envir_nuke, envir_gwarm, envir_gwgood, envir_gwhow)) #fetch
policy$GI1 = factor.scores$fa.10*15+100 #first estimate
policy$GI2 = scale(factor.scores$fa.irt.10)*15+100 #second

#policy and mean GI
plotmeans(GI1 ~ envir_nuke, policy, p=.99, legends=c("More","Fewer","Same"),
          main="Should US have more or fewer nuclear power plants",
          xlab="", ylab="Wordsum IQ")
plotmeans(GI2 ~ envir_nuke, policy, p=.99, legends=NULL,xaxt="n",yaxt="n",xlab="",ylab="",col = "red")

#Global warming
plotmeans(GI1 ~ envir_gwarm, policy, p=.99, legends=c("Has probably been happening","Probably hasn't been happening"),
          main="Is global warming happening or not",
          xlab="", ylab="Wordsum IQ")
plotmeans(GI2 ~ envir_gwarm, policy, p=.99, legends=NULL,xaxt="n",yaxt="n",xlab="",ylab="",col = "red")

#Rising temperatures good or bad
plotmeans(GI1 ~ envir_gwgood, policy, p=.99, legends=c("Good","Bad","Neither good nor bad"),
          main="Rising temperatures good or bad",
          xlab="", ylab="Wordsum IQ")
plotmeans(GI2 ~ envir_gwgood, policy, p=.99, legends=NULL,xaxt="n",yaxt="n",xlab="",ylab="",col = "red")

#Rising temperatures good or bad
plotmeans(GI1 ~ envir_gwhow, policy, p=.99, legends=NA,
          main="Anthropogenic climate change",
          xlab="", ylab="Wordsum IQ")
mtext(c("Mostly by human activity",
        "Mostly by natural causes",
        text.splitter("About equally by human activity and natural causes",25)),
        side=1, at=c(1:3), cex=.8, padj=1)
plotmeans(GI2 ~ envir_gwhow, policy, p=.99, legends=NULL,xaxt="n",yaxt="n",xlab="",ylab="",col = "red")

Nuclear energy often gets bad press. However, journalists are mostly very leftist, scientifically ill-educated women, so perhaps they are not quite the right demographic to tell us about this issue. So far I have not found any published studies about the relationship between attitudes towards nuclear energy and general intelligence, however there is good reason to believe it is positive. The EU is to nice to survey the opinions of EU citizens on nuclear energy every few years, and they include measurement of various variables, but not general intelligence or general science knowledge.

Three surveys of European attitudes towards nuclear power and correlates

nuc18 nuc17 nuc16 nuc15 nuc14 nuc13 nuc12 nuc11 nuc10 nuc9 nuc8 nuc7 nuc6 nuc5 nuc4 nuc3 nuc2 nuc1

Generally, these find:

  1. Men are more positive about nuclear power
  2. The better educated are more positive about nuclear power
  3. The better educated are less in doubt about nuclear power
  4. Education has inconsistent relationship to opposition to nuclear power
  5. Self-rated knowledge is related to more positive attitude to nuclear power
  6. Those with more experience about nuclear power are more positive about nuclear power

(4) might seem inconsistent with (1-3), but it is not. Usually the questions have three broad categories: positive, negative, don’t know. When the education level increases, negative stays about the same (sometimes increases, sometimes decreases), while don’t know always decreases and positive nearly always increases. Thus, the simplest explanation for this is that higher education move people from the don’t know category to the positive category, while it has no clear effect those who are negative about nuclear power.

Survey in the United Kingdom

Most of the questions were too specific to be of use, however one was useful, is post-Fukushima and still found the usual results:


What do experts think?

I found an old review (1984) of expert opinion. They write:

In contrast to the public, most “opinion leaders,” particularly energy experts, support further development of nuclear power. This support is revealed both in opinion polls and in technical studies of the risks of nuclear power. A March 1982 poll of Congress found 76 percent of members supported expanded use of nuclear power (50. In a survey conducted for Connecticut Mutual Life Insurance Co. in 1980, leaders in religion, business, the military, government, science, education, and law perceived the benefits of nuclear power as greater than the risks (19). Among the categories of leaders surveyed, scientists were particularly supportive of nuclear power. Seventyfour percent of scientists viewed the benefits of nuclear power as greater than risks, compared with only 55 percent of the rest of the public.

In a recent study, a random sample of scientists was asked about nuclear power (62). Of those polled, 53 percent said development should proceed rapidly, 36 percent said development should proceed slowly, and 10 percent would halt development or dismantle plants. When a second group of scientists with particular expertise in energy issues was given the same choices, 70 percent favored proceeding rapidly and 25 percent favored proceeding slowly with the technology. This second sample included approximately equal numbers of scientists from 71 disciplines, ranging from air pollution to energy policy to thermodynamics. About 10 percent of those polled in this group worked in disciplines directly related to nuclear energy, so that the results might be somewhat biased. Support among both groups of scientists was found to result from concern about the energy crisis and the belief that nuclear power can make a major contribution to national energy needs over the next 20 years. Like scientists, a majority of engineers continued to support nuclear power after the accident at Three Mile Island (69).

Of course, not all opinion leaders are in favor of the current U.S. program of nuclear development. Leaders of the environmental movement have played a major role in the debate about reactor safety and prominent scientists are found on both sides of the debate. A few critics of nuclear power have come from the NRC and the nuclear industry, including three nuclear engineers who left General Electric in order to demonstrate their concerns about safety in 1976. However, the majority of those with the greatest expertise in nuclear energy support its further development.

Analysis of public opinion polls indicates that people’s acceptance or rejection of nuclear power is more influenced by their view of reactor safety than by any other issue (57). As discussed above, accidents and events at operating plants have greatly increased public concern about the possibility of a catastrophic accident. Partially in response to that concern, technical experts have conducted a number of studies of the likelihood and consequences of such an accident. However, rather than reassuring the public about nuclear safety, these studies appear to have had the opposite effect. By painting a picture of the possible consequences of an accident, the studies have contributed to people’s view of the technology as exceptionally risky, and the debate within the scientific community about the study methodologies and findings has increased public uncertainty.

And recently, much publicity was given to a study showing the discrepancy between public opinion and scientific opinion on various topics, and it included nuclear power:


A 20% percent point different is no small matter and is similar to the older studies described above.

General intelligence and nuclear power

The OKCupid dataset contains only one question related to nuclear power among the first 2400 questions or so (those in the dataset). The question is the 2216th most commonly answered question, that is, not very commonly answered at all. Since people who answer >2000 questions on a dating site are a very self-selected group, there is likely to be some heavy range restriction.


Aside from the “I don’t know”-category, the differences are quite small:

[1] "Question ID: q59519"
[1] "How do you feel about nuclear energy?"
[1] "I'm not sure, there are pros and cons."
[1] "n's =" "1015" 
[1] "means =" "2.3"    
[1] "I don't care, whatever keeps my light bulbs lit."
[1] "n's =" "52"   
[1] "means =" "1.37"   
[1] "No.  It is a danger to public safety."
[1] "n's =" "335"  
[1] "means =" "2.31"   
[1] "Yes.  It is efficient, safe, and clean."
[1] "n's =" "591"  
[1] "means =" "2.49"

However, samples are quite big. The 99% confidence interval is -0.323 to -0.037, which is getting close to 0, so we need more data to be quite certain, but it’s unlikely to have happened by chance [(D|H0)=very low]. How large is the difference in some more useful unit? It’s .18 points on a 3-point scale (3 IQ-type questions). The overall SD in the total dataset for this 3-point scale is .96 (mean=2.12, N=28k). However, for this question, it is only .88 due to range restriction (mean=2.34, N=1993). In other words, in SD units it is .18/.88=0.20, which is 3 IQ points, not correcting for anything. Perhaps after corrections this would be something like 5 IQ points between pro and con-nuclear power people in this dataset.

Future research

I’d like more data with measures of scientific knowledge, or probability reasoning or some such and various energy policy opinions including nuclear. Of course, there should also be general intelligence data.


Europeans and Nuclear Safety – 2010 – PDF

Europeans and Nuclear Safety – 2007 – PDF

Attitudes towards radioactive waste (EU) – 2008 – PDF

Public Attitudes to Nuclear Power (OECD summary) – 2010 – PDF

New Nuclear Watch Europe – Nuclear Energy Survey (United Kingdom) – 2014 – PDF

Public Attitudes Toward Nuclear Power – 1984 – PDF (part of Nuclear Power in an Age of Uncertainty. Washington, D. C.: U.S. Congress, Office of Technology Assessment, OTA-E-216, February 1984.

In reply to and his working paper here:

We are discussing his working paper over email, and I had some reservations about his factor analysis. I decided to run the analyses I wanted myself, but it turned into a longer project which should be placed in a short paper instead of in a private email.

I fetched the data from his source. The raw data did not have variable names, so was unwieldy to work with. I opened the SPSS file, and it did have variable names. Then I exported the CSV with the desired variables (see supp. material). Then I had to recoded the variables so that the true answers are coded as 1, false answers as 0, and missing as NA. This took some time. I followed his coding procedure for most cases (see his STATE file and my R code below).

How many factors to extract

It seems that he relies on some kind of method for determining the number of factors to extract, presumably Eigenvalue>1. I always use three different methods using the nFactors package. Using all 22 variables (note that he did not this all of them at once), all methods agreed to extract 5 factors (at max). Here’s the factor solutions for extracting 1 thru 5 factors and their intercorrelations:

Factor analyses with 1-5 factors and their correlations

[1] "Factor analysis, extracting 1 factors using oblimin and MinRes"

smokheal  0.129
condrift  0.347
rmanmade  0.445
earthhot  0.348
oxyplant  0.189
lasers    0.514
atomsize  0.441
antibiot  0.401
dinosaur  0.323
light     0.384
earthsun  0.515
suntime   0.581
dadgene   0.227
getdrug   0.290
whytest   0.423
probno4   0.396
problast  0.423
probreq   0.349
probif3   0.416
evolved   0.306
bigbang   0.315
onfaith  -0.296

SS loadings    3.191
Proportion Var 0.145
[1] "Factor analysis, extracting 2 factors using oblimin and MinRes"

         MR1    MR2   
smokheal  0.121       
condrift  0.345       
rmanmade  0.368  0.136
earthhot  0.363       
oxyplant  0.172       
lasers    0.518       
atomsize  0.461       
antibiot  0.323  0.133
dinosaur  0.323       
light     0.375       
earthsun  0.587       
suntime   0.658       
dadgene   0.145  0.130
getdrug   0.211  0.130
whytest   0.386       
probno4          0.705
problast         0.789
probreq   0.162  0.305
probif3   0.108  0.514
evolved   0.348       
bigbang   0.367       
onfaith  -0.266       

                 MR1   MR2
SS loadings    2.617 1.569
Proportion Var 0.119 0.071
Cumulative Var 0.119 0.190
     MR1  MR2
MR1 1.00 0.35
MR2 0.35 1.00
[1] "Factor analysis, extracting 3 factors using oblimin and MinRes"

         MR2    MR1    MR3   
condrift                0.346
rmanmade  0.173  0.170  0.232
earthhot         0.187  0.220
oxyplant                0.100
lasers           0.256  0.320
atomsize         0.208  0.312
antibiot  0.168  0.150  0.198
dinosaur         0.119  0.250
light            0.240  0.169
earthsun         0.737       
suntime          0.754       
dadgene   0.147              
getdrug   0.152         0.149
whytest   0.108  0.143  0.294
probno4   0.708              
problast  0.781              
probreq   0.324              
probif3   0.532              
evolved                 0.562
bigbang                 0.525
onfaith                -0.307

                 MR2   MR1   MR3
SS loadings    1.646 1.444 1.389
Proportion Var 0.075 0.066 0.063
Cumulative Var 0.075 0.140 0.204
     MR2  MR1  MR3
MR2 1.00 0.29 0.25
MR1 0.29 1.00 0.43
MR3 0.25 0.43 1.00
[1] "Factor analysis, extracting 4 factors using oblimin and MinRes"

         MR4    MR2    MR1    MR3   
condrift  0.180                0.234
rmanmade  0.387                     
earthhot  0.262         0.102       
oxyplant  0.116                     
lasers    0.490                     
atomsize  0.435                     
antibiot  0.485                     
dinosaur  0.312                     
light     0.274         0.142       
earthsun                0.797       
suntime                 0.719       
dadgene   0.234                     
getdrug   0.273                     
whytest   0.438                     
probno4          0.695              
problast         0.817              
probreq   0.180  0.275              
probif3   0.139  0.487              
evolved                        0.685
bigbang                        0.554
onfaith  -0.141               -0.230

                 MR4   MR2   MR1   MR3
SS loadings    1.511 1.501 1.204 0.915
Proportion Var 0.069 0.068 0.055 0.042
Cumulative Var 0.069 0.137 0.192 0.233
     MR4  MR2  MR1  MR3
MR4 1.00 0.39 0.57 0.42
MR2 0.39 1.00 0.23 0.12
MR1 0.57 0.23 1.00 0.27
MR3 0.42 0.12 0.27 1.00
[1] "Factor analysis, extracting 5 factors using oblimin and MinRes"

         MR2    MR1    MR3    MR5    MR4   
condrift                0.209         0.299
rmanmade  0.104                0.120  0.379
earthhot                              0.367
oxyplant                              0.220
lasers                         0.195  0.361
atomsize                       0.273  0.207
antibiot                       0.401  0.108
dinosaur                       0.204  0.131
light                                 0.423
earthsun         0.504                0.186
suntime          1.007                     
dadgene                        0.277       
getdrug                        0.373       
whytest                        0.504       
probno4   0.701                            
problast  0.816                            
probreq   0.272                0.174       
probif3   0.487                0.107       
evolved                 0.753              
bigbang                 0.483         0.165
onfaith                -0.225 -0.152       

                 MR2   MR1   MR3   MR5   MR4
SS loadings    1.501 1.291 0.919 0.874 0.871
Proportion Var 0.068 0.059 0.042 0.040 0.040
Cumulative Var 0.068 0.127 0.169 0.208 0.248
     MR2  MR1  MR3  MR5  MR4
MR2 1.00 0.20 0.11 0.38 0.28
MR1 0.20 1.00 0.21 0.41 0.44
MR3 0.11 0.21 1.00 0.32 0.30
MR5 0.38 0.41 0.32 1.00 0.50
MR4 0.28 0.44 0.30 0.50 1.00


We see that in the 1-factor solution, all variables load in the expected direction, and we can speak of a general scientific knowledge factor. This is the one we want to use for other analyses. We see that faith loads negatively. This variable is not a true/false question, and thus should be excluded from any actual measurement of the general scientific knowledge factor.

Increasing the number of factors to extract simply divides this general factor into correlated parts. E.g. in the 2-factor solution, we see a probability factor that correlates .35 with the remaining semi-general factor. In solution 3, we see MR2 as the probability factor, MR3 as the knowledge related to religious beliefs factor and MR1 as the remaining items. Intercorrelations are .29, .25 and .43. This pattern continues until the 5th solution which still produces 5 correlated factors: MR2 is the probability factor, MR1 is an astronomy factor, MR3 is the one having to do with religious beliefs, MR5 looks like a medicine/genetics factor, and MR4 is the rest.

Just because scree tests etc. tell you to extract >1 factor does not mean that there is no general factor. This is the old fallacy made in the study of cognitive ability. See discussion in Jensen 1998 (chapter 3). It is sometimes still made e.g. Hampshire, et al (2012). Generally, as one increases the number of variables, the suggested number of factors to extract goes up. This does not mean that there is no general factor, just that with increasing number of variables, one can see a more fine-grained structure in the data than one can with only e.g. 5 variables.

Should we use them or not?

Before discussing whether one should theoretically use them or not, one can measure if it makes much of a difference. One can do this by extracting the general factor with and without the items in questions. I did this, also excluding the onfaith item. Then I correlated the scores from these two analysis: r=.992. In other words, it hardly matters whether one includes these religious-tinged items or not. The general factor is measured quite well already without them and they do not substantially change the factor scores. However, since adding more indicator items/variables generally reduces measurement error of a latent trait/factor, I would include them in my analyses.

How many factors should we extract and use?

There is also the question of how many factors one should extract. The answer is that it depends on what one wants to do. As Zigerell points out in a review comment of this paper on Winnower:

For example, for diagnostic purposes, if we know only that students A, B, and C miss 3 items on a test of general science knowledge, then the only remediation is more science; but we can provide more tailored remediation if we have separate components so that we observe that, say, A did poorly only on the religion-tinged items, B did poorly only on the probability items, and C did poorly only on the astronomy items.

For remedial education, it is clearly preferable to extract the highest number of interpretable factors because this gives the most precise information where knowledge is lacking for a given person. In regression analysis where we want to control for scientific knowledge, one should use the general factor.


Hampshire, A., Highfield, R. R., Parkin, B. L., & Owen, A. M. (2012). Fractionating human intelligence. Neuron, 76(6), 1225-1237.

Jensen, A. R. (1998). The g factor: The science of mental ability. Westport, CT: Praeger.

Supplementary material

Datafile: science_data

R code

library(plyr) #for mapvalues

data = read.csv("science_data.csv") #load data

#Coding so that 1 = true, 0 = false
data$smokheal = mapvalues(data$smokheal, c(9,7,8,2),c(NA,0,0,0))
data$condrift = mapvalues(data$condrift, c(9,7,8,2),c(NA,0,0,0))
data$earthhot = mapvalues(data$earthhot, c(9,7,8,2),c(NA,0,0,0))
data$rmanmade = mapvalues(data$rmanmade, c(9,7,8,1,2),c(NA,0,0,0,1)) #reverse
data$oxyplant = mapvalues(data$oxyplant, c(9,7,8,2),c(NA,0,0,0))
data$lasers = mapvalues(data$lasers, c(9,7,8,2,1),c(NA,0,0,1,0)) #reverse
data$atomsize = mapvalues(data$atomsize, c(9,7,8,2),c(NA,0,0,0))
data$antibiot = mapvalues(data$antibiot, c(9,7,8,2,1),c(NA,0,0,1,0)) #reverse
data$dinosaur = mapvalues(data$dinosaur, c(9,7,8,2,1),c(NA,0,0,1,0)) #reverse
data$light = mapvalues(data$light, c(9,7,8,2,3),c(NA,0,0,0,0))
data$earthsun = mapvalues(data$earthsun, c(9,7,8,2),c(NA,0,0,0))
data$suntime = mapvalues(data$suntime, c(9,7,8,2,3,1,4,99),c(0,0,0,0,1,0,0,NA))
data$dadgene = mapvalues(data$dadgene, c(9,7,8,2),c(NA,0,0,0))
data$getdrug = mapvalues(data$getdrug, c(9,7,8,2,1),c(NA,0,0,1,0)) #reverse
data$whytest = mapvalues(data$whytest, c(1,2,3,4,5,6,7,8,9,99),c(1,0,0,0,0,0,0,0,0,NA))
data$probno4 = mapvalues(data$probno4, c(9,8,2,1),c(NA,0,1,0)) #reverse
data$problast = mapvalues(data$problast, c(9,8,2,1),c(NA,0,1,0)) #reverse
data$probreq = mapvalues(data$probreq, c(9,8,2),c(NA,0,0))
data$probif3 = mapvalues(data$probif3, c(9,8,2,1),c(NA,0,1,0)) #reverse
data$evolved = mapvalues(data$evolved, c(9,7,8,2),c(NA,0,0,0))
data$bigbang = mapvalues(data$bigbang, c(9,7,8,2),c(NA,0,0,0))
data$onfaith = mapvalues(data$onfaith, c(9,1,2,3,4,7,8),c(NA,1,1,0,0,0,0))

#How many factors to extract?
nScree(data[complete.cases(data),]) #use complete cases only

#extract factors
library(psych) #for factor analysis
for (num in 1:5) {
  print(paste0("Factor analysis, extracting ",num," factors using oblimin and MinRes"))
  fa = fa(data,num) #extract factors
  print(fa$loadings) #print
  if (num>1){ #print factor cors
    phi = round(fa$Phi,2) #round to 2 decimals
    colnames(phi) = rownames(phi) = colnames(fa$scores) #set names
    print(phi) #print

#Does it make a difference?
fa.all = fa(data[1:21]) #no onfaith
fa.noreligious = fa(data[1:19]) #no onfaith, bigbang, evolved
cor(fa.all$scores,fa.noreligious$scores, use="pair") #correlation, ignore missing cases

Also posted on the Project Polymath blog.

An interesting study has been published:

This is relevant to the study of polymathy, which of course involves making broader contributions to academic areas. The authors’ own abstract is actually not very good, so here is mine: They sent a personality questionnaire to two random sample of scientists (diabetes researchers). This field was chosen because it is large and old, thus providing researchers with lots of researchers to analyze. They out a couple of thousand of these questionnaires and received received 748 and 478 useful answers. They then hired some other company to provide researcher information regarding the researchers. To measure depth vs. breadth, they used the keywords associated with the articles. More different keywords, means more breadth.
They used this information as well as other measures and their personality measures in four regression models:

polymath_tableS3 polymath_tableS2 polymath_table2 polymath_table1

The difference between the sets of regression models is the use of total publications vs. centrality as a control. These variables also correlate .52, so it not surprisingly made little difference.

They also report the full correlation matrix:


Of note in the results: Their measures of depth and breadth correlated strongly (.59), so this makes things more difficult. Preferably, one would want a single dimension to measure these along, not two highly positively correlated dimensions. The authors claimed to do this, but didn’t:

The two dependent variables, depth and breadth, were correlated positively (r = 0.59), and therefore we analyzed them separately (in each case, controlling for the other) rather than using the same predictive model. Discriminant validity is sup- ported by roughly 65% of variance unshared. At the same time, sharing 35% variance renders the statistical tests somewhat conservative, making the many significant and distinguishing relationships particularly noteworthy.

Openness (5 factor model) correlated positively with both depth and breadth, perhaps just because these are themselves correlated. Thus it seems preferable to control for the other depth/breadth measure when modeling. In any case, O seems to be related to creative output in these data. Conscientiousness had negligible betas, perhaps because they control for centrality/total publications thru which the effect of C is likely to be mediated. They apparently did not use the other scales of the FFM inventory, or at least give the impression they didn’t. Maybe they did and didn’t report because near-zero results (publication bias).

Their four other personality variables correlated in the expected directions. Exploration and learning goal orientation with breadth and performance goal orientation and competitiveness with depth.

Since the correlation matrix is published, one can do path and factor analysis on the data, but cannot run more regression models without case-level data. Perhaps the authors will supply it (they generally won’t).

The reporting on results in the main article is lacking. They report test-statistics without sample sizes and proper (d or r, or RR or something) effect sizes, a big no-no:

Study 1. In a simple test of scientists’ appraisals of deep, specialized studies vs. broader studies that span multiple domains, we created brief hypothetical descriptions of two studies (Fig. 1; see details in Supporting Information). Counterbalancing the sequence of the descriptions in a sample separate from our primary (Study 2) sample, we found that these scientists considered the broader study to be riskier (means = 4.61 vs. 3.15; t = 12.94, P < 0.001), a less significant opportunity (5.17 vs. 5.83; t = 6.13, P < 0.001), and of lower potential importance (5.35 vs. 5.72; t = 3.47, P < 0.001). They reported being less likely to pursue the broader project (on a 100% probability scale, 59.9 vs. 73.5; t = 14.45, P < 0.001). Forced to choose, 64% chose the deep project and 33% (t = 30.12, P < 0.001) chose the broad project (3% were missing). These results support the assumptions underlying our Study 2 predictions, that the perceived risk/return trade-off generally favors choosing depth over breadth.

Since they don’t mean the SDs, one cannot calculate r or d from their data I think. Unless one can get it from the t-values (not sure). One can of course calculate odds ratios using their mean values, but I’m not sure this would be a meaningful statistic (not a ratio scale, maybe not even an interval scale).

Their model fitting comparison is pretty bad, since they only tried their preferred model vs. an implausible straw man model:

Study 2. We conducted confirmatory factor analysis to assess the adequacy of the measurement component of the proposed model and to evaluate the model relative to alternative models (21). A six-factor model, in which items measuring our six self-reported dispositional variables loaded on separate correlated factors, had a significant χ 2 test [χ 2 (175) = 615.09, P < 0.001], and exhibited good fit [comparative fit index (CFI) = 0.90, root mean square error of approximation (RMSEA) = 0.07]. Moreover, the six-factor model’s standardized loadings were strong and significant, ranging from 0.50 to 0.93 (all P < 0.01). We compared the hypothesized measurement model to a one-factor model (22) in which all of the items loaded on a common factor [χ 2 (202) = 1315.5, P < 0.001, CFI = 0.72, RMSEA = 0.17] and found that the hypothesized six-factor model fit the data better than the one-factor model [χ 2 (27) = 700.41, P < 0.001].

Not quite sure how this was done. Too little information given. Did they use item-level modeling or? It sort of sounds like it. Since the data isn’t given, one cannot confirm this, or do other item-level modeling. For instance, if I were to analyze it, I would probably have the items of their competitiveness and performance scales load on a common latent factor (r=.39), as well as the items from the exploration and learning scales on their latent factor, maybe try with openness too (r’s .23, .30, .17).

Of other notes in their correlations: Openness is correlated with being in academia vs. non-academia (r=.22), so there is some selection going on not just with general intelligence there.

One can study a given human trait at many levels. Probably the most common is the individual level. The next-most common the inter-national, and the least common perhaps the intra-national. This last one can be done at various level too, e.g. state, region, commune, and city. These divisions usually vary by country.

The study of general intelligence (GI) at these higher levels has been called the ecology of intelligence by Richard Lynn (1979, 1980) and the sociology of intelligence by Gottfredson (1998). Lynn’s two old papers cited before actually contain quite a bit of data which can be re-analyzed too. I will do so in a future post. I also decided that this series of posts will have to turn into one big paper with a review and meta-analysis. There are strong patterns in the data not previously explored or synthesized by researchers.

Lynn has published a number of papers on the regions of Italy (2010a, 2010b, 2012 and Lynn and Piffer 2014) and it is this topic I turn to in this post.

Lynn’s 2010 data

True to his style, Lynn (2010a) contains the raw data used for his analysis. It is fortunate because it means anyone can re-analyze them. His paper contains the following variables:

  1. 3x PISA subtests: reading, math, science
  2. An average of these PISA scores
  3. An IQ derived from the average
  4. Stature for 1855, 1910, 1927 and 1980
  5. Per capita income for 1970 and 2003
  6. Infant mortality for 1955 and 1999
  7. Literacy 1880
  8. Years of education 1951, 1971 and 2001
  9. Latitude.

These data are given for 12 Italian regions.

Lynn himself merely did correlational analysis and discussed the results. The data however can be usefully factor analyzed to extract a G (from the three PISA subtests) and S factor (from all the socioeconomic variables). I imported the data into R.

Lynn’s choice of variables is quite odd. They are not all from the same years, presumably because he picked them from various other papers instead of going to the Italian statistics website to fetch some himself. This opens the question of how to analyze them. I did this: I did a factor analysis (MinRes, default settings for fa() from the psych package) on the new socioeconomic data only, old data only, and all of it. The two factor analyses of the limited datasets did not reveal anything interesting not shown in the full analysis, so I only show results from the full analysis. Note that by doing this, I broke the rule of thumb about the number of variables per case (at least 2) because there are 7 variables in my analysis but only 13 cases with full data. The loading plot is:


This plot reveals no surprises.

The loadings for the G factor with the PISA subtests were all .99, so it is pointless to post a plot. The scatter plot for G and S is:


And MCV with reversing:


New data

Being dissatisfied with the data Lynn reported, I decided to collect more data. The PISA 2012 results have PISA scores for more regions than before which allows for an analysis with more cases. This also means that one can use more variables in the factor analysis. The new PISA data has 22 regions, so one can use about 11 variables. However, due to some missing data, only 21 regions were available for analysis (Südtirol had some missing data). So I settled on using 10 variables.

To get data for the analysis, I followed the approach taken in the previous post on the S factor in US states. I went to the official statistics bank, IStat, and fetched data for the regions. Like before, for MCV to work well, one needs a diverse selection of variables, so that there is diversity in their S loadings (not just direction of loading). I settled on the following 10 variables:

  1. Political participation index, 9 years
  2. Percent with normal weight, 9 years
  3. Percent smokers, 10 years
  4. Intentional homicide rate, 4 years
  5. Total crime rate, 4 years
  6. Unemployment, 10 years
  7. Life expectancy males, 10 years
  8. Total fertility rate, 10 years
  9. Interpersonal trust index, 5 years
  10. No savings percent, 10 years

For all variables, I calculated the mean for all years. I fetched the last 10 years for all data when available.

For cognitive data, I fetched the regional scores for reading, mathematics and science subtests from PISA 2012, Annex B2.

Factor analysis

I proceeded like above. The loadings plot is:


There are two odd results. Total crime rate has a slight positive loading (.16) while intentional homicide rate has a strong negative loading (-.72). Lynn reported a similar finding in his 1980 paper on Britain. He explained it as being due to urbanization, which increases population density which increases crime rates (more opportunities, more interpersonal conflicts). An alternative hypothesis is that the total crime rate is being increased by immigrants who live mostly in the north. Perhaps one can get crime rates for natives only to test this. A third hypothesis is that it has to do with differences in the legal system, for instance, prosecutor practice in determining which actions to pull into the legal system.

The second odd result is that fertility has a positive loading. Generally, it has been found that fertility has a slight negative correlation with GI and s factor at the individual level, see e.g. Lynn (2011). It has also been found that internationally, GI has a strong negative relationship, -.5 to -.7 depending on measure, to fertility (Shatz, 2008; Lynn and Harvey, 2008). I also found something similar, -.5, when I examined Danish immigrant groups by country of origin (Kirkegaard, 2014). However, if one examines European countries only, one sees that fertility is relatively ‘high’ (a bit below 2) in the northern countries (Nordic countries, UK), and low in the southern and eastern countries. This means that the correlation of fertility between countries in Europe and IQ (e.g. PISA) is positive. Maybe this has some relevance to the current finding. Maybe immigrants are pulling the fertility up in the northern regions.

There is little to report from the factor analysis of PISA results. All loadings between .98 and .99.

Scatter plot of S and G


MCV with reversing


Inter-dataset scatter plots

To examine the inter-dataset stability of factor scores:

S_S2 G_G2

For one case, the Lynn dataset had data for a merged region. I merged the two regions in the new dataset to match it up against the one from Lynn’s. This is the conservative choice. One could have used Lynn’s data for both regions instead which would have increased the sample size by 1.


The results for the regional G and S in Italian regions is especially strong. They rival even the international S factor in their correlation with the G estimates. Italy really is a very divided country. Stability across datasets was very strong too, so Lynn’s odd choice of data was not inflating the results.

MCV worked better in the dataset with more and more diverse indicator variables for S, as would be expected if the correlation was artificially low in the first dataset due to restriction of range in the S loadings.

Supplementary material

All project files (R source code, data files, plots) are available on the Open Science Framework repository.

Thanks to Davide Piffer for catching an error + help in matching the regions up from the two datasets.


  • Gottfredson, L. S. (1998). Jensen, Jensenism, and the sociology of intelligence. Intelligence, 26(3), 291-299.
  • Kirkegaard, E. O. (2014). Criminality and fertility among danish immigrant populations. Open Differential Psychology.
  • Lynn, R. (1979). The social ecology of intelligence in the British Isles. British Journal of Social and Clinical Psychology, 18(1), 1-12.
  • Lynn, R. (1980). The social ecology of intelligence in France. British Journal of Social and Clinical Psychology, 19(4), 325-331.
  • Lynn, R., & Harvey, J. (2008). The decline of the world’s IQ. Intelligence, 36(2), 112-120.
  • Lynn, R. (2010a). In Italy, north–south differences in IQ predict differences in income, education, infant mortality, stature, and literacy. Intelligence, 38(1), 93-100.
  • Lynn, R. (2010b). IQ differences between the north and south of Italy: A reply to Beraldo and Cornoldi, Belacchi, Giofre, Martini, and Tressoldi. Intelligence, 38(5), 451-455.
  • Lynn, R. (2011). Dysgenics: Genetic deterioration in modern populations. Second edition. Westport CT.
  • Lynn, R. (2012). IQs in Italy are higher in the north: A reply to Felice and Giugliano. Intelligence, 40(3), 255-259.
  • Piffer, D., & Lynn, R. (2014). New evidence for differences in fluid intelligence between north and south Italy and against school resources as an explanation for the north–south IQ differential. Intelligence, 46, 246-249.
  • Shatz, S. M. (2008). IQ and fertility: A cross-national study. Intelligence, 36(2), 109-111.