This is a post in the on-going series of comments on studies of international/transracial adoption. A global genetic/hereditarian model of cognitive differences and their socioeconomic effects implies that adoptees from different populations/countries/regions should show the usual group differences in the above mentioned traits and outcomes, all else equal. All else is of course not equal since adoptees from different regions can be adopted at different ages, experience different environment leading up to the adoption, possibly experience different environments after adoption thru no cause of their own (discrimination), and so on. It is not a strict test: finding the usual group differences can be explained by non-genetic factors, and finding no differences or unexpected ones could be consistent with a genetic model given strong non-genetic effects such as differences in adoption practices between origin countries/regions/populations. However, were such differences to be found relatively consistently in sufficiently powered studies, it would be an important prediction verified by the genetic model, broadly speaking. The question thus remains: what do the studies show?

Bruce et al (2009) – Disinhibited Social Behavior Among Internationally Adopted Children

Their abstract reads:

Postinstitutionalized children frequently demonstrate persistent socioemotional difficulties. For example, some postinstitutionalized children display an unusual lack of social reserve with unfamiliar adults. This behavior, which has been referred to as indiscriminate friendliness, disinhibited attachment behavior, and disinhibited social behavior, was examined by comparing children internationally adopted from institutional care to children internationally adopted from foster care and children raised by their biological families. Etiological factors and behavioral correlates were also investigated. Both groups of adopted children displayed more disinhibited social behavior than the nonadopted children. Of the etiological factors examined, only the length of time in institutional care was related to disinhibited social behavior. Disinhibited social behavior was not significantly correlated with general cognitive ability, attachment-related behaviors, or basic emotion abilities. However, this behavior was negatively associated with inhibitory control abilities even after controlling for the length of time in institutional care. These results suggest that disinhibited social behavior might reflect underlying deficits in inhibitory control.

While this does not seem immediately relevant, the authors do investigate IQ. The study design is a three-way comparison between adoptees in foster homes, institutional care and non-adoptees. The sample are small: 40 x 40 x 40. These are children who spent most of their lives first in foster homes and were then adopted, children who spent most of their lives in institutional care and then adopted, and biological children of the adoptive families for comparison. The groups were not equal in origin composition:

However, because countries have institutional or foster systems to care for wards of the state, the institutional care and foster care groups differed in terms of country of origin. The institutional care group was primarily from Eastern Europe (45%) and China (43%), whereas the foster care group was primarily from South Korea (80%).

Their cognitive measure is:

General cognitive ability.To provide an estimate of the children’s general cognitive functioning, each child was administered the vocabulary and block design subtests of the Wechsler Intelligence Scale for Children, 3rd edition (Wechsler, 1991). These subtests are considered the best measures of verbal and nonverbal intelligence, respectively, and are highly correlated with the full scale intelligence quotient (Sattler, 1992). Raw scores on the subtests were converted into age-normed scaled scores. The scaled scores were then summed and transformed into full scale intelligence quotient equivalents.

They give the mean FSIQ by the above three groups, not origin groups. These are:

Group Institutional care Foster care Control
FSIQ 102.68 (16.25) 109.37 (12.93) 117.11 (15.88)

Note: numbers in parentheses are SDs.

The high scores is presumably due to the FLynn effect. The IQ test is from 1991, but the study is from 2009, so there has been 18 years for raw score gains compared to the normative sample. An alternative idea is that the families were just above average SES/IQ which boosts the IQ scores of younger children. Nearly 100% of the adoptive families were Caucasian (presumably European) and were part of the Minnesota International Adoption Project Registry. According to Wiki, the non-Hispanic % of this state is 83%, so Caucasians are a bit overrepresented. In general, these elevation effects are not important when comparing groups within a study.

I contacted the first author to ask if she would give me some more data, and she obliged:

Please find the requested information below. For each region of origin, I provided the number of children, mean, and standard deviation for the Block Design Standard Score, Vocabulary Standard Score, and full scale IQ equivalent. I did not provide these statistics for regions with less than 5 children. Please let me know if you have any questions

Origin N FSIQ Vocabulary Block design
Institutional care
Eastern Europe (e.g., Russia, Romania, Ukraine) 18 94.84 (15.070) 8.39 (2.615) 8.39 (2.615)
China 17 111.77 (13.774) 11.76 (2.969) 12.29 (3.274)
Foster care
South Korea 31/32 110.38 (12.457) 12.10 (2.970) 11.53 (2.940)
Guatemala 5 104.64 (16.609) 12.40 (3.050) 9.20 (3.033)

Notes: Sample size for S. Korea was 31, 31, 32. No explanation given. Numbers in parentheses are SDs.

Again we see that East Asians do well, altho not better than the control children (mean=117). Eastern European do less well, but it is hard to say the exact expected mean since it is not stated how many come from which countries. The IQ of Lynn and Vanhanen (2012) gives 91 as the IQ of Romania, 96.6 for Russia and 94.3 for Ukraine. The Guatemalans certainly do better than expected (79 IQ), but N=5, and the demographics of Guatemala is very mixed making it possible that the adoptees actually had predominantly European ancestry.

The email reply

Hello Emil,
Please find the requested information below. For each region of origin, I provided the number of children, mean, and standard deviation for the Block Design Standard Score, Vocabulary Standard Score, and full scale IQ equivalent. I did not provide these statistics for regions with less than 5 children. Please let me know if you have any questions
Thank you for your interest in our publication,

Institutional care group:
Eastern Europe (e.g., Russia, Romania, Ukraine)
N Mean Std. Deviation
Block Design Standard Score 18 9.83 3.884
Vocabulary Standard Score 18 8.39 2.615
IQ equivalent 18 94.84 15.070

N Mean Std. Deviation
Block Design Standard Score 17 12.29 3.274
Vocabulary Standard Score 17 11.76 2.969
IQ equivalent 17 111.77 13.774

Foster care group:
South Korea
N Mean Std. Deviation
Block Design Standard Score 32 11.53 2.940
Vocabulary Standard Score 31 12.10 2.970
IQ equivalent 31 110.38 12.457

N Mean Std. Deviation
Block Design Standard Score 5 9.20 3.033
Vocabulary Standard Score 5 12.40 3.050
IQ equivalent 5 104.64 16.609


Bruce, J., Tarullo, A. R., & Gunnar, M. R. (2009). Disinhibited social behavior among internationally adopted children. Development and psychopathology, 21(01), 157-171.

I reanalyze data reported by Richard Lynn in a 1979 paper concerning IQ and socioeconomic variables in 12 regions of the United Kingdom as well as Ireland. I find a substantial S factor across regions (66% of variance with MinRes extraction). I produce a new best estimate of the G scores of regions. The correlation of this with the S scores is .79. The MCV with reversal correlation is .47.

The interdisciplinary academic field examining the effect of general intelligence on large scale social phenomena has been called social ecology of intelligence by Richard Lynn (1979, 1980) and sociology of intelligence by Gottfredson (1998). One could also call it cognitive sociology by analogy with cognitive epidemiology (Deary, 2010; Special issue in Intelligence Volume 37, Issue 6, November–December 2009; Gottfredson, 2004). Whatever the name, it is a field that has received renewed attention recently. Richard Lynn and co-authors report data on Italy (Lynn 2010a, 2010b, 2012a, Piffer and Lynn 2014, see also papers by critics), Spain (Lynn 2012b), China (Lynn and Cheng, 2013) and India (Lynn and Yadav, 2015). Two of his older studies cover the British Isles and France (Lynn, 1979, 1980).

A number of my recent papers have reanalyzed data reported by Lynn, as well as additional data I collected. These cover Italy, India, United States, and China (Kirkegaard 2015a, 2015b, 2015c, 2015d). This paper reanalyzes Lynn’s 1979 paper.

Cognitive data and analysis

Lynn’s paper contains 4 datasets for IQ data that covers 11 regions in Great Britain. He further summarizes some studies that report data on Northern Ireland and the Republic of Ireland, so that his cognitive data covers the entire British Isles. Lynn only uses the first 3 datasets to derive a best estimate of the IQs. The last dataset does not report cognitive scores as IQs, but merely percentages of children falling into certain score intervals. Lynn converts these to a mean (method not disclosed). However, he is unable to convert this score to the IQ scale since the inter-personal standard deviation (SD) is not reported in the study. Lynn thus overlooks the fact that one can use the inter-regional SD from the first 3 studies to convert the 4th study to the common scale. Furthermore, using the intervals one could presumably estimate the inter-personal SD, altho I shall not attempt this. The method for converting the mean scores to the IQ score is this:

  1. Standardize the values by subtracting the mean and dividing by the inter-regional SD.
  2. Calculate the inter-regional SD in the other studies, and find the mean of these. Do the same for the inter-regional means.
  3. Multiple the standardized scores by the mean inter-regional SD from the other studies and add the inter-regional mean.

However, I did not use this method. I instead factor analyzed the four 4 IQ datasets as given and extracted 1 factor (extraction method = MinRes). All factor loadings were strongly positive indicating that G could be reliably measured among the regions. The factor score from this analysis was put on the same scale as the first 3 studies by the method above. This is necessary because the IQs for Northern Ireland and the Republic of Ireland are given on that scale. Table 1 shows the correlations between the cognitive variables. The correlations between G and the 4 indicator variables are their factor loadings (italic).

Table 1 – Correlations between cognitive datasets Douglas Davis G Lynn.mean 1 0.66 0.92 0.62 0.96 0.92 0.66 1 0.68 0.68 0.75 0.89
Douglas 0.92 0.68 1 0.72 0.99 0.93
Davis 0.62 0.68 0.72 1 0.76 0.74
G 0.96 0.75 0.99 0.76 1 0.96
Lynn.mean 0.92 0.89 0.93 0.74 0.96 1


It can be noted that my use of factor analysis over simply averaging the datasets had little effect. The correlation of Lynn’s method (mean of datasets 1-3) and my G factor is .96.

Socioeconomic data and analysis

Lynn furthermore reports 7 socioeconomic variables. I quote his description of these:

“1. Intellectual achievement: (a) first-class honours degrees. All first-class honours graduates of the year 1973 were taken from all the universities in the British Isles (with the exception of graduates of Birkbeck College, a London College for mature and part-time students whose inclusion would bias the results in favour of London). Each graduate was allocated to the region where he lived between the ages of 11 and 18. This information was derived from the location of the graduate’s school. Most of the data were obtained from The Times, which publishes annually lists of students obtaining first-class degrees and the schools they attended. Students who had been to boarding schools were written to requesting information on their home residence. Information from the Republic of Ireland universities was obtained from the college records.

The total number of students obtaining first-class honours degrees was 3477, and information was obtained on place of residence for 3340 of these, representing 96 06 per cent of the total.
There are various ways of calculating the proportions of first-class honours graduates produced by each region. Probably the most satisfactory is to express the numbers of firsts in each region per 1000 of the total age cohorts recorded in the census of 1961. In this year the cohorts were approximately 9 years old. The reason for going back to 1961 for a population base is that the criterion taken for residence is the school attended and the 1961 figures reduce the distorting effects of subsequent migration between the regions. However, the numbers in the regions have not changed appreciably during this period, so that it does not matter greatly which year is taken for picking up the total numbers of young people in the regions aged approximately 21 in 1973. (An alternative method of calculating the regional output of firsts is to express the output as a percentage of those attending university. This method yields similar figures.)

2. Intellectual achievement: (b) Fellowships of the Royal Society. A second measure of intellectual achievement taken for the regions is Fellowships of the Royal Society. These are well-known distinctions for scientific work in the British Isles and are open equally to citizens of both the United Kingdom and the Republic of Ireland. The population consists of all Fellows of the Royal Society elected during the period 1931-71 who were born after the year 1911. The number of individuals in this population is 321 and it proved possible to ascertain the place of birth of 98 per cent of these. The Fellows were allocated to the region in which they were born and the numbers of Fellows born in each region were then calculated per million of the total population of the region recorded in the census of 1911. These are the data shown in Table 2. The year 1911 was taken as the population base because the majority of the sample was born between the years 1911-20, so that the populations in 1911 represent approximately the numbers in the regions around the time most of the Fellows were born. (The populations of the regions relative to one another do not change greatly over the period, so that it does not make much difference to the results which census year is taken for the population base.)

3. Per capita income. Figures for per capita incomes for the regions of the United Kingdom are collected by the United Kingdom Inland Revenue. These have been analysed by McCrone (1965) for the standard regions of the UK for the year 1959/60. These results have been used and a figure for the Republic of Ireland calculated from the United Nations Statistical Yearbook.

4. Unemployment. The data are the percentages of the labour force unemployed in the regions for the year 1961 (Statistical Abstracts of the UK and of Ireland).

5. Infant mortality. The data are the numbers of deaths during the first year of life expressed per 1000 live births for the year 1961 (Registrar Generals’ Reports).

6. Crime. The data are offences known to the police for 1961 and expressed per 1000 population (Statistical Abstracts of the UK and of Ireland).

7. Urbanization. The data are the percentages of the population living in county boroughs, municipal boroughs and urban districts in 1961 (Census).”

Lynn furthermore reports historical achievement scores as well as an estimate of inter-regional migration (actually change in population which can also be due to differential fertility). I did not use these in my analysis but they can be found in the datafile in the supplementary material.

Since there are 13 regions in total and 7 variables, I can analyze all variables at once and still almost conform to the rule of thumb of having a case-to-variable ratio of 2 (Zhao, 2009). Table 2 shows the factor loadings from this factor analysis as well as the correlation with G for each socioeconomic variable.

Table 2 – Correlations between S, S indicators, and G
Variable S G
Fellows.RS 0.92 0.92
First.class 0.55 0.58
Income 0.99 0.72
Unemployment -0.85 -0.79
Infant.mortality -0.68 -0.69
Crime 0.83 0.52
Urbanization 0.88 0.64
S 1 0.79


The crime variable had a strong positive loading on the S factor and also a positive correlation with the G factor. This is in contrast to the negative relationship found at the individual-level between the g factor and crime variables at about r=-.2 (Neisser 1996). The difference in mean IQ between criminal and non-criminal samples is usually around 7-15 points depending on which criminal group (sexual, violent and chronic offenders score lower than other offenders; Guay et al, 2005). Beaver and Wright (2011) found that IQ of countries was also negatively related to crime rates, r’s range from -.29 to -.58 depending on type of crime variable (violent crimes highest). At the level of country of origin groups, Fuerst and Kirkegaard (2014a) found that crime variables had strong negative loadings on the S factor (-.85 and -.89) and negative correlations with country of origin IQ. Altho not reported in the paper, Kirkegaard (2014b) found that the loading of 2 crime variables on the S factor in Norway among country of origin groups was -.63 and -.86 (larceny and violent crime; calculated using the supplementary material using the fully imputed dataset). Kirkegaard (2015a) found S loadings of .16 and -.72 of total crime and intentional homicide variables in Italy. Among US states, Kirkegaard (2015c) found S loadings of -.61 and -.71 for murder rate and prison rate. The scatter plot is shown in Figure 1.

Figure 1 – Scatter plot of regional G and S












So, the most similar finding in previous research is that from Italy. There are various possible explanations. Lynn (1979) thinks it is due to large differences in urbanization (which loads positively in multiple studies, .88 in this study). There may be some effect of the type of crime measurement. Future studies could examine this question by employing many different crime variables. My hunch is that it is a combination of differences in urbanization (which increases crime), immigration of crime prone persons into higher S areas, and differences in the justice system between areas.

Method of correlated vectors (MCV)

As done in the previous analysis of S factors, I performed MCV analysis to see whether the G factor was the reason for the association with the G factor score. S factor indicators with negative loadings were reversed to avoid inflating the result (these are marked with “_r” in the plot). The result is shown in Figure 2.

Figure 2 – MCV scatter plot








As in the previous analyses, the relationship was positive even after reversal.

Per capita income and the FLynn effect

An interesting quote from the paper is:

This interpretation [that the first factor of his factor analysis is intelligence] implies that the mean population IQs should be regarded as the cause of the other variables. When causal relationships between the variables are considered, it is obvious that some of the variables are dependent on others. For instance, people do not become intelligent as a consequence of getting a first-class honours degree. Rather, they get firsts because they are intelligent. The most plausible alternative causal variable, apart from IQ, is per capita income, since the remaining four are clearly dependent variables. The arguments against positing per capita income as the primary cause among this set of variables are twofold. First, among individuals it is doubtful whether there is any good evidence that differences in income in affluent nations are a major cause of differences in intelligence. This was the conclusion reached by Burt (1943) in a discussion of this problem. On the other hand, even Jencks (1972) admits that IQ is a determinant of income. Secondly, the very substantial increases in per capita incomes that have taken place in advanced Western nations since 1945 do not seem to have been accompanied by any significant increases in mean population IQ. In Britain the longest time series is that of Burt (1969) on London schoolchildren from 1913 to 1965 which showed that the mean IQ has remained approximately constant. Similarly in the United States the mean IQ of large national samples tested by two subtests from the WISC has remained virtually the same over a 16 year period from the early 1950s to the mid-1960s (Roberts, 1971). These findings make it doubtful whether the relatively small differences in per capita incomes between the regions of the British Isles can be responsible for the mean IQ differences. It seems more probable that the major causal sequence is from the IQ differences to the income differences although it may be that there is also some less important reciprocal effect of incomes on IQ. This is a problem which could do with further analysis.

Compare with Lynn’s recent overview of the history of the FLynn effect (Lynn, 2013).


  • Beaver, K. M.; Wright, J. P. (2011). “The association between county-level IQ and county-level crime rates”. Intelligence 39: 22–26. doi:10.1016/j.intell.2010.12.002
  • Deary, I. J. (2010). Cognitive epidemiology: Its rise, its current issues, and its challenges. Personality and individual differences, 49(4), 337-343.
  • Guay, J. P., Ouimet, M., & Proulx, J. (2005). On intelligence and crime: A comparison of incarcerated sex offenders and serious non-sexual violent criminals. International journal of law and psychiatry, 28(4), 405-417.
  • Gottfredson, L. S. (1998). Jensen, Jensenism, and the sociology of intelligence. Intelligence, 26(3), 291-299.
  • Gottfredson, L. S. (2004). Intelligence: is it the epidemiologists’ elusive” fundamental cause” of social class inequalities in health?. Journal of personality and social psychology, 86(1), 174.
  • Intelligence, Special Issue: Intelligence, health and death: The emerging field of cognitive epidemiology. Volume 37, Issue 6, November–December 2009
  • Kirkegaard, E. O. W., & Fuerst, J. (2014a). Educational attainment, income, use of social benefits, crime rate and the general socioeconomic factor among 71 immigrant groups in Denmark. Open Differential Psychology.
  • Kirkegaard, E. O. W. (2014b). Crime, income, educational attainment and employment among immigrant groups in Norway and Finland. Open Differential Psychology.
  • Kirkegaard, E. O. W. (2015a). S and G in Italian regions: Re-analysis of Lynn’s data and new data. The Winnower.
  • Kirkegaard, E. O. W. (2015b). Indian states: G and S factors. The Winnower.
  • Kirkegaard, E. O. W. (2015c). Examining the S factor in US states. The Winnower.
  • Kirkegaard, E. O. W. (2015d). The S factor in China. The Winnower.
  • Lynn, R. (1979). The social ecology of intelligence in the British Isles. British Journal of Social and Clinical Psychology, 18(1), 1-12.
  • Lynn, R. (1980). The social ecology of intelligence in France. British Journal of Social and Clinical Psychology, 19(4), 325-331.
  • Lynn, R. (2010a). In Italy, north–south differences in IQ predict differences in income, education, infant mortality, stature, and literacy. Intelligence, 38(1), 93-100.
  • Lynn, R. (2010b). IQ differences between the north and south of Italy: A reply to Beraldo and Cornoldi, Belacchi, Giofre, Martini, and Tressoldi. Intelligence, 38(5), 451-455.
  • Lynn, R. (2012a). IQs in Italy are higher in the north: A reply to Felice and Giugliano. Intelligence, 40(3), 255-259.
  • Lynn, R. (2012b). North-south differences in Spain in IQ, educational attainment, per capita income, literacy, life expectancy and employment. Mankind Quarterly, 52(3/4), 265.
  • Lynn, R. (2013). Who discovered the Flynn effect? A review of early studies of the secular increase of intelligence. Intelligence, 41(6), 765-769.
  • Lynn, R., & Cheng, H. (2013). Differences in intelligence across thirty-one regions of China and their economic and demographic correlates. Intelligence, 41(5), 553-559.
  • Lynn, R., & Yadav, P. (2015). Differences in cognitive ability, per capita income, infant mortality, fertility and latitude across the states of India. Intelligence, 49, 179-185.
  • Neisser, Ulric et al. (February 1996). “Intelligence: Knowns and Unknowns”. American Psychologist 52 (2): 77–101, 85.
  • Piffer, D., & Lynn, R. (2014). New evidence for differences in fluid intelligence between north and south Italy and against school resources as an explanation for the north–south IQ differential. Intelligence, 46, 246-249.
  • Zhao, N. (2009). The Minimum Sample Size in Factor Analysis. Encorewiki.


This is another relatively small international adoption study. 159 children were adopted in homes in the Netherlands from Sri Lanka, Korea and Colombia.

Participants are described as:

The present study examines the development and adjustment of 159 adopted children at age 7 years. The largest group, 129 adopted children, was selected from 2 studies, starting when the child was aged 5 months. In these studies a short-term early intervention was implemented in three sessions at home between 6 and 9 months in an experimental group, and results were compared with a control group. The families for this experiment were randomly recruited through Dutch adoption organizations, and not selected on (future) problems. Also, to avoid selection, the parents were not aware of the intervention when they entered the study. They were requested to participate in a study examining the development of adopted children. The results of the intervention study were reported elsewhere (Juffer, Hoksbergen, Riksen-Walraven, & Kohnstamm, 1997 ; Stams, Juffer, Van IJzendoorn, & Hoksbergen, in press). The intervention was not repeated during the following years. The original studies involved 70 mixed families, i.e., adoptive families with biological children, and 90 all-adoptive families, i.e., adoptive families without biological children. As intervention effects were found at age 7 in a small intervention group of 20 mixed families (Stams et al., in press), we decided to omit this group from the present study. The remaining sample consisted of 55 intervention and 74 control families. An additional group of 30 families, matched on the original criteria, was randomly recruited from one adoption agency at age 7, serving as a post-test-only group. The absence of intervention or testing effects on any of the outcome measures was confirmed in preliminary analyses, contrasting intervention with control groups, and control groups with the post-test-only group recruited at age 7, respectively.

The adoptive parents were Caucasian white, and in all families the mother was the primary caregiver. The families were predominantly from middle-class or upper middle-class backgrounds. The attrition rate was 8 %, that is, 11 of 140 participants from the original studies. The major reasons for declining were disinterest or health problems of family members. Four mothers had died of incurable illnesses. A series of separate Bonferroni-corrected statistical tests confirmed the absence of differential attrition in the total sample with respect to child background variables, such as age at placement, and family background variables, such as socioeconomic status or family type (with or without biological children).

The children, 73 boys and 86 girls, were adopted from Sri Lanka (N=108), South Korea (N=37), and Colombia (N=14). The infants from Sri Lanka were in the care of their biological mother until their adoption placement at a mean age of 7 weeks (SD=3). Korean and Colombian infants stayed in an institution or foster home after separation from their biological mother at birth, until their adoption placement at a mean age of 15 weeks (SD=4). In comparison with adoptions from Romania, for example (O’Connor et al., 2000 ; Rutter et al., 1998), the material conditions in the Korean and Colombian institutions were relatively favorable, as these homes received substantial support from a Dutch adoption agency. However, little is known about the quality of care, whereas one may assume that frequent changes of caretakers and nonoptimal child–caretaker ratios, often found in institutions, resulted in less favorable socioemotional conditions (O’Connor, Bredenkamp, Rutter, & the ERA Study Team, 1999). Little is known about the child-rearing conditions of the Sri Lankian infants after birth. Based on anecdotal evidence from parent reports, pre- and post-natal care for the relinquishing mother and her baby were far from optimal in Sri Lanka, and the health condition of the mother was deplorable in many cases (Juffer, 1993).

The paper is primarily about some socioemotional variables of no particular interest to me. However, they also accessed cognitive ability with the following test:

Intelligence. Intelligence was measured with the abbreviated Revised Amsterdam Child Intelligence Test (RACIT). Bleichrodt, Drenth, Zaal, and Resing (1987) found empirical evidence for convergent validity, as the RACIT correlatedr¯ ±86 with the Wechsler Intelligence Scale for Children-Revised (WISC-R). At age 7, the abbreviated RACIT correlatedr¯±92 with the full RACIT. The abbreviated RACIT showed a somewhat lower test–retest reliability, namely, r¯±86 versus r¯±88, and a somewhat lower internal consistency, namely, α¯±90 versusα¯±94, than the full RACIT. The abbreviated RACIT does not seem to underestimate or overestimate the level of individual intelligence.

In the present study, we used the abbreviated RACIT, which consisted of the following subtests : flexibility of closure (α¯ ±84), paired associates (split-half reliability¯±77), perceptual reasoning (split-half reliability¯±73), vocabulary (α¯±74), inductive reasoning (α¯±86), ideational fluency (α¯±81). The reliability of the abbreviated RACIT was .91 (N¯163), and was estimated on the basis of the number of subtests, the reliabilities of the subtests, and the correlations between the subtests (Nunally, 1978). The raw scores were transformed to standardized intelligence scores with a mean of 100 (SD15). The standardized scores were derived from a representative sample of 1415 children between age 4 and 11, drawn from the Dutch general school population in 1982 (Bleichrodt et al., 1987).

The results of the testing at age 7 are:

table 8 stams

The adoptive families in adoption studies are usually somewhat above average which boosts the IQ of especially younger children. This along with the FLynn effect is probably responsible for the higher than 100 scores. But among the groups, we see Koreans on top as is usually seen in these studies.


As Jason Malloy has mentioned, it is strange that in the race intelligence debates, people usually cite the same few studies over and over:

Shortly after writing that post, I decided that more needed to be written about transracial adoption research as a behavior genetic experiment. Arthur Jensen, Richard Lynn, and J. Philippe Rushton have all cited the Minnesota Transracial Adoption Study, as well as several IQ studies of transracially adopted Asians, in support of the hereditarian position. And Richard Nisbett has referenced several other adoption studies that suggest no racial gaps. However, I suspected there was more data for transracially adopted children than what this small cadre of scientists had already discussed (at the very least for important variables other than intelligence); research that could give us a more complete picture of what these unusual children become, and what this can tell us about the causes of ethnic differences in socially valued outcomes.

One way of finding hard to find studies is going thru articles that cite popular reviews of the topic. Sometimes, this is not possible if the reviews have thousands of citations. However, sometimes, it is. In this case, I used the review: Kim, W. J. (1995). International adoption: A case review of Korean children. Child Psychiatry and Human Development, 25(3), 141-154. Then simply look up the studies that cite it (101 results on Scholar). We are looking for studies that report country or area of origin as well as relevant criteria variables such as IQ scores, GPA, educational attainment/achievement, income, socioeconomic status/s factor, crime rates, use of public benefits and so on.

One such study is: Lindblad, F., Hjern, A., & Vinnerljung, B. (2003). Intercountry Adopted Children as Young Adults—‐A Swedish Cohort Study. American Journal of Orthopsychiatry, 73(2), 190-202. The abstract is promising:

In a national cohort study, the family and labor market situation, health problems, and education of 5,942 Swedish intercountry adoptees born between 1968 and 1975 were examined and compared with those of the general population, immigrants, and a siblings group—all age matched—in national registers from 1997 to 1999. Adoptees more often had psychiatric problems and were longtime recipients of social welfare. Level of education was on par with that of the general population but lower when adjusted for socioeconomic status.

The sample consists of:

There were 5,942 individuals in the adoptee study group: 3,237 individuals were born in the Far East (2,658 were born in South Korea), 1,422 in South Asia, 871 in Latin America, and 412 in Africa. In the other study groups there were 1,884 siblings, 8,834 European immigrants, 3,544 non-European immigrants, and 723,154 individuals in the general population.

So, by the usual standards, this is a very large study. We are interested in the region of birth results. They are in two tables:

Table 7Table 8

We can note that the Far East — i.e. mostly South Korean, presumably the rest are North Korean (?), Chinese, Japanese, Vietnamese (?) — usually gets the better outcomes. They were less often married, mixed results for living with parents, more likely to have a university degree, less likely to have only primary school, more likely to be in the workforce, less likely to be unemployed, less likely to receive welfare, mixed results for hospital admissions for substance abuse, much less likely to be admitted for alcohol abuse (likely to be due to Asian alcohol flush syndrome), less likely to be admitted for a psychiatric diagnosis, and less likely to receive disability pension.

It would probably have been better if one could aggregate the results and look at the general socioeconomic factor instead. It is not possible to do so with the above results, since there are only 4 cases and 11 variables. One could calculate a score by choosing some or all of the variables. Or one could assign them factor loadings manually and then calculate scores. I calculated a unit-weighted score based on all but the first two indicators (married and living with parents since these are not socioeconomically important). Two indicators (uni degree and workforce) were reversed (by 1/OR). I also calculated the median score which is resistant to outliers (e.g. the alcohol abuse indicator). Results:


Socioeconomic outcomes by region of origin, and estimated S scores
Group Latin America Africa South Asia Far East 2.50 1.43 1.67 1
Only.primary.ed 1.60 1.50 1.00 1
Workforce 1.43 1.43 1.11 1
Unemployed 1.30 0.90 1.30 1
Welfare.use 1.90 1.50 1.30 1
Substance.abuse 2.70 0.70 1.00 1
Alcohol.abuse 4.50 4.90 3.60 1
Psychiatric.diag 1.50 1.40 1.20 1
Disability.pension 1.30 1.80 1.30 1
Mean.S 2.08 1.73 1.50 1
Median.S 1.60 1.43 1.30 1


It is interesting to see that the Africans did better than the Latin Americans. Perhaps there is something strange going on. Perhaps the Latin Americans are from countries with high African% admixture. Or perhaps it’s some kind of selection effect.

In their discussion they write:

There were considerable differences between adoptees from different geographical regions with better outcomes in many respects for children from the Far East, in this context mainly South Korea. Sim­ilar positive adjustment results concerning Asian adoptees have been presented previously. For in­ stance, an excellent prognosis concerning adjustment and identity development in Chinese adoptees in Britain was described (Bagley, 1993). A Dutch group recently presented data about academic achievement and intelligence in 7-year-old children adopted in in­ fancy (Stams, Juffer, Rispens, & Hoksbergen, 2000). The South Korean group had high IQs with 31% above a score of 120. Pre- and postnatal care before adoption seems to be particularly well organized in South Korea (Kim, 1995), which may be one important reason for the positive outcome. The differences among the geographic regions may also, however, be due to a large number of other factors such as differ­ences in nutrition, motives behind the adoption, qual­ity of care in the orphanage-foster home before the adoption, genetic dispositions, and Swedish preju­dices against “foreign-looking” people. Another ex­planation may be a larger number of younger infants in the South Korean group. However, that is not pos­sible to verify from our register data.

The usual cultural explanations.

I have also contacted the Danish statistics office to hear if they have Danish data.



Item-level data from Raven’s Standard Progressive Matrices was compiled for 12 diverse groups from previously published studies. The method of correlated vectors was used on every possible pair of groups with available data (45 comparisons). Depending on exact method chosen, the mean and mean MCV correlation was about .46/51. Only 2/1 of 45 were negative. Spearman’s hypothesis is confirmed for item-level data from the Standard Progressive Matrices.

Introduction and method

The method of correlated vectors (MCV) is a statistical method invented by Arthur Jensen (1998, p. 371, see also appendix B). The purpose of it is to measure to which degree a latent variable is responsible for an observed correlation between an aggregate measure and a criteria variable. Jensen had in mind the general factor of cognitive ability data (the g factor) as measured by various IQ tests and their subtests, and criteria variables such as brain size, however the method is applicable to any latent trait (e.g. general socioeconomic factor Kirkegaard, 2014a, b). When this method is applied to group differences, particularly ethnoracial ones, it is called Spearman’s hypothesis (SH) because Spearman was the first to note it in his 1927 book.

By now, several large studies and meta-analysis of MCV results for group differences have been published (te Nijenhuis et al (2015a, 2015b, 2014), Jensen (1985)). These studies generally support the hypothesis. Almost all studies use subtest loadings instead of item loadings. This is probably because psychologists are reluctant to share their data (Wicherts, et al, 2006) and as a result there are few open datasets available to use for this purpose. Furthermore, before the introduction of modern computers and the internet, it was impractical to share item-level data. There are advantages and disadvantages to using item-level data over subtest-level data. There are more items than subtests which means that the vectors will be longer and thus sampling error will be smaller. On the other hand, items are less reliable and less pure measures of the g factor which introduces both error and more non-g ability variance.

The recent study by Nijenhuis et al (2015a) however, employed item-level data from Raven’s Standard Progressive Matrices (SPM) and included a diverse set of samples (Libyan, Russian, South African, Roma from Serbia, Moroccan and Spanish). The authors did not use their collected data to its full extent, presumably because they were comparing the groups (semi-)manually. To compare all combinations with a dataset of e.g. 10 groups means that one has to do 45 comparisons (10*9/2). However, this task can easily be overcome with programming skills, and I thus saw an opportunity to gather more data regarding SH.

The authors did not provide the data in the paper despite it being easy to include it in tables. However, the data was available from the primary studies they cited in most cases. Thus, I collected the data from their data sources (it can be found in the supplementary material). This resulted in data from 12 samples of which 10 had both difficulty and item-whole correlations data. Table 1 gives an overview of the datasets:

Table 1- Overview of samples
Short name Race Selection N Year Ref Country Description
A1 African Undergraduates 173 2000 Rushton and Skuy 2000 South Africa University of the Witwatersrand and the Rand Afrikaans University in Johannesburg, South Africa
W1 European Undergraduates 136 2000 Rushton and Skuy 2000 South Africa University of the Witwatersrand and the Rand Afrikaans University in Johannesburg, South Africa
W2 European Std 7 classes 1056 1992 Owen 1992 South Africa 20 schools in the Pretoria-Witwatersrand-Vereeniging (PWV) area and 10 schools in the Cape Peninsula
C1 Colored (African European) Std 7 classes 778 1992 Owen 1992 South Africa 20 coloured schools in the Cape Peninsula
I1 Indian Std 7 classes 1063 1992 Owen 1992 South Africa 30 schools selected at random from the list of high schools in and around Durban
A2 African Std 7 classes 1093 1992 Owen 1992 South Africa Three schools in the PWV area and 25 schools in KwaZulu (Natal)
A3 African First year Engineering students 198 2002 Rushton et al 2002 South Africa First-year students from the Faculties of Engineering and the Built Environment at the University of the Witwatersrand
I2 Indian First year Engineering students 58 2002 Rushton et al 2002 South Africa First-year students from the Faculties of Engineering and the Built Environment at the University of the Witwatersrand
W3 European First year Engineering students 86 2002 Rushton et al 2002 South Africa First-year students from the Faculties of Engineering and the Built Environment at the University of the Witwatersrand
R1 Roma Adults ages 16 to 66 231 2004.5 Rushton et al 2007 Serbia The communities (i.e., Drenovac, Mirijevo, and Rakovica) are in the vicinity of Belgrade
W4 European Adults ages 18 to 65 258 2012 Diaz et al 2012 Spain Mainly from the city of Valencia
NA1 North African Adults ages 18 to 50 202 2012 Diaz et al 2012 Morocco Casablanca, Marrakech, Meknes and Tangiers


Item-whole correlations and item loadings

The data in the papers did usually not contain the actual factor loadings of the items. Instead, they contained the item-whole correlations. The authors argue that one can use these because of the high correlation of unweighted means with extracted g-factors (often, r=.99, e.g. Kirkegaard, in review). Some studies did provide both loadings and item-whole correlations, yet the authors did not correlate them to see how good proxies the item-whole correlations are for the loadings. I calculated this for the 4 studies that included both metrics. Results are shown in Table 2.

Table 2 – Item-whole correlations x g-loadings in 4 studies.
W2 C1 I1 A2
W2 0.549 0.099 0.327 0.197
C1 0.695 0.900 0.843 0.920
I1 0.616 0.591 0.782 0.686
A2 0.626 0.882 0.799 0.981

Note: Within sample correlations between item-whole correlations and item factor loadings are in the diagonal, marked with italic.

As can be seen, the item-whole correlations were not in all cases great proxies for the actual loadings.

To further test this idea, I calculated the item-whole correlations and the factor loadings (first factor, minimum residuals) in the open Wicherts dataset (N=500ish, Dutch university students, see Wicherts and Bakker 2012) tested on Raven’s Advanced Progressive Matrices. The correlation was .89. Thus, aside from the odd result in the W2 sample, item-whole correlations were a reasonable proxy for the factor loadings.

Item difficulties across samples

If two groups are tested on the same test and this test measures the same trait in both groups, then even if the groups have different mean trait levels, the order of difficulty of the items or subtests should be similar. Rushton et al (2000, 2002, 2007) have examined this in previous studies and found it generally to be the case. Table 3 below shows the cross-sample correlations of item difficulties.

Table 3 – Intercorrelations between item difficulties in 12 samples
A1 W1 W2 C1 I1 A2 A3 I2 W3 R1 NA1 W4
A1 1 0.88 0.98 0.96 0.99 0.86 0.96 0.89 0.79 0.89 0.95 0.93
W1 0.88 1 0.93 0.79 0.87 0.65 0.96 0.97 0.94 0.7 0.92 0.95
W2 0.98 0.93 1 0.95 0.98 0.82 0.97 0.92 0.84 0.86 0.96 0.95
C1 0.96 0.79 0.95 1 0.98 0.94 0.89 0.81 0.69 0.95 0.92 0.87
I1 0.99 0.87 0.98 0.98 1 0.88 0.95 0.88 0.79 0.91 0.95 0.92
A2 0.86 0.65 0.82 0.94 0.88 1 0.76 0.68 0.56 0.97 0.82 0.76
A3 0.96 0.96 0.97 0.89 0.95 0.76 1 0.96 0.9 0.8 0.95 0.96
I2 0.89 0.97 0.92 0.81 0.88 0.68 0.96 1 0.92 0.72 0.91 0.92
W3 0.79 0.94 0.84 0.69 0.79 0.56 0.9 0.92 1 0.6 0.88 0.91
R1 0.89 0.7 0.86 0.95 0.91 0.97 0.8 0.72 0.6 1 0.86 0.8
NA1 0.95 0.92 0.96 0.92 0.95 0.82 0.95 0.91 0.88 0.86 1 0.97
W4 0.93 0.95 0.95 0.87 0.92 0.76 0.96 0.92 0.91 0.8 0.97 1


The mean intercorrelation is.88. This is quite remarkable given the diversity of the samples.

Item-whole correlations across samples

Given the above, one might expect similar results for the item-whole correlations. This however is not so. Results are in Table 4.

Table 4 – Intercorrelations between item-whole correlations in 10 samples
A1 W1 W2 C1 I1 A2 A3 I2 W3 R1
A1 1 -0.2 0.59 0.58 0.73 0.54 0.27 0.04 -0.3 0.57
W1 -0.2 1 0.17 -0.59 -0.25 -0.68 0.42 0.51 0.55 -0.55
W2 0.59 0.17 1 0.44 0.79 0.29 0.61 0.25 0.02 0.39
C1 0.58 -0.59 0.44 1 0.79 0.94 0.01 -0.25 -0.49 0.78
I1 0.73 -0.25 0.79 0.79 1 0.69 0.42 0.09 -0.33 0.63
A2 0.54 -0.68 0.29 0.94 0.69 1 -0.13 -0.3 -0.52 0.77
A3 0.27 0.42 0.61 0.01 0.42 -0.13 1 0.26 0.37 0.02
I2 0.04 0.51 0.25 -0.25 0.09 -0.3 0.26 1 0.34 -0.21
W3 -0.3 0.55 0.02 -0.49 -0.33 -0.52 0.37 0.34 1 -0.49
R1 0.57 -0.55 0.39 0.78 0.63 0.77 0.02 -0.21 -0.49 1

Note: The last two samples, NA1 and W4, did not have item-whole correlation data.

The reason for this state of affairs is that the factor loadings change when the group mean trait level changes. For many samples, most of the items were too easy (passing rates at or very close to 100%). When there is no variation in a variable, one cannot calculate a correlation to some other variable. This means that for a substantial number of items for multiple samples, there was missing data for the items.

The lack of cross-sample consistency in item-whole correlations may also explain the weak MCV results in Diaz et al, 2012 since they used g-loadings from another study instead of from their own samples.

Spearman’s hypothesis using one static vector of estimated factor loadings

Some of the sample had rather low sample sizes (I2, N=58, W3, N=86). Thus one might get the idea to use the item-whole correlations from one or more of the large samples for comparisons involving other groups. In fact, given the instability of item-whole correlations across sample as can be seen in Table 4, this is a bad idea. However, for sake of completeness, I calculated the results based on this anyway. As the best estimate of factor loadings, I averaged the item-whole correlations data from the four largest samples (W2, C1, I1 and A2).

Using this vector of item-whole correlations, I used MCV on every possible sample comparison. Because there were 12 samples, this number is 66. MCV analysis was done by subtracting the lower scoring sample’s item difficulties from the higher scoring sample’s thus producing a vector of the sample difference on each item. This vector I correlated with the vector of item-whole correlations.  The results are shown in Table 5.

Table 5 – MCV correlations of group differences across 12 samples using 1 static item-whole correlations
A1 W1 W2 C1 I1 A2 A3 I2 W3 R1 NA1 W4
A1 NA -0.15 0.2 0.42 0.1 0.83 -0.03 -0.12 -0.26 0.8 -0.35 -0.32
W1 -0.15 NA -0.31 0.07 -0.14 0.47 -0.29 -0.19 -0.4 0.4 0.31 0.06
W2 0.2 -0.31 NA 0.56 0.4 0.86 -0.27 -0.28 -0.38 0.83 -0.35 -0.46
C1 0.42 0.07 0.56 NA 0.53 0.88 0.23 0.11 -0.06 0.64 -0.23 -0.05
I1 0.1 -0.14 0.4 0.53 NA 0.88 -0.02 -0.1 -0.24 0.83 -0.45 -0.29
A2 0.83 0.47 0.86 0.88 0.88 NA 0.66 0.52 0.32 0.2 0.42 0.4
A3 -0.03 -0.29 -0.27 0.23 -0.02 0.66 NA -0.17 -0.41 0.61 0.43 -0.43
I2 -0.12 -0.19 -0.28 0.11 -0.1 0.52 -0.17 NA -0.37 0.46 0.33 -0.25
W3 -0.26 -0.4 -0.38 -0.06 -0.24 0.32 -0.41 -0.37 NA 0.23 -0.05 -0.11
R1 0.8 0.4 0.83 0.64 0.83 0.2 0.61 0.46 0.23 NA 0.36 0.32
NA1 -0.35 0.31 -0.35 -0.23 -0.45 0.42 0.43 0.33 -0.05 0.36 NA -0.03
W4 -0.32 0.06 -0.46 -0.05 -0.29 0.4 -0.43 -0.25 -0.11 0.32 -0.03 NA


As one can see, the results are all over the place. The mean MCV correlation is .12.

Spearman’s hypothesis using a variable vector of estimated factor loadings

Since item-whole correlations varied from sample to sample, another idea is to use the samples’ item-whole correlations. I used the unweighted mean of the item-whole correlations for each item (te Nijenhuis et al used a weighted mean). In some cases, only one sample has item-whole correlations for some items (because the other sample had no variance on the item, i.e. 100% get it right). In these cases, one can choose to use the value from the remaining sample, or one can ignore the item and calculated MCV based on the remaining items. I calculated results using both methods, they are shown in Table 6 and 7.

Table 6 – MCV correlations of group differences across 10 samples using variable item-whole correlations, method 1
A1 W1 W2 C1 I1 A2 A3 I2 W3 R1
A1 NA 0.79 0.39 0.29 0.05 0.7 0.48 0.41 0.37 0.71
W1 0.79 NA 0.8 0.5 0.76 0.51 0.79 0.4 0.6 0.54
W2 0.39 0.8 NA 0.68 0.63 0.85 0.43 0.47 0.5 0.79
C1 0.29 0.5 0.68 NA 0.52 0.88 0.32 0.3 -0.09 0.67
I1 0.05 0.76 0.63 0.52 NA 0.84 0.47 0.4 0.31 0.79
A2 0.7 0.51 0.85 0.88 0.84 NA 0.57 0.43 -0.03 0.22
A3 0.48 0.79 0.43 0.32 0.47 0.57 NA 0.38 0.66 0.64
I2 0.41 0.4 0.47 0.3 0.4 0.43 0.38 NA 0.6 0.44
W3 0.37 0.6 0.5 -0.09 0.31 -0.03 0.66 0.6 NA 0.2
R1 0.71 0.54 0.79 0.67 0.79 0.22 0.64 0.44 0.2 NA


Table 6 – MCV correlations of group differences across 10 samples using variable item-whole correlations, method 2
A1 W1 W2 C1 I1 A2 A3 I2 W3 R1
A1 NA 0.42 0.4 0.33 0.06 0.72 0.48 0.3 0.15 0.74
W1 0.42 NA 0.72 0.14 0.52 0.18 0.7 0.44 0.65 0.35
W2 0.4 0.72 NA 0.68 0.63 0.85 0.44 0.53 0.5 0.79
C1 0.33 0.14 0.68 NA 0.52 0.88 0.39 0.19 -0.08 0.67
I1 0.06 0.52 0.63 0.52 NA 0.84 0.51 0.4 0.28 0.79
A2 0.72 0.18 0.85 0.88 0.84 NA 0.62 0.3 0.02 0.22
A3 0.48 0.7 0.44 0.39 0.51 0.62 NA 0.42 0.55 0.67
I2 0.3 0.44 0.53 0.19 0.4 0.3 0.42 NA 0.58 0.35
W3 0.15 0.65 0.5 -0.08 0.28 0.02 0.55 0.58 NA 0.08
R1 0.74 0.35 0.79 0.67 0.79 0.22 0.67 0.35 0.08 NA


Nearly all results are positive using either method. The results are slightly stronger when ignoring items where both samples do not have item-whole correlation data. A better way to visualize the results is to use a histogram with a density curve inputted, as shown in Figure 1 and 2.

SH method 1

Figure 1 – Histogram of MCV results using method 1

SH method 2

Figure 2 – Histogram of MCV results using method 2

Note: The vertical line shows the mean value.

The mean/median result for method 1 was .51/.50, and .46/.48 for method 2. Almost all MCV results were positive, there were only 2/45 that were negative for method 1, and 1/45 for method 2.

Mean MCV value by sample and moderator analysis

It is interesting to examine the mean MCV value by sample. They are shown in Table 7.

Table 7 – MCV correlation means, SDs, and medians by sample
Sample mean SD median
A1 0.46 0.24 0.41
W1 0.63 0.15 0.60
W2 0.62 0.18 0.63
C1 0.45 0.29 0.50
I1 0.53 0.26 0.52
A2 0.55 0.31 0.57
A3 0.53 0.15 0.48
I2 0.43 0.08 0.41
W3 0.35 0.28 0.37
R1 0.56 0.23 0.64


There is no obvious racial pattern. Instead, one might expect the relatively lower result of some samples to be due to sampling error. MCV is extra sensitive to sampling error. If so, the mean correlation should be higher for the larger samples. To see if this was the case, I calculated the rank-order correlation between sample size and sample mean MCV, r=.45/.65 using method 1 or 2 respectively. Rank-order was used because the effect of sample size on sampling error is non-linear. Figure 3 shows the scatter plot of this.


Figure 3 – Sample size as a moderator variable at the sample mean-level


One can also examine sample size as a moderating variable as the comparison-level. This increases the number of datapoints to 45. I used the harmonic mean of the 2 samples as the sample size metric. Figure 4 shows a scatter plot of this.


Figure 4 – Sample size as a moderator variable at the comparison-level


The rank-order correlations are .45/.44 using method 1/2 data. We can see in the plot that the results from the 6 largest comparisons (harmonic mean sample size>800) range from .52 to .88 with a mean of .74/.73 and SD of .15 using method 1/2 results. For the smaller studies (harmonic mean sample size<800), the results range from -.09/-.08 to .89/.79 with a mean of .48/.43 and SD of .22/.23 using method 1/2 results. The results from the smaller studies vary more, as expected with their higher sampling error.

I also examine the group difference size as a moderator variable. I computed this as the difference between the mean item difficulty by the groups. However, it had a near-zero relationship to the MCV results (rank-order r=.03, method 1 data).

Discussion and conclusion

Spearman’s hypothesis has been decisively confirmed using item-level data from Raven’s Standard Progressive Matrices. The analysis presented here can easily be extended to cover more datasets, as well as item-level data from other IQ tests. Researchers should compile such data into open datasets so they can be used for future studies.

It is interesting to note the consistency of results within and across samples that differ in race. Race differences in general intelligence as measured by the SPM appear to be just like those within races.

Supplementary material

R code and dataset is available at the Open Science Framework repository.


  • Dıaz, A., & Sellami, K. Infanzó n, E., Lanzó n, T., & Lynn, R.(2012). A comparative study of general intelligence in Spanish and Moroccan samples. Spanish Journal of Psychology, 15(2), 526-532.
  • Jensen, A. R. (1998). The g factor: The science of mental ability. Westport, CT: Praeger.
  • Jensen, A. R. (1985). The nature of the black–white difference on various psychometric tests: Spearman’s hypothesis. Behavioral and Brain Sciences, 8(02), 193-219.
  • Kirkegaard, E. O. (2014a). The international general socioeconomic factor: Factor analyzing international rankings. Open Differential Psychology.
  • Kirkegaard, E. O. (2014b). Crime, income, educational attainment and employment among immigrant groups in Norway and Finland. Open Differential Psychology.
  • Kirkegaard, E. O. W. (in review). Examining the ICAR and CRT tests in a Danish student sample. Open Differential Psychology.
  • Owen, K. (1992). The suitability of Raven’s Standard Progressive Matrices for various groups in South Africa. Personality and Individual Differences, 13(2), 149-159.
  • Rushton, J. P., Čvorović, J., & Bons, T. A. (2007). General mental ability in South Asians: Data from three Roma (Gypsy) Communities in Serbia. Intelligence, 35, 1-12.
  • Rushton, J. P., Skuy, M., & Fridjhon, P. (2002). Jensen effects among African, Indian, and White engineering students in South Africa on Raven’s Standard Progressive Matrices. Intelligence, 30, 409-423.
  • Rushton, J. P., & Skuy, M. (2000). Performance on Raven’s Matrices by African and White university students in South Africa. Intelligence, 28, 251-265.
  • Spearman, C. (1927). The abilities of man.
  • te Nijenhuis, J., Al-Shahomee, A. A., van den Hoek, M., Grigoriev, A., and Repko, J. (2015a). Spearman’s hypothesis tested comparing Libyan adults with various other groups of adults on the items of the Standard Progressive Matrices. Intelligence. Volume 50, May–June 2015, Pages 114–117
  • te Nijenhuis, J., David, H., Metzen, D., & Armstrong, E. L. (2014). Spearman’s hypothesis tested on European Jews vs non-Jewish Whites and vs Oriental Jews: Two meta-analyses. Intelligence, 44, 15-18.
  • te Nijenhuis, J., van den Hoek, M., & Armstrong, E. L. (2015b). Spearman’s hypothesis and Amerindians: A meta-analysis. Intelligence, 50, 87-92.
  • Wicherts, J. M., Borsboom, D., Kats, J., & Molenaar, D. (2006). The poor availability of psychological research data for reanalysis. American Psychologist, 61(7), 726.
  • Wicherts, J. M., & Bakker, M. (2012). Publish (your data) or (let the data) perish! Why not publish your data too?. Intelligence, 40(2), 73-76.



Research uncovers flawed IQ scoring system” is the headline on, which often posts news about research from other fields. It concerns a study by Harrison et al (2015). The researchers have allegedly “uncovered anomalies and issues with the Wechsler Adult Intelligence Scale-Fourth Edition (WAIS-IV), one of the most widely used intelligence tests in the world”. An important discovery, if true. Let’s hear it from the lead researcher:

“Looking at the normal distribution of scores, you’d expect that only about five per cent of the population should get an IQ score of 75 or less,” says Dr. Harrison. “However, while this was true when we scored their tests using the American norms, our findings showed that 21 per cent of college and university students in our sample had an IQ score this low when Canadian norms were used for scoring.”

How can it be? To learn more, we delve into the actual paper titled: Implications for Educational Classification and Psychological Diagnoses Using the Wechsler Adult Intelligence Scale–Fourth Edition With Canadian Versus American Norms.

The paper

First they summarize a few earlier studies on Canada and the US. The Canadians obtained higher raw scores. Of course, this was hypothesized to be due to differences in ethnicity and educational achievement factors. However, this did not quite work out, so Harrison et al decided to investigate it more (they had already done so in 2014). Their method consists of taking the scores from a large mixed sample consisting of healthy people — i.e. with no diagnosis, 11% — and people with various mental disorders (e.g. 53.5% with ADHD), and then scoring this group on both the American and the Canadian norms. What did they find?

Blast! The results were similar to the results from the previous standardization studies! What happened? To find out, Harrison et al do a thorough examination of various subgroups in various ways. No matter which age group they compare, the result won’t go away. They also report the means and Cohen’s d for each subtest and aggregate measure — very helpful. I reproduce their Table 1 below:

Score M (US)
p d r
FSIQ 95.5 12.9 88.1 14.4 <.001 0.54 0.99
GAI 98.9 14.7 92.5 16 <.001 0.42 0.99
Index Scores
Verbal Comprehension 97.9 15.1 91.8 16.3 <.001 0.39 0.99
Perceptual Reasoning 99.9 14.1 94.5 15.9 <.001 0.36 0.99
Working Memory 90.5 12.8 83.5 13.8 <.001 0.53 0.99
Processing Speed 95.2 12.9 90.4 14.1 <.001 0.36 0.99
Subtest Scores
Verbal Subtests
Vocabulary 9.9 3.1 8.7 3.3 <.001 0.37 0.99
Similarities 9.7 3 8.5 3.3 <.001 0.38 0.98
Information 9.2 3.1 8.5 3.3 <.001 0.22 0.99
Arithmetic 8.2 2.7 7.4 2.7 <.001 0.3 0.99
Digit Span 8.4 2.5 7.1 2.7 <.001 0.5 0.98
Performance Subtests
Block Design 9.8 3 8.9 3.2 <.001 0.29 0.99
Matrix Reasoning 9.8 2.9 9.1 3.2 <.001 0.23 0.99
Visual Puzzles 10.5 2.9 9.4 3.1 <.001 0.37 0.99
Symbol Search 9.3 2.8 8.5 3 <.001 0.28 0.99
Coding 8.9 2.5 8.2 2.6 <.001 0.27 0.98


Sure enough, the scores are lower using the Canadian norms. And very ‘significant’ too. A mystery.

Next, they go on to note how this sometimes changes the classification of individuals into 7 arbitrarily chosen intervals of IQ scores, and how this differs between subtests. They spend a lot of e-ink noting percents about this or that classification. For instance:

“Of interest was the percentage of individuals who would be classified as having a FSIQ below the 10th percentile or who would fall within the IQ range required for diagnosis of ID (e.g., 70 ± 5) when both normative systems were applied to the same raw scores. Using American norms, 13.1% had an IQ of 80 or less, and 4.2% had an IQ of 75 or less. By contrast, when using Canadian norms, 32.3% had an IQ of 80 or less, and 21.2% had an IQ of 75 or less.”

I wonder if some coherent explanation can be found for all these results. In their discussion they ask:

“How is it that selecting Canadian over American norms so markedly lowers the standard scores generated from the identical raw scores? One possible explanation is that more extreme scores occur because the Canadian normative sample is smaller than the American (cf. Kahneman, 2011).”

If the reader was unsure, yes, this is Kahneman’s 2011 book about cognitive biases and dual process theory.

They have more suggestions about the reason:

“One cannot explain this difference simply by saying it is due to the mature students in the sample who completed academic upgrading, as the score differences were most prominent in the youngest cohorts. It is difficult to explain these findings simply as a function of disability status, as all participants were deemed otherwise qualified by these postsecondary institutions (i.e., they had met normal academic requirements for entry into regular postsecondary programs). Furthermore, in Ontario, a diagnosis of LD is given only to students with otherwise normal thinking and reasoning skills, and so students with such a priori diagnosis would have had otherwise average full scale or general abilities scores when tested previously. Performance exaggeration seems an unlikely cause for the findings, as the students’ scores declined only when Canadian norms were applied. Finally, although no one would argue that a subset of disabled students might be functioning below average, it is difficult to believe that almost half of these postsecondary students would fall in this IQ range given that they had graduated from high school with marks high enough to qualify for acceptance into bona fide postsecondary programs. Whatever the cause, our data suggest that one must question both the representativeness of the Canadian normative sample in the younger age ranges and the accuracy of the scores derived when these norms are applied.”

And finally they conclude with a recommendation not to use the Canadian norms for Canadians because this results in lower IQs:

Overall, our findings suggest a need to examine more carefully the accuracy and applicability of the WAIS-IV Canadian norms when interpreting raw test data obtained from Canadian adults. Using these norms appears to increase the number of young adults identified as intellectually impaired and could decrease the number who qualify for gifted programming or a diagnosis of LD. Until more research is conducted, we strongly recommend that clinicians not use Canadian norms to determine intellectual impairment or disability status. Converting raw scores into Canadian standard scores, as opposed to using American norms, systematically lowers the scores of postsecondary students below the age of 35, as the drop in FSIQ was higher for this group than for older adults. Although we cannot know which derived scores most accurately reflect the intellectual abilities of young Canadian adults, it certainly seems implausible that almost half of postsecondary students have FSIQ scores below the 16th percentile, calling into question the accuracy of all other derived WAIS-IV Canadian scores in the classification of cognitive abilities.

Are you still wondering what it going on?

Populations with different mean IQs and cut-offs

Harrison et al seems to have inadvertently almost rediscovered the fact that Canadians are smarter than Americans. They don’t quite make it to this point even when faced with obvious and strong evidence (multiple standardization samples). They somehow don’t realize that using the norms from these standardization samples will reproduce the differences found in those samples, and won’t really report anything new.

Their numerous differences in percents reaching this or that cut-off are largely or entirely explained by simple statistics. They have two populations which have an IQ difference of 7.4 points (95.5 – 88.1 from Table 1) or 8.1 points (15 * .54 d from Table 1). Now, if we plot these (I used a difference of 7.5 IQ) and choose some arbitrary cut-offs, like those between arbitrarily chosen intervals, we see something like this:


Except that I cheated and chose all the cut-offs. The brown and the green lines are the ratios between the densities (read off the second y-axis). We see that around 100, they are generally low, but as we get further from the means, they get a lot larger. This simple fact is not generally appreciated. It’s not a new problem, Arthur Jensen spent much of a chapter in his behemoth 1980 book on the topic, he quotes for instance:

“In the construction trades, new apprentices were 87 percent white and 13 percent black. [Blacks constitute 12 percent of the U.S. population.] For the Federal Civil Service, of those employees above the GS-5 level, 88.5 percent were white, 8.3 percent black, and women account for 30.1 of all civil servants. Finally, a 1969 survey of college teaching positions showed whites with 96.3 percent of all posi­ tions. Blacks had 2.2 percent, and women accounted for 19.1 percent. (U.S. Commission on Civil Rights, 1973)”

Sounds familiar? Razib Khan has also written about it. Now, let’s go back to one of the quotes:

“Using American norms, 13.1% had an IQ of 80 or less, and 4.2% had an IQ of 75 or less. By contrast, when using Canadian norms, 32.3% had an IQ of 80 or less, and 21.2% had an IQ of 75 or less. Most notably, only 0.7% (2 individuals) obtained a FSIQ of 70 or less using American norms, whereas 9.7% had IQ scores this low when Canadian norms were used. At the other end of the spectrum, 1.4% of the students had FSIQ scores of 130 or more (gifted) when American norms were used, whereas only 0.3% were this high using Canadian norms.”

We can put these in a table and calculate the ratios:

IQ threshold Percent US Percent CAN US/CAN CAN/US
130 1.4 0.3 4.67 0.21
80 13.1 32.3 0.41 2.47
75 4.2 21.2 0.20 5.05
70 0.7 9.7 0.07 13.86


And we can also calculate the expected values based on the two populations (with means of 95.5 and 88) above:

IQ threshold Percent US Percent CAN US/CAN CAN/US
130 1.07 0.26 4.12 0.24
80 15.07 29.69 0.51 1.97
75 8.59 19.31 0.44 2.25
70 4.46 11.51 0.39 2.58


This is fairly close right? The only outlier (in italic) is the much lower than expected value for <70 IQ using US norms, perhaps a sampling error. But overall, this is a pretty good fit to the data. Perhaps we have our explanation.

What about those (mis)classification values in their Table 2? Well, for similar reasons that I won’t explain in detail, these are simply a function of the difference between the groups in that variable, e.g. Cohen’s d. In fact, if we correlate the d vector and the “% within same classification” we get a correlation of -.95 (-.96 using rank-orders).

MCV analysis

Incidentally, the d values report in their Table 1 are useful for using the method of correlated vectors. In a previous study comparing US and Canadian IQ data, Dutton and Lynn (2014) compared WAIS-IV standardization data. They found a mean difference of .31 d, or 4.65 IQ, which was reduced to 2.1 IQ if the samples were matched on education, ethnicity and sex. An interesting thing was that the difference between the countries was largest on the most g-loading subtests. When this happens, it is called a Jensen effect (or that it has a positive Jensen coefficient, Kirkegaard 2014). The value in their study was .83, which is on the high side (see e.g. te Nijenhuis et al, 2015).

I used the same loadings as used in their study (McFarland, 2013), and found a correlation of .24 (.35 with rank-order), substantially weaker.

Supplementary material

The R code and data files can be found in the Open Science Framework repository.


  • Harrison, A. G., Holmes, A., Silvestri, R., Armstrong, I. T. (2015). Implications for Educational Classification and Psychological Diagnoses Using the Wechsler Adult Intelligence Scale–Fourth Edition With Canadian Versus American Norms. Journal of Psychoeducational Assessment. 1-13.
  • Jensen, A. R. (1980). Bias in Mental Testing.
  • Kirkegaard, E. O. (2014). The personal Jensen coefficient does not predict grades beyond its association with g. Open Differential Psychology.
  • McFarland, D. (2013). Model individual subtests of the WAIS IV with multiple latent
    factors. PLoSONE. 8(9): e74980. doi:10.1371/journal.pone.0074980
  • te Nijenhuis, J., van den Hoek, M., & Armstrong, E. L. (2015). Spearman’s hypothesis and Amerindians: A meta-analysis. Intelligence, 50, 87-92.

In my previous post, I reviewed data about education and beliefs about nuclear energy/power. In this post, I report some results for intelligence measured by wordsum. The wordsum is a 10 word vocabulary test widely given in US surveys. It correlates .71 with a full-scale IQ test, so it’s a decent proxy for group data, although not that good for individual level. It is short and innocent enough that it can be given as part of larger surveys and has strong enough psychometric properties that it is useful to researchers. Win-win.

The data

The data comes from the ANES 2012 survey. A large survey given to about 6000 US citizens. The datafile has 2240 variables of various kinds.

Method and results

The analysis was done in R.

The datafile does not come with a summed wordsum score, so one must do this oneself, which involves recoding the 10 variables. In doing so I noticed that one of the words (#5, labeled “g”) has a very strong distractor: 43% pick item 5 (the correct), 49% pick item 2. This must clearly be either a very popular confusion (which makes it not a confusion at all because if a word is commonly used to mean x, that is what it means in addition to other meanings if any), or an ambiguous word, or two reasonable answers. Clearly, something is strange. For this reason, I downloaded the questionnaire itself. However, the word options has been redacted!


I don’t know if this is for copyright reasons or test-integrity reasons. The first would be very silly, since it is merely 10 lists of 1+6 words, hardly copyrightable. The second is more open to discussion. How many people actually check the questionnaire so that they can get the word quiz correct? Can’t be that many, but perhaps a few. Anyway, if the alternative is to change the words every year to avoid possible cheating, this is the better option. Changing words would mean that between year results were not so easily comparable, even if care is taken in choosing similar words. Still, for the above reasons, I decided to analyze the test with and without the odd item.

First I recoded the data. Any response other than the right one was coded as 0, and the right one as 1. For the policy and science questions, I recoded the missing/soft refusal and (hard?) refusal into NA (missing data).

Second, for each case, I summed the number of correct responses. Histograms:

hist9 hist10

The test shows a very strong ceiling effect. They need to add more difficult words. They can do this without ruining comparison-ability with older datasets, since one can just exclude the newer harder words in the analysis if one wants to do longitudinal comparisons.

Third, I examined internal reliability using alpha() from psych. This revealed that the average inter-item correlation would increase if the odd item was dropped, tho the effect was small: .25 to .27.

Fourth, I used both standard and item response theory factor analysis on 10 and 9 item datasets and extracted 1 factor. Examining factor loadings showed that the odd item had the lowest loading using both methods (classic: median=.51, item=.37; IRT: median=.68, item=.44).

Fifth, I correlated the scores from all of these methods to each other: classical FA on 10 and 9 items, IRT FA on 10 and 9 items, unit-weighted sum of 10 and 9 items.


Surprisingly, the IRT vs. non-IRT scores did not correlate that strongly, only around .90 (range .87 to .94). The others usually correlated near 1 (range .96 to .99). The scores with and without the odd item correlated strongly tho, so the rest of the analyses I used the 10-item version. Due to the rather ‘low’ correlations between the estimates, I retained both the IRT and classic FA scores for further analysis.

Sixth, I converted both these factor scores to standard IQ format (mean=100, sd=15). Since the output scores of IRT FA was not normal, I normalized it first using scale().

Seventh, I plotted the answer categories with their mean IQ and 99% confidence intervals. For each plot, I plotted both the IQ estimates. Classical FA scores in black, IRT scores in red.



We see as expected that nuclear proponents (people who think we should have more) have higher IQ than those who think we should have fewer or the same. Note that this is a non-linear result. The group who thinks we should have fewer have a slightly higher IQ than those who favor the status quo, altho this difference is pretty slight. If it is real, it is perhaps because it is easier to favor the status quo, than to have an opinion on which way society should change in line with cognitive miser theory (Renfrow and Howard, 2013).

The result for the question of whether global warming is happening or not is interesting. There is almost not effect of intelligence on getting such an obvious question right.

The next two questions were also given to people who said that global warming was not happening. When it was, they prefixed it with “Assuming it’s happening”. Here we see a strong negative effect on thinking that the consequences will be good, as well as weak effects on the cause of global warming.


Renfrow, D. G., & Howard, J. A. (2013). Social psychology of gender and race. In Handbook of Social Psychology (pp. 491-531). Springer Netherlands.

R code and data

Because the data is not given in a good format, I upload it here in such: anes_timeseries_2012

##Nuclear, WORDSUM and Educational attainment

#load data
d = read.csv("anes_timeseries_2012.csv")
wordsum = subset(d, select=c(wordsum_setb,wordsum_setd,wordsum_sete,wordsum_setf,

wordsum$wordsum_setb[wordsum$wordsum_setb != 5] = 0 #everything else than to 0
wordsum$wordsum_setb[wordsum$wordsum_setb == 5] = 1 #correct to 1
table(wordsum$wordsum_setb) #verify
wordsum$wordsum_setd[wordsum$wordsum_setd != 3] = 0 #everything else than to 0
wordsum$wordsum_setd[wordsum$wordsum_setd == 3] = 1 #correct to 1
table(wordsum$wordsum_setd) #verify
wordsum$wordsum_sete[wordsum$wordsum_sete != 1] = 0 #everything else than to 0
table(wordsum$wordsum_sete) #verify
wordsum$wordsum_setf[wordsum$wordsum_setf != 3] = 0 #everything else than to 0
wordsum$wordsum_setf[wordsum$wordsum_setf == 3] = 1 #correct to 1
table(wordsum$wordsum_setf) #verify
wordsum$wordsum_setg[wordsum$wordsum_setg != 5] = 0 #everything else than to 0
wordsum$wordsum_setg[wordsum$wordsum_setg == 5] = 1 #correct to 1
table(wordsum$wordsum_setg) #verify
wordsum$wordsum_seth[wordsum$wordsum_seth != 4] = 0 #everything else than to 0
wordsum$wordsum_seth[wordsum$wordsum_seth == 4] = 1 #correct to 1
table(wordsum$wordsum_seth) #verify
wordsum$wordsum_setj[wordsum$wordsum_setj != 1] = 0 #everything else than to 0
wordsum$wordsum_setj[wordsum$wordsum_setj == 1] = 1 #correct to 1
table(wordsum$wordsum_setj) #verify
wordsum$wordsum_setk[wordsum$wordsum_setk != 1] = 0 #everything else than to 0
wordsum$wordsum_setk[wordsum$wordsum_setk == 1] = 1 #correct to 1
table(wordsum$wordsum_setk) #verify
wordsum$wordsum_setl[wordsum$wordsum_setl != 4] = 0 #everything else than to 0
wordsum$wordsum_setl[wordsum$wordsum_setl == 4] = 1 #correct to 1
table(wordsum$wordsum_setl) #verify
wordsum$wordsum_seto[wordsum$wordsum_seto != 2] = 0 #everything else than to 0
wordsum$wordsum_seto[wordsum$wordsum_seto == 2] = 1 #correct to 1
table(wordsum$wordsum_seto) #verify
d$envir_nuke = mapvalues(d$envir_nuke, c(-9,-8),c(NA,NA)) #remove nonresponse
d$envir_gwarm = mapvalues(d$envir_gwarm, c(-9,-8),c(NA,NA)) #remove nonresponse
d$envir_gwhow = mapvalues(d$envir_gwhow, c(-9,-8),c(NA,NA)) #remove nonresponse
d$envir_gwgood = mapvalues(d$envir_gwgood, c(-9,-8),c(NA,NA)) #remove nonresponse

wordsum10 = data.frame(wordsum10=apply(wordsum,1,sum)) #sum all
wordsum9 = data.frame(wordsum9=apply(wordsum[-5],1,sum)) #sum with potentially broken item

hist(unlist(wordsum10),freq = FALSE,xlab="Wordsum score",main="Histogram of wordsum10")
hist(unlist(wordsum9),freq = FALSE,xlab="Wordsum score",main="Histogram of wordsum9")

alpha(wordsum) #reliabilities
alpha(wordsum[-5]) #minus g
round(cor(wordsum),2) #cors

wordsum10.fa = irt.fa(wordsum) #all items IRT FA
wordsum9.fa = irt.fa(wordsum[-5]) #minus g item
wordsum10.1 = score.irt(wordsum10.fa$fa,wordsum)
wordsum9.1 = score.irt(wordsum9.fa$fa,wordsum[-5])

wordsum10.fa.n = fa(wordsum)
wordsum10.n.1 = wordsum10.fa.n$scores
wordsum9.fa.n = fa(wordsum[-5])
wordsum9.n.1 = wordsum9.fa.n$scores


factor.scores = cbind(wordsum10.n.1,
colnames(factor.scores) = c("fa.10","fa.9","fa.irt.10","fa.irt.9","sum10","sum9")

#policy questions
policy = subset(d, select=c(envir_nuke, envir_gwarm, envir_gwgood, envir_gwhow)) #fetch
policy$GI1 = factor.scores$fa.10*15+100 #first estimate
policy$GI2 = scale(factor.scores$fa.irt.10)*15+100 #second

#policy and mean GI
plotmeans(GI1 ~ envir_nuke, policy, p=.99, legends=c("More","Fewer","Same"),
          main="Should US have more or fewer nuclear power plants",
          xlab="", ylab="Wordsum IQ")
plotmeans(GI2 ~ envir_nuke, policy, p=.99, legends=NULL,xaxt="n",yaxt="n",xlab="",ylab="",col = "red")

#Global warming
plotmeans(GI1 ~ envir_gwarm, policy, p=.99, legends=c("Has probably been happening","Probably hasn't been happening"),
          main="Is global warming happening or not",
          xlab="", ylab="Wordsum IQ")
plotmeans(GI2 ~ envir_gwarm, policy, p=.99, legends=NULL,xaxt="n",yaxt="n",xlab="",ylab="",col = "red")

#Rising temperatures good or bad
plotmeans(GI1 ~ envir_gwgood, policy, p=.99, legends=c("Good","Bad","Neither good nor bad"),
          main="Rising temperatures good or bad",
          xlab="", ylab="Wordsum IQ")
plotmeans(GI2 ~ envir_gwgood, policy, p=.99, legends=NULL,xaxt="n",yaxt="n",xlab="",ylab="",col = "red")

#Rising temperatures good or bad
plotmeans(GI1 ~ envir_gwhow, policy, p=.99, legends=NA,
          main="Anthropogenic climate change",
          xlab="", ylab="Wordsum IQ")
mtext(c("Mostly by human activity",
        "Mostly by natural causes",
        text.splitter("About equally by human activity and natural causes",25)),
        side=1, at=c(1:3), cex=.8, padj=1)
plotmeans(GI2 ~ envir_gwhow, policy, p=.99, legends=NULL,xaxt="n",yaxt="n",xlab="",ylab="",col = "red")

I had heard good things about this book, sort of. It has been cited a lot. Enough that I would be wiling to read it, given that the author has written at least one interesting paper (Political Diversity Will Improve Social Psychological Science). Generally, it is written in popsci style, very few statistics making it impossible to easily judge how much certainty to assign to different studies mentioned in the text. Generally, I was not impressed or learned much, tho not all was necessarily bad. Clearly, he wrote this book in an attempt to appeal to many different people. Perhaps he succeeded, but appeals that work well on large parts of the population rarely work well on me.

In any case, there are some parts worth quoting and commenting on:

The results were as clear as could be in support of Shweder. First, all four of my Philadelphia groups confirmed Turiel’s finding that Americans make a big distinction between moral and conventional violations. I used two stories taken directly from Turiel’s research: a girl pushes a boy off a swing (that’s a clear moral violation) and a boy refuses to wear a school uniform (that’s a conventional violation). This validated my methods. It meant that any differences I found on the harmless taboo stories could not be attributed to some quirk about the way I phrased the probe questions or trained my interviewers. The upper-class Brazilians looked just like the Americans on these stories. But the working-class Brazilian kids usually thought that it was wrong, and universally wrong, to break the social convention and not wear the uniform. In Recife in particular, the working-class kids judged the uniform rebel in exactly the same way they judged the swing-pusher. This pattern supported Shweder: the size of the moral-conventional distinction varied across cultural groups.

Emil’s law: Whenever a study reports that socioeconomic status correlates with X, it is mostly due to its relationship to intelligence, and often socioeconomic status is non-causally related to X.

Wilson used ethics to illustrate his point. He was a professor at Harvard, along with Lawrence Kohlberg and the philosopher John Rawls, so he was well acquainted with their brand of rationalist theorizing about rights and justice.15 It seemed clear to Wilson that what the rationalists were really doing was generating clever justifications for moral intuitions that were best explained by evolution. Do people believe in human rights because such rights actually exist, like mathematical truths, sitting on a cosmic shelf next to the Pythagorean theorem just waiting to be discovered by Platonic reasoners? Or do people feel revulsion and sympathy when they read accounts of torture, and then invent a story about universal rights to help justify their feelings?
Wilson sided with Hume. He charged that what moral philosophers were really doing was fabricating justifications after “consulting the emotive centers” of their own brains.16 He predicted that the study of ethics would soon be taken out of the hands of philosophers and “biologicized,” or made to fit with the emerging science of human nature. Such a linkage of philosophy, biology, and evolution would be an example of the “new synthesis” that Wilson dreamed of, and that he later referred to as consilience—the “jumping together” of ideas to create a unified body of knowledge.17
Prophets challenge the status quo, often earning the hatred of those in power. Wilson therefore deserves to be called a prophet of moral psychology. He was harassed and excoriated in print and in public.18 He was called a fascist, which justified (for some) the charge that he was a racist, which justified (for some) the attempt to stop him from speaking in public. Protesters who tried to disrupt one of his scientific talks rushed the stage and chanted, “Racist Wilson, you can’t hide, we charge you with genocide.”19

For more on the history of sociobiology, see:

But yes, human rights bug me. There is no such thing as an ethical right ‘out there’. Human rights are completely made up. While some of them are useful as model for civil rights, they are nothing more. Worse, human rights keep getting added which are both inconsistent, vague and redundant. See e.g.

I say this as someone who strongly believes in having strong civil rights, especially regarding freedom of expression, assembly, due process and the like. However, since pushing for new human rights attracts social justice warriors, this of course means that the new rights not only conflict with previous rights (e.g. freedom of expression), but also obviously concern matters that should be a matter of national policy (e.g. resource redistribution), not super-national courts and their creeping interpretations. See e.g. this document: or even worse, this one: A EUROPEAN FRAMEWORK NATIONAL STATUTE FOR THE PROMOTION OF TOLERANCE. The irony is that tolerance consists exactly in letting others do what they want: Wiktionary: The ability or practice of tolerating; an acceptance or patience with the beliefs, opinions or practices of others; a lack of bigotry.

Psychopaths do have some emotions. When Hare asked one man if he ever felt his heart pound or stomach churn, he responded: “Of course! I’m not a robot. I really get pumped up when I have sex or when I get into a fight.”29 But psychopaths don’t show emotions that indicate that they care about other people. Psychopaths seem to live in a world of objects, some of which happen to walk around on two legs. One psychopath told Hare about a murder he committed while burglarizing an elderly man’s home:
I was rummaging around when this old geezer comes down stairs and … uh … he starts yelling and having a fucking fit … so I pop him one in the, uh, head and he still doesn’t shut up. So I give him a chop to the throat and he … like … staggers back and falls on the floor. He’s gurgling and making sounds like a stuck pig! [laughs] and he’s really getting on my fucking nerves so I … uh … boot him a few times in the head. That shut him up … I’m pretty tired by now so I grab a few beers from the fridge and turn on the TV and fall asleep. The cops woke me up [laughs].30


This is the sort of bad thinking that a good education should correct, right? Well, consider the findings of another eminent reasoning researcher, David Perkins.21 Perkins brought people of various ages and education levels into the lab and asked them to think about social issues, such as whether giving schools more money would improve the quality of teaching and learning. He first asked subjects to write down their initial judgment. Then he asked them to think about the issue and write down all the reasons they could think of—on either side—that were relevant to reaching a final answer. After they were done, Perkins scored each reason subjects wrote as either a “my-side” argument or an “other-side” argument.
Not surprisingly, people came up with many more “my-side” arguments than “other-side” arguments. Also not surprisingly, the more education subjects had, the more reasons they came up with. But when Perkins compared fourth-year students in high school, college, or graduate school to first-year students in those same schools, he found barely any improvement within each school. Rather, the high school students who generate a lot of arguments are the ones who are more likely to go on to college, and the college students who generate a lot of arguments are the ones who are more likely to go on to graduate school. Schools don’t teach people to reason thoroughly; they select the applicants with higher IQs, and people with higher IQs are able to generate more reasons.
The findings get more disturbing. Perkins found that IQ was by far the biggest predictor of how well people argued, but it predicted only the number of my-side arguments. Smart people make really good lawyers and press secretaries, but they are no better than others at finding reasons on the other side. Perkins concluded that “people invest their IQ in buttressing their own case rather than in exploring the entire issue more fully and evenhandedly.”22

Cite is: Perkins, D. N., M. Farady, and B. Bushey. 1991. “Everyday Reasoning and the Roots of Intelligence.” In Informal Reasoning and Education, ed. J. F. Voss, D. N. Perkins, and J. W. Segal, 83–105. Hillsdale, NJ: Lawrence Erlbaum.

From Plato through Kant and Kohlberg, many rationalists have asserted that the ability to reason well about ethical issues causes good behavior. They believe that reasoning is the royal road to moral truth, and they believe that people who reason well are more likely to act morally.
But if that were the case, then moral philosophers—who reason about ethical principles all day long—should be more virtuous than other people. Are they? The philosopher Eric Schwitzgebel tried to find out. He used surveys and more surreptitious methods to measure how often moral philosophers give to charity, vote, call their mothers, donate blood, donate organs, clean up after themselves at philosophy conferences, and respond to emails purportedly from students.48 And in none of these ways are moral philosophers better than other philosophers or professors in other fields.
Schwitzgebel even scrounged up the missing-book lists from dozens of libraries and found that academic books on ethics, which are presumably borrowed mostly by ethicists, are more likely to be stolen or just never returned than books in other areas of philosophy.49 In other words, expertise in moral reasoning does not seem to improve moral behavior, and it might even make it worse (perhaps by making the rider more skilled at post hoc justification). Schwitzgebel still has yet to find a single measure on which moral philosophers behave better than other philosophers.

Oh dear.

The anthropologists Pete Richerson and Rob Boyd have argued that cultural innovations (such as spears, cooking techniques, and religions) evolve in much the same way that biological innovations evolve, and the two streams of evolution are so intertwined that you can’t study one without studying both.65 For example, one of the best-understood cases of gene-culture coevolution occurred among the first people who domesticated cattle. In humans, as in all other mammals, the ability to digest lactose (the sugar in milk) is lost during childhood. The gene that makes lactase (the enzyme that breaks down lactose) shuts off after a few years of service, because mammals don’t drink milk after they are weaned. But those first cattle keepers, in northern Europe and in a few parts of Africa, had a vast new supply of fresh milk, which could be given to their children but not to adults. Any individual whose mutated genes delayed the shutdown of lactase production had an advantage. Over time, such people left more milk-drinking descendants than did their lactose-intolerant cousins. (The gene itself has been identified.)66 Genetic changes then drove cultural innovations as well: groups with the new lactase gene then kept even larger herds, and found more ways to use and process milk, such as turning it into cheese. These cultural innovations then drove further genetic changes, and on and on it went.

Why is this anyway? Why don’t we just keep expressing this gene? Is there any reason?

In an interview in 2000, the paleontologist Stephen Jay Gould said that “natural selection has almost become irrelevant in human evolution” because cultural change works “orders of magnitude” faster than genetic change. He next asserted that “there’s been no biological change in humans in 40,000 or 50,000 years. Everything we call culture and civilization we’ve built with the same body and brain.”77

I wonder, was Gould right about anything? Another thing. Did Gould invent his Punctuated Equilibrium theory because it postulates these change-free periods which can be conveniently claimed for humans the last 100k years or so in order to keep his denial of racial differences consistent with evolution or?

Religion is therefore well suited to be the handmaiden of groupishness, tribalism, and nationalism. To take one example, religion does not seem to be the cause of suicide bombing. According to Robert Pape, who has created a database of every suicide terrorist attack in the last hundred years, suicide bombing is a nationalist response to military occupation by a culturally alien democratic power.62 It’s a response to boots and tanks on the ground—never to bombs dropped from the air. It’s a response to contamination of the sacred homeland. (Imagine a fist punched into a beehive, and left in for a long time.)

This sounds interesting.

The problem is not just limited to politicians. Technology and changing residential patterns have allowed each of us to isolate ourselves within cocoons of like-minded individuals. In 1976, only 27 percent of Americans lived in “landslide counties”—counties that voted either Democratic or Republican by a margin of 20 percent or more. But the number has risen steadily; in 2008, 48 percent of Americans lived in a landslide county.77 Our counties and towns are becoming increasingly segregated into “lifestyle enclaves,” in which ways of voting, eating, working, and worshipping are increasingly aligned. If you find yourself in a Whole Foods store, there’s an 89 percent chance that the county surrounding you voted for Barack Obama. If you want to find Republicans, go to a county that contains a Cracker Barrel restaurant (62 percent of these counties went for McCain).78

This sounds more like assortative relocation + greater amount of relocation. Now, if only there were more local democratic power, then people living in these different areas could self-govern and stop arguing about conflicts that would never arise. E.g. healthcare systems: each smaller area could decide on its own system. I like to quote from Uncontrolled:

This leads then to a call for “states as laboratories of democracy” federalism in matters of social policy, or in a more formal sense, a call for subsidiarity—the principle that matters ought to be handled by the smallest competent authority. After all, the typical American lives in a state that is a huge political entity governing millions of people. As many decisions as possible ought to be made by counties, towns, neighborhoods, and families (in which parents have significant coer­cive rights over children). In this way, not only can different prefer­ences be met, but we can learn from experience how various social arrangements perform.

Nuclear energy often gets bad press. However, journalists are mostly very leftist, scientifically ill-educated women, so perhaps they are not quite the right demographic to tell us about this issue. So far I have not found any published studies about the relationship between attitudes towards nuclear energy and general intelligence, however there is good reason to believe it is positive. The EU is to nice to survey the opinions of EU citizens on nuclear energy every few years, and they include measurement of various variables, but not general intelligence or general science knowledge.

Three surveys of European attitudes towards nuclear power and correlates

nuc18 nuc17 nuc16 nuc15 nuc14 nuc13 nuc12 nuc11 nuc10 nuc9 nuc8 nuc7 nuc6 nuc5 nuc4 nuc3 nuc2 nuc1

Generally, these find:

  1. Men are more positive about nuclear power
  2. The better educated are more positive about nuclear power
  3. The better educated are less in doubt about nuclear power
  4. Education has inconsistent relationship to opposition to nuclear power
  5. Self-rated knowledge is related to more positive attitude to nuclear power
  6. Those with more experience about nuclear power are more positive about nuclear power

(4) might seem inconsistent with (1-3), but it is not. Usually the questions have three broad categories: positive, negative, don’t know. When the education level increases, negative stays about the same (sometimes increases, sometimes decreases), while don’t know always decreases and positive nearly always increases. Thus, the simplest explanation for this is that higher education move people from the don’t know category to the positive category, while it has no clear effect those who are negative about nuclear power.

Survey in the United Kingdom

Most of the questions were too specific to be of use, however one was useful, is post-Fukushima and still found the usual results:


What do experts think?

I found an old review (1984) of expert opinion. They write:

In contrast to the public, most “opinion leaders,” particularly energy experts, support further development of nuclear power. This support is revealed both in opinion polls and in technical studies of the risks of nuclear power. A March 1982 poll of Congress found 76 percent of members supported expanded use of nuclear power (50. In a survey conducted for Connecticut Mutual Life Insurance Co. in 1980, leaders in religion, business, the military, government, science, education, and law perceived the benefits of nuclear power as greater than the risks (19). Among the categories of leaders surveyed, scientists were particularly supportive of nuclear power. Seventyfour percent of scientists viewed the benefits of nuclear power as greater than risks, compared with only 55 percent of the rest of the public.

In a recent study, a random sample of scientists was asked about nuclear power (62). Of those polled, 53 percent said development should proceed rapidly, 36 percent said development should proceed slowly, and 10 percent would halt development or dismantle plants. When a second group of scientists with particular expertise in energy issues was given the same choices, 70 percent favored proceeding rapidly and 25 percent favored proceeding slowly with the technology. This second sample included approximately equal numbers of scientists from 71 disciplines, ranging from air pollution to energy policy to thermodynamics. About 10 percent of those polled in this group worked in disciplines directly related to nuclear energy, so that the results might be somewhat biased. Support among both groups of scientists was found to result from concern about the energy crisis and the belief that nuclear power can make a major contribution to national energy needs over the next 20 years. Like scientists, a majority of engineers continued to support nuclear power after the accident at Three Mile Island (69).

Of course, not all opinion leaders are in favor of the current U.S. program of nuclear development. Leaders of the environmental movement have played a major role in the debate about reactor safety and prominent scientists are found on both sides of the debate. A few critics of nuclear power have come from the NRC and the nuclear industry, including three nuclear engineers who left General Electric in order to demonstrate their concerns about safety in 1976. However, the majority of those with the greatest expertise in nuclear energy support its further development.

Analysis of public opinion polls indicates that people’s acceptance or rejection of nuclear power is more influenced by their view of reactor safety than by any other issue (57). As discussed above, accidents and events at operating plants have greatly increased public concern about the possibility of a catastrophic accident. Partially in response to that concern, technical experts have conducted a number of studies of the likelihood and consequences of such an accident. However, rather than reassuring the public about nuclear safety, these studies appear to have had the opposite effect. By painting a picture of the possible consequences of an accident, the studies have contributed to people’s view of the technology as exceptionally risky, and the debate within the scientific community about the study methodologies and findings has increased public uncertainty.

And recently, much publicity was given to a study showing the discrepancy between public opinion and scientific opinion on various topics, and it included nuclear power:


A 20% percent point different is no small matter and is similar to the older studies described above.

General intelligence and nuclear power

The OKCupid dataset contains only one question related to nuclear power among the first 2400 questions or so (those in the dataset). The question is the 2216th most commonly answered question, that is, not very commonly answered at all. Since people who answer >2000 questions on a dating site are a very self-selected group, there is likely to be some heavy range restriction.


Aside from the “I don’t know”-category, the differences are quite small:

[1] "Question ID: q59519"
[1] "How do you feel about nuclear energy?"
[1] "I'm not sure, there are pros and cons."
[1] "n's =" "1015" 
[1] "means =" "2.3"    
[1] "I don't care, whatever keeps my light bulbs lit."
[1] "n's =" "52"   
[1] "means =" "1.37"   
[1] "No.  It is a danger to public safety."
[1] "n's =" "335"  
[1] "means =" "2.31"   
[1] "Yes.  It is efficient, safe, and clean."
[1] "n's =" "591"  
[1] "means =" "2.49"

However, samples are quite big. The 99% confidence interval is -0.323 to -0.037, which is getting close to 0, so we need more data to be quite certain, but it’s unlikely to have happened by chance [(D|H0)=very low]. How large is the difference in some more useful unit? It’s .18 points on a 3-point scale (3 IQ-type questions). The overall SD in the total dataset for this 3-point scale is .96 (mean=2.12, N=28k). However, for this question, it is only .88 due to range restriction (mean=2.34, N=1993). In other words, in SD units it is .18/.88=0.20, which is 3 IQ points, not correcting for anything. Perhaps after corrections this would be something like 5 IQ points between pro and con-nuclear power people in this dataset.

Future research

I’d like more data with measures of scientific knowledge, or probability reasoning or some such and various energy policy opinions including nuclear. Of course, there should also be general intelligence data.


Europeans and Nuclear Safety – 2010 – PDF

Europeans and Nuclear Safety – 2007 – PDF

Attitudes towards radioactive waste (EU) – 2008 – PDF

Public Attitudes to Nuclear Power (OECD summary) – 2010 – PDF

New Nuclear Watch Europe – Nuclear Energy Survey (United Kingdom) – 2014 – PDF

Public Attitudes Toward Nuclear Power – 1984 – PDF (part of Nuclear Power in an Age of Uncertainty. Washington, D. C.: U.S. Congress, Office of Technology Assessment, OTA-E-216, February 1984.

In reply to and his working paper here:

We are discussing his working paper over email, and I had some reservations about his factor analysis. I decided to run the analyses I wanted myself, but it turned into a longer project which should be placed in a short paper instead of in a private email.

I fetched the data from his source. The raw data did not have variable names, so was unwieldy to work with. I opened the SPSS file, and it did have variable names. Then I exported the CSV with the desired variables (see supp. material). Then I had to recoded the variables so that the true answers are coded as 1, false answers as 0, and missing as NA. This took some time. I followed his coding procedure for most cases (see his STATE file and my R code below).

How many factors to extract

It seems that he relies on some kind of method for determining the number of factors to extract, presumably Eigenvalue>1. I always use three different methods using the nFactors package. Using all 22 variables (note that he did not this all of them at once), all methods agreed to extract 5 factors (at max). Here’s the factor solutions for extracting 1 thru 5 factors and their intercorrelations:

Factor analyses with 1-5 factors and their correlations

[1] "Factor analysis, extracting 1 factors using oblimin and MinRes"

smokheal  0.129
condrift  0.347
rmanmade  0.445
earthhot  0.348
oxyplant  0.189
lasers    0.514
atomsize  0.441
antibiot  0.401
dinosaur  0.323
light     0.384
earthsun  0.515
suntime   0.581
dadgene   0.227
getdrug   0.290
whytest   0.423
probno4   0.396
problast  0.423
probreq   0.349
probif3   0.416
evolved   0.306
bigbang   0.315
onfaith  -0.296

SS loadings    3.191
Proportion Var 0.145
[1] "Factor analysis, extracting 2 factors using oblimin and MinRes"

         MR1    MR2   
smokheal  0.121       
condrift  0.345       
rmanmade  0.368  0.136
earthhot  0.363       
oxyplant  0.172       
lasers    0.518       
atomsize  0.461       
antibiot  0.323  0.133
dinosaur  0.323       
light     0.375       
earthsun  0.587       
suntime   0.658       
dadgene   0.145  0.130
getdrug   0.211  0.130
whytest   0.386       
probno4          0.705
problast         0.789
probreq   0.162  0.305
probif3   0.108  0.514
evolved   0.348       
bigbang   0.367       
onfaith  -0.266       

                 MR1   MR2
SS loadings    2.617 1.569
Proportion Var 0.119 0.071
Cumulative Var 0.119 0.190
     MR1  MR2
MR1 1.00 0.35
MR2 0.35 1.00
[1] "Factor analysis, extracting 3 factors using oblimin and MinRes"

         MR2    MR1    MR3   
condrift                0.346
rmanmade  0.173  0.170  0.232
earthhot         0.187  0.220
oxyplant                0.100
lasers           0.256  0.320
atomsize         0.208  0.312
antibiot  0.168  0.150  0.198
dinosaur         0.119  0.250
light            0.240  0.169
earthsun         0.737       
suntime          0.754       
dadgene   0.147              
getdrug   0.152         0.149
whytest   0.108  0.143  0.294
probno4   0.708              
problast  0.781              
probreq   0.324              
probif3   0.532              
evolved                 0.562
bigbang                 0.525
onfaith                -0.307

                 MR2   MR1   MR3
SS loadings    1.646 1.444 1.389
Proportion Var 0.075 0.066 0.063
Cumulative Var 0.075 0.140 0.204
     MR2  MR1  MR3
MR2 1.00 0.29 0.25
MR1 0.29 1.00 0.43
MR3 0.25 0.43 1.00
[1] "Factor analysis, extracting 4 factors using oblimin and MinRes"

         MR4    MR2    MR1    MR3   
condrift  0.180                0.234
rmanmade  0.387                     
earthhot  0.262         0.102       
oxyplant  0.116                     
lasers    0.490                     
atomsize  0.435                     
antibiot  0.485                     
dinosaur  0.312                     
light     0.274         0.142       
earthsun                0.797       
suntime                 0.719       
dadgene   0.234                     
getdrug   0.273                     
whytest   0.438                     
probno4          0.695              
problast         0.817              
probreq   0.180  0.275              
probif3   0.139  0.487              
evolved                        0.685
bigbang                        0.554
onfaith  -0.141               -0.230

                 MR4   MR2   MR1   MR3
SS loadings    1.511 1.501 1.204 0.915
Proportion Var 0.069 0.068 0.055 0.042
Cumulative Var 0.069 0.137 0.192 0.233
     MR4  MR2  MR1  MR3
MR4 1.00 0.39 0.57 0.42
MR2 0.39 1.00 0.23 0.12
MR1 0.57 0.23 1.00 0.27
MR3 0.42 0.12 0.27 1.00
[1] "Factor analysis, extracting 5 factors using oblimin and MinRes"

         MR2    MR1    MR3    MR5    MR4   
condrift                0.209         0.299
rmanmade  0.104                0.120  0.379
earthhot                              0.367
oxyplant                              0.220
lasers                         0.195  0.361
atomsize                       0.273  0.207
antibiot                       0.401  0.108
dinosaur                       0.204  0.131
light                                 0.423
earthsun         0.504                0.186
suntime          1.007                     
dadgene                        0.277       
getdrug                        0.373       
whytest                        0.504       
probno4   0.701                            
problast  0.816                            
probreq   0.272                0.174       
probif3   0.487                0.107       
evolved                 0.753              
bigbang                 0.483         0.165
onfaith                -0.225 -0.152       

                 MR2   MR1   MR3   MR5   MR4
SS loadings    1.501 1.291 0.919 0.874 0.871
Proportion Var 0.068 0.059 0.042 0.040 0.040
Cumulative Var 0.068 0.127 0.169 0.208 0.248
     MR2  MR1  MR3  MR5  MR4
MR2 1.00 0.20 0.11 0.38 0.28
MR1 0.20 1.00 0.21 0.41 0.44
MR3 0.11 0.21 1.00 0.32 0.30
MR5 0.38 0.41 0.32 1.00 0.50
MR4 0.28 0.44 0.30 0.50 1.00


We see that in the 1-factor solution, all variables load in the expected direction, and we can speak of a general scientific knowledge factor. This is the one we want to use for other analyses. We see that faith loads negatively. This variable is not a true/false question, and thus should be excluded from any actual measurement of the general scientific knowledge factor.

Increasing the number of factors to extract simply divides this general factor into correlated parts. E.g. in the 2-factor solution, we see a probability factor that correlates .35 with the remaining semi-general factor. In solution 3, we see MR2 as the probability factor, MR3 as the knowledge related to religious beliefs factor and MR1 as the remaining items. Intercorrelations are .29, .25 and .43. This pattern continues until the 5th solution which still produces 5 correlated factors: MR2 is the probability factor, MR1 is an astronomy factor, MR3 is the one having to do with religious beliefs, MR5 looks like a medicine/genetics factor, and MR4 is the rest.

Just because scree tests etc. tell you to extract >1 factor does not mean that there is no general factor. This is the old fallacy made in the study of cognitive ability. See discussion in Jensen 1998 (chapter 3). It is sometimes still made e.g. Hampshire, et al (2012). Generally, as one increases the number of variables, the suggested number of factors to extract goes up. This does not mean that there is no general factor, just that with increasing number of variables, one can see a more fine-grained structure in the data than one can with only e.g. 5 variables.

Should we use them or not?

Before discussing whether one should theoretically use them or not, one can measure if it makes much of a difference. One can do this by extracting the general factor with and without the items in questions. I did this, also excluding the onfaith item. Then I correlated the scores from these two analysis: r=.992. In other words, it hardly matters whether one includes these religious-tinged items or not. The general factor is measured quite well already without them and they do not substantially change the factor scores. However, since adding more indicator items/variables generally reduces measurement error of a latent trait/factor, I would include them in my analyses.

How many factors should we extract and use?

There is also the question of how many factors one should extract. The answer is that it depends on what one wants to do. As Zigerell points out in a review comment of this paper on Winnower:

For example, for diagnostic purposes, if we know only that students A, B, and C miss 3 items on a test of general science knowledge, then the only remediation is more science; but we can provide more tailored remediation if we have separate components so that we observe that, say, A did poorly only on the religion-tinged items, B did poorly only on the probability items, and C did poorly only on the astronomy items.

For remedial education, it is clearly preferable to extract the highest number of interpretable factors because this gives the most precise information where knowledge is lacking for a given person. In regression analysis where we want to control for scientific knowledge, one should use the general factor.


Hampshire, A., Highfield, R. R., Parkin, B. L., & Owen, A. M. (2012). Fractionating human intelligence. Neuron, 76(6), 1225-1237.

Jensen, A. R. (1998). The g factor: The science of mental ability. Westport, CT: Praeger.

Supplementary material

Datafile: science_data

R code

library(plyr) #for mapvalues

data = read.csv("science_data.csv") #load data

#Coding so that 1 = true, 0 = false
data$smokheal = mapvalues(data$smokheal, c(9,7,8,2),c(NA,0,0,0))
data$condrift = mapvalues(data$condrift, c(9,7,8,2),c(NA,0,0,0))
data$earthhot = mapvalues(data$earthhot, c(9,7,8,2),c(NA,0,0,0))
data$rmanmade = mapvalues(data$rmanmade, c(9,7,8,1,2),c(NA,0,0,0,1)) #reverse
data$oxyplant = mapvalues(data$oxyplant, c(9,7,8,2),c(NA,0,0,0))
data$lasers = mapvalues(data$lasers, c(9,7,8,2,1),c(NA,0,0,1,0)) #reverse
data$atomsize = mapvalues(data$atomsize, c(9,7,8,2),c(NA,0,0,0))
data$antibiot = mapvalues(data$antibiot, c(9,7,8,2,1),c(NA,0,0,1,0)) #reverse
data$dinosaur = mapvalues(data$dinosaur, c(9,7,8,2,1),c(NA,0,0,1,0)) #reverse
data$light = mapvalues(data$light, c(9,7,8,2,3),c(NA,0,0,0,0))
data$earthsun = mapvalues(data$earthsun, c(9,7,8,2),c(NA,0,0,0))
data$suntime = mapvalues(data$suntime, c(9,7,8,2,3,1,4,99),c(0,0,0,0,1,0,0,NA))
data$dadgene = mapvalues(data$dadgene, c(9,7,8,2),c(NA,0,0,0))
data$getdrug = mapvalues(data$getdrug, c(9,7,8,2,1),c(NA,0,0,1,0)) #reverse
data$whytest = mapvalues(data$whytest, c(1,2,3,4,5,6,7,8,9,99),c(1,0,0,0,0,0,0,0,0,NA))
data$probno4 = mapvalues(data$probno4, c(9,8,2,1),c(NA,0,1,0)) #reverse
data$problast = mapvalues(data$problast, c(9,8,2,1),c(NA,0,1,0)) #reverse
data$probreq = mapvalues(data$probreq, c(9,8,2),c(NA,0,0))
data$probif3 = mapvalues(data$probif3, c(9,8,2,1),c(NA,0,1,0)) #reverse
data$evolved = mapvalues(data$evolved, c(9,7,8,2),c(NA,0,0,0))
data$bigbang = mapvalues(data$bigbang, c(9,7,8,2),c(NA,0,0,0))
data$onfaith = mapvalues(data$onfaith, c(9,1,2,3,4,7,8),c(NA,1,1,0,0,0,0))

#How many factors to extract?
nScree(data[complete.cases(data),]) #use complete cases only

#extract factors
library(psych) #for factor analysis
for (num in 1:5) {
  print(paste0("Factor analysis, extracting ",num," factors using oblimin and MinRes"))
  fa = fa(data,num) #extract factors
  print(fa$loadings) #print
  if (num>1){ #print factor cors
    phi = round(fa$Phi,2) #round to 2 decimals
    colnames(phi) = rownames(phi) = colnames(fa$scores) #set names
    print(phi) #print

#Does it make a difference?
fa.all = fa(data[1:21]) #no onfaith
fa.noreligious = fa(data[1:19]) #no onfaith, bigbang, evolved
cor(fa.all$scores,fa.noreligious$scores, use="pair") #correlation, ignore missing cases