Van Ijzendoorn, M. H., Juffer, F., & Poelhuis, C. W. K. (2005). Adoption and cognitive development: a meta-analytic comparison of adopted and nonadopted children’s IQ and school performance. Psychological bulletin, 131(2), 301.

It turns out that someone already did a meta-analysis of adoption studies and cognitive ability. It does not solely include cross-country transracial, but it does include some. They report both country of origin and country of adoption, so it is fairly easy to find the studies that one wants to take a closer look at. It is fairly inclusive in what counts as cognitive development, e.g. school results and language tests count, as well as regular IQ tests. They report standardized differences (d), so results are easy to understand.

They do not present aggregated results by country of origin however, so one would have to do that oneself. I haven’t done it (yet?), but the method to do so is this:

  1. Obtain the country IQs for all countries in the study. These are readily available from Lynn & Vanhanen (2012) or in the international megadataset.
  2. Score all the outcomes by using the adoptive country’s IQ. E.g. if the US has a score of 97, and Koreans adopted to that country have get a d score of .16 in “School results” as they do in the first study listed, then this corresponds to a school IQ performance of 97 – 2.4 = 94.6. Note that this assumes that the comparison sample is unselected (not smarter than average). This is likely false because adoptive parents tend to be higher social class and presumably smarter, so they would send their (adoptive) children to above average schools. Also be careful about norm comparisons because they often use older norms and the Flynn effect thus results in higher IQ scores for the adoptees.
  3. Copy relevant study characteristics from the table, e.g. comparison group, sample sizes, age of assessment and type of outcome (school, language, IQ, etc.).
  4. Repeat step (2-3) for all studies.
  5. BONUS: Look for additional studies. Do this by, a) contacting the authors of recent papers and the meta-analysis, b) search for more using Google Scholar/other academic search engine, c) look thru the studies that cite the included studies for more relevant studies.
  6. BONUS: Get someone else to independently repeat steps (2-3) for the same studies. This checks interrater consistency.
  7. Aggregate results (weighted mean of various kinds).
  8. Correlate aggregated results with origin countries’ IQs to check for spatial transferability, a prediction of genetic models.
  9. Do regression analyses to see of study characteristics predict outcomes.
  10. Write it up and submit to Open Differential Psychology (good guy) or Intelligence (career-focused bad guy). Write up to Winnower or Human Varieties if lazy or too busy.

The main results table

More likely, you are too lazy to do the above, but you want to sneak peak at the results. Here’s the main table from the paper.

Study Country/region of study Country/region of child’s origin Age at assessment (years) Age at adoption (months) N Adoption N Comparison Preadoption status Comparison group Outcome (d)
Andresen (1992) Norway Korea 12-18 12-24 135 135 Not reported Classmates School results 0.16 Language 0.09
Benson et al. (1994) United States United States 12-18 < 15 881 Norm Not reported Norm group School results —0.36
Berg-Kelly & Eriksson (1997) Sweden Korea/India 12-18 < 12 125 9204 Not reported General population School results 0.03 f/—0.04 m Language —0.02 f/—0.05 m
Bohman (1970) Sweden Not reported 12-18 < 12 160 1819 Not reported Classmates School results 0.09 f/0.07 m Language 0.02 f/—0.02 m Learning problems 0.00
Brodzinsky et al. (1984) United States United States 4-12 < 12 130 130 Not reported General population School competence 0.62 f/0.51 m
Brodzinsky & Steiger (1991) United States Not reported 9-19 441 6753 Not reported Population % School failure 0.76
Bunjes & de Vries (1988) Netherlands Korea
4-12 12-24 118 236 Not reported Classmates School results 0.24 Language 0.22
Castle et al. (2000) England England 4-12 < 12 52 Norm Not reported Standardized scores School results —0.47, IQ 0.47
Clark & Hanisee (1982) United States Vietnam
0-4 12-24 25 Norm Not reported Standardized scores IQ -2.42
Colombo et al. (1992) Chile Chile 4-12 0-12 16 ii Undernutrition Biological siblings IQ -1.16
Cook et al. (1997) Europe Not reported 4-8 12-24 131 125 Not reported General population School competence 0.56 f/0.16 m
Dalen (2001) Norway Korea
12-18 0-12 193 193 Not reported Classmates School results 0.47 (Colombia), —0.07 (Korea)
Language 0.43 (Colombia), —0.05 (Korea)
Learning problems 0.50
Dennis (1973) United States Lebanon 2-18 > 24 85 51 Institute Institute children IQ —1.28 (intraracial), —1.36 (transracial)
De Jong (2001) New Zealand Romania/Russia 4-15 12-24 116 Norm Some problems General population School competence 0.65
Duyme (1988) France France 12-18 < 12 87 14951 Not reported General population School results 0.00
Fan et al. (2002) United States United States 12-18 514 17241 Not reported General population School grades —0.02
Feigelman (1997) United States Not reported 8-21 101 6258 Not reported General population Education level —0.03
Fisch et al. (1976) United States United States 4-12 < 12 94 188 No problems General population IQ 0.00
School results 0.50 Language 0.52
Frydman & Lynn
Belgium Korea 4-12 12-24 19 Norm Not reported Standardized scores IQ -1.68
Gardner et al. (1961) United States Not reported 12-18 < 12 29 29 Not reported Classmates School achievement 0.09
Geerars et al. (1995) Netherlands Thailand 12-18 < 12 68 Norm Not reported Population % School results 0.19
Hoopes et al. (1970) United States United States 12-18 100 100 1-2 shifts in placement General population IQ 0.12
Hoopes (1982) United States United States 4-12 < 12 260 68 Nothing special General population IQ 0.18
Horn et al. (1979) United States United States 3-26 < 1 469 164 No problems Environment siblings IQ 0.17/0.34/—0.05
W. J. Kim et al. (1992) United States Not reported 12-18 43 43 Not reported General population School results 0.74
W. J. Kim et al. (1999) United States Korea 4-12 < 12 18 9 Nothing special Environment siblings School competence —0.39
Lansford et al. (2001) United States Not reported 12-18 111 200 Not reported General population School grades 0.46
Leahy (1935) United States United States 5-14 < 6 194 194 Not reported General population School grades 0.00 IQ -0.06
Levy-Shiff et al. (1997) Israel Israel
South America
7-13 < 3 5050 Norm
Not reported Standardized scores IQ -1.10 f/—2.00 m
Lien et al. (1977) United States Korea 12-18 > 24 240 Norm Undernutrition Standardized scores IQ 0.00
Lipman et al. (1992) Canada Not reported 4-16 104 3185 Not reported General population School performance —0.05 f/0.16 m
McGuinness & Pallansch (2000) United States Soviet Union 4-12 > 24 105 1000 Long time in orphanages Norm group School competence 0.46
Moore (1986) United States United States 7-10 12-24 23 Norm Not reported Standardized scores IQ -0.00 f/—1.00 m
Morison & Ell wood (2000) Canada Romania 4-12 12-24 59 35 Orphanages General population IQ 1.45 (combined)
Neiss & Rowe (2000) United States 75% LTnited States 12-18 392 392 Not reported General population IQ 0.08
O’Connor et al. (2000) England Romania 6 0-42 207 Norm Orphanage Standardized scores IQ —0.56 (combined)
Palacios & Sanchez (1996) Spain Spain 4-12 > 24 210 314 Not reported Institute children School competence —0.18
Pinderhughes (1998) United States United States 8-15 24—48 66 33 Older children General population School competence 0.64 (combined)
Plomin & DeEries (1985) United States United States 1 0-5 182 182 Not reported General population IQ 0.14
Priel et al. (2000) Israel 75% Israel 8-12 12-24 50 80 Not reported General population School competence 0.77 f/1.12 m
Rosenwald (1995) Australia 73% Korea Asia
South America
4-16 < 12 283 2583 Not reported General population School performance —0.18
Scarr & Weinberg (1976) United States 88% LTnited States 4-16 <12 176 145 Not reported Environment siblings IQ 0.75 (combined)
Schiff et al. (1978) France France 4-12 <12 32 20 Not reported Biological siblings School results —0.70
Segal (1997) United States United States 4-12 < 12 6 6 Not reported Environment siblings IQ -1.14 IQ 2.67
Sharma et al. (1996) United States 81% United States 12-18 12-24 4682 4682 Not reported General population School results 0.37 (combined)
Sharma et al. (1998) United States United States 12-18 < 12 629 72 Not reported Environment School competence —0.45 f/—0.61 m
Silver (1970) United States Not reported 4-12 < 3 10 70 Not reported General population Learning problems 1.21
Silver (1989) United States Not reported 4-12 39 Perc. Not reported General population Learning problems 1.38
Skodak & Skeels (1949) United States Not reported 12-18 < 6 100 100 Not reported Standardized scores IQ -1.12
Smyer et al. (1998) Sweden Not reported Adults < 12 60 60 Not reported Biological (twin siblings) Education level —0.82
Stams et al. (2000) Netherlands Sri Lanka
4-12 < 6 159 Norm Not reported Standardized scores School results 0.33 IQ -0.34 f/—0.73 m Learning problems —0.05
Teas dale & Owen (1986) Denmark Not reported Adults < 12 302 4578 Not reported General population IQ 0.35
Education level 0.32
Tizard & Hodges (1978) England Not reported 8 > 24 25 14 Not reported Restored children IQ —0.40 (older), —0.62 (younger)
Tsitsikas et al. (1988) Greece Greece 5-6 < 12 72 72 Not reported Classmates IQ 0.64, school performance 0.29 Language 0.30
Verhulst et al. (1990) Netherlands Europe
12-18 > 24 2148 933 Not reported General population Perc. special education 0.25 f/0.29 m
Versluis-den Bieman & Verhulst (1995) Netherlands Europe
12-18 > 24 1538 Norm Not reported General population School competence 0.28 f/0.41 m
Wattier & Frydman (1985) Belgium Korea 89% 4-12 12-24 28 Norm Not reported Standardized scores IQ -0.06
Westhues & Cohen (1997) Canada Korea 40% India 40% South America 12-18 12-24 134 83 Not reported Environment siblings School performance 0.13
Wickes & Slate (1997) United States Korea > 18 > 36 174 Norm Not reported Norm group School results 0.09 f/0.07 m Language 0.07 f/0.03 m
Winick et al. (1975) United States Korea 4-12 > 24 112 Norm Malnourished Standardized scores School performance 0.00 IQ 0.00
Witmer et al. (1963) United States United States 12-18 < 12 484 484 Nothing special Classmates School performance 0.00 IQ 0.00

Often I want to get the mean value for a case across a number of columns, usually years. This however gets repetitive because the base mean() function cannot handle data like that. Other times, one wants to standardize the data first, e.g. when the scales are not the same across variables. Lastly, often one wants to use just a few columns, usually marked by a special name. Before, these tasks were time-consuming. Now they are easy.

Consider the iris dataset:

> head(iris)
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
4          4.6         3.1          1.5         0.2  setosa
5          5.0         3.6          1.4         0.2  setosa
6          5.4         3.9          1.7         0.4  setosa

It has 4 numeric and 1 factor variable. Let’s say we want the means by variable for the first four. The simplest idea is:

> mean(iris[1:4])
[1] NA
Warning message:
In mean.default(iris[1:4]) :
  argument is not numeric or logical: returning NA

Alas, it doesn’t work. However, we can:

> df_func(iris[1:4]) %>% head
[1] 2.550 2.375 2.350 2.350 2.550 2.850

If we want to standardize the variables first:

> df_func(iris[1:4], standardize = T) %>% head
[1] -0.6322189 -0.9793858 -0.9392153 -0.9984393 -0.6050527 -0.2041362

Maybe we want the median instead:

> df_func(iris[1:4], func = median) %>% head
[1] 2.45 2.20 2.25 2.30 2.50 2.80

What is we want to match columns by a pattern? The string “petal” matches two variables:

> df_func(iris, pattern = "Petal") %>% head
[1] 0.80 0.80 0.75 0.85 0.80 1.05

If we try to use a non-numeric variable, we get an informative error:

> df_func(iris)
Error in df_func(iris) : Some variables were not numeric!

Likewise, if we our pattern matching but it doesn’t match anything:

> df_func(iris, pattern = "sadaiasd")
Error in df_func(iris, pattern = "sadaiasd") : 
  No columns matched the pattern!

Finally, the function ignores missing data by default, but one can change this if needed.

Get the package from github.

I found this one a long time ago and tweeted it, but apparently forgot to blog it.

Odenstad, A., Hjern, A., Lindblad, F., Rasmussen, F., Vinnerljung, B., & Dalen, M. (2008). Does age at adoption and geographic origin matter? A national cohort study of cognitive test performance in adult inter-country adoptees. Psychological Medicine, 38(12), 1803-1814.

Background Inter-country adoptees run risks of developmental and health-related problems. Cognitive ability is one important indicator of adoptees’ development, both as an outcome measure itself and as a potential mediator between early adversities and ill-health. The aim of this study was to analyse relations between proxies for adoption-related circumstances and cognitive development.
Method Results from global and verbal scores of cognitive tests at military conscription (mandatory for all Swedish men during these years) were compared between three groups (born 1968–1976): 746 adoptees born in South Korea, 1548 adoptees born in other non-Western countries and 330 986 non-adopted comparisons in the same birth cohort. Information about age at adoption and parental education was collected from Swedish national registers.
Results South Korean adoptees had higher global and verbal test scores compared to adoptees from other non-European donor countries. Adoptees adopted after age 4 years had lower test scores if they were not of Korean ethnicity, while age did not influence test scores in South Koreans or those adopted from other non-European countries before the age of 4 years. Parental education had minor effects on the test performance of the adoptees – statistically significant only for non-Korean adoptees’ verbal test scores – but was prominently influential for non-adoptees.
Conclusions Negative pre-adoption circumstances may have persistent influences on cognitive development. The prognosis from a cognitive perspective may still be good regardless of age at adoption if the quality of care before adoption has been ‘good enough’ and the adoption selection mechanisms do not reflect an overrepresentation of risk factors – both requirements probably fulfilled in South Korea.

I summarize and comment on the findings below:

Which adoptees?

In total, 2294 inter-country adoptees were born outside the Western countries (Europe, North America and Australia) and adopted before age 10 years. Of these, 746 were born in South Korea [Korean adoptee (KA) group]. The remaining 1548 individuals were born in other countries, Non-Korean adoptee (NKA) group. India was the most common country of origin, followed by Thailand, Chile, Ethiopia, Colombia and Sri Lanka. These were the only donor countries for which the number of adoptees included in this study exceeded 100. The non-adopted population (NAP) group consisted of non-adopted individuals born in Sweden (n=330 896).

Unfortunately, no more detailed information is given so a origin country IQ x adoptee IQ study (spatial transferability) cannot be done.

Main results


We see that Koreans adoptees do better than Swedes, even on the verbal test. The superiority stops being p<alpha when they control for various things. Notice that the disadvantage for non-Koreans becomes larger after control (their scores decrease and the Swedes’ scores increase).

Age at adoption matters, but apparently only for non-Koreans

age at adoption

This is in line with environmental cumulative disadvantage for non-Koreans. Alternatively, it is due to selection bias in that the less bright children (in the origin countries) are adopted later.

Perhaps the Koreans were placed with the better parents and this made them smarter?

Maybe, but the data shows that it isn’t important, even for transracial adoptives.

parental edu and IQ

Notice the clear relationship between child IQ and parental education for the non-adopted population. Then notice the lack of a clear pattern among the adoptives. There may be a slight upward trend (for Koreans), but it is weak (only .22 between lowest and highest education for Koreans, giving a d≈.10) and not found for non-Koreans (middle education-level had highest scores).

Still, one could claim that in Korean, smarter/normal children are given up for adoption, while in non-Korea non-Western Europe, this isn’t the case or even the opposite is the case. This study cannot address this possibility.

This study is much larger than other studies and also has a comparison group. The main problem with it is that it does not report data for more countries of origin. Only the (superior) Koreans are singled out.

One idea for a series of blog posts is that I could about new functions in my R package. Often I just push these without letting anyone know, but I guess it could be useful to make an introduction for them (the more interesting ones anyway) here.

Function description: Adds delta (difference) columns to a data.frame. These are made from one primary variable and a number of secondary variables. Variables can be given either by indices or by name. If no secondary variables are given, all numeric variables are used.


> iris %>% head %>% df_add_delta(1)
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species delta_Sepal.Length_Sepal.Width
1          5.1         3.5          1.4         0.2  setosa                            1.6
2          4.9         3.0          1.4         0.2  setosa                            1.9
3          4.7         3.2          1.3         0.2  setosa                            1.5
4          4.6         3.1          1.5         0.2  setosa                            1.5
5          5.0         3.6          1.4         0.2  setosa                            1.4
6          5.4         3.9          1.7         0.4  setosa                            1.5
  delta_Sepal.Length_Petal.Length delta_Sepal.Length_Petal.Width
1                             3.7                            4.9
2                             3.5                            4.7
3                             3.4                            4.5
4                             3.1                            4.4
5                             3.6                            4.8
6                             3.7                            5.0

So, we see that three variables were created based on a prefix and a separator (both configurable). The difference scores are given in natural units, but can also be standardized automatically if desired.

Three delta vars were made because we chose var 1 as the primary and it automatically selects the remaining numeric vars as secondaries.

It seems that no one has integrated this literature yet. I will take a quick stab at it here. It could be expanded into a proper paper later in case someone wants to and have time to do that.


Lee Jussim (also blog) has done a tremendous job at reviewing the stereotype in recently years. In general he has found that stereotypes are mostly moderately to very accurate. On the other hand, self-fulfilling prophecies are probably real but fairly limited (e.g. work best when teachers don’t know their students well yet), especially in comparison to stereotype accuracy. Of course, these findings are exactly the opposite of what social psychologists, taken as a group, have been telling us for years.

The best short review of the literature is their book chapter The Unbearable Accuracy of Stereotypes. A longer treatment can be found in his 2012 book Social Perception and Social Reality: Why Accuracy Dominates Bias and Self-Fulfilling Prophecy (libgen).

Occupational success and cognitive ability

Society is more or less a semi-stable hierarchy biased on mostly inherited personality traits, cognitive ability as well as some family-based advantage. This shows up in the examination of surnames over time in many countries, as documented in Gregory Clark’s book The Son Also Rises: Surnames and the History of Social Mobility (libgen). One example:

sweden stability

Briefly put, surnames are kind of an extended family and they tend to keep their standing over time. They regress towards the mean (not the statistical kind!), but slowly. This is due to outmarrying (marrying people from lower classes) and genetic regression (i.e. predicted via breeder’s equation and due to the fact that narrow heritability and shared environment does not add up to 1).

It also shows up when educational attainment is directly examined with behavioral genetic methods. We reviewed the literature recently:

How do we find out whether g is causally related to later socioeconomic status? There are at least five lines of evidence: First, g and socioeconomic status correlate in adulthood. This has consistently been found for so many years that it hardly bears repeating[22, 23]. Second, in longitudinal studies, childhood g is a good correlate of adult socioeconomic status. A recent meta-analysis of longitudinal studies found that g was a better correlate of adult socioeconomic status and income than was parental socioeconomic status[24]. Third, there is a genetic overlap of causes of g and socioeconomic status and income[25, 26, 27, 28]. Fourth, multiple regression analyses show that IQ is a good predictor of future socioeconomic status, income and more, even controlling for parental income and the like[29]. Fifth, comparisons between full-siblings reared together show that those with higher IQ tend to do better in society. This cannot be attributed to shared environmental factors since these are the same for both siblings[30, 31].

I’m not aware of any behavioral genetic study of occupational success itself, but that may exist somewhere. (The scientific literature is basically a very badly standardized, difficult to search database.) But clearly, occupational success is closely related to income, educational attainment, cognitive ability and certain personality traits, all of which show substantial heritability and some of which are known to correlate genetically.

Occupations and cognitive ability

An old line of research shows that there is indeed a stable hierarchy in occupations’ mean and minimum cognitive ability levels. One good review of this is Meritocracy, Cognitive Ability,
and the Sources of Occupational Success, a working paper from 2002. I could not find a more recent version. The paper itself is somewhat antagonistic against the idea (the author hates psychometricians, in particular dislikes Herrnstein and Murray, as well as Jensen) but it does neatly summarize a lot of findings.

occu IQ 1

occu IQ 2

occu IQ 3

occu IQ 4

occu IQ 5

occu IQ 6

occu IQ 7

The last one is from Gottfredson’s book chapter g, jobs, and life (her site, better version).

Occupations and cognitive ability in preparation

Furthermore, we can go a step back from the above and find SAT scores (almost an IQ test) by college majors (more numbers here). These later result in people working in different occupations, altho the connection is not always a simple one-to-one, but somewhere between many-to-many and one-to-one, we might call it a few to a few. Some occupations only recruit persons with particular degrees — doctors must have degrees in medicine — while others are flexible within limits. Physics majors often don’t work with physics at their level of competence, but instead work as secondary education teachers, in the finance industry, as programmers, as engineers and of course sometimes as physicists of various kinds such as radiation specialists at hospitals and meteorologists. But still, physicists don’t often work as child carers or psychologists, so there is in general a strong connection between college majors and occupations.

There is some stereotype research into college majors. For instance, a recently popularized study showed that beliefs about intellectual requirements of college majors correlated with female% of the field, as in, the harder fields perceived to be more difficult had fewer women. In fact, the perceived difficulty of the field probably just mostly proxies the actual difficulty of the field, as measured by the mean SAT/ACT score of the students. However, no one seems to have actually correlated the SAT scores with the perceived difficulty, which is the correlation that is the most relevant for stereotype accuracy research.

There is a catch, however. If one analyses the SAT subtests vs. gender%, one sees that it is mostly the quantitative part of the SAT that gives rise to the SAT x gender% correlation. One can also see that the gender% correlates with median income by major.

quant-by-college-major-gender verbal-by-college-major-gender

Stereotypes about occupations and their cognitive ability

Finally, we get to the central question. If we ask people to estimate the cognitive ability of persons by occupation and then correlate this with the actual cognitive ability, what do we get? Jensen summarizes some results in his 1980 book Bias in Mental Testing (p. 339). I mark the most important passages.

People’s average ranking of occupations is much the same regardless of the basis on which they were told to rank them. The well-known Barr scale of occupations was constructed by asking 30 “ psychological judges” to rate 120 specific occupations, each definitely and concretely described, on a scale going from 0 to 100 according to the level of general intelligence required for ordinary success in the occupation. These judgments were made in 1920. Forty-four years later, in 1964, the National Opinion Research Center (NORC), in a large public opinion poll, asked many people to rate a large number of specific occupations in terms of their subjective opinion of the prestige of each occupation relative to all of the others. The correlation between the 1920 Barr ratings based on the average subjectively estimated intelligence requirements of the various occupations and the 1964 NORC ratings based on the average subjective opined prestige of the occupations is .91. The 1960 U.S. Census o f Population: Classified Index o f Occupations and Industries assigns each of several hundred occupations a composite index score based on the average income and educational level prevailing in the occupation. This index correlates .81 with the Barr subjective intelligence ratings and .90 with the NORC prestige ratings.

Rankings of the prestige of 25 occupations made by 450 high school and college students in 1946 showed the remarkable correlation of .97 with the rankings of the same occupations made by students in 1925 (Tyler, 1965, p. 342). Then, in 1949, the average ranking of these occupations by 500 teachers college students correlated .98 with the 1946 rankings by a different group of high school and college students. Very similar prestige rankings are also found in Britain and show a high degree of consistency across such groups as adolescents and adults, men and women, old and young, and upper and lower social classes. Obviously people are in considerable agreement in their subjective perceptions of numerous occupations, perceptions based on some kind of amalagam of the prestige image and supposed intellectual requirements of occupations, and these are highly related to such objective indices as the typical educational level and average income of the occupation. The subjective desirability of various occupations is also a part of the picture, as indicated by the relative frequencies of various occupational choices made by high school students. These frequencies show scant correspondence to the actual frequencies in various occupations; high-status occupations are greatly overselected and low-status occupations are seldom selected.

How well do such ratings of occupations correlate with the actual IQs of the persons in the rated occupations? The answer depends on whether we correlate the occupational prestige ratings with the average IQs in the various occupations or with the IQs of individual persons. The correlations between average prestige ratings and average IQs in occupations are very high— .90 to .95—when the averages are based on a large number of raters and a wide range of rated occupations. This means that the average of many people’s subjective perceptions conforms closely to an objective criterion, namely, tested IQ. Occupations with the highest status ratings are the learned professions—physician, scientist, lawyer, accountant, engineer, and other occupations that involve high educational requirements and highly developed skills, usually of an intellectual nature. The lowest-rated occupations are unskilled manual labor that almost any able-bodied person could do with very little or no prior training or experience and that involves minimal responsibility for decisions or supervision.

The correlation between rated occupational status and individual IQs ranges from about .50 to .70 in various studies. The results of such studies are much the same in Britain, the Netherlands, and the Soviet Union as in the United States, where the results are about the same for whites and blacks. The size of the correlation, which varies among different samples, seems to depend mostly on the age of the persons whose IQs are correlated with occupational status. IQ and occupational status are correlated .50 to .60 for young men ages 18 to 26 and about .70 for men over 40. A few years can make a big difference in these correlations. The younger men, of course, have not all yet attained their top career potential, and some of the highest-prestige occupations are not even represented in younger age groups. Judges, professors, business executives, college presidents, and the like are missing occupational categories in the studies based on young men, such as those drafted into the armed forces (e.g., the classic study of Harrell & Harrell, 1945).

I predict that there is a lot of delicious low-hanging, ripe research fruit ready for harvest in this area if one takes a day or ten to dig up some data and read thru older papers, books and reports.

See also: Rationality and bias test results

I stumbled upon another political bias test:

Basically, you will be given 10 questions: 5 about why US-style conservatives favor/oppose a given policy, and 5 why US-style liberals favor/oppose, the same policy. It is a multiple-choice format. Your score is then calculated simply after comparison with the answers given to the same survey by people who say they are conservative/liberal.

Does it work? I guess. It has a sampling problem in that people who take political surveys on the internet aren’t going to be representative of these political groups, and so if one is answering about the actual political group, one might get it slightly wrong. Another problem is that the score is based only on 5 items, which must mean fairly large sampling error. Of course, other problems include it using a 1-dimensional model of political opinions despite the evidence saying that this isn’t a good model of the data and that it is US-centric.

Despite that:

You identified yourself as slightly liberal. Your score in correctly choosing conservatives’ reasoning (based on answers given by self-identified conservatives to date) was 4 out of 5 questions.
Your score in identifying conservatives’ reasoning is as good as or better than 97 percent of others who are slightly liberal.

The result is fairly in line with my prior results. I picked slightly liberal because by US standards I’m slightly towards less freedom on the economic distribution dimension, but by the Danish standard, I’m somewhat towards more freedom.

Tattoos and piercing. I haven’t found any evidence that this relates to intelligence or even creativity. On the other hand, what underlying factor of openness would it be an indication of?

I have. A long time ago, I tried to find a study of this. The only meaningful study I found was a small study of Croatian veterans:

Pozgain, I., Barkic, J., Filakovic, P., & Koic, O. (2004). Tattoo and personality traits in Croatian veterans. Yonsei medical journal, 45, 300-305.

The study has N≈100 and found a difference in IQ scores of about 5 IQ. Not very convincing.

OKCupid data

In a still unpublished project, we scraped public data from OKCupid. We did this over several months, so the dataset has about N=70k. The dataset contains the public questions and users’ public answers to them, as well as profile information. Each question is multiple choice with 2 to 4 options.

Some of the questions can be used to make a rudimentary cognitive test. with 3-5 items that has reasonable sample size. This can then be used to calculate a mean cognitive score by answer category to all questions. Plots of the relevant questions are shown below. For interpretation, the SD of cognitive score is about 1, so the differences can be thought of as d values. There is some selection for cognitive ability (OKCupid is a more serious dating site mainly used by college students and graduates), so probably population-wide results would be a bit stronger in general. Worse, this selection gets stronger as the sample size decreases because smarter people tend to answer more questions. The effect is fairly small tho.

Tattoo results





Piercing results






See first: Some methods for measuring and correcting for spatial autocorrelation

Piffer’s method

Piffer’s method to examine the between group heritability of cognitive ability and height uses polygenic scores (either by simple mean or with factor analysis) based on GWAS findings to see if they predict phenotypes for populations. The prior studies (e.g. the recent one we co-authored) have relied on the 1000 Genomes and ALFRED public databases of genetic data. These datasets however do not have that high resolution, N’s = 26 and 50. These do not include fine-grained European populations. However, it is known that there is quite a bit of variation in cognitive ability within Europe. If a genetic model is true, then one should be able to see this when using genetic data. Thus, one should try to obtain frequency data for a large set of SNPs for more populations, and crucially, these populations must be linked with countries so that the large amount of country-level data can be used.

Genomic autocorrelation

The above would be interesting and probably one could find some data to use. However, another idea is to just rely on the overall differences between European populations, e.g. as measured by Fst values. Overall genetic differentiation should be a useful proxy for genetic differentiation in the causal variants for cognitive ability especially within Europe. Furthermore, because k nearest spatial neighbor regression is local, it should be possible to use it on a dataset with Fst values for all populations, not just Europeans.

Since I have already written the R code to analyze data like this, I just need some Fst tables, so if you know of any such tables, please send me an email.


There is another table in this paper:

The study is also interesting in that they note that the SNPs that distinguish Europeans the most are likely to be genic, that is, the SNPs are located within a gene. This is a sign of selection, not drift. See also the same finding in

There are few good things to say about this book. The best thing is on the very first page:

Some things never change: the sun rises in the morning. The sun sets in the evening. A lot of linguists don’t like statistics. I have spent a decade teaching English language and linguistics at various universities, and, invariably, there seems to be a widespread aversion to all things numerical. The most common complaint is that statistics is ‘too difficult’. The fact is, it isn’t. Or at least, it doesn’t need to be.

I laughed out hard in the train when reading this.

Not only do linguists not like numbers, they are not very good with them. This is exemplified with this textbook, which contains so many errors I get tired of writing notes in my PDF file.

By definition, the two approaches of analysis also differ in another respect: qualitative research is inductive, that is, theory is derived from the research results. Qualitative analysis is often used in preliminary studies in order to evaluate the research area. For example, we may want to conduct focus groups and interviews and analyse them qualitatively in order to build a picture of what is going on. We can then use these findings to explore these issues on a larger scale, for example by using a survey, and conduct a quantitative analysis. However, it has to be emphasized that qualitative research is not merely an auxiliary tool to data analysis, but a valuable approach in itself!

Quantitative research, on the other hand, is deductive: based on already known theory we develop hypotheses, which we then try to prove (or disprove) in the course of our empirical investigation. Accordingly, the decision between qualitative and quantitative methodology will have a dramatic impact on how we go about our research. Figures 2.1 and 2.2 outline the deductive and inductive processes graphically.

At the beginning of the quantitative-deductive process stands the hypothesis (or theory). As outlined below, a hypothesis is a statement about a particular aspect of reality, such as ‘the lower the socio-economic class, the more non-standard features a speaker’s language shows’. The hypothesis is based on findings of previous research, and the aim of our study is to prove or disprove it. Based on a precisely formulated hypothesis, or research question, we develop a methodology, that is, a set of instruments which will allow us to measure reality in such a way that the results allow us to prove the hypothesis right or wrong. This also includes the development of adequate analytic tools, which we will use to analyse our data once we have collected it. Throughout this book, I will keep reminding readers that it is paramount that hypothesis/research question and methodological analytical framework must be developed together and form a cohesive and coherent unit. In blunt terms, our data collection methods must enable us to collect data which actually fits our research question, as do our data analysis tools. This will become clearer in the course of this and subsequent chapters. The development of the methodological-analytical framework can be a time-consuming process; especially in complex studies we need to spend considerable time and effort developing a set of reliable and valid (see below) tools.

The author has perhaps heard some summary of Popper’s ideas about the hypothetical deductive method. However, Popper aside, science is very much quantitative and inductive. The entire point of meta-analysis is to obtain good summary effects and to check whether results generalize across contexts (i.e. validity generalization á la Hunter and Schmidt).

But that is not all. He then even gets the standard Popperian view wrong. When we confirm a prediction (we pretend to do so, when there is no pre-registration), we don’t prove the theory. That would be an instance of affirming the consequent.

As for qualitative methods being deductive. That makes little sense. I have taken such classes, they usually involve interviews, which is certainly empirical and inductive.

So, in what situations can (and should) we not use a quantitative but a qualitative method? Remember from above that quantitative research is deductive: we base our question and hypotheses on already existing theories, while qualitative research is deductive and is aimed at developing theories. Qualitative research might come with the advantage of not requiring any maths, but the advantage is only a superficial one: not only do we not have a profound theoretical basis, but we also very often do not know what we get – for inexperienced researchers this can turn into a nightmare.

This passage made me very confused because he used deductive twice by error.

[some generally sensible things about correlation and cause]. In order to be able to speak of a proper causal relationship, our variables must show three characteristics:

  • a They must correlate with each other, that is, their values must co-occur in a particular pattern: for example, the older a speaker, the more dialect features you find in their speech (see Chapter Seven).
  • b There must be a temporal relationship between the two variables X and Y, that is, Y must occur after X. In our word-recognition example in Section 2.4, this would mean that for speed to have a causal effect on performance, speed must be increased first, and drop in participants’ performance occurs afterwards. If performance decreases before we increase the speed, it is highly unlikely that there is causality between the two variables. The two phenomena just co-occur by sheer coincidence.
  • c The relationship between X and Y must not disappear when controlled for a third variable.

It has been said that causation implies correlation, but I think this is not strictly true. Another variable could have a statistical suppressor effect, making the correlation between X and Y disappear.

Furthermore, correlation is a measure of a linear relationship, but causation does could be non-linear and thus show a non-linear statistical pattern.

Finally, (c) is just wrong. For instance, if X causes A, …, Z and Y, and we control for A, …, Z, the correlation between X and Y will greatly diminish or disappear because we are indirectly controlling for X. A brief simulation:

N = 1000
d_test = data.frame(X = rnorm(N),
Y = rnorm(N) + X, #X causes Y
A1 = rnorm(N) + X, #and also A1-3
A2 = rnorm(N) + X,
A3 = rnorm(N) + X)

partial.r(d_test, 1:2, 3:5)

The rXY is .71, but the partial correlation of XY is only .44. If we added enough control variables we could eventually make rXY drop to ~0.

While most people usually consider the first two points when analysing causality between two variables, less experienced researchers (such as undergraduate students) frequently make the mistake to ignore the effect third (and often latent) variables may have on the relationship between X and Y, and take any given outcome for granted. That this can result in serious problems for linguistic research becomes obvious when considering the very nature of language and its users. Any first year linguistics student learns that there are about a dozen sociolinguistic factors alone which influence they way we use language, among them age, gender, social, educational and regional background and so on. And that before we even start thinking about cognitive and psychological aspects. If we investigate the relationship between two variables, how can we be certain that there is not a third (or fourth of fifth) variable influencing whatever we are measuring? In the worst case, latent variables will affect our measure in such a way that it threatens its validity – see below.

This is a minor complaint about language, but I don’t understand why the author calls hidden/missing/omitted variables for latent variables, which has an all together different meaning in statistics!

One step up from ordinal data are variables on interval scales. Again, they allow us to label cases and to put them into a meaningful sequence. However, the differences between individual values are fixed. A typical example are grading systems that evaluate work from A to D, with A being the best and D being the worst mark. The differences between A and B are the same as the difference between B and C and between C and D. In the British university grading system, B, C and D grades (Upper Second, Lower Second and Third Class) all cover a stretch of 10 percentage points and would therefore be interval; however, with First Class grades stretching from 70 to 100 per cent and Fail from 30 downwards, this order is somewhat sabotaged. In their purest form, and in order to avoid problems in statistical analysis, all categories in an interval scale must have the same distance from each other.

This is somewhat unclear. What interval variable really means is that the intervals are equal in meaning. A difference of 10 points means is the same whether it is between IQ 100 and 110, or between 70 and 80. (If you are not convinced that IQ scores are at least approximately interval-scale, read Jensen 1980 chapter 4).

Laws are hypotheses or a set of hypotheses which have been proven correct repeatedly. We may think of the hypothesis: ‘Simple declarative sentences in English have subject-verb-object word order’. If we analyse a sufficient number of simple English declarative sentences, we will find that this hypothesis proves correct repeatedly, hence making it a law. Remember, however, based on the principle of falsifiability of hypotheses, if we are looking at a very large amount of data, we should find at least a few instances where the hypothesis is wrong – the exception to the rule. And here it becomes slightly tricky: since our hypothesis about declarative sentences is a definition, we cannot really prove it wrong. The definition says that declarative sentences must have subject-verb-object (SVO) word order. This implies that a sentence that does not conform to SVO is not a declarative; we cannot prove the hypothesis wrong, as if it is wrong we do not have a declarative sentence. In this sense it is almost tautological. Hence, we have to be very careful when including prescriptive definitions into our hypotheses.

This paragraph mixes up multiple things. Laws can be both statistical and absolute. Perhaps the most commonly referred to laws are Newton’s laws of motion, which turned out to be wrong and thus are not laws at all. They were however thought to be absolute, thus not admitting any counterexamples. Any solid counter-example disproves a universal generalization.

But then he starts discussing falsifiability and how his example hypothesis was not falsifiable, apparently. The initial wording does not make it seem like a definition however.

A few general guidelines apply universally with regard to ethics: first, participants in research must consent to taking part in the study, and should also have the opportunity to withdraw their consent at any time. Surreptitious data collection is, in essence, ‘stealing’ data from someone and is grossly unethical – both in the moral and, increasingly, in the legal sense.

The author is apparently unaware of how any large internet site works. They all gather data from users without explicit consent. It also basically bans data scraping, which is increasingly used in quantitative linguistics.

The probably most popular statistical tool is the arithmetic mean-, most of us will probably know it as the ‘average’ – a term that is, for various reasons we will ignore for the time being, not quite accurate. The mean tells us the ‘average value’ of our data. It is a handy tool to tell us where about our data is approximately located. Plus, it is very easy to calculate – I have heard of people who do this in their head! In order to calculate a reliable and meaningful mean, our data must be on a ratio scale. Think about it: how could we possible calculate the ‘average’ of a nominal variable measuring, for example, respondents’ sex? One is either male or female, so any results along the lines of ‘the average sex of the group is 1.7’ makes no sense whatsoever.

Interval scale is sufficient for calculating means and the like.

median wrong

The illustration is wrong. Q2 (median) should be in the middle. Q4 (4*25=100th centile), should be at the end.

A second approach to probabilities is via the relative frequency of an event. We can approximate the probability of a simple event by looking at its relative frequency, that is, how often a particular outcome occurs when the event if repeated multiple times. For example, when we toss a coin multiple times and report the result, the relative frequency of either outcome to occur will gradually approach 0.5 – given that it is an ‘ideal’ coin without any imbalance. Note that it is important to have a very large number of repetitions: if we toss a coin only 3 times, we will inevitably get 2 heads and 1 tail (or vice versa) but we obviously must not conclude that the 2 probability for heads is P(head) = — (0.66). However, if we toss our coin say 1,000 times, we are likely to get something like 520 heads and 480 tails. P(heads) is then 0.52 – much closer to the actual probability of 0.5.

Apparently, the author forgot one can also get 3 of a kind.

Lastly, there may be cases where we are interested in a union event that includes a conditional event. For example, in our Active sample we know that 45% are non-native speakers, that is, P(NNS)=0.4.

Numerical inconsistency.

Excel (or any other software you might use) has hopefully given you a result of p = 0.01492. But how to interpret it? As a guideline, the closer the p-value (of significance level – we will discuss this later) is to zero, the less likely our variables are to be independent. Our p-value is smaller than 0.05, which is a very strong indicator that there is a relationship between type of words produced and comprehension/production. With our result from the chi-test, we can now confidently say that in this particular study, there is a rather strong link between the type of word and the level of production and comprehension. And while this conclusion is not surprising (well, at least not for those familiar with language acquisition), it is a statistically sound one. Which is far better than anecdotal evidence or guessing.

A p value of .01 is not a “very strong indicator”. Furthermore, it is not an effect size measure (although it is probably correlate with effect size). Concluding large effect size (“strong link”) from small p value is a common fallacy.

With the result of our earlier calculations (6.42) and our df (1) we now go to the table of critical values. Since df=l, we look in the first row (the one below the header row). We can see that our 6.42 falls between 3.841 and 6.635. Looking in the header row, we can see that it hence falls between a significance level of 0.05 and 0.01, so our result is significant on a level of 0.05. If you remember, Excel has given us a p value of 0.014 – not quite 0.01, but very close. As before, with p<0.05, we can confidently reject the idea that the variables are independent, but suggest that they are related in some way.

Overconfidence in p-values.

In an ‘ideal’ world, our variables are related in such a way that we can connect all individual points to a straight line. In this case, we speak of a p e r f e c t c o r r e l a t i o n between the two variables. As we can see in Figures 7.1 and 7.2, the world isn’t ideal and the line is not quite straight, so the correlation between age of onset and proficiency is not perfect – although it is rather strong! Before we think about calculating the correlation, we shall consider five basic principles about the correlation of variables:

1 the Pearson correlation coefficient r can have any value between (and including) -1 and 1, that is, -1 < r <, 1.

2 r = 1 indicates a p e r f e c t p o s it iv e c o r r e l a t i o n , that is, the two variables increase in a linear fashion, that is, all data points lie is a straight line – just as if they were plotted with the help of a ruler.

3 r = -1 indicates a p e r f e c t n e g a t iv e c o r r e l a t i o n , that is, while one variable increases, the other decreases, again, linearly.

4 r=0 indicates that the two variables to not correlate at all. That is, there is no relationship between them.

5 The Pearson correlation only works for normally distributed (or ‘parametric’) data (see Chapter Six) which is on a ratio scale (see Chapter Two) – we discuss solutions for non-parametric distributed data in Chapter Nine.

(Ignore the formatting error.)

(4) is wrong. Correlation of 0 does not mean no relationship. It could be non-linear (as also mentioned in (2-3).

We said above that the correlation coefficient does not tell us anything about causality. However, for reasons whose explanations are beyond the scope of this book, if we square the correlation coefficient, that is, multiply it with itself, the so-called R2(R squared) tells us how much the independent variable accounts for the outcome of the dependent variable. In other words, causality can be approximated via R2 (Field 2000: 90). As reported, Johnson and Newport (1991) found that age of onset and L2 proficiency correlate significantly with r = -0.63. In this situation it is clear that //there was indeed causality, it would be age influencing L2 proficiency – and not vice versa. Since we are not provided with any further information or data about third variables, we cannot calculate the partial correlation and have to work with r=-0.63. If we square r, we get -0.6 3 x -0.63 = 0.3969. This can be interpreted in such a way that age of onset can account for about 40 per cent of the variability in L2 acquisition.

This is pretty good, but it also shows us that there must be something else influencing the acquisition process. So, we can only say that age has a comparatively strong influence on proficiency, but we cannot say that is causes it. The furthest we could go with R2 is to argue that if there is causality between age and proficiency, age only causes around 40% ; but to be on the safe side, we should omit any reference to causality altogether.

If you think about it carefully, if age was the one and only variable influencing (or indeed causing) L2 acquisition, the correlation between the two variables should be perfect, that is, r=l (or r= -l) – otherwise we would never get an R2 of 1! In fact, Birdsong and Molis (2001) replicated Johnson and Newport’s earlier (1989) study and concluded that exogenous factors (such as expose and use of the target language) also influence the syntactic development. And with over 60% of variability unaccounted for, there is plenty of space for those other factors!

A very confused text. Does correlation tell us about causality or not? Partially? Correlation of X and Y increases the probability that there is causality between X and Y, but no that much.

Throughout this chapter, I have repeatedly used the concept of significance, but I have yet to provide a reasonable explanation for it. Following the general objectives of this book of being accessible and not to scare off readers mathematically less well versed, I will limit my explanation to a ‘quick and dirty’ definition.

Statistical significance in this context has nothing to do with our results being of particular important – just because Johnson and Newport’s correlation coefficient comes with a significance value of 0.001 does not mean it is automatically groundbreaking research. Significance in statistics refers to the probability of our results being a fluke or not; it shows the likelihood that our result is reliable and has not just occurred through the bizarre constellation of individual numbers. And as such it is very important: when reporting measures such as the Pearson’s r, we must also give the significance value as otherwise our result is meaningless. Imagine an example where we have two variables A and B and three pairs of scores, as below:

[table “” not found /]

If we calculate the Pearson coefficient, we get r=0.87, this is a rather strong positive relationship. However, if you look at our data more closely, how can we make this statement be reliable? We have only three pairs of scores, and for variable B, two of the three scores are identical (viz. 3), with the third score is only marginally higher. True, the correlation coefficient is strong and positive, but can we really argue that the higher A is the higher is B? With three pairs of scores we have 1 degree of freedom, and if we look up the coefficient in the table of critical values for df=l, we can see that our coefficient is not significant, but quite simply a mathematical fluke.

Statistical significance is based on probability theory: how likely is something to happen or not. Statistical significance is denoted with p, with p fluctuating between zero and 1 inclusive, translating into zero to 100 per cent. The smaller p, the less likely our result is to be a fluke.

For example, for a Pearson correlation we may get r=0.6 and p=0.03. This indicates that the two variables correlate moderately strongly (see above), but it also tells us that the probability of this correlation coefficient occurring by mere chance is only 3%. To turn it round, we can be 97% confident that our r is a reliable measure for the association between the two variables. Accordingly, we might get a result of r=0.98 and p=0.48. Again, we have a very strong relationship; however, in this case the probability that the coefficient is a fluke is 48% – not a very reliable indicator! In this case, we cannot speak of a real relationship between the variables, but the coefficient merely appears as a result of a random constellation of numbers. We have seen a similar issue when discussing two entirely different data sets having the same mean. In linguistics, we are usually looking for significance levels of 0.05 or below (i.e. 95% confidence), although we may want to go as high as p=0.1 (90% confidence). Anything beyond this should make you seriously worried. In the following chapter, we will come across the concept of significance again when we try and test different hypotheses.

Some truth mixed with falsehoods. p values is the probability of getting the result or a more extreme one given the null hypothesis (r=0 in this case). It is not a measure of scientific importance.

It is not a measure of the reliability of finding the value again if we re-do the experiment. That value is much value, depending on statistical power/sample size.

No, a p value >.05 does not imply it was a true negative. It could be a false negative. Especially when sample size is 3!

The interpretation of the result is similar to that of a simple regression: in the first table, the Regression Statistics tell us that our four independent variables in the model can account for 95% of the variability of vitality, that is, the four variables together have a massive impact on vitality. This is not surprising, seeing that vitality is a compound index calculated from those very variables. In fact, the result for R2 should be 100%: with vitality being calculated from the four independent variables, there cannot be any other (latent) variable influencing it! This again reminds us that we are dealing with models – a simplified version of reality. The ANOVA part shows us that the model is highly significant, with p=0.00.

p values cannot be 0.

The task of our statistical test, and indeed our entire research, is to prove or disprove HQ. Paradoxically, in reality most of the time we are interested in disproving H0, that is, to show that in fact there is a difference in the use of non-standard forms between socio-economic classes. This is called the alternative hypothesis H, or HA. Hence:

One cannot prove the null hypothesis using NHST. The best one can do with frequentist stats is to show that the confidence interval is very closely around 0.

The most important part for us is the T(F<f) one-tail’ row: it tells us the significance level. Here p=0.23. This tells us that the variances of our two groups are not statistically significantly different; in other words, even though the variance for CG is nominally higher, we can disregard this and assume that the variances are equal. This is important for our choice of t-test.

I seem to recall that the test approach to t-tests is problematic and that it is better to just always use the robust test. I don’t recall the source tho.

Is this really sufficient information to argue that the stimulus ‘knowledge activation’ results in better performance? If you spend a minute or so thinking about it, you will realise that we have only compared the two post-test scores. It does not actually tell us how the stimulus influenced the result. Sadighi and Zare have also compared the pre-test scores of both groups and found no significant difference. That is, both groups were at the same proficiency level before the stimulus was introduced. Yet, we still need a decent measure that tells us the actual change that occurs between EG and CG, o rather: within EG and CG. And this is where the dependent t-test comes into play.

Another instance of p > alpha ergo no differences fallacy.

The really interesting table is the second one, especially the three rightmost columns labelled F, P-value and Fcrit. We see straightaway that the p-value of p=0.25 is larger than p – 0.05, our required significance level. Like the interpretation of the t-test, we use this to conclude that there is no significant difference in means between the treatment groups. The means may look different, but statistically, they are not. In yet other words, the different type of treatment our students receive does not result in different performance.

Same as above, but note that the author got it right in the very first sentence!

As with most things in research, there are various ways of conducting statistical meta-analyses. I will discuss here the method developed by Hunter and Schmidt (for a book-length discussion, see Hunter and Schmidt 2004: – not for the faint-hearted!). The Hunter-Schmidt method uses the Pearson correlation coefficient r as the effect size. If the studies in your pool do not use r (and quite a few don’t), you can convert other effect sizes into r – see, for example, Field and Wright (2006). Hunter and Schmidt base their method on the assumption of random effect sizes – it assumes that different populations have different effect sizes.

Given the number of errors, I was not expecting to see a reference to that book! Even if the author is mathematically incompetent, he gets the good stuff right. :P

Our true effect size of r= -0.68 lies between these boundaries, so we can consider the result to be significant at a 95% confidence level. So, overall, we can say that our meta-analysis of the AoO debate indicated that there is indeed a strong negative correlation between age of onset and attainment, and that this relationship is statistically significant.

That’s not how confidence intervals work…

I had the impression that, since recognition of [problem] dates back at least to [person from a long time ago], there was a voluminous literature and [statistics to deal with the problem] was a solved problem, so I’m a little troubled that you seem to be trying to invent your own methods and aren’t citing much in the way of prior work.

This anonymous critique is saying that I’m not building on top of what there already is but instead re-inventing the wheel, perhaps even a square one. Underlying the criticism is a view of scientific progress as accumulating knowledge over time. We know more stuff now than we used to (but some things we think we know now we still get wrong!) and this is because new scientists don’t just start finding out how everything works (the goal of science) from scratch, but instead read what research has already been done and try to build on top of that. At least that is the general idea. However, we know from actual scientific practice that scientists often don’t build on top of prior work, perhaps because the body of prior work is already so large that having an overview of it is beyond current human cognitive capacity. Alternatively, because the prior work is often inaccessible, badly structured, not searchable, etc. Othertimes scientists are just lazy.

The first problem is in principle unsolvable because improving human cognitive ability/capacity will accelerate the accumulation of knowledge. However, we will (very soon) improve upon the present situation (Shulman & Bostrom, 2014).

The second problem is faster to fix, requiring either ubiquitous open access or guerrilla open access. The first option is coming along fast for new material, but won’t solve it for old material already locked down by copyright. Probably Big Copyright is going to lobby for extending copyright protection further, which means that even just waiting for copyright to expire is not a legal option.

A delicious example of scientists not building on top of relevant prior works is the concept of construct proliferation (Reeve & Basalik, 2014), which is when we invent a new word/concept to cover the same region in conceptual space as previous concepts already covered. This is itself a redundant copy of the earlier term construct redundancy. This meta-problem is fairly obvious, so my guess is that there is a long list of terms for it, thus illustrating itself.

Yet I argue the opposite…

Given the above, why would one willingly want to not read the earlier literature/build on top of prior work on a topic before trying to find solutions? There are some possible reasons:

One reason is personal. Perhaps one just really likes the experience of finding an interesting problem and coming up with solutions. This is closely related to a couple of concepts: openness to experience, typical intellectual engagement, need for cognition, epistemic curiosity (and more), see (Mussel, 2010) and (Stumm, Hell, & Chamorro-Premuzic, 2011). Incidentally these also show strong concept overlap (this is yet another term to refer to the situation where multiple concepts cover some of the same area in conceptual space, however it is different in that it is explicitly continuous instead of categorical).

A career reason to invent new constructs is a desire to make a name for yourself and get a good job. A well-tested way to do that is to introduce a new concept and accompanying questionnaire that others then hopefully use. This can result in hundreds or thousands of citations. For instance, the original paper for need for cognition has 5063 on Scholar since 1982 / 153 per year, the original paper for typical intellectual engagement has 410 citations since 1992 / 18 per year, and that for epistemic curiosity has 156 since 2003 / 13 per year. The later papers do have lower citation counts per year, perhaps indicating some conceptual satiation, but the papers are still way above the norm. To put it another way, since it is clearly unnecessary to read much of the relevant prior work to get published, one may as well skip this.

Scientifically speaking, neither of the above two reasons are relevant. The first has more to do with personality disposition towards solving new problems, whereas the second is due, to some degree, to perverse incentives.

Exploratory bridge building

Are there any good scientific reasons to sometimes start from scratch? I think so. Think of it this way: Many scientific questions can be approached in multiple ways. We can build a large analogy out of that idea.

Imagine a many-dimensional space where some regions are impassable or slow to pass, and where there are one or more regions or points from which useful resources can be extracted. We, the bridge engineers, start somewhere in this space (all in the same place) and have to find resources but we don’t know exactly where they will be found, so we don’t know exactly which directions to move in. Furthermore, imagine that we can build bridges (vectors) in this space by adding them together and that we can only move on the bridges (or in them). This means that one can now travel in a particular direction, at least slowly. If the resources are far away from the beginning position, it is easy to see that one could never reach them without adding vectors together. This forms the basis of the general preference for building on prior work.

How do we know which direction to build bridges in if we don’t know where the resources are? We can expand the analogy further by saying that no one has the ability to see further than a short distance. Instead, what engineers have is a noisy measure of how close their current position is to the nearest resource, and their measures don’t even agree perfectly with each other. Noisy here meaning that they is only roughly correct, to varying degrees and with different biases. Sometimes what appears to be a good general direction towards a resource ends up in a resource poor dead end, i.e. all directions to move closer to the nearby resources thru impassable or difficult to pass regions.

Those familiar with evolutionary biology should now see where I’m going with this. If we reduce my analogy, we can say that approaches to answers in science can end up in local maximums in the science fitness landscape. When this happens, one has to go back and move in a new direction somewhere.

Still, this leaves us with the question of how far we should move back. Often it may be necessary to go back only some of the way and start a new branch of the same root bridge from that point. Sometimes, however, a very early part of the bridge moved into a regional that can only result in slow progress or even a dead-end. When this happens, one has to start over entirely.

Decision making

Because all engineers are short sighted, it is impossible for them to know when it is time to start over. Worse than that, engineers have a kind of tunnel vision such that when they have once traveled out on given bridge from their homeland, they will be less capable of spotting good directions to build other root bridges from. In other words, once one has learned of a particular approach to a problem, it can be difficult to go “back to basics” and start over with new ideas. One needs a pair of fresh eyes. The only way to do this is to get an engineer who has never been to this space before, avoid informing him of the already built bridges and let him choose where to build his first bridge and let him work on it for some time to see if he ends up in a dead end or a previously unknown resource rich area. Even if the engineers have already found one good resource region, they might wonder whether there are more. Finding more resources probably requires moving in a new direction from the beginning or at least from an early part of the bridge.


It is clear that as a large team project neither extreme solution is optimal: 1) always building on prior work, or 2) never building on prior work. Instead, some balance must be found where some, probably most, engineers are dedicated to building on top of the fairly recent prior work, but some engineers should try to backtrack and see if they can find a better route to a currently known resource area or identify new areas.

Who should start new bridges? We may posit that the engineers vary in their psychological attributes in ways that have an effect on their efficiency of building on prior bridges or starting their own root bridges/branches. In that case, engineers who are particularly good at spotting new directions and working on their own bridge alone would be good for the role of pioneer/Rambo engineers. Even if there are no differences between the efficiency of the engineers re. building new branches/roots or building on top of prior work, if only a few engineers are inclined to working alone perhaps finding new resources (reason #1), the team is in the optimal situation where most build on fairly recent prior work but some don’t.


Given the abstractness of the space bridge engineer analogy, one should probably do a visualization, or maybe even a small computer game. The last is beyond my coding ability at the time being and the first requires more time than I have.