Van Ijzendoorn, M. H., Juffer, F., & Poelhuis, C. W. K. (2005). Adoption and cognitive development: a meta-analytic comparison of adopted and nonadopted children’s IQ and school performance. Psychological bulletin, 131(2), 301.

It turns out that someone already did a meta-analysis of adoption studies and cognitive ability. It does not solely include cross-country transracial, but it does include some. They report both country of origin and country of adoption, so it is fairly easy to find the studies that one wants to take a closer look at. It is fairly inclusive in what counts as cognitive development, e.g. school results and language tests count, as well as regular IQ tests. They report standardized differences (d), so results are easy to understand.

They do not present aggregated results by country of origin however, so one would have to do that oneself. I haven’t done it (yet?), but the method to do so is this:

  1. Obtain the country IQs for all countries in the study. These are readily available from Lynn & Vanhanen (2012) or in the international megadataset.
  2. Score all the outcomes by using the adoptive country’s IQ. E.g. if the US has a score of 97, and Koreans adopted to that country have get a d score of .16 in “School results” as they do in the first study listed, then this corresponds to a school IQ performance of 97 – 2.4 = 94.6. Note that this assumes that the comparison sample is unselected (not smarter than average). This is likely false because adoptive parents tend to be higher social class and presumably smarter, so they would send their (adoptive) children to above average schools. Also be careful about norm comparisons because they often use older norms and the Flynn effect thus results in higher IQ scores for the adoptees.
  3. Copy relevant study characteristics from the table, e.g. comparison group, sample sizes, age of assessment and type of outcome (school, language, IQ, etc.).
  4. Repeat step (2-3) for all studies.
  5. BONUS: Look for additional studies. Do this by, a) contacting the authors of recent papers and the meta-analysis, b) search for more using Google Scholar/other academic search engine, c) look thru the studies that cite the included studies for more relevant studies.
  6. BONUS: Get someone else to independently repeat steps (2-3) for the same studies. This checks interrater consistency.
  7. Aggregate results (weighted mean of various kinds).
  8. Correlate aggregated results with origin countries’ IQs to check for spatial transferability, a prediction of genetic models.
  9. Do regression analyses to see of study characteristics predict outcomes.
  10. Write it up and submit to Open Differential Psychology (good guy) or Intelligence (career-focused bad guy). Write up to Winnower or Human Varieties if lazy or too busy.

The main results table

More likely, you are too lazy to do the above, but you want to sneak peak at the results. Here’s the main table from the paper.

Study Country/region of study Country/region of child’s origin Age at assessment (years) Age at adoption (months) N Adoption N Comparison Preadoption status Comparison group Outcome (d)
Andresen (1992) Norway Korea 12-18 12-24 135 135 Not reported Classmates School results 0.16 Language 0.09
Benson et al. (1994) United States United States 12-18 < 15 881 Norm Not reported Norm group School results —0.36
Berg-Kelly & Eriksson (1997) Sweden Korea/India 12-18 < 12 125 9204 Not reported General population School results 0.03 f/—0.04 m Language —0.02 f/—0.05 m
Bohman (1970) Sweden Not reported 12-18 < 12 160 1819 Not reported Classmates School results 0.09 f/0.07 m Language 0.02 f/—0.02 m Learning problems 0.00
Brodzinsky et al. (1984) United States United States 4-12 < 12 130 130 Not reported General population School competence 0.62 f/0.51 m
Brodzinsky & Steiger (1991) United States Not reported 9-19 441 6753 Not reported Population % School failure 0.76
Bunjes & de Vries (1988) Netherlands Korea
4-12 12-24 118 236 Not reported Classmates School results 0.24 Language 0.22
Castle et al. (2000) England England 4-12 < 12 52 Norm Not reported Standardized scores School results —0.47, IQ 0.47
Clark & Hanisee (1982) United States Vietnam
0-4 12-24 25 Norm Not reported Standardized scores IQ -2.42
Colombo et al. (1992) Chile Chile 4-12 0-12 16 ii Undernutrition Biological siblings IQ -1.16
Cook et al. (1997) Europe Not reported 4-8 12-24 131 125 Not reported General population School competence 0.56 f/0.16 m
Dalen (2001) Norway Korea
12-18 0-12 193 193 Not reported Classmates School results 0.47 (Colombia), —0.07 (Korea)
Language 0.43 (Colombia), —0.05 (Korea)
Learning problems 0.50
Dennis (1973) United States Lebanon 2-18 > 24 85 51 Institute Institute children IQ —1.28 (intraracial), —1.36 (transracial)
De Jong (2001) New Zealand Romania/Russia 4-15 12-24 116 Norm Some problems General population School competence 0.65
Duyme (1988) France France 12-18 < 12 87 14951 Not reported General population School results 0.00
Fan et al. (2002) United States United States 12-18 514 17241 Not reported General population School grades —0.02
Feigelman (1997) United States Not reported 8-21 101 6258 Not reported General population Education level —0.03
Fisch et al. (1976) United States United States 4-12 < 12 94 188 No problems General population IQ 0.00
School results 0.50 Language 0.52
Frydman & Lynn
Belgium Korea 4-12 12-24 19 Norm Not reported Standardized scores IQ -1.68
Gardner et al. (1961) United States Not reported 12-18 < 12 29 29 Not reported Classmates School achievement 0.09
Geerars et al. (1995) Netherlands Thailand 12-18 < 12 68 Norm Not reported Population % School results 0.19
Hoopes et al. (1970) United States United States 12-18 100 100 1-2 shifts in placement General population IQ 0.12
Hoopes (1982) United States United States 4-12 < 12 260 68 Nothing special General population IQ 0.18
Horn et al. (1979) United States United States 3-26 < 1 469 164 No problems Environment siblings IQ 0.17/0.34/—0.05
W. J. Kim et al. (1992) United States Not reported 12-18 43 43 Not reported General population School results 0.74
W. J. Kim et al. (1999) United States Korea 4-12 < 12 18 9 Nothing special Environment siblings School competence —0.39
Lansford et al. (2001) United States Not reported 12-18 111 200 Not reported General population School grades 0.46
Leahy (1935) United States United States 5-14 < 6 194 194 Not reported General population School grades 0.00 IQ -0.06
Levy-Shiff et al. (1997) Israel Israel
South America
7-13 < 3 5050 Norm
Not reported Standardized scores IQ -1.10 f/—2.00 m
Lien et al. (1977) United States Korea 12-18 > 24 240 Norm Undernutrition Standardized scores IQ 0.00
Lipman et al. (1992) Canada Not reported 4-16 104 3185 Not reported General population School performance —0.05 f/0.16 m
McGuinness & Pallansch (2000) United States Soviet Union 4-12 > 24 105 1000 Long time in orphanages Norm group School competence 0.46
Moore (1986) United States United States 7-10 12-24 23 Norm Not reported Standardized scores IQ -0.00 f/—1.00 m
Morison & Ell wood (2000) Canada Romania 4-12 12-24 59 35 Orphanages General population IQ 1.45 (combined)
Neiss & Rowe (2000) United States 75% LTnited States 12-18 392 392 Not reported General population IQ 0.08
O’Connor et al. (2000) England Romania 6 0-42 207 Norm Orphanage Standardized scores IQ —0.56 (combined)
Palacios & Sanchez (1996) Spain Spain 4-12 > 24 210 314 Not reported Institute children School competence —0.18
Pinderhughes (1998) United States United States 8-15 24—48 66 33 Older children General population School competence 0.64 (combined)
Plomin & DeEries (1985) United States United States 1 0-5 182 182 Not reported General population IQ 0.14
Priel et al. (2000) Israel 75% Israel 8-12 12-24 50 80 Not reported General population School competence 0.77 f/1.12 m
Rosenwald (1995) Australia 73% Korea Asia
South America
4-16 < 12 283 2583 Not reported General population School performance —0.18
Scarr & Weinberg (1976) United States 88% LTnited States 4-16 <12 176 145 Not reported Environment siblings IQ 0.75 (combined)
Schiff et al. (1978) France France 4-12 <12 32 20 Not reported Biological siblings School results —0.70
Segal (1997) United States United States 4-12 < 12 6 6 Not reported Environment siblings IQ -1.14 IQ 2.67
Sharma et al. (1996) United States 81% United States 12-18 12-24 4682 4682 Not reported General population School results 0.37 (combined)
Sharma et al. (1998) United States United States 12-18 < 12 629 72 Not reported Environment School competence —0.45 f/—0.61 m
Silver (1970) United States Not reported 4-12 < 3 10 70 Not reported General population Learning problems 1.21
Silver (1989) United States Not reported 4-12 39 Perc. Not reported General population Learning problems 1.38
Skodak & Skeels (1949) United States Not reported 12-18 < 6 100 100 Not reported Standardized scores IQ -1.12
Smyer et al. (1998) Sweden Not reported Adults < 12 60 60 Not reported Biological (twin siblings) Education level —0.82
Stams et al. (2000) Netherlands Sri Lanka
4-12 < 6 159 Norm Not reported Standardized scores School results 0.33 IQ -0.34 f/—0.73 m Learning problems —0.05
Teas dale & Owen (1986) Denmark Not reported Adults < 12 302 4578 Not reported General population IQ 0.35
Education level 0.32
Tizard & Hodges (1978) England Not reported 8 > 24 25 14 Not reported Restored children IQ —0.40 (older), —0.62 (younger)
Tsitsikas et al. (1988) Greece Greece 5-6 < 12 72 72 Not reported Classmates IQ 0.64, school performance 0.29 Language 0.30
Verhulst et al. (1990) Netherlands Europe
12-18 > 24 2148 933 Not reported General population Perc. special education 0.25 f/0.29 m
Versluis-den Bieman & Verhulst (1995) Netherlands Europe
12-18 > 24 1538 Norm Not reported General population School competence 0.28 f/0.41 m
Wattier & Frydman (1985) Belgium Korea 89% 4-12 12-24 28 Norm Not reported Standardized scores IQ -0.06
Westhues & Cohen (1997) Canada Korea 40% India 40% South America 12-18 12-24 134 83 Not reported Environment siblings School performance 0.13
Wickes & Slate (1997) United States Korea > 18 > 36 174 Norm Not reported Norm group School results 0.09 f/0.07 m Language 0.07 f/0.03 m
Winick et al. (1975) United States Korea 4-12 > 24 112 Norm Malnourished Standardized scores School performance 0.00 IQ 0.00
Witmer et al. (1963) United States United States 12-18 < 12 484 484 Nothing special Classmates School performance 0.00 IQ 0.00

I found this one a long time ago and tweeted it, but apparently forgot to blog it.

Odenstad, A., Hjern, A., Lindblad, F., Rasmussen, F., Vinnerljung, B., & Dalen, M. (2008). Does age at adoption and geographic origin matter? A national cohort study of cognitive test performance in adult inter-country adoptees. Psychological Medicine, 38(12), 1803-1814.

Background Inter-country adoptees run risks of developmental and health-related problems. Cognitive ability is one important indicator of adoptees’ development, both as an outcome measure itself and as a potential mediator between early adversities and ill-health. The aim of this study was to analyse relations between proxies for adoption-related circumstances and cognitive development.
Method Results from global and verbal scores of cognitive tests at military conscription (mandatory for all Swedish men during these years) were compared between three groups (born 1968–1976): 746 adoptees born in South Korea, 1548 adoptees born in other non-Western countries and 330 986 non-adopted comparisons in the same birth cohort. Information about age at adoption and parental education was collected from Swedish national registers.
Results South Korean adoptees had higher global and verbal test scores compared to adoptees from other non-European donor countries. Adoptees adopted after age 4 years had lower test scores if they were not of Korean ethnicity, while age did not influence test scores in South Koreans or those adopted from other non-European countries before the age of 4 years. Parental education had minor effects on the test performance of the adoptees – statistically significant only for non-Korean adoptees’ verbal test scores – but was prominently influential for non-adoptees.
Conclusions Negative pre-adoption circumstances may have persistent influences on cognitive development. The prognosis from a cognitive perspective may still be good regardless of age at adoption if the quality of care before adoption has been ‘good enough’ and the adoption selection mechanisms do not reflect an overrepresentation of risk factors – both requirements probably fulfilled in South Korea.

I summarize and comment on the findings below:

Which adoptees?

In total, 2294 inter-country adoptees were born outside the Western countries (Europe, North America and Australia) and adopted before age 10 years. Of these, 746 were born in South Korea [Korean adoptee (KA) group]. The remaining 1548 individuals were born in other countries, Non-Korean adoptee (NKA) group. India was the most common country of origin, followed by Thailand, Chile, Ethiopia, Colombia and Sri Lanka. These were the only donor countries for which the number of adoptees included in this study exceeded 100. The non-adopted population (NAP) group consisted of non-adopted individuals born in Sweden (n=330 896).

Unfortunately, no more detailed information is given so a origin country IQ x adoptee IQ study (spatial transferability) cannot be done.

Main results


We see that Koreans adoptees do better than Swedes, even on the verbal test. The superiority stops being p<alpha when they control for various things. Notice that the disadvantage for non-Koreans becomes larger after control (their scores decrease and the Swedes’ scores increase).

Age at adoption matters, but apparently only for non-Koreans

age at adoption

This is in line with environmental cumulative disadvantage for non-Koreans. Alternatively, it is due to selection bias in that the less bright children (in the origin countries) are adopted later.

Perhaps the Koreans were placed with the better parents and this made them smarter?

Maybe, but the data shows that it isn’t important, even for transracial adoptives.

parental edu and IQ

Notice the clear relationship between child IQ and parental education for the non-adopted population. Then notice the lack of a clear pattern among the adoptives. There may be a slight upward trend (for Koreans), but it is weak (only .22 between lowest and highest education for Koreans, giving a d≈.10) and not found for non-Koreans (middle education-level had highest scores).

Still, one could claim that in Korean, smarter/normal children are given up for adoption, while in non-Korea non-Western Europe, this isn’t the case or even the opposite is the case. This study cannot address this possibility.

This study is much larger than other studies and also has a comparison group. The main problem with it is that it does not report data for more countries of origin. Only the (superior) Koreans are singled out.

It seems that no one has integrated this literature yet. I will take a quick stab at it here. It could be expanded into a proper paper later in case someone wants to and have time to do that.


Lee Jussim (also blog) has done a tremendous job at reviewing the stereotype in recently years. In general he has found that stereotypes are mostly moderately to very accurate. On the other hand, self-fulfilling prophecies are probably real but fairly limited (e.g. work best when teachers don’t know their students well yet), especially in comparison to stereotype accuracy. Of course, these findings are exactly the opposite of what social psychologists, taken as a group, have been telling us for years.

The best short review of the literature is their book chapter The Unbearable Accuracy of Stereotypes. A longer treatment can be found in his 2012 book Social Perception and Social Reality: Why Accuracy Dominates Bias and Self-Fulfilling Prophecy (libgen).

Occupational success and cognitive ability

Society is more or less a semi-stable hierarchy biased on mostly inherited personality traits, cognitive ability as well as some family-based advantage. This shows up in the examination of surnames over time in many countries, as documented in Gregory Clark’s book The Son Also Rises: Surnames and the History of Social Mobility (libgen). One example:

sweden stability

Briefly put, surnames are kind of an extended family and they tend to keep their standing over time. They regress towards the mean (not the statistical kind!), but slowly. This is due to outmarrying (marrying people from lower classes) and genetic regression (i.e. predicted via breeder’s equation and due to the fact that narrow heritability and shared environment does not add up to 1).

It also shows up when educational attainment is directly examined with behavioral genetic methods. We reviewed the literature recently:

How do we find out whether g is causally related to later socioeconomic status? There are at least five lines of evidence: First, g and socioeconomic status correlate in adulthood. This has consistently been found for so many years that it hardly bears repeating[22, 23]. Second, in longitudinal studies, childhood g is a good correlate of adult socioeconomic status. A recent meta-analysis of longitudinal studies found that g was a better correlate of adult socioeconomic status and income than was parental socioeconomic status[24]. Third, there is a genetic overlap of causes of g and socioeconomic status and income[25, 26, 27, 28]. Fourth, multiple regression analyses show that IQ is a good predictor of future socioeconomic status, income and more, even controlling for parental income and the like[29]. Fifth, comparisons between full-siblings reared together show that those with higher IQ tend to do better in society. This cannot be attributed to shared environmental factors since these are the same for both siblings[30, 31].

I’m not aware of any behavioral genetic study of occupational success itself, but that may exist somewhere. (The scientific literature is basically a very badly standardized, difficult to search database.) But clearly, occupational success is closely related to income, educational attainment, cognitive ability and certain personality traits, all of which show substantial heritability and some of which are known to correlate genetically.

Occupations and cognitive ability

An old line of research shows that there is indeed a stable hierarchy in occupations’ mean and minimum cognitive ability levels. One good review of this is Meritocracy, Cognitive Ability,
and the Sources of Occupational Success, a working paper from 2002. I could not find a more recent version. The paper itself is somewhat antagonistic against the idea (the author hates psychometricians, in particular dislikes Herrnstein and Murray, as well as Jensen) but it does neatly summarize a lot of findings.

occu IQ 1

occu IQ 2

occu IQ 3

occu IQ 4

occu IQ 5

occu IQ 6

occu IQ 7

The last one is from Gottfredson’s book chapter g, jobs, and life (her site, better version).

Occupations and cognitive ability in preparation

Furthermore, we can go a step back from the above and find SAT scores (almost an IQ test) by college majors (more numbers here). These later result in people working in different occupations, altho the connection is not always a simple one-to-one, but somewhere between many-to-many and one-to-one, we might call it a few to a few. Some occupations only recruit persons with particular degrees — doctors must have degrees in medicine — while others are flexible within limits. Physics majors often don’t work with physics at their level of competence, but instead work as secondary education teachers, in the finance industry, as programmers, as engineers and of course sometimes as physicists of various kinds such as radiation specialists at hospitals and meteorologists. But still, physicists don’t often work as child carers or psychologists, so there is in general a strong connection between college majors and occupations.

There is some stereotype research into college majors. For instance, a recently popularized study showed that beliefs about intellectual requirements of college majors correlated with female% of the field, as in, the harder fields perceived to be more difficult had fewer women. In fact, the perceived difficulty of the field probably just mostly proxies the actual difficulty of the field, as measured by the mean SAT/ACT score of the students. However, no one seems to have actually correlated the SAT scores with the perceived difficulty, which is the correlation that is the most relevant for stereotype accuracy research.

There is a catch, however. If one analyses the SAT subtests vs. gender%, one sees that it is mostly the quantitative part of the SAT that gives rise to the SAT x gender% correlation. One can also see that the gender% correlates with median income by major.

quant-by-college-major-gender verbal-by-college-major-gender

Stereotypes about occupations and their cognitive ability

Finally, we get to the central question. If we ask people to estimate the cognitive ability of persons by occupation and then correlate this with the actual cognitive ability, what do we get? Jensen summarizes some results in his 1980 book Bias in Mental Testing (p. 339). I mark the most important passages.

People’s average ranking of occupations is much the same regardless of the basis on which they were told to rank them. The well-known Barr scale of occupations was constructed by asking 30 “ psychological judges” to rate 120 specific occupations, each definitely and concretely described, on a scale going from 0 to 100 according to the level of general intelligence required for ordinary success in the occupation. These judgments were made in 1920. Forty-four years later, in 1964, the National Opinion Research Center (NORC), in a large public opinion poll, asked many people to rate a large number of specific occupations in terms of their subjective opinion of the prestige of each occupation relative to all of the others. The correlation between the 1920 Barr ratings based on the average subjectively estimated intelligence requirements of the various occupations and the 1964 NORC ratings based on the average subjective opined prestige of the occupations is .91. The 1960 U.S. Census o f Population: Classified Index o f Occupations and Industries assigns each of several hundred occupations a composite index score based on the average income and educational level prevailing in the occupation. This index correlates .81 with the Barr subjective intelligence ratings and .90 with the NORC prestige ratings.

Rankings of the prestige of 25 occupations made by 450 high school and college students in 1946 showed the remarkable correlation of .97 with the rankings of the same occupations made by students in 1925 (Tyler, 1965, p. 342). Then, in 1949, the average ranking of these occupations by 500 teachers college students correlated .98 with the 1946 rankings by a different group of high school and college students. Very similar prestige rankings are also found in Britain and show a high degree of consistency across such groups as adolescents and adults, men and women, old and young, and upper and lower social classes. Obviously people are in considerable agreement in their subjective perceptions of numerous occupations, perceptions based on some kind of amalagam of the prestige image and supposed intellectual requirements of occupations, and these are highly related to such objective indices as the typical educational level and average income of the occupation. The subjective desirability of various occupations is also a part of the picture, as indicated by the relative frequencies of various occupational choices made by high school students. These frequencies show scant correspondence to the actual frequencies in various occupations; high-status occupations are greatly overselected and low-status occupations are seldom selected.

How well do such ratings of occupations correlate with the actual IQs of the persons in the rated occupations? The answer depends on whether we correlate the occupational prestige ratings with the average IQs in the various occupations or with the IQs of individual persons. The correlations between average prestige ratings and average IQs in occupations are very high— .90 to .95—when the averages are based on a large number of raters and a wide range of rated occupations. This means that the average of many people’s subjective perceptions conforms closely to an objective criterion, namely, tested IQ. Occupations with the highest status ratings are the learned professions—physician, scientist, lawyer, accountant, engineer, and other occupations that involve high educational requirements and highly developed skills, usually of an intellectual nature. The lowest-rated occupations are unskilled manual labor that almost any able-bodied person could do with very little or no prior training or experience and that involves minimal responsibility for decisions or supervision.

The correlation between rated occupational status and individual IQs ranges from about .50 to .70 in various studies. The results of such studies are much the same in Britain, the Netherlands, and the Soviet Union as in the United States, where the results are about the same for whites and blacks. The size of the correlation, which varies among different samples, seems to depend mostly on the age of the persons whose IQs are correlated with occupational status. IQ and occupational status are correlated .50 to .60 for young men ages 18 to 26 and about .70 for men over 40. A few years can make a big difference in these correlations. The younger men, of course, have not all yet attained their top career potential, and some of the highest-prestige occupations are not even represented in younger age groups. Judges, professors, business executives, college presidents, and the like are missing occupational categories in the studies based on young men, such as those drafted into the armed forces (e.g., the classic study of Harrell & Harrell, 1945).

I predict that there is a lot of delicious low-hanging, ripe research fruit ready for harvest in this area if one takes a day or ten to dig up some data and read thru older papers, books and reports.

Tattoos and piercing. I haven’t found any evidence that this relates to intelligence or even creativity. On the other hand, what underlying factor of openness would it be an indication of?

I have. A long time ago, I tried to find a study of this. The only meaningful study I found was a small study of Croatian veterans:

Pozgain, I., Barkic, J., Filakovic, P., & Koic, O. (2004). Tattoo and personality traits in Croatian veterans. Yonsei medical journal, 45, 300-305.

The study has N≈100 and found a difference in IQ scores of about 5 IQ. Not very convincing.

OKCupid data

In a still unpublished project, we scraped public data from OKCupid. We did this over several months, so the dataset has about N=70k. The dataset contains the public questions and users’ public answers to them, as well as profile information. Each question is multiple choice with 2 to 4 options.

Some of the questions can be used to make a rudimentary cognitive test. with 3-5 items that has reasonable sample size. This can then be used to calculate a mean cognitive score by answer category to all questions. Plots of the relevant questions are shown below. For interpretation, the SD of cognitive score is about 1, so the differences can be thought of as d values. There is some selection for cognitive ability (OKCupid is a more serious dating site mainly used by college students and graduates), so probably population-wide results would be a bit stronger in general. Worse, this selection gets stronger as the sample size decreases because smarter people tend to answer more questions. The effect is fairly small tho.

Tattoo results





Piercing results






See first: Some methods for measuring and correcting for spatial autocorrelation

Piffer’s method

Piffer’s method to examine the between group heritability of cognitive ability and height uses polygenic scores (either by simple mean or with factor analysis) based on GWAS findings to see if they predict phenotypes for populations. The prior studies (e.g. the recent one we co-authored) have relied on the 1000 Genomes and ALFRED public databases of genetic data. These datasets however do not have that high resolution, N’s = 26 and 50. These do not include fine-grained European populations. However, it is known that there is quite a bit of variation in cognitive ability within Europe. If a genetic model is true, then one should be able to see this when using genetic data. Thus, one should try to obtain frequency data for a large set of SNPs for more populations, and crucially, these populations must be linked with countries so that the large amount of country-level data can be used.

Genomic autocorrelation

The above would be interesting and probably one could find some data to use. However, another idea is to just rely on the overall differences between European populations, e.g. as measured by Fst values. Overall genetic differentiation should be a useful proxy for genetic differentiation in the causal variants for cognitive ability especially within Europe. Furthermore, because k nearest spatial neighbor regression is local, it should be possible to use it on a dataset with Fst values for all populations, not just Europeans.

Since I have already written the R code to analyze data like this, I just need some Fst tables, so if you know of any such tables, please send me an email.


There is another table in this paper:

The study is also interesting in that they note that the SNPs that distinguish Europeans the most are likely to be genic, that is, the SNPs are located within a gene. This is a sign of selection, not drift. See also the same finding in

It is often said that polygenic traits are based on tons of causal variants each of which has a very small effect size. What is less often discussed is the distribution of these effect sizes, although this has some implications.

The first statistical importance is that we may want to modify our hyperprior if using a Bayesian approach. I’m not sure what the equivalent solution would be using a frequentist approach. I suspect the Frequentist approach is based on assuming a normal distribution of the effects we are looking at and then testing them against the null hypothesis, i.e. looking at p values. Theoretically, the detection of SNPs may improve if we use an appropriate model.

The second implication is that to find even most of them, we need very, very large samples. The smaller effects probably can never be found because there are too few humans around to sample! Their signals are too weak in the noise. One could get around this by increasing the human population or simply collecting data over time as some humans die and new ones are born. Both have problems.

But just how does the distribution of betas look like?

However, based on the current results, just how does the distribution looks like? To find out, I downloaded the supplementary materials from Rietveld et al (2013). I used the EduYears one because college is a dichotomized version of this and dichotomization is bad. The datafile contains the SNP name (rs-number), effect allele, EAF (“frequency of the effect allele from the HapMap2-CEU sample”), beta, standard error and p value for each of the SNPs they examined, N=2.3 x 106.

From these values, we calculate the absolute beta because we are interested in effect size, but not direction. Direction is irrelevant because one could just ‘reverse’ the allele.

One can plot the data in various ways. Perhaps the most obvious is a histogram, shown below.


We see that most SNPs have effect sizes near zero. Another way is to cut the betas into k bins, calculate the midpoint of each bin and the number of betas in them.


The result is fairly close to the histogram above. It is clear that this is not linear. One can’t even see the difference between the numbers for about half the bins. We can fix this by using logscale for the y-axis:


We get the expected fairly straight line. It is however not exactly straight. Should it be? Is it a fluke? How do we quantify straightness/linearity?

Perhaps if we increase our resolution, we would see something more. Let’s try 50 bins:


Now we get a bizarre result. Some of them are empty! Usually this means sampling, coding, or data error. I checked and could not find a problem on my end and it is not sampling error for the smaller betas. Perhaps they used some internal rounding system that prevents betas in certain regions. It is pretty weird. Here’s how the table output looks like:

> table(r$cut_50)

(-3.5e-05,0.0007]   (0.0007,0.0014]   (0.0014,0.0021]   (0.0021,0.0028]   (0.0028,0.0035]   (0.0035,0.0042] 
           174315            340381            321445                 0            292916            258502 
  (0.0042,0.0049]   (0.0049,0.0056]   (0.0056,0.0063]    (0.0063,0.007]    (0.007,0.0077]   (0.0077,0.0084] 
                0            217534            177858            139775                 0            107282 
  (0.0084,0.0091]   (0.0091,0.0098]   (0.0098,0.0105]   (0.0105,0.0112]   (0.0112,0.0119]   (0.0119,0.0126] 
            80258                 0             58967             42998                 0             30249 
  (0.0126,0.0133]    (0.0133,0.014]    (0.014,0.0147]   (0.0147,0.0154]   (0.0154,0.0161]   (0.0161,0.0168] 
            21929             14894                 0              9733              6899                 0 
  (0.0168,0.0175]   (0.0175,0.0182]   (0.0182,0.0189]   (0.0189,0.0196]   (0.0196,0.0203]    (0.0203,0.021] 
             4757              3305                 0              2535              1322               912 
   (0.021,0.0217]   (0.0217,0.0224]   (0.0224,0.0231]   (0.0231,0.0238]   (0.0238,0.0245]   (0.0245,0.0252] 
                0               502               319                 0               174               133 
  (0.0252,0.0259]   (0.0259,0.0266]   (0.0266,0.0273]    (0.0273,0.028]    (0.028,0.0287]   (0.0287,0.0294] 
                0                85                47                33                 0                14 
  (0.0294,0.0301]   (0.0301,0.0308]   (0.0308,0.0315]   (0.0315,0.0322]   (0.0322,0.0329]   (0.0329,0.0336] 
                5                 0                 4                 2                 0                 1 
  (0.0336,0.0343]    (0.0343,0.035] 
                1                 1

Thus we see that some of them are inexplicably empty. Why are there no betas with values between .0021 and .0028?

We can try investigating some other number of cuts. I tried 10, 20, 30, 40 and 50. Only 40 and 50 have the problem. 30 is fine:


The pattern at the 50% higher resolution (30/20=1.5) is still somewhat curved, although probably not with a low p value.

Frequency-corrected betas?

An idea I had while writing this post. Correlations and other linear modeling is affected by base rates as well as betas. Unless they corrected for this (I don’t remember), then some of the SNPs with lower betas probably have stronger betas but they appear to be weak because their base rates are too high or too low. One could correct for this restriction of range if desired which may change conclusions somewhat. What this would do is to estimate the betas of the SNPs if they all had the same frequency.

Is there support for this idea? A simple test is to correlate frequency with absolute beta. This value should be negative. It is: r = -.006 [CI95: -.007 to -.005].

R code

# IO and libs -------------------------------------------------------------
p_load(stringr, kirkegaard, psych, plyr, ggplot2)

#load data
r = read.table("SSGAC_EduYears_Rietveld2013_publicrelease.txt", sep = "\t", header = T)

# calculations ------------------------------------------------------------
#absolute values
#since we dont care about direction
r$Abs_Beta =  abs(r$Beta)

#find cut midpoints
#feature is missing
midpoints <- function(x, dp=2){
  lower <- as.numeric(gsub(",.*", "", gsub("\\(|\\[|\\)|\\]", "", x)))
  upper <- as.numeric(gsub(".*," , "", gsub("\\(|\\[|\\)|\\]", "", x)))
  return(round(lower+(upper-lower)/2, dp))

#make new dfs
cut_vec = c(10, 20, 30, 40, 50)
d_list = llply(cut_vec, function(x) {
  #add cuts to r
  tmp_var = str_c("cut_", x)
  r[tmp_var] = cut(r$Abs_Beta, breaks = x)
  #make a new df based of the table
  data.frame(N = table(r[[tmp_var]]) %>% as.numeric,
             midpoint = table(r[[tmp_var]]) %>% names %>% midpoints(., dp = 99))
}, .progress = "text")
names(d_list) = str_c("cut_", cut_vec) #add names

# plots --------------------------------------------------------------------
ggplot(r, aes(Abs_Beta)) + geom_histogram() + xlab("Absolute beta coefficient")

#loop plot
for (i in seq_along(d_list)) {
  #fetch data
  tmp_d = d_list[[i]]
  ggplot(tmp_d, aes(midpoint, N)) + geom_point() + geom_smooth() + ylab("Number of SNPs") + xlab("Midpoint of range")
  name = str_c(names(d_list)[i], "_beta_N_linear.png")
  try({ #we try because log transformation can give an error
    ggplot(tmp_d, aes(midpoint, N)) + geom_point() + geom_smooth() + ylab("Number of SNPs") + xlab("Midpoint of range") + scale_y_log10() + geom_smooth(method = "lm", se = F, color = "red")
    name = str_c(names(d_list)[i], "_beta_N_log.png")

# investigate -------------------------------------------------------------

Some readers may have fun reading this. I have not edited anything. Due to the conversion some extra lines were added, but his emails genuinely contains lots of oddly placed line breaks. The odd use of e.g. bold is his (Jayl is a male name, apparently).

The topic of the email is “University of Toronto, Canada”.

Jayl Feynman <> 5. oktober 2015 kl. 23.01
I am a cognitive neuroscience researcher from Toronto and wanted to make comment if you don’t mind on one of your studies titled “Sex differences in g and chronometric tests

Study –  file:///C:/Users/user/Downloads/article2.pdf

Males average 100g more brain than women, and brain size is known to correlate with general intelligence (g), leading to the possibility that men average somewhat higher in g than women

Intelligence is a reflection of the efficacy of the networks responsible for cognition and not merely the absolute size of the human brain. Hence it’s quality over quantity. The best model for predicting the neuroscience of intelligence is the P-Fit theory which you could read about it here:

The P-Fit theory also contradicts the claim that G factor does not change as Parietal-frontal integration changes through experience and connectivity.

It has also been shown that brain size correlates with IQ with a strength of about .38
Correlation is not causation. Even Einstein’s brain was smaller than average:

A critique of IQ testing

1. IQ tests such as Raven progressive matrices are very much flawed because almost all of the questions are spatial in nature (picture completion,block design etc)  and do not really touch on the verbal part of intelligence. To date, I have never seen an IQ test that measure verbal comprehension or fluency Measuring spatial ability also cannot be predicative in measuring ability in comprehending literature, novels, books and just about everyday language use.
2. Psychologists do not distinguish between different forms of information when measuring IQ. Different forms of information such as spatial and verbal information uses different areas of the brain for example pictures shapes and even arithmetic or digits use an area of the brain known as the Parietal lobe. Numeracy and digits which Psychologists think are verbal  actually happens to be spatial since the brain represents them as  quantity of space through mental imagery. Verbal information on the other hand such as words, sentences, language and meaning are processed by two areas of the brain known as Brocha’s and Wernicke’s area. Therefore IQ does not measure verbal intelligence especially in processing, reasoning and understanding the use of language.
3. Sample restriction and recruitment bias. The methods in which members of the each group enter the study sample is not in the control of the investigator in charge of the study. For example, in college or school studies it could easily be that both average and less average females enter the study while higher level males enter the study. This would obviously obscure mean differences.

4. Variability hypothesis and bias. This hypothesis holds that males exhibit greater variation than females in many cognitive ability domains, which may explain their overrepresentation in the tails of ability distributions and creates the appearance of mean differences in incomplete or selected samples. The male variability is obviously higher in the right tail of college samples therefore college sex differences in IQ is inaccurate in measuring IQs of the general population.

5. Intelligence is an emergent property of anatomically distinct networks of the brain each of which has it’s own distinct capacity. Therefore to measure general intelligence one would have to measure the distinct capacity of each brain network on which a singular IQ testing cannot do. You would need multiple separate tests to measure the capacity of each network.


Emil Ole William Kirkegaard <> 6. oktober 2015 kl. 02.36
Til: Jayl Feynman <>

I googled your name yet was unable to find anything about your academic standing. LinkedIn is the only result and states that you are “Entrapreneur at MySelf (as an independent consultant)”. There is no mention of your name on the university website and I could locate no publications in your name on Google Scholar. The only things that come up are the publications of a physicist doing work on e.g. solar physics. Your email comes from a non-university address whereas academics usually use their institutional email. They also usually have a signature, whereas you have none.

So, taken all together, the evidence seems to show that you are not who you say you are. I’ll make the assumption that you are a random internet person who is uncomfortable with sex differences in cognitive ability. After all, many people are.

Anyway, I will reply, just for fun. :)

The paper you ‘link to’ (actually you gave the file’s location on your own computer) is a submission that was never published. You can read the submission thread to see why.

Intelligence is a reflection of the efficacy of the networks responsible for cognition and not merely the absolute size of the human brain. Hence it’s quality over quantity. The best model for predicting the neuroscience of intelligence is the P-Fit theory which you could read about it here:
Correlation is not causation. Even Einstein’s brain was smaller than average:

Why do you presume I don’t know what P-FIT is? If so, linking me to unreliable secondary sources is even more strange. Good rule of thumb: if you want to introduce a researcher to something, cite the primary literature. In this case:

Jung, R. E., & Haier, R. J. (2007). The Parieto-Frontal Integration Theory (P-FIT) of intelligence: converging neuroimaging evidence. Behavioral and Brain Sciences, 30(02), 135-154.

There are also some newer relevant work, e.g. by Roberto Colom.

In any case, I don’t think anyone made the claim that size explains all variation in cognitive ability, only that it explains some of it. Given the evidence, it would be extremely surprising if it did not. The very thing that makes humans unique is their high cognitive ability which increased over recent evolutionary time along with the brain size (i.e. we see it by increasing skull sizes; relative to body size).

If brain size was a non-causal correlate, one has the odd job of explaining why it was selected for by evolution just over the same time span. Note that brain tissue is extremely metabolically expensive. The brain accounts for about 20% of the rest metabolic rate yet takes up only about 2% by weight. Increased brain size causes substantial problems with childbirth which kills off a large number of women. A non-causal theory is that evolution still selected for larger brains despite all these costs. Presumably, this is why pretty much no one serious subscribes to that idea.

As for Einstein’s brain. You are citing a single case which cannot disprove an imperfect correlation, so the case is a non-starter. In this case, the case has an obvious explanation: the brain shrinks as we age (matching the decline in absolute scale cognitive ability).

The P-Fit theory also contradicts the claim that G factor does not change as Parietal-frontal integration changes through experience and connectivity.

No one claims a stability of 1 for GCA, hence change in brain structure is clearly consistent with relatively stable GCA. Also note that our measures of GCA measure relative standing (they are deviation scores), not absolute ability. So in fact, no change in GCA is consistent with change in the underlying brain structure if all members of the cohort have the exactly same brain changes (this isn’t true, but I’m just pointing it that the argument to be valid, needs this unstated premise).

This looks to be the stuff you really wanted to say. IQ criticism.
1. IQ tests such as Raven progressive matrices are very much flawed because almost all of the questions are spatial in nature (picture completion,block design etc) and do not really touch on the verbal part of intelligence. To date, I have never seen an IQ test that measure verbal comprehension or fluency Measuring spatial ability also cannot be predicative in measuring ability in comprehending literature, novels, books and just about everyday language use.

2. Psychologists do not distinguish between different forms of information when measuring IQ. Different forms of information such as spatial and verbal information uses different areas of the brain for example pictures shapes and even arithmetic or digits use an area of the brain known as the Parietal lobe. Numeracy and digits which Psychologists think are verbal actually happens to be spatial since the brain represents them as quantity of space through mental imagery. Verbal information on the other hand such as words, sentences, language and meaning are processed by two areas of the brain known as Brocha’s and Wernicke’s area. Therefore IQ does not measure verbal intelligence especially in processing, reasoning and understanding the use of language.

In fact, RPM does not have picture completion or block design. These are both subtests of the WAIS.

You say you have never seen an IQ test measuring these, which means that you have not looked hard because WAIS, probably the most widely used test of all, has a an entire subscale called verbal comprehension consisting of 3-4 tests, depending on which version of WAIS.

The last claim about lack of cross-area predictive validity is widely known to be false. Indeed, it has been known for decades, even going back to Spearman’s time in the 1930s. This is what is called indifference of the indicator, it doesn’t matter that much just which mental test is used to assess GCA, as long as the g-loading is strong, the predictive validity will be similarly strong. A great general resource is still:

Arthur Jensen. 1980. Bias in Mental Testing.

3. Sample restriction and recruitment bias. The methods in which members of the each group enter the study sample is not in the control of the investigator in charge of the study. For example, in college or school studies it could easily be that both average and less average females enter the study while higher level males enter the study. This would obviously obscure mean differences.

4. Variability hypothesis and bias. This hypothesis holds that males exhibit greater variation than females in many cognitive ability domains, which may explain their overrepresentation in the tails of ability distributions and creates the appearance of mean differences in incomplete or selected samples. The male variability is obviously higher in the right tail of college samples therefore college sex differences in IQ is inaccurate in measuring IQs of the general population.

Sometimes the researcher do get to decide, especially when they collect their own data. Other times, they use available datasets which may have recruitment bias. The fact that you cite research written by top researchers (Earl Hunt, Ian Deary) in the field discussing this means that they are aware of the problem.

In any case, there are general population samples too that find differences in variance and some that do not, same as for mean differences. This is why the question is currently undecided.

5. Intelligence is an emergent property of anatomically distinct networks of the brain each of which has it’s own distinct capacity. Therefore to measure general intelligence one would have to measure the distinct capacity of each brain network on which a singular IQ testing cannot do. You would need multiple separate tests to measure the capacity of each network.

You seem unaware of the g factor and seem to posit something like the Thompson’s sampling theory. The networks are not independent and attempts to make tests that do note correlate have all failed despite decades of attempts. Maybe read:

In any case, WAIS does try to measure a diverse set of cognitive abilities. This is a good idea because it results in a better measurement of GCA, which is what is responsible for the validity of the tests. See review at:



Jayl Feynman <> 6. oktober 2015 kl. 04.20
Til: Emil Ole William Kirkegaard <>
 “The very thing that makes humans unique is their high cognitive ability which increased over recent evolutionary time along with the brain size (i.e. we see it by increasing skull sizes; relative to body size)

Actually our overall brains have been shrinking for the past 30,000 years.

On the other hand, our frontal lobes are growing and have gotten much bigger.
The prefrontal cortex is slightly larger relative to the rest of the brain in humans than in most other primates while also having larger volume of white matter to go alongside within it.

As for brain relative to body size, humans have the same as a mouse.  The size of  specific brains areas (prefrontal cortex, hippocampus, parietal cortex etc) specified by the P-Fit, are better correlations for intelligence than overall brain size.

“If brain size was a non-causal correlate, one has the odd job of explaining why it was selected for by evolution just over the same time span”

Because bigger brains equals to more lateralization (asymmetry) of brain functions.


 Lateralized brain allows dual attention to the tasks of feeding (right eye and left eye hemisphere) and vigilance for predators (left eye and right hemisphere). Hence it was most likely selected because dual attention was an advantage for human survival.


Jayl Feynman <> 6. oktober 2015 kl. 06.03
Til: Emil Ole William Kirkegaard <>
Also, I see you have been following this study:

Using SATs to generalize about the broader population is hardly reasonable. Out of 3.3 million high school graduates per year, only 1.3 million take the SATs while only 65% of those 3.3 million actually enroll in college. It seems the SATs are probably just male variance in effect.


Jayl Feynman <> 8. oktober 2015 kl. 22.54
Til: Emil Ole William Kirkegaard <>
So you didn’t respond huh? Well then I will just end with this, after I looking over different test samples I have come to the conclusion that IQ tests are nonsense. The questions are almost all spatial in nature while having a quantitative and visuo-spatial section is redundant as they both measure mental rotation and spatial visualization. IQ test should thus be re-named SIQ or Spatial intelligence quotient because that is precisely what it measures. School academics are probably better predictors for future success more than this pop quiz. So I guess psychology have been getting it wrong for the last 100 years, but what can you expect from a field of study that only looks for correlations right? Well you can find correlations in anything for example I correlate with Psychology with bigotry since it produce the most out of any field example Rushton, Lynn and probably you.

Chao :P


I guess he thinks e.g. these are spatial items:

verbal items logical itemssyntactic

(All from Jensen’s Bias in Mental Testing).

The criticism is particular odd because I already told him about the actual composition of the WAIS.

Notice the links to files on his own computer. The same novice computer mistake made by the feminists writing for the UN.

I was not able to find the source paper for the claim that brain sizes got smaller during the last 30kya period. Can someone find it? Did body size shrink as well? If so, then brain-to-body size ratio may have increased over the period, or stayed the same.

I haven’t heard of the brain size evolved due to lateralization hypothesis before, but the review article he linked to seems interesting:

The real reason I didn’t reply was that I was busy traveling home from the USA to Denmark. Still, it doesn’t seem worth my time.

A PDF of this paper without formatting errors can be downloaded here.


I review recent findings in human behavioral genetics and their implications for selective breeding and estimation of genotypic racial differences in polygenic traits.

Key words: behavioral genetics, cognitive ability, GWAS, intelligence, IQ, race, selective breeding, embryo selection, genetic engineering, educational attainment

1. Polygenic scores from all SNPs vs. p<α SNPs

A recent paper (1) used polygenic scores derived from the Rietveld results (2) to score a non-overlapping sample of European Americans (EA) and African Americans (AA). They found that polygenic scores predicted educational outcomes for samples at r’s = .18 and .11 for EAs and AAs respectively. In terms of variance, this corresponds to 3.24% and 1.21%, respectively. This is small, but not useless. They don’t report confidence intervals, only p value inequalities, so it isn’t so easy to see how precise these estimates are (3). The p value inequalities for the two results are p<.001 and <0.01. Note that sample sizes are different too. The main results table is shown below.


These findings are interesting because they use polygenic scores instead of scores derived from just the findings that surpass the NHST threshold, i.e. those that have a p value below the alpha value (p<α).1 Using the full set of betas instead of just the set with p<α results in better predictions. It has even been found that differential weighting of the SNPs does not have a major effect of the predictive power of the deriving polygenic scores (4).


This should be seen in light of conceptually related results in psychometrics where it has been shown that it doesn’t matter much if one uses g factor scores, simple sums or even randomly weighted subtests (5). The general mathematical explanation for this is that when one creates a linear combination (i.e. adds together) many variables, the common variance (‘the signal’) adds up while the unshared variance (‘the noise’) does not. Thus, the more variables one averages, the more more signal in the noise (simplifying a bit). The general idea goes back at least to 1910, when Spearman and Brown independently derived a formula for it (6). Their papers were published in the same journal, even in the same issue (7,8). Another example of multiple discovery/invention.

Focusing on the number of SNPs with p<α for a trait is the wrong metric to think of. One should instead think of the found correlation (or other effect size measure if outcome data is categorical) between polygenic scores and outcomes for cross-validation samples. Thinking of SNPs where p<α is dichotomous thinking instead of continuous thinking. When using dichotomous models for phenomena that really is continuous, one will get threshold effects that bias the effect sizes downwards.

2. Inconvenient results can be made to go away (maybe)

Since the study contained both a EA and an AA sample with mean IQs of 105.1 and 94.3, it should be possible to derive polygenic scores for members of both groups and compare the mean of the groups. This would be a test of the genetic hypothesis for the well known cognitive ability difference between the groups (9–11). There are two things worth noting, however.

First, the group difference is only 10.8 IQ, smaller than the usual gap found. There is some question as to whether the gap has been changing over time (12,13). Some newer samples find smaller gaps especially those based on WORDSUM scores (14), while others find standard (~1 SD, 15 IQ) sized gaps (15). The smaller than expected gap in the samples may result from selection bias in the AA sample (presumably it is difficult to recruit very low S inner city AAs for scientific studies). Note that results are generally weaker for this sample, which is expected given restriction of range.

Second, not all persons in the groups were genotyped. Those that were had lower mean IQs of 103.9 and 91.6, respectively. This gives a gap of 12.3 points. Note that the reason the EA score is not 100, is that the overall Add Health sample mean is set to ~100 (100.6).

Despite these caveats, the polygenic scores would be interesting to see. However, the authors decided to standardize the results within each group, such that the mean of the polygenic scores was 0 for both groups. The of course makes any group difference impossible to see. They provide the following rationale:

The 917 European Americans (EAs) in our analytic sample are in 386 sibling pairs and 12 sibling trios, with an additional 109 singletons. The 677 African Americans (AAs) are in 100 sibling pairs and four trios, with an additional 465 singletons. Table 1 shows characteristics of the EA and AA sibling pairs study participants who provided genetic data and constitute our analytic sample. The table also shows characteristics of the full Add Health EA and AA samples for comparison. The EAs in our analytic sample are largely comparable to the full population of EA respondents in the Add Health study. The AAs in our sample are less educated, have less educated parents, and score lower on the verbal intelligence measure as compared to all AA Add Health participants. The bulk of our analysis is focused on the EA sample because the original Rietveld et al. (2013) GWAS was conducted on European-descent individuals. Replication of polygenic scores discovered in EA samples among AA samples may be compromised because LD differences in the groups lead to less precision among AA samples. Accordingly, large-scale GWASs of educational attainment in African Americans will be needed to better quantify genetic influences on attainment in this population. Nevertheless, in the interest of testing the extent to which findings made in European-descent individuals replicate in a different population, we conduct several analyses of the AA sample. Due to the small number of AA sibling pairs in the data, sibling analyses are conducted only in EAs.

The rationale is not entirely unreasonable, but not sufficient reason not to standardize the polygenic scores for both samples together. In my opinion, the reason they provide should be taken into account when interpreting the results, but is not sufficient for not showing the results. My guess is that they did calculate the scores for both groups and compared them. Upon finding that the AA sample had a lower mean polygenic score than the EA sample, they decided that result was too toxic to publish. Reverse publication bias in effect. See also this post. A respected academic acquaintance of mine contacted the authors but they refused to share the results.

Lastly, one can use the combined sample to investigate whether the data shows a Simpson’s paradox pattern. The lack of a such pattern is a central finding of Fuerst’s and my upcoming paper (16). Jensen’s default hypothesis (17) predicts the absence of such a pattern since the same genetic causes are postulated to be involved in the within race differences as those between them.

3. Polygenic scores and sibling pairs

Another interesting aspect of the study is that they have sibling data. Since siblings receive a random mix of genes from their parents, they will differ in their genotypic for polygenic traits. This was also found in this sample: “The mean sibling difference in polygenic scores in the EA sample was 0.8.” (they did not calculate this for the AA sample, stating that it was too small). In other words, the difference between siblings is nearing the size of the mean difference in the whole population. The same result is known to be true for siblings and IQ scores. The mean difference is about 11 IQ compared to a full sample mean difference of 17 IQ (17). This gives a ratio of 11/17 = .65. Since the educational attainment data is standardized, we know that the mean difference in scores is 1.13 (Fuerst posted the formula here, but I’m not sure about the source). This gives a ratio of 1.13/.8 = .71. These ratios are pretty close as they should be.

We care about sibling comparisons because they by design control for shared environment effects, so that we don’t need to control them statistically (18). The authors found that results held within sibling pairs, an important finding. The table below is from their paper:


As we will see below, this has another important practical implication.

4. Genetic engineering and causal variants

Since socially valued outcomes have non-zero heritability (19), it means that it is in theory possible to improve the outcomes by genetic means, just as we have done for animals. I see two main routes to do this: selection among possible children and direct editing.

The first method is widely used but so far only for a small number of traits. When two persons want a child, the usual method involves having sex and producing a fetus. As mentioned above, this fetus will have a random combination of genes from the parents. If the same parents produce a different combination we call it a sibling.

Selective abortion involves screening fetuses in the womb for anomalies and aborting ones sufficiently undesirable. Probably the most common target for this is Down’s syndrome, which is substantially reduced due to the high rate of abortions when it is detected (20,21). For Denmark, the abortion rate given detection is 99%.

Selective abortion is better than nothing but it is not a good method. Not only is it painful for the woman, but it is inefficient because one has to wait until one can perform a prenatal screening. At that point, the fetus is many weeks old. Furthermore, abortions can result in infertility.

Embryo selection is the natural extension of the same idea. Instead of selectively aborting fetuses, we select between embryos (fertilized eggs). Essentially, we choose an embryo and implant it. This illustration shows how this works.

The second and best option for genetic engineering is to edit the genes directly. In that we one could potentially create a genome free of known flaws. This would involve using something like CRISPR.

The problem with direct editing is that we need to know the actual causal variants. This is not required for selection among possible children. Here is it sufficient that we can make predictions. The difference here is that the SNPs we know are in most cases probably not the causal variants. Instead, they are proxies for the causal variants because the are in linkage disequilibrium (LD) with them. In simple terms, the reason for this is that the mixing of gene variants from sexual reproduction (meiosis) happens at random, but in chunks. Thus, gene variants that are located closer to each other in the genome tend to travel together during splits. This means that they get correlated, which we call LD.

Since practical use of embryo selection requires working on sibling embryos, it is necessary that we can make genomic predictions among siblings that work. The new paper showed that we can do this for educational attainment.

5. Replicability of GWAS results across racial groups

There are two matters. The first is to which degree the genetic architecture of polygenic traits is similar across racial groups, i.e. if the same genes cause traits across populations or if there is substantial race-level gene-gene interaction (epistasis). The second is the degree to which SNP betas derived from one race can be used to make valid predictions for another race.

For polygenic traits that have been under the selection for many thousands of years (e.g. cognitive ability or height (22)), I think substantial race-level gene-gene interaction is implausible. They are however plausible for traits that involve a small number of genes and show substantial race differences, such as those for hair, eye and skin color.

LD patterns change over time. Since the LD patterns change independently and randomly in each population, they will tend to become different with time.

If the GWAS SNPs owe their predictive power to being actual causal variants, then LD is irrelevant and they should predict the relevant outcome in any racial group. If however they owe wholly or partly their predictive power to just being statistically related to causal variants, they should be relatively worse predictors in racial groups that are most distantly related. One can investigate this by comparing the predictive power of GWAS betas derived from one population on another population. Since there are by now 1000s of GWAS, meta-analyses have in fact made such comparisons, mostly for disease traits. Two reviews found substantial cross-validity for the Eurasian population (Europeans and East Asians), and less for Africans (usually African Americans) (23,24). The first review only relied on SNPs with p<α and found weaker results. This is expected because using only these is a threshold effect, as discussed earlier.

The second review (from 2013; 299 included GWAS) found much stronger results, probably because it included more SNPs and because they also adjusted for statistical power. Doing so, they found that: ~100% of SNPs replicate in other European samples when accounting for statistical power, ~80% in East Asian samples but only ~10% in the African American sample (not adjusted for statistical power, which was ~60% on average). There were fairly few GWAS for AAs however, so some caution is needed in interpreting the number. Still, this throws some doubt on the usefulness of GWAS results from Europeans or Asians used on African samples (or reversely).

Which brings us back to…

6. Low cross-validity of GWAS betas and polygenic scores for educational attainment in AAs

Despite the relatively weak evidence for European sample derived GWAS betas in Africans, the study mentioned in the beginning of this review (1) still found a reliable polygenic correlation of .11 in AAs. However, AAs are an admixed group that are about 75-85% African and 25-15% European (25,26). The exact admixture proportions depend on the selectivity of the sample. Bryc et al used the 23andme database which represents individuals willing to pay to have their genomes sequenced. Since this requires both money (price is about 100$ for US citizens) and interest in genetic results, this will lead to selection for S (27) and cognitive ability. Both traits are known to correlate with European admixture at the individual, region and country levels (16), which would then result in higher proportions of European admixture in AA sample. Shriver et al’s sample is more representative and found mean proportions of 78.7% and 18.6% for African and European ancestry respectively.

If we make the assumption that the polygenic correlation for educational attainment in the AA sample is purely due to the European admixture, we can make a prediction for the effect size, namely that it should be about 20% of the size of that for Europeans. I’m not sure but I think that in this case one should use the proportion of variance, not correlation coefficient. Recall that these were 3.24% and 1.21% (r’s .18 and .11), which gives a ratio of .37. This is higher than the expected value of .186. This means that there is an excess validity of .187 in the African part of their genome under the null model. We can use this to make an estimation of the cross-racial validity. Since we have accounted for AAs European admixture, the rest of the predictive power must come from the African admixture (ignoring Native American admixture for simplicity), which constitutes 78.7%. This gives an estimated cross-racial validity ratio of about .24 (0.187/.787). In a pure African sample, this corresponds to an estimated correlation coefficient of .09 (sqrt(.182 * .24)). Future studies will reveal how far off these estimates are, but most importantly, they are quantitative predictions, not merely qualitative (directional) (28).

7. Poor African-Eurasian cross-validity and the Piffer method

The findings related to the relatively poor, but non-zero cross-validity of GWAS betas between European and African samples throw some doubt on the SNP evidence found by Piffer in his studies of the population/country IQ and cognitive ability SNP factors (29). If the betas for the SNPs identified in European sample GWAS do not work well as predictors for Africans, they would be equally unsuitable for estimating mean genotypic cognitive ability from SNP frequencies. Thus, further research is needed to more precisely estimate the cross-racial validity of GWAS betas, especially with regards to African vs. Eurasian samples.


1. Domingue BW, Belsky DW, Conley D, Harris KM, Boardman JD. Polygenic Influence on Educational Attainment. AERA Open. 2015 Jul 1;1(3):2332858415599972.

2. Rietveld CA, Medland SE, Derringer J, Yang J, Esko T, Martin NW, et al. GWAS of 126,559 Individuals Identifies Genetic Variants Associated with Educational Attainment. Science. 2013 Jun 21;340(6139):1467–71.

3. Cumming G. The New Statistics Why and How. Psychol Sci. 2014 Jan 1;25(1):7–29.

4. Kirkpatrick RM, McGue M, Iacono WG, Miller MB, Basu S. Results of a “GWAS Plus:” General Cognitive Ability Is Substantially Heritable and Massively Polygenic. PLoS ONE. 2014 Nov 10;9(11):e112390.

5. Ree MJ, Carretta TR, Earles JA. In Top-Down Decisions, Weighting Variables does Not Matter: A Consequence of Wilks’ Theorem. Organ Res Methods. 1998 Oct 1;1(4):407–20.

6. Carroll JB. Human cognitive abilities: A survey of factor-analytic studies [Internet]. Cambridge University Press; 1993 [cited 2015 Jun 3]. Available from:,+1993+human+cognitive+abilities&ots=3b3O4R_IKc&sig=wOss3EHXu37Q3_OZV9Due_3wyFg

7. Spearman C. Correlation Calculated from Faulty Data. Br J Psychol 1904-1920. 1910 Oct 1;3(3):271–95.

8. Brown W. Some Experimental Results in the Correlation of Mental Abilities1. Br J Psychol 1904-1920. 1910 Oct 1;3(3):296–322.

9. Fuerst J. Ethnic/Race Differences in Aptitude by Generation in the United States: An Exploratory Meta-analysis. Open Differ Psychol [Internet]. 2014 Jul 26 [cited 2014 Oct 13]; Available from:

10. Rushton JP, Jensen AR. Thirty years of research on race differences in cognitive ability. Psychol Public Policy Law. 2005;11(2):235–94.

11. Fuerst J. The facts that need to be explained [Internet]. Unwelcome Discovery. 2012 [cited 2015 Aug 31]. Available from:

12. Fuerst J. Secular Changes in the Black-White Cognitive Ability Gap [Internet]. Human Varieties. 2013 [cited 2015 Aug 31]. Available from:

13. Malloy J. The Onset and Development of B-W Ability Differences: Early Infancy to Age 3 (Part 1) [Internet]. Human Varieties. 2013 [cited 2015 Aug 31]. Available from:

14. Hu M. An update on the secular narrowing of the black-white gap in the Wordsum vocabulary test (1974-2012) [Internet]. 2014 [cited 2015 Aug 31]. Available from:

15. Frisby CL, Beaujean AA. Testing Spearman’s hypotheses using a bi-factor model with WAIS-IV/WMS-IV standardization data. Intelligence. 2015 Jul;51:79–97.

16. Fuerst J, Kirkegaard EOW. Admixture in the Americas. In London, UK.; 2015. Available from: dit?pli=1#slide=id.p

17. Jensen AR. The g factor: the science of mental ability. Westport, Conn.: Praeger; 1998.

18. Murray C. IQ and income inequality in a sample of sibling pairs from advantaged family backgrounds. Am Econ Rev. 2002;339–43.

19. Polderman TJC, Benyamin B, de Leeuw CA, Sullivan PF, van Bochoven A, Visscher PM, et al. Meta-analysis of the heritability of human traits based on fifty years of twin studies. Nat Genet. 2015 May 18;47(7):702–9.

20. Natoli JL, Ackerman DL, McDermott S, Edwards JG. Prenatal diagnosis of Down syndrome: a systematic review of termination rates (1995-2011): Prenatal diagnosis of down syndrome: systematic review. Prenat Diagn. 2012 Feb;32(2):142–53.

21. de Graaf G, Buckley F, Skotko BG. Estimates of the live births, natural losses, and elective terminations with Down syndrome in the United States. Am J Med Genet A. 2015 Apr;167A(4):756–67.

22. Joshi PK, Esko T, Mattsson H, Eklund N, Gandin I, Nutile T, et al. Directional dominance on stature and cognition in diverse human populations. Nature. 2015 Jul 23;523(7561):459–62.

23. Ntzani EE, Liberopoulos G, Manolio TA, Ioannidis JPA. Consistency of genome-wide associations across major ancestral groups. Hum Genet. 2011 Dec 20;131(7):1057–71.

24. Marigorta UM, Navarro A. High Trans-ethnic Replicability of GWAS Results Implies Common Causal Variants. PLoS Genet [Internet]. 2013 Jun [cited 2015 Aug 31];9(6). Available from:

25. Bryc K, Durand EY, Macpherson JM, Reich D, Mountain JL. The genetic ancestry of African Americans, Latinos, and European Americans across the United States. Am J Hum Genet. 2015 Jan 8;96(1):37–53.

26. Shriver MD, Parra EJ, Dios S, Bonilla C, Norton H, Jovel C, et al. Skin pigmentation, biogeographical ancestry and admixture mapping. Hum Genet. 2003 Feb 11;112(4):387–99.

27. Kirkegaard EOW, Fuerst J. Educational attainment, income, use of social benefits, crime rate and the general socioeconomic factor among 71 immigrant groups in Denmark. Open Differ Psychol [Internet]. 2014 May 12 [cited 2014 Oct 13]; Available from:

28. Velicer WF, Cumming G, Fava JL, Rossi JS, Prochaska JO, Johnson J. Theory Testing Using Quantitative Predictions of Effect Size. Appl Psychol Psychol Appl. 2008 Oct;57(4):589–608.

29. Piffer D. A review of intelligence GWAS hits: their relationship to country IQ and the issue of spatial autocorrelation [Internet]. 2015 [cited 2015 Aug 2]. Available from:


1For GWAS the alpha value is usually set at 5*10-8. The number comes from correcting the standard α=.05 (95% theoretical true positive rate) for multiple testing when using SNP data: .05 * 1e-6 = 5e-8.

I’ve always considered myself a very rational and fairly unbiased person. Being aware of the general tendency for people to overestimate themselves (see also visualization of the Dunning-Kruger effect), this of course reduces my confidence in my own estimates of these. So what better to do than take some actual tests? I have previously taken the short test of estimation ability found in What intelligence tests miss and got 5/5 right. This is actually slight evidence of underconfidence since I was supposed to give 80% confidence intervals. This of course means that I should have had 1 error, not 0. Still, with 5 items, the precision is too low to say whether I’m actually underconfident or not with much certainty, but it shows that I’m unlikely to be strongly overconfident. Underconfidence is expected for smarter people. A project of mine is to make a test with the confidence intervals that is much longer so to give more precise estimates. It should be fairly easy to find a lot of numbers for stuff and have people give 80% confidence intervals for the numbers. Stuff like the depth of the deepest ocean, height of tallest mountain, age of oldest living organism, age of the Earth/universe, dates for various historical events such as ending of WW2, beginning of American Civil war, and so on.

However, I recently saw an article about a political bias test. I think I’m fairly unbiased. As a result of this, my beliefs don’t really fit into any mainstream political theory. This is as expected because the major political ideologies were invented before we understood much about anything, thus making it unlikely that they would tend to get everything right. More likely, they would probably get some things right and some things wrong.

Here’s my test results for political bias:

Screenshot from 2015-09-16 20:38:35 Screenshot from 2015-09-16 20:41:05

So in centiles: >= 99th for knowledge of American politics. This is higher than I expected (around 95th). Since I’m not a US citizen, presumably the test has some bias against me. For bias, the centile is <= 20th. This result did not surprise me. However, since there is a huge floor effect, this test needs more items to be more useful.

Next up, I looked at the website and saw that they had a number of other potentially useful tests. One is about common misconceptions. Now since I consider myself a scientific rationalist, I should do fairly well on this. Also because I have read somewhat extensively on the issue (Wikipedia list, Snopes and 50 myths of pop psych).

Unfortunately, they present the results in a verbose form. Pasting 8 images would be excessive. I will paste some of the relevant text:

1. Brier score

Your total performance across all quiz and confidence questions:

This measures your overall ability on this test. This number above, known as a “Brier” score, is a combination of two data points:

How many answers you got correct
Whether you were confident at the right times. That means being more confident on questions you were more likely to be right about, and less confident on questions you were less likely to get right.

The higher this score is, the more answers you got correct AND the more you were appropriately confident at the right times and appropriately uncertain at the right times.

Your score is above average. Most people’s Brier’s scores fall in the range of 65-80%. About 5% of people got a score higher than yours. That’s a really good score!

2. Overall accuracy

Answers you got correct: 80%

Out of 30 questions, you got 24 correct. Great work B.S. detector! You performed above average. Looks like you’re pretty good at sorting fact from fiction. Most people correctly guess between 16 and 21 answers, a little better than chance.

Out of the common misconceptions we asked you about, you correctly said that 12/15 were actually B.S. That’s pretty good!

No centiles are provided, so it is not evident how this compares to others.

3. Total points

Screenshot from 2015-09-16 22:12:06

As for your points, this is another way of showing your ability to detect Fact vs. B.S. and your confidence accuracy. The larger the score, the better you are at doing both! Most people score between 120 and 200 points. Looks like you did very well, ending at 204 points.

4. Reliability of confidence intervals

Reliability of your confidence levels: 89.34%

Were you confident at the right times? To find out, we took a portion of your earlier Brier score to determine just how reliable your level of confidence was. It looks like your score is above average. About 10% of people got a score higher than yours.

This score measures the link between the size of your bet and the chance you got the correct answer. If you were appropriately confident at the right times, we’d expect you to bet a lot of points more often when you got the answer correct than when you didn’t. If you were appropriately uncertain at the right times, we’d expect you to typically bet only a few points when you got the answer wrong.

You can interpret this score as measuring the ability of your gut to distinguish between things that are very likely true, versus only somewhat likely true. Or in other words, this score tries to answer the question, “When you feel more confident in something, does that actually make it more likely to be true?”

5. Confidence and accuracy

Screenshot from 2015-09-16 22:16:37

When you bet 1-3 points your confidence was accurate. You were a little confident in your answers and got the answer correct 69.23% of the time. Nice work!

When you bet 4-7 points you were underconfident. You were fairly confident in your answers, but you should have been even more confident because you got the answer correct 100% of the time!

When you bet 8-10 points your confidence was accurate. You were extremely confident in your answer and indeed got the answer correct 100% of the time. Great work!

So, again there is some evidence of underconfidence. E.g. for those I betted 0 points, I still had 60% accuracy, tho it should have been 50%.

6. Overall confidence

Your confidence: very underconfident

You tended to be very underconfident in your answers overall. Let’s explore what that means.

In the chart above, your betting average has been translated into a new score called your “average confidence.” This represents roughly how confident you were in each of your answers.

People who typically bet close to 0 points would have an average confidence near 50% (i.e. they aren’t confident at all and don’t think they’ll do much better than chance).
People who typically bet about 5 points would have an average confidence near 75% (i.e. they’re fairly confident; they might’ve thought there was a ¼ chance of being wrong).
People who typically bet 10 points would have an average confidence near 100% (i.e. they are extremely confident; they thought there was almost no chance of being wrong).

The second bar is the average number of questions you got correct. You got 24 questions correct, or 80%.

If you are a highly accurate better, then the two bars above should be about equal. That is, if you got an 80% confidence score, then you should have gotten about 80% of the questions correct.

We said you were underconfident because on average you bet at a confidence level of 69.67% (i.e. you bet 3.93 on average), but in reality you did better than that, getting the answer right 80% of the time.

In general, results were in line with my predictions: high ability + general overestimation + imperfect correlation of self-rated ability and actual ability results in underconfidence. My earlier result indicated some underconfidence as well. The longer test gave the same result. Apparently, I need to be more confident in myself. This is despite the fact that I scored 98 and 99 on the assertiveness facet on the OCEAN test on two different test taking sessions with some months in between.

I did take their additional rationality test, but since this was just based on pop psych Kahneman-style points, it doesn’t seem very useful. It also uses typological thinking because it classifies people into 16 classes, clearly wrong-headed. It found my weakest side to be planning fallacy, but this isn’t actually the case because I’m pretty good at getting papers and projects done on time.

Let’s test that. Since we don’t have the original data, we can’t use that. We can however use open datasets. I like to use Wicherts’ dataset. So let’s analyze!

p_load(kirkegaard, psych, stringr)

#load wicherts data
w = read.csv("wicherts_data.csv", sep = ";")
w = subset(w, select = c("ravenscore", "lre_cor", "nse_cor", "voc_cor", "van_cor", "hfi_cor", "ari_cor", "Zgpa"))

#remove missing
w = na.omit(w)

w = std_df(w)

#subset the subtests
w_subtests = subset(w, select = c("ravenscore", "lre_cor", "nse_cor", "voc_cor", "van_cor", "hfi_cor", "ari_cor"))

#factor analyze
fa = fa(w_subtests)

plot_loadings(fa) + scale_x_continuous(breaks = seq(0, 1, .05))

#save scores
w_subtests$g = as.numeric(fa$scores)

w_res = residualize_DF(w_subtests, "g")

#include GPA
w_res$GPA = w$Zgpa


#predict GPA
fits = lm_beta_matrix(dependent = "GPA",
                      predictors = colnames(w_res)[-9],
                      data = w_res, return_models = "n")


#why does the last model have NA for one variable?
model = str_c("GPA ~ ", str_c(colnames(w_res)[-9], collapse = " + "))
fit = lm(model, w_res)




ravenscore lre_cor nse_cor voc_cor van_cor hfi_cor ari_cor g GPA
ravenscore 1 -0.125 -0.201 -0.115 0.057 0.022 -0.294 0 -0.028
lre_cor -0.125 1 -0.316 -0.143 -0.195 -0.23 -0.288 0 -0.074
nse_cor -0.201 -0.316 1 -0.206 -0.381 -0.103 0.099 0 -0.116
voc_cor -0.115 -0.143 -0.206 1 0.198 -0.141 -0.207 0 0.08
van_cor 0.057 -0.195 -0.381 0.198 1 -0.081 -0.406 0 0.137
hfi_cor 0.022 -0.23 -0.103 -0.141 -0.081 1 -0.233 0 0.037
ari_cor -0.294 -0.288 0.099 -0.207 -0.406 -0.233 1 0 0.007
g 0 0 0 0 0 0 0 1 0.334
GPA -0.028 -0.074 -0.116 0.08 0.137 0.037 0.007 0.334 1


Model fits

Model # ravenscore lre_cor nse_cor voc_cor van_cor hfi_cor ari_cor g r2.adj.
1 -0.028               -0.003
2 -0.074 0.002
3 -0.116 0.01
4 0.08 0.003
5 0.137 0.015
6 0.037 -0.002
7 0.007 -0.003
8 0.334 0.108
9 -0.038 -0.079 0
10 -0.054 -0.127 0.009
11 -0.019 0.078 0
12 -0.036 0.139 0.013
13 -0.029 0.038 -0.005
14 -0.029 -0.001 -0.006
15 -0.028 0.334 0.106
16 -0.123 -0.155 0.02
17 -0.064 0.071 0.004
18 -0.049 0.127 0.014
19 -0.069 0.021 -0.001
20 -0.078 -0.015 -0.001
21 -0.074 0.334 0.111
22 -0.104 0.059 0.01
23 -0.075 0.108 0.017
24 -0.114 0.026 0.007
25 -0.118 0.019 0.007
26 -0.116 0.334 0.119
27 0.055 0.126 0.015
28 0.087 0.05 0.002
29 0.086 0.025 0
30 0.08 0.334 0.112
31 0.141 0.049 0.014
32 0.168 0.075 0.017
33 0.137 0.334 0.124
34 0.041 0.017 -0.005
35 0.037 0.334 0.107
36 0.007 0.334 0.105
37 -0.082 -0.14 -0.177 0.023
38 -0.029 -0.068 0.067 0.001
39 -0.043 -0.054 0.129 0.013
40 -0.038 -0.074 0.021 -0.003
41 -0.049 -0.09 -0.033 -0.003
42 -0.038 -0.079 0.334 0.109
43 -0.046 -0.115 0.052 0.008
44 -0.052 -0.086 0.107 0.016
45 -0.054 -0.125 0.026 0.007
46 -0.053 -0.127 0.004 0.006
47 -0.054 -0.127 0.334 0.119
48 -0.03 0.052 0.128 0.012
49 -0.02 0.085 0.05 -0.001
50 -0.013 0.083 0.021 -0.003
51 -0.019 0.078 0.334 0.109
52 -0.038 0.143 0.05 0.012
53 -0.017 0.166 0.07 0.014
54 -0.036 0.139 0.334 0.122
55 -0.027 0.04 0.009 -0.008
56 -0.029 0.038 0.334 0.104
57 -0.029 -0.001 0.334 0.103
58 -0.115 -0.146 0.034 0.018
59 -0.097 -0.119 0.072 0.021
60 -0.125 -0.157 -0.008 0.017
61 -0.127 -0.155 -0.014 0.017
62 -0.123 -0.155 0.334 0.129
63 -0.043 0.051 0.118 0.013
64 -0.055 0.078 0.036 0.001
65 -0.062 0.072 0.004 0
66 -0.064 0.071 0.334 0.113
67 -0.039 0.133 0.039 0.012
68 -0.024 0.158 0.065 0.014
69 -0.049 0.127 0.334 0.123
70 -0.072 0.018 -0.009 -0.005
71 -0.069 0.021 0.334 0.108
72 -0.078 -0.015 0.334 0.108
73 -0.068 0.046 0.102 0.015
74 -0.099 0.065 0.036 0.008
75 -0.106 0.065 0.031 0.007
76 -0.104 0.059 0.334 0.119
77 -0.069 0.114 0.039 0.015
78 -0.07 0.139 0.071 0.017
79 -0.075 0.108 0.334 0.126
80 -0.116 0.031 0.026 0.004
81 -0.114 0.026 0.334 0.116
82 -0.118 0.019 0.334 0.116
83 0.063 0.129 0.057 0.015
84 0.067 0.158 0.085 0.017
85 0.055 0.126 0.334 0.124
86 0.098 0.061 0.042 0
87 0.087 0.05 0.334 0.111
88 0.086 0.025 0.334 0.109
89 0.183 0.075 0.099 0.018
90 0.141 0.049 0.334 0.123
91 0.168 0.075 0.334 0.126
92 0.041 0.017 0.334 0.104
93 -0.078 -0.135 -0.171 0.017 0.02
94 -0.076 -0.116 -0.144 0.064 0.023
95 -0.082 -0.144 -0.179 -0.012 0.02
96 -0.099 -0.158 -0.181 -0.05 0.022
97 -0.082 -0.14 -0.177 0.334 0.133
98 -0.036 -0.048 0.045 0.121 0.011
99 -0.028 -0.059 0.074 0.035 -0.002
100 -0.033 -0.072 0.064 -0.01 -0.003
101 -0.029 -0.068 0.067 0.334 0.11
102 -0.042 -0.044 0.134 0.039 0.011
103 -0.026 -0.032 0.154 0.053 0.011
104 -0.043 -0.054 0.129 0.334 0.122
105 -0.048 -0.085 0.012 -0.029 -0.006
106 -0.038 -0.074 0.021 0.334 0.106
107 -0.049 -0.09 -0.033 0.334 0.107
108 -0.046 -0.079 0.039 0.102 0.014
109 -0.045 -0.11 0.058 0.035 0.006
110 -0.04 -0.115 0.056 0.018 0.005
111 -0.046 -0.115 0.052 0.334 0.118
112 -0.052 -0.08 0.113 0.039 0.014
113 -0.034 -0.078 0.133 0.059 0.015
114 -0.052 -0.086 0.107 0.334 0.125
115 -0.051 -0.125 0.028 0.011 0.003
116 -0.054 -0.125 0.026 0.334 0.116
117 -0.053 -0.127 0.004 0.334 0.115
118 -0.03 0.059 0.132 0.057 0.012
119 -0.005 0.066 0.158 0.084 0.014
120 -0.03 0.052 0.128 0.334 0.122
121 -0.007 0.096 0.06 0.039 -0.003
122 -0.02 0.085 0.05 0.334 0.108
123 -0.013 0.083 0.021 0.334 0.106
124 -0.013 0.182 0.074 0.095 0.015
125 -0.038 0.143 0.05 0.334 0.122
126 -0.017 0.166 0.07 0.334 0.123
127 -0.027 0.04 0.009 0.334 0.101
128 -0.091 -0.112 0.03 0.071 0.018
129 -0.115 -0.145 0.034 0.001 0.014
130 -0.117 -0.146 0.033 -0.005 0.014
131 -0.115 -0.146 0.034 0.334 0.127
132 -0.094 -0.116 0.075 0.01 0.017
133 -0.081 -0.11 0.092 0.032 0.018
134 -0.097 -0.119 0.072 0.334 0.13
135 -0.132 -0.158 -0.014 -0.019 0.014
136 -0.125 -0.157 -0.008 0.334 0.126
137 -0.127 -0.155 -0.014 0.334 0.127
138 -0.03 0.059 0.123 0.049 0.012
139 -0.011 0.065 0.155 0.08 0.014
140 -0.043 0.051 0.118 0.334 0.123
141 -0.045 0.085 0.044 0.022 -0.002
142 -0.055 0.078 0.036 0.334 0.111
143 -0.062 0.072 0.004 0.334 0.11
144 0.012 0.189 0.08 0.106 0.015
145 -0.039 0.133 0.039 0.334 0.122
146 -0.024 0.158 0.065 0.334 0.123
147 -0.072 0.018 -0.009 0.334 0.105
148 -0.059 0.054 0.108 0.047 0.014
149 -0.061 0.058 0.135 0.08 0.017
150 -0.068 0.046 0.102 0.334 0.125
151 -0.1 0.076 0.048 0.044 0.006
152 -0.099 0.065 0.036 0.334 0.117
153 -0.106 0.065 0.031 0.334 0.117
154 -0.059 0.157 0.065 0.092 0.018
155 -0.069 0.114 0.039 0.334 0.124
156 -0.07 0.139 0.071 0.334 0.127
157 -0.116 0.031 0.026 0.334 0.114
158 0.083 0.175 0.09 0.117 0.021
159 0.063 0.129 0.057 0.334 0.124
160 0.067 0.158 0.085 0.334 0.127
161 0.098 0.061 0.042 0.334 0.109
162 0.183 0.075 0.099 0.334 0.128
163 -0.072 -0.112 -0.139 0.015 0.063 0.019
164 -0.079 -0.138 -0.174 0.015 -0.009 0.016
165 -0.1 -0.158 -0.182 -0.002 -0.05 0.018
166 -0.078 -0.135 -0.171 0.017 0.334 0.13
167 -0.075 -0.115 -0.143 0.065 0.003 0.019
168 -0.083 -0.127 -0.152 0.052 -0.018 0.019
169 -0.076 -0.116 -0.144 0.064 0.334 0.133
170 -0.106 -0.173 -0.19 -0.034 -0.063 0.019
171 -0.082 -0.144 -0.179 -0.012 0.334 0.13
172 -0.099 -0.158 -0.181 -0.05 0.334 0.132
173 -0.035 -0.035 0.053 0.125 0.048 0.01
174 -0.01 -0.015 0.063 0.153 0.075 0.011
175 -0.036 -0.048 0.045 0.121 0.334 0.121
176 -0.024 -0.054 0.077 0.038 0.009 -0.005
177 -0.028 -0.059 0.074 0.035 0.334 0.108
178 -0.033 -0.072 0.064 -0.01 0.334 0.107
179 -0.01 0.008 0.186 0.078 0.1 0.012
180 -0.042 -0.044 0.134 0.039 0.334 0.12
181 -0.026 -0.032 0.154 0.053 0.334 0.121
182 -0.048 -0.085 0.012 -0.029 0.334 0.104
183 -0.044 -0.07 0.046 0.107 0.046 0.012
184 -0.022 -0.067 0.053 0.131 0.072 0.014
185 -0.046 -0.079 0.039 0.102 0.334 0.124
186 -0.034 -0.108 0.067 0.044 0.032 0.003
187 -0.045 -0.11 0.058 0.035 0.334 0.116
188 -0.04 -0.115 0.056 0.018 0.334 0.115
189 -0.028 -0.066 0.152 0.062 0.082 0.015
190 -0.052 -0.08 0.113 0.039 0.334 0.124
191 -0.034 -0.078 0.133 0.059 0.334 0.125
192 -0.051 -0.125 0.028 0.011 0.334 0.113
193 0.004 0.083 0.176 0.091 0.118 0.018
194 -0.03 0.059 0.132 0.057 0.334 0.122
195 -0.005 0.066 0.158 0.084 0.334 0.124
196 -0.007 0.096 0.06 0.039 0.334 0.106
197 -0.013 0.182 0.074 0.095 0.334 0.125
198 -0.083 -0.105 0.035 0.076 0.018 0.015
199 -0.065 -0.095 0.042 0.098 0.046 0.016
200 -0.091 -0.112 0.03 0.071 0.334 0.128
201 -0.118 -0.146 0.032 -0.002 -0.006 0.011
202 -0.115 -0.145 0.034 0.001 0.334 0.124
203 -0.117 -0.146 0.033 -0.005 0.334 0.124
204 -0.053 -0.089 0.12 0.039 0.059 0.015
205 -0.094 -0.116 0.075 0.01 0.334 0.127
206 -0.081 -0.11 0.092 0.032 0.334 0.128
207 -0.132 -0.158 -0.014 -0.019 0.334 0.124
208 0.044 0.094 0.195 0.11 0.144 0.019
209 -0.03 0.059 0.123 0.049 0.334 0.122
210 -0.011 0.065 0.155 0.08 0.334 0.124
211 -0.045 0.085 0.044 0.022 0.334 0.108
212 0.012 0.189 0.08 0.106 0.334 0.125
213 -0.044 0.075 0.157 0.082 0.11 0.019
214 -0.059 0.054 0.108 0.047 0.334 0.124
215 -0.061 0.058 0.135 0.08 0.334 0.127
216 -0.1 0.076 0.048 0.044 0.334 0.116
217 -0.059 0.157 0.065 0.092 0.334 0.128
218 0.083 0.175 0.09 0.117 0.334 0.131
219 -0.071 -0.109 -0.136 0.017 0.065 0.007 0.016
220 -0.078 -0.12 -0.145 0.011 0.056 -0.011 0.016
221 -0.072 -0.112 -0.139 0.015 0.063 0.334 0.13
222 -0.117 -0.188 -0.201 -0.024 -0.045 -0.077 0.016
223 -0.079 -0.138 -0.174 0.015 -0.009 0.334 0.127
224 -0.1 -0.158 -0.182 -0.002 -0.05 0.334 0.128
225 -0.09 -0.142 -0.163 0.038 -0.014 -0.032 0.016
226 -0.075 -0.115 -0.143 0.065 0.003 0.334 0.13
227 -0.083 -0.127 -0.152 0.052 -0.018 0.334 0.13
228 -0.106 -0.173 -0.19 -0.034 -0.063 0.334 0.129
229 0.023 0.056 0.101 0.201 0.118 0.16 0.016
230 -0.035 -0.035 0.053 0.125 0.048 0.334 0.12
231 -0.01 -0.015 0.063 0.153 0.075 0.334 0.121
232 -0.024 -0.054 0.077 0.038 0.009 0.334 0.105
233 -0.01 0.008 0.186 0.078 0.1 0.334 0.122
234 -0.009 -0.046 0.072 0.155 0.08 0.106 0.016
235 -0.044 -0.07 0.046 0.107 0.046 0.334 0.123
236 -0.022 -0.067 0.053 0.131 0.072 0.334 0.124
237 -0.034 -0.108 0.067 0.044 0.032 0.334 0.114
238 -0.028 -0.066 0.152 0.062 0.082 0.334 0.125
239 0.004 0.083 0.176 0.091 0.118 0.334 0.128
240 0.015 -0.034 0.08 0.168 0.09 0.121 0.016
241 -0.083 -0.105 0.035 0.076 0.018 0.334 0.125
242 -0.065 -0.095 0.042 0.098 0.046 0.334 0.126
243 -0.118 -0.146 0.032 -0.002 -0.006 0.334 0.121
244 -0.053 -0.089 0.12 0.039 0.059 0.334 0.126
245 0.044 0.094 0.195 0.11 0.144 0.334 0.129
246 -0.044 0.075 0.157 0.082 0.11 0.334 0.13
247 -0.071 -0.109 -0.136 0.017 0.065 0.007 0.016
248 -0.071 -0.109 -0.136 0.017 0.065 0.007 0.334 0.127
249 -0.078 -0.12 -0.145 0.011 0.056 -0.011 0.334 0.127
250 -0.117 -0.188 -0.201 -0.024 -0.045 -0.077 0.334 0.127
251 -0.09 -0.142 -0.163 0.038 -0.014 -0.032 0.334 0.127
252 0.023 0.056 0.101 0.201 0.118 0.16 0.334 0.127
253 -0.009 -0.046 0.072 0.155 0.08 0.106 0.334 0.127
254 0.015 -0.034 0.08 0.168 0.09 0.121 0.334 0.127
255 -0.071 -0.109 -0.136 0.017 0.065 0.007 0.334 0.127

[Bonus points to whoever can explain why the last ari_cor has a missing value in the last model. I checked. It is not a problem with my function. I don’t know.]

So Timofey Pnin is right. The beta does stay exactly the same across models, at least two 3 digits.

We may also note that adding the other predictors did not have much effect: g alone (model #8) R2 adj. = .108, best model according to R2 adj. = 0.132 (#97). Notice how this model has negative betas for the other items. In other words, one is better off with lower scores. Surely that can’t be right. It is probably just a fluke due to overfitting…

Testing overfitting

We can test overfitting using lasso regression (read this book, seriously, it’s a great book!). Because lasso regression is indeterministic, we repeat it a large number of times and examine the overall results.

#lasso regression
fits_2 = MOD_repeat_cv_glmnet(df = w_res,
                              dependent = "GPA",
                              predictors = colnames(w_res)[-9],
                              runs = 500)


Lasso results

ravenscore lre_cor nse_cor voc_cor van_cor hfi_cor ari_cor g
mean 0 0 0 0 0 0 0 0.096
median 0 0 0 0 0 0 0 0.104
sd 0 0 0 0 0.001 0 0 0.033
fraction_zeroNA 1 1 1 1 0.996 1 1 0.01


The lasso confirms our suspicions. The non-g variables were fairly useless, their apparent usefulness due to overfitting. g retained its usefulness in 99% of the runs. The most promising of the other candidates was only useful in .04% of runs.

This is probably worth writing into a short paper. Contact me if you are willing to do this. I will help you, but I don’t have time to write it all myself.