intelligence / IQ / cognitive ability

Are PISA items biased?

The use of the Programme for International Student Assessment (PISA) across nations, cultures, and languages has been criticized. The key criticisms point to the linguistic and cultural biases potentially underlying the design of reading comprehension tests, raising doubts about the legitimacy of comparisons across economies. Our research focused on the type and magnitude of invariance or non-invariance in the PISA Reading Comprehension test by language, culture, and economic development relative to performance of the Australian English-speaking reference group used to develop the tests. Multi-Group Confirmatory Factor Analysis based on means and covariance structure (MACS) modeling was used to establish a dMACS effect size index for each economy for the degree of non-invariance. Only three wealthy, English-speaking countries had scalar invariance with Australia. Moderate or large effects were observed in just 31% of the comparisons. PISA index of economic, social and cultural status had a moderate inverse correlation with dMACS suggesting that socioeconomic resourcing of education played a significant role in measurement invariance, while educational practice and language factors seemed to play a further small role in non-invariance. Alternative approaches to reporting PISA results consistent with non-invariance are suggested.

This is both a neat and a frustrating study. PISA does not give all students all the items. They give students random booklets/parcels of some subset of items, so there is a massive missing data design (like SAPA) but with groups. They wanted to examine literacy for whatever reason, and they picked booklet 11 which has 28 items about literacy translated into various languages. One can get a feel for the items because they make new ones for every wave, and release the old ones. Here’s the 2018 items.

(It is pretty outdated. No one buys ringtones with SMS anymore. That’s like 2003 tier behavior. Made by boomers?)

Anyway, so they end up with those 28 items with complete data for 32,704 people from 55 countries. They fit a simple g-factor model to these 28 items, and using Australia as the Chosen People, they compute the dMACS score (this is a metric for how much differently the items works) for AUS-focal comparisons. Their result looks like this:

OK, so a lot of countries some show non-MI, which is of course expected. A lot of these show small amounts overall. Annoyingly, the authors don’t compute the effect of this on the score differences between the countries. If we find that items don’t work similarly, and we are using scores from these items for all kinds of research, then we are pretty interested in how this test bias actually affects the comparisons! Seemingly, the authors disagree. They do note that all the countries with small gaps measurement variance to Australia are wealthy countries, and the gaps don’t seem to relate strongly to script or language family, but probably some effect of having the same language. Authors then say:

Alternately, grouping countries into clusters of “countries-like-me,” assuming they use the same language and/or share the same educational culture, might be abetter way to identify invariance of responding to test items and to report results.Restricting reports to those nations would likely be defensible and informative.For example, it may be conventional to argue that East Asian societies which use Mandarin (i.e., Singapore, China, Macao, and Taiwan) and having strong dependence on testing and public examinations and shared cultural approaches to schooling and testing should be compared, independent of other East Asian societies which have different writing scripts and languages, despite having similar cultural histories and forces (i.e., Japan, Korea, and Hong Kong). However, this study has suggested that grouping and reporting performances among East Asian societies may be defensible, since the range of dMACS relative to Australia for all seven economies was just 0.136 to 0.199. Likewise, once could imagine Nordic countries (i.e., Finland, Sweden, Norway, Iceland, and Denmark) or continental Western European countries choosing to allow inter-country comparisons because of their similarities in ESCS, despite differences in language. Nonetheless, separate studies that demonstrate that such natural geographic and cultural groupings were defensible requires conducting parallel analyses using one or more of the contributing nations as a reference point.

Yeah, one could cluster the countries based on this. But since the authors didn’t actually compute the entire distance matrix, this can’t be done, one can only guess. Naturally, when they were doing the above, all they had to do was write a loop to compute the entire matrix of pairwise distances but … no. Another obvious idea is to take some items from another domain, PISA has 3 others, math, science, problem solving, and do the same. Do these show the same results? One can also do it across waves of PISA to check any trends. Some countries grew closer in e.g. wealth during the time since year 2000 (first wave) and 2018 (newest wave). So, doing these two things is a pretty obvious and important study. It would be the first large-scale analysis of whether national IQ gaps are massively affected by measurement variance or not. There is another prior study (Wu et al 2007) that looked at measurement variance in these scholastic test data, but it also did not compute the effects on mean scores. For the reference, this can be easily done using the metrics in mirt package’s empirical_ES() function. Meade 2010 provides a nice introduction to these metrics, but the one we want here is Expected Test Score Standardized Difference (ETSSD), which is the expected effect size in Cohen’s d of the deviance from measurement invariance for the items in question. These differences will probably be small because the measurement variance varies at random by item. One item will happen to be easier for one group, harder for a second group, slightly higher loading for a fourth, lower loading for a fifth, and so on. These effects will cancel out if they are not systematic, so the estimated of mean differences by group will not be biased.

So, to answer the question in the title: Yes, the items show pervasive bias, as expected, but we don’t know what the impact of this is because the authors did not compute metrics this.

Oh on a technical note:

It is pretty weird that the fit of the items in the combined sample is better than to the Australian sample alone. The authors showed that the items don’t work entirely the same, and for some groups, not at all the same. But then, this means that combining all the data should lead to worse fit, but somehow they get quite a much better fit on combined sample! Note that this fit is not related to having different factor structures, all the data are fit with a g-only model, with all items loading, the differences they investigated are only due to loading/slope and intercept differences (scalar invariance). 🤔 Do all of these fit metrics have a sample size bias? It’s the best idea I can think of. One could just subset at random from the total sample to the size of the Australian sample to check this idea out.

Book review Genetics / behavioral genetics intelligence / IQ / cognitive ability Sociology

New paper out: Human Biodiversity for Beginners: A Review of Charles Murray’s Human Diversity

Human Diversity is Charles Murray’s latest book. This review evaluates the claims made in the book and places both the author’s theses and their criticisms in their historical context. It concludes that this book is valuable as an updated summary of current knowledge about psychological differences (in the averages) between genders, races, and social classes. As such it is a useful introduction into the field for everyone interested in it.

The long awaited review. Actually, I read the book back in … (checks GoodReads)… January 4th. But here it is. Some others reviews by informed readers:

These are reviews by people I reckon have some informed knowledge of the field, either because they are themselves scientists (at least sometimes), or because they are avid readers of the literature.

By big media for contrast, I don’t know any of the authors:

I didn’t read any of these reviews before writing my own, so mine is not colored by theirs.

intelligence / IQ / cognitive ability

Gap closers

A lot of these people.

  • Ferguson, R. F. (1998). Can schools narrow the Black–White test score gap?.
  • Kober, N. (2001). It Takes More Than Testing: Closing the Achievement Gap. A Report of the Center on Education Policy.
  • Bainbridge, W. L., & Lasley, T. J. (2002). Demographics, diversity, and K-12 accountability: The challenge of closing the achievement gap. Education and Urban Society, 34(4), 422-437.
  • Thernstrom, A., & Thernstrom, S. (2004). No excuses: Closing the racial gap in learning. Simon and Schuster.
  • Wenglinsky, H. (2004). Closing the racial achievement gap: The role of reforming instructional practices. education policy analysis archives, 12, 64.
  • Murnane, R. J., Willett, J. B., Bub, K. L., McCartney, K., Hanushek, E., & Maynard, R. (2006). Understanding trends in the black-white achievement gaps during the first years of school. Brookings-Wharton papers on urban affairs, 97-135.

Closing the black-white achievement gap when schools remain segregated by race and income is extraordinarily difficult. To succeed, schools that serve concentrations of poor children must be staffed with skilled, experienced teachers who have learned to work together to provide large amounts of consistent, coordinated, high-quality instruction. Closing the gap is the greatest educational challenge facing the United States today.

  • Paige, R., & Witty, E. (2009). The black-white achievement gap: Why closing it is the greatest civil rights issue of our time. AMACOM Div American Mgmt Assn.
  • Ferguson, R., Stellar, A., Schools, B. C. P., & Morganton, N. C. (2010). Toward excellence with equity: An emerging vision for closing the achievement gap. Evidence-based Practice Articles, 56.
  • Jencks, C., & Phillips, M. (Eds.). (2011). The Black-White test score gap. Brookings Institution Press.
  • Blackford, K., & Khojasteh, J. (2013). Closing the achievement gap: Identifying strand score differences. American Journal of Educational Studies, 6(2), 5.
  • Webb, M., & Thomas, R. (2015, January). Teachers’ Perceptions of Educators’ and Students’ Role in Closing the Achievement Gap. In National Forum of Teacher Education Journal (Vol. 25, No. 3).

An impossible educational challenge, sorry.

Immigration intelligence / IQ / cognitive ability Social psychology

New paper out: Public Preferences and Reality: Crime Rates among 70 Immigrant Groups in the Netherlands

We estimated crime rates among 70 origin-based immigrant groups in the Netherlands for the years 2005-2018. Results indicated that crime rates have overall been falling for each group in the period of study, and in the country as a whole, with about a 50% decline since 2005. Immigrant groups varied widely in crime rates, with East Asian countries being lower and Muslim countries, as well as Dutch (ex-)colonial possessions in the Caribbean, being higher than the natives. We found that national IQ and Muslim percentage of population of the origin countries predicted relative differences in crime rates, r’s = .64 and .45, respectively, in line with previous research both in the Netherlands and in other European countries. Furthermore, we carried out a survey of 200 persons living in the Netherlands to measure their preferences for immigrants for each origin country in terms of getting more or fewer persons from each country. Following Carl (2016), we computed a mean opposition metric for each country. This correlated strongly with actual crime rates we found, r’s = .46 and .57, for population weighted and unweighted results, respectively. The main outliers in the regression were the Dutch (ex-)colonial possessions, and if these are excluded, the correlations increase to .68 and .66, respectively. Regressions with plausible confounders (Muslim percentage, geographical fixed effects) showed that crime rates continued to be a useful predictor of opposition to specific countries. The results were interpreted as being in line with a rational voter preference for less crime-prone immigrants.

Main results

Video walkthrough

Genetics / behavioral genetics intelligence / IQ / cognitive ability Science

Sesardić’s conjecture: preliminary evidence in favor

In Making sense of Heritability, Sesardić wrote:

On the less obvious side, a nasty campaign against H could have the unintended effect of strengthening H [hereditarianism] epistemically, and making the criticism of H look less convincing. Simply, if you happen to believe that H is true and if you also know that opponents of H will be strongly tempted to “play dirty,” that they will be eager to seize upon your smallest mistake, blow it out of all proportion, and label you with Dennett’s “good epithets,” with a number of personal attacks thrown in for good measure, then if you still want to advocate H, you will surely take extreme care to present your argument in the strongest possible form. In the inhospitable environment for your views, you will be aware that any major error is a liability that you can hardly afford, because it will more likely be regarded as a reflection of your sinister political intentions than as a sign of your fallibility. The last thing one wants in this situation is the disastrous combination of being politically denounced (say, as a “racist”) and being proved to be seriously wrong about science. Therefore, in the attempt to make themselves as little vulnerable as possible to attacks they can expect from their uncharitable and strident critics, those who defend H will tread very cautiously and try to build a very solid case before committing themselves publicly. As a result, the quality of their argument will tend to rise, if the subject matter allows it.22

It is different with those who attack H. They are regarded as being on the “right” side (in the moral sense), and the arguments they offer will typically get a fair hearing, sometimes probably even a hearing that is “too fair.” Many a potential critic will feel that, despite seeing some weaknesses in their arguments, he doesn’t really want to point them out publicly or make much of them because this way, he might reason, he would just play into the hands of “racists” and “right-wing ideologues” that he and most of his colleagues abhor. 23 Consequently, someone who opposes H can expect to be rewarded with being patted on the back for a good political attitude, while his possible cognitive errors will go unnoticed or unmentioned or at most mildly criticized.

Now, given that an advocate of H and an opponent of H find them- selves in such different positions, who of the two will have more incentive to invest a lot of time and hard work to present the strongest possible defense of his views? The question answers itself. In the academic jungle, as elsewhere, it is the one who anticipates trouble who will spare no effort to be maximally prepared for the confrontation.

If I am right, the pressure of political correctness would thus tend to result, ironically, in politically incorrect theories becoming better developed, more carefully articulated, and more successful in coping with objections. On the other hand, I would predict that a theory with a lot of political support would typically have a number of scholars flocking to its defense with poorly thought out arguments and with speedily generated but fallacious “refutations” of the opposing view. 24 This would explain why, as Ronald Fisher said, “the best causes tend to attract to their sup- port the worst arguments” (Fisher 1959: 31).

Example? Well, the best example I can think of is the state of the debate about heritability. Obviously, the hypothesis of high heritability of human psychological variation – and especially the between-group heritability of IQ differences – is one of the most politically sensitive topics in contemporary social science. The strong presence of political considerations in this controversy is undeniable, and there is no doubt about which way the political wind is blowing. When we turn to discussions in this context that are ostensibly about purely scientific issues two things are striking. First, as shown in previous chapters, critics of heritability very often rely on very general, methodological arguments in their attempts to show that heritability values cannot be determined, are intrinsically misleading, are low, are irrelevant, etc. Second, these global methodological arguments – although defended by some leading biologists, psychologists, and philosophers of science – are surprisingly weak and unconvincing. Yet they continue to be massively accepted, hailed as the best approach to the nature–nurture issue, and further transmitted, often with no detailed analysis or serious reflection.

Footnotes are:

22 This is not guaranteed, of course. For example, biblical literalists who think that the world was created 6,000 years ago can expect to be ridiculed as irrational, ignorant fanatics. So, if they go public, it is in their strong interest to use arguments that are not silly, but the position they have chosen to advocate leaves them with no good options. (I assume that it is not a good option to suggest, along the lines of Philip Gosse’s famous account, that the world was created recently, but with the false traces of its evolutionary history that never happened.)

23 A pressure in the opposite direction would not have much force. It is notorious that in the humanities and social science departments, conservative and other right-of-center views are seriously under-represented (cf. Ladd & Lipset 1975; Redding 2001).

24 I am speaking of tendencies here, of course. There would be good and bad arguments on both sides.

The Fisher (1959) reference given is actually about probability theory, that context Fisher is not writing about genetics or biology and its relation to politics.

Did no one come up with this idea before? Seems unlikely. Robert Plomin and Thomas Bouchard came close in 1987 chapters in the same book Arthur Jensen: Consensus And Controversy (see also this earlier post). Bouchard:

One might fairly claim that this chapter does not constitute a critical appraisal of the work of Arthur Jensen on the genetics of human abilities, but rather a defense. If a reader arrives at that conclusion he or she has overlooked an important message. Since Jensen rekindled the flames of the heredity vs environment debate in 1969, human behavior genetics has undergone a virtual renaissance. Nevertheless, a tremendous amount of energy has been wasted. In my discussions of the work of Kamin, Taylor, Farber, etc., I have often been as critical of them as they have been of the hereditarian program. While I believe that their criticisms have failed and their conclusions are false, I also believe that their efforts were necessary. They were necessary because human behavior genetics has been an insufficiently self-critical discipline. It adopted the quantitative models of experimental plant and animal genetics without sufficient regard for the many problems involved in justifying the application of those models in human research. Furthermore, it failed to deal adequately with most of the issues that are raised and dealt with by meta-analytic techniques. Human behavior geneticists have, until recently, engaged in inadequate analyses. Their critics, on the other hand, have engaged in pseudo-analyses. Much of the answer to the problem of persuading our scientific colleagues that behavior is significantly influenced by genetic processes lies in a more critical treatment of our own data and procedures. The careful and systematic use of meta-analysis, in conjunction with our other tools, will go a long way toward accomplishing this goal. It is a set of tools and a set of attitudes that Galton would have been the first to apply in his own laboratory.


More behavioral genetic data on IQ have been collected since Jensen’s 1969 monograph than in the fifty years preceding it. As mentioned earlier, I would argue that much of this research was conducted because of Jensen’s monograph and the controversy and criticism it aroused.

A decade and a half ago Jensen clearly and forcefully asserted that IQ scores are substantially influenced by genetic differences among individuals. No telling criticism has been made of his assertion, and newer data consistently support it. No other finding in the behavioral sciences has been researched so extensively, subjected to so much scrutiny, and verified so consistently.

Chris Brand also has a chapter in this book, perhaps it has something relevant. I don’t recall it well.

To return to Sesardić, his contention is that non-scientific opposition to some scientific claim will result in so called double standards: higher standards for proponents of the claim, and if reality supports the claim, then higher quality evidence will be gathered and published. It is the reverse for critics of the claim, they will face less scrutiny, so their published arguments and evidence will tend to be poorer. Do we have some kind of objective way to test this claim? We do. We can measure scientific rigor by scientific field or subfield, and compare. Probably the most left-wing field of psychology will be social psychology, and it has massive issues with the replication crisis. Intelligence and behavioral genetics research, on the other hand, have no such big problems. One of the least left-wing fields of psychology (nearly 50-50 balance of self-placed politics), should thus have high rigor. A simple way to measure this is compiling data about statistical power by field. This sometimes calculated as part of meta-analyses. Sean Last has compiled such values, reproduced below.

Citation Discipline Mean / Median Power
Button et al. (2013) Neuroscience 21%
Brain Imaging 8%
Smaldino and McElreath (2016) Social and Behavioral Sciences 24%
Szucs and Ioannidis (2017) Cognitive Neuroscience 14%
Psychology 23%
Medical 23%
Mallet et al (2017) Breast Cancer 16%
Glaucoma 11%
Rheumatoid Arthritis 19%
Alzheimer’s 9%
Epilepsy 24%
MS 24%
Parkinson’s 27%
Lortie-Forgues and Inglis (2019) Education 23%
Nuijten et al (2018) Intelligence 49%
Intelligence – Group Differences 57%

The main issue with this is that the numbers concern either median or mean power, with some inconsistency across fields. The median is usually lower, so one could convert the values using their mean observed ratio.

I should very much like someone to do a more detailed study of this. I imagine that one will do the following:

  1. Acquire a large dataset of scientific articles, including title, authors, abstract, keywords, fulltext, and references. This can be done either via Scihub (difficult) or by mining open access journals (probably easy).
  2. Use algorithms to extract data of interest. Usually studies calculating power rely on so-called focal analyses, i.e., main or important statistical tests. These are hard to identify using simple algorithms, but they can extract all of them (those with standardized format, that is!). Check out the work by Nuijten et al linked above. A better idea is to get a dataset of manually extracted data, and then train neural network to also extract them. I think one can succeed in training such an algorithm that is at least as accurate as human raters. When this is done, one can use it on every paper one has data about. Furthermore, one should look into additional automated measures of scientific rigor or quality. This can be relatively simple stuff like counting table, figure, reference density, or presence of robustness tests, or mentioning of key terms such as “statistical power” “publication bias”. It can also be more complicated, such as an algorithm that predicts whether a paper will likely replicate based on data from large replication studies. Such a prototype algorithm has been developed which reached AUC of .77!
  3. Relate the measures of scientific rigor to indicators of political view of authors, or the conclusions of the paper, or the topic in general. Control for any obvious covariates such as year of publication.
Genomics intelligence / IQ / cognitive ability

Predicting IQ from genetics: how far have we come? [January 2020]

Results from Allegrini et al 2019

Counting significant hits was always a dumb way to measure progress in genomic prediction of a trait. Breeders using animals and plants never bothered with this approach and they used ridge regression for best predictive power (a “two cultures” problem no doubt). Researchers in human genetics are starting to catch up, implementing clever Enet approach for array datafiles (Qian et al 2019, called snpnet, based on glmnet). We are still waiting for this method to be widely used. It is possible to do summary statistics based Enet too (Mak et al 2017, called lassosum), but again, not many have done it yet.

That being said, we still see a lot of progress owing to larger datasets and some improvements in using the output from ‘single-variant-at-a-time’ regression that they use in regular GWASs. A brief summary. I focus on the TEDS sample (a bug UK twin sample with good DNA and cognitive testing) because this is the largest dataset not used to train GWASs with that has great cognitive testing. It’s someone could use the new subset of UK Biobank with improved cognitive testing to replicate the below (Cox et al 2019, n=29k).

  • Davies, G., Marioni, R. E., Liewald, D. C., Hill, W. D., Hagenaars, S. P., Harris, S. E., … & Cullen, B. (2016). Genome-wide association study of cognitive functions and educational attainment in UK Biobank (N= 112 151). Molecular psychiatry, 21(6), 758.
    • Polygenic score analyses indicate that up to 5% of the variance in cognitive test scores can be predicted in an independent cohort.
  • Selzam, S., Krapohl, E., von Stumm, S., O’Reilly, P. F., Rimfeld, K., Kovas, Y., … & Plomin, R. (2017). Predicting educational achievement from DNA. Molecular psychiatry, 22(2), 267.
    • We found that EduYears GPS explained greater amounts of variance in educational achievement over time, up to 9% at age 16, accounting for 15% of the heritable variance. This is the strongest GPS prediction to date for quantitative behavioral traits.
    • Not quite intelligence, but closer to intelligence (g) than to educational attainment.
  • Krapohl, E., Patel, H., Newhouse, S., Curtis, C. J., von Stumm, S., Dale, P. S., … & Plomin, R. (2018). Multi-polygenic score approach to trait prediction. Molecular psychiatry, 23(5), 1368.
    • The MPS approach predicted 10.9% variance in educational achievement, 4.8% in general cognitive ability and 5.4% in BMI in an independent test set, predicting 1.1%, 1.1%, and 1.6% more variance than the best single-score predictions.
  • Allegrini, A. G., Selzam, S., Rimfeld, K., von Stumm, S., Pingault, J. B., & Plomin, R. (2019). Genomic prediction of cognitive traits in childhood and adolescence. Molecular psychiatry, 24(6), 819.
    • In a representative UK sample of 7,026 children at ages 12 and 16, we show that we can now predict up to 11% of the variance in intelligence and 16% in educational achievement.
    • As above, educational achievement.

As it so happens, there is a paper for each year, letting one see a kind of 4 year progress.

Important caveat of the above! These predictions are not done on sibling pairs. When they are (Selzam et al 2019), the validity is ~50% reduced. This indicates some kind of training problem with the GWASs which either train on family related variance, detect population structure and use that, or something more complicated.

intelligence / IQ / cognitive ability Stereotypes

Stereotype threat: current evidence in January 2020

Stereotype threat is:

a situational predicament in which people are or feel themselves to be at risk of conforming to stereotypes about their social group.[1][2] Stereotype threat is purportedly a contributing factor to long-standing racial and gender gaps in academic performance. It may occur whenever an individual’s performance might confirm a negative stereotype because stereotype threat is thought to arise from a particular situation, rather than from an individual’s personality traits or characteristics. Since most people have at least one social identity which is negatively stereotyped, most people are vulnerable to stereotype threat if they encounter a situation in which the stereotype is relevant. Situational factors that increase stereotype threat can include the difficulty of the task, the belief that the task measures their abilities, and the relevance of the stereotype to the task. Individuals show higher degrees of stereotype threat on tasks they wish to perform well on and when they identify strongly with the stereotyped group. These effects are also increased when they expect discrimination due to their identification with a negatively stereotyped group.[3] Repeated experiences of stereotype threat can lead to a vicious circle of diminished confidence, poor performance, and loss of interest in the relevant area of achievement.[4]

At least, that’s the theory. What is the evidence? It’s the usual thing. A bunch of small studies with various p-hacking issues, and then some larger ones with null results. I summarize the large sample size studies and meta-analyses. There is also a published review by skeptic academics:

The stereotype threat literature primarily comprises lab studies, many of which involve features that would not be present in high-stakes testing settings. We meta-analyze the effect of stereotype threat on cognitive ability tests, focusing on both laboratory and operational studies with features likely to be present in high stakes settings. First, we examine the features of cognitive ability test metric, stereotype threat cue activation strength, and type of nonthreat control group, and conduct a focal analysis removing conditions that would not be present in high stakes settings. We also take into account a previously unrecognized methodological error in how data are analyzed in studies that control for scores on a prior cognitive ability test, which resulted in a biased estimate of stereotype threat. The focal sample, restricting the database to samples utilizing operational testing-relevant conditions, displayed a threat effect of d = −.14 (k = 45, N = 3,532, SDδ = .31). Second, we present a comprehensive meta-analysis of stereotype threat. Third, we examine a small subset of studies in operational test settings and studies utilizing motivational incentives, which yielded d-values ranging from .00 to −.14. Fourth, the meta-analytic database is subjected to tests of publication bias, finding nontrivial evidence for publication bias. Overall, results indicate that the size of the stereotype threat effect that can be experienced on tests of cognitive ability in operational scenarios such as college admissions tests and employment testing may range from negligible to small.

Sex: females and math

For sex differences, they picked women and math as the claim to attempt to secure. The reason for this choice is that women’s relatively worse math performance is a major factor in their lower STEM representation, which feminists desperately want. Jelte Wicherts’ former PhD student, Paulette Flore, basically destroyed this idea with her dissertation. Some of it has been published as articles:

Although the effect of stereotype threat concerning women and mathematics has been subject to various systematic reviews, none of them have been performed on the sub-population of children and adolescents. In this meta-analysis we estimated the effects of stereotype threat on performance of girls on math, science and spatial skills (MSSS) tests. Moreover, we studied publication bias and four moderators: test difficulty, presence of boys, gender equality within countries, and the type of control group that was used in the studies. We selected study samples when the study included girls, samples had a mean age below 18years, the design was (quasi-)experimental, the stereotype threat manipulation was administered between-subjects, and the dependent variable was a MSSS test related to a gender stereotype favoring boys. To analyze the 47 effect sizes, we used random effects and mixed effects models. The estimated mean effect size equaled -0.22 and significantly differed from 0. None of the moderator variables was significant; however, there were several signs for the presence of publication bias. We conclude that publication bias might seriously distort the literature on the effects of stereotype threat among schoolgirls. We propose a large replication study to provide a less biased effect size estimate.

And then she did this replication study:

The effects of gender stereotype threat on mathematical test performance in the classroom have been extensively studied in several cultural contexts. Theory predicts that stereotype threat lowers girls’ performance on mathematics tests, while leaving boys’ math performance unaffected. We conducted a large-scale stereotype threat experiment in Dutch high schools (N = 2064) to study the generalizability of the effect. In this registered report, we set out to replicate the overall effect among female high school students and to study four core theoretical moderators, namely domain identification, gender identification, math anxiety, and test difficulty. Among the girls, we found neither an overall effect of stereotype threat on math performance, nor any moderated stereotype threat effects. Most variance in math performance was explained by gender, domain identification, and math identification. We discuss several theoretical and statistical explanations for these findings. Our results are limited to the studied population (i.e. Dutch high school students, age 13–14) and the studied domain (mathematics).

Various groups and GRE-like scores

A little known report (12 years old, 13 citations!) reports some strong evidence, based on 2 previous papers:

The figures speak for themselves:

These are all based on large samples in high-stakes tests i.e. real life life important context.

Academic stereotypes and tracking — in China

Educational tracks create differential expectations of student ability, raising concerns that the negative stereotypes associated with lower tracks might threaten student performance. The authors test this concern by drawing on a field experiment enrolling 11,624 Chinese vocational high school students, half of whom were randomly primed about their tracks before taking technical skill and math exams. As in almost all countries, Chinese students are sorted between vocational and academic tracks, and vocational students are stereotyped as having poor academic abilities. Priming had no effect on technical skills and, contrary to hypotheses, modestly improved math performance. In exploring multiple interpretations, the authors highlight how vocational tracking may crystallize stereotypes but simultaneously diminishes stereotype threat by removing academic performance as a central measure of merit. Taken together, the study implies that reminding students about their vocational or academic identities is unlikely to further contribute to achievement gaps by educational track.


Normie authors:

In many regions around the world students with certain immigrant backgrounds underachieve in educational settings. This paper provides a review and meta-analysis on one potential source of the immigrant achievement gap: stereotype threat, a situational predicament that may prevent students to perform up to their full abilities. A meta-analysis of 19 experiments suggests an overall mean effect size of 0.63 (random effects model) in support of stereotype threat theory. The results are complemented by moderator analyses with regard to circulation (published or unpublished research), cultural context (US versus Europe), age of immigrants, type of stereotype threat manipulation, dependent measures, and means for identification of immigrant status; evidence on the role of ethnic identity strength is reviewed. Theoretical and practical implications of the findings are discussed.

Their funnel plot says it all:

Black-White gap in USA

In depth analysis of the first and famous paper is given by Ulrich Schimmack.

Wicherts has an old meta-analysis that he is hiding for whatever reason.

But we will soon get another big replication, similar to Flore’s above:

According to stereotype threat theory, the possibility of confirming a negative group stereotype can evoke feelings of threat, leading people to underperform in the domains in which they are stereotyped as lacking ability. This theory has immense theoretical and practical implications, but many studies supporting it include small samples and varying operational definitions of “stereotype threat”. We address the first challenge by leveraging a network of psychology labs to recruit a large Black student sample (Nanticipated = 2700) from multiple US sites (Nanticipated = 27). We address the second challenge by identifying three threat-increasing and three threat-decreasing procedures that could plausibly affect performance and use an adaptive Bayesian design to determine which “stereotype threat” operationalization yields the strongest evidence for underperformance. This project has the potential to advance our knowledge of a scientifically and socially important topic: whether and under what conditions stereotype threat affects current US Black students.

Which of course I am looking forward to!

A reasonable prior is that anything from social psychology is most likely bullshit. More so the more left-wing friendly it is. Stereotype threat gets a double bad prior here. The evidence for it is laughably bad, so a reasonable person’s posterior will be close to 0.

intelligence / IQ / cognitive ability

Ken Richardson claims FAQ

A much more in depth reply to one of Richardson’s papers was written by Zeke Jeffrey here.

Since philosophy blogger RaceRealist is on a mission to promote Ken Richardson, it seems we need to discuss his various claims a bit more. Hence, this post is an FAQ about his claims.

Is there agreement on definition of “intelligence” and how to measurement it?

“As we approach the centenary of the first practical intelligence test, there is still little scientific agreement about how human intelligence should be described, whether IQ tests actually measure it, and if they don’t, what they actually do measure.” (Richardson 2002)

Richardson cherry picks some quotes to make his point. However, there is plenty of agreement about the description of intelligence and how to measure it. See results in surveys going back to the 1980s: Snyderman & Rothman 1987, 1988, Reeve & Charles 2008 (in Scott Alexander’s words “97% of expert psychologists and 85% of applied psychologists agree that IQ tests measure cognitive ability “reasonably well””). A typical mainstream statement of definition is that offered by Gottfredson 1994:

A very general mental capability that, among other things, involves the ability to reason, plan, solve problems, think abstractly, comprehend complex ideas, learn quickly and learn from experience. It is not merely book learning, a narrow academic skill, or test-taking smarts. Rather, it reflects a broader and deeper capability for comprehending our surroundings—”catching on,” “making sense” of things, or “figuring out” what to do.

One can also consult every mainstream textbook on the topic.

Do IQ tests measure social class or intelligence?

IQ tests are merely clever numerical surrogates for social class. The numerous correlations evoked in support of g arise from this. (Richardson 1999)

It suggests that all of the population variance in IQ scores can be described in terms of a nexus of sociocognitive-affective factors that differentially prepares individuals for the cognitive, affective and performance demands of the test—in effect that the test is a measure of social class background, and not one of the ability for complex cognition as such. (Richardson 2002)

There are easy and obvious ways to test this idea. First, if IQ tests measure social class, they should be very strongly related to social class/status (SES), either of oneself or one’s parents, as in r = .90 or so (measuring the same thing minus some measurement error). In fact, they are correlated about .35 (Hanscombe et al 2012) with parental SES and about .60 with one’s own adult SES (Strenze 2007 presents a meta-analysis but does not have a composite SES measure, however we reported this in, e.g., our Argentina study .55, Kirkegaard & Fuerst 2017).

Second, one can look at siblings, who of course are known to vary widely in IQ, differing on average by 12 IQ, whereas random people differ by 17 IQ. Many large studies (e.g. Frisell et al., 2012; Hegelund et al., 2019; Murray, 2002, Aghion et al 2018) have shown that these sibling differences in IQ predict outcomes well, sometimes as well as between families, usually with some loss (maybe 15% on average).

Third, as an easier alternative to the second point above, one can also just adjust for parental SES in a regression and see whether IQ still predicts stuff. The Bell Curve (Herrnstein & Murray, 1994) of course famously did this and found that IQ usually predicts the best of the two. These results were also recently replicated for SAT and university education outcomes by Higdem et al 2016:

In table 1, compare lines 1 vs. 2 to see influence of parental SES. The model R2 hardly changes (.268 to .273) and the beta change for SAT is minor (.194 to .173). Table 2 shows values where they calculated the partial correlation, i.e. removed statistical dependency of parental SES for both outcomes and then showed the correlations among the remaining variables by subgroup. They are all substantial. Note that range restriction causes the correlations to differ by group.

Immigration intelligence / IQ / cognitive ability

Studies using national IQs for predicting immigration outcomes

Below I list all the studies I am aware of that use national IQs in studies of immigrants, usually as an estimate of their IQ levels in the host country. Most of these studies are authored by me and coauthors, but not the first two from 2010, which were seemingly independently conceived by two sets of academics. I count 14 papers in this literature so far, and 1 unpublished meta-analysis of these.

Search was done this way:

  1. Looked over the citing studies of Jones & Schneider 2010, and Vinogradov., & Kolvereid 2010 on Google Scholar, looking for anything that seemed relevant.

  2. Copied in all my own listed using my website front page as list.

  3. Looked in Lynn’s Race difference book (2nd ed) for any mentions of immigration/immigrants. Since Lynn is the originator of national IQs, he presumably would be on the lookout for any study using them.

So it is possible I missed studies that do not cite the two 2010 studies, but do cite some of Lynn’s work. Furthermore, it’s possible there are some studies that use other datasets to do essentially the same thing, e.g. Altinok’s scores, Rindermann’s scores, or other datasets.

  • Jones, G., & Schneider, W. J. (2010). IQ in the production function: Evidence from immigrant earnings. Economic Inquiry, 48(3), 743-755.

We show that a country’s average IQ score is a useful predictor of the wages that immigrants from that country earn in the United States, whether or not one adjusts for immigrant education. Just as in numerous microeconomic studies, 1 IQ point predicts 1% higher wages, suggesting that IQ tests capture an important difference in cross‐country worker productivity. In a cross‐country development accounting exercise, about one‐sixth of the global inequality in log income can be explained by the effect of large, persistent differences in national average IQ on the private marginal product of labor. This suggests that cognitive skills matter more for groups than for individuals. (JEL J24, J61, O47)

The level of self-employment varies significantly among immigrants from different countries of origin. The objective of this research is to examine the relationship between home-country national intelligence and self-employment rates among first generation immigrants in Norway. Empirical secondary data on self-employment among immigrants from 117 countries residing in Norway in 2008 was used. The relevant hypothesis was tested using hierarchical regression analysis. The immigrants’ national intelligence was found to be significantly positively associated with self-employment. However, the importance of national IQ for self-employment among immigrants decreases with the duration of residence in Norway. The study concludes with practical implications and suggestions for future research.

Many recent studies have corroborated Lynn and Vanhanen’s worldwide compilation of national IQs; however, no one has attempted to estimate the mean IQ of an immigration population based on its countries of origin. This paper reports such a study based on the Danish immigrant population and IQ data from the military draft. Based on Lynn and Vanhanen’s estimates, the Danish immigrant population was estimated to have an average 89.9 IQ in 2013Q2, and the IQ from the draft was 86.3 in 2003Q3 (against a ‘Danish’ IQ of 100). However, after taking account of two error sources, the discrepancy between the measured IQ and the estimated IQ was reduced to a mere 0.4 IQ. The study thus strongly validates Lynn and Vanhanen’s national IQs.

We discuss the global hereditarian hypothesis of race differences in g and test it on data from the NLSF. We find that migrants country of origin’s IQ predicts GPA and SAT/ACT.

Criminality rates and fertility vary wildly among Danish immigrant populations by their country of origin. Correlational and regression analyses show that these are very predictable (R’s about .85 and .5) at the group level with national IQ, Islam belief, GDP and height as predictors.

A previous study found that criminality among immigrant groups in Denmark was highly predictable by their countries of origin’s prevalence of Muslims, IQ, GDP and height. This study replicates the study for Norway with similar results.

We obtained data from Denmark for the largest 70 immigrant groups by country of origin. We show that three important socioeconomic variables are highly predictable from the Islam rate, IQ, GDP and height of the countries of origin. We further show that there is a general immigrant socioeconomic factor and that country of origin national IQs, Islamic rates, and GDP strongly predict immigrant general socioeconomic scores.

I present new predictive analyses for crime, income, educational attainment and employment among immigrant groups in Norway and crime in Finland. Furthermore I show that the Norwegian data contains a strong general socioeconomic factor (S) which is highly predictable from country-level variables (National IQ .59, Islam prevalence -.71, international general socioeconomic factor .72, GDP .55), and correlates highly (.78) with the analogous factor among immigrant groups in Denmark. Analyses of the prediction vectors show very high correlations (generally ±.9) between predictors which means that the same variables are relatively well or weakly predicted no matter which predictor is used. Using the method of correlated vectors shows that it is the underlying S factor that drives the associations between predictors and socioeconomic traits, not the remaining variance (all correlations near unity).

We argue that if immigrants have a different mean general intelligence (g) than their host country and if immigrants generally retain their mean level of g, then immigration will increase the standard deviation of g. We further argue that inequality in g is an important cause of social inequality, so increasing it will increase social inequality. We build a demographic model to analyze change in the mean and standard deviation of g over time and apply it to data from Denmark. The simplest model, which assumes no immigrant gains in g, shows that g has fallen due to immigration from 97.1 to 96.4, and that for the same reason standard deviation has increased from 15.04 to 15.40, in the time span 1980 to 2014.

Immigrants can be classified into groups based on their country of origin. Group-level data concerning immigrant crime by country of origin was obtained from a 2005 Dutch-language report and were from 2002. There are data for 57 countries of origin. The crime rates were correlated with country of origin predictor variables: national IQ, prevalence of Islam and general socioeconomic factor (S). For males aged 12-17 and 18-24, the mean correlation with IQ, Islam, and S was, respectively, -.51, .37, and -.42. When subsamples split into 1st and 2nd generations were used, the mean correlation was -.74, .34, and -.40. A general crime factor among young persons was extracted. The correlations with the predictors for this variable were -.80, .34, and -.43. The results were similar when weighing the observations by the population of each immigrant group in the Netherlands. The results were also similar when using crime rates controlled for differences in household income. Some groups increased their crime rates from the 1st to 2nd generation, while for others the reverse happened.

Two datasets with grade point average by country of origin or parents’ country of origin are presented (N=13 and 19). Correlation analyses show that GPA is highly predictable from country-level variables: National IQ (.40 to .64), age heaping 1900 (.32 to .53), Islam prevalence (-.72 to -.75), average years of schooling (.41 to .74) and general socioeconomic factor (S) in both Denmark (.72 to .87) and internationally (.38 to .68). Examination of the gap sizes in GPA between natives and immigrants shows that these are roughly the size one would expect based on the estimated general cognitive ability differences between the groups.

Number of suspects per capita were estimated for immigrants in Germany grouped by citizenship (n=83). These were correlated with national IQs (r=-.53) and Islam prevalence in the home countries (r=.49). Multivariate analyses revealed that the mean age and sex distribution of the groups in Germany were confounds.

The German data lacked age and sex information for the crime data and so it was not possible to adjust for age and sex using subgroup analyses. For this reason, an alternative adjustment method was developed. This method was tested on the detailed Danish data which does have the necessary information to carry out subgroup analyses. The new method was found to give highly congruent results with the subgrouping method.

The German crime data were then adjusted for age and sex using the developed method and the resulting values were analyzed with respect to the predictors. They were moderately to strongly correlated with national IQs (.46) and Islam prevalence in the home country (.35). Combining national IQ, Islam% and distance to Germany resulted in a model with a cross-validated r2 of 20%, equivalent to a correlation of .45. If two strong outliers were removed, this rose to 25%, equivalent to a correlation of .50.

Employment rates for 11 country of origin groups living in the three Scandinavian countries are presented. Analysis of variance showed that differences in employment rates are highly predictable (adjusted multiple R = .93). This predictability was mostly due to origin countries (eta = .89), not sex (eta = .25) and host country (eta = .20). Furthermore, national IQs of the origin countries predicted employment rates well across all host countries (r’s = 0.74 [95%CI: 0.30, 0.92], 0.75 [0.30, 0.92], 0.66 [0.14, 0.89] for Denmark, Norway and Sweden, respectively), and so did Muslim % of the origin countries (r’s =-0.80 [-0.94,-0.43],-0.78 [-0.94,-0.37],-0.58 [-0.87,-0.01]).

The relationships between national IQs, Muslim% in origin countries and estimates of net fiscal contributions to public finances in Denmark (n=32) and Finland (n=11) were examined. The analyses showed that the fiscal estimates were near-perfectly correlated between countries (r = .89 [.56 to .98], n=9), and were well-predicted by national IQs (r’s .89 [.49 to .96] and .69 [.45 to .84]), and Muslim% (r’s -.75 [-.93 to -.27] and -.73 [-.86 to -.51]). Furthermore, general socioeconomic factor scores for Denmark were near-perfectly correlated with the fiscal estimates (r = .86 [.74 to .93]), especially when one outlier (Syria) was excluded (.90 [.80 to .95]). Finally, the monetary returns to higher country of origin IQs were estimated to be 917/470 Euros/person-year for a 1 IQ point increase, and -188/-86 for a 1% increase in Muslim%.

The European Union has seen an increased number of asylum seekers and economic migrants over the past few years. There will be request to assess some of these individuals to see if they have an intellectual disability (ID). If this is to be done using the current internationally recognized definitions of ID, we will need to be confident that the IQ tests we have available are able to accurately measure the IQs of people from developing countries. The literature showing substantial differences in the mean measured IQs of different countries is considered. It is found that, although there are numerous problems with these studies, the overall conclusion that there are substantial differences in mean measured IQ is sound. However, what is not clear is whether there are large differences in true intellectual ability between different countries, how predictive IQ scores are of an individual from a developing country ability to cope, and whether or not an individual’s IQ would increase if they go from a developing country to a developed one. Because of these uncertainties, it is suggested that a diagnosis of ID should not be dependent on an IQ cut-off point when assessing people from developing countries.

This is borderline with regards to inclusion. It does not look at prediction differential immigrant group performance using national IQs, but it does discuss the national IQs at length with regards to immigration.

Not a paper yet, but I have done a meta-analysis on most of these results, which was presented at LCI 2017. It is available on Youtube. For technical output, see

Genetics / behavioral genetics intelligence / IQ / cognitive ability

New paper out: Racial and ethnic group differences in the heritability of intelligence: A systematic review and meta-analysis (Pesta et al 2020)

So, our big Scarr-Rowe meta-analysis dropped recently. I was traveling at the time, so there was a bit of a delay to this posting. I also recorded a long video covering the reasons why this kind of study is important.

Via meta-analysis, we examined whether the heritability of intelligence varies across racial or ethnic groups. Specifically, we tested a hypothesis predicting an interaction whereby those racial and ethnic groups living in relatively disadvantaged environments display lower heritability and higher environmentality. The reasoning behind this prediction is that people (or groups of people) raised in poor environments may not be able to realize their full genetic potentials. Our sample (k = 16) comprised 84,897 Whites, 37,160 Blacks, and 17,678 Hispanics residing in the United States. We found that White, Black, and Hispanic heritabilities were consistently moderate to high, and that these heritabilities did not differ across groups. At least in the United States, Race/Ethnicity × Heritability interactions likely do not exist.

Main table:

There is also the table of matched samples (i.e. subsamples from same studies where we calculate differences in ACE parameters), but this table is unwieldy and basically shows the same.