Population genetics

Species genetic variation and number of subspecies/races/breeds/breeding populations/clusters etc.

A straightforward research idea: Compare the recognized count of subspecies (races/breeds/breeding populations/clusters etc.) with measures of genetic variation. The most obvious is the Fst but it’s not an optimal metric. There is some literature on this:

But surely there are newer data since these reviews. One paper I saw a few years ago:

More than a decade of DNA barcoding encompassing about five million specimens covering 100,000 animal species supports the generalization that mitochondrial DNA clusters largely overlap with species as defined by domain experts. Most barcode clustering reflects synonymous substitutions. What evolutionary mechanisms account for synonymous clusters being largely coincident with species? The answer depends on whether variants are phenotypically neutral. To the degree that variants are selectable, purifying selection limits variation within species and neighboring species may have distinct adaptive peaks. Phenotypically neutral variants are only subject to demographic processes—drift, lineage sorting, genetic hitchhiking, and bottlenecks. The evolution of modern humans has been studied from several disciplines with detail unique among animal species. Mitochondrial barcodes provide a commensurable way to compare modern humans to other animal species. Barcode variation in the modern human population is quantitatively similar to that within other animal species. Several convergent lines of evidence show that mitochondrial diversity in modern humans follows from sequence uniformity followed by the accumulation of largely neutral diversity during a population expansion that began approximately 100,000 years ago. A straightforward hypothesis is that the extant populations of almost all animal species have arrived at a similar result consequent to a similar process of expansion from mitochondrial uniformity within the last one to several hundred thousand years.

Since they looked at species without clusters, one would need a different dataset. But it should be possible to find this somewhere.


Empirical effect size comparison studies (2020 overview)

Since I keep having to look for these, here’s a compilation for ease of reference.

Discusses empirical guidelines for interpreting the magnitude of correlation coefficients, a key index of effect size, in psychological studies. The author uses the work of J. Cohen (see record 1987-98267-000), in which operational definitions were offered for interpreting correlation coefficients, and examines two meta-analytic reviews (G. J. Meyer et al., see record 2001-00159-003; and M. W. Lipsey et al., see record 1994-18340-001) to arrive at the empirical guidelines.

This article compiles results from a century of social psychological research, more than 25,000 studies of 8 million people. A large number of social psychological conclusions are listed alongside meta-analytic information about the magnitude and variability of the corresponding effects. References to 322 meta-analyses of social psychological phenomena are presented, as well as statistical effect-size summaries. Analyses reveal that social psychological effects typically yield a value of r equal to.21 and that, in the typical research literature, effects vary from study to study in ways that produce a standard deviation in r of.15. Uses, limitations, and implications of this large-scale compilation are noted.

Journal editors and academy presidents are increasingly calling on researchers to evaluate the substantive, as opposed to the statistical, significance of their results. To measure the extent to which these calls have been heeded, I aggregated the meta-analytically derived effect size estimates obtained from 965 individual samples. I then surveyed 204 studies published in the Journal of International Business Studies. I found that the average effect size in international business research is small, and that most published studies lack the statistical power to detect such effects reliably. I also found that many authors confuse statistical with substantive significance when interpreting their research results. These practices have likely led to unacceptably high Type II error rates and invalid inferences regarding real-world effects. By emphasizing p values over their effect size estimates, researchers are under-selling their results and settling for contributions that are less than what they really have to offer. In view of this, I offer four recommendations for improving research and reporting practices.

Effect size information is essential for the scientific enterprise and plays an increasingly central role in the scientific process. We extracted 147,328 correlations and developed a hierarchical taxonomy of variables reported in Journal of Applied Psychology and Personnel Psychology from 1980 to 2010 to produce empirical effect size benchmarks at the omnibus level, for 20 common research domains, and for an even finer grained level of generality. Results indicate that the usual interpretation and classification of effect sizes as small, medium, and large bear almost no resemblance to findings in the field, because distributions of effect sizes exhibit tertile partitions at values approximately one-half to one-third those intuited by Cohen (1988). Our results offer information that can be used for research planning and design purposes, such as producing better informed non-nil hypotheses and estimating statistical power and planning sample size accordingly. We also offer information useful for understanding the relative importance of the effect sizes found in a particular study in relationship to others and which research domains have advanced more or less, given that larger effect sizes indicate a better understanding of a phenomenon. Also, our study offers information about research domains for which the investigation of moderating effects may be more fruitful and provide information that is likely to facilitate the implementation of Bayesian analysis. Finally, our study offers information that practitioners can use to evaluate the relative effectiveness of various types of interventions.

Individual differences researchers very commonly report Pearson correlations between their variables of interest. Cohen (1988) provided guidelines for the purposes of interpreting the magnitude of a correlation, as well as estimating power. Specifically, r = 0.10, r = 0.30, and r = 0.50 were recommended to be considered small, medium, and large in magnitude, respectively. However, Cohen’s effect size guidelines were based principally upon an essentially qualitative impression, rather than a systematic, quantitative analysis of data. Consequently, the purpose of this investigation was to develop a large sample of previously published meta-analytically derived correlations which would allow for an evaluation of Cohen’s guidelines from an empirical perspective. Based on 708 meta-analytically derived correlations, the 25th, 50th, and 75th percentiles corresponded to correlations of 0.11, 0.19, and 0.29, respectively. Based on the results, it is suggested that Cohen’s correlation guidelines are too exigent, as <3% of correlations in the literature were found to be as large as r = 0.50. Consequently, in the absence of any other information, individual differences researchers are recommended to consider correlations of 0.10, 0.20, and 0.30 as relatively small, typical, and relatively large, in the context of a power analysis, as well as the interpretation of statistical results from a normative perspective.

This study compiles information from more than 250 meta-analyses conducted over the past 30 years to assess the magnitude of reported effect sizes in the organizational behavior (OB)/human resources (HR) literatures. Our analysis revealed an average uncorrected effect of r = .227 and an average corrected effect of ρ = .278 (SDρ = .140). Based on the distribution of effect sizes we report, Cohen’s effect size benchmarks are not appropriate for use in OB/HR research as they overestimate the actual breakpoints between small, medium, and large effects. We also assessed the average statistical power reported in meta-analytic conclusions and found substantial evidence that the majority of primary studies in the management literature are statistically underpowered. Finally, we investigated the impact of the file drawer problem in meta-analyses and our findings indicate that the file drawer problem is not a significant concern for meta-analysts. We conclude by discussing various implications of this study for OB/HR researchers.

A number of recent research publications have shown that commonly used guidelines for interpreting effect sizes suggested by Cohen (1988) do not fit well with the empirical distribution of those effect sizes, and tend to overestimate them in many research areas. This study proposes empirically derived guidelines for interpreting effect sizes for research in social psychology, based on analysis of the true distributions of the two types of effect size measures widely used in social psychology (correlation coefficient and standardized mean differences). Analysis was carried out on the empirical distribution of 9884 correlation coefficients and 3580 Hedges’ g statistics extracted from studies included in 98 published meta-analyses. The analysis reveals that the 25th, 50th, and 75th percentiles corresponded to correlation coefficients values of 0.12, 0.25, and 0.42 and to Hedges’ g values of 0.15, 0.38, and 0.69, respectively. This suggests that Cohen’s guidelines tend to overestimate medium and large effect sizes. It is recommended that correlation coefficients of 0.10, 0.25, and 0.40 and Hedges’ g of 0.15, 0.40, and 0.70 should be interpreted as small, medium, and large effects for studies in social psychology. The analysis also shows that more than half of all studies lack sufficient sample size to detect a medium effect. This paper reports the sample sizes required to achieve appropriate statistical power for the identification of small, medium, and large effects. This can be used for performing appropriately powered future studies when information about exact effect size is not available.

Randomized field experiments designed to better understand the production of human capital have increased exponentially over the past several decades. This chapter summarizes what we have learned about various partial derivatives of the human capital production function, what important partial derivatives are left to be estimated, and what—together—our collective efforts have taught us about how to produce human capital in developed countries. The chapter concludes with a back of the envelope simulation of how much of the racial wage gap in America might be accounted for if human capital policy focused on best practices gleaned from randomized field experiments.

[He split results by various groups, so hard to summarize]

In this meta-study, we analyzed 2,442 effect sizes from 131 meta-analyses in intelligence research, published from 1984 to 2014, to estimate the average effect size, median power, and evidence for bias in this multidisciplinary field. We found that the average effect size in intelligence research was a Pearson’s correlation of .26, and the median sample size was 60. We estimated the power of each primary study by using the corresponding meta-analytic effect as a proxy for the true effect. The median estimated power across all studies was 51.7%, with only 31.7% of the studies reaching a power of 80% or higher. We documented differences in average effect size and median estimated power between different types of in intelligence studies (correlational studies, studies of group differences, experiments, toxicology, and behavior genetics). Across all meta-analyses, we found evidence for small study effects, potentially indicating publication bias and overestimated effects. We found no differences in small study effects between different study types. We also found no convincing evidence for the decline effect, US effect, or citation bias across meta-analyses. We conclude that intelligence research does show signs of low power and publication bias, but that these problems seem less severe than in many other scientific fields.

This article presents a methodological review of 54 meta-analyses of the effectiveness of clinical psychological treatments, using standardized mean differences as the effect size index. We statistically analyzed the distribution of the number of studies of the meta-analyses, the distribution of the sample sizes in the studies of each meta-analysis, the distribution of the effect sizes in each of the meta-analyses, the distribution of the between-studies variance values, and the Pearson correlations between effect size and sample size in each meta-analysis. The results are presented as a function of the type of standardized mean difference: posttest standardized mean difference, standardized mean change from pretest to posttest, and standardized mean change difference between groups. These findings will help researchers design future Monte Carlo and theoretical studies on the performance of meta-analytic procedures, based on the manipulation of realistic model assumptions and parameters of the meta-analyses. Furthermore, the analysis of the distribution of the mean effect sizes through the meta-analyses provides a specific guide for the interpretation of the clinical significance of the different types of standardized mean differences within the field of the evaluation of clinical psychological interventions.

Researchers typically use Cohen’s guidelines of Pearson’s r = .10, .30, and .50, and Cohen’s d = 0.20, 0.50, and 0.80 to interpret observed effect sizes as small, medium, or large, respectively. However, these guidelines were not based on quantitative estimates and are only recommended if field-specific estimates are unknown. This study investigated the distribution of effect sizes in both individual differences research and group differences research in gerontology to provide estimates of effect sizes in the field. Effect sizes (Pearson’s r, Cohen’s d, and Hedges’ g) were extracted from meta-analyses published in 10 top-ranked gerontology journals. The 25th, 50th, and 75th percentile ranks were calculated for Pearson’s r (individual differences) and Cohen’s d or Hedges’ g (group differences) values as indicators of small, medium, and large effects. A priori power analyses were conducted for sample size calculations given the observed effect size estimates. Effect sizes of Pearson’s r = .12, .20, and .32 for individual differences research and Hedges’ g = 0.16, 0.38, and 0.76 for group differences research were interpreted as small, medium, and large effects in gerontology. Cohen’s guidelines appear to overestimate effect sizes in gerontology. Researchers are encouraged to use Pearson’s r = .10, .20, and .30, and Cohen’s d or Hedges’ g = 0.15, 0.40, and 0.75 to interpret small, medium, and large effects in gerontology, and recruit larger samples.

There are a growing number of large-scale educational randomized controlled trials (RCTs). Considering their expense, it is important to reflect on the effectiveness of this approach. We assessed the magnitude and precision of effects found in those large-scale RCTs commissioned by the UK-based Education Endowment Foundation and the U.S.-based National Center for Educational Evaluation and Regional Assistance, which evaluated interventions aimed at improving academic achievement in K–12 (141 RCTs; 1,222,024 students). The mean effect size was 0.06 standard deviations. These sat within relatively large confidence intervals (mean width = 0.30 SDs), which meant that the results were often uninformative (the median Bayes factor was 0.56). We argue that our field needs, as a priority, to understand why educational RCTs often find small and uninformative effects.

Effect sizes are the currency of psychological research. They quantify the results of a study to answer the research question and are used to calculate statistical power. The interpretation of effect sizes—when is an effect small, medium, or large?—has been guided by the recommendations Jacob Cohen gave in his pioneering writings starting in 1962: Either compare an effect with the effects found in past research or use certain conventional benchmarks. The present analysis shows that neither of these recommendations is currently applicable. From past publications without pre-registration, 900 effects were randomly drawn and compared with 93 effects from publications with pre-registration, revealing a large difference: Effects from the former (median r = 0.36) were much larger than effects from the latter (median r = 0.16). That is, certain biases, such as publication bias or questionable research practices, have caused a dramatic inflation in published effects, making it difficult to compare an actual effect with the real population effects (as these are unknown). In addition, there were very large differences in the mean effects between psychological sub-disciplines and between different study designs, making it impossible to apply any global benchmarks. Many more pre-registered studies are needed in the future to derive a reliable picture of real population effects.


Book review Genetics / behavioral genetics intelligence / IQ / cognitive ability Sociology

New paper out: Human Biodiversity for Beginners: A Review of Charles Murray’s Human Diversity

Human Diversity is Charles Murray’s latest book. This review evaluates the claims made in the book and places both the author’s theses and their criticisms in their historical context. It concludes that this book is valuable as an updated summary of current knowledge about psychological differences (in the averages) between genders, races, and social classes. As such it is a useful introduction into the field for everyone interested in it.

The long awaited review. Actually, I read the book back in … (checks GoodReads)… January 4th. But here it is. Some others reviews by informed readers:

These are reviews by people I reckon have some informed knowledge of the field, either because they are themselves scientists (at least sometimes), or because they are avid readers of the literature.

By big media for contrast, I don’t know any of the authors:

I didn’t read any of these reviews before writing my own, so mine is not colored by theirs.

intelligence / IQ / cognitive ability

Gap closers

A lot of these people.

  • Ferguson, R. F. (1998). Can schools narrow the Black–White test score gap?.
  • Kober, N. (2001). It Takes More Than Testing: Closing the Achievement Gap. A Report of the Center on Education Policy.
  • Bainbridge, W. L., & Lasley, T. J. (2002). Demographics, diversity, and K-12 accountability: The challenge of closing the achievement gap. Education and Urban Society, 34(4), 422-437.
  • Thernstrom, A., & Thernstrom, S. (2004). No excuses: Closing the racial gap in learning. Simon and Schuster.
  • Wenglinsky, H. (2004). Closing the racial achievement gap: The role of reforming instructional practices. education policy analysis archives, 12, 64.
  • Murnane, R. J., Willett, J. B., Bub, K. L., McCartney, K., Hanushek, E., & Maynard, R. (2006). Understanding trends in the black-white achievement gaps during the first years of school. Brookings-Wharton papers on urban affairs, 97-135.

Closing the black-white achievement gap when schools remain segregated by race and income is extraordinarily difficult. To succeed, schools that serve concentrations of poor children must be staffed with skilled, experienced teachers who have learned to work together to provide large amounts of consistent, coordinated, high-quality instruction. Closing the gap is the greatest educational challenge facing the United States today.

  • Paige, R., & Witty, E. (2009). The black-white achievement gap: Why closing it is the greatest civil rights issue of our time. AMACOM Div American Mgmt Assn.
  • Ferguson, R., Stellar, A., Schools, B. C. P., & Morganton, N. C. (2010). Toward excellence with equity: An emerging vision for closing the achievement gap. Evidence-based Practice Articles, 56.
  • Jencks, C., & Phillips, M. (Eds.). (2011). The Black-White test score gap. Brookings Institution Press.
  • Blackford, K., & Khojasteh, J. (2013). Closing the achievement gap: Identifying strand score differences. American Journal of Educational Studies, 6(2), 5.
  • Webb, M., & Thomas, R. (2015, January). Teachers’ Perceptions of Educators’ and Students’ Role in Closing the Achievement Gap. In National Forum of Teacher Education Journal (Vol. 25, No. 3).

An impossible educational challenge, sorry.

Immigration intelligence / IQ / cognitive ability Social psychology

New paper out: Public Preferences and Reality: Crime Rates among 70 Immigrant Groups in the Netherlands

We estimated crime rates among 70 origin-based immigrant groups in the Netherlands for the years 2005-2018. Results indicated that crime rates have overall been falling for each group in the period of study, and in the country as a whole, with about a 50% decline since 2005. Immigrant groups varied widely in crime rates, with East Asian countries being lower and Muslim countries, as well as Dutch (ex-)colonial possessions in the Caribbean, being higher than the natives. We found that national IQ and Muslim percentage of population of the origin countries predicted relative differences in crime rates, r’s = .64 and .45, respectively, in line with previous research both in the Netherlands and in other European countries. Furthermore, we carried out a survey of 200 persons living in the Netherlands to measure their preferences for immigrants for each origin country in terms of getting more or fewer persons from each country. Following Carl (2016), we computed a mean opposition metric for each country. This correlated strongly with actual crime rates we found, r’s = .46 and .57, for population weighted and unweighted results, respectively. The main outliers in the regression were the Dutch (ex-)colonial possessions, and if these are excluded, the correlations increase to .68 and .66, respectively. Regressions with plausible confounders (Muslim percentage, geographical fixed effects) showed that crime rates continued to be a useful predictor of opposition to specific countries. The results were interpreted as being in line with a rational voter preference for less crime-prone immigrants.

Main results

Video walkthrough

Genetics / behavioral genetics intelligence / IQ / cognitive ability Science

Sesardić’s conjecture: preliminary evidence in favor

In Making sense of Heritability, Sesardić wrote:

On the less obvious side, a nasty campaign against H could have the unintended effect of strengthening H [hereditarianism] epistemically, and making the criticism of H look less convincing. Simply, if you happen to believe that H is true and if you also know that opponents of H will be strongly tempted to “play dirty,” that they will be eager to seize upon your smallest mistake, blow it out of all proportion, and label you with Dennett’s “good epithets,” with a number of personal attacks thrown in for good measure, then if you still want to advocate H, you will surely take extreme care to present your argument in the strongest possible form. In the inhospitable environment for your views, you will be aware that any major error is a liability that you can hardly afford, because it will more likely be regarded as a reflection of your sinister political intentions than as a sign of your fallibility. The last thing one wants in this situation is the disastrous combination of being politically denounced (say, as a “racist”) and being proved to be seriously wrong about science. Therefore, in the attempt to make themselves as little vulnerable as possible to attacks they can expect from their uncharitable and strident critics, those who defend H will tread very cautiously and try to build a very solid case before committing themselves publicly. As a result, the quality of their argument will tend to rise, if the subject matter allows it.22

It is different with those who attack H. They are regarded as being on the “right” side (in the moral sense), and the arguments they offer will typically get a fair hearing, sometimes probably even a hearing that is “too fair.” Many a potential critic will feel that, despite seeing some weaknesses in their arguments, he doesn’t really want to point them out publicly or make much of them because this way, he might reason, he would just play into the hands of “racists” and “right-wing ideologues” that he and most of his colleagues abhor. 23 Consequently, someone who opposes H can expect to be rewarded with being patted on the back for a good political attitude, while his possible cognitive errors will go unnoticed or unmentioned or at most mildly criticized.

Now, given that an advocate of H and an opponent of H find them- selves in such different positions, who of the two will have more incentive to invest a lot of time and hard work to present the strongest possible defense of his views? The question answers itself. In the academic jungle, as elsewhere, it is the one who anticipates trouble who will spare no effort to be maximally prepared for the confrontation.

If I am right, the pressure of political correctness would thus tend to result, ironically, in politically incorrect theories becoming better developed, more carefully articulated, and more successful in coping with objections. On the other hand, I would predict that a theory with a lot of political support would typically have a number of scholars flocking to its defense with poorly thought out arguments and with speedily generated but fallacious “refutations” of the opposing view. 24 This would explain why, as Ronald Fisher said, “the best causes tend to attract to their sup- port the worst arguments” (Fisher 1959: 31).

Example? Well, the best example I can think of is the state of the debate about heritability. Obviously, the hypothesis of high heritability of human psychological variation – and especially the between-group heritability of IQ differences – is one of the most politically sensitive topics in contemporary social science. The strong presence of political considerations in this controversy is undeniable, and there is no doubt about which way the political wind is blowing. When we turn to discussions in this context that are ostensibly about purely scientific issues two things are striking. First, as shown in previous chapters, critics of heritability very often rely on very general, methodological arguments in their attempts to show that heritability values cannot be determined, are intrinsically misleading, are low, are irrelevant, etc. Second, these global methodological arguments – although defended by some leading biologists, psychologists, and philosophers of science – are surprisingly weak and unconvincing. Yet they continue to be massively accepted, hailed as the best approach to the nature–nurture issue, and further transmitted, often with no detailed analysis or serious reflection.

Footnotes are:

22 This is not guaranteed, of course. For example, biblical literalists who think that the world was created 6,000 years ago can expect to be ridiculed as irrational, ignorant fanatics. So, if they go public, it is in their strong interest to use arguments that are not silly, but the position they have chosen to advocate leaves them with no good options. (I assume that it is not a good option to suggest, along the lines of Philip Gosse’s famous account, that the world was created recently, but with the false traces of its evolutionary history that never happened.)

23 A pressure in the opposite direction would not have much force. It is notorious that in the humanities and social science departments, conservative and other right-of-center views are seriously under-represented (cf. Ladd & Lipset 1975; Redding 2001).

24 I am speaking of tendencies here, of course. There would be good and bad arguments on both sides.

The Fisher (1959) reference given is actually about probability theory, that context Fisher is not writing about genetics or biology and its relation to politics.

Did no one come up with this idea before? Seems unlikely. Robert Plomin and Thomas Bouchard came close in 1987 chapters in the same book Arthur Jensen: Consensus And Controversy (see also this earlier post). Bouchard:

One might fairly claim that this chapter does not constitute a critical appraisal of the work of Arthur Jensen on the genetics of human abilities, but rather a defense. If a reader arrives at that conclusion he or she has overlooked an important message. Since Jensen rekindled the flames of the heredity vs environment debate in 1969, human behavior genetics has undergone a virtual renaissance. Nevertheless, a tremendous amount of energy has been wasted. In my discussions of the work of Kamin, Taylor, Farber, etc., I have often been as critical of them as they have been of the hereditarian program. While I believe that their criticisms have failed and their conclusions are false, I also believe that their efforts were necessary. They were necessary because human behavior genetics has been an insufficiently self-critical discipline. It adopted the quantitative models of experimental plant and animal genetics without sufficient regard for the many problems involved in justifying the application of those models in human research. Furthermore, it failed to deal adequately with most of the issues that are raised and dealt with by meta-analytic techniques. Human behavior geneticists have, until recently, engaged in inadequate analyses. Their critics, on the other hand, have engaged in pseudo-analyses. Much of the answer to the problem of persuading our scientific colleagues that behavior is significantly influenced by genetic processes lies in a more critical treatment of our own data and procedures. The careful and systematic use of meta-analysis, in conjunction with our other tools, will go a long way toward accomplishing this goal. It is a set of tools and a set of attitudes that Galton would have been the first to apply in his own laboratory.


More behavioral genetic data on IQ have been collected since Jensen’s 1969 monograph than in the fifty years preceding it. As mentioned earlier, I would argue that much of this research was conducted because of Jensen’s monograph and the controversy and criticism it aroused.

A decade and a half ago Jensen clearly and forcefully asserted that IQ scores are substantially influenced by genetic differences among individuals. No telling criticism has been made of his assertion, and newer data consistently support it. No other finding in the behavioral sciences has been researched so extensively, subjected to so much scrutiny, and verified so consistently.

Chris Brand also has a chapter in this book, perhaps it has something relevant. I don’t recall it well.

To return to Sesardić, his contention is that non-scientific opposition to some scientific claim will result in so called double standards: higher standards for proponents of the claim, and if reality supports the claim, then higher quality evidence will be gathered and published. It is the reverse for critics of the claim, they will face less scrutiny, so their published arguments and evidence will tend to be poorer. Do we have some kind of objective way to test this claim? We do. We can measure scientific rigor by scientific field or subfield, and compare. Probably the most left-wing field of psychology will be social psychology, and it has massive issues with the replication crisis. Intelligence and behavioral genetics research, on the other hand, have no such big problems. One of the least left-wing fields of psychology (nearly 50-50 balance of self-placed politics), should thus have high rigor. A simple way to measure this is compiling data about statistical power by field. This sometimes calculated as part of meta-analyses. Sean Last has compiled such values, reproduced below.

Citation Discipline Mean / Median Power
Button et al. (2013) Neuroscience 21%
Brain Imaging 8%
Smaldino and McElreath (2016) Social and Behavioral Sciences 24%
Szucs and Ioannidis (2017) Cognitive Neuroscience 14%
Psychology 23%
Medical 23%
Mallet et al (2017) Breast Cancer 16%
Glaucoma 11%
Rheumatoid Arthritis 19%
Alzheimer’s 9%
Epilepsy 24%
MS 24%
Parkinson’s 27%
Lortie-Forgues and Inglis (2019) Education 23%
Nuijten et al (2018) Intelligence 49%
Intelligence – Group Differences 57%

The main issue with this is that the numbers concern either median or mean power, with some inconsistency across fields. The median is usually lower, so one could convert the values using their mean observed ratio.

I should very much like someone to do a more detailed study of this. I imagine that one will do the following:

  1. Acquire a large dataset of scientific articles, including title, authors, abstract, keywords, fulltext, and references. This can be done either via Scihub (difficult) or by mining open access journals (probably easy).
  2. Use algorithms to extract data of interest. Usually studies calculating power rely on so-called focal analyses, i.e., main or important statistical tests. These are hard to identify using simple algorithms, but they can extract all of them (those with standardized format, that is!). Check out the work by Nuijten et al linked above. A better idea is to get a dataset of manually extracted data, and then train neural network to also extract them. I think one can succeed in training such an algorithm that is at least as accurate as human raters. When this is done, one can use it on every paper one has data about. Furthermore, one should look into additional automated measures of scientific rigor or quality. This can be relatively simple stuff like counting table, figure, reference density, or presence of robustness tests, or mentioning of key terms such as “statistical power” “publication bias”. It can also be more complicated, such as an algorithm that predicts whether a paper will likely replicate based on data from large replication studies. Such a prototype algorithm has been developed which reached AUC of .77!
  3. Relate the measures of scientific rigor to indicators of political view of authors, or the conclusions of the paper, or the topic in general. Control for any obvious covariates such as year of publication.

Stereotype accuracy: summary of some studies

Now that the video is up, here’s some other studies I came across.

Has a pretty straightforward summary of stereotypes and stereotyping:

Organizations experience high levels of inefficiency when decisions are based on inaccurate stereotypes. As humans are dependent upon stereotypes in their daily information processing, a critical issue is the identification of conditions that produce more accurate stereotypes. This article delineates a social cognition model of stereotyping and identifies the factors involved in developing more accurate stereotypes. The model is applied to gender stereotypes to indicate how these stereotypes may be modified. Managerial implications and future research issues are identified with the anticipation that these ideas will provide guidelines as to how to stimulate more accurate stereotypes in organizations.

From the introduction:

Humans are dependent upon stereotypes to reduce their information processing demands. Unfortunately this dependence creates a number of problems for organizations and individuals. Inaccurate stereotypes leads to inefficient and uneconomical decisions, and create major barriers in the advancement of minority status individuals. [some citation] Stereotypes can not be eliminated; thus, a critical issue for organizations is identifying conditions that propagate more accurate stereotypes.

Within a social cognition frame work stereotypes function to reduce information processing demands, define group membership, and/or predict behavior based on group membership [6 citations]. Stereotyping has a negative connotation because it is often (a) a source or excuse for social injustice, (b) based on relatively little information, (c) resistant to change even with new information, (d) rarely accurately applied to specific individuals [2 citations]. However, stereotyping is not a negative process; rather it is a neutral, subconscious cognitive process that increases the efficiency of interpreting environmental information. Stereotypes often reflect accurate generalizations about large social categories [4 citations]. In recognition that stereotypes are developed and employed subconsciously, have some degree of accuracy, and can produce social injustice, stereotyping in this study is defined as a neutral, necessary cognitive process that can lead to inaccuracies and/or negative consequences.

[left out citations because PDF not OCR’d]

Accuracy of participants’ ratings of gender differences on 77 behaviors and traits was assessed by correlating participants’ ratings with actual gender differences based on meta-analyses. Accuracy at the group level was impressively high in 5 samples of participants. Accuracy of individuals showed wide variability, suggesting that ability to accurately describe gender differences is an individual difference. Analysis of correlations between individual accuracy and a battery of psychological measures indicated that accuracy was negatively related to a tendency to accept and use stereotypes, negatively related to a rigid cognitive style, and positively related to measures of interpersonal sensitivity.

They collected data from 5 samples, totaling 708 students, massive for its time.

Some tables of interest. First, overall accuracy. Not sure why there is no value for all data combined, which would be marginally better. Samples 4-5 each have ~200 students evenly split by sex.

Accuracy across trait types:

Individual accuracy summary stats:

Finally, correlations of individual accuracy:


The Profile of Nonverbal Sensitivity (PONS) test (Rosenthal et al., 1979) measures ability to identify an encoder’s intended message through nonverbal cues. In this test, a female encoder acts out affective scenes that are each edited to 2 s;these scenes are presented to the test taker in different nonverbal channels,which may consist of facial cues, gestural cues, voice tone cues (but not linguistic cues), and combinations of these. The test taker responds on a multiple-choice answer sheet. In Sample 3 we used the 40-item silent face and body short form of the PONS; in Sample 4 we used the full-length PONS, consisting of 220 items for all nonverbal channels. In the present study, Cronbach’s alpha was .22 for the short form and .83 for the full-length test.

So, it’s an old, longer, and probably better version of the modern Eyes in the mind test. Probably also measures emotional intelligence. It’s the only decent positive predictor in this sample, while social dominance was a negative predictor:

Social Dominance Scale (Samples 3, 4, and 5). This 14-item instrument measures the belief that some groups are superior to others (Pratto, Sidanius, Stallworth, & Malle, 1994). Sample items are “Some people are just more deserving than others” and “It is important that we treat other countries as equals” (reverse scored). Although not directly measuring stereotyping, the scale is described by its authors as significantly correlated with measures of ethnic prejudice and sexism. Higher values indicate more endorsement of the concept of group superiority (median a = .82).

Word embeddings are a powerful machine-learning framework that represents each English word by a vector. The geometric relationship between these vectors captures meaningful semantic relationships between the corresponding words. In this paper, we develop a framework to demonstrate how the temporal dynamics of the embedding helps to quantify changes in stereotypes and attitudes toward women and ethnic minorities in the 20th and 21st centuries in the United States. We integrate word embeddings trained on 100 y of text data with the US Census to show that changes in the embedding track closely with demographic and occupation shifts over time. The embedding captures societal shifts—e.g., the women’s movement in the 1960s and Asian immigration into the United States—and also illuminates how specific adjectives and occupations became more closely associated with certain populations over time. Our framework for temporal analysis of word embedding opens up a fruitful intersection between machine learning and quantitative social science.


A similar study is this one: Bolukbasi et al 2016.

Interest in stereotype accuracy over time? Ngram viewer seems to indicate not too much disparity.

But the academic literature gives another picture:

  • 3570 “inaccurate stereotypes” vs. 441 “accurate stereotypes”: ratio = 8.1
  • 677 “inaccurate stereotype” vs. 333 “accurate stereotype”: ratio = 2.0
Reproductive genetics

IVF tech by country: historical growth

Some time ago, I made a plot of this that Steve Hsu ended up posting. However, here’s a newer one. The public datafile is here. Latest published report is here, which reports data for 2014. They publish a paper every year, or so, which report on ~5 year old data. I see no report for 2019 with the 2015 data, odd.


  • Patchy reporting for many countries.
  • Various complicated distinctions between treatment types: IVF (‘whole thing’) vs. ICSI (‘put sperm into egg’) vs. FER vs … . All inclusive term: Assisted reproductive technology (ART).
  • Their tables aren’t exactly the same format every year, so I have merged them by manually moving numbers around. I might have made a mistake somewhere.

A bit tricky to plot this at once. You can copy my code from the R notebook and try yourself!


New video out: Stereotype Accuracy and Inaccuracy

Published a new video summarizing the stereotype accuracy and bias evidence. Well, some of it! Also then discuss various implications for taking accuracy seriously in a Bayesian framework.

75 minutes long!

Genomics intelligence / IQ / cognitive ability

Predicting IQ from genetics: how far have we come? [January 2020]

Results from Allegrini et al 2019

Counting significant hits was always a dumb way to measure progress in genomic prediction of a trait. Breeders using animals and plants never bothered with this approach and they used ridge regression for best predictive power (a “two cultures” problem no doubt). Researchers in human genetics are starting to catch up, implementing clever Enet approach for array datafiles (Qian et al 2019, called snpnet, based on glmnet). We are still waiting for this method to be widely used. It is possible to do summary statistics based Enet too (Mak et al 2017, called lassosum), but again, not many have done it yet.

That being said, we still see a lot of progress owing to larger datasets and some improvements in using the output from ‘single-variant-at-a-time’ regression that they use in regular GWASs. A brief summary. I focus on the TEDS sample (a bug UK twin sample with good DNA and cognitive testing) because this is the largest dataset not used to train GWASs with that has great cognitive testing. It’s someone could use the new subset of UK Biobank with improved cognitive testing to replicate the below (Cox et al 2019, n=29k).

  • Davies, G., Marioni, R. E., Liewald, D. C., Hill, W. D., Hagenaars, S. P., Harris, S. E., … & Cullen, B. (2016). Genome-wide association study of cognitive functions and educational attainment in UK Biobank (N= 112 151). Molecular psychiatry, 21(6), 758.
    • Polygenic score analyses indicate that up to 5% of the variance in cognitive test scores can be predicted in an independent cohort.
  • Selzam, S., Krapohl, E., von Stumm, S., O’Reilly, P. F., Rimfeld, K., Kovas, Y., … & Plomin, R. (2017). Predicting educational achievement from DNA. Molecular psychiatry, 22(2), 267.
    • We found that EduYears GPS explained greater amounts of variance in educational achievement over time, up to 9% at age 16, accounting for 15% of the heritable variance. This is the strongest GPS prediction to date for quantitative behavioral traits.
    • Not quite intelligence, but closer to intelligence (g) than to educational attainment.
  • Krapohl, E., Patel, H., Newhouse, S., Curtis, C. J., von Stumm, S., Dale, P. S., … & Plomin, R. (2018). Multi-polygenic score approach to trait prediction. Molecular psychiatry, 23(5), 1368.
    • The MPS approach predicted 10.9% variance in educational achievement, 4.8% in general cognitive ability and 5.4% in BMI in an independent test set, predicting 1.1%, 1.1%, and 1.6% more variance than the best single-score predictions.
  • Allegrini, A. G., Selzam, S., Rimfeld, K., von Stumm, S., Pingault, J. B., & Plomin, R. (2019). Genomic prediction of cognitive traits in childhood and adolescence. Molecular psychiatry, 24(6), 819.
    • In a representative UK sample of 7,026 children at ages 12 and 16, we show that we can now predict up to 11% of the variance in intelligence and 16% in educational achievement.
    • As above, educational achievement.

As it so happens, there is a paper for each year, letting one see a kind of 4 year progress.

Important caveat of the above! These predictions are not done on sibling pairs. When they are (Selzam et al 2019), the validity is ~50% reduced. This indicates some kind of training problem with the GWASs which either train on family related variance, detect population structure and use that, or something more complicated.