A partial test of DUF1220 for population differences in intelligence?

You might have heard the DUF1220 hypothesis, it goes something like this:

  • DUF1220 is a copy number variant poorly tagged by arrays, and thus would not be captured well by typical GWASs for education/IQ.
  • Comparative species data suggests strong selection for DUF1220 with increased intelligence/brain size.
  • There’s some data showing a relationship between IQ in humans and DUF1220 copy number.
  • Thus, things are plausible, and hereditarians will expect a good chance that if it is causal, it should show population differences as the regular SNP based polygenic scores do (Piffer 2018).

The between species plot is surely impressive looking, the background papers are:

From Keeney et al

The individual human data:

  • Davis, J. M., Searles, V. B., Anderson, N., Keeney, J., Raznahan, A., Horwood, L. J., … & Sikela, J. M. (2015). DUF1220 copy number is linearly associated with increased cognitive function as measured by total IQ and mathematical aptitude scores. Human genetics, 134(1), 67-75.

Sample: 59 individuals (41 males and 18 females) whose ages ranged from 6 to 22.

DUF1220 protein domains exhibit the greatest human lineage-specific copy number expansion of any protein-coding sequence in the genome, and variation in DUF1220 copy number has been linked to both brain size in humans and brain evolution among primates. Given these findings, we examined associations between DUF1220 subtypes CON1 and CON2 and cognitive aptitude. We identified a linear association between CON2 copy number and cognitive function in two independent populations of European descent. In North American males, an increase in CON2 copy number corresponded with an increase in WISC IQ (R2 = 0.13, p = 0.02), which may be driven by males aged 6–11 (R2 = 0.42, p = 0.003). We utilized ddPCR in a subset as a confirmatory measurement. This group had 26–33 copies of CON2 with a mean of 29, and each copy increase of CON2 was associated with a 3.3-point increase in WISC IQ (R2 = 0.22, p = 0.045). In individuals from New Zealand, an increase in CON2 copy number was associated with an increase in math aptitude ability (R2 = 0.10 p = 0.018). These were not confounded by brain size. To our knowledge, this is the first study to report a replicated association between copy number of a gene coding sequence and cognitive aptitude. Remarkably, dosage variations involving DUF1220 sequences have now been linked to human brain expansion, autism severity and cognitive aptitude, suggesting that such processes may be genetically and mechanistically inter-related. The findings presented here warrant expanded investigations in larger, well-characterized cohorts.

So, not at all convincing. Might be true, but these data look supremely p-hacked.

What about human population counts in 1000 genomes? Well, turns out someone did a study:


DUF1220 protein domains found primarily in Neuroblastoma BreakPoint Family (NBPF) genes show the greatest human lineage-specific increase in copy number of any coding region in the genome. There are 302 haploid copies of DUF1220 in hg38 (~160 of which are human-specific) and the majority of these can be divided into 6 different subtypes (referred to as clades). Copy number changes of specific DUF1220 clades have been associated in a dose-dependent manner with brain size variation (both evolutionarily and within the human population), cognitive aptitude, autism severity, and schizophrenia severity. However, no published methods can directly measure copies of DUF1220 with high accuracy and no method can distinguish between domains within a clade.

Here we describe a novel method for measuring copies of DUF1220 domains and the NBPF genes in which they are found from whole genome sequence data. We have characterized the effect that various sequencing and alignment parameters and strategies have on the accuracy and precision of the method and defined the parameters that lead to optimal DUF1220 copy number measurement and resolution. We show that copy number estimates obtained using our read depth approach are highly correlated with those generated by ddPCR for three representative DUF1220 clades. By simulation, we demonstrate that our method provides sufficient resolution to analyze DUF1220 copy number variation at three levels: (1) DUF1220 clade copy number within individual genes and groups of genes (gene-specific clade groups) (2) genome wide DUF1220 clade copies and (3) gene copy number for DUF1220-encoding genes.

To our knowledge, this is the first method to accurately measure copies of all six DUF1220 clades and the first method to provide gene specific resolution of these clades. This allows one to discriminate among the ~300 haploid human DUF1220 copies to an extent not possible with any other method. The result is a greatly enhanced capability to analyze the role that these sequences play in human variation and disease.

So, they developed a method to count DUF1220 (which is difficult because it’s a strange kind of variation) from sequence data (their tool is publicly available). Then they tested this on 1000 genomes public data, but:

Approximately 25 individuals were randomly chosen from each of the CEU, YRI, CHB, JPT, MXL, CLM, PUR, ASW, LWK, CHS, TSI, IBS, FIN, and GBR populations for a total of 324 individuals.

For some reason only used ~25 persons from each despite free availability of more. 🤔 Someone should re-do with all the data. Their counts are only shown in the supplements:

Which doesn’t seem to show any consistent pattern. Maybe had they merged the continental groups. They don’t provide the values in a table, so we can’t do it easily.


  • Maybe the samples were too small to see the count differences. [Conspiracy hat] This was on purpose to hide them.
  • Maybe DUF1220 is just a fluke. After all, the human IQ study looks p-hacked, and it’s a candidate gene, so prior = low. That stuff about being selected? Well, humans have 20k genes, so something is bound to show a pattern like that.

Things to do:

  • Calculate the DUF1220 counts in the full 1000 genomes dataset and other public sequence data such as the Simon’s panel. Their code for analyzing 1kg is public too.

Also, noting my prior conditional prediction:

Added 9th March 2019

Blogger Half-Assed Science has also previously blogged about this DUF1220 study. They also carried out some regressions, which found not much of interest, but did not reanalyze the data to get more data. To really examine the issue, one would need a large dataset with diverse people, sequencing data, and IQ/SES. The Simons Diversity Project has a lot of data, but no phenotypes, so one will have to assume population means. A better option is to apply for 100k genomes project, which I guess has some phenotypes. Website is not very clear. Another option is the UK10k, which has 10k sequenced genomes. I can’t see what phenotypes they have either. A better approach perhaps is to re-do the ancient genomes paper (Woodley et al 2017) but also adding the DUF1220 counts. Do they increase during human evolution? Does the count correlate with absolute latitude? Easy enough research question, just waiting for something with moderate technical skills and some courage.

This Post Has 4 Comments

    1. Emil Kirkegaard

      Nice. I actually found these papers like half a year ago, but apparently forgot to blog my analysis. Someone needs to run the 1000 genomes code to make sure we aren’t just being misled by sampling error within groups due to the authors only using 15 of each or whatever.

      1. Bonner Tal

        Yeah, someone should.

        By the way, I dropped you a mail, a few weeks back, about one of your recent papers. Possibly went to the spam folder.

        1. Emil Kirkegaard

          Found it. Forgot to reply, sorry.

          There is some bug with my WordPress where half the comments people leave get lost, and I can’t even see them except thru the admin panel. I have been unable to fix this bug, but seems related to WordPress posts changing URLs after being edited after initial publication.

Leave a Reply