On the validity of polygenic scores, in general and in Africans

(See 2015 post)

There’s a very interesting new preprint out:

This study investigates the creation of polygenic scores (PGS)s for human population research. PGSs are a linear, usually weighted, combination of risk alleles that estimate the cumulative genetic risk of an individual for a particular trait. While conceptually simple, there are numerous ways to estimate PGSs, not all achieving the same end goals. In this paper, we systematically investigate the impact of four key decisions in the building of PGSs from published genome-wide association meta-analysis results: 1) whether to use single nucleotide polymorphisms (SNPs) assessed by imputation, 2) criteria for selecting which SNPs to include in the score, 3) whether to account for linkage disequilibrium (LD), and 4) if accounting for LD, which type of method best captures the correlation structure among SNPs (i.e. clumping vs. pruning). Using the Health and Retirement Study (HRS), a nationally representative, population-based longitudinal panel study of Americans over the age of 50, we examine the predictive ability as well as the variability and co-variability in PGSs arising from these different estimation approaches. We examine four traits with large published and replicated genome-wide association studies (height, body mass index, educational attainment, and depression). Our central finding demonstrates PGSs that include all available SNPs either explain the most amount of variation in an outcome or are not significantly different than the PGSs that does. Thus, for reproducibility through rigor and transparency, we recommend that researchers include a PGS with all available SNPs as a reference, and provide substantial justification for using alternative methods.

So, they carried out a method variance testing paper for scoring of polygenic scores. One of their findings was that one should use all the SNPs, not just the p<alpha SNPs. The rest also have signal, even if we cannot specifically say which of the variants have the signal. More importantly, the signal is strong enough that when we aggregate many poor estimates, we get a better score. This is basically the Spearman-Brown prediction formula in action:

spearman brown

(where n is the number of “tests” combined (see below) and pxx’ is the reliability of the current “test”.)

What it means is that when we aggregate (average) many individual datapoints which consist entirely of random noise and signal (true score), the more we average, the better the average score becomes because the noise cancels out (by definition, not being correlated with anything) and all that’s left is the signal. The formula makes it possible to predict how well this works. Wish that they would stop publishing GWASs with titles like “GWAS of nk persons reveal m variants associated with T” where n is sample size in thousands, m is the number of p < alpha variants, and T is a random trait.

[Digression: also true for factor loadings, meaning that we can give them random weights and it still works out about the same. This is why total IQ scores work just as well as g factor scores.]

To perspectivize, this ‘need to move from focusing on ‘hits’ to focusing on validity’ is part of the more general movement away from dichotomous, NHST-thinking into continuous thinking which Gelman sometimes discusses.

R2 vs. r metrics for measuring accuracy

Perhaps the most interesting (to me) part of the paper is found in the appendix.

polygenic score validity africans 1

So, it does appear that there is very poor cross-race validity for polygenic scores across 3 traits, and 4 GWASs (I left out the trait which had near zero validity for Whites too.). However, things may not be quite as bleak as they appear. The reported values are R2-type metrics, not R-type metrics. This is a non-linear transformation that does not reflect the predictive utility of the polygenic scores. For instance, if we could attain an R2 of 25% for a trait, this is a correlation of .50 meaning that we already have 50% predictive validity. The predictive validity is the important metric to use because that is what counts for anything in reality, e.g. if we want to predict risk of disease/treatment effectiveness (medical genetics) or likely work proficiency (employment testing). The figure below shows the relationship.


The disadvantage of R2 as an effect size metric is especially salient around the small values (below R2 = .50 or so). An R2 of 0.01 (1%) sounds very small compared to one of .05 (5%), but square root them and you see that they are not so different after all: r’s .10 and .22, only 2.2 times the utility, not 5 times. This is actually quite important because traits don’t have heritability of 100%, but somewhere below. For educational attainment (as crudely measured), the heritability is about 40%. That’s the maximal prediction accuracy for predictions based on genetics only. It corresponds to r = .63, which is quite good. Now go back and ask: what R2 do we need to get 50% of the maximal possible genetic predictive validity, i.e. r = .315? Only R2 = 9.9%. This is actually about what we have already achieved in one sample!

[The primary use of R2 metrics is when one is reasoning about standardized values where the variance/standard deviation must sum to 1. Example: I want to simulate a standardized variable, Y, that is caused X with standardized beta 0.5 and nothing else. How much noise (residual) do I need to add to make the standard deviation/variance 1? To get that value, we need to use R2. So, square .5 to get .25 (variance accounted for by X), 1 – .25 = .75 (.75 variance left). Square root to get the beta which is then .87.]

Validity in Africans?

The scores appear not to work in Africans, but now we’re a bit more skeptic. In this case, a further disadvantage of R2 is that when correlations are low, the R2 values are very small and so they are hard to estimate from figures! However, I tried my best to estimate the values and got these:

Trait Blacks R2 Whites R2 Blacks r Whites r Ratio B/W r
Height 0.005 0.070 0.071 0.265 0.267
BMI 0.018 0.058 0.134 0.241 0.557
EA 2013 0.003 0.029 0.050 0.170 0.294
EA 2016 0.010 0.036 0.100 0.190 0.527
In African ancestry


So, as best as I can make out, the polygenic correlations for Blacks are about 40% of those for Whites. If we assume that Blacks are 20% European and Whites are 100% European, then we would expect the scores to have 20% of the European validity due to the ancestry alone. We thus observe as excess of validity (.411 – .200 = .211), which we can attribute to validity in the African ancestry. However, there’s only 80% ancestry left, so we have to adjust for that by dividing by .8, thus yielding 26.4%. This is our estimate of the relative predictive validity of the polygenic scores in African ancestry. So, it is not quite 0%, and may be useful in practice.

[Note: In my previous post, I used the R2 values for the comparisons instead. And I think that was a mistake. However, I’m not quite sure which method is right. So if someone knows, please let me know!]

The implications of this as I see it are:

One has to be very cautious about interpretation of Piffer style analyses results. If the SNPs don’t proxy the causal variants well in Africans, then looking at the SNP frequencies in Africans does not make much sense. If these SNPs are random, then the noise would tend to move the frequencies in Africans towards 50%, and this may allow for an adjusted method to work. For now, I’m pretty skeptical. This is basically the same I said last time, but now I’m more skeptical. I wish there was a national level dataset of frequency estimates. This is entirely necessary to do genetically informed national-level social science.

The issue can be solved by using more SNPs. Using more SNPs means that the SNP density is higher, which means that SNPs will tend to be located closer on the genome to the causal variants, and thus be better proxies. One possibility is that causal variants are in fact to a large degree non-deletion copy number variation (short repeats) that is hard to spot using arrays (like Huntington’s architecture). Perhaps in promoter regions for genes that matter. Some argue that the prevalence of such variation has been underestimated.

Given that we need to use rare SNPs to capture African causal variation and that we can probably assume that the causal variation is quite similar across human groups (i.e. bodies fundamentally work the same way, just slightly different tuning parameters), this seems to imply that causal variants are mostly rare, maybe deleterious (per Hsu), variants. That we can use common SNPs to proxy them in Europeans could be interpreted as just meaning that we can successfully find patterns in the random LD patterns/noise that are statistically linked to variation in the causal variants. A lot of this random variation post-dates out of Africa, and thus does not work for Africans.

This interpretation dovetails nicely with this recent n=700k GWAS of height that found largeish effects of rare variants. If true, it is good news for genetic engineering. The feasibility of genetic engineering depends on 1) problematic off-target hit rate (i.e. mess up something else), 2) effect size of variants/number of variants. The fewer variants we need to edit, the safer it is to edit them without major risk of problematic off-target edits. Per the heights paper, some alleles had effect sizes of 2 cm. Now, with winner’s curse/decline effect/regression towards the mean, maybe 1 cm. Still, the height SD is about 7 cm, so that’s a rather large effect size of about .15, or .30 per loci (because 2 alleles). Assuming the same holds for intelligence, and that we can edit e.g. 10 loci without danger, that’s 3 sd, or 45 IQ! But only for a person who had two of the bad variants for all 10 rare variant loci. So, the utility depends on the distribution of the rare variants too. But surely this is pretty exciting! :)