Gwern has discussed this before, but here’s another take inspired by this Twitter exchange:

I envision the following way of finding out:

  1. Gather a large cross-sectional dataset of MZ twins measured on a large number of varied traits, habits etc. of interest.
  2. Calculate all the between family associations using one twin randomly selected from each pair.
  3. Calculate all the within MZ pair associations. Adjust these for restriction of range.
  4. Do RCTs to examine to examine the causal size of the associations (for whose where this is possible).

Then we compare the results from step (1), (3) and (4). Data of type (1) is easy to get, but not so informative about causality. Data of type (3) is harder to get, but we’re getting quite good at building large Twin Registries, so this is being solved. The UK TEDS project has 5,400 MZs in their public dataset, so 2,700 pairs, more than sufficient for this. (4) will be hard to do. One will have to subset a smaller number of behaviors or traits that are suitable to interventions and systematically try these. Then compare with the other two results.

Doing this would be a major undertaking similar to the Reproducibility Project. However, the results would be very informative about how skeptical we should be about causal interpretations in general (see this unsystematic collection), as well as provide information about how informative MZ associations are for causality. It could be that nearly all MZ associations reflect causality, though they may be upwardly biased for the effect size. If they do provide non-trivial evidence of causality (e.g. 50% chance of ≥50% of effect size being causal), they could be very useful for identifying plausible candidates for intervention RCTs.

It is possible to carry out steps (1-3) with the public TEDS data. But their collection of traits (p=176) is very narrow in topic (most concern ability and auxiliary variables) and many are not suitable for interventions. One of those they examined that was suitable — the effect of teachers on ability, behavior and motivation (examining twins who had/had not the same teachers) — produced near-zero effect (Kovas et al 2007, 2015, Coventry et al 2009).


  • We use one twin from each pair instead of averaging because the averaging would increase the signal to noise ratio and make the results harder to compare with other studies.