Disclaimer: Some not too structured thoughts.

It’s commonly said that correlation does not imply causation. That is true (see Gwern’s analysis), but does causation imply correlation? Specifically, if “→” means causes and “~~” means correlates with, does X→Y imply X~~Y? It may seem obvious that the answer is yes, but it is not so clear.

Before going into that, consider transitivity. Wikipedia:

In mathematics, a binary relation R over a set X is transitive if whenever an element a is related to an element b, and b is in turn related to an element c, then a is also related to c.

Is causality transitive? It seems that the answer should be yes. If A causes B, and B causes C, then A causes C. With symbols:

  1. A→B
  2. B→C
  3. ⊢ A→C

(the ⊢ symbol means therefore). If causality is transitive, and causality implies correlation, then we may guess that transitivity holds for correlation too. Does it? Sort of.

The transitivity of correlations

We might more precisely say that it has partial transitivity. If A~~B at 1.0, and B~~C at 1.0, then A~~C at 1.0. However, for correlations ≠ |1|, then it doesn’t hold exactly: If A~~B at 0.7, and B~~C at 0.7, does not imply that A~~C at 0.7 or at 0.72 for that matter (this is the predicted path using path model tracing rules). Instead, there is a range of possible values with 0.72 being the most likely. As usual, the Jensen gives the answer, once or multiple places in his numerous writings. The ranges are given in Jensen (1980, p. 302; Bias in Mental Testing):

[discussing of types of validity] Concurrent validity rests on the soundness of the inference that, since the first test correlates highly with the second test and the second test correlates with the criterion, the first test is also correlated with the criterion. It is essentially this question: If we know to what extent A is correlated with B, and we know to what extent B is correlated with C, how precisely can we infer to what extent A is correlated with C? The degree of risk in this inference can be best understood in terms of the range within which the actual criterion validity coefficient would fall when a new test is validated in terms of its correlation with a validated test. Call the scores on the unvalidated test U, scores on the validated test V, and measures on the criterion C. Then rVC, the correlation between V and C, is the criterion validity of test V; and rUV, the correlation between U and V, is the concurrent validity of test U. The crucial question, then, is what precisely can we infer concerning rVC , that is, the probable criterion validity of test U?

If we know rVC and rUV, the upper and lower limits of the possible range of values of rUC are given by the following formulas: [combined to one]

It may come as a sad surprise to many to see how very wide is the range of possible values of rUC for any given combination of values of rVC and rUV. The ranges of rUC are shown in Table 8.1, from which it is clear that concurrent validity inspires confidence only when the two tests are very highly correlated and the one test has a quite high criterion validity. Because it is rare to find criterion validities much higher than about .50, one can easily see the risk in depending on coefficients of concurrent validity. The risk is greatly lessened, however, when the two tests are parallel forms or one is a shortened form of the other, because both tests will then have approximately the same factor composition, which means that all the abilities measured by the first test that are correlated with the criterion also exist in the second test. The two tests should thus have fairly comparable correlations with the criterion, which is a necessary inference to justify concurrent validity.


So, the next time you see someone arguing thru (multiples) steps of transitivity for correlations, beware! Given A~~B at 0.7 and B~~C at 0.7, is still possible that A~~C is 0.0!

Visually, one can think of it in terms of variance explained and overlapping circles. If the A and B circles overlap 50% (which is r≈.71) and the same for B and C, then the question of what A~~C is depends on which whether the overlap between A and B is also the area that overlaps B and C. Because the areas are about 50% each (0.72 = 0.49), both complete overlap (r = 1) and no overlap with a slight remainder are possible (slight negative correlation).

Probably, because this argument often comes up, I should make a visualization.

Back to causation

Now, how does transitivity work for causation? It turns out that it depends on the exact concept we are using. For instance, suppose that A causes higher X and C causes higher Y. Now, we would probably say that A causes higher Y. However, suppose that A also causes higher D and D causes lower Y. Does A cause higher or lower Y? We might say that it depends on the strength of the causal paths. In this way, we are talking about A’s netto (main) effect on Y which may be positive, negative or null depending on the strengths of the other causal paths. One might also take the view that A both causes higher and lower Y. This view is especially tempting when the causal paths are differentiated for individuals. Suppose for half the population, A causes higher C and C causes higher Y, and for the other half A causes higher D and D causes lower Y. One might instead say that A’s effect is an interaction with whatever differentiates the two halves of the population (e.g. gender).

Title says it all. Apparently, some think this is not the case, but it is a straightforward application of Bayes’ theorem. When I first learned of Bayes’ theorem years ago, I thought of this point. Back then I also believed that stereotypes are irrelevant when one has individualized information. Alas, it is incorrect. Neven Sesardic in his excellent and highly recommended book Making Sense of Heritability (2005; download) explained it very clearly, so I will quote his account in full:

A standard reaction to the suggestion that there might be psychological differences between groups is to exclaim “So what?” Whatever these differences, whatever their origin, people should still be treated as individuals, and this is the end of the matter.

There are several problems with this reasoning. First of all, group membership is often a part of an individual’s identity. Therefore, it may not be easy for individuals to accept the fact of a group difference if it does not reflect well on their group. Of course, whichever characteristic we take, there will usually be much overlap, the difference will be only statistical (between group averages), any group will have many individuals that outscore most members of other groups, yet individuals belonging to the lowest-scoring group may find it difficult to live with this fact. It is not likely that the situation will become tolerable even if it is shown that it is not product of social injustice. As Nathan Glazer said: “But how can a group accept an inferior place in society, even if good reasons for it are put forth? It cannot” (Glazer 1994: 16). In addition, to the extent that the difference turns out to be heritable there will be more reason to think that it will not go away so easily (see chapter 5). It will not be readily eliminable through social engineering. It will be modifiable in principle, but not locally modifiable (see section 5.3 for the explanation of these terms). All this could make it even more difficult to accept it.

Next, the statement that people should be treated as individuals is certainly a useful reminder that in many contexts direct knowledge about a particular person eclipses the informativeness of any additional statistical data, and often makes the collection of this kind of data pointless. The statement is fine as far as it goes, but it should not be pushed too far. If it is understood as saying that it is a fallacy to use the information about an individual’s group membership to infer something about that individual, the statement is simply wrong. Exactly the opposite is true: it is a fallacy not to take this information into account.

Suppose we are interested in whether John has characteristic F. Evidence E (directly relevant for the question at hand) indicates that the probability of John having F is p. But suppose we also happen to know that John is a member of group G. Now elementary probability theory tells us that if we want to get the best estimate of the probability that John has F we have to bring the group information to bear on the issue. In calculating the desired probability we have to take into account (a) that John is a member of G, and (b) what proportion of G has F. Neglecting these two pieces of information would mean discarding potentially relevant information. (It would amount to violating what Carnap called “the requirement of total evidence.”) It may well happen that in the light of this additional information we would be forced to revise our estimate of probability from p to p∗. Disregarding group membership is at the core of the so-called “base rate fallacy,” which I will describe using Tversky and Kahneman’s taxicab scenario (Tversky & Kahneman 1980).

In a small city, in which there are 90 green taxis and 10 blue taxis, there was a hit-and-run accident involving a taxi. There is also an eyewitness who told the police that the taxi was blue. The witness’s reliability is 0.8, which means that, when he was tested for his ability to recognize the color of the car under the circumstances similar to those at the accident scene, his statements were correct 80 percent of the time. To reduce verbiage, let me introduce some abbreviations: B = the taxi was blue; G = the taxi was green; WB = witness said that the taxi was blue.

What we know about the whole situation is the following:

(1) p(B) = 0.1 (the prior probability of B, before the witness’s statement is taken into account)

(2) p(G) = 0.9 (the prior probability of G)

(3) p(WB/B) = 0.8 (the reliability of the witness, or the probability of WB, given B)

(4) p(WB/G) = 0.2 (the probability of WB, given G)

Now, given all this information, what is the probability that the taxi was blue in that particular situation? Basically we want to find p(B/WB), the posterior probability of B, i.e., the probability of B after WB is taken into account. People often conclude, wrongly, that this probability is 0.8. They fail to take into consideration that the proportion of blue taxis is pretty low (10 percent), and that the true probability must reflect that fact. A simple rule of elementary probability, Bayes’ theorem, gives the formula to be applied here:

p(B/WB) = p(B) × p(WB/B) / [p(B) × p(WB/B) + p(G) × p(WB/G)].

Therefore, the correct value for p(B/WB) is 0.31, which shows that the usual guess (0.8 or close to it) is wide of the mark.

It is easier to understand that 0.31 is the correct answer by looking at Figure 6.1. Imagine that the situation with the accident and the witness repeats itself 100 times. Obviously, we can expect that the taxi involved in the accident will be blue in 10 cases (10 percent), while in the remaining 90 cases it will be green. Now consider these two different kinds of cases separately. In the top section (blue taxis), the witness recognizes the true color of the car 80 percent of the times, which means in 8 out of 10 cases. In the bottom section (green taxis), he again recognizes the true color of the car 80 percent of the times, which here means in 72 out of 90 cases. Now count all those cases where the witness declares that the taxi is blue, and see how often he is right about it. Then simply divide the number of times he is right when he says “blue” with the overall number of times he says “blue,” and this will immediately give you p(B/WB). The witness gives the answer “blue” 8 times in the upper section (when the taxi is indeed blue), and 18 times in the bottom section (when the taxi is actually green). Therefore, our probability is: 8/(8 + 18) = 0.31.

It may all seem puzzling. How can it be that the witness says the taxi is blue, his reliability as a witness is 0.8, and yet the probability that the taxi is blue is only 0.31? Actually there is nothing wrong with the reasoning. It is the lower prior frequency of blue taxis that brings down the probability of the taxi being blue, and that is that. Bayes’ theorem is a mathematical truth. Its application in this kind of situation is beyond dispute. Any remaining doubt will be dispelled by inspecting Figure 6.1 and seeing that if you trust the witness when he says “blue” you will indeed be more often wrong than right. But notice that you have excellent reasons to trust the witness if he says “green” because in that case he will be right 97 percent of the time! It all follows from the difference in prior probabilities for “blue” and “green.” There is a consensus that neglecting prior probabilities (or base rates) is a logical fallacy.

But if neglecting prior probabilities is a fallacy in the taxicab example, then it cannot stop being a fallacy in other contexts. Oddly enough, many people’s judgment actually changes with context, particularly when it comes to inferences involving social groups. The same move of neglecting base rates that was previously condemned as the violation of elementary probability rules is now praised as reasonable, whereas applying the Bayes’ theorem (previously recommended) is now criticized as a sign of irrationality, prejudice and bigotry.

A good example is racial or ethnic profiling,30 the practice that is almost universally denounced as ill advised, silly, and serving no useful purpose. This is surprising because the inference underlying this practice has the same logical structure as the taxicab situation. Let me try to show this by representing it in the same format as Figure 6.1. But first I will present an example with some imagined data to prepare the ground for the probability question and for the discussion of group profiling.

Suppose that there is a suspicious characteristic E such that 2 percent terrorists (T) have E but only 0.002 percent non-terrorists (−T) have E. This already gives us two probabilities: p(E/T) = 0.02; p(E/−T) = 0.00002. How useful is E for recognizing terrorists? How likely is it that someone is T if he has E? What is p(T/E)? Bayes’ theorem tells us that the answer depends on the percentage of terrorists in a population. (Clearly, if everybody is a terrorist, then p(T/E) = 1; if no one is a terrorist, then p(T/E) = 0; if some people are T and some −T, then 1 > p(T/E) > 0.) To activate the group question, suppose that there are two groups, A and B, that have different percentages of terrorists (1 in 100, and 1 in 10,000, respectively). This translates into different probabilities of an arbitrary member of a group being a terrorist. In group A, p(T) = 0.01 but in group B, p(T) = 0.0001. Now for the central question: what will p(T/E) be in A and in B? Figures 6.2a and 6.2b provide the answer.



In group A, the probability of a person with characteristic E being a terrorist is 0.91. In group B, this probability is 0.09 (more than ten times lower). The group membership matters, and it matters a lot.

Test your intuitions with a thought experiment: in an airport, you see a person belonging to group A and another person from group B. Both have suspicious trait E but they go in opposite directions. Whom will you follow and perhaps report to the police? Will you (a) go by probabilities and focus on A (committing the sin of racial or ethnic profiling), or (b) follow political correctness and flip a coin (and feel good about it)? It would be wrong to protest here and refuse to focus on A by pointing out that most As are not terrorists. This is true but irrelevant. Most As that have E are terrorists (91 percent of them, to be precise), and this is what counts. Compare that with the other group, where out of all Bs that have E, less than 10 percent are terrorists.

To recapitulate, since the two situations (the taxicab example and the social groups example) are similar in all relevant aspects, consistency requires the same answer. But the resolute answer is already given in the first situation. All competent people speak with one voice here, and agree that in this kind of situation the witness’s statement is only part of the relevant evidence. The proportion of blue cars must also be taken into account to get the correct probability that the taxi involved in the accident was blue. Therefore, there is no choice but to draw the corresponding conclusion in the second case. E is only part of the relevant evidence. The proportion of terrorists in group A (or B) must also be taken into account to get the correct probability that an individual from group A (or B) is a terrorist.

The “must” here is a conditional “must,” not a categorical imperative. That is, you must take into account prior probabilities if you want to know the true posterior probability. But sometimes there may be other considerations, besides the aim to know the true probability. For instance, it may be thought unfair or morally unacceptable to treat members of group A differently from members of group B. After all, As belong to their ethnic group without any decision on their part, and it could be argued that it is unjust to treat every A as more suspect just because a very small proportion of terrorists among As happens to be higher than an even lower proportion of terrorists among Bs. Why should some people be inconvenienced and treated worse than others only because they share a group characteristic, which they did not choose, which they cannot change, and which is in itself morally irrelevant?

I recognize the force of this question. It pulls in the opposite direction from Bayes’ theorem, urging us not to take into account prior probabilities. The question which of the two reasons (the Bayesian or the moral one) should prevail is very complex, and there is no doubt that the answer varies widely, depending on the specific circumstances and also on the answerer. I will not enter that debate at all because it would take us too far away from our subject.

The point to remember is that when many people say that “an individual can’t be judged by his group mean” (Gould 1977: 247), that “as individuals we are all unique and population statistics do not apply” (Venter 2000), that “a person should not be judged as a member of a group but as an individual” (Herrnstein & Murray 1994: 550), these statements sound nice and are likely to be well received but they conflict with the hard fact that a group membership sometimes does matter. If scholars wear their scientific hats when denying or disregarding this fact, I am afraid that rather than convincing the public they will more probably damage the credibility of science.

It is of course an empirical question how often and how much the group information is relevant for judgments about individuals in particular situations, but before we address this complicated issue in specific cases, we should first get rid of the wrong but popular idea that taking group membership into consideration (when thinking about individuals) is in itself irrational or morally condemnable, or both. On the contrary, in certain decisions about individuals, people “would have to be either saints or idiots not to be influenced by the collective statistics” (Genovese 1995: 333).


Lee Jussim (politically incorrect social psychologist; blog) in his interesting book Social perception and social reality (2012; download), notes the same fact. In fact, he spends an entire chapter on the question of how people integrate stereotypes with individualized information and whether this increases accuracy. He begins:

Stereotypes and Person Perception: How Should People Judge Individuals?
“Should” might mean many things. It might mean, “What would be the most moral thing to do?” Or, “What would be the legal thing to do, or the most socially acceptable thing to do, or the least off ensive thing to do?” I do not use it here, however, to mean any of these things. Instead, I use the term “should” here to mean “what would lead people to be most accurate?” It is possible that being as accurate as possible would be considered by some people to be immoral or even illegal (see Chapters 10 and 15). Indeed, a wonderful turn of phrase, “forbidden base-rates,” was coined (Tetlock, 2002 ) to capture the very idea that, sometimes, many people would be outraged by the use of general information about groups to reach judgments that would be as accurate as possible (a “base-rate” is the overall prevalence of some characteristic in a group, usually presented as a percentage; e.g., “0.7 % of Americans are in prison” is a base-rate reflecting Americans’ likelihood of being in prison). The focus in this chapter is exclusively on accuracy and not on morality or legality.

Philip Tetlock (famous for his forecasting tournaments) in the quoted article above, writes:

The SVPM [The sacred value-protection model] maintains that categorical proscriptions on cognition can also be triggered by blocking the implementation of relational schemata in sensitive domains. For example, forbidden base rates can be defined as any statistical generalization that devout Bayesians would not hesitate to insert into their likelihood computations but that deeply offends a moral community. In late 20th-century America, egalitarian movements struggled to purge racial discrimination and its residual effects from society (Sniderman & Tetlock, 1986). This goal was justified in communal-sharing terms (“we all belong to the same national family”) and in equality-matching terms (“let’s rectify an inequitable relationship”). Either way, individual or corporate actors who use statistical generalizations (about crime, academic achievement, etc.) to justify disadvantaging already disadvantaged populations are less likely to be lauded as savvy intuitive statisticians than they are to be condemned for their moral insensitivity.

So this is not some exotic idea, it is recognized by several experts.

I don’t have any particular opinion regarding the morality of using involuntary group memberships in one’s assessments, but in terms of epistemic rationality (making correct judgments), the case is clear: one must take into account group memberships when making judgments about individuals.

In reviewing our upcoming target article in Mankind Quarterly (submission version here), Gerhard Meisenberg wrote:

“One possibility that you don’t seem to discuss is that there are true and large correlations of the Euro% variable and the geographic variables, but that the geographic variables are measured much more precisely than the Euro% variable. In that case, regression models will produce independent effects of the more precisely measured geographic variables even if they have no causal effects at all, because they capture some of the variance that is not captured by the Euro% variable due to its inaccurate measurement”.

I had noticed this problem before and tried to build a simulation to show it. Unfortunately, I ran into problems due to not knowing about mathematical statistics (see here).

Suppose the situation is this:

Differential measurement error

So, we have a causal network where X1 causes both X2 and Y. We also model the measurement error aspect of each variable.

Now, suppose that we don’t know whether X1 or X2 causes Y, and that we plug them into a multiple regression model together with Y as the outcome. If we measured all variables without error, the model would find that X1 predicts Y and that X2 doesn’t.

However, suppose that we can measure X2 and Y without error and X1 with some error. What would we find? We would find that X2 seems to predict Y, despite we knowing that it has no causal effect on Y at all.

For instance, we can simulate this situation in R using the following code:

# differential measurement error ------------------------------------------
library(pacman);p_load(magrittr, lavaan, semPlot, kirkegaard)

n = 1000
  X1 = rnorm(n)
  X1_orb = (X1 + rnorm(n) * .6) %>% scale() #lots of measurement error
  X2 = (X1 + rnorm(n)) %>% scale()
  X2_orb = X2 #no measurement error
  Y = (X1 + rnorm(n)) %>% scale()
  Y_orb = Y #no measurement error
  d = data.frame(X1, X1_orb, X2, X2_orb, Y, Y_orb)

#true scores
lm("Y ~ X1 + X2", data = d) %>% lm_CI()

#observed scores
lm("Y_orb ~ X1_orb + X2_orb", data = d) %>% lm_CI()

What does the results look like?

   Beta   SE CI.lower CI.upper
X1 0.68 0.03     0.62     0.74
X2 0.01 0.03    -0.05     0.07

       R2   R2 adj. 
0.5100199 0.5090370
       Beta   SE CI.lower CI.upper
X1_orb 0.48 0.03     0.42     0.54
X2_orb 0.22 0.03     0.16     0.28

       R2   R2 adj. 
0.4112067 0.4100255

Thus, we see that in the first case, multiple regression found X2 to be a useless predictor (beta = .01); in the second case, it found X2 to be a useful predictor with a beta of .22. In both cases X1 was a useful predictor, but it was weaker in the second case (as expected due to the worse measurement).

Multiple regression can also be visualized using path models. So let’s try that for the same models as above. R:

#path models
model = "Y ~ X1 + X2"
fit = sem(model, data = d)
semPaths(fit, whatLabels = "std")

model = "Y_orb ~ X1_orb + X2_orb"
fit = sem(model, data = d)
semPaths(fit, whatLabels = "std", nCharNodes = 0)

The plots:

model_2 model_1

Thus, we see that these show about the same numbers as multiple regression. I think the difference is due to them using different methods of fitting the numbers (sem() uses maximum likelihood by default). So, we have a general problem for our modeling: when our measures have measurement error, this can impact our findings and not just in the way of finding smaller numbers as with zero-order correlations.

One could set up some causal networks and try all the combinations of measurement error in variables to see how it affects results in general. From the above, I would guess that it tends to make simple systems seem more complicated by spreading out the predictive ability across more variables that are correlated with the true causal variables (false positives).

The problem

Apache gives an error 500 (internal server error) when trying to access website running django 1.9 using wsgi based on the tutorial given at www.digitalocean.com/community/tutorials/how-to-serve-django-applications-with-apache-and-mod_wsgi-on-ubuntu-14-04. Running the server using the developmental server (manage.py runserver) works fine.

Looking in the apache error log reveals:

[stuff] mod_wsgi (pid=20304): Target WSGI script '/django/XY_django/XY/XY/wsgi.py' cannot be loaded as Python module. 
[stuff] mod_wsgi (pid=20304): Exception occurred processing WSGI script '/django/XY_django/XY/XY/wsgi.py'.

[python traceback]

[stuff] ImportError: No module named ‘XY’

[this error is repeated twice]

Various resources mention the same or similar problems:

  1. stackoverflow.com/questions/6454564/target-wsgi-script-cannot-be-loaded-as-python-module
  2. stackoverflow.com/questions/9462212/import-error-no-module-named-django
  3. lots of others

Things I tried

  1. Wrong ownership of django dir and files.
    • Was root, changed to www-data.
    • No change.
  1. Wrong version of wsgi.
    • Ran:
    • sudo apt-get remove libapache2-mod-python libapache2-mod-wsgi
      sudo apt-get install libapache2-mod-wsgi-py3
    • No, used the correct version.
  1. Wrong code with the static block in apache config.
    • No, error persists after commenting it out.
  1. WSGIDaemonProcess python-path is wrong.
    • I noticed that the path given in the tutorial has two folders separated by :, not just one:
    • WSGIDaemonProcess django python-path=/home/user/myproject:/home/user/myproject/myprojectenv/lib/python2.7/site-package
      WSGIDaemonProcess django python-path=/django/env/lib/python3.4/site-packages
    • Replaced with:
    • WSGIDaemonProcess django python-path=/django/XY_django/XY:/django/env/lib/python3.4/site-packages
    • Worked!

The problem:

  • To install a virtualenv (virtual environment for python) under a common (not user) folder, so that it is accessible to other users.

If we try without using sudo, we get a permission error:

[go into virtualenv called env]
pip3 install fibo
>[lots of stuff]
>PermissionError: [Errno 13] Permission denied: '/django/env/lib/python3.4/site-packages/Fibo.py'

If we try with sudo, it installs it in the wrong location:

sudo which pip3

I.e. it is installing it in the system-wide location. This isn’t accessible from within the virtualenv. What to do? Some links I found:

Of these, the first does not address the problem of running pip, which isn’t a python script. The second gives a command that doesn’t work (-E option does not work for me). The third addresses an issue that is almost right, but the solution appears not to be related.

After some time, it occurred to me to call the right pip3 with sudo.

sudo -H env/bin/pip install django
[install stuff]
sudo -H env/bin/pip install list
>Django (1.9.1)
>pip (8.0.0)
>setuptools (19.4)
>wheel (0.26.0)


Better solution

Just change the permissions so that all users can edit the files. Then one does not need sudo at all.

sudo chmod -R o+rwx /directory


Apparently, only a few studies have examined this question and they are not easily available. Because we obtained these, it makes sense to share our results. The datafile is on Google Drive. We will fill in more results as we find them.

So far results reveal nothing surprising: MZ correlations are larger than DZ correlations. The h2 using Falconer’s formula is about 60%. Total sample ≈ 750 pairs.

It is common to talk about traits being monogenic or polygenic. We say that Sickle-cell disease is monogenic because its heritable variation among humans can be accounted for by a single locus of genetic variation. Or more accurately, we say that 100% of the heritable variation can be accounted for by variation in that genetic locus (assuming the simple monogenic scenario, I did not look into the details). We say that height is strongly or highly polygenic because it seems like we need thousands of locuses of genetic variation to account for the heritable variation among humans. The largest study that I know of, Wood et al 2014, identified 697 variants (with p < alpha) and these accounted for only about 20% of heritable variation in human height. Furthermore, as explored in a prior post, the distribution of effect sizes of variants follow a power law-like distribution.

cut_30_beta_N_logX-axis has the beta of the SNP, y-axis has the log10 of the frequency.

Due to the way statistical power works, we find find the variants with the largest effects. Making the assumption that we find the SNPs in the exact order of their effect size (only roughly true, empirically resolvable by those with access to data), we should be able to derive estimates of the number of SNPs needed to explain any given proportion of the heritable variation. Probably, we cannot estimate this with certainty if the proportion is set at 1 (100%), but we should be able to find decent estimates for e.g. 10%, 20% and 50%. Thus, we can introduce a specific measure of the degree of polygenicity: the number of variants needed to explain n% of the heritable variance. I propose the term n% genetic cardinality for this concept. This sets all the traits on the same scale and it should be calculateable for any trait where there is a large GWAS.

An alternative measure would be to use the distribution parameters of the power law, but this would be more complicated to understand and estimate. The advantage is that it would be be less noisy.

Can we estimate some numbers for a few traits just to showcase the concept? Perhaps. There are some traits where we have found SNPs that explain most of the variation. If we set n=50, then we can find the values in papers discussing these traits perhaps.

Liu et al 2010 studied eye color in the Dutch and found that a 17 SNP model predicted about 50% of variance in a cross-validation sample. Actually, a single SNP accounted for ~46% variance in itself.


This means that for this trait, if we set n to be anything ~45 or below, the number will be 1. Most traits are not like this. However, I was not able to find more easy examples like this. Most studies just report a bunch of p values which are near useless

Liu et al 2015 report on skin color variation in Europeans and find that the top 9 SNPs explain 3-16% (depending on measure and sample).

skin color liu

Bergholdt et al 2012 writes that “The estimated proportion of heritability explained by currently identified loci is >80%”. However, I could not find support for this in the given reference (Wray et al 2010). I could not even find the number of SNPs they are talking about.

What is needed is for researchers to publish plots that show cumulative R2 (explained variance, preferably in an independent sample) by number of SNPs in order of largest effect sizes. This would allow for easy comparison of genetic cardinality of traits.


  • If one looked at a broader sample (racial heterogeneous), the genetic cardinality numbers would be generally smaller. It is easier to explain variation between white and black hair of Europeans and Africans than it is to explain smaller differences between white and brown hair among Europeans. Thus, the population must be kept roughly constant for the numbers to be comparable.
  • I don’t have time to try to figure out how to estimate the genetic cardinality from a published table of beta values. This should be possible if one is willing to make an assumption about the degree to which the effects are independent. I.e., one would first sort the SNPs by beta, then calculate the R2 values, then calculate the cumulative sum of R2s. Then fit the power law distribution to the number of SNPs used and the cumulative R2. This uses the assumption of no overlap in R2. The degree of overlap can be determined if one had access to case-level data, altho overlap in itself would be a function of of the number of SNPs. Complicated.

My situation is this:

  • I run linux Mint from Windows 8.1 using Virtualbox.
  • I have disabled the numlock (and the other *lock keys) in Windows by rebinding my keyboard using The Microsoft Keyboard Layout Creator.
  • Windows boots up with numlock on so my numpad works.
  • In Virtualbox, Mint boots up without numlock on.
  • Cannot turn it on because the key is unbound.
  • Don’t want to re-enable the bind in Windows.
  • Linux Mint apparently lacks an on screen keyboard solution like Windows has.

The solution I found was to install an application that can turn on numlock via the terminal using numlockx:

sudo apt-get install numlockx
/usr/bin/numlockx on

Problem solved. To enable it from startup, can use:

sudo sed -i 's|^exit 0.*$|# Numlock enable\n[ -x /usr/bin/numlockx ] \&\& numlockx on\n\nexit 0|' /etc/rc.local

I haven’t tested this part yet because I have yet to restart.

This may have some interest. Basically, typologists cannot into statistics and it shows. On the other hand, it means there is a large number of low hanging fruit for someone with skills in statistical programming.

PDF on OSF: osf.io/p2f9u/

This was handed in as a paper for typology class. Quite likely the last class I will take in linguistics. I don’t plan on actually getting the master’s degree.

Consider the model below:

General model for immigrant group traits and outcomes

Something much like this has been my intuitive working model for thinking about immigrant groups’ traits and socioeconomic outcomes. I will explain the model in this post and refer back to it or use the material in some upcoming paper (nothing planned).

The model shows the home country/a country of origin and two destination countries. The model is not limited to just two destination countries, but I did not draw more to avoid making the model larger. It can be worth using more in some cases which will be explained below.

Familial traits (or intergenerational) are those traits that run in families. This term includes both genetic and shared environmental effects. Because most children grow up with their parents (I assume), it does not matter whether the parents traits→children traits route is genetic or environmental. This means that both psychological traits (mostly genetic) and culturally traits (mostly shared environmental) such as specific religion are included.

When persons leave (emigrate) their home country, there is some selection: people who decide to leave are not random. Sometimes, it is not easy to leave because the government actively tries to restrict its citizens from leaving. This is shown in the model as the Emigration selection→Emigrant group familial traits link. Emigration selection seems to be mostly positive in the real world: the better off and smarter emigrate more than the poorer and less bright.

When the immigrants then move to other countries, there is Immigration selection because the destination countries usually don’t just allow whoever to move in if they want to. Immigration selection can have both positive and negative effects. Countries that receive refugees but try not to receive others have negative selection, while those that try to only pick the best potential immigrants have positive selection. Often countries have elements of both. Immigration selection and Emigrant group familial traits jointly lead to Immigrant group familial traits in a particular destination country.

Note that because immigrant selection is unique for each destination country, but can be similar for some countries. This would show up at correlated immigration selection scores. There is also immigration selection that doesn’t happen in the destination country, namely selection that happens due to geographical distance. For this reason I placed the Immigration selection node half in the destination country boxes. With a more complex model, one could split these if desired.

Worse, it is possible that immigration selection in a given country depends on the origin country, i.e. a country-country interaction selection. This wasn’t included in the above model. Examples of this are easy to find. For instance, within the EU (well, it’s complicated), there is relatively free movement of EU citizens, but not so for persons coming in from outside the EU.

Socioeconomic outcomes: Human capital model + luck

The S factor score of the home country (the general factor of socioeconomic outcomes, which one can think of as roughly equal to the Human Development Index just broader ) is modeled as being the outcome of the Population familial traits and Environmental and historical luck . I think it is mostly the former. Perhaps the most obvious example of environmental luck is having valuable natural resources in your borders, today especially oil. But note that even this is somewhat complicated because borders can change by use of ‘bigger army diplomacy’ or by simply purchasing more land, so one could strategically buy or otherwise acquire land that has valuable resources on it, making it not a strict environmental effect.

Other things could be having access to water, sunlight, wind, earthquakes, mountains, large bodies of inland water & rivers, active underground, arable land, living close to peaceful (or not so much) neighbors and so on. These things can promote or retard economic development. Having suitable rivers means that one can get cheap and safe (well, mostly) energy from those. Countries without such resources have to look elsewhere which may cost more. They are not always strictly environmental, but some amount of their variance is more or less randomly distributed to countries. Some are more lucky than others.

There are some who argue that countries that were colonized are better off now because of it, so that would count as historical luck . However, being colonized is not just an environmental effect because it means that foreign powers were able to defeat your forces overwhelmingly for decades. If they were able to, you probably had a poor military which is linked to general technological development. There is some environmental component to whether you have a history of communism, but it seems to still have negative effects on economic growth decades after.

For immigrant groups inside a host country, however, the environmental effects with country-wide effects cannot account for differences. These are thus due to familial effects only (by a good approximation). To be sure, the other people living in the destination/host country, Other group familial traits, probably have some effect on the Immigrant familial traits as well , such as religion and language. These familial traits and the Other group S then jointly cause the Immigrant group S. This is the effect that Open Borders advocates often talk about one aspect of:

Wage differences are a revealing metric of border discrimination. When a worker from a poorer country moves to a richer one, her wages might double, triple, or rise even tenfold. These extreme wage differences reflect restrictions as stifling as the laws that separated white and black South Africans at the height of Apartheid. Geographical differences in wages also signal opportunity—for financially empowering the migrants, of course, but also for increasing total world output. On the other side of discrimination lies untapped potential. Economists have estimated that a world of open borders would double world GDP.

Paths estimated in studies

A path model is always complete which means that all causal routes are explicitly specified. All the remaining links are non-causal, but nodes can be substantially correlated. For instance, there is no link between the home country Country S and immigrant group S but these are strongly correlated in practice. I previously reported correlations between home Country S and Immigrant group S of .54 and .72 for Denmark and Norway .

There is no link between home country Population familial traits and Immigrant group familial traits, but there is only one link in between (Emigrant group familial traits), so seems reasonable to try to correlate these two nodes. A few studies have looked at these type of correlations. For instance, John Fuerst have looked at GRE/GMAT scores and the like for immigrant groups in the US . This is taken as a proxy for cognitive ability, probably the most important component of the psychological traits part of familial traits. In that paper, Fuerst found correlations of .78 and .81 between these and country cognitive ability using Lynn and Vanhanen’s dataset .

Rindermann and Thompson have reported correlations between cognitive ability (component of Immigrant group familial traits) and native population cognitive ability (component of Other group familial traits) .

Most of my studies have looked at the nodes Population familial traits (sub-components Islam belief and cognitive ability) and Immigrant group S (or sub-components like crime if S was not available). Often this results in large correlations: .54 and .59 for Denmark and Norway (depending on how to deal with missing data, use of weighted correlations etc.). Note that in the model the first does cause the second, but there are a few intermediate steps and other variables, especially Emigrant selection (differs by country of origin which reduces the correlation) and Immigrant selection (which has no effect on the correlation).

There is much to be done. If one could obtain estimates of multiple nodes in a causal chain, one could use mediation analysis to see if mediation is plausible. E.g. right we we have Immigrant group S for two countries, cognitive ability for 100s of countries of origin, so if we could obtain immigrant group cognitive ability, one could test the mediation role of the last. With the current data, one can also check whether country of origin cognitive ability mediates the relationship between immigrant group S and country of origin S, which it should partly, according to the model. I say partly because the mediation is only to the extend that familial cognitive ability is a cause.