There are few good things to say about this book. The best thing is on the very first page:

Some things never change: the sun rises in the morning. The sun sets in the evening. A lot of linguists don’t like statistics. I have spent a decade teaching English language and linguistics at various universities, and, invariably, there seems to be a widespread aversion to all things numerical. The most common complaint is that statistics is ‘too difficult’. The fact is, it isn’t. Or at least, it doesn’t need to be.

I laughed out hard in the train when reading this.

Not only do linguists not like numbers, they are not very good with them. This is exemplified with this textbook, which contains so many errors I get tired of writing notes in my PDF file.

By definition, the two approaches of analysis also differ in another respect: qualitative research is inductive, that is, theory is derived from the research results. Qualitative analysis is often used in preliminary studies in order to evaluate the research area. For example, we may want to conduct focus groups and interviews and analyse them qualitatively in order to build a picture of what is going on. We can then use these findings to explore these issues on a larger scale, for example by using a survey, and conduct a quantitative analysis. However, it has to be emphasized that qualitative research is not merely an auxiliary tool to data analysis, but a valuable approach in itself!

Quantitative research, on the other hand, is deductive: based on already known theory we develop hypotheses, which we then try to prove (or disprove) in the course of our empirical investigation. Accordingly, the decision between qualitative and quantitative methodology will have a dramatic impact on how we go about our research. Figures 2.1 and 2.2 outline the deductive and inductive processes graphically.

At the beginning of the quantitative-deductive process stands the hypothesis (or theory). As outlined below, a hypothesis is a statement about a particular aspect of reality, such as ‘the lower the socio-economic class, the more non-standard features a speaker’s language shows’. The hypothesis is based on findings of previous research, and the aim of our study is to prove or disprove it. Based on a precisely formulated hypothesis, or research question, we develop a methodology, that is, a set of instruments which will allow us to measure reality in such a way that the results allow us to prove the hypothesis right or wrong. This also includes the development of adequate analytic tools, which we will use to analyse our data once we have collected it. Throughout this book, I will keep reminding readers that it is paramount that hypothesis/research question and methodological analytical framework must be developed together and form a cohesive and coherent unit. In blunt terms, our data collection methods must enable us to collect data which actually fits our research question, as do our data analysis tools. This will become clearer in the course of this and subsequent chapters. The development of the methodological-analytical framework can be a time-consuming process; especially in complex studies we need to spend considerable time and effort developing a set of reliable and valid (see below) tools.

The author has perhaps heard some summary of Popper’s ideas about the hypothetical deductive method. However, Popper aside, science is very much quantitative and inductive. The entire point of meta-analysis is to obtain good summary effects and to check whether results generalize across contexts (i.e. validity generalization á la Hunter and Schmidt).

But that is not all. He then even gets the standard Popperian view wrong. When we confirm a prediction (we pretend to do so, when there is no pre-registration), we don’t prove the theory. That would be an instance of affirming the consequent.

As for qualitative methods being deductive. That makes little sense. I have taken such classes, they usually involve interviews, which is certainly empirical and inductive.

So, in what situations can (and should) we not use a quantitative but a qualitative method? Remember from above that quantitative research is deductive: we base our question and hypotheses on already existing theories, while qualitative research is deductive and is aimed at developing theories. Qualitative research might come with the advantage of not requiring any maths, but the advantage is only a superficial one: not only do we not have a profound theoretical basis, but we also very often do not know what we get – for inexperienced researchers this can turn into a nightmare.

This passage made me very confused because he used deductive twice by error.

[some generally sensible things about correlation and cause]. In order to be able to speak of a proper causal relationship, our variables must show three characteristics:

  • a They must correlate with each other, that is, their values must co-occur in a particular pattern: for example, the older a speaker, the more dialect features you find in their speech (see Chapter Seven).
  • b There must be a temporal relationship between the two variables X and Y, that is, Y must occur after X. In our word-recognition example in Section 2.4, this would mean that for speed to have a causal effect on performance, speed must be increased first, and drop in participants’ performance occurs afterwards. If performance decreases before we increase the speed, it is highly unlikely that there is causality between the two variables. The two phenomena just co-occur by sheer coincidence.
  • c The relationship between X and Y must not disappear when controlled for a third variable.

It has been said that causation implies correlation, but I think this is not strictly true. Another variable could have a statistical suppressor effect, making the correlation between X and Y disappear.

Furthermore, correlation is a measure of a linear relationship, but causation does could be non-linear and thus show a non-linear statistical pattern.

Finally, (c) is just wrong. For instance, if X causes A, …, Z and Y, and we control for A, …, Z, the correlation between X and Y will greatly diminish or disappear because we are indirectly controlling for X. A brief simulation:

N = 1000
d_test = data.frame(X = rnorm(N),
Y = rnorm(N) + X, #X causes Y
A1 = rnorm(N) + X, #and also A1-3
A2 = rnorm(N) + X,
A3 = rnorm(N) + X)

partial.r(d_test, 1:2, 3:5)

The rXY is .71, but the partial correlation of XY is only .44. If we added enough control variables we could eventually make rXY drop to ~0.

While most people usually consider the first two points when analysing causality between two variables, less experienced researchers (such as undergraduate students) frequently make the mistake to ignore the effect third (and often latent) variables may have on the relationship between X and Y, and take any given outcome for granted. That this can result in serious problems for linguistic research becomes obvious when considering the very nature of language and its users. Any first year linguistics student learns that there are about a dozen sociolinguistic factors alone which influence they way we use language, among them age, gender, social, educational and regional background and so on. And that before we even start thinking about cognitive and psychological aspects. If we investigate the relationship between two variables, how can we be certain that there is not a third (or fourth of fifth) variable influencing whatever we are measuring? In the worst case, latent variables will affect our measure in such a way that it threatens its validity – see below.

This is a minor complaint about language, but I don’t understand why the author calls hidden/missing/omitted variables for latent variables, which has an all together different meaning in statistics!

One step up from ordinal data are variables on interval scales. Again, they allow us to label cases and to put them into a meaningful sequence. However, the differences between individual values are fixed. A typical example are grading systems that evaluate work from A to D, with A being the best and D being the worst mark. The differences between A and B are the same as the difference between B and C and between C and D. In the British university grading system, B, C and D grades (Upper Second, Lower Second and Third Class) all cover a stretch of 10 percentage points and would therefore be interval; however, with First Class grades stretching from 70 to 100 per cent and Fail from 30 downwards, this order is somewhat sabotaged. In their purest form, and in order to avoid problems in statistical analysis, all categories in an interval scale must have the same distance from each other.

This is somewhat unclear. What interval variable really means is that the intervals are equal in meaning. A difference of 10 points means is the same whether it is between IQ 100 and 110, or between 70 and 80. (If you are not convinced that IQ scores are at least approximately interval-scale, read Jensen 1980 chapter 4).

Laws are hypotheses or a set of hypotheses which have been proven correct repeatedly. We may think of the hypothesis: ‘Simple declarative sentences in English have subject-verb-object word order’. If we analyse a sufficient number of simple English declarative sentences, we will find that this hypothesis proves correct repeatedly, hence making it a law. Remember, however, based on the principle of falsifiability of hypotheses, if we are looking at a very large amount of data, we should find at least a few instances where the hypothesis is wrong – the exception to the rule. And here it becomes slightly tricky: since our hypothesis about declarative sentences is a definition, we cannot really prove it wrong. The definition says that declarative sentences must have subject-verb-object (SVO) word order. This implies that a sentence that does not conform to SVO is not a declarative; we cannot prove the hypothesis wrong, as if it is wrong we do not have a declarative sentence. In this sense it is almost tautological. Hence, we have to be very careful when including prescriptive definitions into our hypotheses.

This paragraph mixes up multiple things. Laws can be both statistical and absolute. Perhaps the most commonly referred to laws are Newton’s laws of motion, which turned out to be wrong and thus are not laws at all. They were however thought to be absolute, thus not admitting any counterexamples. Any solid counter-example disproves a universal generalization.

But then he starts discussing falsifiability and how his example hypothesis was not falsifiable, apparently. The initial wording does not make it seem like a definition however.

A few general guidelines apply universally with regard to ethics: first, participants in research must consent to taking part in the study, and should also have the opportunity to withdraw their consent at any time. Surreptitious data collection is, in essence, ‘stealing’ data from someone and is grossly unethical – both in the moral and, increasingly, in the legal sense.

The author is apparently unaware of how any large internet site works. They all gather data from users without explicit consent. It also basically bans data scraping, which is increasingly used in quantitative linguistics.

The probably most popular statistical tool is the arithmetic mean-, most of us will probably know it as the ‘average’ – a term that is, for various reasons we will ignore for the time being, not quite accurate. The mean tells us the ‘average value’ of our data. It is a handy tool to tell us where about our data is approximately located. Plus, it is very easy to calculate – I have heard of people who do this in their head! In order to calculate a reliable and meaningful mean, our data must be on a ratio scale. Think about it: how could we possible calculate the ‘average’ of a nominal variable measuring, for example, respondents’ sex? One is either male or female, so any results along the lines of ‘the average sex of the group is 1.7’ makes no sense whatsoever.

Interval scale is sufficient for calculating means and the like.

median wrong

The illustration is wrong. Q2 (median) should be in the middle. Q4 (4*25=100th centile), should be at the end.

A second approach to probabilities is via the relative frequency of an event. We can approximate the probability of a simple event by looking at its relative frequency, that is, how often a particular outcome occurs when the event if repeated multiple times. For example, when we toss a coin multiple times and report the result, the relative frequency of either outcome to occur will gradually approach 0.5 – given that it is an ‘ideal’ coin without any imbalance. Note that it is important to have a very large number of repetitions: if we toss a coin only 3 times, we will inevitably get 2 heads and 1 tail (or vice versa) but we obviously must not conclude that the 2 probability for heads is P(head) = — (0.66). However, if we toss our coin say 1,000 times, we are likely to get something like 520 heads and 480 tails. P(heads) is then 0.52 – much closer to the actual probability of 0.5.

Apparently, the author forgot one can also get 3 of a kind.

Lastly, there may be cases where we are interested in a union event that includes a conditional event. For example, in our Active sample we know that 45% are non-native speakers, that is, P(NNS)=0.4.

Numerical inconsistency.

Excel (or any other software you might use) has hopefully given you a result of p = 0.01492. But how to interpret it? As a guideline, the closer the p-value (of significance level – we will discuss this later) is to zero, the less likely our variables are to be independent. Our p-value is smaller than 0.05, which is a very strong indicator that there is a relationship between type of words produced and comprehension/production. With our result from the chi-test, we can now confidently say that in this particular study, there is a rather strong link between the type of word and the level of production and comprehension. And while this conclusion is not surprising (well, at least not for those familiar with language acquisition), it is a statistically sound one. Which is far better than anecdotal evidence or guessing.

A p value of .01 is not a “very strong indicator”. Furthermore, it is not an effect size measure (although it is probably correlate with effect size). Concluding large effect size (“strong link”) from small p value is a common fallacy.

With the result of our earlier calculations (6.42) and our df (1) we now go to the table of critical values. Since df=l, we look in the first row (the one below the header row). We can see that our 6.42 falls between 3.841 and 6.635. Looking in the header row, we can see that it hence falls between a significance level of 0.05 and 0.01, so our result is significant on a level of 0.05. If you remember, Excel has given us a p value of 0.014 – not quite 0.01, but very close. As before, with p<0.05, we can confidently reject the idea that the variables are independent, but suggest that they are related in some way.

Overconfidence in p-values.

In an ‘ideal’ world, our variables are related in such a way that we can connect all individual points to a straight line. In this case, we speak of a p e r f e c t c o r r e l a t i o n between the two variables. As we can see in Figures 7.1 and 7.2, the world isn’t ideal and the line is not quite straight, so the correlation between age of onset and proficiency is not perfect – although it is rather strong! Before we think about calculating the correlation, we shall consider five basic principles about the correlation of variables:

1 the Pearson correlation coefficient r can have any value between (and including) -1 and 1, that is, -1 < r <, 1.

2 r = 1 indicates a p e r f e c t p o s it iv e c o r r e l a t i o n , that is, the two variables increase in a linear fashion, that is, all data points lie is a straight line – just as if they were plotted with the help of a ruler.

3 r = -1 indicates a p e r f e c t n e g a t iv e c o r r e l a t i o n , that is, while one variable increases, the other decreases, again, linearly.

4 r=0 indicates that the two variables to not correlate at all. That is, there is no relationship between them.

5 The Pearson correlation only works for normally distributed (or ‘parametric’) data (see Chapter Six) which is on a ratio scale (see Chapter Two) – we discuss solutions for non-parametric distributed data in Chapter Nine.

(Ignore the formatting error.)

(4) is wrong. Correlation of 0 does not mean no relationship. It could be non-linear (as also mentioned in (2-3).

We said above that the correlation coefficient does not tell us anything about causality. However, for reasons whose explanations are beyond the scope of this book, if we square the correlation coefficient, that is, multiply it with itself, the so-called R2(R squared) tells us how much the independent variable accounts for the outcome of the dependent variable. In other words, causality can be approximated via R2 (Field 2000: 90). As reported, Johnson and Newport (1991) found that age of onset and L2 proficiency correlate significantly with r = -0.63. In this situation it is clear that //there was indeed causality, it would be age influencing L2 proficiency – and not vice versa. Since we are not provided with any further information or data about third variables, we cannot calculate the partial correlation and have to work with r=-0.63. If we square r, we get -0.6 3 x -0.63 = 0.3969. This can be interpreted in such a way that age of onset can account for about 40 per cent of the variability in L2 acquisition.

This is pretty good, but it also shows us that there must be something else influencing the acquisition process. So, we can only say that age has a comparatively strong influence on proficiency, but we cannot say that is causes it. The furthest we could go with R2 is to argue that if there is causality between age and proficiency, age only causes around 40% ; but to be on the safe side, we should omit any reference to causality altogether.

If you think about it carefully, if age was the one and only variable influencing (or indeed causing) L2 acquisition, the correlation between the two variables should be perfect, that is, r=l (or r= -l) – otherwise we would never get an R2 of 1! In fact, Birdsong and Molis (2001) replicated Johnson and Newport’s earlier (1989) study and concluded that exogenous factors (such as expose and use of the target language) also influence the syntactic development. And with over 60% of variability unaccounted for, there is plenty of space for those other factors!

A very confused text. Does correlation tell us about causality or not? Partially? Correlation of X and Y increases the probability that there is causality between X and Y, but no that much.

Throughout this chapter, I have repeatedly used the concept of significance, but I have yet to provide a reasonable explanation for it. Following the general objectives of this book of being accessible and not to scare off readers mathematically less well versed, I will limit my explanation to a ‘quick and dirty’ definition.

Statistical significance in this context has nothing to do with our results being of particular important – just because Johnson and Newport’s correlation coefficient comes with a significance value of 0.001 does not mean it is automatically groundbreaking research. Significance in statistics refers to the probability of our results being a fluke or not; it shows the likelihood that our result is reliable and has not just occurred through the bizarre constellation of individual numbers. And as such it is very important: when reporting measures such as the Pearson’s r, we must also give the significance value as otherwise our result is meaningless. Imagine an example where we have two variables A and B and three pairs of scores, as below:

[table “” not found /]

If we calculate the Pearson coefficient, we get r=0.87, this is a rather strong positive relationship. However, if you look at our data more closely, how can we make this statement be reliable? We have only three pairs of scores, and for variable B, two of the three scores are identical (viz. 3), with the third score is only marginally higher. True, the correlation coefficient is strong and positive, but can we really argue that the higher A is the higher is B? With three pairs of scores we have 1 degree of freedom, and if we look up the coefficient in the table of critical values for df=l, we can see that our coefficient is not significant, but quite simply a mathematical fluke.

Statistical significance is based on probability theory: how likely is something to happen or not. Statistical significance is denoted with p, with p fluctuating between zero and 1 inclusive, translating into zero to 100 per cent. The smaller p, the less likely our result is to be a fluke.

For example, for a Pearson correlation we may get r=0.6 and p=0.03. This indicates that the two variables correlate moderately strongly (see above), but it also tells us that the probability of this correlation coefficient occurring by mere chance is only 3%. To turn it round, we can be 97% confident that our r is a reliable measure for the association between the two variables. Accordingly, we might get a result of r=0.98 and p=0.48. Again, we have a very strong relationship; however, in this case the probability that the coefficient is a fluke is 48% – not a very reliable indicator! In this case, we cannot speak of a real relationship between the variables, but the coefficient merely appears as a result of a random constellation of numbers. We have seen a similar issue when discussing two entirely different data sets having the same mean. In linguistics, we are usually looking for significance levels of 0.05 or below (i.e. 95% confidence), although we may want to go as high as p=0.1 (90% confidence). Anything beyond this should make you seriously worried. In the following chapter, we will come across the concept of significance again when we try and test different hypotheses.

Some truth mixed with falsehoods. p values is the probability of getting the result or a more extreme one given the null hypothesis (r=0 in this case). It is not a measure of scientific importance.

It is not a measure of the reliability of finding the value again if we re-do the experiment. That value is much value, depending on statistical power/sample size.

No, a p value >.05 does not imply it was a true negative. It could be a false negative. Especially when sample size is 3!

The interpretation of the result is similar to that of a simple regression: in the first table, the Regression Statistics tell us that our four independent variables in the model can account for 95% of the variability of vitality, that is, the four variables together have a massive impact on vitality. This is not surprising, seeing that vitality is a compound index calculated from those very variables. In fact, the result for R2 should be 100%: with vitality being calculated from the four independent variables, there cannot be any other (latent) variable influencing it! This again reminds us that we are dealing with models – a simplified version of reality. The ANOVA part shows us that the model is highly significant, with p=0.00.

p values cannot be 0.

The task of our statistical test, and indeed our entire research, is to prove or disprove HQ. Paradoxically, in reality most of the time we are interested in disproving H0, that is, to show that in fact there is a difference in the use of non-standard forms between socio-economic classes. This is called the alternative hypothesis H, or HA. Hence:

One cannot prove the null hypothesis using NHST. The best one can do with frequentist stats is to show that the confidence interval is very closely around 0.

The most important part for us is the T(F<f) one-tail’ row: it tells us the significance level. Here p=0.23. This tells us that the variances of our two groups are not statistically significantly different; in other words, even though the variance for CG is nominally higher, we can disregard this and assume that the variances are equal. This is important for our choice of t-test.

I seem to recall that the test approach to t-tests is problematic and that it is better to just always use the robust test. I don’t recall the source tho.

Is this really sufficient information to argue that the stimulus ‘knowledge activation’ results in better performance? If you spend a minute or so thinking about it, you will realise that we have only compared the two post-test scores. It does not actually tell us how the stimulus influenced the result. Sadighi and Zare have also compared the pre-test scores of both groups and found no significant difference. That is, both groups were at the same proficiency level before the stimulus was introduced. Yet, we still need a decent measure that tells us the actual change that occurs between EG and CG, o rather: within EG and CG. And this is where the dependent t-test comes into play.

Another instance of p > alpha ergo no differences fallacy.

The really interesting table is the second one, especially the three rightmost columns labelled F, P-value and Fcrit. We see straightaway that the p-value of p=0.25 is larger than p – 0.05, our required significance level. Like the interpretation of the t-test, we use this to conclude that there is no significant difference in means between the treatment groups. The means may look different, but statistically, they are not. In yet other words, the different type of treatment our students receive does not result in different performance.

Same as above, but note that the author got it right in the very first sentence!

As with most things in research, there are various ways of conducting statistical meta-analyses. I will discuss here the method developed by Hunter and Schmidt (for a book-length discussion, see Hunter and Schmidt 2004: – not for the faint-hearted!). The Hunter-Schmidt method uses the Pearson correlation coefficient r as the effect size. If the studies in your pool do not use r (and quite a few don’t), you can convert other effect sizes into r – see, for example, Field and Wright (2006). Hunter and Schmidt base their method on the assumption of random effect sizes – it assumes that different populations have different effect sizes.

Given the number of errors, I was not expecting to see a reference to that book! Even if the author is mathematically incompetent, he gets the good stuff right. :P

Our true effect size of r= -0.68 lies between these boundaries, so we can consider the result to be significant at a 95% confidence level. So, overall, we can say that our meta-analysis of the AoO debate indicated that there is indeed a strong negative correlation between age of onset and attainment, and that this relationship is statistically significant.

That’s not how confidence intervals work…

Earlier posts: Something about certainty, proofs in math, induction/abduction, Is the summed cubes equal to the squared sum of counting integer series?

The things I’m going to say for math are equally true for logic, so whenever I write “math”, just mentally substitute to “math and logic”.

Here I’m going to be talking about something different than the normal empiricist approach to philosophy of math. Briefly speaking, that approach states that mathematical truths are just like truths in empirical science because they really are knowable in kind of the same way and there is no special math way of knowing things. The knowability can be either thru inductive generalizations (enumerative induction) or thru a coherentist view of epistemic justification (which I adhere to and which is closely related to consilience).

When a sound proof isn’t enough

I asked a mathy friend of mine to find out whether all numbers of the series:

S. 9, 99, 999, 9999, …

are divisible by 3. Let’s call this proposition P. A moment’s reflection should reveal the answer to be yes, but it is not so easy to give a formal proof of why this is the case*.

My mathy friend came up with this:


What he is calling cross sum is a translation of the Danish tværsum, but it looks like it refers to something unrelated. However, I managed to find the English term: digit sum. The recursive version is digital root.

I’m not sure it is quite formatted correctly (e.g. ai should just be a, I think), but the idea is something like this:

  1. Each of the numbers in S can be constructed with the summation equation given, a=9 and for k→∞. E.g. for k=2, summation is n=9*100+9*101+9*102=9+90+900=999.
  2. The 10 modulus 9 is 1, which is just to say that dividing 10 by 9 gives a remainder of 1.
  3. Some equations with digit roots and modulus which aims to show that the digit root of each member of S is the same (?).
  4. Finally, because we know that having a digital root of 9 means a number is divisible by 3, then all members of S are divisible by 3.

I’m not sure why he is using modulus, but presumably there is some strong connection between modulus and digital roots that I’m not familiar with. A reader of this post can probably come up with a better proof. :)

However, let’s assume that we have a proof that we think is sound (valid+has true premises). How certain should we be in our belief that P is true (by which I just mean, what probability should we assign to P being true), given that we think we have a sound proof? The answer isn’t 100%. To see why, we need to consider that uncertainty in a conclusion can come from (at least) three sources: 1) uncertainty inherent in the inference, 2) uncertainty in the premises, and 3) uncertainty about the inference type.

The first source is the one most familiar to scientists: since we (usually) can’t measure the entire population of some class of interest (e.g. humans, some of which are dead!), we need to rely on a subset of the population, which we call a sample. To put it very briefly, statistical testing is concerned with the question of how to infer and how certain we can be about the properties of the population. This source is absent when dealing with deductive arguments about math.

The second source is our certainty in the premises. Assuming a roughly linear transfer of justification from premises (in conjoint form) to conclusion in deductive arguments means that the certainty in a conclusion we derive from a set of premises cannot be stronger than our certainty in our premises. In empirical domains, we are never 100% certain about our premises since these themselves are infested with these sources of uncertainty. However, for math, this brings us back to the question of how certain we are about the assumptions we use in proofs. Generally, these can be deduced from simpler principles until we in the end get down to the foundations of math. This again brings us to the question of which epistemology is right: foundationalism or coherentism (or exotic infinitism, but few take that seriously)? Can we really be 100% certain about the foundations of math? Weren’t some very smart people wrong in the past about this question (some of them must be, since they disagree)? What does this show about how certain we should be?

The third source is our human fallibility in even knowing which structure the argument has. Who has not made countless errors during their lifetime in getting mathematical proofs right? Thus, we have ample empirical evidence of our own failing to correctly identify which arguments are valid and which are not. The same applies to inductive arguments.

The place for experiments in math

If the above is convincing, it means that we cannot be certain about mathematical conclusions, even when we have formal proofs.

Often, we use a suitable counterexample to show that a mathematical claim is false. For instance, if someone claimed that naive set theory is consistent (i.e. has no inconsistencies), we would point the person to Russell’s paradox. However, coming up with this counterexample wasn’t easy, it remained undiscovered for many years. Counterexamples in math are, as far as I know, usually found thru expert intuition. While this sometimes works, it isn’t a good method. A better method is to systematically search for counterexamples.

How could we do that? Well, we could simply try a lot of numbers. This would make it impractical for humans, but not necessarily for computers. Thus, when we have a mathematical claim of the type all Xs are Ys, we can try to disprove it by generating lots of cases and checking whether they have the property that is claimed (i.e. they shouldn’t have X&~Y). Sometimes we can even do this in an exhaustive or locally exhaustive (e.g. try all numbers between 1 and 1000) way. Other times the search space is too large for modern computers, so we would need to use sampling. This of course means that we introduce type 1 uncertainty discussed above.

Still, if continued search as described above fails to disprove a mathematical claim, my contention is that this increases our certainty about the claim, just as it does in regular empirical science. In my conversation with mathy people, they were surprisingly unwilling to accept this conclusion which is why I wrote up this longer blogpost.

Experimental code for members of S

So, are all members of S divisible by 3?

# divisible by 3 ------------------------------------------------------------
depth = 20
results = numeric()
for (i in 1:depth) {
  #repeat 9 i times
  x_rep = rep(9, i)
  #place next to each other, then convert to numeric
  x = str_c(x_rep, collapse = "") %>% as.numeric
  #save modulus 3
  results[i] = x %% 3
#> [1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

So, the first 20 members of S are divisible by 3. This increases the strength of my belief in P, even if the proof above (or something similar) turns out to work.

My intuitive proof

The reason I said that a moment’s reflection should be sufficient is that the following informal proof springs immediately to my mind:

  1. All members of S are multiples of the first member of S.
    Specifically, the vector of factors to use is, F: 1, 11, 111, 1111, …
  2. If n is divisible by k, then n*i is also divisible of k, where n, k, and i are all non-zero integers.
  3. 9 is divisible by 3.
  4. Thus, all members of S are divisible by 3.

Maybe someone can find a way to prove (1) and (2).

Other R code

Some other R code I wrote for this post, but didn’t end up discussing in the post.

# digital root ------------------------------------------------
digital_root = function(x) {
  #convert to chr
  x_str = as.character(x)
  #split into digits, unlist, convert back to num
  x_split = str_split(x_str, "") %>% unlist %>% as.numeric
  #get sum of digits
  x_sum = sum(x_split)
  #if is has more than 1 digit, call digital sum on result
  #otherwise return result
  if (x_sum > 9) {
  } else {

digital_root(9) == 9
digital_root(99) == 9
digital_root(999) == 9
digital_root(12) == 3
digital_root(123) == 6
digital_root(1234) == 1
# distributive? -----------------------------------------------------------
depth = 100
results = matrix(nrow=depth, ncol=depth)
for (i in 1:depth) {
  for (j in 1:depth) {
    ij = i + j
    results[i, j] = digital_root(ij) == digital_root(i) + digital_root(j)

Published to The Winnower 2015/10/20.

A properly formatted PDF of this paper can be downloaded here (not sized to A4 sheets).

Two new approaches to spatial autocorrelation (SAC) were examined for their ability to measure SAC: correlation of distance scores and k nearest spatial neighbor regression. Furthermore, the methods were compared to a more traditional measure of SAC, Moran’s I. All methods were found to give similar results as measured by their intercorrelations.

Three approaches, correlation of distance scores, k nearest spatial neighbor regression and spatial local regression, were used on simulated datasets to see if they could successfully distinguish between a true and a spurious cause. The correlation of distances approach was found to be lacking, while the other two methods clearly distinguished between the spurious and true causes and gave similar results.

Key words: spatial autocorrelation, Moran’s I, correlation of distances, k nearest spatial neighbor regression, KNSNR, spatial local regression, SLR.

1. Introduction

Much research analyzes data from countries, states, weather stations or other units have that a location. Spatial autocorrelation (SAC) is when there are spatial patterns in the dataset (Hassall & Sherratt, 2011; Radil, 2011). This can be both positive (nearby cases are similar), neutral (neighbor cases have no particular relationship, absence of SAC), or negative (nearby cases are dissimilar). Figure 1 illustrates this.

Figure 1: Illustrations of spatial autocorrelation. From (Radil, 2011).

The presence of SAC in a dataset means that the cases are not independent which means that the degrees of freedom are overestimated. There are methods for correcting the degrees of freedom, but these are not the focus of this paper (Gelade, 2008; Hassall & Sherratt, 2011). Instead, the purpose of this article is to explore ways to measure and correct for SAC with respect to effect sizes.

2. Correlation of distances method

A conceptually simple method for examining SAC is the correlation of distance scores method (CD). CD was used by Davide Piffer (Piffer, 2015) but is so simple that it would be surprising if it had not been thought of before. The first step is to calculate distance scores for all cases for all variables of interest. Note that the kind of distance score depends on the type of variable. Piffer analyzed, among other things, phylogenetic distances (genetic distance between two organisms or classes of organisms) and used Fst values. However, if one’s data concern e.g. countries, one would use spherical geometric distances. If one is analyzing data from flatland (of any hyperspace with flat planes), one could use euclidean distances. For single metric type variables, one could use the absolute difference. Alternatively, one could combine several non-spatial variables and employ various well-researched distance measures such as complete linkage (James, Witten, Hastie, & Tibshirani, 2013, p. 401).

The distance scores results in a new dataset which has N(N-1)/2 cases (choose 2 of N, order irrelevant), where N is the number of cases in the original dataset.

The idea behind the method is that given two variables that are linearly related, pairs of cases that are far from each other on one variable should also be far from each other on another variable. To better understand this, it is worth looking at some example data. Figure 2 shows the spatial distribution of some datapoints in flatland. An outcome variable also exists and is shown by color coding.

Figure 2: Dataset 1. Flatland and outcome.

There are clear spatial patterns in the data: 1) we see 5 fairly distinct clusters, 2) neighbors tend to have similar outcomes, and 3) points with higher x and y values tend to have higher outcome values.

Using the distance scores, one can examine (2) with a standard Pearson correlation, as shown in Figure 3.

Figure 3: Scatter plot of distance scores for spatial and outcome.

As expected we see a positive correlation, i.e. cases closer to each other do have more similar outcomes. We also see substantial non-linearity in the plot. This is because the cases are not uniformly spatially distributed and that distances between all pairs of cases are used.

If one wants to control for SAC, one can use standard statistical methods such as multiple regression or semi-partial correlations. Suppose one has a predictor that is spatially distributed as shown in Figure 4.

Figure 4: Dataset 1. Flatland and predictor.

We see that there is some similarity to Figure 2, but it is not immediately obvious whether the relationship owes to a pattern at the cluster-level or whether the pattern is general. In the original data the correlation between predictor and outcome is .66 and using CD it is .36.

The semi-partial correlation between predictor and outcome is .13, a fair bit lower than .36, which gives us some evidence that the relationship owes much of its size to the cluster pattern. If one instead uses multiple regression the standardized betas are .16 and .36 for predictor and spatial distance, respectively. In fact, the true correlation of predictor and outcome once SAC is taken into account is near 0. The relationship between them is purely due to their common association with the clusters we see (common cause scenario).

2.1. A second dataset

What about a similar situation where the relationship between predictor and outcome is not spurious? Consider the data in Figures 5 and 6.

Figure 5: Dataset 2. Flatland and outcome.


Figure 6: Dataset 2. Flatland and predictor.

We see the same overall pattern as before, but it is subtly different. The correlation in the original dataset between predictor and outcome is nearly the same, .68 vs. .66 before. However, using the distance scores, the correlation between outcome and spatial distance is only .17 vs. .45 before. The distance-based correlation between predictor and outcome is .43 vs. .36 before.

The semi-partial correlation for predictor and outcome correcting the latter for spatial distance is .39 vs. .13 before. A fairly large difference. Using multiple regression the betas are .42 and .04 for predictor and spatial distance, respectively.

Clearly, the distance scores approach shows that the association between predictor and outcome is different in the two datasets.

2.2. The relationship between correlations of distances and correlations in the original data

Before we go further, it is worth examining the relationship between correlations of distance data and correlations of the original data. From my experimentation, correlations of distance scores seem to be an r2-type statistic (Hunter & Schmidt, 2004, p. 189). This can be seen if one simulates normally distributed data with a known correlation and then correlate their distance scores. The results are shown in Figure 7.

Figure 7: Comparison of r, CD and r2.

As can be seen, the correlation of distance scores is almost exactly the same as the square of the correlation, or conversely, the square root of the correlation of distance scores is almost the same as the original correlation score.

Going back to the results in the previous section, this means that we need to take the square root of the semi-partial correlations and standardized betas to get an r-type measure. These values are shown in Table 1.



Dataset 1

Dataset 2

r (predictor x outcome), original data



sqrt. r (predictor x outcome), distance data



sqrt. r (spatial x outcome), distance data



sqrt. rsp (predictor x outcome, spatial), distance data



sqrt. beta (predictor x outcome), distance data



sqrt. beta (spatial x outcome), distance data




Table 1: Table of numbers from datasets 1 and 2.

There are several things worth noting. 1) The square root conversion seems to work because the distance-based correlations are similar to the ones based on the original data. 2) Distance-based correlations were higher for the first dataset as expected. 3) Semi-partial correlations of distance-data showed that the predictor had a substantial size (.36) in the first dataset despite this relationship being entirely spurious. However, it did correctly indicate a stronger effect in the second dataset (.62). 4) Similarly, multiple regression showed that the predictor had substantial validity in the first dataset and reversely that spatial distance had some validity in the second dataset. Neither should be the case. On the positive side, in both cases did the true causes have larger effect sizes (.60>.40; .65>.20).

3. k nearest spatial neighbors regression

Given the problems of the CD method, I developed a new method. It is a variant of the general approach of using the neighboring points to predict values, called k nearest neighbors (KNN) (James et al., 2013). The implementations of KNN I could find for R did not support spatial data, so I wrote my own implementation which I dub k nearest spatial neighbors regression (KNSNR) The approach is fairly simple:

  1. For each case, calculate the distance (spherical, euclidean or otherwise) to each every other case.

  2. For each case, find the k nearest neighbor cases.

  3. For each case, calculate the mean value of the neighbors’ scores and use that as the prediction.

One can alter k in (2) to tune the method. This value should probably be found thru cross-validation (James et al., 2013).

(3) can be altered to make more advanced versions such as taking into account the relative distances among the nearest neighbors (closer neighbors are more important) or the case weights (if using aggregated data or uncertain data).

The function I wrote will output one of four things: 1) predicted values, 2) correlation of actual values with predicted values, 3) residuals, and 4) the correlation of a predictor with the residuals of the outcome. Each has its use.

The correlation of actual values with predicted values is a measure of the SAC in a given variable. It corresponds to the correlation of a given variable and spatial distance when using distance data. Which k is the right to use? It depends. What k does here is change the zoom level. The value at each k tells you how much SAC there is in the dataset given a particular level of zooming in on every case. Figure 8 shows the amount of SAC in the outcome variable according to the first 100 values of k for multiple datasets.

Figure 8: The amount of SAC in datasets 1-6 according to KNSNR at a k 1-100.

In general, k should be larger than 1. k=3 often works well. k reached plateaus for datasets 1-2 around k=10. For these datasets k=50 makes more sense because there are 50 cases in each cluster. At all levels, however, do we see that SAC is stronger for the odd numbered datasets, just as was found using CD.

The predicted scores from KNSNR can be used in a multiple regression model where they can compete with other variables. The residuals can be correlated with a predictor to easily perform semi-partial correlation with SAC controlled for in the chosen variable. The semi-partial correlation between predictor and outcome controlling outcome for SAC in the first dataset is .01 while it is .42 in the second dataset (using k=50). In other words, the method tells us that the without SAC in the outcome the predictor doesn’t work in the first dataset, but it does in the second. The semi-partial correlations of spatial prediction x outcome controlled for predictor are .34 and .01, for the first and second datasets respectively. In other words, spatial proximity retains predictive value once the predictor has been controlled, showing that there is an additional unmeasured cause with strong SAC. All four results are good because they are in line with how the data were generated.

4. A different kind of spatial autocorrelation

The previous examples were easy in the sense that the data pretty clearly contained clusters of points. One could have easily classified the cases into the regional/island clusters and used a more traditional approach to the question: multi-level analysis. In simple terms, one would look whether relationship found in the pooled data holds when analyzing the data within each cluster. Table 2 shows these results.



Dataset 1

Dataset 2

















Table 2: Correlations inside clusters in datasets 1-2.

Clearly, the datasets are markedly different in that there is a near-zero relationship inside the clusters in the first, but there are fairly strong relationships in the second.

When results give discrepant results across levels, it is termed Simpson’s paradox. Not understanding this concept has lead to bogus claims of sex discrimination at least twice (Albers, 2015; Kievit, Frankenhuis, Waldorp, & Borsboom, 2013). An interactive visualization of the phenomenon can be found at: emilkirkegaard.dk/understanding_statistics/?app=Simpson_paradox

Clearly we did not need to use SAC to spot the difference between datasets 1-2. However, the datasets are unusual in that they permit very easy clustering of the cases. Real life datasets are often more difficult to deal with. Consider datasets 3-4 shown in Figures 9-12.

Figure 9: Dataset 3. Flatland and outcome.


Figure 10: Dataset 3. Flatland and predictor.


Figure 11: Dataset 4. Flatland and outcome.


Figure 12: Dataset 4. Flatland and predictor.

These datasets are variations of datasets 1-2, yet now the clusters are not easy to spot. One would not be able to use a simple clustering algorithm on the spatial data to sort them into 5 clusters correctly. If one cannot do that, one cannot use the standard multi-level approach because it requires discrete levels.

The correlations between the predictors and outcomes is in fact identical to before: .66 and .68 respectively. These correlations are generated in the same way as before, meaning that the first is a spurious, SAC induced correlate, while the second is a true cause. The datasets still contain a sizable amount of SAC as was seen in Figure 8, so one might still wonder how this affects the results. Using the semi-partial SAC control approach, the results are .15 and .46 respectively. Should the number .15 be closer to 0? Not exactly.

To understand why, I need to be more explicit about how the data were simulated. In the odd numbered datasets, the outcome and predictor variables are both a function of the clusters, which serve is a SAC variable (SACV). In the even numbered datasets, the predictor is a function of the clusters + noise and the outcome is a function of the predictor + noise. Figure 13 shows the path models.


Figure 13: Path models for the example datasets.

The difference between datasets 1-2 and 3-4 is solely that the within cluster spatial variation was increased from standard deviation=6 to 15. This weakens the SAC in the data and thus the effect of controlling for SAC. The results are thus in line with expectations.

Still, we would like paired examples that appear similar – i.e. has similar r (predictor x outcome) – but where the first effect is spurious and the second is real and that our methods correctly identify this. To do this, I created two more datasets. This time there are no a priori clusters.

The generation proceeded like this:

  1. For each case:

    1. Sample two values from the uniform distribution of numbers between 1 and 100 and use these as x and y coordinates.

    2. Sample a number from a standard normal distribution and use it as the score for the SACV.

  2. Induce SAC in the SACV using a KNSN-type approach.

After this, I proceeded similarly to before, namely by creating two variables, predictor and outcome, according to the path models shown in Figure 13. Noise were added to hit a target uncorrected correlation of .66, similar to datasets 1-4. Because the SACV variable has strong SAC, controlling for SAC decreases the correlation in the spurious cause scenario because path goes thru the SACV, but not in the true cause scenario where causation is directly from predictor to outcome.

It is worth going into more detail about how SAC was induced. The algorithm works as follows:

  1. For each iteration:

    1. For each case:

      1. Find the k nearest spatial neighbors.

      2. Find the mean value of the chosen variable for these neighbors.

    2. For each case:

      1. Change the value of the datapoint to (1-w) times its current value and w times the mean value of its neighbors.

(a) and (b) must proceed in this fashion otherwise the order of the cases would affect the algorithm. The algorithm requires values for the three parameters: i, k and w. i determines the number of iterations the algorithm goes thru. k control the number of neighbors taken into account and w controls the relative weight given to the value from the neighbors. Smaller k means the SAC pattern will be more local, while both i and w makes the SAC stronger but in different ways. The parameters used for datasets 5 and 6 were i=20, k=10, w=1/3.

Figures 14 to 17 show the spatial layout as well as predictor and outcome values for datasets 5-6.

Figure 14: Example 5. Flatland and outcome.


Figure 15: Example 5. Flatland and predictor.



Figure 16: Example 6. Flatland and outcome.

Figure 17: Example 6. Flatland and predictor.

From just inspecting the spatial plots, it is not easy to see a marked difference between datasets 5 and 6. Both datasets clearly show some degree of SAC. Figures 18 and 19 show the regressions between predictor and outcome.

Figure 18: Scatter plot for predictor and outcome in dataset 5.


Figure 19: Scatter plot for predictor and outcome in dataset 6.

Altho not identical, there do not seem to be marked differences between the datasets. Thus, one could not tell the situations apart by simple inspection. There are also no obvious clusters one could classify cases into and use multi-level analysis.

However, if we examine the strength of SAC in the datasets, they are fairly different as shown in Table 3.



Dataset 5

Dataset 6




















Table 3: SAC in datasets 5 and 6. SAC calculated using k=10 because data were generated using k=10.

Unsurprisingly, the coordinate variables and the SACV show extreme levels of SAC. The predictors diverge a bit in SAC, while the difference in SAC for the outcome variables is fairly large. This is as expected given the way the data were generated. In dataset 6, the outcome variable is 2 steps away from SACV, while it is only 1 step away in dataset 5. Also worth noting is the SAC in the model residuals in dataset 5, which was discussed by (Hassall & Sherratt, 2011) as indicating a problem. This indicates that one or more causes not included in the model are SAC.

The correlations with the predictor after controlling the outcome for SAC are .08 and .49 for datasets 5 and 6, respectively showing that the method can successfully detect which association is due to an uncontrolled common cause which is SAC.

4.1. Control outcome, predictor or both for SAC?

So far I have only presented results where the outcome was controlled for SAC. However, one might also control the predictor or both. It is not immediately clear to me which one is the best. For this reason, I tried all three side by side. Table 4 shows the results.









































































Table 4: KNSN regression results, controlling either or both variables with three different k values. The number is the value of k used, the letter indicators which variable was controlled for SAC: p=predictor, o=outcome, b=both.

We see some interesting patterns. When k is 3 or 10, controlling both variables tends to reduce the correlation further in the datasets with spurious causes, but not in the datasets with real causes. When using k=50, controlling both actually tends to make the results stronger for both true and spurious causes.

The most important comparison is that between datasets 5 and 6 because these are the most realistic datasets. When using an appropriate k (3 or 10), we see that controlling both reduces the spurious correlations (.18 to .07; .08 to .03) but either leaves unchanged or increases the true cause correlation (.57 to .57; .49 to .55). For this reason, it would seem best to control both predictor and outcome variables provided one has a reasonable idea about the correct k to use.

4.1.1. Multiple regression

There is another way to control for a variable, namely multiple correlation. However, because the predicted values from KNSNR were often very similar (r>.7) to the predictor, this was not very useful. Therefore, if one does use the MR approach one should pay attention to multicollinearity in the models, e.g. by means of the variable inflation factor (Field, 2013; Gordon, 2015).

5. Spatial local regression

Local regression is a method related to the k nearest neighbor regression. The idea of local regression is that instead of fitting a model to all the predictor data, one fits the model for every case and weigh the nearby cases more than the distant cases. Afterwards we combine the models. The result of this is a smooth fit to the data. Figure 20 shows as illustration of this.

Figure 20: Local regression. At each point, a model is fit. Only nearby cases are used to fit and they are weighted by their distance such that the more distant cases weigh less. Copied from (James et al., 2013, p. 285).

In the above case, the distance measure for the cases is their absolute difference on the predictor variable. However, one could also use spatial distance, a method one might call spatial local regression (SLR). In this way, one would fit a regression for each case and use only the cases that are spatially nearby and finally aggregate/meta-analyze the results. The rationale here is that SAC will have only a weak influence when one examines cases near each other.

For the regression betas, one could use a weighted mean with unit-weights or inverse distance weights. The latter is theoretically superior, but I implemented both options. The implementation can also output the vector of beta values in case someone wants to experiment using other central tendency measures.

The results are surprisingly similar to those from KNSNR. Table 5 shows results for the predictor x outcome relationship in datasets 1-6 using SLR.




















































Table 5: Spatial local regression results for the predictor x outcome relationship in datasets 1-6. The number refers to the chosen k value. The word refers to the use of weights.

Recall that datasets with odd numbers are spurious causes while those with even numbers are true causes. For k=3 and 10, we see strong contrasts between the results from the even and odd numbered datasets, but not for k=50. It appears that using k=50 is sufficient to induce a strong spurious effect. The use of weights does not seem to have much effect, which is somewhat surprising. It does have a small effect on reducing the size of the spurious cause correlations.

To further investigate this, calculated the same numbers as shown in Table 5 but for k = [3-50] using only inverse weighing. Results are shown in Figure 21.

Figure 21: Spatial local regression results for datasets 1-6 for varying values of k. Inverse distance used as weight.

We see that at lower k values, all correlations decrease, but the spurious ones decrease more. As k increases towards N of the sample, the spurious and true cause correlations converge.

6. Comparison of measures of SAC

It is worth comparing the two measures of SAC examined in this paper to the more standard methods in the field. Two methods are widely used to measure the amount of SAC in a dataset: Moran’s I and Geary’s C (Gelade, 2008; Hassall & Sherratt, 2011; Radil, 2011). These two measures are described as approximately inversely related and often only the first is used. I also used only the first because I was unable to understand the implementations of the latter in R (e.g. in spdep package). Both Moran’s I and Geary’s C are global measures of SAC, altho local variations exist (Radil, 2011).

Because KNSNR tends to hit a ceiling quickly (e.g. in Table 2), the analysis here is focused on variables where this doesn’t happen. For this purpose, the outcome, predictor and residuals from predicting the first with the latter are examined with all three methods in across all datasets. Results from different datasets with the ‘same’ variable were appended. Because KNSNR requires a tuning parameter, 3 values were chosen for this (3, 10, and 50). A square root version of CD was also added to see if the non-linearity inherent in the CD correlations would substantially change the relationship to the other measures.

To increase sample size and diversity of datasets, more datasets were created. Dataset 7 is a variant of dataset 6, but where SAC is also induced into the outcome and predictor variables. Dataset 8 uses the distance to the center as the basis for the predictor values and the outcome values are a noisy version of the predictor. Datasets 9 and 10 are variants of dataset 8 but where the distance to the points (75;75) and (100;100) are used, respectively. Dataset 11 had no SAC induced and functions as a null dataset. Dataset 12 has datapoints perfectly distributed in a grid pattern, perfect negative SAC for the predictor and uneven SAC for the outcome. The supplementary materials contains scatter plots for these additional datasets. Table 6 shows the SAC values according to the 5 measures and Table 7 shows their intercorrelations.


Dataset and variable
































































































































































































































































Table 6: SAC measures of three variables in 11 datasets.


Morans_I cd cd_sqrt knsn_3 knsn_10 knsn_50
Morans_I 0.889 0.897 0.737 0.728 0.849
cd 0.892 0.961 0.501 0.502 0.652
cd_sqrt 0.892 0.999 0.598 0.574 0.758
knsn_3 0.811 0.643 0.641 0.718 0.838
knsn_10 0.733 0.583 0.593 0.874 0.638
knsn_50 0.932 0.822 0.822 0.752 0.708

Table 7: Intercorrelations among 6 SAC measures in 12 datasets. Pearson correlations above diagonal. Spearman’s rank-order below.1

All correlations between the measures were sizable as one would expect. The CD measure showed downward-biasing non-linearity because the rank-order correlations were noticeably stronger.

Behind the strong intercorrelations lie some surprisingly inconsistencies. Consider ex1_outcome vs. ex7_outcome_predictor_resids. CD gives 0.452 and 0.088 while knsn_3 gives 0.741 and 0.976. What this seems to mean is that there is a strong local SAC (which knsn_3 measures) in the residuals but a weak global SAC (which CD measures).

The KNSNR results for dataset 12 need special attention. Because the points are distributed in an exact grid pattern, many pairs of points are equidistant. This poses a problem for KNN-type approaches because they have to picked k nearest neighbors. In this case, the computer chooses one based on the position in the distance matrix which is an arbitrary choice. If one is examining data distributed in a perfect grid pattern, extra attention should be given to the choice of k. Notice also that cases close to the edges have no neighbors in that direction (this isn’t Pacman world), so the algorithm must choose neighbors further away. This can give strange results.

7. Comparison of correction methods

In the previous sections, three different methods for controlling for SAC were discussed. It is useful to do a straight up comparison with the best choice of parameters. Table 8 shows the uncontrolled and SAC-controlled correlation in datasets 1-6 using correlation of distances (using semi-partial, partial and multiple regression sub-methods), KNSNR (using all three parameters with regards to which variable(s) to control for SAC, k=10) and SLR (using inverse weights, k=3).









































































Table 8: SAC corrected results across methods.

We see that for datasets 1-4, the CD approaches fail to correct for the spurious correlation since the correlations after control are still sizable. For the same datasets we see that KNSNR and SLR perform ok to well. For the most important datasets, 5-6, the CD approaches barely make a difference, but the other two methods make a very large difference.

8. Discussion and conclusion

Table 9 shows an overview of SAC methods and their uses.



Can induce SAC?

Can measure SAC?

Can control for SAC?

Correlation of distances




k nearest spatial neighbor




Spatial local regression




Moran’s I and similar





Table 9: Overview of SAC methods and their uses.

Correlation of distances (Piffer, 2015) is a clever idea, and a potentially powerful approach because it can be used to both measure and control for SAC and possibly induce SAC too. However, the version of it examined in this paper (“simple CD”) seems to be fairly useless in the sense that it failed to correct for spurious causation in simulated datasets.

It is worth examining whether limiting CD to the nearest k neighbors or neighbors within a radius of a given number of kilometers makes it a better measure. Another option is to weigh the datapoints by weights that diminish as the distance increases (e.g. inverse). Both suggested modifications attempt to deal with the problem that results from the globalness of the simple CD.

A second problem with CD approaches is that because distance scores are based on two datapoints, any measurement error in these scores will have a compound effect on the difference scores, but the matter is complicated (Trafimow, 2015). I’m not sure if this has an effect on the conversion back to an r-type measure as described in this paper, but it might.

A third problem with CD approaches is that the number of distances quickly become very large. E.g. for a dataset with 10,000 cases, the number of distance scores is 49,995,000. This makes for fairly resource-intensive calculations. Random sampling of datapoints could probably be used to offset this problem at the price of some sampling error.

KNSNR approaches are powerful since they can be used to all three ends. The main weakness seems to be that they require a tuning parameter. Probably some approach could be find this through re-sampling methods similar to those used to find tuning parameters for e.g. lasso regression (James et al., 2013). One problem with KNSNR is that it tends to fairly quickly reach a ceiling.

SLR approaches seem more limited because they can only be used to correct for SAC. They do however seem to give similar results to KNSNR approaches, so they could probably be used along KNSNR in a multi-method approach to controlling for SAC. Like KNSNR approaches SLR requires a tuning parameter, but it seems to always be better to use k=3. The smaller the k, the smaller the effect of SAC in the data on the estimate but the more sampling error.

Still, the present findings call for further research into all three approaches.

8.1. Limitations

Only 12 test datasets were examined. Testing the methods against each other on a larger and more diverse set of datasets might change conclusions somewhat.

Of the traditional measures of SAC, only Moran’s I was used in the comparisons. Further research should compare the proposed measures with other already developed measures such as Geary’s C, Gi, Gi* and local Moran’s I (Radil, 2011, p. 19).

Another method used by (Hassall & Sherratt, 2011), spatial eigenvector mapping, to correct for SAC could not be used because I could not find a suitable R implementation.

Supplementary material and acknowledgments

The R source code for this paper is available at osf.io/f2pnr/files/

Thanks to Davide Piffer for helpful comments.


Albers, C. (2015). NWO, Gender bias and Simpson’s paradox. Retrieved from blog.casperalbers.nl/science/nwo-gender-bias-and-simpsons-paradox/

Field, A. P. (2013). Discovering statistics using IBM SPSS statistics (4th edition). Los Angeles: Sage.

Gelade, G. A. (2008). The geography of IQ. Intelligence, 36(6), 495–501. doi.org/10.1016/j.intell.2008.01.004

Gordon, R. A. (2015). Regression analysis for the social sciences (Second edition). New York: Routledge, Taylor & Francis Group.

Hassall, C., & Sherratt, T. N. (2011). Statistical inference and spatial patterns in correlates of IQ. Intelligence, 39(5), 303–310. doi.org/10.1016/j.intell.2011.05.001

Hunter, J. E., & Schmidt, F. L. (2004). Methods of meta-analysis correcting error and bias in research findings. Thousand Oaks, Calif.: Sage. Retrieved from site.ebrary.com/id/10387875

James, G., Witten, D., Hastie, T., & Tibshirani, R. (Eds.). (2013). An introduction to statistical learning: with applications in R. New York: Springer.

Kievit, R., Frankenhuis, W. E., Waldorp, L., & Borsboom, D. (2013). Simpson’s paradox in psychological science: a practical guide. Quantitative Psychology and Measurement, 4, 513. doi.org/10.3389/fpsyg.2013.00513

Piffer, D. (2015). A review of intelligence GWAS hits: Their relationship to country IQ and the issue of spatial autocorrelation. Intelligence, 53, 43–50. doi.org/10.1016/j.intell.2015.08.008

Radil, S. M. (2011). Spatializing social networks: making space for theory in spatial analysis. University of Illinois at Urbana-Champaign. Retrieved from www.ideals.illinois.edu/handle/2142/26222

Trafimow, D. (2015). A defense against the alleged unreliability of difference scores. Cogent Mathematics, 2(1), 1064626. doi.org/10.1080/23311835.2015.1064626


1 The reader with a sense for detail might wonder how the rank-order correlations for CD and CD_sqrt differ. Taking the square root of a vector of numbers does not change their order, so how can the rank-order correlations be different? The explanation is that sometimes the square root is undefined because the original number was negative. The correlations are based on pair-wise, not case-wise complete data, which results in a slightly different set of cases used for the correlations for CD and CD_sqrt.

A correction to this was made 2015-11-10. See openpsych.net/forum/showthread.php?tid=251&pid=3675#pid3675

Post your review over at the OP forum: openpsych.net/forum/showthread.php?tid=251
Some methods for analyzing and correcting for spatial autocorrelation

Emil O. W. Kirkegaard

A method for examining spatial autocorrelation based on distance scores was examined and found to be deficient. Because of this an alternative measure was proposed: k nearest spatial neighbor regression. Methods were examined across simulated datasets to see if they could successfully remove SAC and distinguish between a true and a spurious cause. Furthermore, the methods were compared to a more traditional measure of spatial autocorrelation, Moran’s I.

Key words: spatial autocorrelation, Moran’s I, correlation of distances, k nearest spatial neighbor, knsn

PDF + files: osf.io/ea2bm/

I looked around to see if I could find a nice function for just plotting the results of kmeans() using ggplot2. I could not find that. So I wrote my own function.

The example data I’m using is real GDP growth, not sure exactly what it is, but the file can be found here: OECD real economic growth. The datafile: GDP. Because some data is missing, I use imputation using irmi() from VIM.

One thing about kmeans is that it is an indeterministic algorithm because it depends on an initial position of the cluster means. In the default settings, these are picked at random. This means that results won’t be completely identical always. Even when they are, factors may be change numbers. Such is life. We can get around that by running the method a large number of times and picking the best of the solutions found. I set the default for this to 100 runs.

kmeans can use raw data if desired. Generally however this is not desired. That is, we want all the data to count equally, not differ in importance because their scales are different. The default setting is thus to standardize the data first.

Then there is the question of how to plot the solution. The usual solution is a scatter plot. I used that as well. The scatter plot is based on the two first factors as found by fa() (psych package).

Finally, I color code the factors and add rownames to the points for easy identification.

The results:


In the first plot, I have clustered the countries. In the second I have clustered the years. This is simply done by first transposing the dataset.

R code

p_load(kirkegaard, psych, plyr, stringr, weights, VIM, ggplot2, readODS)

#read from file
d = read.ods("GDP.ods", 1)

#fix names
row_names = d$A[-1]
col_names = d[1, -1]
d = d[-1, -1]

#recode NA
d[d==""] = NA

#to numeric
d = as.data.frame(lapply(d, as.numeric))

#impute NA
d = irmi(d, noise = F)

rownames(d) = row_names
colnames(d) = col_names

#transposed version
d2 = as.data.frame(t(d))

#some cors
round(wtd.cors(d), 2)
round(wtd.cors(t(d)), 2)

plot_kmeans = function(df, clusters, runs, standardize=T) {
  if (standardize) df = std_df(df)
  tmp_k = kmeans(df, centers = clusters, nstart = 100)
  tmp_f = fa(df, 2, rotate = "none")
  #collect data
  tmp_d = data.frame(matrix(ncol=0, nrow=nrow(df)))
  tmp_d$cluster = as.factor(tmp_k$cluster)
  tmp_d$fact_1 = as.numeric(tmp_f$scores[, 1])
  tmp_d$fact_2 = as.numeric(tmp_f$scores[, 2])
  tmp_d$label = rownames(df)
  g = ggplot(tmp_d, aes(fact_1, fact_2, color = cluster)) + geom_point() + geom_text(aes(label = label), size = 3, vjust = 1, color = "black")

plot_kmeans(d, 2)
plot_kmeans(d2, 2)


Two methods are presented that allow for identification of mixed cases in the extraction of general factors. Simulated data is used to illustrate them.

General factors can be extracted from datasets where all or nearly so the variables are correlated. At the case-level, such general factors are decreased in size if there are mixed cases present. A mixed case is an ‘inconsistent’ case according to the factor structure of the data.

A simple way of illustrating what I’m talking about is using the matrixplot() function from the VIM package to R (Templ, Alfons, Kowarik, & Prantner, 2015) with some simulated data.

For simulated dataset 1, start by imaging that we are measuring a general factor and that all our indicator variables have a positive loading on this general factor, but that this loading varies in strength. Furthermore, there is no error of measurement and there is only one factor in the data (no group factors, i.e. no hierarchical or bi-factor structure, (Jensen & Weng, 1994)). I have used datasets with 50 cases and 25 variables to avoid the excessive sampling error of small samples and to keep a realistic number of cases compared to the datasets examined in S factor studies (e.g. Kirkegaard, 2015). The matrix plot is shown in Figure 1.

Figure 1: Matrix plot of dataset 1

No real data looks like this, but it is important to understand what to look for. Every indicator is on the x-axis and the cases are on the y-axis. The cases are colored by their relative values, where darker means higher values. So in this dataset we see that any case that does well on any particular indicator does just as well on every other indicator. All the indicators have the same factor loading of 1, and the proportion of variance explained is also 1 (100%), so there is little point in showing the loadings plot.

To move towards realism, we need to complicate this simulation in some way. The first way is to introduce some measurement error. The amount of error introduced determines the factor loadings and hence the size of the general factor. In dataset 2, the error amount is .5, and the signal multiplier varies from .05 to .95 all of which are equally likely (uniform distribution). The matrix and the loadings plots are shown in Figures 2 and 3.

Figure 2: Matrix plot for dataset 2


Figure 3: Loadings plot for dataset 2

By looking at the matrix plot we can still see a fairly simple structure. Some cases are generally darker (whiter) than others, but there is also a lot of noise which is of course the error we introduced. The loadings show quite a bit of variation. The size of this general factor is .45.

The next complication is to introduce the possibility of negative loadings (these are consistent with a general factor, as long as they load in the right direction, (Kirkegaard, 2014)). We go back to the simplified case of no measurement error for simplicity. Figures 4 and 5 show the matrix and loadings plots.

Figure 4: Matrix plot for dataset 3

Figure 5: Loadings plot for dataset 3

The matrix plot looks odd, until we realize that some of the indicators are simply reversed. The loadings plot shows this reversal. One could easily get back to a matrix plot like that in Figure 1 by reversing all indicators with a negative loading (i.e. multiplying by -1). However, the possibility of negative loadings does increase the complexity of the matrix plots.

For the 4th dataset, we make a begin with dataset 2 and create a mixed case. This we do by setting its value on every indicator to be 2, a strong positive value (98 centile given a standard normal distribution). Figure 6 shows the matrix plot. I won’t bother with the loadings plot because it is not strongly affected by a single mixed case.

Figure 6: Matrix plot for dataset 4

Can you guess which case it is? Perhaps not. It is #50 (top line). One might expect it to be the same hue all the way. This however ignores the fact that the values in the different indicators vary due to sampling error. So a value of 2 is not necessarily at the same centile or equally far from the mean in standard units in every indicator, but it is fairly close which is why the color is very dark across all indicators.

For datasets with general factors, the highest value of a case tends to be on the most strongly loaded indicator (Kirkegaard, 2014b), but this information is not easy to use in an eye-balling of the dataset. Thus, it is not so easy to identify the mixed case.

Now we complicate things further by adding the possibility of negative loadings. This gets us data roughly similar to that found in S factor analysis (there are still no true group factors in the data). Figure 7 shows the matrix plot.

Figure 7: Matrix plot for dataset 5

Just looking at the dataset, it is fairly difficult to detect the general factor, but in fact the variance explained is .38. The mixed case is easy to spot now (#50) since it is the only case that is consistently dark across indicators, which is odd given that some of them have negative loadings. It ‘shouldn’t’ happen. The situation is however somewhat extreme in the mixedness of the case.

Automatic detection

Eye-balling figures and data is a powerful tool for quick analysis, but it cannot give precise numerical values used for comparison between cases. To get around this I developed two methods for automatic identification of mixed cases.

Method 1
A general factor only exists when multidimensional data can be usefully compressed, informationally speaking, to 1-dimensional data (factor scores on the general factor). I encourage readers to consult the very well-made visualization of principal component analysis (almost the same as factor analysis) at this website. In this framework, mixed cases are those that are not well described or predicted by a single score.

Thus, it seems to me that that we can use this information as a measure of the mixedness of a case. The method is:

  1. Extract the general factor.
  2. Extract the case-level scores.
  3. For each indicator, regress it unto the factor scores. Save the residuals.
  4. Calculate a suitable summary metric, such as the mean absolute residual and rank the cases.

Using this method on dataset 5 in fact does identify case 50 as the most mixed one. Mixedness varies between cases due to sampling error. Figure 8 shows the histogram.

Figure 8: Histogram of absolute mean residuals from dataset 5

The outlier on the right is case #50.

How extreme does a mixed case need to be for this method to find it? We can try reducing its mixedness by assigning it less extreme values. Table 1 shows the effects of doing this.


Mixedness values

Mean absolute residual








Table 1: Mean absolute residual and mixedness

So we see that when it is 2 and 1.5, it is clearly distinguishable from the rest of the cases, but 1 is about the limit of this since the second-highest value is .80. Below this, the other cases are similarly mixed, just due to the randomness introduced by measurement error.

Method 2
Since mixed cases are poorly described by a single score, they don’t fit well with the factor structure in the data. Generally, this should result in the proportion of variance increasing when they are removed. Thus the method is:

  1. Extract the general factor from the complete dataset.
  2. For every case, create a subset of the dataset where this case is removed.
  3. Extract the general factors from each subset.
  4. For each analysis, extract the proportion of variance explained and calculate the difference to that using the full dataset.

Using this method on the dataset also used above correctly identifies the mixed case. The histogram of results is shown in Figure 9.

Figure 9: Histogram of differences in proportion of variance to the full analysis

Like we method 1, we then redo this analysis for other levels of mixedness. Results are shown in Table 2.


Mixedness values

Improvement in proportion of variance








Table 2: Improvement in proportion of variance and mixedness

We see the same as before, in that both 2 and 1.5 are clearly identifiable as being an outlier in mixedness, while 1 is not since the next-highest value is .45.

Large scale simulation with the above methods could be used to establish distributions to generate confidence intervals from.

It should be noted that the improvement in proportion of variance is not independent of number of cases (more cases means that a single case is less import, and non-linearly so), so the value cannot simply be used to compare across cases without correcting for this problem. Correcting it is however beyond the scope of this article.

Comparison of methods

The results from both methods should have some positive relationship. The scatter plot is shown in

Figure 10: Scatter plot of method 1 and 2

We see that the true mixedness case is a strong outlier with both methods — which is good because it really is a strong outlier. The correlation is strongly inflated because of this, to r=.70 with, but only .26 without. The relative lack of a positive relationship without the true outlier in mixedness is perhaps due to range restriction in mixedness in the dataset, which is true because the only amount of mixedness besides case 50 is due to measurement error. Whatever the exact interpretation, I suspect it doesn’t matter since the goal is to find the true outliers in mixedness, not to agree on the relative ranks of the cases with relatively little mixedness.1

I have implemented both above methods in R. They can be found in my unofficial psych2 collection of useful functions located here.

Supplementary material
Source code and figures are available at the Open Science Framework repository.


Jensen, A. R., & Weng, L.-J. (1994). What is a good g? Intelligence, 18(3), 231–258. doi.org/10.1016/0160-2896(94)90029-9

Kirkegaard, E. O. W. (2014a). The international general socioeconomic factor: Factor analyzing international rankings. Open Differential Psychology. Retrieved from openpsych.net/ODP/2014/09/the-international-general-socioeconomic-factor-factor-analyzing-international-rankings/

Kirkegaard, E. O. W. (2014b). The personal Jensen coefficient does not predict grades beyond its association with g. Open Differential Psychology. Retrieved from openpsych.net/ODP/2014/10/the-personal-jensen-coefficient-does-not-predict-grades-beyond-its-association-with-g/

Kirkegaard, E. O. W. (2015). Examining the S factor in US states. The Winnower. Retrieved from thewinnower.com/papers/examining-the-s-factor-in-us-states

Templ, M., Alfons, A., Kowarik, A., & Prantner, B. (2015, February 19). VIM: Visualization and Imputation of Missing Values. CRAN. Retrieved from cran.r-project.org/web/packages/VIM/index.html


1 Concerning the agreement about rank-order, it is about .4 both with and without case 50. But this is based on a single simulation and I’ve seen some different values when re-running it. A large scale simulation is necessary.


Blog commenter Lion of the Judah-sphere has claimed that the SAT does not correlate as well with comprehensive IQ tests as said IQ tests correlate with one another. At first I assumed he was wrong, but my recent analysis suggesting Harvard undergrads have an average Wechsler IQ of 122, really makes me wonder.

While an IQ of 122 (white norms) is 25 points above the U.S. mean of 97, it seems very low for a group of people who averaged 1490 out of 1600 on the SAT. According to my formula, since 1995 a score of 1490 on the SAT equated to an IQ of 141. But my formula was based on modern U.S. norms; because demographic changes have made the U.S. mean IQ 3 points below the white American mean (and made the U.S. standard deviation 3.4 percent larger than the white SD), converting to white norms reduces Harvard’s SAT IQ equivalent to 139.

In general, research correlating the SAT with IQ has been inconsistent, with correlations ranging from 0.4 to 0.8. I think much depends on the sample. Among people who took similar courses in similar high schools, the SAT is probably an excellent measure of IQ. But considering the wide range of schools and courses American teenagers have experienced, the SAT is not, in my judgement, a valid measure of IQ. Nor should it be. Universities should not be selecting students based on biological ability, but rather on acquired academic skills.

The lower values are due to restriction of range, e.g. Frey and Detterman (2004). When corrected, the value goes up to .7-.8 range. Also .54 using ICAR60 (Condon and Revelle, 2014) without correction for reliability or restriction.

As for the post, I can think of few things:

1. The sample recruited is likely not representative of Harvard. Probably mostly social sci/humanities students, who have lower scores.

2. Regression towards the mean means that the Harvard student body won’t be as exceptional on their second measurement as on their first. This is because some of the reason they were so high was just good luck.

3. The SAT is teachable to some extent which won’t transfer well to other tests. This reduces the correlation between between SAT and other GCA tests.

4. Harvard uses affirmative action which lowers the mean SAT of the students a bit. It seems to be about 1500.

SAT has an approx. mean of ~500 per subtest, ceiling of 800 and SD of approx. 100. So a 1500 total score is 750+750=(500+250)+(500+250)=(500+2.5sd)+(500+2.5sd), or about 2.5 SD above the mean. Test-retest reliability/stability for a few months is around .86 (mean of values here research.collegeboard.org/sites/default/files/publications/2012/7/researchreport-1982-7-test-disclosure-retest-performance-sat.pdf, n≈3700).

The interesting question is how much regression towards the mean we can expect? I decided to investigate using simulated data. The idea is basically that we first generate some true scores (per classical test theory), and then make two measurements of them. Then using the first measurement, we make a selection that has a mean 2.5 SD above the mean, then we check how well this group performs on the second testing.

In R, the way we do this, is to simulate some randomly distributed data, and then create new variables that are a sum of true score and error for the measurements. This presents us with a problem.

How much error to add?

We can solve this question either by trying some values, or analytically. Analytically, it is like this:

Cor (test1 x test2) = cor(test1 x true score) * cor(test2 x true score)

The correlation of each testing is due to their common association with the true scores. To simulate a testing, we need the correlation between testing and true score. Since test-rest is .863, we take the square root and get .929. The squared value of a correlation is the amount of variance it explains, so if we square it get back to where we were before. Since the total variance is 1, we can calculate the remaining variance as 1-.863=.137. We can take the square root of this to get the correlation, which is .37. Since the correlations are those we need when adding, we have to weigh the true score by .929 and the error by .370 to get the measurements such that they have a test-retest correlation of .863.

One could also just try some values until one gets something that is about right. In this case, just weighing the true score by 1 and error by .4 produces nearly the same result. Trying out a few values like this is faster than doing the math. In fact, I initially tried out some values to hit the approximate value but then decided to solve it analytically as well.

How much selection to make?

This question is more difficult analytically. I have asked some math and physics people and they could not solve it analytically. The info we have is the mean value of the selected group, which is about 2.5, relative to a standard normal population. Assuming we make use of top-down selection (i.e. everybody above a threshold gets selected, no one below), where must we place our threshold to get a mean of 2.5? It is not immediately obvious. I solved it by trying some values and calculating the mean trait value in the selected group. It turns out that to get a selected group with a mean of 2.5, the threshold must be 2.14.

Since this group is selected for having positive errors and positive true scores, their true scores and second testing scores will be lower. How much lower? 2.32 and 2.15 according to my simulation. I’m not sure why the second measurement scores are lower than the true scores, perhaps someone more learned in statistics can tell me.

So there is still some way down to an IQ of 122 from 132.3 (2.15*15+100). However, we may note that they used a shorter WAIS-R form, which correlates .91 with the full-scale. Factoring this in reduces our estimate to 129.4. Given the selectivity noted above, this is not so unrealistic.

Also, the result can apparently be reached simply by 2.5*.86. I was aware that this might work, but wasn’t sure so I tested it (the purpose of this post). One of the wonders of statistical software like this is that one can do empirical mathematics. :)

R code


#size and true score
n.size = 1e6
true.score = rnorm(n.size)

#add error to true scores
test.1 = .929*true.score + .370*rnorm(n.size)
test.2 = .929*true.score + .370*rnorm(n.size)
SAT = data.frame(true.score,
#verify it is correct

#select a subsample
selected = filter(SAT, test.1>2.14) #selected sample
describe(selected) #desc. stats


I ran into trouble trying to remove the effects of one variable on other variables doing the writing my reanalysis of the Noble et al 2015 paper. It was not completely obvious which exact method to use.

The authors in their own paper used OLS aka. standard linear regression. However, since I wanted to plot the results at the case-level, this was not useful to me. Before doing this I did not really understand how partial correlations worked, or what semi-partial correlations were, or how multiple regression works exactly. I still don’t understand the last, but the others I will explain with example results.

Generated data

Since we are cool, we use R.

We begin by loading some stuff I need and generating some data:

#load libs
 library(pacman) #install this first if u dont have it
 p_load(Hmisc, psych, ppcor ,MASS, QuantPsyc, devtools)
 #Generate data
 df = as.data.frame(mvrnorm(200000, mu = c(0,0), Sigma = matrix(c(1,0.50,0.50,1), ncol = 2), empirical=TRUE))
[1] 0.5

The correlation from this is exactly .500000, as chosen above.

Then we add a gender specific effect:

df$gender = as.factor(c(rep("M",100000),rep("F",100000)))
 df[1:100000,"V2"] = df[1:100000,"V2"]+1 #add 1 to males
 cor(df[1:2])[1,2] #now theres error due to effect of maleness!

[1] 0.4463193

The correlation has been reduced a bit as expected. It is time to try and undo this and detect the real correlation of .5.

#multiple regression
 model = lm(V1 ~ V2+gender, df) #standard linear model
 coef(model)[-1] #unstd. betas
 lm.beta(model) #std. betas
V2        genderM 
0.4999994 -0.5039354
V2        genderM 
0.5589474 -0.2519683 

The raw beta coefficient is on spot, but the standardized is not at all. This certainly makes interpretation more difficult if the std. beta does not correspond in size with correlations.

#split samples
 df.genders = split(df, df$gender) #split by gender
 cor(df.genders$M[1:2])[1,2] #males
 cor(df.genders$F[1:2])[1,2] #females
[1] 0.4987116
[1] 0.5012883

These are both approximately correct. Note however that these are the values not for the total sample as before, but for each subsample. If the sumsamples differ in their correlation with, it will show up here clearly.

#partial correlation
df$gender = as.numeric(df$gender) #convert to numeric
partial.r(df, c(1:2),3)[1,2]
[1] 0.5000005

This is using an already made function for doing partial correlations. Partial correlations are calculated from the residuals of both variables. This means that they correlate whatever remains of the variables after everything that could be predicted from the controlling variable is removed.

#residualize both
df.r2 = residuals.DF(df, "gender")
[1] 0.5000005

This should be the exact same as the above, but done manually. And it is.

spcor(df)$estimate #hard to interpet output
spcor.test(df$V1, df$V2, df$gender)[1] #partial out gender from V2
spcor.test(df$V2, df$V1, df$gender)[1] #partial out gender from V1
V1 V2 gender
V1 1.0000000 0.4999994 -0.2253951
V2 0.4472691 1.0000000 0.4479416
gender -0.2253103 0.5005629 1.0000000
1 0.4999994
1 0.4472691

Semi-partial correlations aka part correlations are the same as above, except that only one of the two variables is residualized by the controlled variable. The above has two different already made functions for calculating these. The first works on an entire data.frame, and outputs semi-partials for all variables controlling for all other variables. The output is a bit tricky to read however, and there is no explanation in the help. One has to read it by looking at each row, this is the original variable correlated with the residualized variables in each col.
The two calls below are using the other function one where has to specify which two variables to correlate and which variables to control for. The second variable called is the one that gets residualized by the control variables.
We see that the results are as they should be. Controlling V1 for gender has about zero effect. This is because gender has no effect on this variable aside from a very small chance effect (r=-0.00212261). Controlling V2 for gender has the desired effect of returning a value very close to .5, as it should.

#residualize only V2
df.r1.V1= df.r2 #copy above
df.r1.V1$V1 = df$V1 #fetch orig V1
[1] 0.4999994
#residualize only V1
df.r1.V2= df.r2 #copy above
df.r1.V2$V2 = df$V2 #fetch orig V2
[1] 0.4472691

These are the two manual ways of doing the same as above. We get the exact same results, so that is good.

So where does this lead us? Well, apparently using multiple regression to control variables is a bad idea since it results in difficult to interpret results.


In this commentary I explain how mean differences between normal distributions give rise to different percentages of the populations being above or below a given threshold, depending on where the threshold is.


Research uncovers flawed IQ scoring system” is the headline on phys.org, which often posts news about research from other fields. It concerns a study by Harrison et al (2015). The researchers have allegedly “uncovered anomalies and issues with the Wechsler Adult Intelligence Scale-Fourth Edition (WAIS-IV), one of the most widely used intelligence tests in the world”. An important discovery, if true. Let’s hear it from the lead researcher:

“Looking at the normal distribution of scores, you’d expect that only about five per cent of the population should get an IQ score of 75 or less,” says Dr. Harrison. “However, while this was true when we scored their tests using the American norms, our findings showed that 21 per cent of college and university students in our sample had an IQ score this low when Canadian norms were used for scoring.”

How can it be? To learn more, we delve into the actual paper titled: Implications for Educational Classification and Psychological Diagnoses Using the Wechsler Adult Intelligence Scale–Fourth Edition With Canadian Versus American Norms.

The paper

First they summarize a few earlier studies on Canada and the US. The Canadians obtained higher raw scores. Of course, this was hypothesized to be due to differences in ethnicity and educational achievement factors. However, this did not quite work out, so Harrison et al decided to investigate it more (they had already done so in 2014). Their method consists of taking the scores from a large mixed sample consisting of healthy people — i.e. with no diagnosis, 11% — and people with various mental disorders (e.g. 53.5% with ADHD), and then scoring this group on both the American and the Canadian norms. What did they find?

Blast! The results were similar to the results from the previous standardization studies! What happened? To find out, Harrison et al do a thorough examination of various subgroups in various ways. No matter which age group they compare, the result won’t go away. They also report the means and Cohen’s d for each subtest and aggregate measure — very helpful. I reproduce their Table 1 below:

Score M (US)
p d r
FSIQ 95.5 12.9 88.1 14.4 <.001 0.54 0.99
GAI 98.9 14.7 92.5 16 <.001 0.42 0.99
Index Scores
Verbal Comprehension 97.9 15.1 91.8 16.3 <.001 0.39 0.99
Perceptual Reasoning 99.9 14.1 94.5 15.9 <.001 0.36 0.99
Working Memory 90.5 12.8 83.5 13.8 <.001 0.53 0.99
Processing Speed 95.2 12.9 90.4 14.1 <.001 0.36 0.99
Subtest Scores
Verbal Subtests
Vocabulary 9.9 3.1 8.7 3.3 <.001 0.37 0.99
Similarities 9.7 3 8.5 3.3 <.001 0.38 0.98
Information 9.2 3.1 8.5 3.3 <.001 0.22 0.99
Arithmetic 8.2 2.7 7.4 2.7 <.001 0.3 0.99
Digit Span 8.4 2.5 7.1 2.7 <.001 0.5 0.98
Performance Subtests
Block Design 9.8 3 8.9 3.2 <.001 0.29 0.99
Matrix Reasoning 9.8 2.9 9.1 3.2 <.001 0.23 0.99
Visual Puzzles 10.5 2.9 9.4 3.1 <.001 0.37 0.99
Symbol Search 9.3 2.8 8.5 3 <.001 0.28 0.99
Coding 8.9 2.5 8.2 2.6 <.001 0.27 0.98


Sure enough, the scores are lower using the Canadian norms. And very ‘significant’ too. A mystery.

Next, they go on to note how this sometimes changes the classification of individuals into 7 arbitrarily chosen intervals of IQ scores, and how this differs between subtests. They spend a lot of e-ink noting percents about this or that classification. For instance:

“Of interest was the percentage of individuals who would be classified as having a FSIQ below the 10th percentile or who would fall within the IQ range required for diagnosis of ID (e.g., 70 ± 5) when both normative systems were applied to the same raw scores. Using American norms, 13.1% had an IQ of 80 or less, and 4.2% had an IQ of 75 or less. By contrast, when using Canadian norms, 32.3% had an IQ of 80 or less, and 21.2% had an IQ of 75 or less.”

I wonder if some coherent explanation can be found for all these results. In their discussion they ask:

“How is it that selecting Canadian over American norms so markedly lowers the standard scores generated from the identical raw scores? One possible explanation is that more extreme scores occur because the Canadian normative sample is smaller than the American (cf. Kahneman, 2011).”

If the reader was unsure, yes, this is Kahneman’s 2011 book about cognitive biases and dual process theory.

They have more suggestions about the reason:

“One cannot explain this difference simply by saying it is due to the mature students in the sample who completed academic upgrading, as the score differences were most prominent in the youngest cohorts. It is difficult to explain these findings simply as a function of disability status, as all participants were deemed otherwise qualified by these postsecondary institutions (i.e., they had met normal academic requirements for entry into regular postsecondary programs). Furthermore, in Ontario, a diagnosis of LD is given only to students with otherwise normal thinking and reasoning skills, and so students with such a priori diagnosis would have had otherwise average full scale or general abilities scores when tested previously. Performance exaggeration seems an unlikely cause for the findings, as the students’ scores declined only when Canadian norms were applied. Finally, although no one would argue that a subset of disabled students might be functioning below average, it is difficult to believe that almost half of these postsecondary students would fall in this IQ range given that they had graduated from high school with marks high enough to qualify for acceptance into bona fide postsecondary programs. Whatever the cause, our data suggest that one must question both the representativeness of the Canadian normative sample in the younger age ranges and the accuracy of the scores derived when these norms are applied.”

And finally they conclude with a recommendation not to use the Canadian norms for Canadians because this results in lower IQs:

Overall, our findings suggest a need to examine more carefully the accuracy and applicability of the WAIS-IV Canadian norms when interpreting raw test data obtained from Canadian adults. Using these norms appears to increase the number of young adults identified as intellectually impaired and could decrease the number who qualify for gifted programming or a diagnosis of LD. Until more research is conducted, we strongly recommend that clinicians not use Canadian norms to determine intellectual impairment or disability status. Converting raw scores into Canadian standard scores, as opposed to using American norms, systematically lowers the scores of postsecondary students below the age of 35, as the drop in FSIQ was higher for this group than for older adults. Although we cannot know which derived scores most accurately reflect the intellectual abilities of young Canadian adults, it certainly seems implausible that almost half of postsecondary students have FSIQ scores below the 16th percentile, calling into question the accuracy of all other derived WAIS-IV Canadian scores in the classification of cognitive abilities.

Are you still wondering what it going on?

Populations with different mean IQs and cut-offs

Harrison et al seems to have inadvertently almost rediscovered the fact that Canadians are smarter than Americans. They don’t quite make it to this point even when faced with obvious and strong evidence (multiple standardization samples). They somehow don’t realize that using the norms from these standardization samples will reproduce the differences found in those samples, and won’t really report anything new.

Their numerous differences in percents reaching this or that cut-off are largely or entirely explained by simple statistics. They have two populations which have an IQ difference of 7.4 points (95.5 – 88.1 from Table 1) or 8.1 points (15 * .54 d from Table 1). Now, if we plot these (I used a difference of 7.5 IQ) and choose some arbitrary cut-offs, like those between arbitrarily chosen intervals, we see something like this:


Except that I cheated and chose all the cut-offs. The brown and the green lines are the ratios between the densities (read off the second y-axis). We see that around 100, they are generally low, but as we get further from the means, they get a lot larger. This simple fact is not generally appreciated. It’s not a new problem, Arthur Jensen spent much of a chapter in his behemoth 1980 book on the topic, he quotes for instance:

“In the construction trades, new apprentices were 87 percent white and 13 percent black. [Blacks constitute 12 percent of the U.S. population.] For the Federal Civil Service, of those employees above the GS-5 level, 88.5 percent were white, 8.3 percent black, and women account for 30.1 of all civil servants. Finally, a 1969 survey of college teaching positions showed whites with 96.3 percent of all posi­ tions. Blacks had 2.2 percent, and women accounted for 19.1 percent. (U.S. Commission on Civil Rights, 1973)”

Sounds familiar? Razib Khan has also written about it. Now, let’s go back to one of the quotes:

“Using American norms, 13.1% had an IQ of 80 or less, and 4.2% had an IQ of 75 or less. By contrast, when using Canadian norms, 32.3% had an IQ of 80 or less, and 21.2% had an IQ of 75 or less. Most notably, only 0.7% (2 individuals) obtained a FSIQ of 70 or less using American norms, whereas 9.7% had IQ scores this low when Canadian norms were used. At the other end of the spectrum, 1.4% of the students had FSIQ scores of 130 or more (gifted) when American norms were used, whereas only 0.3% were this high using Canadian norms.”

We can put these in a table and calculate the ratios:

IQ threshold Percent US Percent CAN US/CAN CAN/US
130 1.4 0.3 4.67 0.21
80 13.1 32.3 0.41 2.47
75 4.2 21.2 0.20 5.05
70 0.7 9.7 0.07 13.86


And we can also calculate the expected values based on the two populations (with means of 95.5 and 88) above:

IQ threshold Percent US Percent CAN US/CAN CAN/US
130 1.07 0.26 4.12 0.24
80 15.07 29.69 0.51 1.97
75 8.59 19.31 0.44 2.25
70 4.46 11.51 0.39 2.58


This is fairly close right? The only outlier (in italic) is the much lower than expected value for <70 IQ using US norms, perhaps a sampling error. But overall, this is a pretty good fit to the data. Perhaps we have our explanation.

What about those (mis)classification values in their Table 2? Well, for similar reasons that I won’t explain in detail, these are simply a function of the difference between the groups in that variable, e.g. Cohen’s d. In fact, if we correlate the d vector and the “% within same classification” we get a correlation of -.95 (-.96 using rank-orders).

MCV analysis

Incidentally, the d values report in their Table 1 are useful for using the method of correlated vectors. In a previous study comparing US and Canadian IQ data, Dutton and Lynn (2014) compared WAIS-IV standardization data. They found a mean difference of .31 d, or 4.65 IQ, which was reduced to 2.1 IQ if the samples were matched on education, ethnicity and sex. An interesting thing was that the difference between the countries was largest on the most g-loading subtests. When this happens, it is called a Jensen effect (or that it has a positive Jensen coefficient, Kirkegaard 2014). The value in their study was .83, which is on the high side (see e.g. te Nijenhuis et al, 2015).

I used the same loadings as used in their study (McFarland, 2013), and found a correlation of .24 (.35 with rank-order), substantially weaker.

Supplementary material

The R code and data files can be found in the Open Science Framework repository.


  • Harrison, A. G., Holmes, A., Silvestri, R., Armstrong, I. T. (2015). Implications for Educational Classification and Psychological Diagnoses Using the Wechsler Adult Intelligence Scale–Fourth Edition With Canadian Versus American Norms. Journal of Psychoeducational Assessment. 1-13.
  • Jensen, A. R. (1980). Bias in Mental Testing.
  • Kirkegaard, E. O. (2014). The personal Jensen coefficient does not predict grades beyond its association with g. Open Differential Psychology.
  • McFarland, D. (2013). Model individual subtests of the WAIS IV with multiple latent
    factors. PLoSONE. 8(9): e74980. doi:10.1371/journal.pone.0074980
  • te Nijenhuis, J., van den Hoek, M., & Armstrong, E. L. (2015). Spearman’s hypothesis and Amerindians: A meta-analysis. Intelligence, 50, 87-92.


Researcher choice in reporting of regression models allow for questionable research practices. Here I present a function for R that reports all possible regression models given a set of predictors and a dependent variable. I illustrate this function on two datasets of artificial data.


In an email to L.J Zigerell I wrote:

I had a look at your new paper here: rap.sagepub.com/content/2/1/2053168015570996 [Zigerell, 2015]

Generally, I agree about the problem. Pre-registration or always reporting all comparisons are the obvious solutions. Personally, I will try to pre-register all my survey-type studies from now on. The second is problematic in that there are sometimes quite a lot of ways to test a given hypothesis or estimate an effect with a dataset, your paper mentions a few. In many cases, it may not be easy for the researcher to report all of them by doing the analyses manually. I see two solutions: 1) tools for making automatic comparisons of all test methods, and 2) sampling of test methods. The first is preferable if it can be done, and when it cannot, one can fall back to the second. This is not a complete fix because there may be ways to estimate an effect using a dataset that the researcher did not even think of. If the dataset is not open, there is no way for others to conduct such tests. Open data is necessary.

I have one idea for how to make an automatic comparison of all possible ways to analyze data. In your working paper, you report only two regression models (Table 1). One controlling for 3 and one for 6 variables in MR. However, given the choice of these 6 variables, there are 2^6-1 (63) ways to run the MR analysis (the 64th is the empty model, I guess which could be used to estimate the intercept but nothing else). You could have tried all of them, and reported only the ones that gave the result you wanted (in line with the argument in your paper above). I’m not saying you did this of course, but it is possible. There is researcher degree of freedom about which models to report. There is a reason for this too which is that normally people run these models manually and running all 63 models would take a while doing manually (hard coding them or point and click), and they would also take up a lot of space to report if done via the usual table format.

You can perhaps see where I’m going with this. One can make a function that takes as input the dependent variable, the set of independent variables and the dataset, and then returns all the results for all possible ways to include these into MRs. One can then calculate descriptive statistics (mean, median, range, SD) for the effect sizes (std. betas) of the different variables to see how stable they are depending on which other variables are included. This would be a great way to combat the problem of which models to report when using MR, I think. Better, one can plot the results in a visually informative and appealing way.

With a nice interactive figure, one can also make it possible for users to try all of them, or see results for only specific groups of models.

I have now written the code for this. I tested it with two cases, a simple and a complicated one.

Simple case

In the simple case, we have three variables:

a = (normally distributed) noise
b = noise
y = a+b

Then I standardized the data so that betas from regressions are standardized betas. Correlation matrix:


The small departure from expected values is sampling error (n=1000 in these simulations). The beta matrix is:


We see the expected results. The correlations alone are the same as their betas together because they are nearly uncorrelated (r=.01) and correlated to y at about the same (r=.70 and .72). The correlations and betas are around .71 because math.

Complicated case

In this case we have 6 predictors some of which are correlated.

a = noise
b = noise
c = noise
d = c+noise
e = a+b+noise
f = .3*a+.7*c+noise
y = .2a+.3b+.5c

Correlation matrix:


And the beta matrix is:


So we see that the full model (model 63) finds that the betas are the same as the correlations while the other second-order variables get betas of 0 altho they have positive correlations with y. In other words, MR is telling us that these variables have zero effect when taking into account a, b and c, and each other — which is true.

By inspecting the matrix, we can see how one can be an unscrupulous researcher by exploiting researcher freedom (Simmons, 2011). If one likes the variable d, one can try all these models (or just a few of them manually), and then selectively report the ones that give strong betas. In this case model 60 looks like a good choice, since it controls for a lot yet still produces a strong beta for the favored variable. Or perhaps choose model 45, in which beta=.59. Then one can plausibly write in the discussion section something like this:

Prior research and initial correlations indicated that d may be a potent explanatory factor of y, r=.56 [significant test statistic]. After controlling for a, b and e, the effect was mostly unchanged, beta=.59 [significant test statistic]. However, after controlling for f as well, the effect was somewhat attenuated but still substantial, beta=.49 [significant test statistic]. The findings thus support theory T about the importance of d in understanding y.

One can play the game the other way too. If one dislikes a, one might focus on the models that find a weak effect of a, such as 31, where the beta is -.04 [non-significant test statistic] when controlling for e and f.

It can also be useful to examine the descriptive statistics of the beta matrix:


The range and sd (standard deviation) are useful as a measure of how the effect size of the variable varies from model to model. We see that among the true causes (a, b, c) the sd and range are smaller than among the ones that are non-causal correlates (d, e, f). Among the true causes, the weaker causes have larger ranges and sds. Perhaps one can find a way to adjust for this to get an effect size independent measure of how much the beta varies from model to model. The mad (median absolute deviation,  robust alternative to the sd) looks like a very promising candidate the detecting the true causal variables. It is very low (.01, .01 and .03) for the true causes, and at least 4.67 times larger for the non-causal correlates (.03 and .14).

In any case, I hope to have shown how researchers can use freedom of choice in model choice and reporting to inflate or deflate the effect size of variables they like/dislike. There are two ways to deal with this. Researchers must report all the betas with all available variables in a study (this table can get large, because it is 2^n-1 where n is the number of variables), e.g. using my function or an equivalent function, or better, the data must be available for reanalysis by others.

Source code

The function is available from the psych2 repository at Github. The R code for this paper is available on the Open Science Framework repository.


Simmons JP,  Nelson LD,  Simonsohn U. (2011). False-positive psychology: Undisclosed flexibility in data collection and analysis allows presenting anything as significant. Psychological Science 22(11): 1359–1366.

Zigerell, L. J. (2015). Inferential selection bias in a study of racial bias: Revisiting ‘Working twice as hard to get half as far’. Research & Politics, 2(1), 2053168015570996.