Two methods are presented that allow for identification of mixed cases in the extraction of general factors. Simulated data is used to illustrate them.

General factors can be extracted from datasets where all or nearly so the variables are correlated. At the case-level, such general factors are decreased in size if there are mixed cases present. A mixed case is an ‘inconsistent’ case according to the factor structure of the data.

A simple way of illustrating what I’m talking about is using the matrixplot() function from the VIM package to R (Templ, Alfons, Kowarik, & Prantner, 2015) with some simulated data.

For simulated dataset 1, start by imaging that we are measuring a general factor and that all our indicator variables have a positive loading on this general factor, but that this loading varies in strength. Furthermore, there is no error of measurement and there is only one factor in the data (no group factors, i.e. no hierarchical or bi-factor structure, (Jensen & Weng, 1994)). I have used datasets with 50 cases and 25 variables to avoid the excessive sampling error of small samples and to keep a realistic number of cases compared to the datasets examined in S factor studies (e.g. Kirkegaard, 2015). The matrix plot is shown in Figure 1.

Figure 1: Matrix plot of dataset 1

No real data looks like this, but it is important to understand what to look for. Every indicator is on the x-axis and the cases are on the y-axis. The cases are colored by their relative values, where darker means higher values. So in this dataset we see that any case that does well on any particular indicator does just as well on every other indicator. All the indicators have the same factor loading of 1, and the proportion of variance explained is also 1 (100%), so there is little point in showing the loadings plot.

To move towards realism, we need to complicate this simulation in some way. The first way is to introduce some measurement error. The amount of error introduced determines the factor loadings and hence the size of the general factor. In dataset 2, the error amount is .5, and the signal multiplier varies from .05 to .95 all of which are equally likely (uniform distribution). The matrix and the loadings plots are shown in Figures 2 and 3.

Figure 2: Matrix plot for dataset 2


Figure 3: Loadings plot for dataset 2

By looking at the matrix plot we can still see a fairly simple structure. Some cases are generally darker (whiter) than others, but there is also a lot of noise which is of course the error we introduced. The loadings show quite a bit of variation. The size of this general factor is .45.

The next complication is to introduce the possibility of negative loadings (these are consistent with a general factor, as long as they load in the right direction, (Kirkegaard, 2014)). We go back to the simplified case of no measurement error for simplicity. Figures 4 and 5 show the matrix and loadings plots.

Figure 4: Matrix plot for dataset 3

Figure 5: Loadings plot for dataset 3

The matrix plot looks odd, until we realize that some of the indicators are simply reversed. The loadings plot shows this reversal. One could easily get back to a matrix plot like that in Figure 1 by reversing all indicators with a negative loading (i.e. multiplying by -1). However, the possibility of negative loadings does increase the complexity of the matrix plots.

For the 4th dataset, we make a begin with dataset 2 and create a mixed case. This we do by setting its value on every indicator to be 2, a strong positive value (98 centile given a standard normal distribution). Figure 6 shows the matrix plot. I won’t bother with the loadings plot because it is not strongly affected by a single mixed case.

Figure 6: Matrix plot for dataset 4

Can you guess which case it is? Perhaps not. It is #50 (top line). One might expect it to be the same hue all the way. This however ignores the fact that the values in the different indicators vary due to sampling error. So a value of 2 is not necessarily at the same centile or equally far from the mean in standard units in every indicator, but it is fairly close which is why the color is very dark across all indicators.

For datasets with general factors, the highest value of a case tends to be on the most strongly loaded indicator (Kirkegaard, 2014b), but this information is not easy to use in an eye-balling of the dataset. Thus, it is not so easy to identify the mixed case.

Now we complicate things further by adding the possibility of negative loadings. This gets us data roughly similar to that found in S factor analysis (there are still no true group factors in the data). Figure 7 shows the matrix plot.

Figure 7: Matrix plot for dataset 5

Just looking at the dataset, it is fairly difficult to detect the general factor, but in fact the variance explained is .38. The mixed case is easy to spot now (#50) since it is the only case that is consistently dark across indicators, which is odd given that some of them have negative loadings. It ‘shouldn’t’ happen. The situation is however somewhat extreme in the mixedness of the case.

Automatic detection

Eye-balling figures and data is a powerful tool for quick analysis, but it cannot give precise numerical values used for comparison between cases. To get around this I developed two methods for automatic identification of mixed cases.

Method 1
A general factor only exists when multidimensional data can be usefully compressed, informationally speaking, to 1-dimensional data (factor scores on the general factor). I encourage readers to consult the very well-made visualization of principal component analysis (almost the same as factor analysis) at this website. In this framework, mixed cases are those that are not well described or predicted by a single score.

Thus, it seems to me that that we can use this information as a measure of the mixedness of a case. The method is:

  1. Extract the general factor.
  2. Extract the case-level scores.
  3. For each indicator, regress it unto the factor scores. Save the residuals.
  4. Calculate a suitable summary metric, such as the mean absolute residual and rank the cases.

Using this method on dataset 5 in fact does identify case 50 as the most mixed one. Mixedness varies between cases due to sampling error. Figure 8 shows the histogram.

Figure 8: Histogram of absolute mean residuals from dataset 5

The outlier on the right is case #50.

How extreme does a mixed case need to be for this method to find it? We can try reducing its mixedness by assigning it less extreme values. Table 1 shows the effects of doing this.


Mixedness values

Mean absolute residual








Table 1: Mean absolute residual and mixedness

So we see that when it is 2 and 1.5, it is clearly distinguishable from the rest of the cases, but 1 is about the limit of this since the second-highest value is .80. Below this, the other cases are similarly mixed, just due to the randomness introduced by measurement error.

Method 2
Since mixed cases are poorly described by a single score, they don’t fit well with the factor structure in the data. Generally, this should result in the proportion of variance increasing when they are removed. Thus the method is:

  1. Extract the general factor from the complete dataset.
  2. For every case, create a subset of the dataset where this case is removed.
  3. Extract the general factors from each subset.
  4. For each analysis, extract the proportion of variance explained and calculate the difference to that using the full dataset.

Using this method on the dataset also used above correctly identifies the mixed case. The histogram of results is shown in Figure 9.

Figure 9: Histogram of differences in proportion of variance to the full analysis

Like we method 1, we then redo this analysis for other levels of mixedness. Results are shown in Table 2.


Mixedness values

Improvement in proportion of variance








Table 2: Improvement in proportion of variance and mixedness

We see the same as before, in that both 2 and 1.5 are clearly identifiable as being an outlier in mixedness, while 1 is not since the next-highest value is .45.

Large scale simulation with the above methods could be used to establish distributions to generate confidence intervals from.

It should be noted that the improvement in proportion of variance is not independent of number of cases (more cases means that a single case is less import, and non-linearly so), so the value cannot simply be used to compare across cases without correcting for this problem. Correcting it is however beyond the scope of this article.

Comparison of methods

The results from both methods should have some positive relationship. The scatter plot is shown in

Figure 10: Scatter plot of method 1 and 2

We see that the true mixedness case is a strong outlier with both methods — which is good because it really is a strong outlier. The correlation is strongly inflated because of this, to r=.70 with, but only .26 without. The relative lack of a positive relationship without the true outlier in mixedness is perhaps due to range restriction in mixedness in the dataset, which is true because the only amount of mixedness besides case 50 is due to measurement error. Whatever the exact interpretation, I suspect it doesn’t matter since the goal is to find the true outliers in mixedness, not to agree on the relative ranks of the cases with relatively little mixedness.1

I have implemented both above methods in R. They can be found in my unofficial psych2 collection of useful functions located here.

Supplementary material
Source code and figures are available at the Open Science Framework repository.


Jensen, A. R., & Weng, L.-J. (1994). What is a good g? Intelligence, 18(3), 231–258. doi.org/10.1016/0160-2896(94)90029-9

Kirkegaard, E. O. W. (2014a). The international general socioeconomic factor: Factor analyzing international rankings. Open Differential Psychology. Retrieved from openpsych.net/ODP/2014/09/the-international-general-socioeconomic-factor-factor-analyzing-international-rankings/

Kirkegaard, E. O. W. (2014b). The personal Jensen coefficient does not predict grades beyond its association with g. Open Differential Psychology. Retrieved from openpsych.net/ODP/2014/10/the-personal-jensen-coefficient-does-not-predict-grades-beyond-its-association-with-g/

Kirkegaard, E. O. W. (2015). Examining the S factor in US states. The Winnower. Retrieved from thewinnower.com/papers/examining-the-s-factor-in-us-states

Templ, M., Alfons, A., Kowarik, A., & Prantner, B. (2015, February 19). VIM: Visualization and Imputation of Missing Values. CRAN. Retrieved from cran.r-project.org/web/packages/VIM/index.html


1 Concerning the agreement about rank-order, it is about .4 both with and without case 50. But this is based on a single simulation and I’ve seen some different values when re-running it. A large scale simulation is necessary.


Blog commenter Lion of the Judah-sphere has claimed that the SAT does not correlate as well with comprehensive IQ tests as said IQ tests correlate with one another. At first I assumed he was wrong, but my recent analysis suggesting Harvard undergrads have an average Wechsler IQ of 122, really makes me wonder.

While an IQ of 122 (white norms) is 25 points above the U.S. mean of 97, it seems very low for a group of people who averaged 1490 out of 1600 on the SAT. According to my formula, since 1995 a score of 1490 on the SAT equated to an IQ of 141. But my formula was based on modern U.S. norms; because demographic changes have made the U.S. mean IQ 3 points below the white American mean (and made the U.S. standard deviation 3.4 percent larger than the white SD), converting to white norms reduces Harvard’s SAT IQ equivalent to 139.

In general, research correlating the SAT with IQ has been inconsistent, with correlations ranging from 0.4 to 0.8. I think much depends on the sample. Among people who took similar courses in similar high schools, the SAT is probably an excellent measure of IQ. But considering the wide range of schools and courses American teenagers have experienced, the SAT is not, in my judgement, a valid measure of IQ. Nor should it be. Universities should not be selecting students based on biological ability, but rather on acquired academic skills.

The lower values are due to restriction of range, e.g. Frey and Detterman (2004). When corrected, the value goes up to .7-.8 range. Also .54 using ICAR60 (Condon and Revelle, 2014) without correction for reliability or restriction.

As for the post, I can think of few things:

1. The sample recruited is likely not representative of Harvard. Probably mostly social sci/humanities students, who have lower scores.

2. Regression towards the mean means that the Harvard student body won’t be as exceptional on their second measurement as on their first. This is because some of the reason they were so high was just good luck.

3. The SAT is teachable to some extent which won’t transfer well to other tests. This reduces the correlation between between SAT and other GCA tests.

4. Harvard uses affirmative action which lowers the mean SAT of the students a bit. It seems to be about 1500.

SAT has an approx. mean of ~500 per subtest, ceiling of 800 and SD of approx. 100. So a 1500 total score is 750+750=(500+250)+(500+250)=(500+2.5sd)+(500+2.5sd), or about 2.5 SD above the mean. Test-retest reliability/stability for a few months is around .86 (mean of values here research.collegeboard.org/sites/default/files/publications/2012/7/researchreport-1982-7-test-disclosure-retest-performance-sat.pdf, n≈3700).

The interesting question is how much regression towards the mean we can expect? I decided to investigate using simulated data. The idea is basically that we first generate some true scores (per classical test theory), and then make two measurements of them. Then using the first measurement, we make a selection that has a mean 2.5 SD above the mean, then we check how well this group performs on the second testing.

In R, the way we do this, is to simulate some randomly distributed data, and then create new variables that are a sum of true score and error for the measurements. This presents us with a problem.

How much error to add?

We can solve this question either by trying some values, or analytically. Analytically, it is like this:

Cor (test1 x test2) = cor(test1 x true score) * cor(test2 x true score)

The correlation of each testing is due to their common association with the true scores. To simulate a testing, we need the correlation between testing and true score. Since test-rest is .863, we take the square root and get .929. The squared value of a correlation is the amount of variance it explains, so if we square it get back to where we were before. Since the total variance is 1, we can calculate the remaining variance as 1-.863=.137. We can take the square root of this to get the correlation, which is .37. Since the correlations are those we need when adding, we have to weigh the true score by .929 and the error by .370 to get the measurements such that they have a test-retest correlation of .863.

One could also just try some values until one gets something that is about right. In this case, just weighing the true score by 1 and error by .4 produces nearly the same result. Trying out a few values like this is faster than doing the math. In fact, I initially tried out some values to hit the approximate value but then decided to solve it analytically as well.

How much selection to make?

This question is more difficult analytically. I have asked some math and physics people and they could not solve it analytically. The info we have is the mean value of the selected group, which is about 2.5, relative to a standard normal population. Assuming we make use of top-down selection (i.e. everybody above a threshold gets selected, no one below), where must we place our threshold to get a mean of 2.5? It is not immediately obvious. I solved it by trying some values and calculating the mean trait value in the selected group. It turns out that to get a selected group with a mean of 2.5, the threshold must be 2.14.

Since this group is selected for having positive errors and positive true scores, their true scores and second testing scores will be lower. How much lower? 2.32 and 2.15 according to my simulation. I’m not sure why the second measurement scores are lower than the true scores, perhaps someone more learned in statistics can tell me.

So there is still some way down to an IQ of 122 from 132.3 (2.15*15+100). However, we may note that they used a shorter WAIS-R form, which correlates .91 with the full-scale. Factoring this in reduces our estimate to 129.4. Given the selectivity noted above, this is not so unrealistic.

Also, the result can apparently be reached simply by 2.5*.86. I was aware that this might work, but wasn’t sure so I tested it (the purpose of this post). One of the wonders of statistical software like this is that one can do empirical mathematics. :)

R code


#size and true score
n.size = 1e6
true.score = rnorm(n.size)

#add error to true scores
test.1 = .929*true.score + .370*rnorm(n.size)
test.2 = .929*true.score + .370*rnorm(n.size)
SAT = data.frame(true.score,
#verify it is correct

#select a subsample
selected = filter(SAT, test.1>2.14) #selected sample
describe(selected) #desc. stats


I ran into trouble trying to remove the effects of one variable on other variables doing the writing my reanalysis of the Noble et al 2015 paper. It was not completely obvious which exact method to use.

The authors in their own paper used OLS aka. standard linear regression. However, since I wanted to plot the results at the case-level, this was not useful to me. Before doing this I did not really understand how partial correlations worked, or what semi-partial correlations were, or how multiple regression works exactly. I still don’t understand the last, but the others I will explain with example results.

Generated data

Since we are cool, we use R.

We begin by loading some stuff I need and generating some data:

#load libs
 library(pacman) #install this first if u dont have it
 p_load(Hmisc, psych, ppcor ,MASS, QuantPsyc, devtools)
 #Generate data
 df = as.data.frame(mvrnorm(200000, mu = c(0,0), Sigma = matrix(c(1,0.50,0.50,1), ncol = 2), empirical=TRUE))
[1] 0.5

The correlation from this is exactly .500000, as chosen above.

Then we add a gender specific effect:

df$gender = as.factor(c(rep("M",100000),rep("F",100000)))
 df[1:100000,"V2"] = df[1:100000,"V2"]+1 #add 1 to males
 cor(df[1:2])[1,2] #now theres error due to effect of maleness!

[1] 0.4463193

The correlation has been reduced a bit as expected. It is time to try and undo this and detect the real correlation of .5.

#multiple regression
 model = lm(V1 ~ V2+gender, df) #standard linear model
 coef(model)[-1] #unstd. betas
 lm.beta(model) #std. betas
V2        genderM 
0.4999994 -0.5039354
V2        genderM 
0.5589474 -0.2519683 

The raw beta coefficient is on spot, but the standardized is not at all. This certainly makes interpretation more difficult if the std. beta does not correspond in size with correlations.

#split samples
 df.genders = split(df, df$gender) #split by gender
 cor(df.genders$M[1:2])[1,2] #males
 cor(df.genders$F[1:2])[1,2] #females
[1] 0.4987116
[1] 0.5012883

These are both approximately correct. Note however that these are the values not for the total sample as before, but for each subsample. If the sumsamples differ in their correlation with, it will show up here clearly.

#partial correlation
df$gender = as.numeric(df$gender) #convert to numeric
partial.r(df, c(1:2),3)[1,2]
[1] 0.5000005

This is using an already made function for doing partial correlations. Partial correlations are calculated from the residuals of both variables. This means that they correlate whatever remains of the variables after everything that could be predicted from the controlling variable is removed.

#residualize both
df.r2 = residuals.DF(df, "gender")
[1] 0.5000005

This should be the exact same as the above, but done manually. And it is.

spcor(df)$estimate #hard to interpet output
spcor.test(df$V1, df$V2, df$gender)[1] #partial out gender from V2
spcor.test(df$V2, df$V1, df$gender)[1] #partial out gender from V1
V1 V2 gender
V1 1.0000000 0.4999994 -0.2253951
V2 0.4472691 1.0000000 0.4479416
gender -0.2253103 0.5005629 1.0000000
1 0.4999994
1 0.4472691

Semi-partial correlations aka part correlations are the same as above, except that only one of the two variables is residualized by the controlled variable. The above has two different already made functions for calculating these. The first works on an entire data.frame, and outputs semi-partials for all variables controlling for all other variables. The output is a bit tricky to read however, and there is no explanation in the help. One has to read it by looking at each row, this is the original variable correlated with the residualized variables in each col.
The two calls below are using the other function one where has to specify which two variables to correlate and which variables to control for. The second variable called is the one that gets residualized by the control variables.
We see that the results are as they should be. Controlling V1 for gender has about zero effect. This is because gender has no effect on this variable aside from a very small chance effect (r=-0.00212261). Controlling V2 for gender has the desired effect of returning a value very close to .5, as it should.

#residualize only V2
df.r1.V1= df.r2 #copy above
df.r1.V1$V1 = df$V1 #fetch orig V1
[1] 0.4999994
#residualize only V1
df.r1.V2= df.r2 #copy above
df.r1.V2$V2 = df$V2 #fetch orig V2
[1] 0.4472691

These are the two manual ways of doing the same as above. We get the exact same results, so that is good.

So where does this lead us? Well, apparently using multiple regression to control variables is a bad idea since it results in difficult to interpret results.


In this commentary I explain how mean differences between normal distributions give rise to different percentages of the populations being above or below a given threshold, depending on where the threshold is.


Research uncovers flawed IQ scoring system” is the headline on phys.org, which often posts news about research from other fields. It concerns a study by Harrison et al (2015). The researchers have allegedly “uncovered anomalies and issues with the Wechsler Adult Intelligence Scale-Fourth Edition (WAIS-IV), one of the most widely used intelligence tests in the world”. An important discovery, if true. Let’s hear it from the lead researcher:

“Looking at the normal distribution of scores, you’d expect that only about five per cent of the population should get an IQ score of 75 or less,” says Dr. Harrison. “However, while this was true when we scored their tests using the American norms, our findings showed that 21 per cent of college and university students in our sample had an IQ score this low when Canadian norms were used for scoring.”

How can it be? To learn more, we delve into the actual paper titled: Implications for Educational Classification and Psychological Diagnoses Using the Wechsler Adult Intelligence Scale–Fourth Edition With Canadian Versus American Norms.

The paper

First they summarize a few earlier studies on Canada and the US. The Canadians obtained higher raw scores. Of course, this was hypothesized to be due to differences in ethnicity and educational achievement factors. However, this did not quite work out, so Harrison et al decided to investigate it more (they had already done so in 2014). Their method consists of taking the scores from a large mixed sample consisting of healthy people — i.e. with no diagnosis, 11% — and people with various mental disorders (e.g. 53.5% with ADHD), and then scoring this group on both the American and the Canadian norms. What did they find?

Blast! The results were similar to the results from the previous standardization studies! What happened? To find out, Harrison et al do a thorough examination of various subgroups in various ways. No matter which age group they compare, the result won’t go away. They also report the means and Cohen’s d for each subtest and aggregate measure — very helpful. I reproduce their Table 1 below:

Score M (US)
p d r
FSIQ 95.5 12.9 88.1 14.4 <.001 0.54 0.99
GAI 98.9 14.7 92.5 16 <.001 0.42 0.99
Index Scores
Verbal Comprehension 97.9 15.1 91.8 16.3 <.001 0.39 0.99
Perceptual Reasoning 99.9 14.1 94.5 15.9 <.001 0.36 0.99
Working Memory 90.5 12.8 83.5 13.8 <.001 0.53 0.99
Processing Speed 95.2 12.9 90.4 14.1 <.001 0.36 0.99
Subtest Scores
Verbal Subtests
Vocabulary 9.9 3.1 8.7 3.3 <.001 0.37 0.99
Similarities 9.7 3 8.5 3.3 <.001 0.38 0.98
Information 9.2 3.1 8.5 3.3 <.001 0.22 0.99
Arithmetic 8.2 2.7 7.4 2.7 <.001 0.3 0.99
Digit Span 8.4 2.5 7.1 2.7 <.001 0.5 0.98
Performance Subtests
Block Design 9.8 3 8.9 3.2 <.001 0.29 0.99
Matrix Reasoning 9.8 2.9 9.1 3.2 <.001 0.23 0.99
Visual Puzzles 10.5 2.9 9.4 3.1 <.001 0.37 0.99
Symbol Search 9.3 2.8 8.5 3 <.001 0.28 0.99
Coding 8.9 2.5 8.2 2.6 <.001 0.27 0.98


Sure enough, the scores are lower using the Canadian norms. And very ‘significant’ too. A mystery.

Next, they go on to note how this sometimes changes the classification of individuals into 7 arbitrarily chosen intervals of IQ scores, and how this differs between subtests. They spend a lot of e-ink noting percents about this or that classification. For instance:

“Of interest was the percentage of individuals who would be classified as having a FSIQ below the 10th percentile or who would fall within the IQ range required for diagnosis of ID (e.g., 70 ± 5) when both normative systems were applied to the same raw scores. Using American norms, 13.1% had an IQ of 80 or less, and 4.2% had an IQ of 75 or less. By contrast, when using Canadian norms, 32.3% had an IQ of 80 or less, and 21.2% had an IQ of 75 or less.”

I wonder if some coherent explanation can be found for all these results. In their discussion they ask:

“How is it that selecting Canadian over American norms so markedly lowers the standard scores generated from the identical raw scores? One possible explanation is that more extreme scores occur because the Canadian normative sample is smaller than the American (cf. Kahneman, 2011).”

If the reader was unsure, yes, this is Kahneman’s 2011 book about cognitive biases and dual process theory.

They have more suggestions about the reason:

“One cannot explain this difference simply by saying it is due to the mature students in the sample who completed academic upgrading, as the score differences were most prominent in the youngest cohorts. It is difficult to explain these findings simply as a function of disability status, as all participants were deemed otherwise qualified by these postsecondary institutions (i.e., they had met normal academic requirements for entry into regular postsecondary programs). Furthermore, in Ontario, a diagnosis of LD is given only to students with otherwise normal thinking and reasoning skills, and so students with such a priori diagnosis would have had otherwise average full scale or general abilities scores when tested previously. Performance exaggeration seems an unlikely cause for the findings, as the students’ scores declined only when Canadian norms were applied. Finally, although no one would argue that a subset of disabled students might be functioning below average, it is difficult to believe that almost half of these postsecondary students would fall in this IQ range given that they had graduated from high school with marks high enough to qualify for acceptance into bona fide postsecondary programs. Whatever the cause, our data suggest that one must question both the representativeness of the Canadian normative sample in the younger age ranges and the accuracy of the scores derived when these norms are applied.”

And finally they conclude with a recommendation not to use the Canadian norms for Canadians because this results in lower IQs:

Overall, our findings suggest a need to examine more carefully the accuracy and applicability of the WAIS-IV Canadian norms when interpreting raw test data obtained from Canadian adults. Using these norms appears to increase the number of young adults identified as intellectually impaired and could decrease the number who qualify for gifted programming or a diagnosis of LD. Until more research is conducted, we strongly recommend that clinicians not use Canadian norms to determine intellectual impairment or disability status. Converting raw scores into Canadian standard scores, as opposed to using American norms, systematically lowers the scores of postsecondary students below the age of 35, as the drop in FSIQ was higher for this group than for older adults. Although we cannot know which derived scores most accurately reflect the intellectual abilities of young Canadian adults, it certainly seems implausible that almost half of postsecondary students have FSIQ scores below the 16th percentile, calling into question the accuracy of all other derived WAIS-IV Canadian scores in the classification of cognitive abilities.

Are you still wondering what it going on?

Populations with different mean IQs and cut-offs

Harrison et al seems to have inadvertently almost rediscovered the fact that Canadians are smarter than Americans. They don’t quite make it to this point even when faced with obvious and strong evidence (multiple standardization samples). They somehow don’t realize that using the norms from these standardization samples will reproduce the differences found in those samples, and won’t really report anything new.

Their numerous differences in percents reaching this or that cut-off are largely or entirely explained by simple statistics. They have two populations which have an IQ difference of 7.4 points (95.5 – 88.1 from Table 1) or 8.1 points (15 * .54 d from Table 1). Now, if we plot these (I used a difference of 7.5 IQ) and choose some arbitrary cut-offs, like those between arbitrarily chosen intervals, we see something like this:


Except that I cheated and chose all the cut-offs. The brown and the green lines are the ratios between the densities (read off the second y-axis). We see that around 100, they are generally low, but as we get further from the means, they get a lot larger. This simple fact is not generally appreciated. It’s not a new problem, Arthur Jensen spent much of a chapter in his behemoth 1980 book on the topic, he quotes for instance:

“In the construction trades, new apprentices were 87 percent white and 13 percent black. [Blacks constitute 12 percent of the U.S. population.] For the Federal Civil Service, of those employees above the GS-5 level, 88.5 percent were white, 8.3 percent black, and women account for 30.1 of all civil servants. Finally, a 1969 survey of college teaching positions showed whites with 96.3 percent of all posi­ tions. Blacks had 2.2 percent, and women accounted for 19.1 percent. (U.S. Commission on Civil Rights, 1973)”

Sounds familiar? Razib Khan has also written about it. Now, let’s go back to one of the quotes:

“Using American norms, 13.1% had an IQ of 80 or less, and 4.2% had an IQ of 75 or less. By contrast, when using Canadian norms, 32.3% had an IQ of 80 or less, and 21.2% had an IQ of 75 or less. Most notably, only 0.7% (2 individuals) obtained a FSIQ of 70 or less using American norms, whereas 9.7% had IQ scores this low when Canadian norms were used. At the other end of the spectrum, 1.4% of the students had FSIQ scores of 130 or more (gifted) when American norms were used, whereas only 0.3% were this high using Canadian norms.”

We can put these in a table and calculate the ratios:

IQ threshold Percent US Percent CAN US/CAN CAN/US
130 1.4 0.3 4.67 0.21
80 13.1 32.3 0.41 2.47
75 4.2 21.2 0.20 5.05
70 0.7 9.7 0.07 13.86


And we can also calculate the expected values based on the two populations (with means of 95.5 and 88) above:

IQ threshold Percent US Percent CAN US/CAN CAN/US
130 1.07 0.26 4.12 0.24
80 15.07 29.69 0.51 1.97
75 8.59 19.31 0.44 2.25
70 4.46 11.51 0.39 2.58


This is fairly close right? The only outlier (in italic) is the much lower than expected value for <70 IQ using US norms, perhaps a sampling error. But overall, this is a pretty good fit to the data. Perhaps we have our explanation.

What about those (mis)classification values in their Table 2? Well, for similar reasons that I won’t explain in detail, these are simply a function of the difference between the groups in that variable, e.g. Cohen’s d. In fact, if we correlate the d vector and the “% within same classification” we get a correlation of -.95 (-.96 using rank-orders).

MCV analysis

Incidentally, the d values report in their Table 1 are useful for using the method of correlated vectors. In a previous study comparing US and Canadian IQ data, Dutton and Lynn (2014) compared WAIS-IV standardization data. They found a mean difference of .31 d, or 4.65 IQ, which was reduced to 2.1 IQ if the samples were matched on education, ethnicity and sex. An interesting thing was that the difference between the countries was largest on the most g-loading subtests. When this happens, it is called a Jensen effect (or that it has a positive Jensen coefficient, Kirkegaard 2014). The value in their study was .83, which is on the high side (see e.g. te Nijenhuis et al, 2015).

I used the same loadings as used in their study (McFarland, 2013), and found a correlation of .24 (.35 with rank-order), substantially weaker.

Supplementary material

The R code and data files can be found in the Open Science Framework repository.


  • Harrison, A. G., Holmes, A., Silvestri, R., Armstrong, I. T. (2015). Implications for Educational Classification and Psychological Diagnoses Using the Wechsler Adult Intelligence Scale–Fourth Edition With Canadian Versus American Norms. Journal of Psychoeducational Assessment. 1-13.
  • Jensen, A. R. (1980). Bias in Mental Testing.
  • Kirkegaard, E. O. (2014). The personal Jensen coefficient does not predict grades beyond its association with g. Open Differential Psychology.
  • McFarland, D. (2013). Model individual subtests of the WAIS IV with multiple latent
    factors. PLoSONE. 8(9): e74980. doi:10.1371/journal.pone.0074980
  • te Nijenhuis, J., van den Hoek, M., & Armstrong, E. L. (2015). Spearman’s hypothesis and Amerindians: A meta-analysis. Intelligence, 50, 87-92.


Researcher choice in reporting of regression models allow for questionable research practices. Here I present a function for R that reports all possible regression models given a set of predictors and a dependent variable. I illustrate this function on two datasets of artificial data.


In an email to L.J Zigerell I wrote:

I had a look at your new paper here: rap.sagepub.com/content/2/1/2053168015570996 [Zigerell, 2015]

Generally, I agree about the problem. Pre-registration or always reporting all comparisons are the obvious solutions. Personally, I will try to pre-register all my survey-type studies from now on. The second is problematic in that there are sometimes quite a lot of ways to test a given hypothesis or estimate an effect with a dataset, your paper mentions a few. In many cases, it may not be easy for the researcher to report all of them by doing the analyses manually. I see two solutions: 1) tools for making automatic comparisons of all test methods, and 2) sampling of test methods. The first is preferable if it can be done, and when it cannot, one can fall back to the second. This is not a complete fix because there may be ways to estimate an effect using a dataset that the researcher did not even think of. If the dataset is not open, there is no way for others to conduct such tests. Open data is necessary.

I have one idea for how to make an automatic comparison of all possible ways to analyze data. In your working paper, you report only two regression models (Table 1). One controlling for 3 and one for 6 variables in MR. However, given the choice of these 6 variables, there are 2^6-1 (63) ways to run the MR analysis (the 64th is the empty model, I guess which could be used to estimate the intercept but nothing else). You could have tried all of them, and reported only the ones that gave the result you wanted (in line with the argument in your paper above). I’m not saying you did this of course, but it is possible. There is researcher degree of freedom about which models to report. There is a reason for this too which is that normally people run these models manually and running all 63 models would take a while doing manually (hard coding them or point and click), and they would also take up a lot of space to report if done via the usual table format.

You can perhaps see where I’m going with this. One can make a function that takes as input the dependent variable, the set of independent variables and the dataset, and then returns all the results for all possible ways to include these into MRs. One can then calculate descriptive statistics (mean, median, range, SD) for the effect sizes (std. betas) of the different variables to see how stable they are depending on which other variables are included. This would be a great way to combat the problem of which models to report when using MR, I think. Better, one can plot the results in a visually informative and appealing way.

With a nice interactive figure, one can also make it possible for users to try all of them, or see results for only specific groups of models.

I have now written the code for this. I tested it with two cases, a simple and a complicated one.

Simple case

In the simple case, we have three variables:

a = (normally distributed) noise
b = noise
y = a+b

Then I standardized the data so that betas from regressions are standardized betas. Correlation matrix:


The small departure from expected values is sampling error (n=1000 in these simulations). The beta matrix is:


We see the expected results. The correlations alone are the same as their betas together because they are nearly uncorrelated (r=.01) and correlated to y at about the same (r=.70 and .72). The correlations and betas are around .71 because math.

Complicated case

In this case we have 6 predictors some of which are correlated.

a = noise
b = noise
c = noise
d = c+noise
e = a+b+noise
f = .3*a+.7*c+noise
y = .2a+.3b+.5c

Correlation matrix:


And the beta matrix is:


So we see that the full model (model 63) finds that the betas are the same as the correlations while the other second-order variables get betas of 0 altho they have positive correlations with y. In other words, MR is telling us that these variables have zero effect when taking into account a, b and c, and each other — which is true.

By inspecting the matrix, we can see how one can be an unscrupulous researcher by exploiting researcher freedom (Simmons, 2011). If one likes the variable d, one can try all these models (or just a few of them manually), and then selectively report the ones that give strong betas. In this case model 60 looks like a good choice, since it controls for a lot yet still produces a strong beta for the favored variable. Or perhaps choose model 45, in which beta=.59. Then one can plausibly write in the discussion section something like this:

Prior research and initial correlations indicated that d may be a potent explanatory factor of y, r=.56 [significant test statistic]. After controlling for a, b and e, the effect was mostly unchanged, beta=.59 [significant test statistic]. However, after controlling for f as well, the effect was somewhat attenuated but still substantial, beta=.49 [significant test statistic]. The findings thus support theory T about the importance of d in understanding y.

One can play the game the other way too. If one dislikes a, one might focus on the models that find a weak effect of a, such as 31, where the beta is -.04 [non-significant test statistic] when controlling for e and f.

It can also be useful to examine the descriptive statistics of the beta matrix:


The range and sd (standard deviation) are useful as a measure of how the effect size of the variable varies from model to model. We see that among the true causes (a, b, c) the sd and range are smaller than among the ones that are non-causal correlates (d, e, f). Among the true causes, the weaker causes have larger ranges and sds. Perhaps one can find a way to adjust for this to get an effect size independent measure of how much the beta varies from model to model. The mad (median absolute deviation,  robust alternative to the sd) looks like a very promising candidate the detecting the true causal variables. It is very low (.01, .01 and .03) for the true causes, and at least 4.67 times larger for the non-causal correlates (.03 and .14).

In any case, I hope to have shown how researchers can use freedom of choice in model choice and reporting to inflate or deflate the effect size of variables they like/dislike. There are two ways to deal with this. Researchers must report all the betas with all available variables in a study (this table can get large, because it is 2^n-1 where n is the number of variables), e.g. using my function or an equivalent function, or better, the data must be available for reanalysis by others.

Source code

The function is available from the psych2 repository at Github. The R code for this paper is available on the Open Science Framework repository.


Simmons JP,  Nelson LD,  Simonsohn U. (2011). False-positive psychology: Undisclosed flexibility in data collection and analysis allows presenting anything as significant. Psychological Science 22(11): 1359–1366.

Zigerell, L. J. (2015). Inferential selection bias in a study of racial bias: Revisiting ‘Working twice as hard to get half as far’. Research & Politics, 2(1), 2053168015570996.

Some time ago, I wrote on Reddit:

There are two errors that I see quite frequently:

  1. Conclude from the fact that a statistically significant difference was found to that a significant socially, scientifically or otherwise difference was found. The reason this won’t work is that any minute difference will be stat.sig. if N is large enough. Some datasets have N=1e6, so very small differences between groups can be found reliably. This does not mean they are worth any attention. The general problem is the lack of focus on effect sizes.
  2. Conclude from the fact that a difference was not statistically significant to that there was no difference in that trait. The error being that they ignore the possibility of false negative; there is a difference, but sample size is too small to reliably detect it or sampling fluctuation caused it to be smaller than usual in the present sample. Together with the misuse of P values, one often sees stuff like “men and women differed in trait1 (p<0.04) but did not differ in trait2 (p>0.05), as if the p value difference of .01 has some magical significance.

These are rather obvious (to me), so I don’t know why I keep reading papers (Wassell et al, 2015) that go like this:

2.1. Experiment 1

In experiment 1 participants filled in the VVIQ2 and reported their current menstrual phase by counting backward the appropriate number of days from the next onset of menstruation. We grouped female participants according to these reports. Fig. 2A shows the mean VVIQ2 score for males and females in the follicular and mid luteal phases (males: M = 56.60, SD = 10.39, follicular women: M = 60.11, SD = 8.84, mid luteal women: M = 69.38, SD = 8.52). VVIQ2 scores varied between menstrual groups, as confirmed by a significant one-way ANOVA, F(2, 52) = 8.63, p < .001, η2 = .25. Tukey post hoc comparisons revealed that mid luteal females reported more vivid imagery than males, p < .001, d = 1.34, and follicular females, p < .05, d = 1.07, while males and follicular females did not differ, p = .48, d = 0.37. These data suggest a possible link between sex hormone concentration and the vividness of mental imagery.

A normal interpretation of the above has the authors as making the fallacy. It is even contradictory, an effect size of d=.37 is a medium-small effect, but in the same sentence they state that there is no effect (i.e. d=0).

However, later on they write:

VVIQ2 scores were found to significantly correlate with imagery strength from the binocular rivalry task, r = .37, p < .01. As is evident in Fig. 3A, imagery strength measured by the binocular rivalry task varied significantly between menstrual groups, F(2, 55) = 8.58, p < .001, η2 = .24, with mid luteal females showing stronger imagery than both males, p < .05, d = 1.03, and late follicular females, p < .001, d = 1.26. These latter two groups’ scores did not differ significantly, p = .51, d = 0.34. Together, these findings support the questionnaire data, and the proposal that imagery differences are influenced by menstrual phase and sex hormone concentration.

Now the authors are back to phrasing it in a way that cannot be taken as the fallacy. Sometimes it gets more silly. One paper, Kleisner et al (2014) which received quite a lot of attention in the media, is based on this kind of subgroup analysis where the effect had p<.05 for one gender but not the other. The typical source of this silliness is the relatively small sample size of most studies combined with the authors’ use of exploratory subgroup analysis (which they pretend to be hypothesis-driven in their testing). Gender, age, and race are the typical groups explored and in combination.

Probably, it best is scientists would stop using “significant” to talk about lowish p values. There is a very large probability that the public will misunderstand this. (There was agood study recently about this, but I can’t find it again, help!)


Kleisner, K., Chvátalová, V., & Flegr, J. (2014). Perceived intelligence is associated with measured intelligence in men but not women. PloS one, 9(3), e81237.

Wassell, J., Rogers, S. L., Felmingam, K. L., Bryant, R. A., & Pearson, J. (2015). Sex hormones predict the sensory strength and vividness of mental imagery. Biological Psychology.

I made this: emilkirkegaard.shinyapps.io/Understanding_restriction_of_range/


# ui.R
  titlePanel(title, windowTitle = title),
      helpText("Get an intuitive understanding of restriction of range using this interactive plot. The slider below limits the dataset to those within the limits."),
        label = "Restriction of range",
        min = -5, max = 5, value = c(-5, 5), step=.1),
      helpText("Note that these are Z-values. A Z-value of +/- 2 corresponds to the 98th or 2th centile, respectively.")
# server.R
  function(input, output) {
    output$plot <- renderPlot({
      lower.limit = input$limits[1] #lower limit
      upper.limit = input$limits[2]  #upper limit
      #adjust data object
      data["X.restricted"] = data["X"] #copy X
      data[data[,1]<lower.limit | data[,1]>upper.limit,"X.restricted"] = NA #remove values
      group = data.frame(rep("Included",nrow(data))) #create group var
      colnames(group) = "group" #rename
      levels(group$group) = c("Included","Excluded") #add second factor level
      group[is.na(data["X.restricted"])] = "Excluded" #is NA?
      data["group"] = group #add to data
      xyplot(Y ~ X, data, type=c("p","r"), col.line = "darkorange", lwd = 1,
             group=group, auto.key = TRUE)
    output$text <- renderPrint({
      lower.limit = input$limits[1] #lower limit
      upper.limit = input$limits[2]  #upper limit
      #adjust data object
      data["X.restricted"] = data["X"] #copy X
      data[data[,1]<lower.limit | data[,1]>upper.limit,"X.restricted"] = NA #remove values
      group = data.frame(rep("Included",nrow(data))) #create group var
      colnames(group) = "group" #rename
      levels(group$group) = c("Included","Excluded") #add second factor level
      group[is.na(data["X.restricted"])] = "Excluded" #is NA?
      data["group"] = group #add to data
      cors = cor(data[1:3], use="pairwise")
      r = round(cors[3,2],2)
      #print output
      str = paste0("The correlation in the full dataset is .50, the correlation in the restricted dataset is ",r)
data = read.csv("data.csv",row.names = 1) #load data
title = "Understanding restriction of range"

Restriction of range is when the variance in some variable is reduced compared to the true population variance. This lowers the correlation between this variable and other variables. It is a common problem with research on students which are selected for general intelligence (GI) and hence have a lower variance. This means that correlations between GI and whatever found in student samples is too low.

There are some complicated ways to correct for restriction of range. The usual formula used is this:

restriction of range

which is also known as Thorndike’s case 2, or Pearson’s 1903 formula. Capital XY are the unrestricted variables, xy the restricted. The hat on r means estimated.

However, in a paper in review I used the much simpler formula, namely: corrected r = uncorrected r / (SD_restricted/SD_unrestricted) which seemed to give about the right results. But I wasn’t sure this was legit, so I did some simulations.

First, I selected a large range of true population correlations (.1 to .8) and a large range of selectivity (.1 to .9), then I generated very large datasets with each population correlation. Then for each restriction, I cut off the datapoints where the one variable was below the cutoff point, and calculated the correlation in that restricted dataset. Then I calculated the corrected correlation. Then I saved both pieces of information.

This gives us these correlations in the restricted samples (N=1,000,000)

cor/restriction R 0.1 R 0.2 R 0.3 R 0.4 R 0.5 R 0.6 R 0.7 R 0.8 R 0.9
r 0.1 0.09 0.08 0.07 0.07 0.06 0.06 0.05 0.05 0.04
r 0.2 0.17 0.15 0.14 0.13 0.12 0.11 0.10 0.09 0.08
r 0.3 0.26 0.23 0.22 0.20 0.19 0.17 0.16 0.14 0.12
r 0.4 0.35 0.32 0.29 0.27 0.26 0.24 0.22 0.20 0.17
r 0.5 0.44 0.40 0.37 0.35 0.33 0.31 0.28 0.26 0.23
r 0.6 0.53 0.50 0.47 0.44 0.41 0.38 0.36 0.33 0.29
r 0.7 0.64 0.60 0.57 0.54 0.51 0.48 0.45 0.42 0.37
r 0.8 0.75 0.71 0.68 0.65 0.63 0.60 0.56 0.53 0.48


The true population correlation is in the left-margin. The amount of restriction in the columns. So we see the effect of restricting the range.

Now, here’s the corrected correlations by my method:

cor/restriction R 0.1 R 0.2 R 0.3 R 0.4 R 0.5 R 0.6 R 0.7 R 0.8 R 0.9
r 0.1 0.10 0.10 0.10 0.10 0.10 0.10 0.10 0.10 0.09
r 0.2 0.20 0.20 0.20 0.20 0.21 0.21 0.20 0.20 0.20
r 0.3 0.30 0.31 0.31 0.31 0.31 0.31 0.30 0.30 0.29
r 0.4 0.41 0.41 0.42 0.42 0.42 0.42 0.42 0.42 0.42
r 0.5 0.52 0.53 0.53 0.54 0.54 0.55 0.55 0.56 0.56
r 0.6 0.63 0.65 0.66 0.67 0.68 0.69 0.70 0.70 0.72
r 0.7 0.76 0.79 0.81 0.83 0.84 0.86 0.87 0.89 0.90
r 0.8 0.89 0.93 0.97 1.01 1.04 1.07 1.10 1.13 1.17


Now, the first 3 rows are fairly close deviating by max .1, but it the rest deviates progressively more. The discrepancies are these:

cor/restriction R 0.1 R 0.2 R 0.3 R 0.4 R 0.5 R 0.6 R 0.7 R 0.8 R 0.9
r 0.1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 -0.01
r 0.2 0.00 0.00 0.00 0.00 0.01 0.01 0.00 0.00 0.00
r 0.3 0.00 0.01 0.01 0.01 0.01 0.01 0.00 0.00 -0.01
r 0.4 0.01 0.01 0.02 0.02 0.02 0.02 0.02 0.02 0.02
r 0.5 0.02 0.03 0.03 0.04 0.04 0.05 0.05 0.06 0.06
r 0.6 0.03 0.05 0.06 0.07 0.08 0.09 0.10 0.10 0.12
r 0.7 0.06 0.09 0.11 0.13 0.14 0.16 0.17 0.19 0.20
r 0.8 0.09 0.13 0.17 0.21 0.24 0.27 0.30 0.33 0.37


So, if we can figure out how to predict the values in these cells from the two values in the row and column, one can make a simpler way to correct for restriction.

Or, we can just use the correct formula, and then we get:

cor/restriction R 0.1 R 0.2 R 0.3 R 0.4 R 0.5 R 0.6 R 0.7 R 0.8 R 0.9
r 0.1 0.10 0.10 0.10 0.10 0.10 0.10 0.10 0.10 0.09
r 0.2 0.20 0.20 0.20 0.20 0.20 0.20 0.20 0.21 0.20
r 0.3 0.30 0.30 0.30 0.30 0.30 0.30 0.30 0.30 0.30
r 0.4 0.40 0.40 0.40 0.40 0.40 0.40 0.40 0.39 0.39
r 0.5 0.50 0.50 0.50 0.50 0.50 0.50 0.50 0.50 0.49
r 0.6 0.60 0.60 0.60 0.60 0.60 0.60 0.60 0.60 0.60
r 0.7 0.70 0.70 0.70 0.70 0.70 0.70 0.70 0.70 0.71
r 0.8 0.80 0.80 0.80 0.80 0.80 0.80 0.80 0.80 0.80


With discrepancies:

cor/restriction R 0.1 R 0.2 R 0.3 R 0.4 R 0.5 R 0.6 R 0.7 R 0.8 R 0.9
r 0.1 0 0 0 0 0 0 0 0 -0.01
r 0.2 0 0 0 0 0 0 0 0.01 0
r 0.3 0 0 0 0 0 0 0 0 0
r 0.4 0 0 0 0 0 0 0 -0.01 -0.01
r 0.5 0 0 0 0 0 0 0 0 -0.01
r 0.6 0 0 0 0 0 0 0 0 0
r 0.7 0 0 0 0 0 0 0 0 0.01
r 0.8 0 0 0 0 0 0 0 0 0


Pretty good!

Also, I need to re-do my paper.

R code:


pop.cors = seq(.1,.8,.1) #population correlations to test
restrictions = seq(.1,.9,.1) #restriction of ranges in centiles
sample = 1000000 #sample size

#empty dataframe for results
results = data.frame(matrix(nrow=length(pop.cors),ncol=length(restrictions)))
colnames(results) = paste("R",restrictions)
rownames(results) = paste("r",pop.cors)
results.c = results
results.c2 = results

#and fetch!
for (pop.cor in pop.cors){ #loop over population cors
  data = mvrnorm(sample, mu = c(0,0), Sigma = matrix(c(1,pop.cor,pop.cor,1), ncol = 2),
                 empirical = TRUE) #generate data
  rowname = paste("r",pop.cor) #get current row names
  for (restriction in restrictions){ #loop over restrictions
    colname = paste("R",restriction) #get current col names
    z.cutoff = qnorm(restriction) #find cut-off
    rows.to.keep = data[,1] > z.cutoff #which rows to keep
    rdata = data[rows.to.keep,] #cut away data
    cor = rcorr(rdata)$r[1,2] #get cor
    results[rowname,colname] = cor #add cor to results
    sd = describe(rdata)$sd[1] #find restricted sd
    cor.c = cor/sd #corrected cor, simple formula
    results.c[rowname,colname] = cor.c #add cor to results
    cor.c2 = cor/sqrt(cor^2+sd^2-sd^2*cor^2) #correct formula
    results.c2[rowname,colname] = cor.c2 #add cor to results

#how much are they off by?
discre = results.c
for (num in 1:length(pop.cors)){
  cor = pop.cors[num]
  discre[num,] = discre[num,]-cor

discre2 = results.c2
for (num in 1:length(pop.cors)){
  cor = pop.cors[num]
  discre2[num,] = discre2[num,]-cor

A person on ResearchGate asked the following question:

How can I correlate ordinal variables (attitude Likert scale) with continuous ratio data (years of experience)?
Currently, I am working on my dissertation which explores learning organisation characteristics at HEIs. One of the predictor demographic variables is the indication of the years of experience. Respondents were asked to fill in the gap the number of years. Should I categorise the responses instead? as for example:
1. from 1 to 4 years
2. from 4 to 10
and so on?
or is there a better choice/analysis I could apply?

My answer may also be of interest to others, so I post it here as well.

Normal practice is to treat likert scales as continuous variable even though they are not. As long as there are >=5 options, the bias from discreteness is not large.

I simulated the situation for you. I generated two variables with continuous random data from two normal distributions with a correlation of .50, N=1000. Then I created likert scales of varying levels from the second variable. Then I correlated all these variables with each other.

Correlations of continuous variable 1 with:

continuous2 0.5
likert10 0.482
likert7 0.472
likert5 0.469
likert4 0.432
likert3 0.442
likert2 0.395

So you see, introducing discreteness biases correlations towards zero, but not by much as long as likert is >=5 level. You can correct for the bias by multiplying by the correction factor if desired:

Correction factor:

continuous2 1
likert10 1.037
likert7 1.059
likert5 1.066
likert4 1.157
likert3 1.131
likert2 1.266

Psychologically, if your data does not make sense as an interval scale, i.e. if the difference between options 1-2 is not the same as between options 3-4, then you should use Spearman’s correlation instead of Pearson’s. However, it will rarely make much of a difference.

Here’s the R code.

#load library
#simulate dataset of 2 variables with correlation of .50, N=1000
simul.data = mvrnorm(1000, mu = c(0,0), Sigma = matrix(c(1,0.50,0.50,1), ncol = 2), empirical = TRUE)
simul.data = as.data.frame(simul.data);colnames(simul.data) = c(“continuous1″,”continuous2″)
#divide into bins of equal length
simul.data[“likert10″] = as.numeric(cut(unlist(simul.data[2]),breaks=10))
simul.data[“likert7″] = as.numeric(cut(unlist(simul.data[2]),breaks=7))
simul.data[“likert5″] = as.numeric(cut(unlist(simul.data[2]),breaks=5))
simul.data[“likert4″] = as.numeric(cut(unlist(simul.data[2]),breaks=4))
simul.data[“likert3″] = as.numeric(cut(unlist(simul.data[2]),breaks=3))
simul.data[“likert2″] = as.numeric(cut(unlist(simul.data[2]),breaks=2))