Introduction

Research uncovers flawed IQ scoring system” is the headline on phys.org, which often posts news about research from other fields. It concerns a study by Harrison et al (2015). The researchers have allegedly “uncovered anomalies and issues with the Wechsler Adult Intelligence Scale-Fourth Edition (WAIS-IV), one of the most widely used intelligence tests in the world”. An important discovery, if true. Let’s hear it from the lead researcher:

“Looking at the normal distribution of scores, you’d expect that only about five per cent of the population should get an IQ score of 75 or less,” says Dr. Harrison. “However, while this was true when we scored their tests using the American norms, our findings showed that 21 per cent of college and university students in our sample had an IQ score this low when Canadian norms were used for scoring.”

How can it be? To learn more, we delve into the actual paper titled: Implications for Educational Classification and Psychological Diagnoses Using the Wechsler Adult Intelligence Scale–Fourth Edition With Canadian Versus American Norms.

The paper

First they summarize a few earlier studies on Canada and the US. The Canadians obtained higher raw scores. Of course, this was hypothesized to be due to differences in ethnicity and educational achievement factors. However, this did not quite work out, so Harrison et al decided to investigate it more (they had already done so in 2014). Their method consists of taking the scores from a large mixed sample consisting of healthy people — i.e. with no diagnosis, 11% — and people with various mental disorders (e.g. 53.5% with ADHD), and then scoring this group on both the American and the Canadian norms. What did they find?

Blast! The results were similar to the results from the previous standardization studies! What happened? To find out, Harrison et al do a thorough examination of various subgroups in various ways. No matter which age group they compare, the result won’t go away. They also report the means and Cohen’s d for each subtest and aggregate measure — very helpful. I reproduce their Table 1 below:

Score M (US)
SD (US)
M (CAN)
SD (CAN)
p d r
FSIQ 95.5 12.9 88.1 14.4 <.001 0.54 0.99
GAI 98.9 14.7 92.5 16 <.001 0.42 0.99
Index Scores
Verbal Comprehension 97.9 15.1 91.8 16.3 <.001 0.39 0.99
Perceptual Reasoning 99.9 14.1 94.5 15.9 <.001 0.36 0.99
Working Memory 90.5 12.8 83.5 13.8 <.001 0.53 0.99
Processing Speed 95.2 12.9 90.4 14.1 <.001 0.36 0.99
Subtest Scores
Verbal Subtests
Vocabulary 9.9 3.1 8.7 3.3 <.001 0.37 0.99
Similarities 9.7 3 8.5 3.3 <.001 0.38 0.98
Information 9.2 3.1 8.5 3.3 <.001 0.22 0.99
Arithmetic 8.2 2.7 7.4 2.7 <.001 0.3 0.99
Digit Span 8.4 2.5 7.1 2.7 <.001 0.5 0.98
Performance Subtests
Block Design 9.8 3 8.9 3.2 <.001 0.29 0.99
Matrix Reasoning 9.8 2.9 9.1 3.2 <.001 0.23 0.99
Visual Puzzles 10.5 2.9 9.4 3.1 <.001 0.37 0.99
Symbol Search 9.3 2.8 8.5 3 <.001 0.28 0.99
Coding 8.9 2.5 8.2 2.6 <.001 0.27 0.98

 

Sure enough, the scores are lower using the Canadian norms. And very ‘significant’ too. A mystery.

Next, they go on to note how this sometimes changes the classification of individuals into 7 arbitrarily chosen intervals of IQ scores, and how this differs between subtests. They spend a lot of e-ink noting percents about this or that classification. For instance:

“Of interest was the percentage of individuals who would be classified as having a FSIQ below the 10th percentile or who would fall within the IQ range required for diagnosis of ID (e.g., 70 ± 5) when both normative systems were applied to the same raw scores. Using American norms, 13.1% had an IQ of 80 or less, and 4.2% had an IQ of 75 or less. By contrast, when using Canadian norms, 32.3% had an IQ of 80 or less, and 21.2% had an IQ of 75 or less.”

I wonder if some coherent explanation can be found for all these results. In their discussion they ask:

“How is it that selecting Canadian over American norms so markedly lowers the standard scores generated from the identical raw scores? One possible explanation is that more extreme scores occur because the Canadian normative sample is smaller than the American (cf. Kahneman, 2011).”

If the reader was unsure, yes, this is Kahneman’s 2011 book about cognitive biases and dual process theory.

They have more suggestions about the reason:

“One cannot explain this difference simply by saying it is due to the mature students in the sample who completed academic upgrading, as the score differences were most prominent in the youngest cohorts. It is difficult to explain these findings simply as a function of disability status, as all participants were deemed otherwise qualified by these postsecondary institutions (i.e., they had met normal academic requirements for entry into regular postsecondary programs). Furthermore, in Ontario, a diagnosis of LD is given only to students with otherwise normal thinking and reasoning skills, and so students with such a priori diagnosis would have had otherwise average full scale or general abilities scores when tested previously. Performance exaggeration seems an unlikely cause for the findings, as the students’ scores declined only when Canadian norms were applied. Finally, although no one would argue that a subset of disabled students might be functioning below average, it is difficult to believe that almost half of these postsecondary students would fall in this IQ range given that they had graduated from high school with marks high enough to qualify for acceptance into bona fide postsecondary programs. Whatever the cause, our data suggest that one must question both the representativeness of the Canadian normative sample in the younger age ranges and the accuracy of the scores derived when these norms are applied.”

And finally they conclude with a recommendation not to use the Canadian norms for Canadians because this results in lower IQs:

Overall, our findings suggest a need to examine more carefully the accuracy and applicability of the WAIS-IV Canadian norms when interpreting raw test data obtained from Canadian adults. Using these norms appears to increase the number of young adults identified as intellectually impaired and could decrease the number who qualify for gifted programming or a diagnosis of LD. Until more research is conducted, we strongly recommend that clinicians not use Canadian norms to determine intellectual impairment or disability status. Converting raw scores into Canadian standard scores, as opposed to using American norms, systematically lowers the scores of postsecondary students below the age of 35, as the drop in FSIQ was higher for this group than for older adults. Although we cannot know which derived scores most accurately reflect the intellectual abilities of young Canadian adults, it certainly seems implausible that almost half of postsecondary students have FSIQ scores below the 16th percentile, calling into question the accuracy of all other derived WAIS-IV Canadian scores in the classification of cognitive abilities.

Are you still wondering what it going on?

Populations with different mean IQs and cut-offs

Harrison et al seems to have inadvertently almost rediscovered the fact that Canadians are smarter than Americans. They don’t quite make it to this point even when faced with obvious and strong evidence (multiple standardization samples). They somehow don’t realize that using the norms from these standardization samples will reproduce the differences found in those samples, and won’t really report anything new.

Their numerous differences in percents reaching this or that cut-off are largely or entirely explained by simple statistics. They have two populations which have an IQ difference of 7.4 points (95.5 – 88.1 from Table 1) or 8.1 points (15 * .54 d from Table 1). Now, if we plot these (I used a difference of 7.5 IQ) and choose some arbitrary cut-offs, like those between arbitrarily chosen intervals, we see something like this:

2pop

Except that I cheated and chose all the cut-offs. The brown and the green lines are the ratios between the densities (read off the second y-axis). We see that around 100, they are generally low, but as we get further from the means, they get a lot larger. This simple fact is not generally appreciated. It’s not a new problem, Arthur Jensen spent much of a chapter in his behemoth 1980 book on the topic, he quotes for instance:

“In the construction trades, new apprentices were 87 percent white and 13 percent black. [Blacks constitute 12 percent of the U.S. population.] For the Federal Civil Service, of those employees above the GS-5 level, 88.5 percent were white, 8.3 percent black, and women account for 30.1 of all civil servants. Finally, a 1969 survey of college teaching positions showed whites with 96.3 percent of all posi­ tions. Blacks had 2.2 percent, and women accounted for 19.1 percent. (U.S. Commission on Civil Rights, 1973)”

Sounds familiar? Razib Khan has also written about it. Now, let’s go back to one of the quotes:

“Using American norms, 13.1% had an IQ of 80 or less, and 4.2% had an IQ of 75 or less. By contrast, when using Canadian norms, 32.3% had an IQ of 80 or less, and 21.2% had an IQ of 75 or less. Most notably, only 0.7% (2 individuals) obtained a FSIQ of 70 or less using American norms, whereas 9.7% had IQ scores this low when Canadian norms were used. At the other end of the spectrum, 1.4% of the students had FSIQ scores of 130 or more (gifted) when American norms were used, whereas only 0.3% were this high using Canadian norms.”

We can put these in a table and calculate the ratios:

IQ threshold Percent US Percent CAN US/CAN CAN/US
130 1.4 0.3 4.67 0.21
80 13.1 32.3 0.41 2.47
75 4.2 21.2 0.20 5.05
70 0.7 9.7 0.07 13.86

 

And we can also calculate the expected values based on the two populations (with means of 95.5 and 88) above:

IQ threshold Percent US Percent CAN US/CAN CAN/US
130 1.07 0.26 4.12 0.24
80 15.07 29.69 0.51 1.97
75 8.59 19.31 0.44 2.25
70 4.46 11.51 0.39 2.58

 

This is fairly close right? The only outlier (in italic) is the much lower than expected value for <70 IQ using US norms, perhaps a sampling error. But overall, this is a pretty good fit to the data. Perhaps we have our explanation.

What about those (mis)classification values in their Table 2? Well, for similar reasons that I won’t explain in detail, these are simply a function of the difference between the groups in that variable, e.g. Cohen’s d. In fact, if we correlate the d vector and the “% within same classification” we get a correlation of -.95 (-.96 using rank-orders).

MCV analysis

Incidentally, the d values report in their Table 1 are useful for using the method of correlated vectors. In a previous study comparing US and Canadian IQ data, Dutton and Lynn (2014) compared WAIS-IV standardization data. They found a mean difference of .31 d, or 4.65 IQ, which was reduced to 2.1 IQ if the samples were matched on education, ethnicity and sex. An interesting thing was that the difference between the countries was largest on the most g-loading subtests. When this happens, it is called a Jensen effect (or that it has a positive Jensen coefficient, Kirkegaard 2014). The value in their study was .83, which is on the high side (see e.g. te Nijenhuis et al, 2015).

I used the same loadings as used in their study (McFarland, 2013), and found a correlation of .24 (.35 with rank-order), substantially weaker.

Supplementary material

The R code and data files can be found in the Open Science Framework repository.

References

  • Harrison, A. G., Holmes, A., Silvestri, R., Armstrong, I. T. (2015). Implications for Educational Classification and Psychological Diagnoses Using the Wechsler Adult Intelligence Scale–Fourth Edition With Canadian Versus American Norms. Journal of Psychoeducational Assessment. 1-13.
  • Jensen, A. R. (1980). Bias in Mental Testing.
  • Kirkegaard, E. O. (2014). The personal Jensen coefficient does not predict grades beyond its association with g. Open Differential Psychology.
  • McFarland, D. (2013). Model individual subtests of the WAIS IV with multiple latent
    factors. PLoSONE. 8(9): e74980. doi:10.1371/journal.pone.0074980
  • te Nijenhuis, J., van den Hoek, M., & Armstrong, E. L. (2015). Spearman’s hypothesis and Amerindians: A meta-analysis. Intelligence, 50, 87-92.

In an email to L.J Zigerell I wrote:

I had a look at your new paper here: rap.sagepub.com/content/2/1/2053168015570996

Generally, I agree about the problem. Pre-registration or always reporting all comparisons are the obvious solutions. Personally, I will try to pre-register all my survey-type studies from now on. The second is problematic in that there are sometimes quite a lot of ways to test a given hypothesis or estimate an effect with a dataset, your paper mentions a few. In many cases, it may not be easy for the researcher to report all of them by doing the analyses manually. I see two solutions: 1) tools for making automatic comparisons of all test methods, and 2) sampling of test methods. The first is preferable if it can be done, and when it cannot, one can fall back to the second. This is not a complete fix because there may be ways to estimate an effect using a dataset that the researcher did not even think of. If the dataset is not open, there is no way for others to conduct such tests. Open data is necessary.

I have one idea for how to make an automatic comparison of all possible ways to analyze data. In your working paper, you report only two regression models (Table 1). One controlling for 3 and one for 6 variables in MR. However, given the choice of these 6 variables, there are 2^6-1 (63) ways to run the MR analysis (the 64th is the empty model, I guess which could be used to estimate the intercept but nothing else). You could have tried all of them, and reported only the ones that gave the result you wanted (in line with the argument in your paper above). I’m not saying you did this of course, but it is possible. There is researcher degree of freedom about which models to report. There is a reason for this too which is that normally people run these models manually and running all 63 models would take a while doing manually (hard coding them or point and click), and they would also take up a lot of space to report if done via the usual table format.

You can perhaps see where I’m going with this. One can make a function that takes as input the dependent variable, the set of independent variables and the dataset, and then returns all the results for all possible ways to include these into MRs. One can then calculate descriptive statistics (mean, median, range, SD) for the effect sizes (std. betas) of the different variables to see how stable they are depending on which other variables are included. This would be a great way to combat the problem of which models to report when using MR, I think. Better, one can plot the results in a visually informative and appealing way.

With a nice interactive figure, one can also make it possible for users to try all of them, or see results for only specific groups of models.

I have now written the code for this. I tested it with two cases, a simple and a complicated one.

Simple case

In the simple case, we have two predictors which are randomly generated numbers, and y=.5a+.5b. Then I standardized the data so that betas from regressions are standardized betas. Correlation matrix:

 

The small departure from expected values is sampling error (n=1000 in these simulations).  The beta matrix is:

 

We see the expected results. The correlations alone are the same as their betas together because they are nearly uncorrelated (r=.01) and correlated to Y at about the same (r=.70 and .72). The correlations and betas are around .71 because math.

Complicated case

In this case we have 6 predictors some of which are correlated, and y=.2a+.3b+.5c. So we expect y-correlations of .32, .49 and .81. The other variables are various combinations of the first three and noise (see code below). Correlation matrix:

 

Correlations are close to the expected values.

And the beta matrix is:

 

So we see that the full model (63) finds that the betas are the same as the correlations while the other second-order variables get betas of 0 although they have positive y-correlations. In other words, MR is telling us that these variables have linear effect when taking into account a, b and c, and each other — which is true.

By inspecting the matrix, we can see how one can be an unscrupulous researcher by exploiting researcher freedom (Simmons, 2011). If one likes the variable d, one can try all these models (or just a few of them manually), and then selectively report the ones that give strong betas. In this case model 60 looks like a good choice, since it controls for a lot yet still produces a strong beta for the favored variable. Or perhaps choose model 45, in which beta=.59. Then one can plausibly write in the discussion section something like this:

Prior research and initial correlations indicated that d may be a potent explanatory factor of y, r=.56 [significant test statistic]. After controlling for a, b and e, the effect was mostly unchanged, beta=.59 [significant test statistic]. However, after controlling for f as well, the effect was somewhat attenuated but still substantial, beta=.49 [significant test statistic]. The findings thus support theory T about the importance of d in understanding y.

One can play the game the other way too. If one dislikes a, one might focus on the models that find a weak effect of a, such as 31, where the beta is -.04 [non-significant test statistic] when controlling for e and f.

It can also be useful to examine the descriptive statistics:

 

The range and sd are useful as a measure of how the effect size of the variable varies from model to model. We see that among the true causes (a, b, c) the sd and range are smaller than among the ones that are non-causal correlates (d, e, f). Among the true causes, the weaker causes have larger ranges and sds. Perhaps one can find a way to adjust for this to get an effect size independent measure of how much the beta varies from model to model. The mad (median absolute deviation) looks like a very promising candidate the detecting the true causal variables. It is very low (.01, .01 and .03) for the true causes, and at least 4.67 times larger for the non-causal correlates (.03 and .14).

In any case, I hope to have shown how researchers can use freedom of choice in model choice and reporting to inflate or deflate the effect size of variables they like/dislike. There are two ways to deal with this. Researchers must report all the betas with all available variables in a study (this table can get large, because it is 2^n-1 where n is the number of variables), e.g. using my function or an equivalent function, or better, the data must be available for reanalysis by others.

R code:

library(psych)

inde.vars = c("a","b") #simple case
inde.vars = c("a","b","c","d","e","f") #complicated case

beta.matrix = lm.all.models("y",inde.vars,data) #run function!
View(round(beta.matrix,2))
View(round(describe(beta.matrix),2))

#Generate simple case data
rows = 1000 #how many cases
data = data.frame(matrix(nrow = rows,ncol=0)) #dataframe
data["a"] = rnorm(rows) #
data["b"] = rnorm(rows) #
data["y"] = .5*data["a"]+.5*data["b"] #dependent var
data = data.frame(scale(data)) #standardize data
View(round(cor(data),2))

#Generate complicated case data
rows = 1000 #how many cases
data = data.frame(matrix(nrow = rows,ncol=0)) #dataframe
data["a"] = rnorm(rows) #
data["b"] = rnorm(rows) #
data["c"] = rnorm(rows) #
data["d"] = data["c"]+rnorm(rows) #worse version of c
data["e"] = data["a"]+data["b"]+rnorm(rows) #combination of a+b and noise
data["f"] = .3*data["a"]+.7*data["c"]+rnorm(rows) #a and c and noise
data["y"] = .2*data["a"]+.3*data["b"]+.5*data["c"] #dependent var made of a,b,c
data = data.frame(scale(data)) #standardize data
View(round(cor(data),2))

# The function
lm.all.models = function(dependent, vector.of.indep, dataset=NULL) {
  #find all the combinations
  library(gtools) #for combinations()
  num.inde = length(vector.of.indep) #how many indeps?
  
  sets = list() #list of all combinations
  for (num.choose in 1:num.inde) { #loop over numbers of variables to choose
    temp.sets = combinations(num.inde,num.choose) #all combinations of picking r out of n
    temp.sets = split(temp.sets, seq.int(nrow(temp.sets))) #as a list
    sets = c(sets,temp.sets)
  }
  #return(sets) #for debugging
  
  #create all the models
  models = numeric() #empty vector for models
  for (set in sets) { #loop over each possible combination
    deps = inde.vars[set] #fetch predictor names
    deps = paste0(deps, collapse = " + ") #add " + " between predictors
    model = paste0(dependent," ~ ",deps) #join up
    models = c(models,model) #add to vector
  }
  #print(models) #debug
  
  #run each model
  betas = data.frame(matrix(ncol=num.inde,nrow=length(models))) #DF for betas
  colnames(betas) = vector.of.indep #colnames
  for (model.idx in 1:length(models)) { #loop over the index of each model
    #print(model.idx)
    #print(models[model.idx])
    lm.fit = lm(models[model.idx],dataset) #fit the model
    lm.fit.betas = lm.fit$coefficients[-1] #get betas, remove intercept
    #print(lm.fit.betas)
    
    #insert data
    for (beta.idx in 1:length(lm.fit.betas)) { #loop over each beta
      beta.names = as.vector(names(lm.fit.betas)) #get the names
      beta.name = beta.names[beta.idx] #get the name
      #print(beta.names)
      #print(beta.name)
      betas[model.idx,beta.name] = lm.fit.betas[beta.idx] #insert beta in the right place
    }
  }
  return(betas)
}

References

Simmons JP,  Nelson LD,  Simonsohn U. (2011). False-positive psychology: Undisclosed flexibility in data collection and analysis allows presenting anything as significant. Psychological Science 22(11): 1359–1366.

Some time ago, I wrote on Reddit:

There are two errors that I see quite frequently:

  1. Conclude from the fact that a statistically significant difference was found to that a significant socially, scientifically or otherwise difference was found. The reason this won’t work is that any minute difference will be stat.sig. if N is large enough. Some datasets have N=1e6, so very small differences between groups can be found reliably. This does not mean they are worth any attention. The general problem is the lack of focus on effect sizes.
  2. Conclude from the fact that a difference was not statistically significant to that there was no difference in that trait. The error being that they ignore the possibility of false negative; there is a difference, but sample size is too small to reliably detect it or sampling fluctuation caused it to be smaller than usual in the present sample. Together with the misuse of P values, one often sees stuff like “men and women differed in trait1 (p<0.04) but did not differ in trait2 (p>0.05), as if the p value difference of .01 has some magical significance.

These are rather obvious (to me), so I don’t know why I keep reading papers (Wassell et al, 2015) that go like this:

2.1. Experiment 1

In experiment 1 participants filled in the VVIQ2 and reported their current menstrual phase by counting backward the appropriate number of days from the next onset of menstruation. We grouped female participants according to these reports. Fig. 2A shows the mean VVIQ2 score for males and females in the follicular and mid luteal phases (males: M = 56.60, SD = 10.39, follicular women: M = 60.11, SD = 8.84, mid luteal women: M = 69.38, SD = 8.52). VVIQ2 scores varied between menstrual groups, as confirmed by a significant one-way ANOVA, F(2, 52) = 8.63, p < .001, η2 = .25. Tukey post hoc comparisons revealed that mid luteal females reported more vivid imagery than males, p < .001, d = 1.34, and follicular females, p < .05, d = 1.07, while males and follicular females did not differ, p = .48, d = 0.37. These data suggest a possible link between sex hormone concentration and the vividness of mental imagery.

A normal interpretation of the above has the authors as making the fallacy. It is even contradictory, an effect size of d=.37 is a medium-small effect, but in the same sentence they state that there is no effect (i.e. d=0).

However, later on they write:

VVIQ2 scores were found to significantly correlate with imagery strength from the binocular rivalry task, r = .37, p < .01. As is evident in Fig. 3A, imagery strength measured by the binocular rivalry task varied significantly between menstrual groups, F(2, 55) = 8.58, p < .001, η2 = .24, with mid luteal females showing stronger imagery than both males, p < .05, d = 1.03, and late follicular females, p < .001, d = 1.26. These latter two groups’ scores did not differ significantly, p = .51, d = 0.34. Together, these findings support the questionnaire data, and the proposal that imagery differences are influenced by menstrual phase and sex hormone concentration.

Now the authors are back to phrasing it in a way that cannot be taken as the fallacy. Sometimes it gets more silly. One paper, Kleisner et al (2014) which received quite a lot of attention in the media, is based on this kind of subgroup analysis where the effect had p<.05 for one gender but not the other. The typical source of this silliness is the relatively small sample size of most studies combined with the authors’ use of exploratory subgroup analysis (which they pretend to be hypothesis-driven in their testing). Gender, age, and race are the typical groups explored and in combination.

Probably, it best is scientists would stop using “significant” to talk about lowish p values. There is a very large probability that the public will misunderstand this. (There was agood study recently about this, but I can’t find it again, help!)

References

Kleisner, K., Chvátalová, V., & Flegr, J. (2014). Perceived intelligence is associated with measured intelligence in men but not women. PloS one, 9(3), e81237.

Wassell, J., Rogers, S. L., Felmingam, K. L., Bryant, R. A., & Pearson, J. (2015). Sex hormones predict the sensory strength and vividness of mental imagery. Biological Psychology.

I made this: emilkirkegaard.shinyapps.io/Understanding_restriction_of_range/

Source:

# ui.R
shinyUI(fluidPage(
  titlePanel(title, windowTitle = title),
  
  sidebarLayout(
    sidebarPanel(
      helpText("Get an intuitive understanding of restriction of range using this interactive plot. The slider below limits the dataset to those within the limits."),
      
      sliderInput("limits",
        label = "Restriction of range",
        min = -5, max = 5, value = c(-5, 5), step=.1),
      
      helpText("Note that these are Z-values. A Z-value of +/- 2 corresponds to the 98th or 2th centile, respectively.")
      ),
    
    
    mainPanel(
      plotOutput("plot"),width=8,
      
      textOutput("text")
      )
  )
))
# server.R
shinyServer(
  function(input, output) {
    output$plot <- renderPlot({
      #limits
      lower.limit = input$limits[1] #lower limit
      upper.limit = input$limits[2]  #upper limit
      
      #adjust data object
      data["X.restricted"] = data["X"] #copy X
      data[data[,1]<lower.limit | data[,1]>upper.limit,"X.restricted"] = NA #remove values
      group = data.frame(rep("Included",nrow(data))) #create group var
      colnames(group) = "group" #rename
      levels(group$group) = c("Included","Excluded") #add second factor level
      group[is.na(data["X.restricted"])] = "Excluded" #is NA?
      data["group"] = group #add to data
      
      #plot
      xyplot(Y ~ X, data, type=c("p","r"), col.line = "darkorange", lwd = 1,
             group=group, auto.key = TRUE)
    })
    
    output$text <- renderPrint({
      #limits
      lower.limit = input$limits[1] #lower limit
      upper.limit = input$limits[2]  #upper limit
      
      #adjust data object
      data["X.restricted"] = data["X"] #copy X
      data[data[,1]<lower.limit | data[,1]>upper.limit,"X.restricted"] = NA #remove values
      group = data.frame(rep("Included",nrow(data))) #create group var
      colnames(group) = "group" #rename
      levels(group$group) = c("Included","Excluded") #add second factor level
      group[is.na(data["X.restricted"])] = "Excluded" #is NA?
      data["group"] = group #add to data
      
      #correlations
      cors = cor(data[1:3], use="pairwise")
      r = round(cors[3,2],2)
      #print output
      str = paste0("The correlation in the full dataset is .50, the correlation in the restricted dataset is ",r)
      print(str)
    })
    
  }
)
#global.R
library("lattice")
data = read.csv("data.csv",row.names = 1) #load data
title = "Understanding restriction of range"

Restriction of range is when the variance in some variable is reduced compared to the true population variance. This lowers the correlation between this variable and other variables. It is a common problem with research on students which are selected for general intelligence (GI) and hence have a lower variance. This means that correlations between GI and whatever found in student samples is too low.

There are some complicated ways to correct for restriction of range. The usual formula used is this:

restriction of range

which is also known as Thorndike’s case 2, or Pearson’s 1903 formula. Capital XY are the unrestricted variables, xy the restricted. The hat on r means estimated.

However, in a paper in review I used the much simpler formula, namely: corrected r = uncorrected r / (SD_restricted/SD_unrestricted) which seemed to give about the right results. But I wasn’t sure this was legit, so I did some simulations.

First, I selected a large range of true population correlations (.1 to .8) and a large range of selectivity (.1 to .9), then I generated very large datasets with each population correlation. Then for each restriction, I cut off the datapoints where the one variable was below the cutoff point, and calculated the correlation in that restricted dataset. Then I calculated the corrected correlation. Then I saved both pieces of information.

This gives us these correlations in the restricted samples (N=1,000,000)

cor/restriction R 0.1 R 0.2 R 0.3 R 0.4 R 0.5 R 0.6 R 0.7 R 0.8 R 0.9
r 0.1 0.09 0.08 0.07 0.07 0.06 0.06 0.05 0.05 0.04
r 0.2 0.17 0.15 0.14 0.13 0.12 0.11 0.10 0.09 0.08
r 0.3 0.26 0.23 0.22 0.20 0.19 0.17 0.16 0.14 0.12
r 0.4 0.35 0.32 0.29 0.27 0.26 0.24 0.22 0.20 0.17
r 0.5 0.44 0.40 0.37 0.35 0.33 0.31 0.28 0.26 0.23
r 0.6 0.53 0.50 0.47 0.44 0.41 0.38 0.36 0.33 0.29
r 0.7 0.64 0.60 0.57 0.54 0.51 0.48 0.45 0.42 0.37
r 0.8 0.75 0.71 0.68 0.65 0.63 0.60 0.56 0.53 0.48

 

The true population correlation is in the left-margin. The amount of restriction in the columns. So we see the effect of restricting the range.

Now, here’s the corrected correlations by my method:

cor/restriction R 0.1 R 0.2 R 0.3 R 0.4 R 0.5 R 0.6 R 0.7 R 0.8 R 0.9
r 0.1 0.10 0.10 0.10 0.10 0.10 0.10 0.10 0.10 0.09
r 0.2 0.20 0.20 0.20 0.20 0.21 0.21 0.20 0.20 0.20
r 0.3 0.30 0.31 0.31 0.31 0.31 0.31 0.30 0.30 0.29
r 0.4 0.41 0.41 0.42 0.42 0.42 0.42 0.42 0.42 0.42
r 0.5 0.52 0.53 0.53 0.54 0.54 0.55 0.55 0.56 0.56
r 0.6 0.63 0.65 0.66 0.67 0.68 0.69 0.70 0.70 0.72
r 0.7 0.76 0.79 0.81 0.83 0.84 0.86 0.87 0.89 0.90
r 0.8 0.89 0.93 0.97 1.01 1.04 1.07 1.10 1.13 1.17

 

Now, the first 3 rows are fairly close deviating by max .1, but it the rest deviates progressively more. The discrepancies are these:

cor/restriction R 0.1 R 0.2 R 0.3 R 0.4 R 0.5 R 0.6 R 0.7 R 0.8 R 0.9
r 0.1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 -0.01
r 0.2 0.00 0.00 0.00 0.00 0.01 0.01 0.00 0.00 0.00
r 0.3 0.00 0.01 0.01 0.01 0.01 0.01 0.00 0.00 -0.01
r 0.4 0.01 0.01 0.02 0.02 0.02 0.02 0.02 0.02 0.02
r 0.5 0.02 0.03 0.03 0.04 0.04 0.05 0.05 0.06 0.06
r 0.6 0.03 0.05 0.06 0.07 0.08 0.09 0.10 0.10 0.12
r 0.7 0.06 0.09 0.11 0.13 0.14 0.16 0.17 0.19 0.20
r 0.8 0.09 0.13 0.17 0.21 0.24 0.27 0.30 0.33 0.37

 

So, if we can figure out how to predict the values in these cells from the two values in the row and column, one can make a simpler way to correct for restriction.

Or, we can just use the correct formula, and then we get:

cor/restriction R 0.1 R 0.2 R 0.3 R 0.4 R 0.5 R 0.6 R 0.7 R 0.8 R 0.9
r 0.1 0.10 0.10 0.10 0.10 0.10 0.10 0.10 0.10 0.09
r 0.2 0.20 0.20 0.20 0.20 0.20 0.20 0.20 0.21 0.20
r 0.3 0.30 0.30 0.30 0.30 0.30 0.30 0.30 0.30 0.30
r 0.4 0.40 0.40 0.40 0.40 0.40 0.40 0.40 0.39 0.39
r 0.5 0.50 0.50 0.50 0.50 0.50 0.50 0.50 0.50 0.49
r 0.6 0.60 0.60 0.60 0.60 0.60 0.60 0.60 0.60 0.60
r 0.7 0.70 0.70 0.70 0.70 0.70 0.70 0.70 0.70 0.71
r 0.8 0.80 0.80 0.80 0.80 0.80 0.80 0.80 0.80 0.80

 

With discrepancies:

cor/restriction R 0.1 R 0.2 R 0.3 R 0.4 R 0.5 R 0.6 R 0.7 R 0.8 R 0.9
r 0.1 0 0 0 0 0 0 0 0 -0.01
r 0.2 0 0 0 0 0 0 0 0.01 0
r 0.3 0 0 0 0 0 0 0 0 0
r 0.4 0 0 0 0 0 0 0 -0.01 -0.01
r 0.5 0 0 0 0 0 0 0 0 -0.01
r 0.6 0 0 0 0 0 0 0 0 0
r 0.7 0 0 0 0 0 0 0 0 0.01
r 0.8 0 0 0 0 0 0 0 0 0

 

Pretty good!

Also, I need to re-do my paper.


R code:

library(MASS)
library(Hmisc)
library(psych)

pop.cors = seq(.1,.8,.1) #population correlations to test
restrictions = seq(.1,.9,.1) #restriction of ranges in centiles
sample = 1000000 #sample size

#empty dataframe for results
results = data.frame(matrix(nrow=length(pop.cors),ncol=length(restrictions)))
colnames(results) = paste("R",restrictions)
rownames(results) = paste("r",pop.cors)
results.c = results
results.c2 = results

#and fetch!
for (pop.cor in pop.cors){ #loop over population cors
  data = mvrnorm(sample, mu = c(0,0), Sigma = matrix(c(1,pop.cor,pop.cor,1), ncol = 2),
                 empirical = TRUE) #generate data
  rowname = paste("r",pop.cor) #get current row names
  for (restriction in restrictions){ #loop over restrictions
    colname = paste("R",restriction) #get current col names
    z.cutoff = qnorm(restriction) #find cut-off
    rows.to.keep = data[,1] > z.cutoff #which rows to keep
    rdata = data[rows.to.keep,] #cut away data
    cor = rcorr(rdata)$r[1,2] #get cor
    results[rowname,colname] = cor #add cor to results
    sd = describe(rdata)$sd[1] #find restricted sd
    cor.c = cor/sd #corrected cor, simple formula
    results.c[rowname,colname] = cor.c #add cor to results
    
    cor.c2 = cor/sqrt(cor^2+sd^2-sd^2*cor^2) #correct formula
    results.c2[rowname,colname] = cor.c2 #add cor to results
  }
}

#how much are they off by?
discre = results.c
for (num in 1:length(pop.cors)){
  cor = pop.cors[num]
  discre[num,] = discre[num,]-cor
}

discre2 = results.c2
for (num in 1:length(pop.cors)){
  cor = pop.cors[num]
  discre2[num,] = discre2[num,]-cor
}

A person on ResearchGate asked the following question:

How can I correlate ordinal variables (attitude Likert scale) with continuous ratio data (years of experience)?
Currently, I am working on my dissertation which explores learning organisation characteristics at HEIs. One of the predictor demographic variables is the indication of the years of experience. Respondents were asked to fill in the gap the number of years. Should I categorise the responses instead? as for example:
1. from 1 to 4 years
2. from 4 to 10
and so on?
or is there a better choice/analysis I could apply?

My answer may also be of interest to others, so I post it here as well.

Normal practice is to treat likert scales as continuous variable even though they are not. As long as there are >=5 options, the bias from discreteness is not large.

I simulated the situation for you. I generated two variables with continuous random data from two normal distributions with a correlation of .50, N=1000. Then I created likert scales of varying levels from the second variable. Then I correlated all these variables with each other.

Correlations of continuous variable 1 with:

continuous2 0.5
likert10 0.482
likert7 0.472
likert5 0.469
likert4 0.432
likert3 0.442
likert2 0.395

So you see, introducing discreteness biases correlations towards zero, but not by much as long as likert is >=5 level. You can correct for the bias by multiplying by the correction factor if desired:

Correction factor:

continuous2 1
likert10 1.037
likert7 1.059
likert5 1.066
likert4 1.157
likert3 1.131
likert2 1.266

Psychologically, if your data does not make sense as an interval scale, i.e. if the difference between options 1-2 is not the same as between options 3-4, then you should use Spearman’s correlation instead of Pearson’s. However, it will rarely make much of a difference.

Here’s the R code.

#load library
library(MASS)
#simulate dataset of 2 variables with correlation of .50, N=1000
simul.data = mvrnorm(1000, mu = c(0,0), Sigma = matrix(c(1,0.50,0.50,1), ncol = 2), empirical = TRUE)
simul.data = as.data.frame(simul.data);colnames(simul.data) = c(“continuous1″,”continuous2″)
#divide into bins of equal length
simul.data[“likert10″] = as.numeric(cut(unlist(simul.data[2]),breaks=10))
simul.data[“likert7″] = as.numeric(cut(unlist(simul.data[2]),breaks=7))
simul.data[“likert5″] = as.numeric(cut(unlist(simul.data[2]),breaks=5))
simul.data[“likert4″] = as.numeric(cut(unlist(simul.data[2]),breaks=4))
simul.data[“likert3″] = as.numeric(cut(unlist(simul.data[2]),breaks=3))
simul.data[“likert2″] = as.numeric(cut(unlist(simul.data[2]),breaks=2))
#correlations
round(cor(simul.data),3)

R can tell us:

DF.numbers = data.frame(cubesum=numeric(),sumsquare=numeric()) #initial dataframe
for (n in 1:100){ #loop and fill in
  DF.numbers[n,"cubesum"] = sum((1:n)^3)
  DF.numbers[n,"sumsquare"] = sum(1:n)^2
}

library(car) #for the scatterplot() function
scatterplot(cubesum ~ sumsquare, DF.numbers,
            smoother=FALSE, #no moving average
            labels = rownames(DF.numbers), id.n = nrow(DF.numbers), #labels
            log = "xy", #logscales
            main = "Cubesum is identical to sumsquare, proven by induction")

#checks that they are identical, except for the name
all.equal(DF.numbers["cubesum"],DF.numbers["sumsquare"], check.names=FALSE)

 

cubesum_sumsquare

One can increase the number in the loop to test more numbers. I did test it with 1:10000, and it was still true.

Actually im busy doing an exam paper for linguistics class, but it turned out to be not so difficult, so i spent som time on Khan Academy doing probability and statistics courses. i want to master that stuff, especially the stuff i dont currently know the details about, like regression.

anyway, i stumpled into a comment asking about the way the standard deviation is calculated. why not just use the absolute value insted of squaring stuff and taking the square root after? i actually tried that once, and it gives different results! i tried it out becus the teacher’s notes said that it wud giv the same results. pretty neat discovery IMO.

anyway, the other one has a name as well: en.wikipedia.org/wiki/Absolute_deviation

here’s a paper that argues that we shud really return to the MD (mean deviation). i didnt understand all the math, but it sure is easier to calculate and the meaning of it easier to grasp, altho its probably too difficult to switch now that most of statistics is based on the SD. still cool tho.

Revisiting a 90-year-old debate the advantages of the mean deviation

ABSTRACT:  This  paper  discusses  the  reliance  of  numerical  analysis  on
the  concept  of  the  standard  deviation,  and  its  close  relative  the  variance.
It  suggests  that  the  original  reasons  why  the  standard  deviation  concept
has  permeated  traditional  statistics  are  no  longer  clearly  valid,  if  they
ever  were.  The  absolute  mean  deviation,  it  is  argued  here,  has many
advantages  over  the  standard  deviation.  It  is more  efficient  as an
estimate  of  a population  parameter  in  the  real-life  situation  where  the
data  contain  tiny  errors,  or  do  not  form  a completely  perfect  normal
distribution.  It  is  easier  to  use,  and more  tolerant  of  extreme  values,  in
the  majority  of  real-life  situations  where  population  parameters  are  not
required.  It  is  easier  for  new  researchers  to  learn  about  and  understand,
and  also  closely  linked  to  a number  of  arithmetic  techniques  already
used  in  the  sociology  of  education  and  elsewhere.  We  could  continue  to
use  the  standard  deviation  instead,  as we  do  presently,  because  so  much
of  the  rest  of  traditional  statistics  is  based  upon  it  (effect  sizes,  and  the
F-test,  for  example).  However,  we  should  weigh  the  convenience  of  this
solution  for  some  against  the  possibility  of  creating  a much  simpler  and
more  widespread  form  of  numeric  analysis  for  many.

Keywords:  variance,  measuring  variation,  political  arithmetic,  mean
deviation,  standard  deviation, social  construction  of  statistics

it also has a new odd use of “social construction” which annoyed me when reading it.

I was researching a different topic and came across this paper. I was rewatching the Everything is a remix series. Then i looked up som mor relevant links, and came across these videos. One of them mentioned this article.

Complex to the ear but simple to the mind (Nicholas J Hudson)

Abstract:

Background: The biological origin of music, its universal appeal across human cultures and the cause of its beauty
remain mysteries. For example, why is Ludwig Van Beethoven considered a musical genius but Kylie Minogue is
not? Possible answers to these questions will be framed in the context of Information Theory.
Presentation of the Hypothesis: The entire life-long sensory data stream of a human is enormous. The adaptive
solution to this problem of scale is information compression, thought to have evolved to better handle, interpret
and store sensory data. In modern humans highly sophisticated information compression is clearly manifest in
philosophical, mathematical and scientific insights. For example, the Laws of Physics explain apparently complex
observations with simple rules. Deep cognitive insights are reported as intrinsically satisfying, implying that at some
point in evolution, the practice of successful information compression became linked to the physiological reward
system. I hypothesise that the establishment of this “compression and pleasure” connection paved the way for
musical appreciation, which subsequently became free (perhaps even inevitable) to emerge once audio
compression had become intrinsically pleasurable in its own right.
Testing the Hypothesis: For a range of compositions, empirically determine the relationship between the
listener’s pleasure and “lossless” audio compression. I hypothesise that enduring musical masterpieces will possess
an interesting objective property: despite apparent complexity, they will also exhibit high compressibility.
Implications of the Hypothesis: Artistic masterpieces and deep Scientific insights share the common process of
data compression. Musical appreciation is a parasite on a much deeper information processing capacity. The
coalescence of mathematical and musical talent in exceptional individuals has a parsimonious explanation. Musical
geniuses are skilled in composing music that appears highly complex to the ear yet transpires to be highly simple
to the mind. The listener’s pleasure is influenced by the extent to which the auditory data can be resolved in the
simplest terms possible.

Interesting, but it is way too short on data. its not that difficult to acquire som data to test this hypothesis. varius open source lossless compressors ar freely available, im thinking particularly of FLAC compressors. then one needs a juge library of music, and som sort of ranking of the music related to the quality of it. if the hypothesis is correct, then the best music shud com out on top, at least relativly within genres, or within bands etc. i think i will test this myself.