Review: What Intelligence Tests Miss: The Psychology of Rational Thought (Stanovich, 2009)

MOBI on libgen

I’ve seen this book cited quite a few times and when looking for what to read next, it seemed like on okay choice. The book is written in typical popscience style: no crucial statistical information about the studies is mentioned, so it is impossible for the skeptical reader to know which claims to believe and which not to.

For instance, he spends quite a while talking about how IQ/SAT etc. do not correlate strongly with rationality measures. Rarely does he mention the exact effect size. He does not mention whether it is measured as a correlation of IQ with single item rationality measures. Single items have lower reliability which reduces correlations, and are usually dichotomous which also lowers (Pearson) correlations (simulation results here, TL;DR multiple by 1.266 for dichotomous items). He does not say whether it was university students, which lower correlations as they are selected for g and rationality (maybe). The OKCupid dataset happens to contain a number of items on rationality items (e.g. astrology), I have already noted on Twitter that these correlate with g in the expected direction (religiousness).

Otherwise the book feels like reading Kahneman’s Thinking Fast and Slow. It covers the most well-known heuristics and how they sometimes lead astray (representativeness, ease of recall, framing effect, status quo bias, planning bias, etc.).

The book can be read by researchers with some gain in knowledge, but don’t expect that much. For the serious newcomer, it is better to read a textbook on the topic (unfortunately, I don’t know any, as I have yet to read one myself — regrettably!). For the curious layperson, I guess it is okay.

Therefore, it came as something of a surprise when scores on various college placement exams and Armed Forces tests that the president had taken over the years were converted into an estimated IQ score. The president’s [Bush #2] score was approximately 120-roughly the same as that of Bush’s opponent in the 2004 presidential election, John Kerry, when Kerry’s exam results from young adulthood were converted into IQ scores using the same formulas. These results surprised many critics of the president (as well as many of his supporters), but 1, as a scientist who studies individual differences in cognitive skills, was not surprised.

Virtually all commentators on the president’s cognition, including sympathetic commentators such as his onetime speechwriter David Frum, admit that there is something suboptimal about the president’s thinking. The mistake they make is assuming that all intellectual deficiencies are reflected in a lower IQ score.

In a generally positive portrait of the president, Frum nonetheless notes that “he is impatient and quick to anger; sometimes glib, even dogmatic; often uncurious and as a result ill-informed” (2003, p. 272). Conservative commentator George Will agrees, when he states that in making Supreme Court appointments, the president “has neither the inclination nor the ability to make sophisticated judgments about competing approaches to construing the Constitution” (2005, P. 23)

Seems fishy. One obvious idea is that he has had some kind of brain damage since his recorded score. Since it is based on a SAT score, it is possible that he had considerable help on the SAT test. It is true that SAT prepping does not generally work well and has diminishing returns, but surely Bush had quite a lot of help as he comes from a very rich and prestigious family. (I once read a recent meta-analysis of SAT prepping/coaching, but I can’t find it again. Mean effect size was about .25, which corresponds to 3.75 IQ.)

See also:

Actually, we do not have to speculate about the proportion of high-IQ people with these beliefs. Several years ago, a survey of paranormal beliefs was given to members of a Mensa club in Canada, and the results were instructive. Mensa is a club restricted to high-IQ individuals, and one must pass IQ-type tests to be admitted. Yet 44 percent of the members of this club believed in astrology, 51 percent believed in biorhythms, and 56 percent believed in the existence of extraterrestrial visitors-all beliefs for which there is not a shred of evidence.

Seems fishy too. Maybe MENSA just attracts irrational smart people. I know someone who is in Danish MENSA, so I can perhaps do a new survey.

Rational thinking errors appear to arise from a variety of sources -it is unlikely that anyone will propose a psychometric g of rationality. Irrational thinking does not arise from a single cognitive problem, but the research literature does allow us to classify thinking into smaller sets of similar problems. Our discussion so far has set the stage for such a classification system, or taxonomy. First, though, I need to introduce one additional feature in the generic model of the mind outlined in Chapter 3.

But that is exactly what I will propose. What is the factor structure of rationality? Is there a general factor, is it hierarchical? Is rationality perhaps a second-order factor of g? I get inspiration from study study of ‘Emotional Intelligence’ as a second-stratum factor (MacCann et al, 2014).

The next category (defaulting to the autonomous mind and not engaging at all in Type 2 processing) is the most shallow processing tendency of the cognitive miser. The ability to sustain Type 2 processing is of course related to intelligence. But the tendency to engage in such processing or to default to autonomous processes is a property of the reflective mind that is not assessed on IQ tests. Consider the Levesque problem (“Jack is looking at Anne but Anne is looking at George”) as an example of avoiding Type 2 processing. The subjects who answer this problem correctly are no higher in intelligence than those who do not, at least in a sample of university students studied by Maggie Toplak in my own laboratory.

This sure does sound like a 1 item correct/wrong item correlated with IQ scores from a selected g group. He says “no higher” but perhaps his sample was too small too and what he meant was that the difference was not significant. Samples for this kind are usually pretty small.

Theoretically, one might expect a positive correlation between intelligence and the tendency of the reflective mind to initiate Type 2 processing because it might be assumed that those of high intelligence would be more optimistic about the potential efficacy of Type 2 processing and thus be more likely to engage in it. Indeed, some insight tasks do show a positive correlation with intelligence, one in particular being the task studied by Shane Frederick and mentioned in Chapter 6: A bat and a ball cost $I.Io in total. The bat costs $I more than the ball. How much does the ball cost? Nevertheless, the correlation between intelligence and a set of similar items is quite modest, .43-.46, leaving plenty of room for performance dissociations of the type that define dysrationalia 14 Frederick has found that large numbers of high-achieving students at MIT, Princeton, and Harvard when given this and other similar problems rely on this most primitive of cognitive miser strategies.

The sum of the 3 CRT items (one mentioned above) correlated r=.50 with the 16 item ICAR sample test in my student (age ~18, n=72) data. These items do not perform differently when factor analyzed with the entire item set.

In numerous place he complains that society cares too much about IQ in selection even tho he admits that there is substantial evidence for it works. He also admits that there is no standard test for rationality and cites no evidence that selecting for rationality will improve outcomes (e.g. job performance, GPA in college, prevention of drop-out in training programs), it is difficult to see what he has to complain about. He should have been less bombastic. Yes, we should try rationality measures, but calling for wide scale use before proper validation is very premature.



MacCann, C., Joseph, D. L., Newman, D. A., & Roberts, R. D. (2014). Emotional intelligence is a second-stratum factor of intelligence: Evidence from hierarchical and bifactor models. Emotion, 14(2), 358.

Gender distribution of comedians over time

It is a long time ago since I did this project. I did not write about it here before but it is a pity since the results are thus not ‘out there’. I put the project page here in 2012 (!). In short, I wrote python code to crawl Wikipedia lists. I figured out a way to decide whether a person was male or female. This was done using gendered pronouns which exist in English. I.e., the crawler fetches the full-text of the article, and counts “he”, “his”, “him”, “she”, “her”. It assigns the gender with the most pronouns. This method seems rather reliable in my informal testing.

I specifically wrote it to look at comedians because I had read a study of comedians (Greengross et al 2012). They gave personality and a vocabulary test (from the Multidimensional Aptitude Battery, r=.62 with WAIS-R) to a sample of 31 comedians and psychology 400 students. The comedians scored 1.34 d above the students. Some care must be taken with this result. The comedians were much older and vocabulary raw scores go up with age (mean age 38.9 vs. 20.5). The authors do not state that they were age-corrected. Psychology students are not very bright and this was a sample from New Mexico with lots of Hispanics. We can safely conclude that comedians are smarter than the student body and the general population of New Mexico, but can’t say much about exactly. We can hazard a guess at student body (maybe 107 IQ) + age corrected d (maybe 15 IQ), so we end with an estimate of 122 IQ.

There are various other tables of interest that don’t need much explaining, which I will paste below:


As of writing this, I found another older study (Janus, 1975). I will just quote:

The data to support the above theses were gathered through psychological case studies, in-depth interviews with many of the leading comedians in the United States today, and psychological tests. [n addition to a clinical interview, the instruments used were the Wechsler Adult Intelligence Scale, Machover Human Figure Drawing Test, graphological analysis, earliest memories, and recurring dreams.

Population consisted of 55 professional comedians. In order to be considered in this study, comedians had to be full-time professional stand-up comedians. Most of the subjects earned salaries of six figures or over, from comedy alone. In order to make the sample truly representative, each comedian had to be nationally known and had to have been in the field full time for at least ten years. The average time spent in full- time comedy for the subjects was twenty-five years. The group consisted of fifity-one men and four women. They represented all major religions, many geographic areas, and diverse socioeconomic backgrounds. Comedians were interviewed in New York, California, and points in between. Their socioeconomic backgrounds, family hierarchy, demographic information, religious influences, and analytic material were investigated. Of the population researched, 85 percent came from lower-class homes, 10 percent from lower-middle-class homes, and 5 percent from middle-class and upper-middle-class homes. All subjects participated voluntarily, received no remuneration, and were personally interviewed by the author.

I.Q. scores ranged from 115 to 160+. For a population at large, I.Q. scores in the average range are from 90 to 110. I.Q. scores in the bright-average range of intelligence, that is, from 10g to 115, were scored by only three subjects. The remainder scored above 125, with the mean score being 138. The vocabulary subtest was utilized. Several subjects approached it as a word-association test, but all regarded it as a challenge. Since these are verbal people, they were highly motivated. The problem was not one of getting them to respond, it was one of continuously allaying their anxiety, and re- assuring them they they were indeed doing well.

So, a very high mean was found. WAIS was published in 1955, so there is approximately 20 years of FLynn gains in raw scores, presumably uncorrected for. According to a new meta-analysis of FLynn gains (Trahan et al 2014), the mean gain is 2.31 per decade. So we are assuming about a gain of 4.6 IQ here. But then again, the verbal test for the students was published in 1984, so there may be some gain there as well (FLynn effects supposedly showed down recent in Western countries). Perhaps a net gain in favor of the old study by 4 IQ. In that case, we get estimates of 134 and 122. With samples of 31 and 55, different subtests, sampling procedure etc., this is surely reasonable. We can take a weighted mean and say best estimate for professional comedians is about 129.7, or about +2SD. It seems a bit wild, are comedians really on average as smart as fysicists?

EDIT: There is another study by Janus (1978). Same test:

[N=14] Intelligence scores ranged from 112 to 144 plus. (The range of average IQ is from 90 to 110.) Four subjects scored in the bright average range–i.e., 108 to 115. The remaining subjects scored above 118 with a mean score of 126. Two subjects scored above 130. The mean score for male comics was 138. The subjects approached the testing with overenthusiasm, in some cases bordering on frenzy. Despite the brightness of the group, all subjects needed constant reassurance and positive feedback.

So 126, with ~5 IQ because of FLynn effect. New weighted mean is 128.5 IQ.

Perhaps we should test it. If you want to test it with me, write me an email/tweet. We will design a questionnaire and give it to your local sample of comedians. One can e.g. try to convince professional comedian organizations (e.g. Danish here, N=35) to forward it to their members.

So what did I find?

I did the scraping twice. One time at first in 2012, and then again later when I was reminded of the project in May 2014. Now I have been reminded of it again. The very basic stats is that there were 1106 comedians found, of which the gender distribution was this (the “other” is unknown gender, which was 1 person).

What about the change over time? The code fetches their birth year if mentioned on their Wikipedia page. Then I limited the data to US comedians (66% of the sample). This was done because if we are looking for ways to explain it, we need to restrict ourselves to some more homogenous subset. What explains the change in gender distribution in Saudi Arabia at time t1 may not also explain it in Japan.

Next we get a common scientific conflict of interest: that between precision of estimate and detail. Essentially what we need is a moving average since most or all years have too few comedians for a reliable estimate (very zigzaggy lines on the plot). So we must decide how large a moving average to use. A larger will give more precision in estimate, but less detail. I decided to try a few different options (5, 10, 15, 20). To avoid extreme zigzagginess, I only plotted them if there were >=20 persons in the interval. This plots look like this:

So in general we see a decline in the proportion of male comedians. But it is not not going straight down. There is a local minimum in 1960 or so, and a local maximum in 1980 or so. How to explain these?

I tried abortion rate (not much data before 1973) and total fertility rate (plenty of data) but was not convinced by the results. One can also inflate or deflate the numbers according to which moving interval one chooses. One can even try all the possible sizes of intervals and the delays to see which gives the best match. I did some of this semi-manually using spreadsheets, but it has a very high chance of overfitting. One would need to do some programming to try all of them in a reasonable time.

I wrote some of this stuff in a paper, but never finished it. It can now be found at its OSF repository.


Newer dataset from May 2014.

Older dataset dated to 2012.

Python code. This includes code to crawl Wikipedia with and quite a lot of other raw data output files.


Greengross, G., Martin, R. A., & Miller, G. (2012). Personality traits, intelligence, humor styles, and humor production ability of professional stand-up comedians compared to college students. Psychology of Aesthetics, Creativity, and the Arts, 6(1), 74.

Janus, S. S. (1975). The great comedians: Personality and other factors. The American Journal of Psychoanalysis, 35(2), 169-174.

Janus, S. S., Bess, B. E., & Janus, B. R. (1978). The great comediennes: Personality and other factors. The American Journal of Psychoanalysis, 38(4), 367-372.

Trahan, L. H., Stuebing, K. K., Fletcher, J. M., & Hiscock, M. (2014). The Flynn effect: A meta-analysis.

The personal Jensen coefficient, useful for detecting teaching to the test?

In my previous paper, I examined whether a personal Jensen coefficient could predict GPA beyond the general factor (or just the normal summed score). I found this not to be the case for a Dutch university student sample (n ≈ 300). One thing I did find, however, was that the personal Jensen coefficient was correlate with the g factor: r=.35.

Moreover, Piffer’s alternative metric, the g advantage coefficient (g factor score minus unit-weighted score) had a very strong correlation with the summed score r=.88. This measure is arguably thus a more reliable measure.

While neither of these predicted GPA beyond g, they may have another use. When there is teaching to the test, the subtests that increase the most are those that are the least g-loaded (see this). So, this should have an effect on these two measures, making them weaker or negative: the highest scores tending to be on the least g-loaded subtests. Thus, it may be practically useful to detect cheating on tests, although perhaps only at the group level.

Unfortunately, I don’t have any dataset with test-retest gains or direct training, but one could simulate gains that are negatively related to the g loadings, and then calculate the personal Jensen coefficient and Piffer’s g advantage coefficient.

Maybe I will update this post with the results of such a simulation.

A general assortative mating factor?: An idea in need of a dataset

I was talking with my girlfriend about how on some areas we don’t match, and on others we match well, and then it occurred to me that there may be a general assortative mating factor. I.e. if one takes a lot of variables: personality, intelligence, socioeconomic, interests, then calculates the partner correlations, then one could maybe extract a somewhat general factor of many of these. The method is correlate the partner trait correlations with each other for each trait. I.e., are couples who are more similar in intelligence, also more similar in socioeconomic variables? Likely. Are they also more similar in interests? Maybe slightly? Are people who are more similar in height more similar in intelligence? Seems doubtful since the intelligence x height cor is only .2 or so. But maybe.

What is needed is a dataset with, say, >100 couples and >10 diverse variables of interest. Anyone know of such a dataset?

Predicting immigrant performance: Does inbreeding have incremental validity over IQ and Islam?

So, she came up with:

So I decided to try it out, since I’m taking a break from reading Lilienfeld which I had been doing that for 5 hours straight or so.

So the question is whether inbreeding measures have incremental validity over IQ and Islam, which I have previously used to examine immigrant performance in a number of studies.

So, to get the data into R, I OCR’d the PDF in Abbyy FineReader since this program allows for easy copying of table data by row or column. I only wanted column 1-2 and didn’t want to deal with the hassle of importing it with spreadsheet problems (which need a consistent separator, e.g. comma or space). Then I merged it with the megadataset to create a new version, 2.0d.

Then I created a subset of the data with variables of interest, and renamed them (otherwise results would be unwieldy). Intercorrelations are:

row.names Cousin% CoefInbreed IQ Islam
1 Cousin% 1.00 0.52 -0.59 0.78 -0.76
2 CoefInbreed 0.52 1.00 -0.28 0.40 -0.55
3 IQ -0.59 -0.28 1.00 -0.27 0.54
4 Islam 0.78 0.40 -0.27 1.00 -0.71
5 -0.76 -0.55 0.54 -0.71 1.00


Spearman’ correlations, which are probably better due to the non-normal data:

row.names Cousin% CoefInbreed IQ Islam
1 Cousin% 1.00 0.91 -0.63 0.67 -0.73
2 CoefInbreed 0.91 1.00 -0.55 0.61 -0.76
3 IQ -0.63 -0.55 1.00 -0.23 0.72
4 Islam 0.67 0.61 -0.23 1.00 -0.61
5 -0.73 -0.76 0.72 -0.61 1.00


The fairly high correlations of inbreeding measures with IQ and Islam mean that their contribution will likely be modest as incremental validity.

However, let’s try modeling them. I create 7 models of interest and compile the primary measure of interest from them, R2 adjusted, into an object. Looks like this:

row.names R2 adj.
1 ~ IQ+Islam 0.5472850
2 ~ IQ+Islam+CousinPercent 0.6701305
3 ~ IQ+Islam+CoefInbreed 0.7489312
4 ~ Islam+CousinPercent 0.6776841
5 ~ Islam+CoefInbreed 0.7438711
6 ~ IQ+CousinPercent 0.5486674
7 ~ IQ+CoefInbreed 0.4979552


So we see that either of them adds a fair amount of incremental validity to the base model (line 1 vs. 2-3). They are in fact better than IQ if one substitutes them in (1 vs. 4-5). They can also substitute for Islam, but only with about the same predictive power (1 vs 6-7).

Replication for Norway

Replication for science is important. Let’s try Norwegian data. The Finnish and Dutch data are well-suited for this (too few immigrant groups, few outcome variables i.e. only crime)

Pearson intercorrelations:

row.names CousinPercent CoefInbreed IQ Islam
1 CousinPercent 1.00 0.52 -0.59 0.78 -0.78
2 CoefInbreed 0.52 1.00 -0.28 0.40 -0.46
3 IQ -0.59 -0.28 1.00 -0.27 0.60
4 Islam 0.78 0.40 -0.27 1.00 -0.72
5 -0.78 -0.46 0.60 -0.72 1.00



row.names CousinPercent CoefInbreed IQ Islam
1 CousinPercent 1.00 0.91 -0.63 0.67 -0.77
2 CoefInbreed 0.91 1.00 -0.55 0.61 -0.71
3 IQ -0.63 -0.55 1.00 -0.23 0.75
4 Islam 0.67 0.61 -0.23 1.00 -0.47
5 -0.77 -0.71 0.75 -0.47 1.00


These look fairly similar to Denmark.

And the regression results:

row.names R2 adj.
1 ~ IQ+Islam 0.5899682
2 ~ IQ+Islam+CousinPercent 0.7053999
3 ~ IQ+Islam+CoefInbreed 0.7077162
4 ~ Islam+CousinPercent 0.6826272
5 ~ Islam+CoefInbreed 0.6222364
6 ~ IQ+CousinPercent 0.6080922
7 ~ IQ+CoefInbreed 0.5460777


Fairly similar too. If added, they have incremental validity (line 1 vs. 2-3). They perform better than IQ if substituted but not as much as in the Danish data (1 vs. 4-5). They can also substitute for Islam (1 vs. 6-7).

How to interpret?

Since inbreeding does not seem to have any direct influence on behavior that is reflected in the S factor, it is not so easy to interpret these findings. Inbreeding leads to various health problems and lower g in offspring, the latter which may have some effect. However, presumably, national IQs already reflect the lowered IQ from inbreeding, so there should be no additional effect there beyond national IQs. Perhaps inbreeding results in other psychological problems that are relevant.

Another idea is that inbreeding rates reflect non-g psychological traits that are relevant to adapting to life in Denmark. Perhaps it is a useful measure of clanishness, would be reflected in hostility towards integration in Danish society (such as getting an education, or lack of sympathy/antipathy towards ethnic Danes and resulting higher crime rates against them), which would be reflected in the S factor.

The lack of relatively well established causal routes for interpreting the finding makes me somewhat cautious about how to interpret this.


##Code for mergining cousin marriage+inbreeding data with megadataset
inbreed = read.table("clipboard", sep="\t",header=TRUE, row.names=1) #load data from clipboard
source("merger.R") #load mega functions
mega20d = read.mega("Megadataset_v2.0d.csv") #load latest megadataset
names = as.abbrev(rownames(inbreed)) #get abbreviated names
rownames(inbreed) = names #set them as rownames

#merge and save
mega20e = merge.datasets(mega20d,inbreed,1) #merge to create v. 2.0e
write.mega(mega20e,"Megadataset_v2.0e.csv") #save it

#select subset of interesting data = subset(mega20e, selec=c("Weighted.mean.consanguineous.percentage.HobenEtAl2010",
colnames( = c("CousinPercent","CoefInbreed","IQ","Islam","") #shorter var names
rcorr = rcorr(as.matrix( #correlation object
View(round(rcorr$r,2)) #view correlations, round to 2
rcorr.S = rcorr(as.matrix(,type = "spearman") #spearman correlation object
View(round(rcorr.S$r,2)) #view correlations, round to 2

#Multiple regression
library(QuantPsyc) #for beta coef
results = = NA, nrow=0, ncol = 1)) #empty matrix for results
colnames(results) = "R2 adj."
models = c(" ~ IQ+Islam", #base model,
           " ~ IQ+Islam+CousinPercent", #1. inbreeding var
           " ~ IQ+Islam+CoefInbreed", #2. inbreeding var
           " ~ Islam+CousinPercent", #without IQ
           " ~ Islam+CoefInbreed", #without IQ
           " ~ IQ+CousinPercent", #without Islam
           " ~ IQ+CoefInbreed") #without Islam

for (model in models){ #run all the models
  fit.model = lm(model, #fit model
  sum.stats = summary(fit.model) #summary stats object
  summary(fit.model) #summary stats
  lm.beta(fit.model) #standardized betas
  results[model,] = sum.stats$adj.r.squared #add result to results object
View(results) #view results

##Let's try Norway too = subset(mega20e, selec=c("Weighted.mean.consanguineous.percentage.HobenEtAl2010",

colnames( = c("CousinPercent","CoefInbreed","IQ","Islam","") #shorter var names
rcorr = rcorr(as.matrix( #correlation object
View(round(rcorr$r,2)) #view correlations, round to 2
rcorr.S = rcorr(as.matrix(,type = "spearman") #spearman correlation object
View(round(rcorr.S$r,2)) #view correlations, round to 2

results = = NA, nrow=0, ncol = 1)) #empty matrix for results
colnames(results) = "R2 adj."
models = c(" ~ IQ+Islam", #base model,
           " ~ IQ+Islam+CousinPercent", #1. inbreeding var
           " ~ IQ+Islam+CoefInbreed", #2. inbreeding var
           " ~ Islam+CousinPercent", #without IQ
           " ~ Islam+CoefInbreed", #without IQ
           " ~ IQ+CousinPercent", #without Islam
           " ~ IQ+CoefInbreed") #without Islam

for (model in models){ #run all the models
  fit.model = lm(model, #fit model
  sum.stats = summary(fit.model) #summary stats object
  summary(fit.model) #summary stats
  lm.beta(fit.model) #standardized betas
  results[model,] = sum.stats$adj.r.squared #add result to results object
View(results) #view results

Age differences in the WISC-IV has a positive Jensen coefficient, maybe

Group differences in cognitive scores have generally been found to be g-loaded, i.e. the differences are larger on the items/subtests that load more strongly on the general factor. This is generally called a Jensen effect, and its opposite an anti-Jensen effect. However this can cause linguistic trouble when dealing with (near)-zero correlations or when dealing with effects of unknown direction, at which point we don’t know if we should call them “Jensen effects” or “anti-Jensen effects”. For that reason, I use the term “Jensen coefficient” which can easily be referred to as positive, negative or near-zero.

Generally when studies report factor structure of cognitive data, they remove the effects of age and gender and do not generally report the correlations between age and subtests. Recently, I saw this paper about the standardization of the WISC-IV in Vietnam, where the authors do report them. They differ by subtest. So, this immediately leads someone like me to propose that the effect should be larger on the more g-loaded tests. This is based on the idea that as one grows up, one really gets smarter i.e. increases general intelligence. So the vector correlation should be positive. The Vietnamese study does however not report the g-loadings. So, I have resorted to getting these from some other papers on the same test, in the English language version.

The datafile is here. It has g-loadings from 6 papers yielding 8 estimates. Some papers report more than one because they model the data with more than one model. E.g. four-factor vs. five-factor hierarchical model. The correlations between the g-loadings of these studies and the subtest x age correlation from the Vietnamese study range between .272 and .528, with a median of .427 and mean of .422. If one uses the average g-loading across studies, the correlation with age x subtest is .441.* Using Spearman correlation, it is also .441.

wisc g-loading age

If one removes the Symbol Search outlier, Spearman r=.29, so the relationship is not entirely due to that.

As usual, this research is hampered by a lack of data sharing. 1000s of studies use the WISC and have age data too, but don’t share the data or report the necessary results so one can calculate the correlation. Furthermore, the relatively small selection of subtests make the MCV method error-prone. It would be much better if one had e.g. 20 subtests of more different g-loadings, e.g. reaction time tests.

It is also possible that a large change in some non-g ability can throw the MCV results off. General intelligence is probably not the only ability that changes as one grows up. MCV is sensitive to these other abilities changing too.

Where to go from here

Next steps:

  1. Find more studies reporting g-loadings of WISC-IV subtests.
  2. Find more studies that report age x subtest correlations.
  3. Find open datasets where (1-2) can be calculated.
  4. Write to authors and ask them if they can provide results for (1-2) or send data (3).
  5. Find other commonly used tests for children and do (1-4). Also interesting are age declines later on.

I have contacted some authors.

* Google Drive Sheets calculates the r as .439 instead. I don’t know why.


##R code for doing the analyses and plotting = read.table("clipboard", sep="\t",header=TRUE, row.names=1) #load data from clipboard
library(Hmisc) #needed for rcorr
rcorr(as.matrix( #get correlations
cor(, use="pair") #use the other function to verify
rcorr(as.matrix(, type = "spearman") #spearman

library(car) #for scatterplot
scatterplot(r.x.age ~ avg..g.loading,, smoother=FALSE, id.n=nrow(,
            main = "MCV: WISC-IV g-loading and subtest x age correlation\nSpearman r = .441",
            xlab = "Average g-loading (mean of 8 datapoints)",
            ylab = "Score x age (1 datapoint)") #plot it
wisc.data2 =[-10,] #exclude outlier symbol search
rcorr(as.matrix(wisc.data2), type = "spearman") #spearman


Bodin, D., Pardini, D. A., Burns, T. G., & Stevens, A. B. (2009). Higher order factor structure of the WISC-IV in a clinical neuropsychological sample. Child Neuropsychology, 15(5), 417-424.
Chen, H., Keith, T., Chen, Y., & Chang, B. (2009). What does the WISC-IV measure? Validation of the scoring and CHC-based interpretative approaches. Journal of Research in Education Sciences, 54(3), 85-108.
Keith, T. Z., Fine, J. G., Taub, G. E., Reynolds, M. R., & Kranzler, J. H. (2006). Higher order, multisample, confirmatory factor analysis of the Wechsler Intelligence Scale for Children—Fourth Edition: What does it measure. School Psychology Review, 35(1), 108-127.
Dang, H. M., Weiss, B., Pollack, A., & Nguyen, M. C. (2011). Adaptation of the Wechsler Intelligence Scale for Children-IV (WISC-IV) for Vietnam. Psychological studies, 56(4), 387-392.
Weiss, L. G., Keith, T. Z., Zhu, J., & Chen, H. (2013). WISC-IV and clinical validation of the four-and five-factor interpretative approaches. Journal of Psychoeducational Assessment, 31(2), 114-131.
Watkins, M. W. (2006). Orthogonal higher order structure of the Wechsler Intelligence Scale for Children–. Psychological Assessment, 18(1), 123.
Watkins, M. W., Wilson, S. M., Kotz, K. M., Carbone, M. C., & Babula, T. (2006). Factor structure of the Wechsler Intelligence Scale for Children–Fourth Edition among referred students. Educational and Psychological Measurement, 66(6), 975-983.

Intelligence, income inequality and prison rates: It’s complicated

There was some talk on Twitter around prison rates and inequality:

And IQ and inequality:

But then what about prison data beyond those given above? I have downloaded the newest data from here ICPS (rate data, not totals).

Now, what about all three variables?

#load mega20d as the datafile
ineqprisoniq = subset(mega20d, select=c("Fact1_inequality","LV2012estimatedIQ","PrisonRatePer100000ICPS2015"))
rcorr(as.matrix(ineqprisoniq),type = "spearman")
                            Fact1_inequality LV2012estimatedIQ PrisonRatePer100000ICPS2015
Fact1_inequality                        1.00             -0.51                        0.22
LV2012estimatedIQ                      -0.51              1.00                        0.16
PrisonRatePer100000ICPS2015             0.22              0.16                        1.00

                            Fact1_inequality LV2012estimatedIQ PrisonRatePer100000ICPS2015
Fact1_inequality                         275               119                         117
LV2012estimatedIQ                        119               275                         193
PrisonRatePer100000ICPS2015              117               193                         275

So IQ is slightly positively related to prison rates and so is equality. Positive? Isn’t it bad having people in prison? Well, if the alternative is having them dead… because the punishment for most crimes is death. Although one need not be excessive as the US is. Somewhere in the middle is perhaps best?

What if we combine them into a model?

model = lm(PrisonRatePer100000ICPS2015 ~ Fact1_inequality+LV2012estimatedIQ,ineqprisoniq)
summary = summary(model)
prediction =
colnames(prediction) = "Predicted"
ineqprisoniq = merge.datasets(ineqprisoniq,prediction,1)
scatterplot(PrisonRatePer100000ICPS2015 ~ Predicted, ineqprisoniq,
> summary

lm(formula = PrisonRatePer100000ICPS2015 ~ Fact1_inequality + 
    LV2012estimatedIQ, data = ineqprisoniq)

    Min      1Q  Median      3Q     Max 
-153.61  -75.05  -31.53   44.62  507.34 

                  Estimate Std. Error t value Pr(>|t|)   
(Intercept)       -116.451     88.464  -1.316  0.19069   
Fact1_inequality    31.348     11.872   2.640  0.00944 **
LV2012estimatedIQ    3.227      1.027   3.142  0.00214 **
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 113.6 on 114 degrees of freedom
  (158 observations deleted due to missingness)
Multiple R-squared:  0.09434,	Adjusted R-squared:  0.07845 
F-statistic: 5.938 on 2 and 114 DF,  p-value: 0.003523

> lm.beta(model)
Fact1_inequality LV2012estimatedIQ 
        0.2613563         0.3110241

This is a pretty bad model (var%=8), but the directions held from before but were stronger. Standardized betas .25-.31. The R2 seems to be awkwardly low to me given the betas.

More importantly, the residuals are clearly not normal as can be seen above. The QQ-plot is:


It is concave, so data distribution isn’t normal. To get diagnostic plots, simply use “plot(model)”.

Perhaps try using rank-order data:

ineqprisoniq =,2,rank,na.last="keep")) #rank order the data

And then rerunning model gives:

> summary

lm(formula = PrisonRatePer100000ICPS2015 ~ Fact1_inequality + 
    LV2012estimatedIQ, data = ineqprisoniq)

     Min       1Q   Median       3Q      Max 
-100.236  -46.753   -8.507   46.986  125.211 

                  Estimate Std. Error t value Pr(>|t|)    
(Intercept)        1.08557   18.32052   0.059    0.953    
Fact1_inequality   0.84766    0.16822   5.039 1.78e-06 ***
LV2012estimatedIQ  0.50094    0.09494   5.276 6.35e-07 ***
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 54.36 on 114 degrees of freedom
  (158 observations deleted due to missingness)
Multiple R-squared:  0.2376,	Adjusted R-squared:  0.2242 
F-statistic: 17.76 on 2 and 114 DF,  p-value: 1.924e-07

> lm.beta(model)
 Fact1_inequality LV2012estimatedIQ 
        0.4757562         0.4981808

Much better R2, directions the same but betas are stronger, and residuals look normalish from the above. QQ plot shows them not to be even now.


Prediction plots based off the models:

prison prison_rank

So is something strange going on with the IQ, inequality and prison rates? Perhaps something nonlinear. Let’s plot them by IQ bins:

bins = cut(unlist(ineqprisoniq["LV2012estimatedIQ"]),5) #divide IQs into 5 bins
ineqprisoniq["IQ.bins"] = bins
plotmeans(PrisonRatePer100000ICPS2015 ~ IQ.bins, ineqprisoniq,
          main = "Prison rate by national IQ bins",
          xlab = "IQ bins (2012 data)", ylab = "Prison rate per 100000 (2014 data)")


That looks like “bingo!” to me. We found the pattern.

What about inequality? The trouble is that the inequality data is horribly skewed with almost all countries have a low and near identical inequality compared with the extremes. The above will (does not) work well. I tried with different bins numbers too. Results look something like this:

bins = cut(unlist(ineqprisoniq["Fact1_inequality"]),5) #divide IQs into 5 bins
ineqprisoniq["inequality.bins"] = bins
plotmeans(PrisonRatePer100000ICPS2015 ~ inequality.bins, ineqprisoniq,
          main = "Prison rate by national inequality bins",
          xlab = "inequality bins", ylab = "Prison rate per 100000 (2014 data)")


So basically, the most equal countries to the left have low rates, somewhat higher in the unequal countries within the main group and varying and on average lowish among the very unequal countries (African countries without much infrastructure?).

Perhaps this is why the Equality Institute limited their analyses to the group on the left, otherwise they don’t get the nice clear pattern they want. One can see it a little bit if one uses a high number of bins and ignores the groups to the right. E.g. 10 bins:


Among the 3 first groups, there is a slight upward trend.

Review: Race (John Baker)

I had seen references to this book in a number of places which got me curious. I am somewhat hesitant to read older books since I know much of what they discuss is dated and has been superseded by newer science. Sometimes, however, science (or the science culture) has gone wrong so one may actually learn more reading an older book than a newer one. Since fewer people read older books, one can sometimes find relevant but forgotten facts in them. Lastly, they can provide much needed historical information about the development of thinking about some idea or of some field. All of these remarks are arguably relevant to the race/population genetics controversy.

Still, I did not read the book immediately altho I had a PDF of it. I ended up starting to read it more or less at random due to a short talk I had with John Fuerst about it (we are writing together on racial admixture, intelligence and socioeconomic outcomes in the Americas and also wrote a paper on immigrant performance in Denmark).

So, the book really is dated. It spends hundreds of pages on arcane fysical anthropology which requires one to master human anatomy. Most readers don’t master this discipline, so these parts of the book are virtually un-understandable. However, they do provide one with the distinct impression of how one did fysical anthropology in old times. Lots of observations of cranium, other bones, noses, eyes+lids, teeth, lips, buttocks, etc., and then try to find clusters in these data manually. No wonder they did not reach that high agreement. The data are too scarce to find clusters and humans not sufficiently good at cluster analysis at the intuitive level. Still, they did notice some patterns that are surely correct, such as the division between various African populations, Ainu vs. Japanese, that Europeans are Asians are closer related, that Afghans etc. belong to the European supercluster etc. Clearly, these pre-genetic ideas were not all totally wrong headed. Here’s the table of Races+Subraces from the end of the book. They seem reasonably in line with modern evidence.


Some quotes:

The story of 7 ‘kinds’ of mosquitoes.

[Dobzhansky’s definition = ‘Species in sexual cross-fertilizing organisms can be defined as groups of populations which are reproductively isolated to the extent that the exchange of genes between them is absent or so slow that the genetic differences are not diminished or swamped.’]

Strict application of Dobzhansky’s definition results in certain very similar animals being assigned to different species. The malarial mosquitoes and their relatives provide a remarkable example of this. The facts are not only extreme­ly interesting from the purely scientific point of view, but also of great practical importance in the maintenance of public health in malarious districts. It was discovered in 1920 that one kind of the genus Anopheles, called elutus, could be distinguished from the well-known malarial mosquito, A. maculipennis, by certain minute differences in the adult, and by the fact that its its eggs looked different; but for our detailed knowledge of this subject we are mainly indebted to one Falleroni, a retired inspector of public health in Italy, who began in 1924 to breed Anopheles mosquitoes as a hobby. He noticed that several different kinds of eggs could be distinguished, that the same female always laid eggs having the same appearance, and that adult females derived from those eggs produced eggs of the same type. He realized that although the adults all appeared similar, there were in fact several different kinds, which he could recognize by the markings on their eggs. Falleroni named several different kinds after his friends, and the names he gave are the accepted ones today in scientific nomenclature.

It was not until 1931 that the matter came to the attention of L. W. Hackett, who, with A. Missiroli, did more than anyone else to unravel the details of this curious story.(449,447.448] The facts are these. There are in Europe six different kinds of Anopheles that cannot be distinguished with certainty from one another in the adult state, however carefully they are examined under the microscope by experts; a seventh kind, elutus, can be distinguished by minor differences if its age is known. The larvae of two of the kinds can be distinguished from one another by minute differences (in the type of palmate hair on the second segment, taken in conjunction with the number of branches of hair no. 2 on the fourth and fifth segments). Other supposed differences between the kinds, apart from those in the eggs, have been shown to be unreal.

In nature the seven kinds are not known to interbreed, and it is therefore necessary, under Dobzhansky’s definition, to regard them all as separate species.

The mates of six of the seven species have the habit of ‘swarming’ when ready to copulate. They join in groups of many individuals, humming, high in the air; suddenly the swarm bursts asunder and rejoins. The females recognize the swarms of males of their own species, and are attracted towards them. Each female dashes in, seizes a male, and flies off, copulating.

With the exceptions mentioned, the only visible differences between the species occur at the egg-stage. The eggs of six of the seven species are shown in Fig. 8 (p. 76).

6 anopheles

It will be noticed that each egg is roughly sausage-shaped, with an air-filled float at each side, which supports it in the water in which it is laid. The eggs of the different species are seen to differ in the length and position of the floats. The surface of the rest of the egg is covered all over with microscopic finger-shaped papillae, standing up like the pile of a carpet. It is these papillae that are responsible for the distinctive patterns seen on the eggs of the different species. Where the papillae are long and their tips rough, light is reflected to give a whitish appearance; where they are short and smooth, light passes through to reveal the underlying surface of the egg, which is black. The biological significance of these apparently trivial differences is unknown.

From the point of view of the ethnic problem the most interesting fact is this. Although the visible differences between the species are trivial and confined or almost confined to the egg-stage, it is evident that the nervous and sensory systems are different, for each species has its own habits. The males of one species (atroparvus) do not swarm. It has already been mentioned that the females recognize the males of their own species. Some of the species lay their eggs in fresh water, others in brackish. The females of some species suck the blood of cattle, and are harmless to man; those of other species suck the blood of man, and in injecting their saliva transmit malaria to him.

Examples could be quoted of other species that are distinguishable from one another by morphological differences no greater than those that separate the species of Anopheles; but the races of a single species—indeed, the subraces of a single race—are often distinguished from one another, in their typical forms, by obvious differences, affecting many parts of the body. It is not the case that species are necessarily very distinct, and races very similar. [p. 74ff]

Nature is very odd indeed! More on Wiki.

Some very strange examples of abnormalities of this sort have been recorded by reputable authorities. Buffon quotes two examples of an ‘amour violent’ between a dog and a sow. In one case the dog was a large spaniel on the property of the Comte de Feuillee, in Burgundy. Many persons witnessed ‘the mutual ardour of these two animals; the dog even made prodigious and oft-repeated efforts to copulate with the sow, but the unsuitability of their reproductive organs prevented their union.’ Another example, still more remarkable, occurred on Buffon’s own property. A miller kept a mare and a bull in the same stable. These two animals developed such a passion for one another that on all occasions when the mare was on heat, over a period of several years, the bull copulated with her three or four times a day, whenever he was free to do so. The act was witnessed by all the inhabitants of the place. [p. 92]

Of smelly Japanese:

There is, naturally enough, a correlation between the development of the axillary organ and the smelliness of the secretion of this gland (and probably this applies also to the a glands of the genito-anal region). Briefly, the Europids and Negrids are smelly, the Mongolids scarcely or not at all. so far as the axillary secretion is concerned. Adachi. who has devoted more study to this subject than anyone else, has summed up his findings in a single, short sentence: ‘The Mongolids are essentially an odourless or very slightly smelly race with dry ear-wax.’(5] Since most of the Japanese are free or almost free from axillary smell, they are very sensitive to its presence, of which they seem to have a horror. About 10% of Japanese have smelly axillae. This is attributed to remote Ainuid ancestry, since the Ainu are invariably smelly, like most other Europids, and a tendency to smelliness is known to be inherited among the Japanese. 151 The existence of the odour is regarded among Japanese as a disease, osmidrosis axillae which warrants (or used to warrant) exemption from military service. Certain doctors specialize in its treatment, and sufferers are accustomed to enter hospital. [p. 173]

Japan always take these things to a new level.

Measurements of adult stature, made on several thousand pairs of persons, show a rather close correspondence with these figures, namely, 0 507, 0-322, 0-543, and 0-287 respectively.(172) It will be noticed that the correlations are all somewhat higher than one would expect; that is to say, the members of each pair are, on average, rather more nearly of the same height than the simple theory would suggest. This is attributed in the main to the tendency towards assortative mating, the reality of which had already been recognized by Karl Pearson and Miss Lee in their paper published in 1903. [p. 462]

I didn’t know assortative mating was recognized so far back. This may be a good source to understand the historical development of understanding of assortative mating.

The reference is: Pearson, K. &  Lee,  A.,  1903.  ‘On  the  laws  of  inheritance  in  man.  I.  Inheritance  of  physical characters.’  Biometrika,  2, 357—462.

Definition of intelligence?

What has been said on p. 496 may now be rewritten in the form of a short definition of intelligence, in the straightforward, everyday sense of that word. It is the ability to perceive, comprehend, and reason, combined with the capacity to choose worth-while subjects for study, eagerness to acquire, use, transmit, and (if possible) add to knowledge and understanding, and the faculty for sustained effort towards these ends (cf. p. 438). One might say briefly that a person is intelligent in so far as his cognitive ability and personality tend towards productiveness through mental activity. [p. 495ff]

Baker prefers a broader definition of “intelligence” which includes certain non-cognitive parts. He uses “cognitive ability” like many people do now a days use “general cognitive ability”.

And now surely at the end of the book, the evil master-racist privileged white male John Baker tells us what to do with the information we just learned in the book:

Here, on reaching the end of the book, 1 must repeat some words that I wrote years ago when drafting the Introduction (p. 6), for there is nothing in the whole work that would tend to contradict or weaken them:
Every ethnic taxon of man includes many persons capable of living responsible and useful lives in the communities to which they belong, while even in those taxa that are best known for their contributions to the world’s store of intellectual wealth, there are many so mentally deficient that they would be inadequate members of any society. It follows that no one can claim superiority simply because he or she belongs to a particular ethnic taxon. [p. 534]

So, clearly according to our anti-racist heroes, Baker tells us to revel in our (sorry Jayman if you are reading!) European master ancestry, right?

edited: removed joke because public image -_-

Do different intelligence tests measure the same g?

The most direct way to test this is to employ latent variable modeling (SEM/CFA) and correlate the general factors from different IQ batteries. Below I quote the abstracts from all such studies I am aware of.


Johnson, W., Bouchard Jr, T. J., Krueger, R. F., McGue, M., & Gottesman, I. I. (2004). Just one g: consistent results from three test batteries. Intelligence, 32(1), 95-107.

The concept of a general intelligence factor or g is controversial in psychology. Although the controversy swirls at many levels, one of the most important involves g’s identification and measurement in a group of individuals. If g is actually predictive of a range of intellectual performances, the factor identified in one battery of mental ability tests should be closely related to that identified in another dissimilar aggregation of abilities. We addressed the extent to which this prediction was true using three mental ability batteries administered to a heterogeneous sample of 436 adults. Though the particular tasks used in the batteries reflected varying conceptions of the range of human intellectual performance, the g factors identified by the batteries were completely correlated (correlations were .99, .99, and 1.00). This provides further evidence for the existence of a higher-level g factor and suggests that its measurement is not dependent on the use of specific mental ability tasks.

Johnson, W., Nijenhuis, J. T., & Bouchard Jr, T. J. (2008). Still just 1g: Consistent results from five test batteries. Intelligence, 36(1), 81-95.

In a recent paper, Johnson, Bouchard, Krueger, McGue, and Gottesman (2004) addressed a long-standing debate in psychology by demonstrating that the g factors derived from three test batteries administered to a single group of individuals were completely correlated. This finding provided evidence for the existence of a unitary higher-level general intelligence construct whose measurement is not dependent on the specific abilities assessed. In the current study we constructively replicated this finding utilizing five test batteries. The replication is important because there were substantial differences in both the sample and the batteries administered from those in the original study. The current sample consisted of 500 Dutch seamen of very similar age and somewhat truncated range of ability. The batteries they completed included many tests of perceptual ability and dexterity, and few verbally oriented tests. With the exception of the g correlations involving the Cattell Culture Fair Test, which consists of just four matrix reasoning tasks of very similar methodology, all of the g correlations were at least .95. The lowest g correlation was .77. We discuss the implications of this finding.

The Cattell battery is a nonverbal battery with only 4 subtests. Lower correlation likely due to psychometric sampling error.


Keith, T. Z., Kranzler, J. H., & Flanagan, D. P. (2001). What does the Cognitive Assessment System (CAS) measure? Joint confirmatory factor analysis of the CAS and the Woodcock-Johnson Tests of Cognitive Ability. School Psychology Review, 30(1), 89-119.

Results of recent research by Kranzler and Keith (1999) raised important questions concerning the construct validity of the Cognitive Assessment System (CAS; Naglieri & Das, 1997), a new test of intelligence based on the planning, attention, simultaneous, and sequential (PASS) processes theory of human cognition. Their results indicated that the CAS lacks structural fidelity, leading them to hypothesize that the CAS Scales are better understood from the perspective of Cattell-Horn-Carroll (CHC) theory as measures of psychometric g, processing speed, short-term memory span, and fluid intelligence/broad visualization. To further examine the constructs measured by the CAS, this study reports the results of the first joint confirmatory factor analysis (CFA) of the CAS and a test of intelligence designed to measure the broad cognitive abilities of CHC theory—the Wood-cock-Johnson Tests of Cognitive Abilities-3rd Edition (WJ III; Woodcock, McGrew, & Mather, 2001). In this study, 155 general education students between 8 and 11 years of age (M = 9.81) were administered the CAS and the WJ III. A series of joint CFA models was examined from both the PASS and the CHC theoretical perspectives to determine the nature of the constructs measured by the CAS. Results of these analyses do not support the construct validity of the CAS as a measure of the PASS processes. These results, therefore, question the utility of the CAS in practical settings for differential diagnosis and intervention planning. Moreover, results of this study and other independent investigations of the factor structure of preliminary batteries of PASS tasks and the CAS challenge the viability of the PASS model as a theory of individual differences in intelligence.

The correlation between g factors was .98.

Floyd, R. G., Bergeron, R., Hamilton, G., & Parra, G. R. (2010). How do executive functions fit with the Cattell–Horn–Carroll model? Some evidence from a joint factor analysis of the Delis–Kaplan executive function system and the Woodcock–Johnson III tests of cognitive abilities. Psychology in the Schools, 47(7), 721-738.

This study investigated the relations among executive functions and cognitive abilities through a joint exploratory factor analysis and joint confirmatory factor analysis of 25 test scores from the Delis–Kaplan Executive Function System and the Woodcock–Johnson III Tests of Cognitive Abilities. Participants were 100 children and adolescents recruited from general education classrooms. Principal axis factoring followed by an oblique rotation yielded a six-factor solution. The Schmid–Leiman transformation was then used to examine the relations between specific cognitive ability factors and a general factor. A variety of hypothesis-driven models were also tested using confirmatory factor analysis. Results indicated that all tests measure the general factor, and 24 tests measure at least one of five broad cognitive ability factors outlined by the Cattell–Horn–Carroll theory of cognitive abilities. These results, with limitations considered, add to the body of evidence supporting the confluence of measures of executive functions and measures of cognitive abilities derived from individual testing.

Correlations between latent g’s were .99 and 1.00.

Floyd, R. G., Reynolds, M. R., Farmer, R. L., Kranzler, J. H., & Volpe, R. (2013). Are the General Factors From Different Child And Adolescent Intelligence Tests the Same? Results From a Five-Sample, Six-Test Analysis. School Psychology Review, 42(4).
Psychometric g is the largest, most general, and most predictive factor underlying individual differences across cognitive tasks included in intelligence tests. Given that the overall score from intelligence tests is interpreted as an index of psychometric g, we examined the correlations between general factors extracted from individually administered intelligence tests using data from five samples of children and adolescents (n = 83 to n = 200) who completed at least two of six intelligence tests. We found strong correlations between the general factors indicating that these intelligence tests measure the same construct, psychometric g. A total of three general-factor correlations exceeded .95, but two other correlations were somewhat lower (.89 and .92). In addition, specific ability factors correlated highly across tests in most (but not all) cases. School psychologists and other professionals should know that psychometric g and several specific abilities are measured in remarkably similar ways across a wide array of intelligence tests.

Lower results may be due to sampling error and temporal changes related to growth.

Other constructs

Results like these are not limited to the general intelligence construct. Using ordinary explorative factor analysis, I found that the general factor (S factor) extracted from the 54 variables of the Social Progress Index correlated .98 with that extracted from the 42 variable Democracy Index. Perhaps this would be closer to 1.0 had I used structural equation modeling (a question left for another time).

Kirkegaard, E. O. W. (2014). The international general socioeconomic factor: Factor analyzing international rankings. Open Differential Psychology.

The inconsistency of studies of gender differences in cognitive abilities: due to using different methods?

I read this study:

Palejwala, M. H., & Fine, J. G. (2015). Gender differences in latent cognitive abilities in children aged 2 to 7. Intelligence, 48, 96-108.
It reminded me that these studies rarely test what happens when one uses a bunch of different methods on the same dataset. Nyborg (2003) wrote years ago that different results were probably to a high degree due to this. There seems to have been no change in research practices since then.
Due to the lack of data sharing, it is generally not possible for researchers inclined to perform a such study on the data they used. One is limited to either gathering some data oneself or finding some open dataset.
Wicherts and Bakker (2012) have provided researchers with an open dataset. It is not at all perfect: unrepresentative; psych students, young and mostly female, and medium-sized; N=400-500 (depending on treatment of missing data).
Obviously, the study cannot be used to determine the size of any difference in the general population. However, it should be sufficient for researchers to see how different methods compare. One can try summing latent variables. One can use EFA in various ways (different extraction methods, Schmid-Leiman transformed or not, hierarchical or not) to extract latent traits and compare the score means and variances. One can do it with CFA/SEM with different models (hierarchical, bi-factor). How would all these results compare? The data is public, so who wants to do this study with me?
The closest study of this kind is maybe Steinmayr et al 2010. But it used non-public data, and did not use all the available methods. For instance, it did not use latent models with a g factor at all, only 5 primary factors (?!).
The above is not to say that using different samples does not alter results too. There are various ways of excluding participants, e.g. for handicaps (fysical, mental, both) which surely change both means and variances. Worse, most data concern only a rather small number of subtests (because they rely on commercial tests, also bad!) which were often picked to minimize gender differences (or at least balance them out which means that total summed score is useless).
It would be better if some super/master dataset could be collected with say 60 very different mental tests: elementary cognitive tests, Piagetian (do these work on adults? or too strong ceiling effect?), matrix, vocabulary, number series, picture completion, analogies, digit span, learning tests, maze solving, get inspiration from Jensen (1980, Chapter 4), … with a minimum of 3 per type of test so a latent group/type factor can be estimated if one exists) and then all researchers could study the same dataset. This is how science should work. Methods and data must be completely open.
Nyborg, H. (2003). Sex differences in g. The scientific study of general intelligence, 187-222.
Wicherts, J. M., & Bakker, M. (2012). Publish (your data) or (let the data) perish! Why not publish your data too?. Intelligence, 40(2), 73-76.
Jensen, A. R. (1980). Bias in mental testing.
Steinmayr, R., Beauducel, A., & Spinath, B. (2010). Do sex differences in a faceted model of fluid and crystallized intelligence depend on the method applied?. Intelligence, 38(1), 101-110.