Review: What Intelligence Tests Miss: The Psychology of Rational Thought (Stanovich, 2009)

MOBI on libgen

I’ve seen this book cited quite a few times and when looking for what to read next, it seemed like on okay choice. The book is written in typical popscience style: no crucial statistical information about the studies is mentioned, so it is impossible for the skeptical reader to know which claims to believe and which not to.

For instance, he spends quite a while talking about how IQ/SAT etc. do not correlate strongly with rationality measures. Rarely does he mention the exact effect size. He does not mention whether it is measured as a correlation of IQ with single item rationality measures. Single items have lower reliability which reduces correlations, and are usually dichotomous which also lowers (Pearson) correlations (simulation results here, TL;DR multiple by 1.266 for dichotomous items). He does not say whether it was university students, which lower correlations as they are selected for g and rationality (maybe). The OKCupid dataset happens to contain a number of items on rationality items (e.g. astrology), I have already noted on Twitter that these correlate with g in the expected direction (religiousness).

Otherwise the book feels like reading Kahneman’s Thinking Fast and Slow. It covers the most well-known heuristics and how they sometimes lead astray (representativeness, ease of recall, framing effect, status quo bias, planning bias, etc.).

The book can be read by researchers with some gain in knowledge, but don’t expect that much. For the serious newcomer, it is better to read a textbook on the topic (unfortunately, I don’t know any, as I have yet to read one myself — regrettably!). For the curious layperson, I guess it is okay.

Therefore, it came as something of a surprise when scores on various college placement exams and Armed Forces tests that the president had taken over the years were converted into an estimated IQ score. The president’s [Bush #2] score was approximately 120-roughly the same as that of Bush’s opponent in the 2004 presidential election, John Kerry, when Kerry’s exam results from young adulthood were converted into IQ scores using the same formulas. These results surprised many critics of the president (as well as many of his supporters), but 1, as a scientist who studies individual differences in cognitive skills, was not surprised.

Virtually all commentators on the president’s cognition, including sympathetic commentators such as his onetime speechwriter David Frum, admit that there is something suboptimal about the president’s thinking. The mistake they make is assuming that all intellectual deficiencies are reflected in a lower IQ score.

In a generally positive portrait of the president, Frum nonetheless notes that “he is impatient and quick to anger; sometimes glib, even dogmatic; often uncurious and as a result ill-informed” (2003, p. 272). Conservative commentator George Will agrees, when he states that in making Supreme Court appointments, the president “has neither the inclination nor the ability to make sophisticated judgments about competing approaches to construing the Constitution” (2005, P. 23)

Seems fishy. One obvious idea is that he has had some kind of brain damage since his recorded score. Since it is based on a SAT score, it is possible that he had considerable help on the SAT test. It is true that SAT prepping does not generally work well and has diminishing returns, but surely Bush had quite a lot of help as he comes from a very rich and prestigious family. (I once read a recent meta-analysis of SAT prepping/coaching, but I can’t find it again. Mean effect size was about .25, which corresponds to 3.75 IQ.)

See also:

Actually, we do not have to speculate about the proportion of high-IQ people with these beliefs. Several years ago, a survey of paranormal beliefs was given to members of a Mensa club in Canada, and the results were instructive. Mensa is a club restricted to high-IQ individuals, and one must pass IQ-type tests to be admitted. Yet 44 percent of the members of this club believed in astrology, 51 percent believed in biorhythms, and 56 percent believed in the existence of extraterrestrial visitors-all beliefs for which there is not a shred of evidence.

Seems fishy too. Maybe MENSA just attracts irrational smart people. I know someone who is in Danish MENSA, so I can perhaps do a new survey.

Rational thinking errors appear to arise from a variety of sources -it is unlikely that anyone will propose a psychometric g of rationality. Irrational thinking does not arise from a single cognitive problem, but the research literature does allow us to classify thinking into smaller sets of similar problems. Our discussion so far has set the stage for such a classification system, or taxonomy. First, though, I need to introduce one additional feature in the generic model of the mind outlined in Chapter 3.

But that is exactly what I will propose. What is the factor structure of rationality? Is there a general factor, is it hierarchical? Is rationality perhaps a second-order factor of g? I get inspiration from study study of ‘Emotional Intelligence’ as a second-stratum factor (MacCann et al, 2014).

The next category (defaulting to the autonomous mind and not engaging at all in Type 2 processing) is the most shallow processing tendency of the cognitive miser. The ability to sustain Type 2 processing is of course related to intelligence. But the tendency to engage in such processing or to default to autonomous processes is a property of the reflective mind that is not assessed on IQ tests. Consider the Levesque problem (“Jack is looking at Anne but Anne is looking at George”) as an example of avoiding Type 2 processing. The subjects who answer this problem correctly are no higher in intelligence than those who do not, at least in a sample of university students studied by Maggie Toplak in my own laboratory.

This sure does sound like a 1 item correct/wrong item correlated with IQ scores from a selected g group. He says “no higher” but perhaps his sample was too small too and what he meant was that the difference was not significant. Samples for this kind are usually pretty small.

Theoretically, one might expect a positive correlation between intelligence and the tendency of the reflective mind to initiate Type 2 processing because it might be assumed that those of high intelligence would be more optimistic about the potential efficacy of Type 2 processing and thus be more likely to engage in it. Indeed, some insight tasks do show a positive correlation with intelligence, one in particular being the task studied by Shane Frederick and mentioned in Chapter 6: A bat and a ball cost $I.Io in total. The bat costs $I more than the ball. How much does the ball cost? Nevertheless, the correlation between intelligence and a set of similar items is quite modest, .43-.46, leaving plenty of room for performance dissociations of the type that define dysrationalia 14 Frederick has found that large numbers of high-achieving students at MIT, Princeton, and Harvard when given this and other similar problems rely on this most primitive of cognitive miser strategies.

The sum of the 3 CRT items (one mentioned above) correlated r=.50 with the 16 item ICAR sample test in my student (age ~18, n=72) data. These items do not perform differently when factor analyzed with the entire item set.

In numerous place he complains that society cares too much about IQ in selection even tho he admits that there is substantial evidence for it works. He also admits that there is no standard test for rationality and cites no evidence that selecting for rationality will improve outcomes (e.g. job performance, GPA in college, prevention of drop-out in training programs), it is difficult to see what he has to complain about. He should have been less bombastic. Yes, we should try rationality measures, but calling for wide scale use before proper validation is very premature.



MacCann, C., Joseph, D. L., Newman, D. A., & Roberts, R. D. (2014). Emotional intelligence is a second-stratum factor of intelligence: Evidence from hierarchical and bifactor models. Emotion, 14(2), 358.

Gender distribution of comedians over time

It is a long time ago since I did this project. I did not write about it here before but it is a pity since the results are thus not ‘out there’. I put the project page here in 2012 (!). In short, I wrote python code to crawl Wikipedia lists. I figured out a way to decide whether a person was male or female. This was done using gendered pronouns which exist in English. I.e., the crawler fetches the full-text of the article, and counts “he”, “his”, “him”, “she”, “her”. It assigns the gender with the most pronouns. This method seems rather reliable in my informal testing.

I specifically wrote it to look at comedians because I had read a study of comedians (Greengross et al 2012). They gave personality and a vocabulary test (from the Multidimensional Aptitude Battery, r=.62 with WAIS-R) to a sample of 31 comedians and psychology 400 students. The comedians scored 1.34 d above the students. Some care must be taken with this result. The comedians were much older and vocabulary raw scores go up with age (mean age 38.9 vs. 20.5). The authors do not state that they were age-corrected. Psychology students are not very bright and this was a sample from New Mexico with lots of Hispanics. We can safely conclude that comedians are smarter than the student body and the general population of New Mexico, but can’t say much about exactly. We can hazard a guess at student body (maybe 107 IQ) + age corrected d (maybe 15 IQ), so we end with an estimate of 122 IQ.

There are various other tables of interest that don’t need much explaining, which I will paste below:


As of writing this, I found another older study (Janus, 1975). I will just quote:

The data to support the above theses were gathered through psychological case studies, in-depth interviews with many of the leading comedians in the United States today, and psychological tests. [n addition to a clinical interview, the instruments used were the Wechsler Adult Intelligence Scale, Machover Human Figure Drawing Test, graphological analysis, earliest memories, and recurring dreams.

Population consisted of 55 professional comedians. In order to be considered in this study, comedians had to be full-time professional stand-up comedians. Most of the subjects earned salaries of six figures or over, from comedy alone. In order to make the sample truly representative, each comedian had to be nationally known and had to have been in the field full time for at least ten years. The average time spent in full- time comedy for the subjects was twenty-five years. The group consisted of fifity-one men and four women. They represented all major religions, many geographic areas, and diverse socioeconomic backgrounds. Comedians were interviewed in New York, California, and points in between. Their socioeconomic backgrounds, family hierarchy, demographic information, religious influences, and analytic material were investigated. Of the population researched, 85 percent came from lower-class homes, 10 percent from lower-middle-class homes, and 5 percent from middle-class and upper-middle-class homes. All subjects participated voluntarily, received no remuneration, and were personally interviewed by the author.

I.Q. scores ranged from 115 to 160+. For a population at large, I.Q. scores in the average range are from 90 to 110. I.Q. scores in the bright-average range of intelligence, that is, from 10g to 115, were scored by only three subjects. The remainder scored above 125, with the mean score being 138. The vocabulary subtest was utilized. Several subjects approached it as a word-association test, but all regarded it as a challenge. Since these are verbal people, they were highly motivated. The problem was not one of getting them to respond, it was one of continuously allaying their anxiety, and re- assuring them they they were indeed doing well.

So, a very high mean was found. WAIS was published in 1955, so there is approximately 20 years of FLynn gains in raw scores, presumably uncorrected for. According to a new meta-analysis of FLynn gains (Trahan et al 2014), the mean gain is 2.31 per decade. So we are assuming about a gain of 4.6 IQ here. But then again, the verbal test for the students was published in 1984, so there may be some gain there as well (FLynn effects supposedly showed down recent in Western countries). Perhaps a net gain in favor of the old study by 4 IQ. In that case, we get estimates of 134 and 122. With samples of 31 and 55, different subtests, sampling procedure etc., this is surely reasonable. We can take a weighted mean and say best estimate for professional comedians is about 129.7, or about +2SD. It seems a bit wild, are comedians really on average as smart as fysicists?

EDIT: There is another study by Janus (1978). Same test:

[N=14] Intelligence scores ranged from 112 to 144 plus. (The range of average IQ is from 90 to 110.) Four subjects scored in the bright average range–i.e., 108 to 115. The remaining subjects scored above 118 with a mean score of 126. Two subjects scored above 130. The mean score for male comics was 138. The subjects approached the testing with overenthusiasm, in some cases bordering on frenzy. Despite the brightness of the group, all subjects needed constant reassurance and positive feedback.

So 126, with ~5 IQ because of FLynn effect. New weighted mean is 128.5 IQ.

Perhaps we should test it. If you want to test it with me, write me an email/tweet. We will design a questionnaire and give it to your local sample of comedians. One can e.g. try to convince professional comedian organizations (e.g. Danish here, N=35) to forward it to their members.

So what did I find?

I did the scraping twice. One time at first in 2012, and then again later when I was reminded of the project in May 2014. Now I have been reminded of it again. The very basic stats is that there were 1106 comedians found, of which the gender distribution was this (the “other” is unknown gender, which was 1 person).

What about the change over time? The code fetches their birth year if mentioned on their Wikipedia page. Then I limited the data to US comedians (66% of the sample). This was done because if we are looking for ways to explain it, we need to restrict ourselves to some more homogenous subset. What explains the change in gender distribution in Saudi Arabia at time t1 may not also explain it in Japan.

Next we get a common scientific conflict of interest: that between precision of estimate and detail. Essentially what we need is a moving average since most or all years have too few comedians for a reliable estimate (very zigzaggy lines on the plot). So we must decide how large a moving average to use. A larger will give more precision in estimate, but less detail. I decided to try a few different options (5, 10, 15, 20). To avoid extreme zigzagginess, I only plotted them if there were >=20 persons in the interval. This plots look like this:

So in general we see a decline in the proportion of male comedians. But it is not not going straight down. There is a local minimum in 1960 or so, and a local maximum in 1980 or so. How to explain these?

I tried abortion rate (not much data before 1973) and total fertility rate (plenty of data) but was not convinced by the results. One can also inflate or deflate the numbers according to which moving interval one chooses. One can even try all the possible sizes of intervals and the delays to see which gives the best match. I did some of this semi-manually using spreadsheets, but it has a very high chance of overfitting. One would need to do some programming to try all of them in a reasonable time.

I wrote some of this stuff in a paper, but never finished it. It can now be found at its OSF repository.


Newer dataset from May 2014.

Older dataset dated to 2012.

Python code. This includes code to crawl Wikipedia with and quite a lot of other raw data output files.


Greengross, G., Martin, R. A., & Miller, G. (2012). Personality traits, intelligence, humor styles, and humor production ability of professional stand-up comedians compared to college students. Psychology of Aesthetics, Creativity, and the Arts, 6(1), 74.

Janus, S. S. (1975). The great comedians: Personality and other factors. The American Journal of Psychoanalysis, 35(2), 169-174.

Janus, S. S., Bess, B. E., & Janus, B. R. (1978). The great comediennes: Personality and other factors. The American Journal of Psychoanalysis, 38(4), 367-372.

Trahan, L. H., Stuebing, K. K., Fletcher, J. M., & Hiscock, M. (2014). The Flynn effect: A meta-analysis.

Admixture in the Americas: Admixture among US Blacks and Hispanics and academic achievement

Some time ago a new paper came out from the 23andme people reporting admixture among US ethnoracial groups (Bryc et al, 2014). Per our still on-going admixture project (current draft here), one could see if admixture predicts academic achievement (or IQ, if such were available). We (that is, John did) put together achievement data (reading and math scores) from the NAEP and the admixture data here.

Descriptive stats

Admixture studies do not work well if there is no or little variation within groups. So let’s first examine them. For blacks:

                      vars  n mean   sd median trimmed  mad  min  max range  skew kurtosis   se
BlackAfricanAncestry     1 31 0.74 0.04   0.74    0.74 0.03 0.64 0.83  0.19 -0.03    -0.38 0.01
BlackEuropeanAncestry    1 31 0.23 0.04   0.24    0.23 0.03 0.15 0.34  0.19  0.09    -0.30 0.01


So we see that there is little American admixture in Blacks because the African and European add up to close to 100 (23+74=97). In fact, the correlation between African and European ancestry in Blacks is -.99. This also means that multiple correlation is useless because of collinearity.

White admixture data is also not very useful. It is almost exclusively European:

                      vars  n mean sd median trimmed mad  min max range  skew kurtosis se
WhiteEuropeanAncestry    1 51 0.99  0   0.99    0.99   0 0.98   1  0.02 -0.95     0.74  0

What about Hispanics (some sources call them Latinos)?

                       vars  n mean   sd median trimmed  mad  min  max range skew kurtosis   se
LatinoEuropeanAncestry    1 34 0.73 0.07   0.72    0.73 0.05 0.57 0.90  0.33 0.34     0.22 0.01
LatinoAfricanAncestry     1 34 0.09 0.05   0.08    0.08 0.06 0.01 0.22  0.21 0.51    -0.69 0.01
LatinoAmericanAncestry    1 34 0.10 0.05   0.09    0.10 0.03 0.04 0.21  0.17 0.80    -0.47 0.01

Hispanics are fairly admixed. Overall, they are mostly European, but the range of African and American ancestry is quite high. Furthermore, due to the three way variation, multiple regression should work. The ancestry intercorrelations are: -.42 (Afro x Amer) -.21 (Afro x Euro) -.50 (Amer x Euro). There must also be another source because 73+9+10 is only 92%. Where’s the last 8% admixture from?

Admixture x academic achievement correlations: Blacks

row.names BlackAfricanAncestry BlackAmericanAncestry BlackEuropeanAncestry
1 Math2013B -0.32 0.09 0.29
2 Math2011B -0.27 0.21 0.25
3 Math2009B -0.30 0.09 0.28
4 Math2007B -0.12 0.27 0.08
5 Math2005B -0.28 0.26 0.23
6 Math2003B -0.30 0.15 0.26
7 Math2000B -0.36 -0.08 0.34
8 Read2013B -0.25 0.14 0.22
9 Read2011B -0.33 0.22 0.30
10 Read2009B -0.40 -0.03 0.41
11 Read2007B -0.26 0.14 0.24
12 Read2005B -0.43 0.33 0.39
13 Read2003B -0.42 0.09 0.38
14 Read2002B -0.30 -0.10 0.27


Summarizing these results:

     vars  n  mean   sd median trimmed  mad   min   max range  skew kurtosis   se
Afro    1 14 -0.31 0.08  -0.30   -0.32 0.05 -0.43 -0.12  0.31  0.48     0.10 0.02
Amer    1 14  0.13 0.13   0.14    0.13 0.11 -0.10  0.33  0.43 -0.32    -1.07 0.03
Euro    1 14  0.28 0.08   0.28    0.29 0.06  0.08  0.41  0.33 -0.49     0.11 0.02

So we see the expected directions and order, for Blacks (who are mostly African), American admixture is positive and European is more positive. There is quite a bit of variation over the years. It is possible that this reflects mostly ‘noise’ as in, e.g. changes in educational policies in the states, or just sampling error. It is also possible that the changes are due to admixture changes within states over time.

Admixture x academic achievement correlations: Hispanics

row.names LatinoAfricanAncestry LatinoAmericanAncestry LatinoEuropeanAncestry
1 Math13H 0.20 -0.13 -0.10
2 Math11H 0.27 0.02 -0.02
3 Math09H 0.29 -0.32 0.04
4 Math07H 0.36 -0.14 -0.01
5 Math05H 0.38 -0.08 0.00
6 Math03H 0.37 -0.23 -0.08
7 Math00H 0.30 -0.09 -0.05
8 Read2013H 0.18 -0.44 0.33
9 Read2011H 0.21 -0.26 0.33
10 Read2009H 0.19 -0.44 0.33
11 Read2007H 0.13 -0.32 0.23
12 Read2005H 0.38 -0.30 0.23
13 Read2003H 0.32 -0.34 0.18
14 Read2002H 0.24 -0.23 0.08

And summarizing:

     vars  n  mean   sd median trimmed  mad   min  max range  skew kurtosis   se
Afro    1 14  0.27 0.08   0.28    0.28 0.12  0.13 0.38  0.25 -0.10    -1.49 0.02
Amer    1 14 -0.24 0.14  -0.24   -0.24 0.15 -0.44 0.02  0.46  0.17    -1.13 0.04
Euro    1 14  0.11 0.16   0.06    0.11 0.19 -0.10 0.33  0.43  0.23    -1.68 0.04

We do not see the expected results per genetic model. Among Hispanics who are 73% European, African admixture has a positive relationship to academic achievement. American admixture is negatively correlated and European positively, but weaker than African. The only thing that’s in line with the genetic model is that European is positive. On the other hand, results are not in line with a null model either, because then we were expecting results to fluctuate around 0.

Note that the European admixture numbers are only positive for the reading tests. The reading tests are presumably those mostly affected by language bias (many Hispanics speak Spanish as a first language). If anything, the math results are worse for the genetic model.

General achievement factors

We can eliminate some of the noise in the data by extracting a general achievement factor for each group. I do this by first removing the cases with no data at all, and then imputing the rest.

Then we get the correlation like before. This should be fairly close to the means above:

 LatinoAfricanAncestry LatinoAmericanAncestry LatinoEuropeanAncestry 
                  0.28                  -0.36                   0.22

The European result is stronger with the general factor from the imputed dataset, but the order is the same.

We can do the same for the Black data to see if the imputation+factor analysis screws up the results:

 BlackAfricanAncestry BlackAmericanAncestry BlackEuropeanAncestry 
                -0.35                  0.20                  0.31

These results are similar to before (-.31, .13, .28) with the American result somewhat stronger.


Perhaps if we plot the results, we can figure out what is going on. We can plot either the general achievement factor, or specific results. Let’s do both:

Reading2013 plots

hispanic_afro_read13 hispanic_amer_read13 hispanic_euro_read13

Math2013 plots

hispanic_afro_math13 hispanic_amer_math13 hispanic_euro_math13

General factor plots

hispanic_afro_general hispanic_amer_general hispanic_euro_general

These did not help me understand it. Maybe they make more sense to someone who understands US demographics and history better.

Multiple regression

As mentioned above, the Black data should be mostly useless for multiple regression due to high collinearity. But the hispanic should be better. I ran models using two of the three ancestry estimates at a time since one cannot use all three (I think).

Generally, the independents did not reach significance. Using the general achievement factor as the dependent, the standardized betas are:

LatinoAfricanAncestry LatinoAmericanAncestry
             0.1526765             -0.2910413
LatinoAfricanAncestry LatinoEuropeanAncestry
             0.3363636              0.2931108
LatinoAmericanAncestry LatinoEuropeanAncestry
           -0.32474678             0.06224425

The first is relative to European, second to American, and third African. The results are not even consistent with each other. In the first, African>European. In the third, European>African. All results show that Others>American tho.

The remainder

There is something odd about the data, it doesn’t sum to 1. I calculated the sum of the ancestry estimates, and then subtracted that from 1. Here’s the results:

black_remainder hispanic_remainder

To these we can add simple descriptive stats:

                        vars  n mean   sd median trimmed  mad  min  max range skew kurtosis   se
BlackRemainderAncestry     1 31 0.02 0.00   0.02    0.02 0.00 0.01 0.03  0.02 1.35     1.18 0.00
LatinoRemainderAncestry    1 34 0.08 0.05   0.07    0.07 0.03 0.02 0.34  0.32 3.13    12.78 0.01


So we see that there is a sizable other proportion of Hispanics and a small one for Blacks. Presumably, the large outlier of Hawaii is Asian admixture from Japanese, Chinese, Filipino and Native Hawaiian clusters. At least, these are the largest groups according to Wikipedia. For Blacks, the ancestry is presumably Asian admixture as well.

Do these remainders correlate with academic achievement? For Blacks, r = .39 (p = .03), and for Hispanics r = -.24 (p = .18). So the direction is as expected for Blacks and stronger, but for Hispanics, it is in the right direction but weaker.

Partial correlations

What about partialing out the remainders?

LatinoAfricanAncestry LatinoAmericanAncestry LatinoEuropeanAncestry
            0.21881404            -0.33114612             0.09329413
BlackAfricanAncestry BlackAmericanAncestry BlackEuropeanAncestry
           -0.2256171             0.1189219             0.2185139


Not much has changed. European correlation has become weaker for Hispanics. For Blacks, results are similar to before.

Proposed explanations?

The African results are in line with genetic models. The Hispanic is not, but it isn’t in line with the null-model either. Perhaps it has something to do with generational effects. Perhaps if one could find % of first generation Hispanics by state and add those to the regression model / control for that using partial correlations.

Other ideas? Before calculating the results, John wrote:

Language, generation, and genetic assimilation are all confounded, so I thought it best to not look at them.

He may be right.

R code

data = read.csv("BryceAdmixNAEP.tsv", sep="\t",row.names=1)
library(car) # for vif
library(psych) # for describe
library(VIM) # for imputation
library(QuantPsyc) #for lm.beta
library(devtools) #for source_url
#load mega functions

#descriptive stats

black.model = "Math2013B ~ BlackAfricanAncestry+BlackAmericanAncestry"
black.model = "Read2013B ~ BlackAfricanAncestry+BlackAmericanAncestry"
black.model = "Math2013B ~ BlackAfricanAncestry+BlackEuropeanAncestry"
black.model = "Read2013B ~ BlackAfricanAncestry+BlackEuropeanAncestry" = lm(black.model, data)

hispanic.model = "Math2013H ~ LatinoAfricanAncestry+LatinoAmericanAncestry"
hispanic.model = "Read2013H ~ LatinoAfricanAncestry+LatinoAmericanAncestry"
hispanic.model = "Math2013H ~ LatinoAfricanAncestry+LatinoEuropeanAncestry"
hispanic.model = "Read2013H ~ LatinoAfricanAncestry+LatinoEuropeanAncestry"
hispanic.model = "hispanic.ach.factor ~ LatinoAfricanAncestry+LatinoAmericanAncestry"
hispanic.model = "hispanic.ach.factor ~ LatinoAfricanAncestry+LatinoEuropeanAncestry"
hispanic.model = "hispanic.ach.factor ~ LatinoAmericanAncestry+LatinoEuropeanAncestry"
hispanic.model = "hispanic.ach.factor ~ LatinoAfricanAncestry+LatinoAmericanAncestry+LatinoEuropeanAncestry" = lm(hispanic.model, data)

cors = round(rcorr(as.matrix(data))$r,2) #all correlations, round to 2 decimals

#blacks = cors[10:23,1:3] #Black admixture x Achv.
hist(unlist([,1])) #hist for afri x achv
hist(unlist([,2])) #amer x achv
hist(unlist([,3])) #euro x achv
desc = rbind(Afro=describe(unlist([,1])), #descp. stats afri x achv
             Amer=describe(unlist([,2])), #amer x achv
             Euro=describe(unlist([,3]))) #euro x achv

admixture.cors.white = cors[24:25,4:6] #White admixture x Achv.

admixture.cors.hispanic = cors[26:39,7:9] #White admixture x Achv.
desc = rbind(Afro=describe(unlist(admixture.cors.hispanic[,1])), #descp. stats afri x achv
             Amer=describe(unlist(admixture.cors.hispanic[,2])), #amer x achv
             Euro=describe(unlist(admixture.cors.hispanic[,3]))) #euro x achv

##Examine hispanics by scatterplots
scatterplot(Read2013H ~ LatinoAfricanAncestry, data,
            smoother=FALSE, id.n=nrow(data))
scatterplot(Read2013H ~ LatinoEuropeanAncestry, data,
            smoother=FALSE, id.n=nrow(data))
scatterplot(Read2013H ~ LatinoAmericanAncestry, data,
            smoother=FALSE, id.n=nrow(data))
scatterplot(Math2013H ~ LatinoAfricanAncestry, data,
            smoother=FALSE, id.n=nrow(data))
scatterplot(Math2013H ~ LatinoEuropeanAncestry, data,
scatterplot(Math2013H ~ LatinoAmericanAncestry, data,
#General factor
scatterplot(hispanic.ach.factor ~ LatinoAfricanAncestry, data,
            smoother=FALSE, id.n=nrow(data))
scatterplot(hispanic.ach.factor ~ LatinoEuropeanAncestry, data,
scatterplot(hispanic.ach.factor ~ LatinoAmericanAncestry, data,

##Imputed and aggregated data
#Hispanics = data[26:39] #subset hispanic ach data =[<ncol(,] #remove empty cases
miss.table( #examine missing data = irmi(, noise.factor = 0) #impute the rest
#factor analysis
fact.hispanic = fa( #get common ach factor
fact.scores = fact.hispanic$scores; colnames(fact.scores) = "hispanic.ach.factor"
data = merge.datasets(data,fact.scores,1) #merge it back into data
cors[7:9,"hispanic.ach.factor"] #results for general factor

#Blacks = data[10:23] #subset black ach data =[<ncol(,] #remove empty cases = irmi(, noise.factor = 0) #impute the rest
#factor analysis = fa( #get common ach factor
fact.scores =$scores; colnames(fact.scores) = "black.ach.factor"
data = merge.datasets(data,fact.scores,1) #merge it back into data
cors[1:3,"black.ach.factor"] #results for general factor

##Admixture totals
Hispanic.admixture = subset(data, select=c("LatinoAfricanAncestry","LatinoAmericanAncestry","LatinoEuropeanAncestry"))
Hispanic.admixture = Hispanic.admixture[,] #complete cases
Hispanic.admixture.sum = data.frame(apply(Hispanic.admixture, 1, sum))
colnames(Hispanic.admixture.sum)="Hispanic.admixture.sum" #fix name
describe(Hispanic.admixture.sum) #stats

#add data back to dataframe
LatinoRemainderAncestry = 1-Hispanic.admixture.sum #get remainder
colnames(LatinoRemainderAncestry) = "LatinoRemainderAncestry" #rename
data = merge.datasets(LatinoRemainderAncestry,data,2) #merge back

#plot it
LatinoRemainderAncestry = LatinoRemainderAncestry[order(LatinoRemainderAncestry,decreasing=FALSE),,drop=FALSE] #reorder
dotchart(as.matrix(LatinoRemainderAncestry),cex=.7) #plot, with smaller text

Black.admixture = subset(data, select=c("BlackAfricanAncestry","BlackAmericanAncestry","BlackEuropeanAncestry"))
Black.admixture = Black.admixture[,] #complete cases
Black.admixture.sum = data.frame(apply(Black.admixture, 1, sum))
colnames(Black.admixture.sum)="Black.admixture.sum" #fix name
describe(Black.admixture.sum) #stats

#add data back to dataframe
BlackRemainderAncestry = 1-Black.admixture.sum #get remainder
colnames(BlackRemainderAncestry) = "BlackRemainderAncestry" #rename
data = merge.datasets(BlackRemainderAncestry,data,2) #merge back

#plot it
BlackRemainderAncestry = BlackRemainderAncestry[order(BlackRemainderAncestry,decreasing=FALSE),,drop=FALSE] #reorder
dotchart(as.matrix(BlackRemainderAncestry),cex=.7) #plot, with smaller text

#simple stats for both

#make subset with remainder data and achievement
remainders = subset(data, select=c("black.ach.factor","BlackRemainderAncestry",
View(rcorr(as.matrix(remainders))$r) #correlations?

#Partial correlations
partial.r(data, c(7:9,40), c(43))[4,] #partial out remainder for Hispanics
partial.r(data, c(1:3,41), c(42))[4,] #partial out remainder for Blacks


Bryc, K., Durand, E. Y., Macpherson, J. M., Reich, D., & Mountain, J. L. (2014). The Genetic Ancestry of African Americans, Latinos, and European Americans across the United States. The American Journal of Human Genetics.

The personal Jensen coefficient, useful for detecting teaching to the test?

In my previous paper, I examined whether a personal Jensen coefficient could predict GPA beyond the general factor (or just the normal summed score). I found this not to be the case for a Dutch university student sample (n ≈ 300). One thing I did find, however, was that the personal Jensen coefficient was correlate with the g factor: r=.35.

Moreover, Piffer’s alternative metric, the g advantage coefficient (g factor score minus unit-weighted score) had a very strong correlation with the summed score r=.88. This measure is arguably thus a more reliable measure.

While neither of these predicted GPA beyond g, they may have another use. When there is teaching to the test, the subtests that increase the most are those that are the least g-loaded (see this). So, this should have an effect on these two measures, making them weaker or negative: the highest scores tending to be on the least g-loaded subtests. Thus, it may be practically useful to detect cheating on tests, although perhaps only at the group level.

Unfortunately, I don’t have any dataset with test-retest gains or direct training, but one could simulate gains that are negatively related to the g loadings, and then calculate the personal Jensen coefficient and Piffer’s g advantage coefficient.

Maybe I will update this post with the results of such a simulation.

A general assortative mating factor?: An idea in need of a dataset

I was talking with my girlfriend about how on some areas we don’t match, and on others we match well, and then it occurred to me that there may be a general assortative mating factor. I.e. if one takes a lot of variables: personality, intelligence, socioeconomic, interests, then calculates the partner correlations, then one could maybe extract a somewhat general factor of many of these. The method is correlate the partner trait correlations with each other for each trait. I.e., are couples who are more similar in intelligence, also more similar in socioeconomic variables? Likely. Are they also more similar in interests? Maybe slightly? Are people who are more similar in height more similar in intelligence? Seems doubtful since the intelligence x height cor is only .2 or so. But maybe.

What is needed is a dataset with, say, >100 couples and >10 diverse variables of interest. Anyone know of such a dataset?

Predicting immigrant performance: Does inbreeding have incremental validity over IQ and Islam?

So, she came up with:

So I decided to try it out, since I’m taking a break from reading Lilienfeld which I had been doing that for 5 hours straight or so.

So the question is whether inbreeding measures have incremental validity over IQ and Islam, which I have previously used to examine immigrant performance in a number of studies.

So, to get the data into R, I OCR’d the PDF in Abbyy FineReader since this program allows for easy copying of table data by row or column. I only wanted column 1-2 and didn’t want to deal with the hassle of importing it with spreadsheet problems (which need a consistent separator, e.g. comma or space). Then I merged it with the megadataset to create a new version, 2.0d.

Then I created a subset of the data with variables of interest, and renamed them (otherwise results would be unwieldy). Intercorrelations are:

row.names Cousin% CoefInbreed IQ Islam
1 Cousin% 1.00 0.52 -0.59 0.78 -0.76
2 CoefInbreed 0.52 1.00 -0.28 0.40 -0.55
3 IQ -0.59 -0.28 1.00 -0.27 0.54
4 Islam 0.78 0.40 -0.27 1.00 -0.71
5 -0.76 -0.55 0.54 -0.71 1.00


Spearman’ correlations, which are probably better due to the non-normal data:

row.names Cousin% CoefInbreed IQ Islam
1 Cousin% 1.00 0.91 -0.63 0.67 -0.73
2 CoefInbreed 0.91 1.00 -0.55 0.61 -0.76
3 IQ -0.63 -0.55 1.00 -0.23 0.72
4 Islam 0.67 0.61 -0.23 1.00 -0.61
5 -0.73 -0.76 0.72 -0.61 1.00


The fairly high correlations of inbreeding measures with IQ and Islam mean that their contribution will likely be modest as incremental validity.

However, let’s try modeling them. I create 7 models of interest and compile the primary measure of interest from them, R2 adjusted, into an object. Looks like this:

row.names R2 adj.
1 ~ IQ+Islam 0.5472850
2 ~ IQ+Islam+CousinPercent 0.6701305
3 ~ IQ+Islam+CoefInbreed 0.7489312
4 ~ Islam+CousinPercent 0.6776841
5 ~ Islam+CoefInbreed 0.7438711
6 ~ IQ+CousinPercent 0.5486674
7 ~ IQ+CoefInbreed 0.4979552


So we see that either of them adds a fair amount of incremental validity to the base model (line 1 vs. 2-3). They are in fact better than IQ if one substitutes them in (1 vs. 4-5). They can also substitute for Islam, but only with about the same predictive power (1 vs 6-7).

Replication for Norway

Replication for science is important. Let’s try Norwegian data. The Finnish and Dutch data are well-suited for this (too few immigrant groups, few outcome variables i.e. only crime)

Pearson intercorrelations:

row.names CousinPercent CoefInbreed IQ Islam
1 CousinPercent 1.00 0.52 -0.59 0.78 -0.78
2 CoefInbreed 0.52 1.00 -0.28 0.40 -0.46
3 IQ -0.59 -0.28 1.00 -0.27 0.60
4 Islam 0.78 0.40 -0.27 1.00 -0.72
5 -0.78 -0.46 0.60 -0.72 1.00



row.names CousinPercent CoefInbreed IQ Islam
1 CousinPercent 1.00 0.91 -0.63 0.67 -0.77
2 CoefInbreed 0.91 1.00 -0.55 0.61 -0.71
3 IQ -0.63 -0.55 1.00 -0.23 0.75
4 Islam 0.67 0.61 -0.23 1.00 -0.47
5 -0.77 -0.71 0.75 -0.47 1.00


These look fairly similar to Denmark.

And the regression results:

row.names R2 adj.
1 ~ IQ+Islam 0.5899682
2 ~ IQ+Islam+CousinPercent 0.7053999
3 ~ IQ+Islam+CoefInbreed 0.7077162
4 ~ Islam+CousinPercent 0.6826272
5 ~ Islam+CoefInbreed 0.6222364
6 ~ IQ+CousinPercent 0.6080922
7 ~ IQ+CoefInbreed 0.5460777


Fairly similar too. If added, they have incremental validity (line 1 vs. 2-3). They perform better than IQ if substituted but not as much as in the Danish data (1 vs. 4-5). They can also substitute for Islam (1 vs. 6-7).

How to interpret?

Since inbreeding does not seem to have any direct influence on behavior that is reflected in the S factor, it is not so easy to interpret these findings. Inbreeding leads to various health problems and lower g in offspring, the latter which may have some effect. However, presumably, national IQs already reflect the lowered IQ from inbreeding, so there should be no additional effect there beyond national IQs. Perhaps inbreeding results in other psychological problems that are relevant.

Another idea is that inbreeding rates reflect non-g psychological traits that are relevant to adapting to life in Denmark. Perhaps it is a useful measure of clanishness, would be reflected in hostility towards integration in Danish society (such as getting an education, or lack of sympathy/antipathy towards ethnic Danes and resulting higher crime rates against them), which would be reflected in the S factor.

The lack of relatively well established causal routes for interpreting the finding makes me somewhat cautious about how to interpret this.


##Code for mergining cousin marriage+inbreeding data with megadataset
inbreed = read.table("clipboard", sep="\t",header=TRUE, row.names=1) #load data from clipboard
source("merger.R") #load mega functions
mega20d = read.mega("Megadataset_v2.0d.csv") #load latest megadataset
names = as.abbrev(rownames(inbreed)) #get abbreviated names
rownames(inbreed) = names #set them as rownames

#merge and save
mega20e = merge.datasets(mega20d,inbreed,1) #merge to create v. 2.0e
write.mega(mega20e,"Megadataset_v2.0e.csv") #save it

#select subset of interesting data = subset(mega20e, selec=c("Weighted.mean.consanguineous.percentage.HobenEtAl2010",
colnames( = c("CousinPercent","CoefInbreed","IQ","Islam","") #shorter var names
rcorr = rcorr(as.matrix( #correlation object
View(round(rcorr$r,2)) #view correlations, round to 2
rcorr.S = rcorr(as.matrix(,type = "spearman") #spearman correlation object
View(round(rcorr.S$r,2)) #view correlations, round to 2

#Multiple regression
library(QuantPsyc) #for beta coef
results = = NA, nrow=0, ncol = 1)) #empty matrix for results
colnames(results) = "R2 adj."
models = c(" ~ IQ+Islam", #base model,
           " ~ IQ+Islam+CousinPercent", #1. inbreeding var
           " ~ IQ+Islam+CoefInbreed", #2. inbreeding var
           " ~ Islam+CousinPercent", #without IQ
           " ~ Islam+CoefInbreed", #without IQ
           " ~ IQ+CousinPercent", #without Islam
           " ~ IQ+CoefInbreed") #without Islam

for (model in models){ #run all the models
  fit.model = lm(model, #fit model
  sum.stats = summary(fit.model) #summary stats object
  summary(fit.model) #summary stats
  lm.beta(fit.model) #standardized betas
  results[model,] = sum.stats$adj.r.squared #add result to results object
View(results) #view results

##Let's try Norway too = subset(mega20e, selec=c("Weighted.mean.consanguineous.percentage.HobenEtAl2010",

colnames( = c("CousinPercent","CoefInbreed","IQ","Islam","") #shorter var names
rcorr = rcorr(as.matrix( #correlation object
View(round(rcorr$r,2)) #view correlations, round to 2
rcorr.S = rcorr(as.matrix(,type = "spearman") #spearman correlation object
View(round(rcorr.S$r,2)) #view correlations, round to 2

results = = NA, nrow=0, ncol = 1)) #empty matrix for results
colnames(results) = "R2 adj."
models = c(" ~ IQ+Islam", #base model,
           " ~ IQ+Islam+CousinPercent", #1. inbreeding var
           " ~ IQ+Islam+CoefInbreed", #2. inbreeding var
           " ~ Islam+CousinPercent", #without IQ
           " ~ Islam+CoefInbreed", #without IQ
           " ~ IQ+CousinPercent", #without Islam
           " ~ IQ+CoefInbreed") #without Islam

for (model in models){ #run all the models
  fit.model = lm(model, #fit model
  sum.stats = summary(fit.model) #summary stats object
  summary(fit.model) #summary stats
  lm.beta(fit.model) #standardized betas
  results[model,] = sum.stats$adj.r.squared #add result to results object
View(results) #view results

Scott O. Lilienfeld is a great researcher

Some researchers are just more interesting than others to you than others. So when I find one that has written something very interesting, I attempt to find their other papers to see if they have produced more interesting stuff. This is another such person. Lilienfeld writes about science and pseudoscience with regards to psychology, especially clinical psychology. He has a number of papers on a variety of dubious ideas in psychology such as repressed memory. He also writes about the public’s perception of psychology.

Pubmed lists 123 papers under his name, Scholar lists 381 publications, so he is certainly pretty productive. Here’s a collection of interesting material:

Of interest also are his books, of which I’ve already read two:

Age differences in the WISC-IV has a positive Jensen coefficient, maybe

Group differences in cognitive scores have generally been found to be g-loaded, i.e. the differences are larger on the items/subtests that load more strongly on the general factor. This is generally called a Jensen effect, and its opposite an anti-Jensen effect. However this can cause linguistic trouble when dealing with (near)-zero correlations or when dealing with effects of unknown direction, at which point we don’t know if we should call them “Jensen effects” or “anti-Jensen effects”. For that reason, I use the term “Jensen coefficient” which can easily be referred to as positive, negative or near-zero.

Generally when studies report factor structure of cognitive data, they remove the effects of age and gender and do not generally report the correlations between age and subtests. Recently, I saw this paper about the standardization of the WISC-IV in Vietnam, where the authors do report them. They differ by subtest. So, this immediately leads someone like me to propose that the effect should be larger on the more g-loaded tests. This is based on the idea that as one grows up, one really gets smarter i.e. increases general intelligence. So the vector correlation should be positive. The Vietnamese study does however not report the g-loadings. So, I have resorted to getting these from some other papers on the same test, in the English language version.

The datafile is here. It has g-loadings from 6 papers yielding 8 estimates. Some papers report more than one because they model the data with more than one model. E.g. four-factor vs. five-factor hierarchical model. The correlations between the g-loadings of these studies and the subtest x age correlation from the Vietnamese study range between .272 and .528, with a median of .427 and mean of .422. If one uses the average g-loading across studies, the correlation with age x subtest is .441.* Using Spearman correlation, it is also .441.

wisc g-loading age

If one removes the Symbol Search outlier, Spearman r=.29, so the relationship is not entirely due to that.

As usual, this research is hampered by a lack of data sharing. 1000s of studies use the WISC and have age data too, but don’t share the data or report the necessary results so one can calculate the correlation. Furthermore, the relatively small selection of subtests make the MCV method error-prone. It would be much better if one had e.g. 20 subtests of more different g-loadings, e.g. reaction time tests.

It is also possible that a large change in some non-g ability can throw the MCV results off. General intelligence is probably not the only ability that changes as one grows up. MCV is sensitive to these other abilities changing too.

Where to go from here

Next steps:

  1. Find more studies reporting g-loadings of WISC-IV subtests.
  2. Find more studies that report age x subtest correlations.
  3. Find open datasets where (1-2) can be calculated.
  4. Write to authors and ask them if they can provide results for (1-2) or send data (3).
  5. Find other commonly used tests for children and do (1-4). Also interesting are age declines later on.

I have contacted some authors.

* Google Drive Sheets calculates the r as .439 instead. I don’t know why.


##R code for doing the analyses and plotting = read.table("clipboard", sep="\t",header=TRUE, row.names=1) #load data from clipboard
library(Hmisc) #needed for rcorr
rcorr(as.matrix( #get correlations
cor(, use="pair") #use the other function to verify
rcorr(as.matrix(, type = "spearman") #spearman

library(car) #for scatterplot
scatterplot(r.x.age ~ avg..g.loading,, smoother=FALSE, id.n=nrow(,
            main = "MCV: WISC-IV g-loading and subtest x age correlation\nSpearman r = .441",
            xlab = "Average g-loading (mean of 8 datapoints)",
            ylab = "Score x age (1 datapoint)") #plot it
wisc.data2 =[-10,] #exclude outlier symbol search
rcorr(as.matrix(wisc.data2), type = "spearman") #spearman


Bodin, D., Pardini, D. A., Burns, T. G., & Stevens, A. B. (2009). Higher order factor structure of the WISC-IV in a clinical neuropsychological sample. Child Neuropsychology, 15(5), 417-424.
Chen, H., Keith, T., Chen, Y., & Chang, B. (2009). What does the WISC-IV measure? Validation of the scoring and CHC-based interpretative approaches. Journal of Research in Education Sciences, 54(3), 85-108.
Keith, T. Z., Fine, J. G., Taub, G. E., Reynolds, M. R., & Kranzler, J. H. (2006). Higher order, multisample, confirmatory factor analysis of the Wechsler Intelligence Scale for Children—Fourth Edition: What does it measure. School Psychology Review, 35(1), 108-127.
Dang, H. M., Weiss, B., Pollack, A., & Nguyen, M. C. (2011). Adaptation of the Wechsler Intelligence Scale for Children-IV (WISC-IV) for Vietnam. Psychological studies, 56(4), 387-392.
Weiss, L. G., Keith, T. Z., Zhu, J., & Chen, H. (2013). WISC-IV and clinical validation of the four-and five-factor interpretative approaches. Journal of Psychoeducational Assessment, 31(2), 114-131.
Watkins, M. W. (2006). Orthogonal higher order structure of the Wechsler Intelligence Scale for Children–. Psychological Assessment, 18(1), 123.
Watkins, M. W., Wilson, S. M., Kotz, K. M., Carbone, M. C., & Babula, T. (2006). Factor structure of the Wechsler Intelligence Scale for Children–Fourth Edition among referred students. Educational and Psychological Measurement, 66(6), 975-983.

Review: Faking Science (Diederik Stapel)

FakingScience-20141214 <– PDF

In general, this book was a fun and quick read. It gets somewhat repetitive with his descriptions of how bad he feels for his actions and how he was mistreated by some, stalked by media etc. It is worth reading if you care about psychology as a field. Surely there are more Stapel’s out there. The best solution for weeding them out and keeping them away is: 1) replication studies both by the same and independent authors, 2) open data so fraudulent data can more easily be spotted, 3) more meta-analyses of published studies that examine publication bias and estimate effect sizes given the bias (currently not a widespread practice), 4) more statistics! :)

I preferred to drink wine, and was inclined to totally embrace social psychology, but I did drink beer from time to time, and deep insideI felt that social psychology had a pretty optimistic view of the degree to which human behavior could be influenced. It seemed to suggest that you could get people to behave in the way you wanted just by engineering the right environment. Hadn’t social psychologists learned anything from all the failed attempts of the past to build the ideal society? Didn’t they know about all the failed hippy communities, communes, sects, and kibbutzim? If those had taught us one thing, it’s that you can’t construct a society from scratch. Thinking about it, the idea that “you are your environment” was ridiculous. In any given situation—a school, a church, a neighborhood— you’ll always find lots of different people, behaving in lots of different ways. So where did those differences come from? How could social psychology explain them, if not by reference to personality factors? Somewhere, deep inside each of us, doesn’t there have to be some core element that accounts for these differences? In the end, aren’t we all just, simply, our… selves?

I thought I might find an answer to these questions during the classes on personality studies. In the social psychology classes that I’d taken, a series of teachers, all social psychologists themselves, had enthusiastically promoted the “situationist” viewpoint, explaining how behavior is principally affected by environmental factors, and so I was expecting the same kind of thing in the personality classes, with personality psychologists mounting a vigorous defense of the idea that personality factors were the main determinants of behavior. So I was rather shocked when this didn’t happen.

During the personality studies classes in Amsterdam, we were introduced, somewhat ironically, by our lecturer (usually dressed in a flowing coat and with a carefully combed curl in his blond hair) to Benjamin Kouwer’s book The Game of Personality. In this book, Kouwer puts forward the proposition that there is no such thing as personality. What we call personality is, according to Kouwer, an empty and meaningless concept, that we have been attempting to catch in our hands with any number of useless theories, because things would be so much nicer and more convenient if it actually existed. This was the first and only time that I heard a teacher spend an entire series of classes defending the position that what he was supposed to be teaching us about didn’t exist. It was a refreshing way of putting things into perspective, which of course raised the question of how valid the concepts that were studied by all the other types of psychology were too.

Kouwer wrote his masterwork about personality theories in 1963. The book, which opened with the ominous sentence “Man doubts nothing so much as himself and his fellow men,” is written in a hilariously sarcastic tone; Kouwer’s fellow psychologist, Barendregt, described it on its publication as a “critical examination of the chaos that reigns over everything that can be described by the term ‘personality’”. Kouwer describes, step by step, from the ancient Greeks through to modern American experimental psychologists, everything that has been written, claimed, and theorized about personality. He looks at the insights of psychologists, physicians, philosophers, sociologists, astrologers, and other types of characterologists. Theophrastus, Schopenhauer, Skinner, Freud, Jung, Sartre: nobody is safe from Kouwer’s pithy analysis. His book describes and critiques the dozens (perhaps hundreds) of theories about “the true self” that people have cobbled together over the last few thousand years, and concludes, in a withering final chapter, that The True Self simply doesn’t exist. There is no core personality, no kernel of human existence. If you have a personality test that measures “honesty”, it turns out not to be very useful for predicting whether someone will cheat at cards, crib in an exam, or invent fake data for a research project. Nobody is always honest and does the right thing; nobody is always dishonest and does the wrong thing. Nobody behaves the same way all the time.

So Stapel read a wildly outdated book, and decided that was it for personality research? Never mind all those empirical studies finding stable personality traits over time, also in animals? Disregarding all the plethora of intelligence data? This data was already sufficient in the 60’s. Stapel must be pretty stupid or wildly naive.

As I dug deeper into social psychology, I found myself increasingly convinced by the idea that people’s feelings, perceptions, and behavior are principally influenced by the present situation. To use the jargon, I became a situationist, as opposed to a personologist. Apart from the idea that behavior is largely driven by situational factors, another thing that social psychology taught me is that it’s human nature to greatly underestimate the influence of the situation and overstate the effect of personality traits. Even though it’s often very clear from their behavior that people are always adapting themselves to different situations, contexts, and environments, we still tend to feel that behavior is actually a function of someone’s personality or motivation. Why? Because it gives us a feeling of control and certainty, of understanding. John does something. Why? Because he’s John.

Stapel apparently does not notice the contradiction between situationist and human nature. Are some traits genetically determined/influenced or not? Earlier, he talks about how it makes sense in an evolutionary view:

This account fits well with an evolutionary perspective. In order to survive in constantly changing circumstances, people have to keep adapting. In fact survival is, more or less, the same as adaptation. People play different roles; as they do so, they change, and grow, and their personality takes on new facets. Although the physical body that seems to contain the personality may remain recognizable, the personality itself cannot be fully grasped, because it’s always changing with the context. It’s like the ship of Theseus: when all the wooden parts of the ship have been replaced over time, is it still the same ship? Nothing is left of the original ship that Theseus first sailed out of the harbor, but it’s still his ship.

Now, mr. Stapel, how do things evolve? Through genetics surely. Which means that traits both: 1) are heritable to some degree, 2) vary between people. It cannot be all situation, there would be nothing for selection to work with. Evolution requires genetically caused diversity in traits to work.

Why do people use stereotypes, no matter how inappropriate, to explain and judge how others behave? Because we’re lazy, and social categories (gays, Mexicans, Muslims) and the stereotypes that go along with them (effeminate, lazy, terrorists) help to make the world seem simpler. We like our shortcuts. We know that “chair” goes with “sit”, so it’s nice to be able to put together other pairings like“woman/emotional”, “man/competitive”, “child/innocent”, “soldier/tough”, or “professor/smart”, even if they’re not always (if at all) correct.

Stereotypes are not inappropriate. Stereotypes are just beliefs about group statistics, typically means. No one really believes that there is no within group variance, i.e. groups are not categorical. I can find no evidence of the often lamented black and white thinking about groups. For much more about stereotypes, see: Lee Jussim’s excellent book.

Scientific research is a bit like solving a jigsaw puzzle. You think up how the puzzle should look, find the pieces, identify the holes, cut new pieces, and see if everything fits. At the level of millimeters all this fine detail can sometimes feel a bit pointless, but the aim is to have each larger section of the puzzle, an inch or a few inches across, look good. When you’ve finished all the cutting and trimming and fitting, when you take a step back and look at the whole picture, you don’t see which pieces are original and which you had to improvise. All you see is a single, coherent picture. It’s a nice feeling to something that you imagined come to life before you. It’s great to come up with an idea, develop it theoretically, and validate it empirically. It’s absolutely fantastic when you discover— perhaps after a bit of trial and error—that your idea works.

But sometimes it doesn’t work. Sometimes an experiment goes wrong. Sometimes—in fact, pretty often—the results don’t come out the way you hoped. Sometimes reality just doesn’t want to go along with your theoretical analysis, no matter how logical and carefully formulated that might be.It’s frustrating, but that’s how it is. Sometimes you’re just wrong; you have to go back tothe drawing board and try again, a bit harder this time. But if you don’t find what you were expecting, while everybody else seems to have no trouble, it’s even more frustrating. Let’s say the literature is full of discussions about effect X—for example, if you havepeople read a text in which the word “friendly” occurs a few times, their opinions of others become more positive, but if you replace “friendly” with the names of individual people who are perceived to be friendly, like “Gandhi” or “Mandela”, their opinions of others become more negative. You’d like to show this “X effect” yourself. So you read the literature on X very carefully, you do exactly what the “Methods” section of the article says you should do, and you get… Y. Not X. Oh. Now what? Well, let’s run it again. Nope, still Y. Now what? Back to the literature, read it again twice, check everything, change the materials a bit, run it again. Y. Now what?

I was doing something wrong. Clearly, there was something in the recipe for the X effect that I was missing. But what? I decided toask the experts, the people who’d found the X effect and published lots of articles about it. Maybe they could send me their materials? I wrote some letters. To my surprise, in most cases I received a prompt and comprehensive reply. My colleagues from around theworld sent me piles of instructions, questionnaires, papers, and software.
Now I saw what was going on. In most of the packages there was a letter, or sometimes a yellow Post-It note stuck to the bundleof documents, with extra instructions:
“Don’t do this test on a computer. We tried that and it doesn’t work. It only works if you use pencil-and-paper forms.”
“This experiment only works if you use ‘friendly’ or ‘nice’. It doesn’t work with ‘cool’ or ‘pleasant’ or ‘fine’. I don’t know why.”
“After they’ve read the newspaper article, give the participants something else to do for three minutes. No more, no less. Three minutes, otherwise it doesn’t work.”
“This questionnaire only works if you administer it to groups of three to five people. No more than that.”
I certainly hadn’t encountered these kinds of instructions and warnings in the articles and research reports that I’d been reading. This advice was informal, almost under-the-counter, but it seemed to be a necessary part of developing a successful experiment.
Had all the effect X researchers deliberately omitted this sort of detail when they wrote up their work for publication? I don’t know.Perhaps they did; or perhaps they were just following the rules of the trade: You can’t bore your readers with every single detail of the methodology you used. That would be ridiculous. Perhaps

And the answer just screams to anyone who has read his Schmidt and Hunter: These results are all based on small studies with large amounts of publication bias. The negative results don’t get published. These findings are not reliable and the true effect size, if there at all, are not very large. They do not understand validity generalization. His chapter 4 is a long story of how not to do science.

A few weeks ago I was in the paper—in fact, I was in all the papers. I had published a study which showed that messy streets lead to greater intolerance. In a messy environment, people are more likely to resort to stereotypes of others because trash makes you want to clear it up, and the use of stereotypeslets you feel like you’re clearing things up. Stereotypes bring clarity to a messy world. Women are emotional, men are aggressive, New Yorkers are in a hurry, Southerners are hospitable. Stereotypes make the world predictable, and we like that, especially if the world currently looks dirty and unkempt.

The publication of this study caused a sensation. It was published in the most prestigious journal of them all, Science, and it made headlines around the world. The idea that physical disorder activates the need for mental order and so leads to stereotyping and intolerance was innovative and exciting. It might explain why there’s more interpersonal conflict in run-down neighborhoods and it suggested an elegant way to combat racism and other forms of discrimination: clear up the mess, throw out the trash.

The coolest aspect of the study was the way in which it combined careful laboratory research with field studies that people could relate to. In the lab we had students sit in front of a computer and look at photographs, words,and symbols that depicted greater or lesser degrees of disorder, before asking them to fill in some questionnaires. In the field, we stood in clean or dirty railway stations, or on messy or neat street corners, and interviewed unsuspecting passers-by about their opinions on immigrants, gay people, foreigners, men, and women. The idea was very simple, the lab work was impressive, and the field studies were models of cunning design.

What made this research especially attractive was that it followed logically from decades of research into stereotyping. Every social psychologist knows that the need for structure is one of the driving factors behind the human tendency to stereotype and discriminate against others. There were already dozens of studies showing that the need for structure (“I want certainty”) is directly coupled to the use of stereotypes. The more structure you need, the more likely you are to judge people based on preconceptions. So it was only logical that this would still hold if the need for mental order was caused by physical disorder. Anyone could have thought of that. Maybe. But I was the one who had actually come up with the idea. In fact, I hadn’t just come up with the idea; I’d come up with all the data myself. It was a clever, simple,logical, and obvious idea, but the empirical tests were completely imaginary. The lab research hadn’t been carried out. The field studies never happened.

How much money did this fraud waste? How many ghettos programs were initiated on this idea that if we just clean up the environment, the people will turn into productive citizens? I know there are lots of programs like these in Denmark. No one knows because they are local projects, so no one compiles a list of all the costs and which effects, if any, they had.

How do these people think that these environments got bad to begin with? Who break the things in ghettos?

I got the idea that physical disorder might create the need for structure when, by chance, I came across “broken windows theory” in the literature. This theory argues that there is a relationship between the bad state of repair of housing in poor neighborhoods and the other social problems to be found there. Because I knew that the need for structure is one of the main motivations for people to stereotype each other as undesirables, I immediately saw the connection. It was only logical that neighborhoods with lots of broken windows, liquor stores, empty homes and dilapidated buildings would have social problems. All that urban decay pushes people to use stereotypes and other forms of prejudice to “clean things up” in their heads, thus restoring some structure. It was a brilliant idea.

I asked one of my students, a good photographer who I knew lived in a disheveled squat, to take some pictures of houses with and without broken windows, neat streets and disorderly ones, walls with and without graffiti, and any other contrast between trash and non-trash she could think of. A few days later she brought me dozens of photos. I selected a few of them and devised a questionnaire. We showed people a set of photos—showing either orderly or disorderly scenes—and then had them answer questions about different social groupings.

There was a measurable “chaos leads to stereotyping” effect—the people who had seen the disorderly photos gave more prejudiced answers—but it wasn’t very strong.

I decided to try another approach. Instead of a succession of photos, I made a collage with a house, a tree, a car, and some people where everything looked normal, and another one where everything was out of place. That worked as well, but the effect was still very small. It worked for some stereotypes, but not for all of them.

I tried again with photos of walls and houses, but this time there was no difference at all between the groups. I had lost the effect, and I couldn’t bring it back. Such a beautiful, obvious, logical, and (especially) simple effect, and I couldn’t find it. I decided to give up. I obviously had no talent for simplicity.

More underpowered studies with small or null population effect sizes.

I’ve come up with a nice idea for a series of experiments with an American colleague, but that project isn’t making much progress. Here’s how it works: people sit at a computer screen and we show them brief flashes of either very attractive or very ugly faces. The images show up for such a short time that the participants can’t really make out the faces, but they know that they’ve seen something because of the flash. Every flash occurs in a different corner of the screen, and the subjects have to press a button to say whether they saw it on the left or the right. After that, we tell them that the task is finished, but we ask them just to sign a piece of paper, “for the record.” The idea is that seeing pictures of ugly people, even unconsciously as the image is flashed in front of them for a fraction of a second, will make people feel good about themselves, which will make their signature larger, whereas flashes of attractive people will make their signature smaller—in other words, there’ll be a contrast effect. The size of someone’s signature is a subtle, implicit way to measure how positive their self-image is. If your signature becomes bigger after seeing the flashed images, you see yourself in a more positive light, and if it becomes smaller, your self-image has become more negative.

The initial results were very promising, but the last experiment I ran to try and demonstrate this automatic, unconscious social comparison effect failed completely. That’s incomprehensible, and it still hurts. It was an elegant idea, and everyone expected it to work, given the number of comparable effects in the literature.

In the only way my painful back will allow, I stand in front of my computer in the kitchen and start to write. It’s a great story, with lots of experiments, all of which turn out fine. After just four days, the article is written. All the failures are behind me.

A wildly implausible hypothesis turns out not to be found in an experiment? How surprising. Which world are these social psychologists living in? If Stapel was a hotshot and he thought all these things, surely they are common in the field.

I drive to Zwolle and then to Groningen. I can picture the scene, a few months ago, at the start of the summer, just before the last day of school. There are dozens, if not hundreds, of students, their faces showing their concentration but also smiling, filling in my questionnaires. They’re wearing summer dresses or shorts, with thongs on their feet. They’re sitting in silence, at white tables laid out in rows of five, working hard in the name of science. Some have their tongues sticking out of their mouths, others are pressing their pencils so hard into the paper that they nearly snap, but they’re all trying their very best. I can see them: circling answers, shading boxes, making crosses. I can see it with my own eyes. I didn’t have to come here today, because they’re all here. All here, answering questions in the name of science.

Thongs on their feet?!? Mistranslation?

The October report felt to me like an attempt at anexorcism. What I discovered in it was not just a description or an explanation of something evil; it was an attempt to destroy that evil, root and branch. I saw myself portrayedas an arrogant, manipulative con artist, an evil genius, a wicked researcher who, very deliberately, following a nefarious master plan, had set out to deceive as many people as possible. Really? Sure, I’d told a whole load of lies, and I’m going to have to accept the punishment that goes with that for the rest of my life. But did I really have a plan? If I had, wouldn’t I have gone about it in a more careful, smart, calculating way? My fraud was a mess, my fake data always put together in a hurry, full of statistical errors and little quirks that made it easy to spot, if you looked even moderately carefully at it. Wouldn’t a thoughtful, rational person make a better, neater, and less obvious job of it? Apparently I’d deliberately surrounded myself with weak, easily manipulated research students so that I could make them extra-dependent on me and get them to go along with my evil little schemes. Really? In fact, all students had to go through an intensive, “any doubt—out” selection process, with three or four other people besides me, and only the best few making it through. They’d suggested that I’d managed to surround myself with poor-quality researchers. Really? Many of the people I’d published with were senior academics and leading psychologists, some of them world-famous. They said I’d gotten rid of people who’d dared to criticize me. Really? Who? Who did I get rid of, and when? What was their complaint? Did they leave because of me or for some other reason? Sure, not everyone ended up with a paid gig, but was that because they’d criticized me, or vice versa? The fact that I’d invited my colleagues to dinner at our house, organized drinks receptions or barbecues, or an occasional trip to the theater, was cited as evidence of my manipulative behavior. Seriously? We worked hard together, sometimes we let our hair down, and sometimes we tried to do a bit of team-building. Was I the only person doing this with my team? Doesn’t anyone else ever go out for the evening with people from work? Don’t other groups of researchers like to have a good time occasionally? But in my case, it seems I was doing it to butter up my colleagues, to get them on my side, and especially to get them to keep quiet if they found out that anything sketchy was going on. Really? Was I really so calculating? Did I do all that just to maintain my web of lies? Apparently so. I found out later that one of the members of the Committee, in a discussion with some of my former colleagues, had become somewhat emotional and described me “just like any other criminal”.

This is another interesting case of one persons’ modus ponens being another person’s modus tollens. Stapel is arguing that:

  • If the people he had published with were leading social psychologists, then he did not surround himself with poor-quality researchers.
  • The people he had published with were leading social psychologists.
  • So, he did not surround himself with poor-quality researchers.

Now, I’m more inclined (based on this book and other failures of social psychology) to reason the other way:

  • He surrounded himself with poor-quality researchers.
  • The people he had published with were leading social psychologists.
  • So, leading social psychologists were poor-quality researchers.



Review: The Sports Gene (David Epstein)

Incidentally, the Wiki page was very poor, so I had to rewrite that before writing this.

Generally, this was an interesting read that taught me a lot. This probably has to do with me not really caring much about sports. Some parts can be boring if you don’t care/know much about e.g. Baseball. It is pretty US-centric in the topics chosen.

The science in the book comes mostly thru interviews with experts and some summarizing of studies. Rarely is sufficient detail given about the studies for one to make an informed decision about whether to trust it or not. Usually, no sample sizes, p-values, effect sizes etc. are mentioned. It was written as a popular science book to be fair, so this criticism is somewhat unfair.

Some quotes:

When scientists at Washington University in St. Louis tested him, Pujols, the greatest hitter of an era, was in the sixty-sixth percentile for simple reaction time compared with a random sample of college students.

College students are above average g, which means above average reaction time. Presumably, the tested simple reaction time. This correlates about .2 with g. College students are perhaps at 115 on average. This university is apparently a top university. So perhaps the mean IQ is 120-125 there, meaning that these students are about 0.334 d above the mean on reaction time (unless they were students in fysical ed. in which case they may be even higher). Being at the 66 centile is not bad then.

Jason Gulbin, the physiologist who worked on Australia’s Olympic skeleton experiment, says that the word “genetics” has become so taboo in his talent-identification field that “we actively changed our language here around genetic work that we’re doing from ‘genetics’ to ‘molecular biology and protein synthesis.’ It was, literally, ‘Don’t mention the g-word.’ Any research proposals we put in, we don’t mention the genetics if we can help it. It’s: ‘Oh, well, if you’re doing molecular biology and protein synthesis, well, that’s all right.’” Never mind that it’s the same thing.

Studying race? NAZI NAZI!!! Studying population genetics? No problem, carry on.

This story is fascinating. Perhaps the best example of how categorical thinking about gender lead to real life problems.

Several scientists I spoke with about the theory insisted that they would have no interest in investigating it because of the inevitably thorny issue of race involved. One of them told me that he actually has data on ethnic differences with respect to a particular physiological trait, but that he would never publish the data because of the potential controversy. Another told me he would worry about following Cooper and Morrison’s line of inquiry because any suggestion of a physical advantage among a group of people could be equated to a corresponding lack of intellect, as if athleticism and intelligence were on some kind of biological teeter-totter. With that stigma in mind, perhaps the most important writing Cooper did in Black Superman was his methodical evisceration of any supposed inverse link between physical and mental prowess. “The concept that physical superiority could somehow be a symptom of intellectual inferiority only developed when physical superiority became associated with African Americans,” Cooper wrote. “That association did not begin until about 1936.” The idea that athleticism was suddenly inversely proportional to intellect was never a cause of bigotry, but rather a result of it. And Cooper implied that more serious scientific inquiry into difficult issues, not less, is the appropriate path.

How very familiar. Better not hurt those feelings! At least they should publish the data anonymously in some way so others can examine them.

There is a university called Lehigh… Le High… geddit??

In 2010, Heather Huson, a geneticist then studying at the University of Alaska, Fairbanks—and a dogsled racer since age seven—tested dogs from eight different racing kennels. To Huson’s surprise, Alaskan sled dogs have been so thoroughly bred for specific traits that analysis of microsatellites—repeats of small sequences of DNA—proved Alaskan huskies to be an entirely genetically distinct breed, as unique as poodles or labs, rather than just a variation of Alaskan malamutes or Siberian huskies.
Huson and colleagues discovered genetic traces of twenty-one dog breeds, in addition to the unique Alaskan husky signature. The research team also established that the dogs had widely disparate work ethics (measured via the tension in their tug lines) and that sled dogs with better work ethics had more DNA from Anatolian shepherds—a muscular, often blond breed of dog originally prized as a guardian of sheep because it would eagerly do battle with wolves. That Anatolian shepherd genes uniquely contribute to the work ethic of sled dogs was a new finding, but the best mushers already knew that work ethic is specifically bred into dogs.
“Yeah, thirty-eight years ago in the Iditarod there were dogs that weren’t enthused about doing it, and that were forced to do it,” Mackey says. “I want to be out there and have the privilege of going along for the ride because they want to go, because they love what they do, not because I want to go across the state of Alaska for my satisfaction, but because they love doing it. And that’s what’s happened over forty years of breeding. We’ve made and designed dogs suited for desire.”

Admixture studies in dogs, a useful precedent to cite to ease the pain for newcomers.

In one tank are mice missing oxytocin receptors. They are used in the study of pain, but the mice also have deficits in social recognition. Put them with mice they grew up with and they won’t recognize them. In another corner is a tank of raven-haired mice that were bred to be prone to head pain, that is, migraines. They spend a lot of time scratching their foreheads and shuddering, and they are apparently justified in using the old headache excuse to avoid mating. “This experiment has taken years,” says Jeffrey Mogil, head of the lab, of the work that seeks to help develop migraine treatments, “because they breed really, really badly.”

How did they get ethics approval for this???

As Pitsiladis put it, to be a world-beater, “you absolutely must choose your parents correctly.” He was being facetious, of course, because we can’t choose our parents. Nor do humans tend to couple with conscious knowledge of one another’s gene variants. We pair up more in the manner of a roulette ball that bounces off a few pockets before settling into one of many suitable spots. Williams suggests, hypothetically, that if humanity is to produce an athlete with more “correct” sports genes, one approach is to weight the genetic roulette ball with more lineages in which parents and grandparents are outstanding athletes and thus probably harbor a large number of good athleticism genes. Yao Ming—at 7’5″, once the tallest active player in the NBA—was born from China’s tallest couple, a pair of ex–basketball players brought together by the Chinese basketball federation. As Brook Larmer writes in Operation Yao Ming: “Two generations of Yao Ming’s forebears had been singled out by authorities for their hulking physiques, and his mother and father were both drafted into the sports system against their will.” Still, the witting merger of athletes in pursuit of superstar progeny is rare.

Sure we do!  Some do it quite consciously, e.g. using dating sites that match for overall likeness.