pumpkinperson.com/2015/04/13/is-the-sat-an-iq-test/

Blog commenter Lion of the Judah-sphere has claimed that the SAT does not correlate as well with comprehensive IQ tests as said IQ tests correlate with one another. At first I assumed he was wrong, but my recent analysis suggesting Harvard undergrads have an average Wechsler IQ of 122, really makes me wonder.

While an IQ of 122 (white norms) is 25 points above the U.S. mean of 97, it seems very low for a group of people who averaged 1490 out of 1600 on the SAT. According to my formula, since 1995 a score of 1490 on the SAT equated to an IQ of 141. But my formula was based on modern U.S. norms; because demographic changes have made the U.S. mean IQ 3 points below the white American mean (and made the U.S. standard deviation 3.4 percent larger than the white SD), converting to white norms reduces Harvard’s SAT IQ equivalent to 139.

…

In general, research correlating the SAT with IQ has been inconsistent, with correlations ranging from 0.4 to 0.8. I think much depends on the sample. Among people who took similar courses in similar high schools, the SAT is probably an excellent measure of IQ. But considering the wide range of schools and courses American teenagers have experienced, the SAT is not, in my judgement, a valid measure of IQ. Nor should it be. Universities should not be selecting students based on biological ability, but rather on acquired academic skills.

The lower values are due to restriction of range, e.g. Frey and Detterman (2004). When corrected, the value goes up to .7-.8 range. Also .54 using ICAR60 (Condon and Revelle, 2014) without correction for reliability or restriction.

As for the post, I can think of few things:

1. The sample recruited is likely not representative of Harvard. Probably mostly social sci/humanities students, who have lower scores.

2. Regression towards the mean means that the Harvard student body won’t be as exceptional on their second measurement as on their first. This is because some of the reason they were so high was just good luck.

3. The SAT is teachable to some extent which won’t transfer well to other tests. This reduces the correlation between between SAT and other GCA tests.

4. Harvard uses affirmative action which lowers the mean SAT of the students a bit. It seems to be about 1500.

SAT has an approx. mean of ~500 per subtest, ceiling of 800 and SD of approx. 100. So a 1500 total score is 750+750=(500+250)+(500+250)=(500+2.5sd)+(500+2.5sd), or about 2.5 SD above the mean. Test-retest reliability/stability for a few months is around .86 (mean of values here research.collegeboard.org/sites/default/files/publications/2012/7/researchreport-1982-7-test-disclosure-retest-performance-sat.pdf, n≈3700).

The interesting question is how much regression towards the mean we can expect? I decided to investigate using simulated data. The idea is basically that we first generate some true scores (per classical test theory), and then make two measurements of them. Then using the first measurement, we make a selection that has a mean 2.5 SD above the mean, then we check how well this group performs on the second testing.

In R, the way we do this, is to simulate some randomly distributed data, and then create new variables that are a sum of true score and error for the measurements. This presents us with a problem.

**How much error to add?**

We can solve this question either by trying some values, or analytically. Analytically, it is like this:

Cor (test1 x test2) = cor(test1 x true score) * cor(test2 x true score)

The correlation of each testing is due to their common association with the true scores. To simulate a testing, we need the correlation between testing and true score. Since test-rest is .863, we take the square root and get .929. The squared value of a correlation is the amount of variance it explains, so if we square it get back to where we were before. Since the total variance is 1, we can calculate the remaining variance as 1-.863=.137. We can take the square root of this to get the correlation, which is .37. Since the correlations are those we need when adding, we have to weigh the true score by .929 and the error by .370 to get the measurements such that they have a test-retest correlation of .863.

One could also just try some values until one gets something that is about right. In this case, just weighing the true score by 1 and error by .4 produces nearly the same result. Trying out a few values like this is faster than doing the math. In fact, I initially tried out some values to hit the approximate value but then decided to solve it analytically as well.

**How much selection to make?**

This question is more difficult analytically. I have asked some math and physics people and they could not solve it analytically. The info we have is the mean value of the selected group, which is about 2.5, relative to a standard normal population. Assuming we make use of top-down selection (i.e. everybody above a threshold gets selected, no one below), where must we place our threshold to get a mean of 2.5? It is not immediately obvious. I solved it by trying some values and calculating the mean trait value in the selected group. It turns out that to get a selected group with a mean of 2.5, the threshold must be 2.14.

Since this group is selected for having positive errors and positive true scores, their true scores and second testing scores will be lower. How much lower? 2.32 and 2.15 according to my simulation. I’m not sure why the second measurement scores are lower than the true scores, perhaps someone more learned in statistics can tell me.

So there is still some way down to an IQ of 122 from 132.3 (2.15*15+100). However, we may note that they used a shorter WAIS-R form, which correlates .91 with the full-scale. Factoring this in reduces our estimate to 129.4. Given the selectivity noted above, this is not so unrealistic.

Also, the result can apparently be reached simply by 2.5*.86. I was aware that this might work, but wasn’t sure so I tested it (the purpose of this post). One of the wonders of statistical software like this is that one can do empirical mathematics. :)

**R code**

##SAT AND IQ FOR HARVARD library(dplyr) #size and true score n.size = 1e6 true.score = rnorm(n.size) #add error to true scores test.1 = .929*true.score + .370*rnorm(n.size) test.2 = .929*true.score + .370*rnorm(n.size) SAT = data.frame(true.score, test.1, test.2) #verify it is correct cor(SAT) #select a subsample selected = filter(SAT, test.1>2.14) #selected sample describe(selected) #desc. stats

**References**

- Condon, D. M., & Revelle, W. (2014). The International Cognitive Ability Resource: Development and initial validation of a public-domain measure.
*Intelligence*,*43*, 52-64. - Frey, M. C., & Detterman, D. K. (2004). Scholastic assessment or g? The relationship between the scholastic assessment test and general cognitive ability.
*Psychological science*,*15*(6), 373-378.

Normal practice is to treat likert scales as continuous variable even though they are not. As long as there are >=5 options, the bias from discreteness is not large.

I simulated the situation for you. I generated two variables with continuous random data from two normal distributions with a correlation of .50, N=1000. Then I created likert scales of varying levels from the second variable. Then I correlated all these variables with each other.

Correlations of continuous variable 1 with:

continuous2 0.5

likert10 0.482

likert7 0.472

likert5 0.469

likert4 0.432

likert3 0.442

likert2 0.395

So you see, introducing discreteness biases correlations towards zero, but not by much as long as likert is >=5 level. You can correct for the bias by multiplying by the correction factor if desired:

Correction factor:

continuous2 1

likert10 1.037

likert7 1.059

likert5 1.066

likert4 1.157

likert3 1.131

likert2 1.266

Psychologically, if your data does not make sense as an interval scale, i.e. if the difference between options 1-2 is not the same as between options 3-4, then you should use Spearman’s correlation instead of Pearson’s. However, it will rarely make much of a difference.

Here’s the R code.

#load librarylibrary(MASS)#simulate dataset of 2 variables with correlation of .50, N=1000simul.data = mvrnorm(1000, mu = c(0,0), Sigma = matrix(c(1,0.50,0.50,1), ncol = 2), empirical = TRUE)simul.data = as.data.frame(simul.data);colnames(simul.data) = c(“continuous1″,”continuous2″)#divide into bins of equal lengthsimul.data[“likert10″] = as.numeric(cut(unlist(simul.data[2]),breaks=10))simul.data[“likert7″] = as.numeric(cut(unlist(simul.data[2]),breaks=7))simul.data[“likert5″] = as.numeric(cut(unlist(simul.data[2]),breaks=5))simul.data[“likert4″] = as.numeric(cut(unlist(simul.data[2]),breaks=4))simul.data[“likert3″] = as.numeric(cut(unlist(simul.data[2]),breaks=3))simul.data[“likert2″] = as.numeric(cut(unlist(simul.data[2]),breaks=2))#correlationsround(cor(simul.data),3)