The book is on Libgen (free download).

Since I have ventured into criminology as part of my ongoing research program into the spatial transferability hypothesis (psychological traits are stable when people move around, including between countries) and the immigrant groups by country of origin studies, I thought it was a good idea to actually read some criminology. So since there was a recent book covering genetically informative studies, this seemed like a decent choice, especially because it was also available on libgen for free! :)

So basically it is a debate book with a number of topics. For each topic, someone (or a group of someones) will argue for or explain the non-genetic theories/hypotheses, while another someone will sum up the genetically informative studies (i.e. behavioral genetics studies into crime) or at least biologically informed (e.g. neurological correlates of crime).

Initially, I read all the sociological chapters too until I decided they were a waste of time to read. Then I just read the biosocial ones. If you are wondering about the origin of that term as opposed to the more commonly used synonym sociobiological, the use of it was mostly a move to avoid the political backslash. One of the biosocial authors explained it like this to me:

In terms of the name biosocial (versus sociobiological), I think the name change happened accidentally. But there was somewhat of a reason, I guess. EO Wilson and sociobiological thought was so hated amongst sociologists and criminologists, none of us would have gotten a job had we labelled ourselves sociobiologists. Though it was no great secret that sociobiology gave birth to our field. In some ways, it was purely a semantic way to fend off attacks. Even so, there are some distinctions between us and old school sociobiology (use of behavior genetic techniques, etc.).

The book suffers from the widespread problem in social science of not giving effect size numbers. This is more of a problem for the sociological chapters, but true also for the biosocial ones. If no effect sizes are not reported, one cannot compare the importance of the alleged causes! Note that behavioral genetics results inherently include effect sizes. The simplest ACE fitting will output the effect sizes for additive genetics, shared environment and unshared environment+error.

Even if you don’t plan to read much of this, I recommend reading the highly entertaining chapter: The Role of Intelligence and Temperament in Interpreting the SES-Crime Relationship by Anthony Walsh, Charlene Y. Taylor, and Ilhong Yun.

From the interactive visualization I previously published to give foster an intuitive understanding of the concept:

Tail effects are when there are large differences between groups at the extremes (tails) of distributions. This happens when the distributions differ in either the mean or the standard deviation (or both), even when these differences are quite small. Below we see a density plot of two normal distributions with different means as well as a threshold value (vertical line). The table below the plot shows various summary statistics about the distributions with regards to the threshold. Try playing around with the numbers on the left and see how results change.

One of the pleasures of reading a very broad selection of science is that one discovers connections between fields that are not commonly connected. Sometimes these connections may give rise to important new inter-disciplinary fields or understanding, sometimes it just gives you a nice feeling of seeing the same concept in different circumstances.

Tail effects are often discussed in differential psychology because of the continued interest in group differences. These can be in whatever trait: cognitive, interest, emotional, personality-wise, and with whichever groups: social, economic, gender or racial. However, tail phenomena is more general than group differences, the two distributions can be any kind of data, including the same data from different times.

In the last year or so I have taken an increased interest in climate science. The reason is basically this, and that science denialism annoys me and incentivizes me to explore a new area of science. In fact, the whole reason I got started on science in general was that I was debating with creationists on a forum. Debating creationists effectively actually requires a fairly broad knowledge of science and philosophy. One must understand enough physics and chem. to explain how radiometric dating works, enough cosmology to explain facts related to the big bang, enough geology to explain plate tectonics, enough geology and paleontology to explain the distribution of fossils, enough evolutionary biology and genetics to explain the general ideas of evolution, and finally enough philosophy (logic, critical thinking, epistemology, philosophy of language) to spot logical errors, language and debating tricks. It isn’t exactly easy. Just listing all these areas took a number of minutes, reading the Wikipedia articles, will take many hours.

Anyway, since I had been debating climate skeptics recently on a Danish-language conservative-libertarian-nationalist news aggregator, I have encountered many odd claims, which require a fairly deep knowledge of various various of climate science. E.g. to explain the facts related to Climategate, one needs an idea of temperature reconstruction with tree rings and their dating, scientific graphing, and the methods used to combine the data (principal components analysis, another method from psychometrics :) ). An issue that sometimes comes up is extreme weather. Since there was some discussion of this, including the very definition, I decided to find a review article:
Zwiers, F. W., Alexander, L. V., Hegerl, G. C., Knutson, T. R., Kossin, J. P., Naveau, P., … & Zhang, X. (2013). Climate extremes: challenges in estimating and understanding recent changes in the frequency and intensity of extreme climate and weather events. In Climate Science for Serving Society (pp. 339-389). [odd journal name, but paper seems decent]

The following visual explanation of extreme weather is found in the paper:

Climate Extremes Challenges in Estimating and Understanding Recent Changes in the Frequency and Intensity of Extreme Climate and Weather Events tail effects climatescience

Which showcases the general applicability of the concept. :)

What is age heaping?

Number heaping is a common tendency of humans. What this means is that we tend round numbers to the nearest 5 or 10. Age heaping is the tendency of innumerate people to round their age to the nearest 5 or 10, presumably because they can’t subtract to infer their current age from their birth year and the current year. Psychometrically speaking, this is a very easy mathematical test, so why is it useful? Surely everybody but small children can do it now? Yes. However, in the past, not all adults even in Western countries could do this. One can locate legal documents and tomb stones from these times and analyze the amount of age heaping. The figure below shows an example of age heaping in old Italian data.

age heaping italy

Source: “Uniting Souls” and Numeracy Skills. Age Heaping in the First Italian National Censuses, 1861-1881. A’Hearn, Delfino & Nuvolari – Valencia, 13/06/2013.

Since we know that people’s ages really are nearly uniform, that is, the number of people aged 59 and 61 should be about the same as those aged 60, we can calculate indexes for how much heaping there is and use that as a crude numeracy measure. Economic historians have been doing this for some time and so we have some fairly comprehensible datasets for age heaping by now.

Is it a useful correlate?

If you read the source above you will see that age heaping in the 1800s show the expected north/south Italy patterns, but this is just one case. Does it work in general? The answer is yes. Below I plot some of the age heaping datasets versus Lynn and Vanhanen’s (2012) national IQs:

AH1800_IQAH1820_IQ  AH1850_IQAH1870_IQ AH1890_IQ

The problem with the data is this: the older datasets cover fewer countries and the newer datasets show strong ceiling effects (lots of countries very close to 100 on the x-axis). The ceiling effects are because the test is too easy. Still, the data covers a sufficiently large number of countries to be useful for modern comparisons. For instance, we can predict immigrant performance in Scandinavian countries based on their numeracy ability in the 1800s. Below I plot general socioeconomic performance (a general factor of education, income, use of social benefits and crime in Denmark in 2012) and age heaping in 1890:


The actual correlations are shown below:

AH1800 AH1820 AH1850 AH1870 AH1890 LV12 IQ S in DK
AH1800 1 0.95 0.94 0.96 0.9 0.85 0.61
AH1820 0.95 1 0.94 0.94 0.76 0.62 0.67
AH1850 0.94 0.94 1 0.99 0.84 0.73 0.59
AH1870 0.96 0.94 0.99 1 0.96 0.64 0.56
AH1890 0.9 0.76 0.84 0.96 1 0.52 0.73
LV12 IQ 0.85 0.62 0.73 0.64 0.52 1 0.54
S in DK 0.61 0.67 0.59 0.56 0.73 0.54 1


And the sample sizes:

AH1800 AH1820 AH1850 AH1870 AH1890 LV12 IQ S in DK
AH1800 31 25 22 22 24 29 24
AH1820 25 45 37 22 36 43 27
AH1850 22 37 45 27 37 43 30
AH1870 22 22 27 62 56 61 34
AH1890 24 36 37 56 109 107 50
LV12 IQ 29 43 43 61 107 203 68
S in DK 24 27 30 34 50 68 70


Great, where can I find the datasets?

Fortunately, they are freely available. The easiest solution is probably just to download the worldwide megadataset, which contains a number of the age heaping variables and lots of other variables for you to play around with:

Alternatively, you can Baten’s age heaping data directly:

R code

#this is assuming you have loaded the megadataset as DF.supermega
temp = subset(DF.supermega, select = c("AH1800", "AH1820", "AH1850", "AH1870", "AH1890", "LV2012estimatedIQ", ""))
write_clipboard(wtd.cors(temp), digits = 2)

for (year in c("AH1800", "AH1820", "AH1850", "AH1870", "AH1890")) {
  ggplot(DF.supermega, aes_string(year, "LV2012estimatedIQ")) + geom_point() + geom_smooth(method = lm) + geom_text(aes(label = rownames(temp)))
  name = str_c(year, "_IQ.png")

ggplot(DF.supermega, aes(AH1890, + geom_point() + geom_smooth(method = lm) + geom_text(aes(label = rownames(temp)))

Note that perhaps there should be doubt quotation marks around human in the title. Would humans with a 1,000 SD increase in (general) cognitive ability (CA) really be human?

Steve Hsu discusses his rough estimation that we can increase CA in humans around 1,000 SD by basically turning all the current alleles with negative effects into their positive or neutral variants. While the problem seems sound enough to me, I can think of some problems.

Trait level x gene interactions

One problem is the possibility of trait level x gene interactions. For instance, suppose that a large number of genes affect pathway X to CA in a roughly linear fashion (i.e. what we find using familial studies and GCTA methods). This could be brain nerve conduction velocity (BNCV) for which there is some evidence that it is related to CA (TE Reed, AR Jensen, 1993). One seemingly mostly forgotten study did find evidence that the correlation between IQ and NCV is genetic (FV Rijsdijk, DI Boomsma, 1997). There is a physical limit on how fast BNCV can be such that the closer to get the the physical limit, the smaller increase we get from altering another negative allele to its positive version. This would be roughly equivalent to the situation in physics with the speed of light. A given amount of energy converted to kinetic energy will result in a smaller increase in m/s as we get closer to the speed of light (the physical limit).

In the comments, Hsu invokes the history of artificial selection on e.g. oil content to argue against trait level x gene interactions. See also: Animal breeding, human breeding. Brains are much more complicated than simple oil content, tho.

Brain size and BNCV

I seem to recall that due to the relatively low BNCV, there is a limit on the practice size of the brain. We know that brain size correlates with CA around .25 (large meta-analysis), which perhaps after corrections for errors will be .35 (restriction of range and measurement error; see Understanding Statistics). The reason this problem happens is that the internal brain communication will become slower at the brain size increases (brief discussion here), which presumably in the end results in a lower (possibly negative in the long run!) increase in CA from changing the alleles that result in larger brains. Solving this could mean requiring more modularization, which presumably would affect the factor structure of cognitive abilities resulting in a weaker general factor.

Brain size and reproduction

When selecting for one trait one will simultaneously select for a number of other genetically correlated traits. With cognitive ability, one of them is brain size. However, due to the physical limitation on space on women’s wombs, we cannot just scale up the scale of brains indefinitely. The relatively large human head size already results in complications with giving birth in current humans. The birth problem has probably been a relatively strong selective force against higher CA.

We can of course use Cesarians now to avoid between-the-legs birth, so it is not really a problem, but it adds costs to the reproduction process. In the long run, if we scale up brain size a lot, we would need to scale up the size of women’s interior space to accommodate the larger fetus. Note that if we just increase the size of women overall, this would result in smaller brain to brain ratio, which is what really matters. So it won’t be so easy to deal with this problem.

The final (biological reproduction) solution is to stop using women for reproduction: artificial wombs/uterus. This technology is however not being aggressively pursued as far as I know, so it is probably a number of decades away.

Of course, we will want to switch to some other neurate at some point too. :)

Due to lengthy discussion over at Unz concerning the good performance of some African groups in the UK, it seems worth it to review the Danish and Norwegian results. Basically, some African groups perform better on some measures than native British. The author is basically arguing that this disproves global hereditarianism. I think not.

The over-performance relative to home country IQ of some African countries is not restricted to the UK. In my studies of immigrants in Denmark and Norway, I found the same thing. It is very clear that there are strong selection effects for some countries, but not others, and that this is a large part of the reason why the home country IQ x performance in host country are not higher. If the selection effect was constant across countries, it would not affect the correlations. But because it differs between countries, it essentially creates noise in the correlations.

Two plots:


The codes are ISO-3 codes. SO e.g. NGA is Nigeria, GHA is Ghana, KEN = Kenya and so on. They perform fairly well compared to their home country IQ, both in Norway and Denmark. But Somalia does not and the performance of several MENAP immigrants is abysmal.

The scores on the Y axis are S factor scores for their performance in these countries. They are general factors extracted from measures of income, educational attainment, use of social benefits, crime and the like. The S scores correlate .77 between the countries. For details, see the papers concerning the data:

I did not use the scores from the papers, I redid the analysis. The code is posted below for those curious. The kirkegaard package is my personal package. It is on github. The megadataset file is on OSF.


p_load(kirkegaard, ggplot2)

M = read_mega("Megadataset_v2.0e.csv")

DK = M[111:135] #fetch danish data
DK = DK[miss_case(DK) <= 4, ] #keep cases with 4 or fewer missing
DK = irmi(DK, noise = F) #impute the missing
DK.S = fa(DK) #factor analyze
DK_S_scores = data.frame(DK.S = as.vector(DK.S$scores) * -1) #save scores, reversed
rownames(DK_S_scores) = rownames(DK) #add rownames

M = merge_datasets(M, DK_S_scores, 1) #merge to mega

ggplot(M, aes(LV2012estimatedIQ, DK.S)) + 
  geom_point() +
  geom_text(aes(label = rownames(M)), vjust = 1, alpha = .7) +
  geom_smooth(method = "lm", se = F)

# Norway ------------------------------------------------------------------

NO_work = cbind(M[""], #for work data

NO_income = cbind(M["Norway.Income.index.2009"], #for income data

#make DF
NO = cbind(M["NorwayViolentCrimeAdjustedOddsRatioSkardhamar2014"],

#get 5 year means
NO[""] = apply(NO_work[1:5],1,mean,na.rm=T) #get means, ignore missing
NO["OutOfWork.2010to2014.women"] = apply(NO_work[6:10],1,mean,na.rm=T) #get means, ignore missing

#get means for income and add to DF
NO["Income.index.2009to2012"] = apply(NO_income,1,mean,na.rm=T) #get means, ignore missing

plot_miss(NO) #view is data missing?

NO = NO[miss_case(NO) <= 3, ] #keep those with 3 datapoints or fewer missing
NO = irmi(NO, noise = F) #impute the missing

NO_S = fa(NO) #factor analyze
NO_S_scores = data.frame(NO_S = as.vector(NO_S$scores) * -1) #save scores, reverse
rownames(NO_S_scores) = rownames(NO) #add rownames

M = merge_datasets(M, NO_S_scores, 1) #merge with mega

ggplot(M, aes(LV2012estimatedIQ, NO_S)) +
  geom_point() +
  geom_text(aes(label = rownames(M)), vjust = 1, alpha = .7) +
  geom_smooth(method = "lm", se = F)


cor(M$NO_S, M$DK.S, use = "pair")



A reanalysis of (Carl, 2015) revealed that the inclusion of London had a strong effect on the S loading of crime and poverty variables. S factor scores from a dataset without London and redundant variables was strongly related to IQ scores, r = .87. The Jensen coefficient for this relationship was .86.



Carl (2015) analyzed socioeconomic inequality across 12 regions of the UK. In my reading of his paper, I thought of several analyses that Carl had not done. I therefore asked him for the data and he shared it with me. For a fuller description of the data sources, refer back to his article.

Redundant variables and London

Including (nearly) perfectly correlated variables can skew an extracted factor. For this reason, I created an alternative dataset where variables that correlated above |.90| were removed. The following pairs of strongly correlated variables were found:

  1. median.weekly.earnings and log.weekly.earnings r=0.999
  2. GVA.per.capita and log.GVA.per.capita r=0.997
  3. R.D.workers.per.capita and log.weekly.earnings r=0.955
  4. log.GVA.per.capita and log.weekly.earnings r=0.925
  5. economic.inactivity and children.workless.households r=0.914

In each case, the first of the pair was removed from the dataset. However, this resulted in a dataset with 11 cases and 11 variables, which is impossible to factor analyze. For this reason, I left in the last pair.

Furthermore, because capitals are known to sometimes strongly affect results (Kirkegaard, 2015a, 2015b, 2015d), I also created two further datasets without London: one with the redundant variables, one without. Thus, there were 4 datasets:

  1. A dataset with London and redundant variables.
  2. A dataset with redundant variables but without London.
  3. A dataset with London but without redundant variables.
  4. A dataset without London and redundant variables.

Factor analysis

Each of the four datasets was factor analyzed. Figure 1 shows the loadings.


Figure 1: S factor loadings in four analyses.

Removing London strongly affected the loading of the crime variable, which changed from moderately positive to moderately negative. The poverty variable also saw a large change, from slightly negative to strongly negative. Both changes are in the direction towards a purer S factor (desirable outcomes with positive loadings, undesirable outcomes with negative loadings). Removing the redundant variables did not have much effect.

As a check, I investigated whether these results were stable across 30 different factor analytic methods.1 They were, all loadings and scores correlated near 1.00. For my analysis, I used those extracted with the combination of minimum residuals and regression.


Due to London’s strong effect on the loadings, one should check that the two methods developed for finding such cases can identify it (Kirkegaard, 2015c). Figure 2 shows the results from these two methods (mean absolute residual and change in factor size):

Figure 2: Mixedness metrics for the complete dataset.

As can be seen, London was identified as a far outlier using both methods.

S scores and IQ

Carl’s dataset also contains IQ scores for the regions. These correlate .87 with the S factor scores from the dataset without London and redundant variables. Figure 3 shows the scatter plot.

Figure 3: Scatter plot of S and IQ scores for regions of the UK.

However, it is possible that IQ is not really related to the latent S factor, just the other variance of the extracted S scores. For this reason I used Jensen’s method (method of correlated vectors) (Jensen, 1998). Figure 4 shows the results.

Figure 4: Jensen’s method for the S factor’s relationship to IQ scores.

Jensen’s method thus supported the claim that IQ scores and the latent S factor are related.

Discussion and conclusion

My reanalysis revealed some interesting results regarding the effect of London on the loadings. This was made possible by data sharing demonstrating the importance of this practice (Wicherts & Bakker, 2012).

Supplementary material

R source code and datasets are available at the OSF.


Carl, N. (2015). IQ and socioeconomic development across Regions of the UK. Journal of Biosocial Science, 1–12.

Jensen, A. R. (1998). The g factor: the science of mental ability. Westport, Conn.: Praeger.

Kirkegaard, E. O. W. (2015a). Examining the S factor in Mexican states. The Winnower. Retrieved from

Kirkegaard, E. O. W. (2015b). Examining the S factor in US states. The Winnower. Retrieved from

Kirkegaard, E. O. W. (2015c). Finding mixed cases in exploratory factor analysis. The Winnower. Retrieved from

Kirkegaard, E. O. W. (2015d). The S factor in Brazilian states. The Winnower. Retrieved from

Revelle, W. (2015). psych: Procedures for Psychological, Psychometric, and Personality Research (Version 1.5.4). Retrieved from

Wicherts, J. M., & Bakker, M. (2012). Publish (your data) or (let the data) perish! Why not publish your data too? Intelligence, 40(2), 73–76.

1There are 6 different extraction and 5 scoring methods supported by the fa() function from the psych package (Revelle, 2015). Thus, there are 6*5 combinations.

Some time ago, I stumbled upon this paper:
Searls, D. T., Mead, N. A., & Ward, B. (1985). The relationship of students’ reading skills to TV watching, leisure time reading, and homework. Journal of Reading, 158-162.

Sample is very large:

To enlarge on such information, the National Assessment of Educational Progress (NAEP) gathered data on the TV viewing habits of 9, 13, and 17 year olds across the U.S. during its 1979-80 assessment of reading skills. In this survey, 21,208 9 year olds, 30,488 13 year olds, and 25,551 17 year olds responded to questions about their back- grounds and to a wide range of items probing their reading comprehension skills. These data provide information on the amount of TV watched by different groups of students and allow comparisons of reading skills and TV watching.

The relationship turns out to be interestingly nonlinear:

TV reading compre age

For understanding, it is better to visualize the data anew:


I will just pretend that reading comprehension is cognitive ability, usually a fair approximation.

So, if we follow the smarties: At 9 they watch a fairly amount of TV (3-4 hours per day), then at 13, they watch about half of that (1-2), and then at age 17, they barely watch it (<1).

Developmental hypothesis: TV is interesting but only to persons at a certain cognitive ability level. Young smart children fit in the target group, but as they age and become smarter, they grow out of the target group and stop watching.

Alternatives hypotheses?

R code

The code for the plot above.

d = data.frame(c(1.5, 2.2, 2.3),
               c(3, 3, 1.3),
               c(5.2, .2, -2.2),
               c(-1.7, -6.9, -8.1))

colnames(d) = c("<1 hour", "1-2 hours", "3-4 hours", ">4 hours")
d$age = factor(c("9", "13", "17"), levels = c("9", "13", "17"))

d = melt(d, id.vars = "age")


ggplot(d, aes(age, value)) +
  geom_point(aes(color = variable)) +
  ylab("Relative reading comprehension score") +
  scale_color_discrete(name = "TV watching per day") +
  scale_shape_discrete(guide = F)


A dataset was compiled with 17 diverse socioeconomic variables for 32 departments of Colombia and the capital district. Factor analysis revealed an S factor. Results were robust to data imputation and removal of a redundant variable. 14 of 17 variables loaded in the expected direction. Extracted S factors correlated about .50 with the cognitive ability estimate. The Jensen coefficient for the S factor for this relationship was .60.



The general socioeconomic factor is the mathematical construct associated with the idea that positive outcomes tend to go along with other positive outcomes, and likewise for the negative. Mathematically, this shows up as a factor where the desirable outcomes load positively and where the undesirable outcomes load negatively. As far as I know, (Kirkegaard, 2014b) was the first to report such a factor, although Lynn (1979) was close to the same idea. The factor is called s at the individual level, and S when found in aggregated data.

By now, S factors have been found between countries (Kirkegaard, 2014b), twice between country-of-origin groups within countries (Kirkegaard, 2014a), numerous times within countries (reviewed in (Kirkegaard, 2015c)) and also at the level of first names (Kirkegaard & Tranberg, 2015). This paper analyzes data for 33 Colombian departments including the capital district.

Data sources

Most of the data were found via the English-language website which is an aggregator of statistical information concerning countries and their divisions. A second source was a Spanish-language report (DANE, 2011). One variable had to be found on Wikipedia (“List of Colombian departments by GDP,” 2015). Finally, HDI2010 was found in a Spanish-language UN report (United Nations Development Programme & UNDP Colombia, 2011).

Variables were selected according to two criteria: 1) they must be socioeconomically important and 2) they must not be strongly dependent on local climatic conditions. For instance, fishermen per capita would be a variable that fails both criteria, since it is not generally seen as socioeconomically important and is dependent on having access to a body of water.

The included variables are:

  • SABER, verbal scores
  • SABER, math scores
  • Acute malnutrition, %
  • Chronic malnutrition, %
  • Low birth weight, %
  • Access to clean water, %
  • The presence of a sewerage system, %
  • Immunization coverage, %
  • Child mortality, rate
  • Infant mortality, rate
  • Life expectancy at birth
  • Total fertility rate
  • Births that occur in a health clinic, %
  • Unemployment, %
  • GDP per capita
  • Poverty, %
  • GINI
  • Domestic violence, rate
  • Urbanicity, %
  • Population, absolute number
  • HDI 2010

SABER is a local academic achievement test similar to PISA.

Missing data

When collecting the data, I noticed that quite a number of the variables have missing data. The matrixplot is shown in Figure 1.


Figure 1: Matrix plot for the dataset.

The red fields indicate missing data (NA). The greyscale fields indicate high (dark) and low values in each variable. We see that the same departments tend to miss data.

Redundant variables and imputation

Very highly correlated variables cause problems for factor analysis and result in ‘double weighing’ of some variables. For this reason I used the algorithm I developed to find the most highly correlated pairs of variables and remove one of them automatically (Kirkegaard, 2015a). I used a rule of thumb that variables which correlate at >.90 should be removed. There was only one such pair (infant mortality and child mortality, r = .922; removed infant mortality).

I imputed the missing data using the irmi() function from the VIM package (Templ, Alfons, Kowarik, & Prantner, 2015). This was done without noise to make the results replicable. I had no interest in trying to estimate standard errors, so multiple imputation was unnecessary (Donders, van der Heijden, Stijnen, & Moons, 2006).

To check whether results were comparable across methods, datasets were saved with every combination of imputation and removal of the redundant variable, thus creating 4 datasets.

Factor analysis

I carried out factor analysis on the 4 datasets. The factor loadings plot is shown in Figure 2.

Figure 2: Factor loadings plot.

Results were were similar across methods. Per S factor theory, the desirable variables should have positive loadings and the undesirable negative loadings. This was not entirely the case. 3 variables that are generally considered undesirable loaded positively: unemployment rate, low birth weight and domestic violence.

Unemployment rate and crime has been found to load in the wrong direction before when analyzing state-like units. It may be due to the welfare systems being better in the higher S departments, making it possible to survive without working.

It is said that cities breed crime and since urbanicity has a very high positive S loading, the crime result may be a side-effect of that. Alternatively, the legal system may be better (e.g. less corrupt) in the higher S departments making it more likely for crimes to be reported. This is perhaps especially so for crimes against women.

The result with low birth weight is more strange given that higher birth weight is a known correlate of higher educational levels and cognitive ability (Shenkin, Starr, & Deary, 2004). One of the other variables suggest an answer: in the lower S departments, a large fraction (30-40%) of births are home-births, and it seems likely that this would result in fewer reports of low birth weights.

Generally, the results are consistent with those from other countries; 14 of 17 variables loaded in the expected direction.

Mixed cases

Mixed cases are cases that do not fit the factor structure of a dataset. I have previously developed two methods for detecting such cases (Kirkegaard, 2015b). Neither method indicated any strong mixed cases in the unimputed, unreduced dataset or the imputed, reduced dataset. Removing the least congruent case would only improve the factor size by 1.2%point, and the case with the greatest mean absolute residual had only .89.

Unlike previous analysis, the capital district was kept because it did not appear to be a structural outlier.

Cognitive ability, S and HDI

The two cognitive variables correlated at .84, indicating the presence of the aggregate general cognitive ability factor (G factor; Rindermann, 2007). They were averaged to form an estimate of the G factor.

The correlations between S factors, HDI and cognitive ability is shown in Table 1.





















 Table 1: Correlation matrix for cognitive ability, S factor and HDI. Correlations below diagonal are weighted by the square root of population size.

Weighted and unweighted correlations were approximately the same. The imputed and trimmed S factor was nearly identical to the HDI values, despite that the HDI values are from 2010 and the data the S factor is based on is from 2005. Results are fairly similar to those found in other countries.

Figure 3 shows a scatter plot of S factor (reduced, imputed dataset) and cognitive ability.


Figure 3: Scatter plot of S factor scores and cognitive ability.

Jensen’s method

Finally, as a robustness test, I used Jensen’s method (method of correlated vectors (Frisby & Beaujean, 2015; Jensen, 1998)) to see if cognitive abilities’ association with the S factor scores was due to the latent trait. Figure 4 shows the Jensen plot.

Figure 4: Jensen plot for S factor loadings and cognitive ability.

The correlation was .60, which is satisfactory given the relatively few variables (N=16).


  • I don’t speak Spanish, so I may have overlooked some variables that should have been included in the analysis. They may also be translation errors as I had to rely on those found on the websites I used.
  • No educational attainment variables were included despite these often having very strong loadings. None were available in the data sources I consulted.
  • Data was missing for many cases and had to be imputed.

Supplementary material

Data files, R source code and high quality figures are available in the Open Science Framework repository.


A factor analysis was carried out on 6 socioeconomic variables for 506 census tracts of Boston. An S factor was found with positive loadings for median value of owner-occupied homes and average number of rooms in these; negative loadings for crime rate, pupil-teacher ratio, NOx pollution, and the proportion of the population of ‘lower status’. The S factor scores were negatively correlated with the estimated proportion of African Americans in the tracts r = -.36 [CI95 -0.43; -0.28]. This estimate was biased downwards due to data error that could not be corrected for.

The general socioeconomic factor (s/S1) is a similar construct to that of general cognitive ability (GCA; g factor, intelligence, etc., (Gottfredson, 2002; Jensen, 1998). For ability data, it has been repeatedly found that performance on any cognitive test is positively related to performance on any other test, no matter which format (pen pencil, read aloud, computerized), and type (verbal, spatial, mathematical, figural, or reaction time-based) has been tried. The S factor is similar. It has been repeatedly found that desirable socioeconomic outcomes tend are positively related to other desirable socioeconomic outcomes, and undesirable outcomes positively related to other undesirable outcomes. When this pattern is found, one can extract a general factor such that the desirable outcomes have positive loadings and then undesirable outcomes have negative loadings. In a sense, this is the latent factor that underlies the frequently used term “socioeconomic status” except that it is broader and not just restricted to income, occupation and educational attainment, but also includes e.g. crime and health.

So far, S factors have been found for country-level (Kirkegaard, 2014b), state/regional-level (e.g. Kirkegaard, 2015), country of origin-level for immigrant groups (Kirkegaard, 2014a) and first name-level data (Kirkegaard & Tranberg, In preparation). The S factors found have not always been strictly general in the sense that sometimes an indicator loads in the ‘wrong direction’, meaning that either an undesirable variable loads positively (typically crime rates), or a desirable outcome loads negatively. These findings should not be seen as outliers to be explained away, but rather to be explained in some coherent fashion. For instance, crime rates may load positively despite crime being undesirable because the justice system may be better in the higher S states, or because of urbanicity tends to create crime and urbanicity usually has a positive loading. To understand why some indicators sometimes load in the wrong direction, it is important to examine data at many levels. This paper extends the S factor to a new level, that of census tracts in the US.

Data source
While taking a video course on statistical learning based on James, Witten, Hastie, & Tibshirani (2013), I noted that a dataset used as an example would be useful for an S factor analysis. The dataset concerns 506 census tracts of Boston and includes the following variables (Harrison & Rubinfeld, 1978):

  • Median value of owner-occupied homes
  • Average number of rooms in owner units.
  • Proportion of owner units built before 1940.
  • Proportion of the population that is ‘lower status’. “Proportion of adults without, some high school education and proportion of male workers classified as laborers)”.
  • Crime rate.
  • Proportion of residential land zoned for lots greater than 25k square feet.
  • Proportion of nonretail business acres.
  • Full value property tax rate.
  • Pupil-teacher ratios for schools.
  • Whether the tract bounds the Charles River.
  • Weighted distance to five employment centers in the Boston region.
  • Index of accessibility to radial highways.
  • Nitrogen oxide concentration. A measure of air pollution.
  • Proportion of African Americans.

See the original paper for a more detailed description of the variables.

This dataset has become very popular as a demonstration dataset in machine learning and statistics which shows the benefits of data sharing (Wicherts & Bakker, 2012). As Gilley & Pace (1996) note “Essentially, a cottage industry has sprung up around using these data to examine alternative statistical techniques.”. However, as they re-checked the data, they found a number of errors. The corrected data can be downloaded here, which is the dataset used for this analysis.

The proportion of African Americans
The variable concerning African Americans have been transformed by the following formula: 1000(x – .63)2. Because one has to take the square root to reverse the effect of taking the square, some information is lost. For example, if we begin with the dataset {2, -2, 2, 2, -2, -2} and take the square of these and get {4, 4, 4, 4, 4, 4}, it is impossible someone to reverse this transformation and get the original because they cannot tell whether 4 results from -2 or 2 being squared.

In case of the actual data, the distribution is shown in Figure 1.

Figure 1: Transformed data for the proportion of blacks by census tract.

Due to the transformation, the values around 400 actually mean that the proportion of blacks is around 0. The function for back-transforming the values is shown in Figure 2.

Figure 2: The transformation function.

We can now see the problem of back-transforming the data. If the transformed data contains a value between 0 and about 140, then we cannot tell which original value was with certainty. For instance, a transformed value of 100 might correspond to an original proportion of .31 or .95.

To get a feel for the data, one can use the Racial Dot Map explorer and look at Boston. Figure 3 shows the Boston area color-coded by racial groups.

Boston race
Figure 3: Racial dot map of Boston area.

As can be seen, the races tend to live rather separate with large areas dominated by one group. From looking at it, it seems that Whites and Asians mix more with each other than with the other groups, and that African Americans and Hispanics do the same. One might expect this result based on the groups’ relative differences in S factor and GCA (Fuerst, 2014). Still, this should be examined by numerical analysis, a task which is left for another investigation.

Still, we are left with the problem of how to back-transform the data. The conservative choice is to use only the left side of the function. This is conservative because any proportion above .63 will get back-transformed to a lower value. E.g. .80 will become .46, a serious error. This is the method used for this analysis.

Factor analysis
Of the variables in the dataset, there is the question of which to use for S factor analysis. In general when doing these analyses, I have sought to include variables that measure something socioeconomically important and which is not strongly influenced by the local natural environment. For instance, the dummy variable concerning the River Charles fails on both counts. I chose the following subset:

  • Median value of owner-occupied homes
  • Average number of rooms in owner units.
  • Proportion of the population that is ‘lower status’.
  • Crime rate.
  • Pupil-teacher ratios for schools.
  • Nitrogen oxide concentration. A measure of air pollution.

Which concern important but different things. Figure 4 shows the loadings plot for the factor analysis (reversed).2


Figure 4: Loadings plot for the S factor.

The S factor was confirmed for this data without exceptions, in that all indicator variables loaded in the expected direction. The factor was moderately strong, accounting for 47% of the variance.

Relationship between S factor and proportions of African Americans
Figure 5 shows a scatter plot of the relationship between the back-transformed proportion of African Americans and the S factor.

Figure 5: Scatter plot of S scores and the back-transformed proportion of African Americans by census tract in Boston.

We see that there is a wide variation in S factor even among tracts with no or very few African Americans. These low S scores may be due to Hispanics or simply reflect the wide variation within Whites (there few Asians back then). The correlation between proportion of African Americans and S is -.36 [CI95 -0.43; -0.28].

We see that many very low S points lie around S [-3 to -1.5]. Some of these points may actually be census tracts with very high proportions of African Americans that were back-transformed incorrectly.

The value of r = -.36 should not be interpreted as an estimate of effect size of ancestry on S factor for census tracts in Boston because the proportions of the other sociological races were not used. A multiple regression or similar method with all sociological races as the predictors is necessary to answer this question. Still, the result above is in the expected direction based on known data concerning the mean GCA of African Americans, and the relationship between GCA and socioeconomic outcomes (Gottfredson, 1997).

The back-transformation process likely introduced substantial error in the results.

Data are relatively old and may not reflect reality in Boston as it is now.

Supplementary material
Data, high quality figures and R source code is available at the Open Science Framework repository.



1 Capital S is used when the data are aggregated, and small s is used when it is individual level data. This follows the nomenclature of (Rindermann, 2007).

2 To say that it is reversed is because the analysis gave positive loadings for undesirable outcomes and negative for desirable outcomes. This is because the analysis includes more indicators of undesirable outcomes and the factor analysis will choose the direction to which most indicators point as the positive one. This can easily be reversed by multiplying with -1.

It has been found that workers who hail from higher socioeconomic classes have higher earnings even in the same profession. An environmental cause was offered as an explanation of this. I show that this effect is expected solely for statistical reasons.

Friedman and Laurison (2015) offer data about the earnings of persons employed in the higher professions by their social class of origin. They find that those who originate from a higher social class earn more. I reproduce their figure below.


They posit an environmental explanation of this:

In doing so, we have purposively borrowed the ‘glass ceiling’ concept developed by feminist scholars to explain the hidden barriers faced by women in the workplace. In a working paper recently published by LSE Sociology, we argue that it is also possible to identify a ‘class ceiling’ in Britain which is preventing the upwardly mobile from enjoying equivalent earnings to those from upper middle-class backgrounds.

There is also a longer working paper by the same authors, but I did not read that. A link to it can be found in the previously mentioned source.

A simplified model of the situation
How do persons advance to professions? Well, we know that the occupational hierarchy is basically a (general) cognitive ability hierarchy (GCA; Gottfredson, 1997), as well as presumably also one of various relevant non-cognitive traits such as being hard working/conscientiousness altho I am not familiar with a study of this.

A simple way to model the situation is to think of it as a threshold system where no one below the given threshold gets into the profession and everybody above gets into it. This is of course not like reality. Reality does have a threshold which increases up the hierarchy. [Insert the figure from one of Gottfredson’s paper that shows the minimum IQ by occupation, but I can’t seem to locate it. Help!] The effect of GCA is probably more like a probabilistic function akin to the cumulative distribution function such that below a certain cognitive level, virtually no one from below that level is found.

Simulating this is a bit complicated but we can approximate it reasonably by using a simple cut-off value, such that everybody above gets in, everybody below does not, see Gordon (1997) for a similar case with belief in conspiracy theories.

A simulation
One could perhaps solve this analytically, but it is easier to simulate it, so we do that. I used the following procedure:

  1. We make three groups of origin with 90, 100, and 110 IQ.
  2. We simulate a large number (1e4) of random persons from these groups.
  3. We plot these to get an overview of the data.
  4. We find the subgroup of each group with IQ > 115, which we take as the minimum for some high level profession.
  5. We calculate the mean IQ of each subgroup.

The plot looks like this:


The vertical lines are the cut-off threshold (black) and the three means (in their corresponding colors). As can be seen, the means in the subgroups are not the same despite the same threshold being applied. The values are respectively: 121.74, 122.96, and 125.33. The differences between these are not large for the present simulation, but they may be sufficient to bring about differences that are detectable in a large dataset. The values depend on how far the population mean is from the threshold and the standard deviation of the population (all 15 in the above simulation). The further away the threshold is from the mean, the closer the mean of the subgroup above the threshold will be to the threshold value. For subgroups far away, it will be nearly identical. For instance, the mean IQ of those with >=150 is about 153.94 (based on a sampling with 10e7 cases, mean 100, sd 15).

It should be noted that if one also considers measurement error, this effect will be stronger, since persons from lower IQ groups regress further down. This is just to say that their initial IQs contained more measurement error. One can correct for this bias, but it is not often done (Jensen, 1980).

Supplementary material
R source code is available at the Open Science Framework repository.


  • Friedman, S and Laurison, D. (2015). Introducing the ‘class’ ceiling. British Politics and Policy blog.
  • Gordon, R. A. (1997). Everyday life as an intelligence test: Effects of intelligence and intelligence context. Intelligence, 24(1), 203-320.
  • Gottfredson, L. S. (1997). Why g matters: The complexity of everyday life. Intelligence, 24(1), 79-132.
  • Jensen, A. R. (1980). Bias in Mental Testing.