Due to lengthy discussion over at Unz concerning the good performance of some African groups in the UK, it seems worth it to review the Danish and Norwegian results. Basically, some African groups perform better on some measures than native British. The author is basically arguing that this disproves global hereditarianism. I think not.

The over-performance relative to home country IQ of some African countries is not restricted to the UK. In my studies of immigrants in Denmark and Norway, I found the same thing. It is very clear that there are strong selection effects for some countries, but not others, and that this is a large part of the reason why the home country IQ x performance in host country are not higher. If the selection effect was constant across countries, it would not affect the correlations. But because it differs between countries, it essentially creates noise in the correlations.

Two plots:


The codes are ISO-3 codes. SO e.g. NGA is Nigeria, GHA is Ghana, KEN = Kenya and so on. They perform fairly well compared to their home country IQ, both in Norway and Denmark. But Somalia does not and the performance of several MENAP immigrants is abysmal.

The scores on the Y axis are S factor scores for their performance in these countries. They are general factors extracted from measures of income, educational attainment, use of social benefits, crime and the like. The S scores correlate .77 between the countries. For details, see the papers concerning the data:

I did not use the scores from the papers, I redid the analysis. The code is posted below for those curious. The kirkegaard package is my personal package. It is on github. The megadataset file is on OSF.


p_load(kirkegaard, ggplot2)

M = read_mega("Megadataset_v2.0e.csv")

DK = M[111:135] #fetch danish data
DK = DK[miss_case(DK) <= 4, ] #keep cases with 4 or fewer missing
DK = irmi(DK, noise = F) #impute the missing
DK.S = fa(DK) #factor analyze
DK_S_scores = data.frame(DK.S = as.vector(DK.S$scores) * -1) #save scores, reversed
rownames(DK_S_scores) = rownames(DK) #add rownames

M = merge_datasets(M, DK_S_scores, 1) #merge to mega

ggplot(M, aes(LV2012estimatedIQ, DK.S)) + 
  geom_point() +
  geom_text(aes(label = rownames(M)), vjust = 1, alpha = .7) +
  geom_smooth(method = "lm", se = F)

# Norway ------------------------------------------------------------------

NO_work = cbind(M["Norway.OutOfWork.2010Q2.men"], #for work data

NO_income = cbind(M["Norway.Income.index.2009"], #for income data

#make DF
NO = cbind(M["NorwayViolentCrimeAdjustedOddsRatioSkardhamar2014"],

#get 5 year means
NO["OutOfWork.2010to2014.men"] = apply(NO_work[1:5],1,mean,na.rm=T) #get means, ignore missing
NO["OutOfWork.2010to2014.women"] = apply(NO_work[6:10],1,mean,na.rm=T) #get means, ignore missing

#get means for income and add to DF
NO["Income.index.2009to2012"] = apply(NO_income,1,mean,na.rm=T) #get means, ignore missing

plot_miss(NO) #view is data missing?

NO = NO[miss_case(NO) <= 3, ] #keep those with 3 datapoints or fewer missing
NO = irmi(NO, noise = F) #impute the missing

NO_S = fa(NO) #factor analyze
NO_S_scores = data.frame(NO_S = as.vector(NO_S$scores) * -1) #save scores, reverse
rownames(NO_S_scores) = rownames(NO) #add rownames

M = merge_datasets(M, NO_S_scores, 1) #merge with mega

ggplot(M, aes(LV2012estimatedIQ, NO_S)) +
  geom_point() +
  geom_text(aes(label = rownames(M)), vjust = 1, alpha = .7) +
  geom_smooth(method = "lm", se = F)


cor(M$NO_S, M$DK.S, use = "pair")



A reanalysis of (Carl, 2015) revealed that the inclusion of London had a strong effect on the S loading of crime and poverty variables. S factor scores from a dataset without London and redundant variables was strongly related to IQ scores, r = .87. The Jensen coefficient for this relationship was .86.



Carl (2015) analyzed socioeconomic inequality across 12 regions of the UK. In my reading of his paper, I thought of several analyses that Carl had not done. I therefore asked him for the data and he shared it with me. For a fuller description of the data sources, refer back to his article.

Redundant variables and London

Including (nearly) perfectly correlated variables can skew an extracted factor. For this reason, I created an alternative dataset where variables that correlated above |.90| were removed. The following pairs of strongly correlated variables were found:

  1. median.weekly.earnings and log.weekly.earnings r=0.999
  2. GVA.per.capita and log.GVA.per.capita r=0.997
  3. R.D.workers.per.capita and log.weekly.earnings r=0.955
  4. log.GVA.per.capita and log.weekly.earnings r=0.925
  5. economic.inactivity and children.workless.households r=0.914

In each case, the first of the pair was removed from the dataset. However, this resulted in a dataset with 11 cases and 11 variables, which is impossible to factor analyze. For this reason, I left in the last pair.

Furthermore, because capitals are known to sometimes strongly affect results (Kirkegaard, 2015a, 2015b, 2015d), I also created two further datasets without London: one with the redundant variables, one without. Thus, there were 4 datasets:

  1. A dataset with London and redundant variables.
  2. A dataset with redundant variables but without London.
  3. A dataset with London but without redundant variables.
  4. A dataset without London and redundant variables.

Factor analysis

Each of the four datasets was factor analyzed. Figure 1 shows the loadings.


Figure 1: S factor loadings in four analyses.

Removing London strongly affected the loading of the crime variable, which changed from moderately positive to moderately negative. The poverty variable also saw a large change, from slightly negative to strongly negative. Both changes are in the direction towards a purer S factor (desirable outcomes with positive loadings, undesirable outcomes with negative loadings). Removing the redundant variables did not have much effect.

As a check, I investigated whether these results were stable across 30 different factor analytic methods.1 They were, all loadings and scores correlated near 1.00. For my analysis, I used those extracted with the combination of minimum residuals and regression.


Due to London’s strong effect on the loadings, one should check that the two methods developed for finding such cases can identify it (Kirkegaard, 2015c). Figure 2 shows the results from these two methods (mean absolute residual and change in factor size):

Figure 2: Mixedness metrics for the complete dataset.

As can be seen, London was identified as a far outlier using both methods.

S scores and IQ

Carl’s dataset also contains IQ scores for the regions. These correlate .87 with the S factor scores from the dataset without London and redundant variables. Figure 3 shows the scatter plot.

Figure 3: Scatter plot of S and IQ scores for regions of the UK.

However, it is possible that IQ is not really related to the latent S factor, just the other variance of the extracted S scores. For this reason I used Jensen’s method (method of correlated vectors) (Jensen, 1998). Figure 4 shows the results.

Figure 4: Jensen’s method for the S factor’s relationship to IQ scores.

Jensen’s method thus supported the claim that IQ scores and the latent S factor are related.

Discussion and conclusion

My reanalysis revealed some interesting results regarding the effect of London on the loadings. This was made possible by data sharing demonstrating the importance of this practice (Wicherts & Bakker, 2012).

Supplementary material

R source code and datasets are available at the OSF.


Carl, N. (2015). IQ and socioeconomic development across Regions of the UK. Journal of Biosocial Science, 1–12. doi.org/10.1017/S002193201500019X

Jensen, A. R. (1998). The g factor: the science of mental ability. Westport, Conn.: Praeger.

Kirkegaard, E. O. W. (2015a). Examining the S factor in Mexican states. The Winnower. Retrieved from thewinnower.com/papers/examining-the-s-factor-in-mexican-states

Kirkegaard, E. O. W. (2015b). Examining the S factor in US states. The Winnower. Retrieved from thewinnower.com/papers/examining-the-s-factor-in-us-states

Kirkegaard, E. O. W. (2015c). Finding mixed cases in exploratory factor analysis. The Winnower. Retrieved from thewinnower.com/papers/finding-mixed-cases-in-exploratory-factor-analysis

Kirkegaard, E. O. W. (2015d). The S factor in Brazilian states. The Winnower. Retrieved from thewinnower.com/papers/the-s-factor-in-brazilian-states

Revelle, W. (2015). psych: Procedures for Psychological, Psychometric, and Personality Research (Version 1.5.4). Retrieved from cran.r-project.org/web/packages/psych/index.html

Wicherts, J. M., & Bakker, M. (2012). Publish (your data) or (let the data) perish! Why not publish your data too? Intelligence, 40(2), 73–76. doi.org/10.1016/j.intell.2012.01.004

1There are 6 different extraction and 5 scoring methods supported by the fa() function from the psych package (Revelle, 2015). Thus, there are 6*5 combinations.


A dataset of 127 variables concerning socioeconomic outcomes for US states was analyzed. Of these, 81 were used in a factor analysis. The analysis revealed a general socioeconomic factor. This factor correlated .961 with one from a previous analysis of socioeconomic data for US states.



It has repeatedly been found that desirable outcomes tend to be associated with other desirable outcomes and likewise for undesirable outcomes. When this is the case, one can extract a general factor — the general socioeconomic factor (S factor) — such that the desirable outcomes load positively and the undesirable outcomes negatively. This pattern has been found at the country level (1), within country divisions of many countries (2–10), at the city district level (11), at the level of first names (12) and at the level of country of origin groups in two countries (13,14).

A previous study have found that the pattern holds for US states too (7). However, a new and larger dataset has been found, so it is worth examining whether the pattern holds in it, and if so, how strongly correlated the extracted factor scores are between the datasets. This would function as a kind of test-retest reliability.

Data sources

The previous study (7) of the S factor among US states used a dataset of 25 variables compiled from various official statistics found at The 2012 Statistical Abstract website. The current study relies upon a dataset compiled by Measure of America, a website that visualizes social inequality. It is possible to download the datasets their maps rely upon here.

As done with earlier studies, I excluded the capital district. I also excluded the data for US as a whole since it was not a state like the other cases.

The dataset contains a total of 127 variables. However, not all of these are useful for examining the S factor:

  • 4 variables are the composite indexes calculated by Measure of America. These are fairly similar to the Human Development Index scores, except that they are scaled differently.
  • 6 variables concern the population sizes in percent of 6 sociological race categories: Non-Hispanic White, Latino, African American, Asian, Amerindian (Native American) and other.
  • 1 variable contains the total population size for each state.
  • A number of variables were not given in a form adjusted for population size e.g. per capita, percent or rate per 100k persons. These variables were excluded: Rape (total number), Homeless Population (total number), Medicare Recipients (thousands), Medicaid Recipients (thousands), Army Recruits (total), Total Military Casualties in Operations Enduring Freedom and Iraqi Freedom to April 2010, Prisoners State or Federal Jurisdiction (total number), Women in Congressional Delegation (total), Men in Congressional Delegation (total), Carcinogen Releases (pounds), Lead Releases (pounds), Dioxin Releases (grams), Superfund Sites (total), Protected Forest (acres), and Protected Farm and Ranch Land (acres).
  • 1 variable was excluded due to being heavily reliant on local natural environment (presence of water and forests): Farming fishing and forestry occupations (%).
  • 1 variable was excluded because most of its data was missing: State Earned Income Tax Credit (% of federal Earned Income Tax Credit).

The variables that were not given in per population format almost always had a sibling variable that was given in a suitable format and which was included in the analysis. After these exclusions, 101 variables remained for analysis.

Missing data

An analysis of missing data showed that some variables still had missing data. Because the dataset had more variables than cases, it was not possible to impute the missing data using multiple regression as commonly done in these analyses. For this reason, these variables were excluded. After this, 93 variables remained for analysis.

Duplicated, reverse-coded and highly redundant variables

An analysis of correlations among variables showed that 2 of them had duplicates (r = 1): Diabetes (% age 18 and older) and Low-Birth-Weight Infants (% of all infants). I’m not sure why this is the case.

Furthermore, 4 variables had a reverse-coded sibling (r = -1):

  1. Less Than High School (%) + At Least High School Diploma (%)
  2. 4th Graders Reading Below Proficiency (%) + 4th Grade National Assessment of Educational Progress in Reading (% at or above proficient)
  3. Urban Population (%) + Rural Population (%)
  4. Public High School Graduation Rate (%) + High School Freshmen Not Graduating After 4 Years (%).

Finally, some variables were so strongly related to other variables that keeping both would perhaps result in factor analytic errors or headily influence the resulting factor. I decided to use a threshold of |.9| as the limit. If any pair of variables correlated at this level or above, one of them was excluded. There were 6 pairs of variables like this and the first of the pair was excluded:

  1. Poverty Rate (% below federal poverty threshold) + Child Poverty (% living in families below the poverty line), r = .985.
  2. Poverty Rate (% below federal poverty threshold) + Children Under 6 Living in Poverty (%), r = .968.
  3. Management professional and related occupations (%) + At Least Bachelor’s Degree (%), r = .925.
  4. Preschool Enrollment (% enrolled ages 3 and 4) + 3- and 4-year-olds Not Enrolled in Preschool (%), r = -.925.
  5. Army Recruits (per 1000 youth) + Army Recruits (per 1000 youth), r = .914.
  6. Graduate Degree (%) + At Least Bachelor’s Degree (%), r= .910.

The army recruit variable seems to be a duplicate, but the numbers are not identical for all cases. The two preschool enrollment variables seem to be meant to be a reverse-coding of each other, but they don’t correlate perfectly negatively.

After exclusion of these variables, there were 81 remaining.

Factor analysis

Next I extracted a general factor from the data. Since one previous study had found instability across extraction methods when extracting factors from datasets with more variables than cases (2), I examined the stability across all possible extraction and scoring methods, 30 in total (6 extraction methods, 5 scoring methods). 11 of these 30 methods did not result in an error tho they gave warnings. There was no loading instability or scoring instability across methods: all correlations >.996.1 I saved the results from the minres+regression combination.

Inspection of the loadings revealed no important variables with the ‘wrong loading’ i.e., either a desirable outcome but with a negative loading or an undesirable outcome with a positive loading. Some variables are debatable. E.g. binge drinking in adults has a loading of .566, but this could be seen as a good thing (sufficient free time and money to spend it drinking large quantities of alcohol), or a bad thing (binge drinking is bad for one’s health). Figure 1 shows the loadings plot.

Figure 1: Loadings on the S factor. Some variable names were too long and were cut at the 40th character. Consult the main data file to see the full name.

Factor scores

The extracted factor scores were compared with previously obtained similar measures:

  • HDI2010 scores calculated from HDI2002 scores found in (16).
  • Measure of America’s own American Human Development Index found in the dataset.
  • The S factor scores from the previous study of US states (7).

The correlation matrix is shown in Table 1.























Table 1: Correlation matrix of S and HDI scores. Weighted correlations below the diagonal (sqrt of population).

The correlation between the previously obtained S factor and the new one was very strong at .961. The two different HDI measures had the lowest correlation. This is the expected result if they are the worst approximations of the S factor. Note however that the HDI2010 is rescaled from 2002 data, whereas the AHDI and current S factor are based on 2010 data. The previous S factor is based on data from approximately the last 10 years that were averaged.


Finally, factorial mixedness was examined using two methods detailed in a previous paper (17). In short, mixedness is when cases are incongruent with the overall factor structure found for the data. The methods showed convergent results (r = .65). Figure 2 shows the results.

Figure 2: Factorial mixedness in cases.

If one was doing a more detailed study, one could examine the residuals at the case level and see if one can find the reasons for why an outlier state is an outlier. In the case of Alaska, the residuals for each variable are shown in Table 2.







































































































































































Table 2: Residuals per variable for Alaska.

The meaning of the numbers is this: It is the number of standard deviations that Alaska is above or below on each variable given its score on the S factor (-.24); How much it deviates from the expected level. We see that the Alaskan state spends a much more on transportation per person than expected (more than 6 standard deviations). This is presumably due to it being located very far north compared to the other states and has the lowest population density. It also spends more energy per citizen, again presumably related to the climate. I’m not sure why rape is so common, however.

One could examine the other outlier states in a similar fashion, but this is left as an exercise to the reader.

Discussion and conclusion

The present analysis used a much larger dataset of 81 very diverse variables than the previous study of the S factor in US states which used 25, yet the findings were almost identical (r = .961). This should probably be interpreted as being because the S factor can be very reliably measured when an appropriate number of and diversity of socioeconomic variables are used. It should be noted however that many of the variables between the datasets overlapped in content, e.g. expected life span at birth.

Supplementary material

Data files and source code is available on OSF.


1. Kirkegaard EOW. The international general socioeconomic factor: Factor analyzing international rankings. Open Differ Psychol [Internet]. 2014 Sep 8 [cited 2014 Oct 13]; Available from: openpsych.net/ODP/2014/09/the-international-general-socioeconomic-factor-factor-analyzing-international-rankings/

2. Kirkegaard EOW. Examining the S factor in Mexican states. The Winnower [Internet]. 2015 Apr 19 [cited 2015 Apr 23]; Available from: thewinnower.com/papers/examining-the-s-factor-in-mexican-states

3. Kirkegaard EOW. S and G in Italian regions: Re-analysis of Lynn’s data and new data. The Winnower [Internet]. 2015 Apr 23 [cited 2015 Apr 23]; Available from: thewinnower.com/papers/s-and-g-in-italian-regions-re-analysis-of-lynn-s-data-and-new-data

4. Kirkegaard EOW. The S factor in the British Isles: A reanalysis of Lynn (1979). The Winnower [Internet]. 2015 Mar 28 [cited 2015 Apr 23]; Available from: thewinnower.com/papers/the-s-factor-in-the-british-isles-a-reanalysis-of-lynn-1979

5. Kirkegaard EOW. Indian states: G and S factors. The Winnower [Internet]. 2015 Apr 23 [cited 2015 Apr 23]; Available from: thewinnower.com/papers/indian-states-g-and-s-factors

6. Kirkegaard EOW. The S factor in China. The Winnower [Internet]. 2015 Apr 23 [cited 2015 Apr 23]; Available from: thewinnower.com/papers/the-s-factor-in-china

7. Kirkegaard EOW. Examining the S factor in US states. The Winnower [Internet]. 2015 Apr 23 [cited 2015 Apr 23]; Available from: thewinnower.com/papers/examining-the-s-factor-in-us-states

8. Kirkegaard EOW. The S factor in Brazilian states. The Winnower [Internet]. 2015 Apr 30 [cited 2015 May 1]; Available from: thewinnower.com/papers/the-s-factor-in-brazilian-states

9. Kirkegaard EOW. The general socioeconomic factor among Colombian departments. The Winnower [Internet]. 2015 Jun 16 [cited 2015 Jun 16]; Available from: thewinnower.com/papers/1390-the-general-socioeconomic-factor-among-colombian-departments


11. Kirkegaard EOW. An S factor among census tracts of Boston. The Winnower [Internet]. 2015 Jun 2 [cited 2015 Jun 2]; Available from: thewinnower.com/papers/an-s-factor-among-census-tracts-of-boston

12. Kirkegaard EOW, Tranberg B. What is a good name? The S factor in Denmark at the name-level. The Winnower [Internet]. 2015 Jun 4 [cited 2015 Jun 6]; Available from: thewinnower.com/papers/what-is-a-good-name-the-s-factor-in-denmark-at-the-name-level

13. Kirkegaard EOW. Crime, income, educational attainment and employment among immigrant groups in Norway and Finland. Open Differ Psychol [Internet]. 2014 Oct 9 [cited 2014 Oct 13]; Available from: openpsych.net/ODP/2014/10/crime-income-educational-attainment-and-employment-among-immigrant-groups-in-norway-and-finland/

14. Kirkegaard EOW, Fuerst J. Educational attainment, income, use of social benefits, crime rate and the general socioeconomic factor among 71 immigrant groups in Denmark. Open Differ Psychol [Internet]. 2014 May 12 [cited 2014 Oct 13]; Available from: openpsych.net/ODP/2014/05/educational-attainment-income-use-of-social-benefits-crime-rate-and-the-general-socioeconomic-factor-among-71-immmigrant-groups-in-denmark/

15. Revelle W. psych: Procedures for Psychological, Psychometric, and Personality Research [Internet]. 2015 [cited 2015 Apr 29]. Available from: cran.r-project.org/web/packages/psych/index.html

16. Stanton EA. Inequality and the Human Development Index [Internet]. ProQuest; 2007 [cited 2015 Jun 25]. Available from: www.google.com/books?hl=en&lr=&id=87oZlFPLCykC&oi=fnd&pg=PR5&dq=INEQUALITY+AND+THE+HUMAN+DEVELOPMENT+INDEX+&ots=l1FCqCH_fZ&sig=7WbAexkbLK8zwxudyDqF72SfDTw

17. Kirkegaard EOW. Finding mixed cases in exploratory factor analysis. The Winnower [Internet]. 2015 Apr 28 [cited 2015 May 1]; Available from: thewinnower.com/papers/finding-mixed-cases-in-exploratory-factor-analysis


1 The factor analysis was done with the fa() function from the psych package (15). The cross-method check was done with a home-made function, see the supplementary material.

Some time ago, I stumbled upon this paper:
Searls, D. T., Mead, N. A., & Ward, B. (1985). The relationship of students’ reading skills to TV watching, leisure time reading, and homework. Journal of Reading, 158-162.

Sample is very large:

To enlarge on such information, the National Assessment of Educational Progress (NAEP) gathered data on the TV viewing habits of 9, 13, and 17 year olds across the U.S. during its 1979-80 assessment of reading skills. In this survey, 21,208 9 year olds, 30,488 13 year olds, and 25,551 17 year olds responded to questions about their back- grounds and to a wide range of items probing their reading comprehension skills. These data provide information on the amount of TV watched by different groups of students and allow comparisons of reading skills and TV watching.

The relationship turns out to be interestingly nonlinear:

TV reading compre age

For understanding, it is better to visualize the data anew:


I will just pretend that reading comprehension is cognitive ability, usually a fair approximation.

So, if we follow the smarties: At 9 they watch a fairly amount of TV (3-4 hours per day), then at 13, they watch about half of that (1-2), and then at age 17, they barely watch it (<1).

Developmental hypothesis: TV is interesting but only to persons at a certain cognitive ability level. Young smart children fit in the target group, but as they age and become smarter, they grow out of the target group and stop watching.

Alternatives hypotheses?

R code

The code for the plot above.

d = data.frame(c(1.5, 2.2, 2.3),
               c(3, 3, 1.3),
               c(5.2, .2, -2.2),
               c(-1.7, -6.9, -8.1))

colnames(d) = c("<1 hour", "1-2 hours", "3-4 hours", ">4 hours")
d$age = factor(c("9", "13", "17"), levels = c("9", "13", "17"))

d = melt(d, id.vars = "age")


ggplot(d, aes(age, value)) +
  geom_point(aes(color = variable)) +
  ylab("Relative reading comprehension score") +
  scale_color_discrete(name = "TV watching per day") +
  scale_shape_discrete(guide = F)

Since James Thompson is posting statistics, here are some for comparison.

Note that the statistics for this covers all sites hosted by this server, so that includes: both Danish and English blogs, lyddansk.dk, legaliser.nu, Understanding Statistics, as well as a host of other subsites that can be found via the old front page. Note that the large traffic is due to the PDFs hosted on the site. Lots of visitors never really visit the site, just download the PDFs — fine by me — but it inflates the statistics.

Click the image, then click download. The images are huge, not small, and cannot be shown on one screen.

2015, so far












Upon reading about the obscene costs of journals, e.g. in this post, I decided to write to my local university library and ask. They responded with this:

Hej Emil

Du har sendt os et spørgsmål:

Jeg er interesseret i at finde ud af hvor mange penge AU bruger hvert år på at købe adgang til akademiske journaler. Jeg har kikket nogle budgetter igennem men fandt ikke noget.

Ved I hvor man kan finde den information?

Og spørgsmålet er sendt videre til mig, da det nok ikke er noget, der offentliggøres som et tal for sig selv. Jeg kan fortælle dig, at AU og SB i fællesskabet AU Library bruger ca. 45 mio. kroner pr. år på elektroniske ressourcer, hvoraf e-tidsskrifter udgør langt størstedelen – dertil kommer databaser, leksika og mere almindelige e-bøger.

Med venlig hilsen


Lilian Madsen

Områdedirektør, Procesområdet

The relevant part translate to:

And the question has been sent to me because it is probably not something that is published as a number for itself. I can tell you that [the university and university library together] spends about 45 million kroner per year on electronic resources of which the large majority is is e-journals — to that comes databases, lexicons, and normal e-books.

If we assume that the large majority means 90%, then the number is 40.5 million DKK a year. A DKK is about .15 USD, so this is about 6.075 million USD. Put this together with the number of students, currently 43,600, and one can calculate a cost per year per student. This value is 930 DKK or about 139.3 USD per year per student. If you think about it, this is a crazy price to pay. The marginal cost post extra student is near-zero.

Aarhus is currently the largest university in Denmark, but it is about the same size as Copenhagen which has 40,866 students.


A dataset was compiled with 17 diverse socioeconomic variables for 32 departments of Colombia and the capital district. Factor analysis revealed an S factor. Results were robust to data imputation and removal of a redundant variable. 14 of 17 variables loaded in the expected direction. Extracted S factors correlated about .50 with the cognitive ability estimate. The Jensen coefficient for the S factor for this relationship was .60.



The general socioeconomic factor is the mathematical construct associated with the idea that positive outcomes tend to go along with other positive outcomes, and likewise for the negative. Mathematically, this shows up as a factor where the desirable outcomes load positively and where the undesirable outcomes load negatively. As far as I know, (Kirkegaard, 2014b) was the first to report such a factor, although Lynn (1979) was close to the same idea. The factor is called s at the individual level, and S when found in aggregated data.

By now, S factors have been found between countries (Kirkegaard, 2014b), twice between country-of-origin groups within countries (Kirkegaard, 2014a), numerous times within countries (reviewed in (Kirkegaard, 2015c)) and also at the level of first names (Kirkegaard & Tranberg, 2015). This paper analyzes data for 33 Colombian departments including the capital district.

Data sources

Most of the data were found via the English-language website Knoema.com which is an aggregator of statistical information concerning countries and their divisions. A second source was a Spanish-language report (DANE, 2011). One variable had to be found on Wikipedia (“List of Colombian departments by GDP,” 2015). Finally, HDI2010 was found in a Spanish-language UN report (United Nations Development Programme & UNDP Colombia, 2011).

Variables were selected according to two criteria: 1) they must be socioeconomically important and 2) they must not be strongly dependent on local climatic conditions. For instance, fishermen per capita would be a variable that fails both criteria, since it is not generally seen as socioeconomically important and is dependent on having access to a body of water.

The included variables are:

  • SABER, verbal scores
  • SABER, math scores
  • Acute malnutrition, %
  • Chronic malnutrition, %
  • Low birth weight, %
  • Access to clean water, %
  • The presence of a sewerage system, %
  • Immunization coverage, %
  • Child mortality, rate
  • Infant mortality, rate
  • Life expectancy at birth
  • Total fertility rate
  • Births that occur in a health clinic, %
  • Unemployment, %
  • GDP per capita
  • Poverty, %
  • GINI
  • Domestic violence, rate
  • Urbanicity, %
  • Population, absolute number
  • HDI 2010

SABER is a local academic achievement test similar to PISA.

Missing data

When collecting the data, I noticed that quite a number of the variables have missing data. The matrixplot is shown in Figure 1.


Figure 1: Matrix plot for the dataset.

The red fields indicate missing data (NA). The greyscale fields indicate high (dark) and low values in each variable. We see that the same departments tend to miss data.

Redundant variables and imputation

Very highly correlated variables cause problems for factor analysis and result in ‘double weighing’ of some variables. For this reason I used the algorithm I developed to find the most highly correlated pairs of variables and remove one of them automatically (Kirkegaard, 2015a). I used a rule of thumb that variables which correlate at >.90 should be removed. There was only one such pair (infant mortality and child mortality, r = .922; removed infant mortality).

I imputed the missing data using the irmi() function from the VIM package (Templ, Alfons, Kowarik, & Prantner, 2015). This was done without noise to make the results replicable. I had no interest in trying to estimate standard errors, so multiple imputation was unnecessary (Donders, van der Heijden, Stijnen, & Moons, 2006).

To check whether results were comparable across methods, datasets were saved with every combination of imputation and removal of the redundant variable, thus creating 4 datasets.

Factor analysis

I carried out factor analysis on the 4 datasets. The factor loadings plot is shown in Figure 2.

Figure 2: Factor loadings plot.

Results were were similar across methods. Per S factor theory, the desirable variables should have positive loadings and the undesirable negative loadings. This was not entirely the case. 3 variables that are generally considered undesirable loaded positively: unemployment rate, low birth weight and domestic violence.

Unemployment rate and crime has been found to load in the wrong direction before when analyzing state-like units. It may be due to the welfare systems being better in the higher S departments, making it possible to survive without working.

It is said that cities breed crime and since urbanicity has a very high positive S loading, the crime result may be a side-effect of that. Alternatively, the legal system may be better (e.g. less corrupt) in the higher S departments making it more likely for crimes to be reported. This is perhaps especially so for crimes against women.

The result with low birth weight is more strange given that higher birth weight is a known correlate of higher educational levels and cognitive ability (Shenkin, Starr, & Deary, 2004). One of the other variables suggest an answer: in the lower S departments, a large fraction (30-40%) of births are home-births, and it seems likely that this would result in fewer reports of low birth weights.

Generally, the results are consistent with those from other countries; 14 of 17 variables loaded in the expected direction.

Mixed cases

Mixed cases are cases that do not fit the factor structure of a dataset. I have previously developed two methods for detecting such cases (Kirkegaard, 2015b). Neither method indicated any strong mixed cases in the unimputed, unreduced dataset or the imputed, reduced dataset. Removing the least congruent case would only improve the factor size by 1.2%point, and the case with the greatest mean absolute residual had only .89.

Unlike previous analysis, the capital district was kept because it did not appear to be a structural outlier.

Cognitive ability, S and HDI

The two cognitive variables correlated at .84, indicating the presence of the aggregate general cognitive ability factor (G factor; Rindermann, 2007). They were averaged to form an estimate of the G factor.

The correlations between S factors, HDI and cognitive ability is shown in Table 1.





















 Table 1: Correlation matrix for cognitive ability, S factor and HDI. Correlations below diagonal are weighted by the square root of population size.

Weighted and unweighted correlations were approximately the same. The imputed and trimmed S factor was nearly identical to the HDI values, despite that the HDI values are from 2010 and the data the S factor is based on is from 2005. Results are fairly similar to those found in other countries.

Figure 3 shows a scatter plot of S factor (reduced, imputed dataset) and cognitive ability.


Figure 3: Scatter plot of S factor scores and cognitive ability.

Jensen’s method

Finally, as a robustness test, I used Jensen’s method (method of correlated vectors (Frisby & Beaujean, 2015; Jensen, 1998)) to see if cognitive abilities’ association with the S factor scores was due to the latent trait. Figure 4 shows the Jensen plot.

Figure 4: Jensen plot for S factor loadings and cognitive ability.

The correlation was .60, which is satisfactory given the relatively few variables (N=16).


  • I don’t speak Spanish, so I may have overlooked some variables that should have been included in the analysis. They may also be translation errors as I had to rely on those found on the websites I used.
  • No educational attainment variables were included despite these often having very strong loadings. None were available in the data sources I consulted.
  • Data was missing for many cases and had to be imputed.

Supplementary material

Data files, R source code and high quality figures are available in the Open Science Framework repository.


We present and analyze data from a dataset of 2358 Danish first names and socioeconomic outcomes not previously made available to the public (“Navnehjulet”, the Name Wheel). We visualize the data and show that there is a general socioeconomic factor with indicator loadings in the expected directions (positive: income, owning your own place; negative: having a criminal conviction, being without a job). This result holds after controlling for age and for each gender alone. It also holds when analyzing the data in age bins. The factor loading of being married depends on analysis method, so it is more difficult to interpret.

A pseudofertility is calculated based on the population size for the names for the years 2012 and 2015. This value is negatively correlated with the S factor score r = -.35 [95CI: -.39; -.31], but the relationship seems to be somewhat non-linear and there is an upward trend at the very high end of the S factor. The relationship is strongly driven by relatively uncommon names who have high pseudofertility and low to very low S scores. The n-weighted correlation is -.21 [95CI: -.25; -.17]. This dysgenic pseudofertility was mostly driven by Arabic and African names.

All data and R code is freely available.



It has been noted that good outcomes tend to go together, but to our knowledge, the factor structure of such relationships have not been examined before recently by (Kirkegaard, 2014c). When it has, it has repeatedly been found that there is a general socioeconomic factor to which good outcomes nearly always have positive loadings and bad outcomes have negative loadings.1 Recent studies have examined S factors at the national, regional/state and country of origin-level; see (Kirkegaard, 2015c) for a review of regional/state-level studies, and (Kirkegaard, 2014a) for country of origin-level studies. In this paper we exploit a unique dataset to examine the S factor at the name-level in Denmark.

The dataset
Last year the Danish newspaper Ugebrevet A4 published an interactive infographic called “Navnehjulet” (“the Name Wheel”). It’s simple: you just enter a first name and it shows you some numbers about that name. The data was initially bought from Statistics Denmark and is based on 2012 data. There is no option available to download the dataset. A screenshot of the Name Wheel is shown in Figure 1.

Figure 1: A screenshot of the Name Wheel with “Emil” entered.

The more technical aspects of the scraping (“automatic downloading of the data”) are covered elsewhere (Tranberg, 2015), here we focus on the data and the statistical analyses.

The statistical information shown for each name varies (presumably due to data availability), but in the cases with full data, it includes:

  1. Number of persons with the name.
  2. 3 most common job types.
  3. 3 most common living areas.
  4. Average age.
  5. Percents who rent and own their home. Note that this does not always sum to 100%.
  6. Percentage with at least one conviction in the last 5 years.
  7. Average monthly income in DKK.
  8. Marital status (married, cohabiting, registered partner2, single).
  9. Employee rate.
  10. Student rate.
  11. Outside the job market rate.
  12. Independent rate.
  13. Unemployment rate.
  14. Chief executive rate.

Of note is that the unemployment variable includes only those who spent at least half the year without work or who received dagpenge (a kind of unemployment benefit). The outside the job market variable includes heterogeneous groups: førtidspensionister (pre-time retirees), folkepensionister (ordinary retirees), efterlønsmodtagere (another type of pre-time retirement), kontanthjælpsmodtagere (another type of unemployment benefit), and andre (others). As such, this last variable is a mixture of situations that are normal (ordinary pension, efterløn) and some which are used by unproductive members of society (førtidspension, kontanthjælp). Thus, interpretation of that variable is not straightforward. There is a more detailed description of the variables available at the website. We have taken a copy of this in case the site goes down (see supplementary material; in Danish).

We downloaded the data for all variables for each of the 2358 names in the database. The gender of the names was usually not marked, but because they were sorted by gender, we could easily assign them genders. The gender distribution is 1266 females and 1092 males, or 54% female. This is a higher female percentage than the actual population (50.3%3). This seems to be due to females simply having a greater diversity of names. Table 1 shows the top 20 most common names by gender.



Name (F)


Name (M)











































































































Table 1: Top 20 most common names by gender.

As can be seen, the top 20 most common female names have a smaller sum than the male sum, by 21%.

A few names have genders marked which was because these were unisex names. Such names were quite rare (36 pairs).

Missing data

There is quite a bit of missing data, 20% of names have have at least some missing data. For this reason we examined the distribution of missing data to see if some of it could fruitfully be imputed (Donders, van der Heijden, Stijnen, & Moons, 2006). The matrix plot is shown in Figure 2.4


Figure 2: Matrix plot for missing data.

Note: Not all cases are shown due to insufficient resolution of the image.

We see that data is not missing at random but that some cases tend to have a lot of missing data. We also see that some variables have no missing data (unisex, number, age, conviction).

Which kind of cases have missing data? It cannot be seen from the above, but the missingness is strongly related to the number of persons with that name, which is not surprising. The data is limited to names where there are 100 or more persons. To see the relationship, we sort the data by number of persons and replot the matrix plot; Figure 3.


Figure 3: Matrix plot of missing data, cases sorted by number of persons with the name.

Another way to examine missingness is to examine the distribution of cases by the number of missing cases. A histogram of this is shown in Figure 4.


Figure 4: Histogram of cases by number of missing datapoints.

While about 20% of the data has 13 missing datapoints, a small number of datapoints (71) have only 2 missing datapoints. These can be imputed to slightly increase the sample size.

Getting an overview of the data
Before running numerical analyses on data, it is important to get a solid overview of it. This is because one can rapidly identify patterns by eye that may go unnoticed by numerical analyses. For instance, relying on correlations can miss important non-linear patterns, which can easily be identified by eye if data or plotted using a moving average or similar (Lubinski & Humphreys, 1996).

The classic example of this is Anscombe’s quartet, 4 bi-variate datasets with which have (almost) the same mean of x and y, variance/standard deviation, correlation and regression coefficients (intercept and slope). However, plotting the data reveals that they are very different.

Histograms are the easiest way to get a quick overview of the data structure. We plot selected histograms in Figures 5-8. The rest are available in the supplementary material.


Figure 5: Histogram of number of persons per name. Note that the x-axis is log-scale.

We see a power law distribution in that most of the names have only a few persons with it, while a few have many thousands. The top 20 by gender were shown in Table 1. The mean and median number of persons per name are: 2209 and 316. Since the data is capped at at least 100 persons per name, showing the least common names is not particularly interesting. The curious reader can consult the supplementary material (results/number_ranks.csv).


Figure 6: Histogram of ages.

The distribution of the mean age of names is a fat normal distribution. Top 5 youngest: Elliot, Milas, Noam, Storm, Mynte (MMMMF); oldest: Valborg, Hertha, Dagny, Magna, Erna (all F).


Figure 7: Histogram of incomes.

The income distribution is fairly normal with a long right tail. Presumably, a few very rich people with uncommon names result in those names having very high incomes. The top scores are: Renè (M), Leise (F), Frants (M), Heine (M) and Thorleif (M). The bottom scorers are dominated by names who are very young and thus have very low incomes, e.g. Alberte (mean age 8, mean income 4893 DKK). These have little interest so we shall not mention them.


Figure 8: Histogram of mean convictions past 5 years.

It is clear that some names are much more criminal than others, the top scorers are: Alaa, Ferhat, Walid, Rachid, Fadi (all male). The female top scorer is Vesna (top #51). These names are all foreign, mostly Arabic, except the female name which is Slovenian according to www.meaning-of-names.com. This result is expected because persons from Muslim countries are highly overrepresented in crime statistics (Kirkegaard & Fuerst, 2014).

Variables by age and gender
Since the mean age of the names has central importance to the other variables (e.g. income) and since gender is a suitable dichotomous variable, we plot the other variables by age and gender. These are shown in Figures 9 to 17.


Figure 9: Income by age and gender.

We see the familiar pattern in that men earn more money than women. The difference is stable until about age 45 where it increases. Interpretation is difficult because the data is cross-sectional, not longitudinal and hence there are both age and cohort differences between the names. Still, one would expect something to happen at about that age that increases the difference.


Figure 10: Convictions by age and gender.

It is well-known that crime tends to be committed by younger males, we see the same pattern here. Recall that this is the percentage of persons with the name who has at least one conviction the last 5 years. Thus, it has a bit of lag which is probably why it is fairly high for even men in their 40 — they could have gotten their conviction at age 35.


Figure 11: Being outside the job market by age and gender.

This variable is the odd one comprising both regular pensions as well as some unemployment benefits and other benefits given to people who cannot/won’t work (e.g. who had a work accident, have severe psychological problems, are just lazy). As expected, it goes up heavily with age as people go on pension.


Figure 12: Being independently employed by age and gender.

There are known gender differences in rates of self-employment, and we see it here as well at all ages. It seems to increase over the lifespan a bit being at maximum value perhaps around 45-50.


Figure 13: Marital status by age and gender.

This one is interesting in that it has an odd pattern at old age. Our guess is that the men who are married tend to live longer which explains the male pattern, while the female pattern is explained by the fact that women live longer than men and their husbands die off before them, leaving them widowed (unmarried). In discussion with EOWK, A. J. Figueredo suggested that it may be due to serial monogamy. Simply put, some men divorce their aging wives and marry a younger one. This would tend to keep men married at older ages as well as decreasing the marriage rate of older women.


Figure 14: Owning a home by age and gender.

This one is odd in that at middle age around 30 more women have their own home, but men catch up later. One could think of it as men making an earlier investment of their resources into career, while women are more interested in getting a home. And when men’s careers get going at age 45 and above, they acquire their homes. Again, due to the cross-sectional data, it is difficult to say.


Figure 15: No job by age and gender.

This is the variable for ‘pure’ unemployment. The gender difference is only slight at early to mid ages, while it reverses in direction at older ages. It is somewhat odd that it is highest around age 35.


Figure 16: Being and student by age and gender.

Girls and women generally acquire more formal education than more and we see it in the data here as well.


Figure 17: Being an executive by age and gender.

Finally, there is a clear gender and age pattern in being an executive. Males are more likely at all ages, but there is an increase around age 45, especially for men. This is presumably the explanation of the pattern seen for income in Figure 9.

Is there an S factor among names?
Some of the variables are (almost) linearly dependent on each other. Own.place and rent.place sum to nearly 100, so using both in an analysis would perhaps cause problems. The same is true for the 4 civil status variables (married, cohabiting, reg. partnership, single), and the 6 employment variables (no.job, employee, out.of.work.market, student, independent, executive). To be safe, one should probably not pick more than one from each of these three sets.

To do a factor analysis we must however pick some of them. We decided on the following: no.job, own.place, married, conviction and income. The expectation is that no.job and conviction will have negative loadings, while own.place and income will have positive, and marriage perhaps somewhat positive (Herrnstein & Murray, 1994).

What we want to measure is the general socioeconomic status factor (if it exists). However, gender can disrupt the analysis. This is because men earn more money but are also more criminal. This may lead to gender specific variance, which is error in the factor analysis. One could regress out the effect of gender, but we instead divide the dataset into two which also allows for easier interpretation of the results.

Age has a strong influence on the variables which may disrupt results. For instance, a very young name will have lower income and a low conviction rate, which will result in high mixedness (Kirkegaard, 2015b). For this reason, we use both the original variables for analysis and a version of them where the effect of age has been partialed out. To do this, we regress every value on age, age2 and age3.

Some cases had some missing datapoints (refer back to Figure 2). We imputed the cases with 2 or fewer missing datapoints and excluded the rest.

Correlation matrices

Before looking at the factor analysis results, we will look at the correlation matrices by gender and together, as well as with and without partialing out the effects of age; Tables 2-4.













































Table 2: Correlation matrix of S variables for both genders. Above diag., age partialed out.













































Table 3: Correlation matrix of S variables for men. Above diag., age partialed out.













































Table 4: Correlation matrix of S variables for women. Above diag., age partialed out.

Below the diagonal, one can see that the linear effect of age is often substantial, while above the diagonal, the linear effect of age is zero, meaning that generally the partialization worked, at least linearly speaking. Generally, the relationships were similar across gender. There are some exceptions. To make them easier to see, Table 5 shows the delta (difference) correlation matrix.













































Table 5: Delta correlation matrix for genders. Higher means mens’ correlations are stronger.

The largest difference for the age-partialed data is the relationship between being married and having no job (recall that this does not include those pensioned). Among female names, there is a strong relationship between unemployment and being married. Perhaps because women are more often reliant on their husbands (being a homemaker) than the reverse, but both correlations were positive. It could also have something to do with Muslim immigrants (about 10% of the population) who are often married and where a large fraction of the women are unemployed.

Factor analyses

The loadings plots are shown in Figure 18.


Figure 18: Loadings plot for factor analyses.

The factors were not particularly strong, as shown in Table 6.


Factor analysis















Table 6: Variance explained by S factors.

The factors decreased in size after correcting for age, which could be because age was inflating the factor size, or because the correction was too strong. The gender difference in the marriage indicator is strong: about 0 vs. about .5 after age correction. Notice that the own.place has loadings near 1, so the S factor is about equal to variable in these datasets. It is probably an indicator sampling error that would be corrected if more indicators of greater diversity were available.5 Some previous S factor studies have found the same when only a few indicators were used, e.g. Kirkegaard (2015a, first analysis).

Still, the factor loadings are in the expected directions for all variables in all analyses.

Given the similar factor loadings, one would also expect the extracted factor scores to be similar, which Table 7 shows them to be.





































Table 7: Correlations between S factors across analyses.

Note: The apparently missing values are because the data does not overlap. There are no scores for men in the S factor analyses with only women.

Using age bins instead

In the above analyses, we have analyzed data for all ages both with and without partialing the effects of age out. However, age may be insufficiently dealt with by the chosen correction method, and its effect may be so strong that not correcting for it also leads to spurious results. Hence we employed a third method, that of age bins. The dataset is large enough that we can split it up into age groups, say 20-25 as well as age and analyze each subgroup separately. While this does not entirely remove the age effect, it is more likely to not introduce any spurious effects over-correction effects.

Concretely, we analyzed subgroups within 5 year bracket starting at age 20-25 and stopping at age 50-55. We do this for both genders together, and separately. The analysis procedure is the same as above, namely extracting the general factor, examining the loadings and the factor sizes. Figures 19-21 show the factor loadings by age bin for both genders together, and each separately.


Figure 19: Factor loadings by age bins, both genders together


Figure 20: Factor loadings by age bins, males only


Figure 21: Factor loadings by age bins, females only

The most conspicuous finding is the marriage loadings which are now negative! Apparently, the positive loadings from before were an age confound. The exception is the last two age groups where the marriage indicator is positive, especially for the last group. The odd finding that for 50-55 year olds, crime has a loading around 0 is presumably sampling error as well as reflecting the fact that crime among people in their 50s is fairly rare. When the base rate is low, correlations become weaker and factor loadings are based on the correlation patterns in the data (Ferguson, 2009). The sample sizes are not terribly impressive, 126 to 257, and the least for the last two groups. The ones by each gender about half that.

For the male data, the marriage loadings are about 0. The two last age bins are again positive. The other four loadings are somewhat stronger in males with criminality actually having stronger (negative) correlations than unemployment. This is presumably because crime is more common among males which means the correlations are stronger.

Finally, for the female data, marriage loadings are more strongly negative except for the last two age bins, same as with the male data.

Figure 22 shows the factor strength by age bin and gender, together and separate.


Figure 22: Factor sizes by age bin and gender

Generally the male-only analyses had the strongest S factors (6/7), with the female-only analyses being above the one with both (5/7). One might interpret this as being due to the lower base rate of crime making the correlations with the crime variable smaller for females which makes the factor size smaller. The mixed-gender analyses usually had smaller factors, perhaps because the of the mixedness that results from this as discussed earlier.

Pseudofertility and the S factor

Since the Name Wheel data contains the count of persons with each name in 2012, if we could find some data for a later year for the same names, we could calculate a name-wise ‘fertility’, which we shall call pseudofertility. It is the growth (or decrease) in number of persons with each name in Denmark. This may be due to actual births, immigration or name-changes. This pseudofertility can then be compared to the S factor score for each name to see if there is any relationship. A somewhat negative relationship is expected due to low S immigrant names increasing their number via higher than average fertility (at least in the first generation, (Kirkegaard, 2014b)) and immigration.

The Danish Statistics agency (Danish Statistics) maintains a web page where one can look up any first or last name and see how many people have that name in the current year and last year. Using a similar method to that using to scrape the data form the Name Wheel, we scraped the count data for the years 2014 and 2015 for every name in our dataset. From these data, we calculated the pseudofertility by the fractional increase (or decrease) of each name over both the period 2012-2015 and 2014-2015. The first should give a more reliable number since it’s over a few years as opposed to the second which is over 1 year only. Their correlation is .95 (no outliers), so reliability was very high.

Figure 23 shows the scatter plot of pseudofertility 2012-2015 and S factor score (age adjusted, both genders together).


Figure 23: Pseudofertility 2012-2015 and S factor scores. Point sizes are proportional to the number of persons with the name.

Overall, there is a medium-sized negative relationship, r = -.35 [95CI: -.39; -.31], between pseudofertility and S factor score (age-controlled). As can be seen in the plot, this is mainly due to the names left of 0 S (the below average). There appears to be an upward trend at the other end, but there are relatively few datapoints, so it may be a fluke. The point sizes show that the names creating the trend are relatively uncommon (few people have those names, relatively speaking). The largest names cluster around S [0-1.5]. For this reason, we also calculated the weighted correlation which is -.21 [95CI: -.25; -.17], so the effect is still reliable but substantially smaller as expected from the inspection of the plot.

We plotted the figure in very high resolution using vector graphics so that one can zoom in on any given region. The reader can examine the pseudofertility_names.svg file in the supplementary material to explore the figure. Looking at the names in the region creating the negative slope reveals them to be almost exclusively immigrant names from Arabic or African countries, e.g.: Mohammad, Hossein, Mostafa, Sayed, Malika, Mana, Slawomir, Omar (names from the region north of the moving average near S = -1.5). Unfortunately the dataset does not contain information about the immigration status of each name, so we could exclude all of them and see if the ‘dysgenic’ relationship holds without immigrants.

Thus, the name data reveals a small ‘dysgenic’ effect on S in line with modeling by (Kirkegaard & Tranberg, 2015). If the trend were to continue, and assuming that everything else is equal, then the average level of socioeconomic status would fall in Denmark and there would be increasing socioeconomic inequality.

Discussion and conclusion
Despite being a new level of analysis (at least to us), the results were generally in line with those from more ‘traditional’ country, regional/state-level and origin country-level analyses.

This dataset contained first names, but one could also analyze last names which are more familial in nature. Such data was not available at the Name Wheel website, but it could probably be acquired from the statistical agency if one is willing to pay.

The dataset is especially useful for researchers wishing to investigate the (in)accuracy of stereotypes of names, see e.g. (Jussim, Cain, Crawford, Harber, & Cohen, 2009; Jussim, 2012).


As mentioned earlier, the data are an odd kind of cross-sectional data which makes it difficult to infer causality. A given difference observed between names with a mean age of 20 and 40, could be either an effect of age (being 20 versus 40), a cohort effect (being born in 1995 versus 1975), or something more complicated.

The mean age of the names is tricky to interpret since the distribution of age of persons with the name is not shown. This could be a normal distribution if the name was fashionable at some point but then faded out. However, it could also be bi-modal. For instance, if a name was fashionable in 1965 and in 1995, there would be two groups of persons. One aged about 50 and one aged about 20. If they are about evenly distributed the mean age of the name would be about 35 despite few people with the name being that age.

Aside from the extra population data from Danish Statistics, the dataset only has data from one year (2012). It would be better if data for more than one year was available. Both to avoid fluke effects, but also to examine e.g. the effects of macroeconomics on the relationships between the variables.

To our knowledge, this is a new kind of grouped data and so methods for analyzing it have not been well-tested. This should give some extra caution about the inferences drawn from it.

Supplementary material

Data, source code and figures are available at the Open Science Framework repository.



1 Note that sometimes a factor is reversed such that the good outcomes have negative loadings, and the bad outcomes have positive loadings. This reversing is quite arbitrary and depends on the balance of good and bad variables included in the analysis. A preponderance of bad variables means that the factor will be reversed. If the factor is thus reversed, one can just multiple all loadings by -1 to unreverse it.

2 This is a pre-2012 category as an alternative to marriage for same sex couples. One can no longer attain this legal status, but one can retain it if one acquired it before 2012. See Borger.dk (Danish).

4 This plot is made using the matrixplot() function from the VIM package (Templ, Alfons, Kowarik, & Prantner, 2015). The 5 character/string variables are left out because due to a bug in the function, such variables are always shown as missing all data, whereas in fact in this case none of them had any missing data.

5 Indicator sampling error is meant to be a generalized version of Jensen’s “psychometric sampling error”, see e.g. (Kranzler & Jensen, 1991).

A factor analysis was carried out on 6 socioeconomic variables for 506 census tracts of Boston. An S factor was found with positive loadings for median value of owner-occupied homes and average number of rooms in these; negative loadings for crime rate, pupil-teacher ratio, NOx pollution, and the proportion of the population of ‘lower status’. The S factor scores were negatively correlated with the estimated proportion of African Americans in the tracts r = -.36 [CI95 -0.43; -0.28]. This estimate was biased downwards due to data error that could not be corrected for.

The general socioeconomic factor (s/S1) is a similar construct to that of general cognitive ability (GCA; g factor, intelligence, etc., (Gottfredson, 2002; Jensen, 1998). For ability data, it has been repeatedly found that performance on any cognitive test is positively related to performance on any other test, no matter which format (pen pencil, read aloud, computerized), and type (verbal, spatial, mathematical, figural, or reaction time-based) has been tried. The S factor is similar. It has been repeatedly found that desirable socioeconomic outcomes tend are positively related to other desirable socioeconomic outcomes, and undesirable outcomes positively related to other undesirable outcomes. When this pattern is found, one can extract a general factor such that the desirable outcomes have positive loadings and then undesirable outcomes have negative loadings. In a sense, this is the latent factor that underlies the frequently used term “socioeconomic status” except that it is broader and not just restricted to income, occupation and educational attainment, but also includes e.g. crime and health.

So far, S factors have been found for country-level (Kirkegaard, 2014b), state/regional-level (e.g. Kirkegaard, 2015), country of origin-level for immigrant groups (Kirkegaard, 2014a) and first name-level data (Kirkegaard & Tranberg, In preparation). The S factors found have not always been strictly general in the sense that sometimes an indicator loads in the ‘wrong direction’, meaning that either an undesirable variable loads positively (typically crime rates), or a desirable outcome loads negatively. These findings should not be seen as outliers to be explained away, but rather to be explained in some coherent fashion. For instance, crime rates may load positively despite crime being undesirable because the justice system may be better in the higher S states, or because of urbanicity tends to create crime and urbanicity usually has a positive loading. To understand why some indicators sometimes load in the wrong direction, it is important to examine data at many levels. This paper extends the S factor to a new level, that of census tracts in the US.

Data source
While taking a video course on statistical learning based on James, Witten, Hastie, & Tibshirani (2013), I noted that a dataset used as an example would be useful for an S factor analysis. The dataset concerns 506 census tracts of Boston and includes the following variables (Harrison & Rubinfeld, 1978):

  • Median value of owner-occupied homes
  • Average number of rooms in owner units.
  • Proportion of owner units built before 1940.
  • Proportion of the population that is ‘lower status’. “Proportion of adults without, some high school education and proportion of male workers classified as laborers)”.
  • Crime rate.
  • Proportion of residential land zoned for lots greater than 25k square feet.
  • Proportion of nonretail business acres.
  • Full value property tax rate.
  • Pupil-teacher ratios for schools.
  • Whether the tract bounds the Charles River.
  • Weighted distance to five employment centers in the Boston region.
  • Index of accessibility to radial highways.
  • Nitrogen oxide concentration. A measure of air pollution.
  • Proportion of African Americans.

See the original paper for a more detailed description of the variables.

This dataset has become very popular as a demonstration dataset in machine learning and statistics which shows the benefits of data sharing (Wicherts & Bakker, 2012). As Gilley & Pace (1996) note “Essentially, a cottage industry has sprung up around using these data to examine alternative statistical techniques.”. However, as they re-checked the data, they found a number of errors. The corrected data can be downloaded here, which is the dataset used for this analysis.

The proportion of African Americans
The variable concerning African Americans have been transformed by the following formula: 1000(x – .63)2. Because one has to take the square root to reverse the effect of taking the square, some information is lost. For example, if we begin with the dataset {2, -2, 2, 2, -2, -2} and take the square of these and get {4, 4, 4, 4, 4, 4}, it is impossible someone to reverse this transformation and get the original because they cannot tell whether 4 results from -2 or 2 being squared.

In case of the actual data, the distribution is shown in Figure 1.

Figure 1: Transformed data for the proportion of blacks by census tract.

Due to the transformation, the values around 400 actually mean that the proportion of blacks is around 0. The function for back-transforming the values is shown in Figure 2.

Figure 2: The transformation function.

We can now see the problem of back-transforming the data. If the transformed data contains a value between 0 and about 140, then we cannot tell which original value was with certainty. For instance, a transformed value of 100 might correspond to an original proportion of .31 or .95.

To get a feel for the data, one can use the Racial Dot Map explorer and look at Boston. Figure 3 shows the Boston area color-coded by racial groups.

Boston race
Figure 3: Racial dot map of Boston area.

As can be seen, the races tend to live rather separate with large areas dominated by one group. From looking at it, it seems that Whites and Asians mix more with each other than with the other groups, and that African Americans and Hispanics do the same. One might expect this result based on the groups’ relative differences in S factor and GCA (Fuerst, 2014). Still, this should be examined by numerical analysis, a task which is left for another investigation.

Still, we are left with the problem of how to back-transform the data. The conservative choice is to use only the left side of the function. This is conservative because any proportion above .63 will get back-transformed to a lower value. E.g. .80 will become .46, a serious error. This is the method used for this analysis.

Factor analysis
Of the variables in the dataset, there is the question of which to use for S factor analysis. In general when doing these analyses, I have sought to include variables that measure something socioeconomically important and which is not strongly influenced by the local natural environment. For instance, the dummy variable concerning the River Charles fails on both counts. I chose the following subset:

  • Median value of owner-occupied homes
  • Average number of rooms in owner units.
  • Proportion of the population that is ‘lower status’.
  • Crime rate.
  • Pupil-teacher ratios for schools.
  • Nitrogen oxide concentration. A measure of air pollution.

Which concern important but different things. Figure 4 shows the loadings plot for the factor analysis (reversed).2


Figure 4: Loadings plot for the S factor.

The S factor was confirmed for this data without exceptions, in that all indicator variables loaded in the expected direction. The factor was moderately strong, accounting for 47% of the variance.

Relationship between S factor and proportions of African Americans
Figure 5 shows a scatter plot of the relationship between the back-transformed proportion of African Americans and the S factor.

Figure 5: Scatter plot of S scores and the back-transformed proportion of African Americans by census tract in Boston.

We see that there is a wide variation in S factor even among tracts with no or very few African Americans. These low S scores may be due to Hispanics or simply reflect the wide variation within Whites (there few Asians back then). The correlation between proportion of African Americans and S is -.36 [CI95 -0.43; -0.28].

We see that many very low S points lie around S [-3 to -1.5]. Some of these points may actually be census tracts with very high proportions of African Americans that were back-transformed incorrectly.

The value of r = -.36 should not be interpreted as an estimate of effect size of ancestry on S factor for census tracts in Boston because the proportions of the other sociological races were not used. A multiple regression or similar method with all sociological races as the predictors is necessary to answer this question. Still, the result above is in the expected direction based on known data concerning the mean GCA of African Americans, and the relationship between GCA and socioeconomic outcomes (Gottfredson, 1997).

The back-transformation process likely introduced substantial error in the results.

Data are relatively old and may not reflect reality in Boston as it is now.

Supplementary material
Data, high quality figures and R source code is available at the Open Science Framework repository.



1 Capital S is used when the data are aggregated, and small s is used when it is individual level data. This follows the nomenclature of (Rindermann, 2007).

2 To say that it is reversed is because the analysis gave positive loadings for undesirable outcomes and negative for desirable outcomes. This is because the analysis includes more indicators of undesirable outcomes and the factor analysis will choose the direction to which most indicators point as the positive one. This can easily be reversed by multiplying with -1.

A friend of mine and his brother just received their 23andme results.



In a table they look like this (I have added myself for comparison):

Macrorace Bro1 Bro2 Emil
European 52.6 53 99.8
MENA 42.5 41.3 0.2
South Asian 2.8 3.4 0
East Asian & Amerindian 1.1 0.7 0
Sub-Saharan African 0.5 0.5 0
Oceanian 0.5 0 0
Unassigned 0 1.1 0.1
Sum 100 100 100.1
Mesorace Bro1 Bro2 Emil
Northern 51.5 51.5 91.3
Southern 1 1.2 0
Ashkenazi 0.1 0 2.9
Eastern 0 0 4
Common European 0.1 0.4 1.5
Middle Eastern 42 40.8 0
North African 0.3 0.2 0.2
Common MENA 0.2 0.3 0
South Asian 2.8 3.4 0
East Asian & Amerindian
East Asian 0.7 0.4 0
Southeast Asian 0.2 0 0
Amerindian 0 0.1 0
Common East Asian & Amerindian 0.1 0.1 0
Sub-Saharan African
East 0.3 0.3 0
West 0.2 0.4 0
Central & South 0 0 0
Common Sub-Sahara African 0.1 0.1 0
Oceanian 0 0 0
Unassigned 0.5 1.1 0.1
Sum 100.1 100.3 100
Microrace Bro1 Bro2 Emil
Scandinavian 21.3 24.2 37.3
French & German 10.5 14.9 0.8
British and Irish 8.9 4.9 11
Finnish 0 0 0.3
Common Northern 10.7 7.5 42
Italian 0.9 0.8 0
Sardinian 0 0 0
Iberian 0 0 0
Balkan 0 0 0
Common Southern 0.1 0.4 0
Ashkenazi 0.1 0 2.9
Eastern 0 0 4
Common European 0.1 0.4 1.5
Middle Eastern 42 40.8 0
North African 0.3 0.2 0.2
Common MENA 0.2 0.3 0
South Asian 2.8 3.4 0
East Asian & Amerindian
East Asian
Japanese 0.2 0 0
Mongolian 0.1 0.2 0
Korean 0 0 0
Yakut 0 0 0
Chinese 0 0 0
Common East Asian 0.5 0.2 0
Southeast Asian 0.2 0 0
Amerindian 0 0.1 0
Common East Asian & Amerindian 0.1 0.1 0
Sub-Saharan African
East 0.3 0.3 0
West 0.2 0.4 0
Central & South 0 0 0
Common Sub-Sahara African 0.1 0.1 0
Oceanian 0 0 0
Unassigned 0.5 1.1 0.1
Sum 100.1 100.3 100.1


Note that I have used data from all three zoom levels. Sometimes people will ask the nonsensical question “How many races are there?” Well, it depends on how much you want to zoom in. 23andme supports three zoom-levels. I have called the groups identified macro-, meso- and microraces.

So we see that the siblings are almost but not exactly the same. As Jason Malloy has pointed out, this is a very important fact because it allows for a sibling-control study akin to Murray (2002). In this design, researchers find full-siblings, measure some predictor variable(s) from each sibling and compare them on the outcome variable(s). This is an important design because it removes the common environment (between family effects) confound that make interpretation of regression results difficult, e.g. those in The Bell Curve (Herrnstein and Murray, 1994). Murray (2002) used each sibling’s IQ to predict socioeconomic outcomes at adulthood (age 30-38): income, marriage and birth out of wedlock. I reproduce the tables below:


The results are similar to the results from regression modeling presented in The Bell Curve. In other words, for this question, the effects were not due to the common environment confound.

The same design can be used for the question of whether racial ancestry predicts outcome variables such as general cognitive ability (g factor, IQ, etc.), income, educational attainment and crime rate. Since siblings differ somewhat in their ancestry (as was shown in the tables and figures above), then if the genetic hypothesis for the trait is true, then the differences in ancestry will slightly predict the level of the trait.

In practice for this to work, one will need a large sample of sibling sets (pairs, triples, etc.). To make it easy, they should not be admixture from more than 2 genetic clusters/races. So e.g. African Americans in the US are good for this purpose as they are mostly a mix of European and African genes, but there are other similar groups in the world: Colored in South Africa, Greenlanders in Denmark and Greenland (Moltke et al, 2015), admixed Hawaiians, basically everybody in South America (see admixture project, part I).