The problem:

  • To install a virtualenv (virtual environment for python) under a common (not user) folder, so that it is accessible to other users.

If we try without using sudo, we get a permission error:

[go into virtualenv called env]
pip3 install fibo
>[lots of stuff]
>PermissionError: [Errno 13] Permission denied: '/django/env/lib/python3.4/site-packages/'

If we try with sudo, it installs it in the wrong location:

sudo which pip3

I.e. it is installing it in the system-wide location. This isn’t accessible from within the virtualenv. What to do? Some links I found:

Of these, the first does not address the problem of running pip, which isn’t a python script. The second gives a command that doesn’t work (-E option does not work for me). The third addresses an issue that is almost right, but the solution appears not to be related.

After some time, it occurred to me to call the right pip3 with sudo.

sudo -H env/bin/pip install django
[install stuff]
sudo -H env/bin/pip install list
>Django (1.9.1)
>pip (8.0.0)
>setuptools (19.4)
>wheel (0.26.0)


Better solution

Just change the permissions so that all users can edit the files. Then one does not need sudo at all.

sudo chmod -R o+rwx /directory


I often share material on Twitter when I find something interesting. This is the primary purpose of Twitter for me: share and find interesting links. Right now, the process is a bit cumbersome. I have to switch tabs and paste the URL manually. Sometimes I need to also copy the title which means another set of tab-switches. It is not much time, but it adds up in the long run.

Thus, in the spirit of Automate the boring stuff, I wanted to see if I could find a way to speed it up a bit. Essentially, what I need is a pop-up menu or some equivalent so that I can quickly share material and with useful defaults (html title and URL as defaults).

I tried (search results):

Very often, I also share images from the URL in question. Thus, it would be nice with a system such that I could choose which images from the site to use (if any). Facebook has a similar feature when one posts (one can choose the picture for the thumbnail).

While I’m at it, I want to see if Twitter can otherwise be made better. One frequent problem is that I need to search a person’s tweets (often my own) to find something I think or know is there. However, Twitter lacks satisfactory search functionality, so one has to rely on Google custom searches (for instance, to find my tweets about publication bias) or find a better solution. I did not find a better solution.

Items on my wishlist for Twitter

It seems useful to write up functionalities that I want for Twitter.

  • Limit search to one or more users.
    Current workaround involves using Google custom search, but this is not easy to expand to multiple users. For instance, if I know it was a tweet in my feed but don’t know who posted it, I would want to search all tweets from all users that I follow.
  • Filter tweets by type by person.
    Twitter has 3 filters built in: tweets, tweets and replies, and media content. However, their first category actually includes retweets as well and some people post good tweets but retweet a bunch of stuff I don’t care about, thus decreasing the average interestingness of their output that I see. This could be solved by filtering out retweets from those persons, but that is currently not possible.
  • Filter out promotions.
    Get rid of ads. Twitter Inc. itself will not like that idea, but it is useful to me.
  • Relative importance of users.
    Some people post things that are more interesting to me than others. However, Twitter offers only two options: see stuff from that user or don’t see stuff. Facebook has introduced a tiered approach to following people’s updates: normal, see first and close friends (one can also hide stuff from friends while keeping them on the friend list, but this is not very applicable to Twitter).


A dataset of 127 variables concerning socioeconomic outcomes for US states was analyzed. Of these, 81 were used in a factor analysis. The analysis revealed a general socioeconomic factor. This factor correlated .961 with one from a previous analysis of socioeconomic data for US states.



It has repeatedly been found that desirable outcomes tend to be associated with other desirable outcomes and likewise for undesirable outcomes. When this is the case, one can extract a general factor — the general socioeconomic factor (S factor) — such that the desirable outcomes load positively and the undesirable outcomes negatively. This pattern has been found at the country level (1), within country divisions of many countries (2–10), at the city district level (11), at the level of first names (12) and at the level of country of origin groups in two countries (13,14).

A previous study have found that the pattern holds for US states too (7). However, a new and larger dataset has been found, so it is worth examining whether the pattern holds in it, and if so, how strongly correlated the extracted factor scores are between the datasets. This would function as a kind of test-retest reliability.

Data sources

The previous study (7) of the S factor among US states used a dataset of 25 variables compiled from various official statistics found at The 2012 Statistical Abstract website. The current study relies upon a dataset compiled by Measure of America, a website that visualizes social inequality. It is possible to download the datasets their maps rely upon here.

As done with earlier studies, I excluded the capital district. I also excluded the data for US as a whole since it was not a state like the other cases.

The dataset contains a total of 127 variables. However, not all of these are useful for examining the S factor:

  • 4 variables are the composite indexes calculated by Measure of America. These are fairly similar to the Human Development Index scores, except that they are scaled differently.
  • 6 variables concern the population sizes in percent of 6 sociological race categories: Non-Hispanic White, Latino, African American, Asian, Amerindian (Native American) and other.
  • 1 variable contains the total population size for each state.
  • A number of variables were not given in a form adjusted for population size e.g. per capita, percent or rate per 100k persons. These variables were excluded: Rape (total number), Homeless Population (total number), Medicare Recipients (thousands), Medicaid Recipients (thousands), Army Recruits (total), Total Military Casualties in Operations Enduring Freedom and Iraqi Freedom to April 2010, Prisoners State or Federal Jurisdiction (total number), Women in Congressional Delegation (total), Men in Congressional Delegation (total), Carcinogen Releases (pounds), Lead Releases (pounds), Dioxin Releases (grams), Superfund Sites (total), Protected Forest (acres), and Protected Farm and Ranch Land (acres).
  • 1 variable was excluded due to being heavily reliant on local natural environment (presence of water and forests): Farming fishing and forestry occupations (%).
  • 1 variable was excluded because most of its data was missing: State Earned Income Tax Credit (% of federal Earned Income Tax Credit).

The variables that were not given in per population format almost always had a sibling variable that was given in a suitable format and which was included in the analysis. After these exclusions, 101 variables remained for analysis.

Missing data

An analysis of missing data showed that some variables still had missing data. Because the dataset had more variables than cases, it was not possible to impute the missing data using multiple regression as commonly done in these analyses. For this reason, these variables were excluded. After this, 93 variables remained for analysis.

Duplicated, reverse-coded and highly redundant variables

An analysis of correlations among variables showed that 2 of them had duplicates (r = 1): Diabetes (% age 18 and older) and Low-Birth-Weight Infants (% of all infants). I’m not sure why this is the case.

Furthermore, 4 variables had a reverse-coded sibling (r = -1):

  1. Less Than High School (%) + At Least High School Diploma (%)
  2. 4th Graders Reading Below Proficiency (%) + 4th Grade National Assessment of Educational Progress in Reading (% at or above proficient)
  3. Urban Population (%) + Rural Population (%)
  4. Public High School Graduation Rate (%) + High School Freshmen Not Graduating After 4 Years (%).

Finally, some variables were so strongly related to other variables that keeping both would perhaps result in factor analytic errors or headily influence the resulting factor. I decided to use a threshold of |.9| as the limit. If any pair of variables correlated at this level or above, one of them was excluded. There were 6 pairs of variables like this and the first of the pair was excluded:

  1. Poverty Rate (% below federal poverty threshold) + Child Poverty (% living in families below the poverty line), r = .985.
  2. Poverty Rate (% below federal poverty threshold) + Children Under 6 Living in Poverty (%), r = .968.
  3. Management professional and related occupations (%) + At Least Bachelor’s Degree (%), r = .925.
  4. Preschool Enrollment (% enrolled ages 3 and 4) + 3- and 4-year-olds Not Enrolled in Preschool (%), r = -.925.
  5. Army Recruits (per 1000 youth) + Army Recruits (per 1000 youth), r = .914.
  6. Graduate Degree (%) + At Least Bachelor’s Degree (%), r= .910.

The army recruit variable seems to be a duplicate, but the numbers are not identical for all cases. The two preschool enrollment variables seem to be meant to be a reverse-coding of each other, but they don’t correlate perfectly negatively.

After exclusion of these variables, there were 81 remaining.

Factor analysis

Next I extracted a general factor from the data. Since one previous study had found instability across extraction methods when extracting factors from datasets with more variables than cases (2), I examined the stability across all possible extraction and scoring methods, 30 in total (6 extraction methods, 5 scoring methods). 11 of these 30 methods did not result in an error tho they gave warnings. There was no loading instability or scoring instability across methods: all correlations >.996.1 I saved the results from the minres+regression combination.

Inspection of the loadings revealed no important variables with the ‘wrong loading’ i.e., either a desirable outcome but with a negative loading or an undesirable outcome with a positive loading. Some variables are debatable. E.g. binge drinking in adults has a loading of .566, but this could be seen as a good thing (sufficient free time and money to spend it drinking large quantities of alcohol), or a bad thing (binge drinking is bad for one’s health). Figure 1 shows the loadings plot.

Figure 1: Loadings on the S factor. Some variable names were too long and were cut at the 40th character. Consult the main data file to see the full name.

Factor scores

The extracted factor scores were compared with previously obtained similar measures:

  • HDI2010 scores calculated from HDI2002 scores found in (16).
  • Measure of America’s own American Human Development Index found in the dataset.
  • The S factor scores from the previous study of US states (7).

The correlation matrix is shown in Table 1.























Table 1: Correlation matrix of S and HDI scores. Weighted correlations below the diagonal (sqrt of population).

The correlation between the previously obtained S factor and the new one was very strong at .961. The two different HDI measures had the lowest correlation. This is the expected result if they are the worst approximations of the S factor. Note however that the HDI2010 is rescaled from 2002 data, whereas the AHDI and current S factor are based on 2010 data. The previous S factor is based on data from approximately the last 10 years that were averaged.


Finally, factorial mixedness was examined using two methods detailed in a previous paper (17). In short, mixedness is when cases are incongruent with the overall factor structure found for the data. The methods showed convergent results (r = .65). Figure 2 shows the results.

Figure 2: Factorial mixedness in cases.

If one was doing a more detailed study, one could examine the residuals at the case level and see if one can find the reasons for why an outlier state is an outlier. In the case of Alaska, the residuals for each variable are shown in Table 2.















































































































































Table 2: Residuals per variable for Alaska.

The meaning of the numbers is this: It is the number of standard deviations that Alaska is above or below on each variable given its score on the S factor (-.24); How much it deviates from the expected level. We see that the Alaskan state spends a much more on transportation per person than expected (more than 6 standard deviations). This is presumably due to it being located very far north compared to the other states and has the lowest population density. It also spends more energy per citizen, again presumably related to the climate. I’m not sure why rape is so common, however.

One could examine the other outlier states in a similar fashion, but this is left as an exercise to the reader.

Discussion and conclusion

The present analysis used a much larger dataset of 81 very diverse variables than the previous study of the S factor in US states which used 25, yet the findings were almost identical (r = .961). This should probably be interpreted as being because the S factor can be very reliably measured when an appropriate number of and diversity of socioeconomic variables are used. It should be noted however that many of the variables between the datasets overlapped in content, e.g. expected life span at birth.

Supplementary material

Data files and source code is available on OSF.


1. Kirkegaard EOW. The international general socioeconomic factor: Factor analyzing international rankings. Open Differ Psychol [Internet]. 2014 Sep 8 [cited 2014 Oct 13]; Available from:

2. Kirkegaard EOW. Examining the S factor in Mexican states. The Winnower [Internet]. 2015 Apr 19 [cited 2015 Apr 23]; Available from:

3. Kirkegaard EOW. S and G in Italian regions: Re-analysis of Lynn’s data and new data. The Winnower [Internet]. 2015 Apr 23 [cited 2015 Apr 23]; Available from:

4. Kirkegaard EOW. The S factor in the British Isles: A reanalysis of Lynn (1979). The Winnower [Internet]. 2015 Mar 28 [cited 2015 Apr 23]; Available from:

5. Kirkegaard EOW. Indian states: G and S factors. The Winnower [Internet]. 2015 Apr 23 [cited 2015 Apr 23]; Available from:

6. Kirkegaard EOW. The S factor in China. The Winnower [Internet]. 2015 Apr 23 [cited 2015 Apr 23]; Available from:

7. Kirkegaard EOW. Examining the S factor in US states. The Winnower [Internet]. 2015 Apr 23 [cited 2015 Apr 23]; Available from:

8. Kirkegaard EOW. The S factor in Brazilian states. The Winnower [Internet]. 2015 Apr 30 [cited 2015 May 1]; Available from:

9. Kirkegaard EOW. The general socioeconomic factor among Colombian departments. The Winnower [Internet]. 2015 Jun 16 [cited 2015 Jun 16]; Available from:


11. Kirkegaard EOW. An S factor among census tracts of Boston. The Winnower [Internet]. 2015 Jun 2 [cited 2015 Jun 2]; Available from:

12. Kirkegaard EOW, Tranberg B. What is a good name? The S factor in Denmark at the name-level. The Winnower [Internet]. 2015 Jun 4 [cited 2015 Jun 6]; Available from:

13. Kirkegaard EOW. Crime, income, educational attainment and employment among immigrant groups in Norway and Finland. Open Differ Psychol [Internet]. 2014 Oct 9 [cited 2014 Oct 13]; Available from:

14. Kirkegaard EOW, Fuerst J. Educational attainment, income, use of social benefits, crime rate and the general socioeconomic factor among 71 immigrant groups in Denmark. Open Differ Psychol [Internet]. 2014 May 12 [cited 2014 Oct 13]; Available from:

15. Revelle W. psych: Procedures for Psychological, Psychometric, and Personality Research [Internet]. 2015 [cited 2015 Apr 29]. Available from:

16. Stanton EA. Inequality and the Human Development Index [Internet]. ProQuest; 2007 [cited 2015 Jun 25]. Available from:

17. Kirkegaard EOW. Finding mixed cases in exploratory factor analysis. The Winnower [Internet]. 2015 Apr 28 [cited 2015 May 1]; Available from:


1 The factor analysis was done with the fa() function from the psych package (15). The cross-method check was done with a home-made function, see the supplementary material.

Since James Thompson is posting statistics, here are some for comparison.

Note that the statistics for this covers all sites hosted by this server, so that includes: both Danish and English blogs,,, Understanding Statistics, as well as a host of other subsites that can be found via the old front page. Note that the large traffic is due to the PDFs hosted on the site. Lots of visitors never really visit the site, just download the PDFs — fine by me — but it inflates the statistics.

Click the image, then click download. The images are huge, not small, and cannot be shown on one screen.

2015, so far












Upon reading about the obscene costs of journals, e.g. in this post, I decided to write to my local university library and ask. They responded with this:

Hej Emil

Du har sendt os et spørgsmål:

Jeg er interesseret i at finde ud af hvor mange penge AU bruger hvert år på at købe adgang til akademiske journaler. Jeg har kikket nogle budgetter igennem men fandt ikke noget.

Ved I hvor man kan finde den information?

Og spørgsmålet er sendt videre til mig, da det nok ikke er noget, der offentliggøres som et tal for sig selv. Jeg kan fortælle dig, at AU og SB i fællesskabet AU Library bruger ca. 45 mio. kroner pr. år på elektroniske ressourcer, hvoraf e-tidsskrifter udgør langt størstedelen – dertil kommer databaser, leksika og mere almindelige e-bøger.

Med venlig hilsen


Lilian Madsen

Områdedirektør, Procesområdet

The relevant part translate to:

And the question has been sent to me because it is probably not something that is published as a number for itself. I can tell you that [the university and university library together] spends about 45 million kroner per year on electronic resources of which the large majority is is e-journals — to that comes databases, lexicons, and normal e-books.

If we assume that the large majority means 90%, then the number is 40.5 million DKK a year. A DKK is about .15 USD, so this is about 6.075 million USD. Put this together with the number of students, currently 43,600, and one can calculate a cost per year per student. This value is 930 DKK or about 139.3 USD per year per student. If you think about it, this is a crazy price to pay. The marginal cost post extra student is near-zero.

Aarhus is currently the largest university in Denmark, but it is about the same size as Copenhagen which has 40,866 students.

We present and analyze data from a dataset of 2358 Danish first names and socioeconomic outcomes not previously made available to the public (“Navnehjulet”, the Name Wheel). We visualize the data and show that there is a general socioeconomic factor with indicator loadings in the expected directions (positive: income, owning your own place; negative: having a criminal conviction, being without a job). This result holds after controlling for age and for each gender alone. It also holds when analyzing the data in age bins. The factor loading of being married depends on analysis method, so it is more difficult to interpret.

A pseudofertility is calculated based on the population size for the names for the years 2012 and 2015. This value is negatively correlated with the S factor score r = -.35 [95CI: -.39; -.31], but the relationship seems to be somewhat non-linear and there is an upward trend at the very high end of the S factor. The relationship is strongly driven by relatively uncommon names who have high pseudofertility and low to very low S scores. The n-weighted correlation is -.21 [95CI: -.25; -.17]. This dysgenic pseudofertility was mostly driven by Arabic and African names.

All data and R code is freely available.



It has been noted that good outcomes tend to go together, but to our knowledge, the factor structure of such relationships have not been examined before recently by (Kirkegaard, 2014c). When it has, it has repeatedly been found that there is a general socioeconomic factor to which good outcomes nearly always have positive loadings and bad outcomes have negative loadings.1 Recent studies have examined S factors at the national, regional/state and country of origin-level; see (Kirkegaard, 2015c) for a review of regional/state-level studies, and (Kirkegaard, 2014a) for country of origin-level studies. In this paper we exploit a unique dataset to examine the S factor at the name-level in Denmark.

The dataset
Last year the Danish newspaper Ugebrevet A4 published an interactive infographic called “Navnehjulet” (“the Name Wheel”). It’s simple: you just enter a first name and it shows you some numbers about that name. The data was initially bought from Statistics Denmark and is based on 2012 data. There is no option available to download the dataset. A screenshot of the Name Wheel is shown in Figure 1.

Figure 1: A screenshot of the Name Wheel with “Emil” entered.

The more technical aspects of the scraping (“automatic downloading of the data”) are covered elsewhere (Tranberg, 2015), here we focus on the data and the statistical analyses.

The statistical information shown for each name varies (presumably due to data availability), but in the cases with full data, it includes:

  1. Number of persons with the name.
  2. 3 most common job types.
  3. 3 most common living areas.
  4. Average age.
  5. Percents who rent and own their home. Note that this does not always sum to 100%.
  6. Percentage with at least one conviction in the last 5 years.
  7. Average monthly income in DKK.
  8. Marital status (married, cohabiting, registered partner2, single).
  9. Employee rate.
  10. Student rate.
  11. Outside the job market rate.
  12. Independent rate.
  13. Unemployment rate.
  14. Chief executive rate.

Of note is that the unemployment variable includes only those who spent at least half the year without work or who received dagpenge (a kind of unemployment benefit). The outside the job market variable includes heterogeneous groups: førtidspensionister (pre-time retirees), folkepensionister (ordinary retirees), efterlønsmodtagere (another type of pre-time retirement), kontanthjælpsmodtagere (another type of unemployment benefit), and andre (others). As such, this last variable is a mixture of situations that are normal (ordinary pension, efterløn) and some which are used by unproductive members of society (førtidspension, kontanthjælp). Thus, interpretation of that variable is not straightforward. There is a more detailed description of the variables available at the website. We have taken a copy of this in case the site goes down (see supplementary material; in Danish).

We downloaded the data for all variables for each of the 2358 names in the database. The gender of the names was usually not marked, but because they were sorted by gender, we could easily assign them genders. The gender distribution is 1266 females and 1092 males, or 54% female. This is a higher female percentage than the actual population (50.3%3). This seems to be due to females simply having a greater diversity of names. Table 1 shows the top 20 most common names by gender.



Name (F)


Name (M)











































































































Table 1: Top 20 most common names by gender.

As can be seen, the top 20 most common female names have a smaller sum than the male sum, by 21%.

A few names have genders marked which was because these were unisex names. Such names were quite rare (36 pairs).

Missing data

There is quite a bit of missing data, 20% of names have have at least some missing data. For this reason we examined the distribution of missing data to see if some of it could fruitfully be imputed (Donders, van der Heijden, Stijnen, & Moons, 2006). The matrix plot is shown in Figure 2.4


Figure 2: Matrix plot for missing data.

Note: Not all cases are shown due to insufficient resolution of the image.

We see that data is not missing at random but that some cases tend to have a lot of missing data. We also see that some variables have no missing data (unisex, number, age, conviction).

Which kind of cases have missing data? It cannot be seen from the above, but the missingness is strongly related to the number of persons with that name, which is not surprising. The data is limited to names where there are 100 or more persons. To see the relationship, we sort the data by number of persons and replot the matrix plot; Figure 3.


Figure 3: Matrix plot of missing data, cases sorted by number of persons with the name.

Another way to examine missingness is to examine the distribution of cases by the number of missing cases. A histogram of this is shown in Figure 4.


Figure 4: Histogram of cases by number of missing datapoints.

While about 20% of the data has 13 missing datapoints, a small number of datapoints (71) have only 2 missing datapoints. These can be imputed to slightly increase the sample size.

Getting an overview of the data
Before running numerical analyses on data, it is important to get a solid overview of it. This is because one can rapidly identify patterns by eye that may go unnoticed by numerical analyses. For instance, relying on correlations can miss important non-linear patterns, which can easily be identified by eye if data or plotted using a moving average or similar (Lubinski & Humphreys, 1996).

The classic example of this is Anscombe’s quartet, 4 bi-variate datasets with which have (almost) the same mean of x and y, variance/standard deviation, correlation and regression coefficients (intercept and slope). However, plotting the data reveals that they are very different.

Histograms are the easiest way to get a quick overview of the data structure. We plot selected histograms in Figures 5-8. The rest are available in the supplementary material.


Figure 5: Histogram of number of persons per name. Note that the x-axis is log-scale.

We see a power law distribution in that most of the names have only a few persons with it, while a few have many thousands. The top 20 by gender were shown in Table 1. The mean and median number of persons per name are: 2209 and 316. Since the data is capped at at least 100 persons per name, showing the least common names is not particularly interesting. The curious reader can consult the supplementary material (results/number_ranks.csv).


Figure 6: Histogram of ages.

The distribution of the mean age of names is a fat normal distribution. Top 5 youngest: Elliot, Milas, Noam, Storm, Mynte (MMMMF); oldest: Valborg, Hertha, Dagny, Magna, Erna (all F).


Figure 7: Histogram of incomes.

The income distribution is fairly normal with a long right tail. Presumably, a few very rich people with uncommon names result in those names having very high incomes. The top scores are: Renè (M), Leise (F), Frants (M), Heine (M) and Thorleif (M). The bottom scorers are dominated by names who are very young and thus have very low incomes, e.g. Alberte (mean age 8, mean income 4893 DKK). These have little interest so we shall not mention them.


Figure 8: Histogram of mean convictions past 5 years.

It is clear that some names are much more criminal than others, the top scorers are: Alaa, Ferhat, Walid, Rachid, Fadi (all male). The female top scorer is Vesna (top #51). These names are all foreign, mostly Arabic, except the female name which is Slovenian according to This result is expected because persons from Muslim countries are highly overrepresented in crime statistics (Kirkegaard & Fuerst, 2014).

Variables by age and gender
Since the mean age of the names has central importance to the other variables (e.g. income) and since gender is a suitable dichotomous variable, we plot the other variables by age and gender. These are shown in Figures 9 to 17.


Figure 9: Income by age and gender.

We see the familiar pattern in that men earn more money than women. The difference is stable until about age 45 where it increases. Interpretation is difficult because the data is cross-sectional, not longitudinal and hence there are both age and cohort differences between the names. Still, one would expect something to happen at about that age that increases the difference.


Figure 10: Convictions by age and gender.

It is well-known that crime tends to be committed by younger males, we see the same pattern here. Recall that this is the percentage of persons with the name who has at least one conviction the last 5 years. Thus, it has a bit of lag which is probably why it is fairly high for even men in their 40 — they could have gotten their conviction at age 35.


Figure 11: Being outside the job market by age and gender.

This variable is the odd one comprising both regular pensions as well as some unemployment benefits and other benefits given to people who cannot/won’t work (e.g. who had a work accident, have severe psychological problems, are just lazy). As expected, it goes up heavily with age as people go on pension.


Figure 12: Being independently employed by age and gender.

There are known gender differences in rates of self-employment, and we see it here as well at all ages. It seems to increase over the lifespan a bit being at maximum value perhaps around 45-50.


Figure 13: Marital status by age and gender.

This one is interesting in that it has an odd pattern at old age. Our guess is that the men who are married tend to live longer which explains the male pattern, while the female pattern is explained by the fact that women live longer than men and their husbands die off before them, leaving them widowed (unmarried). In discussion with EOWK, A. J. Figueredo suggested that it may be due to serial monogamy. Simply put, some men divorce their aging wives and marry a younger one. This would tend to keep men married at older ages as well as decreasing the marriage rate of older women.


Figure 14: Owning a home by age and gender.

This one is odd in that at middle age around 30 more women have their own home, but men catch up later. One could think of it as men making an earlier investment of their resources into career, while women are more interested in getting a home. And when men’s careers get going at age 45 and above, they acquire their homes. Again, due to the cross-sectional data, it is difficult to say.


Figure 15: No job by age and gender.

This is the variable for ‘pure’ unemployment. The gender difference is only slight at early to mid ages, while it reverses in direction at older ages. It is somewhat odd that it is highest around age 35.


Figure 16: Being and student by age and gender.

Girls and women generally acquire more formal education than more and we see it in the data here as well.


Figure 17: Being an executive by age and gender.

Finally, there is a clear gender and age pattern in being an executive. Males are more likely at all ages, but there is an increase around age 45, especially for men. This is presumably the explanation of the pattern seen for income in Figure 9.

Is there an S factor among names?
Some of the variables are (almost) linearly dependent on each other. and sum to nearly 100, so using both in an analysis would perhaps cause problems. The same is true for the 4 civil status variables (married, cohabiting, reg. partnership, single), and the 6 employment variables (no.job, employee,, student, independent, executive). To be safe, one should probably not pick more than one from each of these three sets.

To do a factor analysis we must however pick some of them. We decided on the following: no.job,, married, conviction and income. The expectation is that no.job and conviction will have negative loadings, while and income will have positive, and marriage perhaps somewhat positive (Herrnstein & Murray, 1994).

What we want to measure is the general socioeconomic status factor (if it exists). However, gender can disrupt the analysis. This is because men earn more money but are also more criminal. This may lead to gender specific variance, which is error in the factor analysis. One could regress out the effect of gender, but we instead divide the dataset into two which also allows for easier interpretation of the results.

Age has a strong influence on the variables which may disrupt results. For instance, a very young name will have lower income and a low conviction rate, which will result in high mixedness (Kirkegaard, 2015b). For this reason, we use both the original variables for analysis and a version of them where the effect of age has been partialed out. To do this, we regress every value on age, age2 and age3.

Some cases had some missing datapoints (refer back to Figure 2). We imputed the cases with 2 or fewer missing datapoints and excluded the rest.

Correlation matrices

Before looking at the factor analysis results, we will look at the correlation matrices by gender and together, as well as with and without partialing out the effects of age; Tables 2-4.











































Table 2: Correlation matrix of S variables for both genders. Above diag., age partialed out.











































Table 3: Correlation matrix of S variables for men. Above diag., age partialed out.











































Table 4: Correlation matrix of S variables for women. Above diag., age partialed out.

Below the diagonal, one can see that the linear effect of age is often substantial, while above the diagonal, the linear effect of age is zero, meaning that generally the partialization worked, at least linearly speaking. Generally, the relationships were similar across gender. There are some exceptions. To make them easier to see, Table 5 shows the delta (difference) correlation matrix.











































Table 5: Delta correlation matrix for genders. Higher means mens’ correlations are stronger.

The largest difference for the age-partialed data is the relationship between being married and having no job (recall that this does not include those pensioned). Among female names, there is a strong relationship between unemployment and being married. Perhaps because women are more often reliant on their husbands (being a homemaker) than the reverse, but both correlations were positive. It could also have something to do with Muslim immigrants (about 10% of the population) who are often married and where a large fraction of the women are unemployed.

Factor analyses

The loadings plots are shown in Figure 18.


Figure 18: Loadings plot for factor analyses.

The factors were not particularly strong, as shown in Table 6.


Factor analysis















Table 6: Variance explained by S factors.

The factors decreased in size after correcting for age, which could be because age was inflating the factor size, or because the correction was too strong. The gender difference in the marriage indicator is strong: about 0 vs. about .5 after age correction. Notice that the has loadings near 1, so the S factor is about equal to variable in these datasets. It is probably an indicator sampling error that would be corrected if more indicators of greater diversity were available.5 Some previous S factor studies have found the same when only a few indicators were used, e.g. Kirkegaard (2015a, first analysis).

Still, the factor loadings are in the expected directions for all variables in all analyses.

Given the similar factor loadings, one would also expect the extracted factor scores to be similar, which Table 7 shows them to be.





























Table 7: Correlations between S factors across analyses.

Note: The apparently missing values are because the data does not overlap. There are no scores for men in the S factor analyses with only women.

Using age bins instead

In the above analyses, we have analyzed data for all ages both with and without partialing the effects of age out. However, age may be insufficiently dealt with by the chosen correction method, and its effect may be so strong that not correcting for it also leads to spurious results. Hence we employed a third method, that of age bins. The dataset is large enough that we can split it up into age groups, say 20-25 as well as age and analyze each subgroup separately. While this does not entirely remove the age effect, it is more likely to not introduce any spurious effects over-correction effects.

Concretely, we analyzed subgroups within 5 year bracket starting at age 20-25 and stopping at age 50-55. We do this for both genders together, and separately. The analysis procedure is the same as above, namely extracting the general factor, examining the loadings and the factor sizes. Figures 19-21 show the factor loadings by age bin for both genders together, and each separately.


Figure 19: Factor loadings by age bins, both genders together


Figure 20: Factor loadings by age bins, males only


Figure 21: Factor loadings by age bins, females only

The most conspicuous finding is the marriage loadings which are now negative! Apparently, the positive loadings from before were an age confound. The exception is the last two age groups where the marriage indicator is positive, especially for the last group. The odd finding that for 50-55 year olds, crime has a loading around 0 is presumably sampling error as well as reflecting the fact that crime among people in their 50s is fairly rare. When the base rate is low, correlations become weaker and factor loadings are based on the correlation patterns in the data (Ferguson, 2009). The sample sizes are not terribly impressive, 126 to 257, and the least for the last two groups. The ones by each gender about half that.

For the male data, the marriage loadings are about 0. The two last age bins are again positive. The other four loadings are somewhat stronger in males with criminality actually having stronger (negative) correlations than unemployment. This is presumably because crime is more common among males which means the correlations are stronger.

Finally, for the female data, marriage loadings are more strongly negative except for the last two age bins, same as with the male data.

Figure 22 shows the factor strength by age bin and gender, together and separate.


Figure 22: Factor sizes by age bin and gender

Generally the male-only analyses had the strongest S factors (6/7), with the female-only analyses being above the one with both (5/7). One might interpret this as being due to the lower base rate of crime making the correlations with the crime variable smaller for females which makes the factor size smaller. The mixed-gender analyses usually had smaller factors, perhaps because the of the mixedness that results from this as discussed earlier.

Pseudofertility and the S factor

Since the Name Wheel data contains the count of persons with each name in 2012, if we could find some data for a later year for the same names, we could calculate a name-wise ‘fertility’, which we shall call pseudofertility. It is the growth (or decrease) in number of persons with each name in Denmark. This may be due to actual births, immigration or name-changes. This pseudofertility can then be compared to the S factor score for each name to see if there is any relationship. A somewhat negative relationship is expected due to low S immigrant names increasing their number via higher than average fertility (at least in the first generation, (Kirkegaard, 2014b)) and immigration.

The Danish Statistics agency (Danish Statistics) maintains a web page where one can look up any first or last name and see how many people have that name in the current year and last year. Using a similar method to that using to scrape the data form the Name Wheel, we scraped the count data for the years 2014 and 2015 for every name in our dataset. From these data, we calculated the pseudofertility by the fractional increase (or decrease) of each name over both the period 2012-2015 and 2014-2015. The first should give a more reliable number since it’s over a few years as opposed to the second which is over 1 year only. Their correlation is .95 (no outliers), so reliability was very high.

Figure 23 shows the scatter plot of pseudofertility 2012-2015 and S factor score (age adjusted, both genders together).


Figure 23: Pseudofertility 2012-2015 and S factor scores. Point sizes are proportional to the number of persons with the name.

Overall, there is a medium-sized negative relationship, r = -.35 [95CI: -.39; -.31], between pseudofertility and S factor score (age-controlled). As can be seen in the plot, this is mainly due to the names left of 0 S (the below average). There appears to be an upward trend at the other end, but there are relatively few datapoints, so it may be a fluke. The point sizes show that the names creating the trend are relatively uncommon (few people have those names, relatively speaking). The largest names cluster around S [0-1.5]. For this reason, we also calculated the weighted correlation which is -.21 [95CI: -.25; -.17], so the effect is still reliable but substantially smaller as expected from the inspection of the plot.

We plotted the figure in very high resolution using vector graphics so that one can zoom in on any given region. The reader can examine the pseudofertility_names.svg file in the supplementary material to explore the figure. Looking at the names in the region creating the negative slope reveals them to be almost exclusively immigrant names from Arabic or African countries, e.g.: Mohammad, Hossein, Mostafa, Sayed, Malika, Mana, Slawomir, Omar (names from the region north of the moving average near S = -1.5). Unfortunately the dataset does not contain information about the immigration status of each name, so we could exclude all of them and see if the ‘dysgenic’ relationship holds without immigrants.

Thus, the name data reveals a small ‘dysgenic’ effect on S in line with modeling by (Kirkegaard & Tranberg, 2015). If the trend were to continue, and assuming that everything else is equal, then the average level of socioeconomic status would fall in Denmark and there would be increasing socioeconomic inequality.

Discussion and conclusion
Despite being a new level of analysis (at least to us), the results were generally in line with those from more ‘traditional’ country, regional/state-level and origin country-level analyses.

This dataset contained first names, but one could also analyze last names which are more familial in nature. Such data was not available at the Name Wheel website, but it could probably be acquired from the statistical agency if one is willing to pay.

The dataset is especially useful for researchers wishing to investigate the (in)accuracy of stereotypes of names, see e.g. (Jussim, Cain, Crawford, Harber, & Cohen, 2009; Jussim, 2012).


As mentioned earlier, the data are an odd kind of cross-sectional data which makes it difficult to infer causality. A given difference observed between names with a mean age of 20 and 40, could be either an effect of age (being 20 versus 40), a cohort effect (being born in 1995 versus 1975), or something more complicated.

The mean age of the names is tricky to interpret since the distribution of age of persons with the name is not shown. This could be a normal distribution if the name was fashionable at some point but then faded out. However, it could also be bi-modal. For instance, if a name was fashionable in 1965 and in 1995, there would be two groups of persons. One aged about 50 and one aged about 20. If they are about evenly distributed the mean age of the name would be about 35 despite few people with the name being that age.

Aside from the extra population data from Danish Statistics, the dataset only has data from one year (2012). It would be better if data for more than one year was available. Both to avoid fluke effects, but also to examine e.g. the effects of macroeconomics on the relationships between the variables.

To our knowledge, this is a new kind of grouped data and so methods for analyzing it have not been well-tested. This should give some extra caution about the inferences drawn from it.

Supplementary material

Data, source code and figures are available at the Open Science Framework repository.



1 Note that sometimes a factor is reversed such that the good outcomes have negative loadings, and the bad outcomes have positive loadings. This reversing is quite arbitrary and depends on the balance of good and bad variables included in the analysis. A preponderance of bad variables means that the factor will be reversed. If the factor is thus reversed, one can just multiple all loadings by -1 to unreverse it.

2 This is a pre-2012 category as an alternative to marriage for same sex couples. One can no longer attain this legal status, but one can retain it if one acquired it before 2012. See (Danish).

4 This plot is made using the matrixplot() function from the VIM package (Templ, Alfons, Kowarik, & Prantner, 2015). The 5 character/string variables are left out because due to a bug in the function, such variables are always shown as missing all data, whereas in fact in this case none of them had any missing data.

5 Indicator sampling error is meant to be a generalized version of Jensen’s “psychometric sampling error”, see e.g. (Kranzler & Jensen, 1991).

A friend of mine and his brother just received their 23andme results.



In a table they look like this (I have added myself for comparison):

Macrorace Bro1 Bro2 Emil
European 52.6 53 99.8
MENA 42.5 41.3 0.2
South Asian 2.8 3.4 0
East Asian & Amerindian 1.1 0.7 0
Sub-Saharan African 0.5 0.5 0
Oceanian 0.5 0 0
Unassigned 0 1.1 0.1
Sum 100 100 100.1
Mesorace Bro1 Bro2 Emil
Northern 51.5 51.5 91.3
Southern 1 1.2 0
Ashkenazi 0.1 0 2.9
Eastern 0 0 4
Common European 0.1 0.4 1.5
Middle Eastern 42 40.8 0
North African 0.3 0.2 0.2
Common MENA 0.2 0.3 0
South Asian 2.8 3.4 0
East Asian & Amerindian
East Asian 0.7 0.4 0
Southeast Asian 0.2 0 0
Amerindian 0 0.1 0
Common East Asian & Amerindian 0.1 0.1 0
Sub-Saharan African
East 0.3 0.3 0
West 0.2 0.4 0
Central & South 0 0 0
Common Sub-Sahara African 0.1 0.1 0
Oceanian 0 0 0
Unassigned 0.5 1.1 0.1
Sum 100.1 100.3 100
Microrace Bro1 Bro2 Emil
Scandinavian 21.3 24.2 37.3
French & German 10.5 14.9 0.8
British and Irish 8.9 4.9 11
Finnish 0 0 0.3
Common Northern 10.7 7.5 42
Italian 0.9 0.8 0
Sardinian 0 0 0
Iberian 0 0 0
Balkan 0 0 0
Common Southern 0.1 0.4 0
Ashkenazi 0.1 0 2.9
Eastern 0 0 4
Common European 0.1 0.4 1.5
Middle Eastern 42 40.8 0
North African 0.3 0.2 0.2
Common MENA 0.2 0.3 0
South Asian 2.8 3.4 0
East Asian & Amerindian
East Asian
Japanese 0.2 0 0
Mongolian 0.1 0.2 0
Korean 0 0 0
Yakut 0 0 0
Chinese 0 0 0
Common East Asian 0.5 0.2 0
Southeast Asian 0.2 0 0
Amerindian 0 0.1 0
Common East Asian & Amerindian 0.1 0.1 0
Sub-Saharan African
East 0.3 0.3 0
West 0.2 0.4 0
Central & South 0 0 0
Common Sub-Sahara African 0.1 0.1 0
Oceanian 0 0 0
Unassigned 0.5 1.1 0.1
Sum 100.1 100.3 100.1


Note that I have used data from all three zoom levels. Sometimes people will ask the nonsensical question “How many races are there?” Well, it depends on how much you want to zoom in. 23andme supports three zoom-levels. I have called the groups identified macro-, meso- and microraces.

So we see that the siblings are almost but not exactly the same. As Jason Malloy has pointed out, this is a very important fact because it allows for a sibling-control study akin to Murray (2002). In this design, researchers find full-siblings, measure some predictor variable(s) from each sibling and compare them on the outcome variable(s). This is an important design because it removes the common environment (between family effects) confound that make interpretation of regression results difficult, e.g. those in The Bell Curve (Herrnstein and Murray, 1994). Murray (2002) used each sibling’s IQ to predict socioeconomic outcomes at adulthood (age 30-38): income, marriage and birth out of wedlock. I reproduce the tables below:


The results are similar to the results from regression modeling presented in The Bell Curve. In other words, for this question, the effects were not due to the common environment confound.

The same design can be used for the question of whether racial ancestry predicts outcome variables such as general cognitive ability (g factor, IQ, etc.), income, educational attainment and crime rate. Since siblings differ somewhat in their ancestry (as was shown in the tables and figures above), then if the genetic hypothesis for the trait is true, then the differences in ancestry will slightly predict the level of the trait.

In practice for this to work, one will need a large sample of sibling sets (pairs, triples, etc.). To make it easy, they should not be admixture from more than 2 genetic clusters/races. So e.g. African Americans in the US are good for this purpose as they are mostly a mix of European and African genes, but there are other similar groups in the world: Colored in South Africa, Greenlanders in Denmark and Greenland (Moltke et al, 2015), admixed Hawaiians, basically everybody in South America (see admixture project, part I).


For years I have looked for a good alternative to Skype. Skype has some nice features including:

  1. Group conversations
  2. Voice and video chat
  3. A chat history
  4. Good interface
  5. Ease of use

It also have some nasty features I diswant:

  1. Closed source
  2. Ads
  3. NSA et al spying

So, the quest is to find something that has 1-5 from the first list and nothing from the second list. Over the years, there have been various proposals: Hemlis (defunct), Cryptocat, Jitsi, Pidgin, and so on. The EFF has a list here. However, it does not include Tox.

Tox has all the four good features and none of the bad ones. It has multiple cross platform clients. It beats Jitsi and Pidgin in good interface and ease of use, especially set-up. Personally, I like the qTox client, but you may have another preference.

Adding people is pretty easy: they simply add your ID and you get a request, like with Skype. No need to set up servers, or make accounts etc.

If you want to reach me, my Tox ID is: 1728E9D22CDDDDBE314E002843E7F57A20365D40CF0F6B26803AD68A163E82710C78A5213A9B. Be sure to use some specific message so you don’t look like a bot.

It’s a really annoying ‘feature’.

I’m searching for a way to disable the very annoying number formatting in Libre Office Calc. Whenever I enter some number or a string containing numbers, LO is trying to format a date out of it.


I’ve found some so-called-solutions, but none of these works.

  • Format cells to “text” – works, but only as long as one pastes or deletes from/to a specific cell. For example, if I paste some content, the formatting is lost again.
  • Start typing with a single quote '. Not an option in daily working routine, just to enter a numeric string.
  • Deselect Tools > AutoCorrect > Options > Apply Numbering – there is no such option in Libre Office (at least not in version 3.5).

Use cases:

  • 2-3 means “two to three whatever”
  • 5.2 is the code following 5.1
  • 6. refers to the sixth item

All those values are translated to some random date in Libre Office by default. I guess they were on drugs when implementing such a bug into the program.

Is there a global setting to turn off that featurebug?

There are two main workarounds: 1) manually adding ‘ in front of numbers, which will force treatment of them as a character string. 2) setting the format to “text” before entering text.

Neither of these are good solutions for everyday work. For instance, if you paste in data from somewhere else, it will not generally have ‘ in front, and it will also override the format you chose. A last trick here is to use “paste special” and then choose the types, which can be a good workaround too.

It’s not a new complaint:

Developers don’t seem to understand the users’ frustration. Instead they write stuff like this:

At first I thought you meant you must handle cell formatting for each cell individually, which of course is false. (You select all the cells and choose “text” as a data type.)
Now if I understand correctly, you want a way to disable automatic date recognition globally, for all spreadsheets? Or at least, you want to know why that is not in the preferences section?
I can at least make a guess at the last question. It is probably for the same reason why there’s not an option to globally turn off automatic recognition of formulas. Because that’s what Calc is for….
In the vast, vast majority of cases, a user will expect that if he types in a date, it will be “understood” by his software as a date. If it doesn’t get “recognized”, he’s going to think “Hmmm, this software isn’t very good.” He’s not going to think, “Hmmm, there must be a global setting somewhere that’s been switched off so that dates don’t get recognized,” and then go hunting for that setting. (If that did happen and he actually found the setting, his inevitable question would be, “Why on earth is that even an option? Who would want to turn off the date recognition for all spreadsheets?” And we’d have to tell him, “Well, it was this guy Swingletree….” ;)

This is a typical example of developers being out of contact with normal users. For normal users (>95% of users), this auto-conversion is more of a bug than a feature, which is why people want to turn it off completely, and then just manually tell Calc when to interpret something as a date.