Due to lengthy discussion over at Unz concerning the good performance of some African groups in the UK, it seems worth it to review the Danish and Norwegian results. Basically, some African groups perform better on some measures than native British. The author is basically arguing that this disproves global hereditarianism. I think not.

The over-performance relative to home country IQ of some African countries is not restricted to the UK. In my studies of immigrants in Denmark and Norway, I found the same thing. It is very clear that there are strong selection effects for some countries, but not others, and that this is a large part of the reason why the home country IQ x performance in host country are not higher. If the selection effect was constant across countries, it would not affect the correlations. But because it differs between countries, it essentially creates noise in the correlations.

Two plots:

NO_S_IQ DK_S_IQ

The codes are ISO-3 codes. SO e.g. NGA is Nigeria, GHA is Ghana, KEN = Kenya and so on. They perform fairly well compared to their home country IQ, both in Norway and Denmark. But Somalia does not and the performance of several MENAP immigrants is abysmal.

The scores on the Y axis are S factor scores for their performance in these countries. They are general factors extracted from measures of income, educational attainment, use of social benefits, crime and the like. The S scores correlate .77 between the countries. For details, see the papers concerning the data:

I did not use the scores from the papers, I redid the analysis. The code is posted below for those curious. The kirkegaard package is my personal package. It is on github. The megadataset file is on OSF.


 

library(pacman)
p_load(kirkegaard, ggplot2)

M = read_mega("Megadataset_v2.0e.csv")

DK = M[111:135] #fetch danish data
DK = DK[miss_case(DK) <= 4, ] #keep cases with 4 or fewer missing
DK = irmi(DK, noise = F) #impute the missing
DK.S = fa(DK) #factor analyze
DK_S_scores = data.frame(DK.S = as.vector(DK.S$scores) * -1) #save scores, reversed
rownames(DK_S_scores) = rownames(DK) #add rownames

M = merge_datasets(M, DK_S_scores, 1) #merge to mega

#plot
ggplot(M, aes(LV2012estimatedIQ, DK.S)) + 
  geom_point() +
  geom_text(aes(label = rownames(M)), vjust = 1, alpha = .7) +
  geom_smooth(method = "lm", se = F)
ggsave("DK_S_IQ.png")


# Norway ------------------------------------------------------------------

NO_work = cbind(M["Norway.OutOfWork.2010Q2.men"], #for work data
                M["Norway.OutOfWork.2011Q2.men"],
                M["Norway.OutOfWork.2012Q2.men"],
                M["Norway.OutOfWork.2013Q2.men"],
                M["Norway.OutOfWork.2014Q2.men"],
                M["Norway.OutOfWork.2010Q2.women"],
                M["Norway.OutOfWork.2011Q2.women"],
                M["Norway.OutOfWork.2012Q2.women"],
                M["Norway.OutOfWork.2013Q2.women"],
                M["Norway.OutOfWork.2014Q2.women"])

NO_income = cbind(M["Norway.Income.index.2009"], #for income data
                  M["Norway.Income.index.2010"],
                  M["Norway.Income.index.2011"],
                  M["Norway.Income.index.2012"])

#make DF
NO = cbind(M["NorwayViolentCrimeAdjustedOddsRatioSkardhamar2014"],
           M["NorwayLarcenyAdjustedOddsRatioSkardhamar2014"],
           M["Norway.tertiary.edu.att.bigsamples.2013"])


#get 5 year means
NO["OutOfWork.2010to2014.men"] = apply(NO_work[1:5],1,mean,na.rm=T) #get means, ignore missing
NO["OutOfWork.2010to2014.women"] = apply(NO_work[6:10],1,mean,na.rm=T) #get means, ignore missing

#get means for income and add to DF
NO["Income.index.2009to2012"] = apply(NO_income,1,mean,na.rm=T) #get means, ignore missing

plot_miss(NO) #view is data missing?

NO = NO[miss_case(NO) <= 3, ] #keep those with 3 datapoints or fewer missing
NO = irmi(NO, noise = F) #impute the missing

NO_S = fa(NO) #factor analyze
NO_S_scores = data.frame(NO_S = as.vector(NO_S$scores) * -1) #save scores, reverse
rownames(NO_S_scores) = rownames(NO) #add rownames

M = merge_datasets(M, NO_S_scores, 1) #merge with mega

#plot
ggplot(M, aes(LV2012estimatedIQ, NO_S)) +
  geom_point() +
  geom_text(aes(label = rownames(M)), vjust = 1, alpha = .7) +
  geom_smooth(method = "lm", se = F)
ggsave("NO_S_IQ.png")

sum(!is.na(M$NO_S))
sum(!is.na(M$DK.S))

cor(M$NO_S, M$DK.S, use = "pair")

 

Abstract

A reanalysis of (Carl, 2015) revealed that the inclusion of London had a strong effect on the S loading of crime and poverty variables. S factor scores from a dataset without London and redundant variables was strongly related to IQ scores, r = .87. The Jensen coefficient for this relationship was .86.

 

Introduction

Carl (2015) analyzed socioeconomic inequality across 12 regions of the UK. In my reading of his paper, I thought of several analyses that Carl had not done. I therefore asked him for the data and he shared it with me. For a fuller description of the data sources, refer back to his article.

Redundant variables and London

Including (nearly) perfectly correlated variables can skew an extracted factor. For this reason, I created an alternative dataset where variables that correlated above |.90| were removed. The following pairs of strongly correlated variables were found:

  1. median.weekly.earnings and log.weekly.earnings r=0.999
  2. GVA.per.capita and log.GVA.per.capita r=0.997
  3. R.D.workers.per.capita and log.weekly.earnings r=0.955
  4. log.GVA.per.capita and log.weekly.earnings r=0.925
  5. economic.inactivity and children.workless.households r=0.914

In each case, the first of the pair was removed from the dataset. However, this resulted in a dataset with 11 cases and 11 variables, which is impossible to factor analyze. For this reason, I left in the last pair.

Furthermore, because capitals are known to sometimes strongly affect results (Kirkegaard, 2015a, 2015b, 2015d), I also created two further datasets without London: one with the redundant variables, one without. Thus, there were 4 datasets:

  1. A dataset with London and redundant variables.
  2. A dataset with redundant variables but without London.
  3. A dataset with London but without redundant variables.
  4. A dataset without London and redundant variables.

Factor analysis

Each of the four datasets was factor analyzed. Figure 1 shows the loadings.

loadings

Figure 1: S factor loadings in four analyses.

Removing London strongly affected the loading of the crime variable, which changed from moderately positive to moderately negative. The poverty variable also saw a large change, from slightly negative to strongly negative. Both changes are in the direction towards a purer S factor (desirable outcomes with positive loadings, undesirable outcomes with negative loadings). Removing the redundant variables did not have much effect.

As a check, I investigated whether these results were stable across 30 different factor analytic methods.1 They were, all loadings and scores correlated near 1.00. For my analysis, I used those extracted with the combination of minimum residuals and regression.

Mixedness

Due to London’s strong effect on the loadings, one should check that the two methods developed for finding such cases can identify it (Kirkegaard, 2015c). Figure 2 shows the results from these two methods (mean absolute residual and change in factor size):

mixedness
Figure 2: Mixedness metrics for the complete dataset.

As can be seen, London was identified as a far outlier using both methods.

S scores and IQ

Carl’s dataset also contains IQ scores for the regions. These correlate .87 with the S factor scores from the dataset without London and redundant variables. Figure 3 shows the scatter plot.

IQ_S
Figure 3: Scatter plot of S and IQ scores for regions of the UK.

However, it is possible that IQ is not really related to the latent S factor, just the other variance of the extracted S scores. For this reason I used Jensen’s method (method of correlated vectors) (Jensen, 1998). Figure 4 shows the results.

Jensen_method
Figure 4: Jensen’s method for the S factor’s relationship to IQ scores.

Jensen’s method thus supported the claim that IQ scores and the latent S factor are related.

Discussion and conclusion

My reanalysis revealed some interesting results regarding the effect of London on the loadings. This was made possible by data sharing demonstrating the importance of this practice (Wicherts & Bakker, 2012).

Supplementary material

R source code and datasets are available at the OSF.

References

Carl, N. (2015). IQ and socioeconomic development across Regions of the UK. Journal of Biosocial Science, 1–12. doi.org/10.1017/S002193201500019X

Jensen, A. R. (1998). The g factor: the science of mental ability. Westport, Conn.: Praeger.

Kirkegaard, E. O. W. (2015a). Examining the S factor in Mexican states. The Winnower. Retrieved from thewinnower.com/papers/examining-the-s-factor-in-mexican-states

Kirkegaard, E. O. W. (2015b). Examining the S factor in US states. The Winnower. Retrieved from thewinnower.com/papers/examining-the-s-factor-in-us-states

Kirkegaard, E. O. W. (2015c). Finding mixed cases in exploratory factor analysis. The Winnower. Retrieved from thewinnower.com/papers/finding-mixed-cases-in-exploratory-factor-analysis

Kirkegaard, E. O. W. (2015d). The S factor in Brazilian states. The Winnower. Retrieved from thewinnower.com/papers/the-s-factor-in-brazilian-states

Revelle, W. (2015). psych: Procedures for Psychological, Psychometric, and Personality Research (Version 1.5.4). Retrieved from cran.r-project.org/web/packages/psych/index.html

Wicherts, J. M., & Bakker, M. (2012). Publish (your data) or (let the data) perish! Why not publish your data too? Intelligence, 40(2), 73–76. doi.org/10.1016/j.intell.2012.01.004

1There are 6 different extraction and 5 scoring methods supported by the fa() function from the psych package (Revelle, 2015). Thus, there are 6*5 combinations.

Some time ago, I stumbled upon this paper:
Searls, D. T., Mead, N. A., & Ward, B. (1985). The relationship of students’ reading skills to TV watching, leisure time reading, and homework. Journal of Reading, 158-162.

Sample is very large:

To enlarge on such information, the National Assessment of Educational Progress (NAEP) gathered data on the TV viewing habits of 9, 13, and 17 year olds across the U.S. during its 1979-80 assessment of reading skills. In this survey, 21,208 9 year olds, 30,488 13 year olds, and 25,551 17 year olds responded to questions about their back- grounds and to a wide range of items probing their reading comprehension skills. These data provide information on the amount of TV watched by different groups of students and allow comparisons of reading skills and TV watching.

The relationship turns out to be interestingly nonlinear:

TV reading compre age

For understanding, it is better to visualize the data anew:

tv_age_reading_comprehension

I will just pretend that reading comprehension is cognitive ability, usually a fair approximation.

So, if we follow the smarties: At 9 they watch a fairly amount of TV (3-4 hours per day), then at 13, they watch about half of that (1-2), and then at age 17, they barely watch it (<1).

Developmental hypothesis: TV is interesting but only to persons at a certain cognitive ability level. Young smart children fit in the target group, but as they age and become smarter, they grow out of the target group and stop watching.

Alternatives hypotheses?

R code

The code for the plot above.

d = data.frame(c(1.5, 2.2, 2.3),
               c(3, 3, 1.3),
               c(5.2, .2, -2.2),
               c(-1.7, -6.9, -8.1))
d

colnames(d) = c("<1 hour", "1-2 hours", "3-4 hours", ">4 hours")
d$age = factor(c("9", "13", "17"), levels = c("9", "13", "17"))

d = melt(d, id.vars = "age")

d

ggplot(d, aes(age, value)) +
  geom_point(aes(color = variable)) +
  ylab("Relative reading comprehension score") +
  scale_color_discrete(name = "TV watching per day") +
  scale_shape_discrete(guide = F)

Abstract

A dataset was compiled with 17 diverse socioeconomic variables for 32 departments of Colombia and the capital district. Factor analysis revealed an S factor. Results were robust to data imputation and removal of a redundant variable. 14 of 17 variables loaded in the expected direction. Extracted S factors correlated about .50 with the cognitive ability estimate. The Jensen coefficient for the S factor for this relationship was .60.

 

Introduction

The general socioeconomic factor is the mathematical construct associated with the idea that positive outcomes tend to go along with other positive outcomes, and likewise for the negative. Mathematically, this shows up as a factor where the desirable outcomes load positively and where the undesirable outcomes load negatively. As far as I know, (Kirkegaard, 2014b) was the first to report such a factor, although Lynn (1979) was close to the same idea. The factor is called s at the individual level, and S when found in aggregated data.

By now, S factors have been found between countries (Kirkegaard, 2014b), twice between country-of-origin groups within countries (Kirkegaard, 2014a), numerous times within countries (reviewed in (Kirkegaard, 2015c)) and also at the level of first names (Kirkegaard & Tranberg, 2015). This paper analyzes data for 33 Colombian departments including the capital district.

Data sources

Most of the data were found via the English-language website Knoema.com which is an aggregator of statistical information concerning countries and their divisions. A second source was a Spanish-language report (DANE, 2011). One variable had to be found on Wikipedia (“List of Colombian departments by GDP,” 2015). Finally, HDI2010 was found in a Spanish-language UN report (United Nations Development Programme & UNDP Colombia, 2011).

Variables were selected according to two criteria: 1) they must be socioeconomically important and 2) they must not be strongly dependent on local climatic conditions. For instance, fishermen per capita would be a variable that fails both criteria, since it is not generally seen as socioeconomically important and is dependent on having access to a body of water.

The included variables are:

  • SABER, verbal scores
  • SABER, math scores
  • Acute malnutrition, %
  • Chronic malnutrition, %
  • Low birth weight, %
  • Access to clean water, %
  • The presence of a sewerage system, %
  • Immunization coverage, %
  • Child mortality, rate
  • Infant mortality, rate
  • Life expectancy at birth
  • Total fertility rate
  • Births that occur in a health clinic, %
  • Unemployment, %
  • GDP per capita
  • Poverty, %
  • GINI
  • Domestic violence, rate
  • Urbanicity, %
  • Population, absolute number
  • HDI 2010

SABER is a local academic achievement test similar to PISA.

Missing data

When collecting the data, I noticed that quite a number of the variables have missing data. The matrixplot is shown in Figure 1.

matrixplot

Figure 1: Matrix plot for the dataset.

The red fields indicate missing data (NA). The greyscale fields indicate high (dark) and low values in each variable. We see that the same departments tend to miss data.

Redundant variables and imputation

Very highly correlated variables cause problems for factor analysis and result in ‘double weighing’ of some variables. For this reason I used the algorithm I developed to find the most highly correlated pairs of variables and remove one of them automatically (Kirkegaard, 2015a). I used a rule of thumb that variables which correlate at >.90 should be removed. There was only one such pair (infant mortality and child mortality, r = .922; removed infant mortality).

I imputed the missing data using the irmi() function from the VIM package (Templ, Alfons, Kowarik, & Prantner, 2015). This was done without noise to make the results replicable. I had no interest in trying to estimate standard errors, so multiple imputation was unnecessary (Donders, van der Heijden, Stijnen, & Moons, 2006).

To check whether results were comparable across methods, datasets were saved with every combination of imputation and removal of the redundant variable, thus creating 4 datasets.

Factor analysis

I carried out factor analysis on the 4 datasets. The factor loadings plot is shown in Figure 2.

loadings
Figure 2: Factor loadings plot.

Results were were similar across methods. Per S factor theory, the desirable variables should have positive loadings and the undesirable negative loadings. This was not entirely the case. 3 variables that are generally considered undesirable loaded positively: unemployment rate, low birth weight and domestic violence.

Unemployment rate and crime has been found to load in the wrong direction before when analyzing state-like units. It may be due to the welfare systems being better in the higher S departments, making it possible to survive without working.

It is said that cities breed crime and since urbanicity has a very high positive S loading, the crime result may be a side-effect of that. Alternatively, the legal system may be better (e.g. less corrupt) in the higher S departments making it more likely for crimes to be reported. This is perhaps especially so for crimes against women.

The result with low birth weight is more strange given that higher birth weight is a known correlate of higher educational levels and cognitive ability (Shenkin, Starr, & Deary, 2004). One of the other variables suggest an answer: in the lower S departments, a large fraction (30-40%) of births are home-births, and it seems likely that this would result in fewer reports of low birth weights.

Generally, the results are consistent with those from other countries; 14 of 17 variables loaded in the expected direction.

Mixed cases

Mixed cases are cases that do not fit the factor structure of a dataset. I have previously developed two methods for detecting such cases (Kirkegaard, 2015b). Neither method indicated any strong mixed cases in the unimputed, unreduced dataset or the imputed, reduced dataset. Removing the least congruent case would only improve the factor size by 1.2%point, and the case with the greatest mean absolute residual had only .89.

Unlike previous analysis, the capital district was kept because it did not appear to be a structural outlier.

Cognitive ability, S and HDI

The two cognitive variables correlated at .84, indicating the presence of the aggregate general cognitive ability factor (G factor; Rindermann, 2007). They were averaged to form an estimate of the G factor.

The correlations between S factors, HDI and cognitive ability is shown in Table 1.

S

S.ri

HDI

CA

S

0.99

0.84

0.54

S.ri

0.99

0.85

0.49

HDI

0.84

0.87

0.44

CA

0.51

0.58

0.60

 Table 1: Correlation matrix for cognitive ability, S factor and HDI. Correlations below diagonal are weighted by the square root of population size.

Weighted and unweighted correlations were approximately the same. The imputed and trimmed S factor was nearly identical to the HDI values, despite that the HDI values are from 2010 and the data the S factor is based on is from 2005. Results are fairly similar to those found in other countries.

Figure 3 shows a scatter plot of S factor (reduced, imputed dataset) and cognitive ability.

S_CA

Figure 3: Scatter plot of S factor scores and cognitive ability.

Jensen’s method

Finally, as a robustness test, I used Jensen’s method (method of correlated vectors (Frisby & Beaujean, 2015; Jensen, 1998)) to see if cognitive abilities’ association with the S factor scores was due to the latent trait. Figure 4 shows the Jensen plot.

Jensen_plot
Figure 4: Jensen plot for S factor loadings and cognitive ability.

The correlation was .60, which is satisfactory given the relatively few variables (N=16).

Limitations

  • I don’t speak Spanish, so I may have overlooked some variables that should have been included in the analysis. They may also be translation errors as I had to rely on those found on the websites I used.
  • No educational attainment variables were included despite these often having very strong loadings. None were available in the data sources I consulted.
  • Data was missing for many cases and had to be imputed.

Supplementary material

Data files, R source code and high quality figures are available in the Open Science Framework repository.

References

Abstract
A factor analysis was carried out on 6 socioeconomic variables for 506 census tracts of Boston. An S factor was found with positive loadings for median value of owner-occupied homes and average number of rooms in these; negative loadings for crime rate, pupil-teacher ratio, NOx pollution, and the proportion of the population of ‘lower status’. The S factor scores were negatively correlated with the estimated proportion of African Americans in the tracts r = -.36 [CI95 -0.43; -0.28]. This estimate was biased downwards due to data error that could not be corrected for.

Introduction
The general socioeconomic factor (s/S1) is a similar construct to that of general cognitive ability (GCA; g factor, intelligence, etc., (Gottfredson, 2002; Jensen, 1998). For ability data, it has been repeatedly found that performance on any cognitive test is positively related to performance on any other test, no matter which format (pen pencil, read aloud, computerized), and type (verbal, spatial, mathematical, figural, or reaction time-based) has been tried. The S factor is similar. It has been repeatedly found that desirable socioeconomic outcomes tend are positively related to other desirable socioeconomic outcomes, and undesirable outcomes positively related to other undesirable outcomes. When this pattern is found, one can extract a general factor such that the desirable outcomes have positive loadings and then undesirable outcomes have negative loadings. In a sense, this is the latent factor that underlies the frequently used term “socioeconomic status” except that it is broader and not just restricted to income, occupation and educational attainment, but also includes e.g. crime and health.

So far, S factors have been found for country-level (Kirkegaard, 2014b), state/regional-level (e.g. Kirkegaard, 2015), country of origin-level for immigrant groups (Kirkegaard, 2014a) and first name-level data (Kirkegaard & Tranberg, In preparation). The S factors found have not always been strictly general in the sense that sometimes an indicator loads in the ‘wrong direction’, meaning that either an undesirable variable loads positively (typically crime rates), or a desirable outcome loads negatively. These findings should not be seen as outliers to be explained away, but rather to be explained in some coherent fashion. For instance, crime rates may load positively despite crime being undesirable because the justice system may be better in the higher S states, or because of urbanicity tends to create crime and urbanicity usually has a positive loading. To understand why some indicators sometimes load in the wrong direction, it is important to examine data at many levels. This paper extends the S factor to a new level, that of census tracts in the US.

Data source
While taking a video course on statistical learning based on James, Witten, Hastie, & Tibshirani (2013), I noted that a dataset used as an example would be useful for an S factor analysis. The dataset concerns 506 census tracts of Boston and includes the following variables (Harrison & Rubinfeld, 1978):

  • Median value of owner-occupied homes
  • Average number of rooms in owner units.
  • Proportion of owner units built before 1940.
  • Proportion of the population that is ‘lower status’. “Proportion of adults without, some high school education and proportion of male workers classified as laborers)”.
  • Crime rate.
  • Proportion of residential land zoned for lots greater than 25k square feet.
  • Proportion of nonretail business acres.
  • Full value property tax rate.
  • Pupil-teacher ratios for schools.
  • Whether the tract bounds the Charles River.
  • Weighted distance to five employment centers in the Boston region.
  • Index of accessibility to radial highways.
  • Nitrogen oxide concentration. A measure of air pollution.
  • Proportion of African Americans.

See the original paper for a more detailed description of the variables.

This dataset has become very popular as a demonstration dataset in machine learning and statistics which shows the benefits of data sharing (Wicherts & Bakker, 2012). As Gilley & Pace (1996) note “Essentially, a cottage industry has sprung up around using these data to examine alternative statistical techniques.”. However, as they re-checked the data, they found a number of errors. The corrected data can be downloaded here, which is the dataset used for this analysis.

The proportion of African Americans
The variable concerning African Americans have been transformed by the following formula: 1000(x – .63)2. Because one has to take the square root to reverse the effect of taking the square, some information is lost. For example, if we begin with the dataset {2, -2, 2, 2, -2, -2} and take the square of these and get {4, 4, 4, 4, 4, 4}, it is impossible someone to reverse this transformation and get the original because they cannot tell whether 4 results from -2 or 2 being squared.

In case of the actual data, the distribution is shown in Figure 1.

untrans_hist
Figure 1: Transformed data for the proportion of blacks by census tract.

Due to the transformation, the values around 400 actually mean that the proportion of blacks is around 0. The function for back-transforming the values is shown in Figure 2.

backtransform_func
Figure 2: The transformation function.

We can now see the problem of back-transforming the data. If the transformed data contains a value between 0 and about 140, then we cannot tell which original value was with certainty. For instance, a transformed value of 100 might correspond to an original proportion of .31 or .95.

To get a feel for the data, one can use the Racial Dot Map explorer and look at Boston. Figure 3 shows the Boston area color-coded by racial groups.

Boston race
Figure 3: Racial dot map of Boston area.

As can be seen, the races tend to live rather separate with large areas dominated by one group. From looking at it, it seems that Whites and Asians mix more with each other than with the other groups, and that African Americans and Hispanics do the same. One might expect this result based on the groups’ relative differences in S factor and GCA (Fuerst, 2014). Still, this should be examined by numerical analysis, a task which is left for another investigation.

Still, we are left with the problem of how to back-transform the data. The conservative choice is to use only the left side of the function. This is conservative because any proportion above .63 will get back-transformed to a lower value. E.g. .80 will become .46, a serious error. This is the method used for this analysis.

Factor analysis
Of the variables in the dataset, there is the question of which to use for S factor analysis. In general when doing these analyses, I have sought to include variables that measure something socioeconomically important and which is not strongly influenced by the local natural environment. For instance, the dummy variable concerning the River Charles fails on both counts. I chose the following subset:

  • Median value of owner-occupied homes
  • Average number of rooms in owner units.
  • Proportion of the population that is ‘lower status’.
  • Crime rate.
  • Pupil-teacher ratios for schools.
  • Nitrogen oxide concentration. A measure of air pollution.

Which concern important but different things. Figure 4 shows the loadings plot for the factor analysis (reversed).2

S_loadings

Figure 4: Loadings plot for the S factor.

The S factor was confirmed for this data without exceptions, in that all indicator variables loaded in the expected direction. The factor was moderately strong, accounting for 47% of the variance.

Relationship between S factor and proportions of African Americans
Figure 5 shows a scatter plot of the relationship between the back-transformed proportion of African Americans and the S factor.

S_AA_backtrans
Figure 5: Scatter plot of S scores and the back-transformed proportion of African Americans by census tract in Boston.

We see that there is a wide variation in S factor even among tracts with no or very few African Americans. These low S scores may be due to Hispanics or simply reflect the wide variation within Whites (there few Asians back then). The correlation between proportion of African Americans and S is -.36 [CI95 -0.43; -0.28].

We see that many very low S points lie around S [-3 to -1.5]. Some of these points may actually be census tracts with very high proportions of African Americans that were back-transformed incorrectly.

Discussion
The value of r = -.36 should not be interpreted as an estimate of effect size of ancestry on S factor for census tracts in Boston because the proportions of the other sociological races were not used. A multiple regression or similar method with all sociological races as the predictors is necessary to answer this question. Still, the result above is in the expected direction based on known data concerning the mean GCA of African Americans, and the relationship between GCA and socioeconomic outcomes (Gottfredson, 1997).

Limitations
The back-transformation process likely introduced substantial error in the results.

Data are relatively old and may not reflect reality in Boston as it is now.

Supplementary material
Data, high quality figures and R source code is available at the Open Science Framework repository.

References

Footnotes

1 Capital S is used when the data are aggregated, and small s is used when it is individual level data. This follows the nomenclature of (Rindermann, 2007).

2 To say that it is reversed is because the analysis gave positive loadings for undesirable outcomes and negative for desirable outcomes. This is because the analysis includes more indicators of undesirable outcomes and the factor analysis will choose the direction to which most indicators point as the positive one. This can easily be reversed by multiplying with -1.

Abstract
It has been found that workers who hail from higher socioeconomic classes have higher earnings even in the same profession. An environmental cause was offered as an explanation of this. I show that this effect is expected solely for statistical reasons.

Introduction
Friedman and Laurison (2015) offer data about the earnings of persons employed in the higher professions by their social class of origin. They find that those who originate from a higher social class earn more. I reproduce their figure below.

Friedman-fig-1-1024x669

They posit an environmental explanation of this:

In doing so, we have purposively borrowed the ‘glass ceiling’ concept developed by feminist scholars to explain the hidden barriers faced by women in the workplace. In a working paper recently published by LSE Sociology, we argue that it is also possible to identify a ‘class ceiling’ in Britain which is preventing the upwardly mobile from enjoying equivalent earnings to those from upper middle-class backgrounds.

There is also a longer working paper by the same authors, but I did not read that. A link to it can be found in the previously mentioned source.

A simplified model of the situation
How do persons advance to professions? Well, we know that the occupational hierarchy is basically a (general) cognitive ability hierarchy (GCA; Gottfredson, 1997), as well as presumably also one of various relevant non-cognitive traits such as being hard working/conscientiousness altho I am not familiar with a study of this.

A simple way to model the situation is to think of it as a threshold system where no one below the given threshold gets into the profession and everybody above gets into it. This is of course not like reality. Reality does have a threshold which increases up the hierarchy. [Insert the figure from one of Gottfredson’s paper that shows the minimum IQ by occupation, but I can’t seem to locate it. Help!] The effect of GCA is probably more like a probabilistic function akin to the cumulative distribution function such that below a certain cognitive level, virtually no one from below that level is found.

Simulating this is a bit complicated but we can approximate it reasonably by using a simple cut-off value, such that everybody above gets in, everybody below does not, see Gordon (1997) for a similar case with belief in conspiracy theories.

A simulation
One could perhaps solve this analytically, but it is easier to simulate it, so we do that. I used the following procedure:

  1. We make three groups of origin with 90, 100, and 110 IQ.
  2. We simulate a large number (1e4) of random persons from these groups.
  3. We plot these to get an overview of the data.
  4. We find the subgroup of each group with IQ > 115, which we take as the minimum for some high level profession.
  5. We calculate the mean IQ of each subgroup.

The plot looks like this:

thresholds

The vertical lines are the cut-off threshold (black) and the three means (in their corresponding colors). As can be seen, the means in the subgroups are not the same despite the same threshold being applied. The values are respectively: 121.74, 122.96, and 125.33. The differences between these are not large for the present simulation, but they may be sufficient to bring about differences that are detectable in a large dataset. The values depend on how far the population mean is from the threshold and the standard deviation of the population (all 15 in the above simulation). The further away the threshold is from the mean, the closer the mean of the subgroup above the threshold will be to the threshold value. For subgroups far away, it will be nearly identical. For instance, the mean IQ of those with >=150 is about 153.94 (based on a sampling with 10e7 cases, mean 100, sd 15).

It should be noted that if one also considers measurement error, this effect will be stronger, since persons from lower IQ groups regress further down. This is just to say that their initial IQs contained more measurement error. One can correct for this bias, but it is not often done (Jensen, 1980).

Supplementary material
R source code is available at the Open Science Framework repository.

References

  • Friedman, S and Laurison, D. (2015). Introducing the ‘class’ ceiling. British Politics and Policy blog.
  • Gordon, R. A. (1997). Everyday life as an intelligence test: Effects of intelligence and intelligence context. Intelligence, 24(1), 203-320.
  • Gottfredson, L. S. (1997). Why g matters: The complexity of everyday life. Intelligence, 24(1), 79-132.
  • Jensen, A. R. (1980). Bias in Mental Testing.

Abstract
Sizeable S factors were found across 3 different datasets (from years 1991, 2000 and 2010), which explained 56 to 71% of the variance. Correlations of extracted S factors with cognitive ability were strong ranging from .69 to .81 depending on which year, analysis and dataset is chosen. Method of correlated vectors supported the interpretation that the latent S factor was primarily responsible for the association (r’s .71 to .81).

Introduction
Many recent studies have examined within-country regional correlates of (general) cognitive ability (also known as (general) intelligence, general mental ability, g),. This has been done for the British Isles (Lynn, 1979; Kirkegaard, 2015g), France (Lynn, 1980), Italy (Lynn, 2010; Kirkegaard, 2015e), Spain (Lynn, 2012), Portugal (Almeida, Lemos, & Lynn, 2011), India (Kirkegaard, 2015d; Lynn & Yadav, 2015), China (Kirkegaard, 2015f; Lynn & Cheng, 2013), Japan (Kura, 2013), the US (Kirkegaard, 2015b; McDaniel, 2006; Templer & Rushton, 2011), Mexico (Kirkegaard, 2015a) and Turkey (Lynn, Sakar, & Cheng, 2015). This paper examines data for Brazil.

Data
Cognitive data
Data from PISA was used as a substitute for IQ test data. PISA and IQ correlate very strongly (>.9; (Rindermann, 2007)) across nations and presumably also across regions altho this hasn’t been thoroly investigated to my knowledge.

Socioeconomic data
As opposed to some of my prior analyses, there was no dataset to build on top of. For this reason, I tried to find an English-language database for Brazil with a comprehensive selection of variables. Altho I found some resources, they did not allow for easy download and compilation of state-level data, which I needed. Instead, I relied upon the Portugeese-language site, Atlasbrasil.org.br, which has a comprehensive data explorer with a convenient download function for state-level data. I used Google Translate to find my way around the site.

Using the data explorer, I selected a broad range of variables. The goal was to cover most important areas of socioeconomic development and avoid variables of little importance or which are heavily influenced by local climate factors (e.g. amount of rainforest). The following variables were selected:

  1. Gini coefficient
  2. Activity rate age 25-29
  3. Unemployment rate age 25-29
  4. Public sector workers%
  5. Farmers%
  6. Service sector workers%
  7. Girls age 10-17 with child%
  8. Life expectancy
  9. Households without electricity%
  10. Infant mortality rate
  11. Child mortality rate
  12. Survive to 40%
  13. Survive to 60%
  14. Total fertility rate
  15. Dependancy ratio
  16. Aging rate
  17. Illiteracy age 11-14 %
  18. Illiteracy age 25 and above %
  19. Age 6-17 in school %
  20. Attendence higher education %
  21. Income per capita
  22. Mean income lowest quintile
  23. Pct extremely poor
  24. Richest 10 pct income
  25. Bad walls%
  26. Bad water and sanitation%
  27. HDI
  28. HDI income
  29. HDI life expectancy
  30. HDI education
  31. Population
  32. Population rural

Variables were available only for three time points: 1991, 2000 and 2010. I selected all three with intention of checking stability of results over different time periods.

Most data was already in an appropriate per unit measure so it was not necessary to do extensive conversions as with the Mexican data (Kirkegaard, 2015a). I calculated fraction of the population living in rural areas by dividing the rural population by the total population.

Note that the data explorer also has data at a lower level, that of municipals. It could be used in the future to see if the S factor holds for a lower level of aggregate analysis.

S factor loadings
I split the data into three datasets, one for 1991, 2000 and 2010.

I extracted S factors using the fa() function with default parameters from the psych package (Revelle, 2015).

S factor in 1991
Due to missing data, there were only 21 indicators available for this year. The data could not be imputed since it was missing for all cases for these variables. The loadings plot is shown in Figure 1.

S.1991.loadings

Figure 1: Loadings plot for S factor for the data from 1991

All indicators were in the expected direction aside from perhaps “aging rate”, which is somewhat unclear and would perhaps be expected to have a positive loading.

S factor in 2000
Less missing data, 26 variables. Loadings plot shown in Figure 2.

S.2000.loadings

Figure 2: Loadings plot for S factor for the 2000 data

All indicators were in the expected direction.

S factor for 2010
27 variables. Factor analysis gave an error for this dataset which means that I had to remove at least one variable.1 This left me with the question of which variable(s) to exclude. Similar to the previous analysis for Mexican states (Kirkegaard, 2015a), I used an automatic method. After removing one variable, the factor analysis worked and gave no warning. The excluded variable was child mortaility which was correlated near perfectly with another variable (infant mortality, r=.992), so little indicator sampling error should be introduced because of this deletion. The loadings plot is shown in Figure 3.

S.2010.1.loadings

Figure 3: Loadings plot for S factor for the 2010 data, minus one variable

Oddly, survival to 60 and 40 now have negative loadings, altho one would expect them to correlate highly with life expectancy which has a loading near 1. In fact, the correlations between life expectancy and the survival variables was -.06 and -.21, which makes you wonder what these variables are measuring. Excluding them does not substantially change results, but does increase the amount of variance explained to .60.

Out of curiosity, I also tried versions where I deleted 5 and 10 variables, but this did not change much in the plots, so I won’t show them. Interested readers can consult the source code.

Mixed cases
To examine whether there are any cases with strong mixedness — cases that are incongruent with the factor structure in the data — I developed two methods which are presented elsewhere (Kirkegaard, 2015c). Briefly, the first method measures the mixedness of the case by quantifying how predictable indicator scores are from the factor score for each case (mean absolute residual, MAR). The second quantifies how much the size of the general factor changes after exclusion of each individual case (improvement in proportion of variance, IPV). Both methods were found to be useful at finding a strongly mixed case in simulation data.

I applied both methods to the Brazilian datasets. For the second method, I had to create two additional reduced datasets since the factor analysis could not run with the resulting combinations of cases and indicators.

There are two ways one can examine the results: 1) by looking at the top (or bottom) most mixed cases for each method; 2) by looking at the correlations between results from the methods. The first is interesting if Brazilian state-level inequality in S has particular interest, while the second is more relevant for checking that the methods really work — they should give congruent results if mixed cases are present.

Top mixed cases
For each method and each analysis, I extracted the names of the top 5 mixed states. They are shown in Table 1.

 

Position_1

Position_2

Position_3

Position_4

Position_5

m1.1991

Amapá

Acre

Distrito Federal

Roraima

Rondônia

m1.2000

Amapá

Roraima

Acre

Distrito Federal

Rondônia

m1.2010.1

Roraima

Distrito Federal

Amapá

Amazonas

Acre

m1.2010.5

Roraima

Distrito Federal

Amapá

Acre

Amazonas

m1.2010.10

Roraima

Distrito Federal

Amapá

Acre

Amazonas

m2.1991

Amapá

Rondônia

Acre

Roraima

Amazonas

m2.2000.1

Amapá

Rondônia

Roraima

Paraíba

Ceará

m2.2010.2

Amapá

Roraima

Distrito Federal

Pernambuco

Sergipe

m2.2010.5

Amapá

Roraima

Distrito Federal

Piauí

Bahia

m2.2010.10

Distrito Federal

Roraima

Amapá

Ceará

Tocantins

 

Table 1: Top 5 mixed cases by method and dataset

As can be seen, there is quite a bit of agreement across years, datasets, and methods. If one were to do a more thoro investigation of socioeconomic differences across Brazilian states, one should examine these states for unusual patterns. One could do this using the residuals for each indicator by case from the first method (these are available from the FA.residuals() in psych2). A quick look at the 2010.1 data for Amapá shows that the state is roughly in the middle regarding state-level S (score = -.26, rank 15 of 27), Farmers do not constitute a large fraction of the population (only 9.9%, rank 4 only behind the states with large cities: Federal district, Rio de Janeiro, and São Paulo). Given that farmers% has a strong negative loading (-.77) and the state’s S score, one would expect the state to have relatively more farmers than it has, the mean of all states for that dataset is 17.2%.

Much more could be said along these lines, but I rather refrain since I don’t know much about the country and can’t read the language very well. Perhaps a researchers who is a Brailizian native could use the data to make a more detailed analysis.

Correlations between methods and datasets
To test whether the results were stable across years, data reductions, and methods, I correlated all the mixedness metrics. Results are in Table 2.

 

m1.1991

m1.2000

m1.2010.1

m1.2010.5

m1.2010.10

m2.1991

m2.2000.1

m2.2010.2

m2.2010.5

m1.1991

m1.2000

0.88

m1.2010.1

0.81

0.85

m1.2010.5

0.77

0.87

0.98

m1.2010.10

0.70

0.79

0.93

0.96

m2.1991

0.48

0.64

0.45

0.48

0.40

m2.2000.1

0.41

0.58

0.34

0.39

0.27

0.87

m2.2010.2

0.53

0.63

0.66

0.66

0.51

0.58

0.68

m2.2010.5

0.32

0.49

0.60

0.64

0.51

0.49

0.59

0.86

m2.2010.10

0.42

0.44

0.66

0.65

0.59

0.32

0.44

0.75

0.76

 

Table 2: Correlation table for mixedness metrics across datasets and methods.

There is method specific variance since the correlations within method (topleft and bottomright squares) are stronger than those across methods. Still, all correlations are positive, Cronbach’s alpha is .87, Guttman lambda 6 is .98 and the mean correlation is .61.

S and HDI correlations
HDI
Previous S factor studies have found that HDI (Human Development Index) is basically a non-linear proxy for the S factor (Kirkegaard, 2014, 2015a). This is not surprising since the HDI is calculated from longevity, education and income, all three of which are known to have strong S factor loadings. The actual derivation of HDI values is somewhat complex. One might expect them simple to average the three indicators, or extract the general factor, but no. Instead they do complicated things (WHO, 2014).

For longevity (life expectancy at birth), they introduce ceilings at 25 and 85 years. According to data from WHO (WHO, 2012), no country has values above or below these values altho Japan is close (84 years).

For education, it is actually an average of two measures: years of education by 25 year olds and expected years of schooling for children entering school age. These measures also have artificial limits of 0-18 and 0-15 respectively.

For gross national income, they use the log values and also artificial limits of 100-75,000 USD.

Moreover, these are not simply combined by standardizing (i.e. rescaling so the mean is 0 and standard deviation is 1) the values and adding them or taking the mean. Instead, a value is calculated for every indicator using the following formula:

HDI_dimension_formula
Equation 1: HDI index formula

Note that for education, this formula is used twice and the results averaged.

Finally, the three dimensions are combined using a geometric mean:

HDI_combine
Equation 2: HDI index combination formula

The use of a geometric mean as opposed to the normal arithmetic mean, is that if a single indicator is low, the overall level is strongly reduced, whereas with the arithmetic, only the sum of the indicators matter, not the standard deviation of them. If the indicators have the same value, then the geometric and arithmetic mean have the same value.

For instance, if indicators are .7, .7, .7, the arithmetic mean is .7+.7+.7=2.1, 2.1/3=.7 and the geometric .73=0.343, 0.3431/3=.7. However, if indicators are 1, .7, .4, then the arithmetic mean is 1+.7+.4=2.1, 2.1/3=.7, but geometric mean is 1*.7*.4=0.28, 0.281/3=0.654 which is a bit lower than .7.

S and HDI correlations
I used the previously extracted factor scores and the HDI data. I also extracted S factors from the HDI datasets (3 variables)2 to see how these compared with the complex HDI value derivation. Finally, I correlated the S factors from non-HDI data, S factors from HDI data, HDI values and cognitive ability scores. Results are shown in Table 2.

 

HDI.1991

HDI.2000

HDI.2010

HDI.S.1991

HDI.S.2000

HDI.S.2010

S.1991

S.2000

S.2010.1

S.2010.5

S.2010.10

CA2012

HDI.1991

0.95

0.92

0.98

0.95

0.93

0.96

0.93

0.86

0.89

0.90

0.59

HDI.2000

0.97

0.97

0.94

0.99

0.96

0.94

0.98

0.93

0.95

0.96

0.66

HDI.2010

0.94

0.98

0.93

0.98

0.99

0.93

0.97

0.94

0.97

0.98

0.65

HDI.S.1991

0.98

0.96

0.94

0.95

0.94

0.98

0.92

0.84

0.88

0.90

0.54

HDI.S.2000

0.97

1.00

0.98

0.97

0.97

0.95

0.98

0.92

0.95

0.97

0.65

HDI.S.2010

0.95

0.98

0.99

0.95

0.98

0.94

0.96

0.94

0.96

0.97

0.66

S.1991

0.96

0.96

0.94

0.97

0.97

0.96

0.92

0.86

0.90

0.91

0.60

S.2000

0.95

0.98

0.96

0.93

0.99

0.97

0.97

0.96

0.98

0.98

0.69

S.2010.1

0.89

0.94

0.94

0.86

0.94

0.95

0.92

0.97

0.99

0.96

0.76

S.2010.5

0.91

0.95

0.96

0.89

0.96

0.97

0.93

0.98

0.99

0.98

0.72

S.2010.10

0.93

0.96

0.98

0.93

0.97

0.98

0.93

0.97

0.96

0.98

0.71

CA2012

0.67

0.73

0.71

0.60

0.72

0.74

0.69

0.78

0.81

0.79

0.75

 

Table 3: Correlation matrix for S, HDI and cognitive ability scores. Pearson’s below the diagonal, rank-order above.

All results were very strongly correlated no matter which dataset or scoring method was used. Cognitive ability scores were strongly correlated to all S factor measures. The best estimate of the relationship between S factor and cognitive ability is probably the correlation with S.2010.1, since this is the dataset cloest in time to the cognitive dataset and the S factor is extracted from the most variables. This is also the highest value (.81), but that may be a coincidence.

It is worth noting that the rank-order correlations were somewhat weaker. This usually indicates that an outlier case is increasing the Pearson correlation. To investigate this, I plot the S.2010.1 and CA2012 variables, see Figure 4.

CA_S_2010_1
Figure 4: Scatter plot of S factor and cognitive ability

The scatter plot however does not seem to reveal any outliers inflating the correlation.

Method of correlated vectors
To examine whether the S factor was plausibly the cause of the pattern seen with the S factor scores (it is not necessarily), I used the method of correlated vectors with reversing. Results are shown in Figures 5-7.

MCV_1991
Figure 5: MCV for the 1991 dataset

MCV_2000
Figure 6: MCV for the 2000 dataset

MCV_2010_1
Figure 7: MCV for the 2010 dataset

The first result seems to be driven by a few outliers, but the second and third seems decent enough. The numerical results were fairly consistent (.71, .75, .81).

Discussion and conclusion
Generally, the results were in line with earlier studies. Sizeable S factors were found across 3 (or 6 if one counts the mini-HDI ones) different datasets, which explained 56 to 71% of the variance. There seems to be a decrease over time, which is intriguing as it is may eventually lead to the ‘destruction’ of the S factor. It may also be due to differences between the datasets across the years, since they were not entirely comparable. I did not examine the issue in depth.

Correlations of S factors and HDIs with cognitive ability were strong ranging from .60 to .81 depending on which year, analysis, dataset is chosen, and whether one uses the HDI values. Correlations were stronger when they were from the larger datasets, which is perhaps because they were better measures of latent S. MCV supported the interpretation that the latent S factor was primarily responsible for the association (r’s .71 to .81).

Future studies should examine to which degree cognitive ability and S factor differences are explainable by ethnracial factors e.g. racial ancestry as done by e.g. (Kirkegaard, 2015b).

Limitations
There are some problems with this paper:

  • I cannot read Portuguese and this may have resulted in including some incorrect variables.
  • There was a lack of crime variables in the datasets, altho these have central importance for sociology. None were available in the data source I used.

Supplementary material
R source code, data and figures can be found in the Open Science Framework repository.

References

Footnotes

1 Error in min(eigens$values) : invalid ‘type’ (complex) of argument.

2 Factor loadings for HDI factor analysis were very strong, always >.9.

Abstract

Two datasets of socioeconomic data was obtained from different sources. Both were factor analyzed and revealed a general factor (S factor). These factors were highly correlated with each other (.79 to .95), HDI (.68 to .93) and with cognitive ability (PISA; .70 to .78). The federal district was a strong outlier and excluding it improved results.

Method of correlated vectors was strongly positive for all 4 analyses (r’s .78 to .92 with reversing).

Introduction

In a number of recent articles (Kirkegaard 2015a,b,c,d,e), I have analyzed within-country regional data to examine the general socioeconomic factor, if it exists in the dataset (for the origin of the term, see e.g. Kirkegaard 2014). This work was inspired by Lynn (2010) whose datasets I have also reanalyzed. While doing work on another project (Fuerst and Kirkegaard, 2015*), I needed an S factor for Mexican states, if such exists. Since I was not aware of any prior analysis of this country in this fashion, I decided to do it myself.

The first problem was obtaining data for the analysis. For this, one needs a number of diverse indicators that measure important economic and social matters for each Mexican state. Mexico has 31 states and a federal district, so one can use a decent number of indicators to examine the S factor. Mexico is a Spanish speaking country and English comprehension is fairly poor. According to Wikipedia, only 13% of people speak English there. Compare with 86% for Denmark, 64% for Germany and 35% for Egypt.

S factor analysis 1 – Wikipedian data

Data source and treatment

Unlike for the previous countries, I could not easily find good data available in English. As a substitute, I used data from Wikipedia:

These come from various years, are sometimes not given per person, and often have no useful source given. So they are of unknown veracity, but they are probably fine for a first look. The HDI is best thought of as a proxy for the S factor, so we can use it to examine construct validity.

Some variables had data for multiple time-points and they were averaged.

Some data was given in raw numbers. I calculated per capita versions of them using the population data also given.

Results

The variables above minus HDI and population size were factor analyzed using minimum residuals to extract 1 factor. The loadings plot is shown below.

S_wiki

The literacy variables had a near perfect loading on S (.99). Unemployment unexpectedly loaded positively and so did homicides per capita altho only slightly. This could be because unemployment benefits are only in existence in the higher S states such that going unemployed would mean starvation. The homicide loading is possibly due to the drug war in the country.

Analysis 2 – Data obtained from INEG

Data source and treatment

Since the results based on Wikipedia data was dubious, I searched further for more data. I found it on the Spanish-language statistical database, Instituto Nacional De Estadística Y Geografía, which however had the option of showing poorly done English translations. This is not optimal as there are many translation errors which may result in choosing the wrong variable for further analysis. If any Spanish-speaker reads this, I would be happy if they would go over my chosen variables and confirm that they are correct. I ended up with the following variables:

  1. Cost of crime against individuals and households
  2. Cost of crime on economic units
  3. Annual percentage change of GDP at 2008 prices
  4. Crime prevalence rate per 10,000 economic units
  5. Crime prevalence rate per hundred thousand inhabitants aged 18 years and over, by state
  6. Dark figure of crime on economic units
  7. Dark figure (crimes not reported and crimes reported that were not investigated)
  8. Doctors per 100 000 inhabitants
  9. Economic participation of population aged 12 to 14 years
  10. Economic participation of population aged 65 and over
  11. Economic units.
  12. Economically active population. Age 15 and older
  13. Economically active population. Unemployed persons. Age 15 and older
  14. Electric energy users
  15. Employed population by income level. Up to one minimum wage. Age 15 and older
  16. Employed population by income level. More than 5 minimum wages. Age 15 and older
  17. Employed population by income level. Do not receive income. Age 15 and older
  18. Fertility rate of adolescents aged 15 to 19 years
  19. Female mortality rate for cervical cancer
  20. Global rate of fertility
  21. Gross rate of women participation
  22. Hospital beds per 100 thousand inhabitants
  23. Inmates in state prisons at year end
  24. Life expectancy at birth
  25. Literacy rate of women 15 to 24 years
  26. Literacy rate of men 15 to 24 years
  27. Median age
  28. Nurses per 100 000 inhabitants
  29. Percentage of households victims of crime
  30. Percentage of births at home
  31. Percentage of population employed as professionals and technicians
  32. Prisoners rate (per 10,000 inhabitants age 18 and over)
  33. Rate of maternal mortality (deaths per 100 thousand live births)
  34. Rate of inhabitants aged 18 years and over that consider their neighborhood or locality as unsafe, per hundred thousand inhabitants aged 18 years and over
  35. Rate of inhabitants aged 18 years and over that consider their state as unsafe, per hundred thousand inhabitants aged 18 years and over
  36. Rate sentenced to serve a sentence (per 1,000 population age 18 and over)
  37. State Gross Domestic Product (GDP) at constant prices of 2008
  38. Total population
  39. Total mortality rate from respiratory diseases in children under 5 years
  40. Total mortality rate from acute diarrheal diseases (ADD) in population under 5 years
  41. Unemployment rate of men
  42. Unemployment rate of women
  43. Households
  44. Inhabited housings with available computer
  45. Inhabited housings that have toilet
  46. Inhabited housings that have a refrigerator
  47. Inhabited housings with available water from public net
  48. Inhabited housings that have drainage
  49. Inhabited housings with available electricity
  50. Inhabited housings that have a washing machine
  51. Inhabited housings with television
  52. Percentage of housing with piped water
  53. Percentage of housing with electricity
  54. Proportion of population with access to improved sanitation, urban and rural
  55. Proportion of population with sustainable access to improved sources of water supply, in urban and rural areas

There are were data for multiple years for most of them. I used all data from the last 10 years, approximately. For all data with multiple years, I calculated the mean value.

For data given in raw numbers, I calculated the appropriate per unit measures (per person, per economically active person (?), per household).

A matrix plot for all the S factor relevant data (e.g. not population size) is shown below. It shows missing data in red, as well as the relative difference between datapoints. Thus, cells that are completely white or black are outliers compared to the other data.

matrixplot

One variable (inmates per person) has a few missing datapoints.

Multiple other variables had strong outliers. I examined these to determine if they were real or due to data error.

Inspection revealed that the GDP per person data was clearly incorrect for one state (Campeche) but I could not find the source of error. The data is the same as on the website and did not match the data on Wikipedia. I deleted it to be safe.

The GDP change outlier seems to be real (Campeche) which has negative growth. According to this site, it is due to oil fields closing.

The rest of the outliers were hard to say something about due to the odd nature of the data (“dark crime”?), or were plausible. E.g. Mexico City (aka Federal District, the capital) was an outlier on nurses and doctors per capita, but this is presumably due to many large hospitals being located there.

Some data errors of my own were found and corrected but there is no guarantee there are not more. Compiling a large set of data like this frequently results in human errors.

Factor analysis

Since there were only 32 cases — 31 states + federal district — and 47 variables (excluding the bogus GDP per capita one), this gives problems for factor analysis. There are various recommendations, but almost none of them are met by this dataset (Zhao, 2009). To test limits, I decided to try factor analyzing all of the variables. This produced warnings:

The estimated weights for the factor scores are probably incorrect.  Try a different factor extraction method.
In factor.scores, the correlation matrix is singular, an approximation is used
Warning messages:
1: In cor.smooth(R) : Matrix was not positive definite, smoothing was done
2: In cor.smooth(R) : Matrix was not positive definite, smoothing was done
3: In cor.smooth(r) : Matrix was not positive definite, smoothing was done
4: In cor.smooth(r) : Matrix was not positive definite, smoothing was done

Warnings such these do not always mean that the result is nonsense, but they often do. For that reason, I wanted to extract an S factor with a smaller number of variables. From the 47, I selected the following 21 variables as generally representative and interpretable:

  1. GDP.change,              #Economic
  2. Unemploy.men.rate,
  3. Unemploy.women.rate,
  4. Low.income.peap,
  5. High.income.peap,
  6. Prof.tech.employ.pct,
  7. crime.rate.per.adult,   #crime
  8. Inmates.per.pers,
  9. Unsafe.neighborhood.percept.rate,
  10. Has.water.net.per.hh,    #material goods
  11. Elec.pct,
  12. Has.wash.mach.per.hh,
  13. Doctors.per.pers,      #Health
  14. Nurses.per.pers,
  15. Hospital.beds.per.pers,
  16. Total.fertility,
  17. Home.births.pct,
  18. Maternal.death.rate,
  19. Life.expect,
  20. Women.participation,   #Gender equality
  21. Lit.young.women        #education

Note that peap = per economically active person, hh = household.

The selection was made by my judgment call and others may choose different variables.

Automatic reduction of dataset

As a robustness check and evidence against a possible claim that I picked the variables such as to get an S factor that most suited my prior beliefs, I decided to find an automatic method of selecting a subset of variables for factor analysis. I noticed that in the original dataset, some variables overlapped near perfectly. This would mean that whatever they measure, it would get measured twice or more when extracting a factor. Highly correlated variables can also create nonsense solutions, especially when extracting more than 1 factor.

Another piece of insight comes from the fact that for cognitive data, general factors extracted from a less broad selection of subtests are worse measures of general cognitive ability than those from broader selections (Johnson et al, 2008).

Lastly, subtests from different domains tend to be less correlated than those from the same domain (hence the existence of group factors).

Combining all this, it seems a decent idea that to reduce a dataset by 1 variable, one should calculate all the intercorrelations and find the highest one. Then one should remove one of the variables responsible for it. One can do this repeatedly to remove more than 1 variable from a dataset. Concerning the question of which of the two variables to remove, I can think of three ways: always removing the first, always the second, choosing at random. I implemented all three settings and chose the second as the default. This is because in many datasets the first of a set of highly correlated variables is usually the ‘primary one’, E.g. unemployment, unemployment men, unemployment women. The algorithm also outputs step-by-step information concerning which variables was removed and what their correlation was.

Having written the R code for the algorithm, I ran it on the Mexican dataset. I wanted to obtain a solution using the largest possible number of variables without getting a warning from the factor extraction function. So I first removed 1 variable, and then ran the factor analysis. When I received an error, I removed another, and so on. After having removed 20 variables, I no longer received an error. This left the analysis with 27 variables, or 6 more than my chosen selection. The output from the reduction algorithm was:

> s3 = remove.redundant(s, 20)
[1] "Dropping variable number 1"
[1] "Most correlated vars are Good.water.prop and Piped.water.pct r=0.997"
[1] "Dropping variable number 2"
[1] "Most correlated vars are Piped.water.pct and Has.water.net.per.hh r=0.996"
[1] "Dropping variable number 3"
[1] "Most correlated vars are Fertility.teen and Total.fertility r=0.99"
[1] "Dropping variable number 4"
[1] "Most correlated vars are Good.sani.prop and Has.drainage.per.hh r=0.984"
[1] "Dropping variable number 5"
[1] "Most correlated vars are Victims.crime.households and crime.rate.per.adult r=0.97"
[1] "Dropping variable number 6"
[1] "Most correlated vars are Nurses.per.pers and Doctors.per.pers r=0.962"
[1] "Dropping variable number 7"
[1] "Most correlated vars are Lit.young.men and Lit.young.women r=0.938"
[1] "Dropping variable number 8"
[1] "Most correlated vars are Elec.pct and Has.elec.per.hh r=0.938"
[1] "Dropping variable number 9"
[1] "Most correlated vars are Has.wash.mach.per.hh and Has.refrig.per.household r=0.926"
[1] "Dropping variable number 10"
[1] "Most correlated vars are Prisoner.rate and Inmates.per.pers r=0.901"
[1] "Dropping variable number 11"
[1] "Most correlated vars are Unemploy.women.rate and Unemploy.men.rate r=0.888"
[1] "Dropping variable number 12"
[1] "Most correlated vars are Women.participation and Has.computer.per.household r=0.877"
[1] "Dropping variable number 13"
[1] "Most correlated vars are Hospital.beds.per.pers and Doctors.per.pers r=0.87"
[1] "Dropping variable number 14"
[1] "Most correlated vars are Has.computer.per.household and Prof.tech.employ.pct r=0.868"
[1] "Dropping variable number 15"
[1] "Most correlated vars are Unemploy.men.rate and Unemployed.15plus.peap r=0.866"
[1] "Dropping variable number 16"
[1] "Most correlated vars are Has.tv.per.hh and Has.elec.per.hh r=0.864"
[1] "Dropping variable number 17"
[1] "Most correlated vars are Has.elec.per.hh and Has.drainage.per.hh r=0.851"
[1] "Dropping variable number 18"
[1] "Most correlated vars are Median.age and Prof.tech.employ.pct r=0.846"
[1] "Dropping variable number 19"
[1] "Most correlated vars are Home.births.pct and Low.income.peap r=0.806"
[1] "Dropping variable number 20"
[1] "Most correlated vars are Life.expect and Has.water.net.per.hh r=0.796

In my opinion the output shows that the function works. In most cases, the pair of variables found was either a (near-)double measure e.g. percent of population with electricity and percent of households with electricity, or closely related e.g. literacy in men and women. Sometimes however, the pair did not seem to be closely related, e.g. women’s participation and percent of households with a computer.

Since this dataset selected the variable with missing data, I used the irmi() function from the VIM package to impute the missing data (Templ et al, 2014).

Factor loadings: stability

The factor loading plots are shown below.

S_self_all S_self_automatic S_self_chosen

Each analysis relied upon a unique but overlapping selection of variables. Thus, it is possible to correlate the loadings of the overlapping parts for each analysis. This is a measure of loading stability in different factor analytic environments, as also done by Ree and Earles (1993) for general cognitive ability factor (g factor). The correlations were .98, 1.00, .98 (n’s 21, 27, 12), showing very high stability across datasets. Note that it was not possible to use the loadings from the Wikipedian data factor analysis because the variables were not strictly speaking overlapping.

Factor loadings: interpretation

Examining the factor loadings reveals some things of interest. Generally for all analyses, whatever that is generally considered good loads positively, and whatever considered bad loads negatively.

Unemployment (together, men, women) has positive loadings, whereas it ‘should’ have negative loadings. This is perhaps because the lower S factor states have more dysfunctional or no social security nets such that not working means starvation, and that this keeps people from not working. This is merely a conjecture because I don’t know much about Mexico. Hopefully someone more knowledgeable than me will read this and have a better answer.

Crime variables (crime rate, victimization, inmates/prisoner per capita, sentencing rate) load positively whereas it should load negatively. This pattern has been found before, see Kirkegaard (2015e) for a review of S factor studies and crime variables.

Factor scores

Next I correlated the factor scores from all 4 analysis with each other as well as HDI and cognitive ability as measured by PISA tests (the cognitive data is from Fuerst and Kirkegaard, 2015*; the HDI data from Wikipedia). The correlation matrix is shown below.

“regression” method S.all S.chosen S.automatic S.wiki HDI Cognitive ability
S.all 1.00 -0.08 -0.02 0.08 -0.17 -0.12
S.chosen -0.08 1.00 0.93 0.84 0.93 0.65
S.automatic -0.02 0.93 1.00 0.89 0.88 0.74
S.wiki 0.08 0.84 0.89 1.00 0.76 0.78
HDI -0.17 0.93 0.88 0.76 1.00 0.53
Cognitive ability -0.12 0.65 0.74 0.78 0.53 1.00

 

Strangely, despite the similar factor loadings, the factor scores from the factor extracted from all the variables had about no relation to the others. This probably indicates that the factor scoring method could not handle this type of odd case. The default scoring method for the factor analysis is “regression”, but there are a few others. Bartlett’s method yielded results for S.all that fit with the other factors, while none of the others did. See the psych package documentation for details (Revelle, 2015). I changed the extraction method for all the other analyses to Bartlett’s to remove method specific variance. The new correlation table is shown below:

Bartlett’s method S.all S.chosen S.automatic S.wiki HDI.mean Cognitive ability
S.all 1.00 0.79 0.88 0.88 0.68 0.74
S.chosen 0.79 1.00 0.95 0.87 0.93 0.70
S.automatic 0.88 0.95 1.00 0.88 0.89 0.74
S.wiki 0.88 0.87 0.88 1.00 0.75 0.78
HDI.mean 0.68 0.93 0.89 0.75 1.00 0.53
Cognitive ability 0.74 0.70 0.74 0.78 0.53 1.00

 

Intriguingly, now all the correlations are stronger. Perhaps Bartlett’s method is better for handling this type of extraction involving general factors from datasets with low case to variable ratios. It certainly deserves empirical investigation, including reanalysis of prior datasets. I reran the earlier parts of this paper with the Bartlett method. It did not substantially change results. The correlations between loadings across analysis increased a bit (to .98, 1.00, .99).

One possibility however is that the stronger results is just due to Bartlett’s method creating outliers that happen to lie on the regression line. This did not seem to be the case, see scatterplots below.

Correlation_matrix

S factor scores and cognitive ability

The next question is to what degree the within country differences in Mexico can be explained by cognitive ability. The correlations are in the above table as well, they are in the region .70 to .78 for the various S factors. In other words, fairly high. One could plot all of them vs. cognitive ability, but that would give us 4 plots. Instead, I plot only the S factor from my chosen variables since this has the highest correlation with HDI and thus the best claim for construct validity. It is also the most conservative option because of the 4 S factors, it has the lowest correlation with cognitive ability. The plot is shown below:

CA_S_chosen

We see that the federal district is a strong outlier, just like in the study with US states and Washington DC (Kirkegaard, 2015c). One should then remove it and rerun all the analyses. This includes the S factor extractions because the presence of a strong ‘mixed case’ (to be explained further in a future publication) affects the S factor extracted (see again, Kirkegaard, 2015c).

Analyses without Federal District

I reran all the analyses without the federal district. Generally, this did not change much with regards to loadings. Crime and unemployment still had positive loadings.

The loadings correlations across analyses increased to 1.00, 1.00, 1.00.

S.all S.chosen S.automatic S.wiki HDI mean Cognitive ability
S.all 1.00 0.99 0.98 0.93 0.85 0.78
S.chosen 0.99 1.00 0.98 0.94 0.88 0.80
S.automatic 0.98 0.98 1.00 0.90 0.90 0.75
S.wiki 0.93 0.94 0.90 1.00 0.75 0.77
HDI mean 0.85 0.88 0.90 0.75 1.00 0.56
Cognitive ability 0.78 0.80 0.75 0.77 0.56 1.00

 

The factor score correlations increased meaning that the federal district outlier was a source of discrepancy between the extraction methods. This can be seen in the scatterplots above in that  there is noticeable variation in how far from the rest the federal district lies. After this is resolved, the S factors from the INEG dataset are in near-perfect agreement (.99, .98, .98) while the one from Wikipedia data is less so but still respectable (.93, .94, .90). Correlations with cognitive ability also improved a bit.

Method of correlated vectors

In line with earlier studies, I examine whether the measures that are better measures of the latent S factor are also correlated more highly with the criteria variable, cognitive ability.

MCV_S_all MCV_S_automatic MCV_S_chosen MCV_S_wiki

The MCV results are strong: .90 .78 .85 and .92 for the analysis with all variables, chosen variables, automatically chosen variables and Wikipedian variables respectively. Note that these are for the analyses without the federal district, but they were similar with it too.

Discussion and conclusion

Generally, the present analysis reached similar findings to those before, especially with the one about US states. Cognitive ability was a very strong correlate of the S factors, especially once the federal district outlier was removed before the analysis. Further work is needed to find out why unemployment and crime variables sometimes load positively in S factor analyses with regions or states as the unit of analysis.

MCV analysis supported the idea that cognitive ability is related to the S factor, not just some non-S factor source of variance also present in the dataset.

Supplementary material

Data files, R code, figures are available at the Open Science Framework repository.

References

  • Fuerst, J. and Kirkegaard, E. O. W. (2015*). Admixture in the Americas part 2: Regional and National admixture. (Publication venue undecided.)
  • Johnson, W., Nijenhuis, J. T., & Bouchard Jr, T. J. (2008). Still just 1g: Consistent results from five test batteries. Intelligence, 36(1), 81-95.
  • Kirkegaard, E. O. W. (2014). The international general socioeconomic factor: Factor analyzing international rankings. Open Differential Psychology.
  • Kirkegaard, E. O. W. (2015a). S and G in Italian regions: Re-analysis of Lynn’s data and new data. The Winnower.
  • Kirkegaard, E. O. W. (2015b). Indian states: G and S factors. The Winnower.
  • Kirkegaard, E. O. W. (2015c). Examining the S factor in US states. The Winnower.
  • Kirkegaard, E. O. W. (2015d). The S factor in China. The Winnower.
  • Kirkegaard, E. O. W. (2015e). The S factor in the British Isles: A reanalysis of Lynn (1979). The Winnower.
  • Lynn, R. (2010). In Italy, north–south differences in IQ predict differences in income, education, infant mortality, stature, and literacy. Intelligence, 38(1), 93-100.
  • Ree, M. J., & Earles, J. A. (1991). The stability of g across different methods of estimation. Intelligence, 15(3), 271-278.
  • Revelle, W. (2015). psych: Procedures for Psychological, Psychometric, and Personality Research. CRAN
  • Templ, M., Alfons A., Kowarik A., Prantner, B. (2014). VIM: Visualization and Imputation of Missing Values. CRAN
  • Zhao, N. (2009). The Minimum Sample Size in Factor Analysis. Encorewiki.

* = not yet published, year is expected publication year.

pumpkinperson.com/2015/04/13/is-the-sat-an-iq-test/

Blog commenter Lion of the Judah-sphere has claimed that the SAT does not correlate as well with comprehensive IQ tests as said IQ tests correlate with one another. At first I assumed he was wrong, but my recent analysis suggesting Harvard undergrads have an average Wechsler IQ of 122, really makes me wonder.

While an IQ of 122 (white norms) is 25 points above the U.S. mean of 97, it seems very low for a group of people who averaged 1490 out of 1600 on the SAT. According to my formula, since 1995 a score of 1490 on the SAT equated to an IQ of 141. But my formula was based on modern U.S. norms; because demographic changes have made the U.S. mean IQ 3 points below the white American mean (and made the U.S. standard deviation 3.4 percent larger than the white SD), converting to white norms reduces Harvard’s SAT IQ equivalent to 139.

In general, research correlating the SAT with IQ has been inconsistent, with correlations ranging from 0.4 to 0.8. I think much depends on the sample. Among people who took similar courses in similar high schools, the SAT is probably an excellent measure of IQ. But considering the wide range of schools and courses American teenagers have experienced, the SAT is not, in my judgement, a valid measure of IQ. Nor should it be. Universities should not be selecting students based on biological ability, but rather on acquired academic skills.

The lower values are due to restriction of range, e.g. Frey and Detterman (2004). When corrected, the value goes up to .7-.8 range. Also .54 using ICAR60 (Condon and Revelle, 2014) without correction for reliability or restriction.

As for the post, I can think of few things:

1. The sample recruited is likely not representative of Harvard. Probably mostly social sci/humanities students, who have lower scores.

2. Regression towards the mean means that the Harvard student body won’t be as exceptional on their second measurement as on their first. This is because some of the reason they were so high was just good luck.

3. The SAT is teachable to some extent which won’t transfer well to other tests. This reduces the correlation between between SAT and other GCA tests.

4. Harvard uses affirmative action which lowers the mean SAT of the students a bit. It seems to be about 1500.

SAT has an approx. mean of ~500 per subtest, ceiling of 800 and SD of approx. 100. So a 1500 total score is 750+750=(500+250)+(500+250)=(500+2.5sd)+(500+2.5sd), or about 2.5 SD above the mean. Test-retest reliability/stability for a few months is around .86 (mean of values here research.collegeboard.org/sites/default/files/publications/2012/7/researchreport-1982-7-test-disclosure-retest-performance-sat.pdf, n≈3700).

The interesting question is how much regression towards the mean we can expect? I decided to investigate using simulated data. The idea is basically that we first generate some true scores (per classical test theory), and then make two measurements of them. Then using the first measurement, we make a selection that has a mean 2.5 SD above the mean, then we check how well this group performs on the second testing.

In R, the way we do this, is to simulate some randomly distributed data, and then create new variables that are a sum of true score and error for the measurements. This presents us with a problem.

How much error to add?

We can solve this question either by trying some values, or analytically. Analytically, it is like this:

Cor (test1 x test2) = cor(test1 x true score) * cor(test2 x true score)

The correlation of each testing is due to their common association with the true scores. To simulate a testing, we need the correlation between testing and true score. Since test-rest is .863, we take the square root and get .929. The squared value of a correlation is the amount of variance it explains, so if we square it get back to where we were before. Since the total variance is 1, we can calculate the remaining variance as 1-.863=.137. We can take the square root of this to get the correlation, which is .37. Since the correlations are those we need when adding, we have to weigh the true score by .929 and the error by .370 to get the measurements such that they have a test-retest correlation of .863.

One could also just try some values until one gets something that is about right. In this case, just weighing the true score by 1 and error by .4 produces nearly the same result. Trying out a few values like this is faster than doing the math. In fact, I initially tried out some values to hit the approximate value but then decided to solve it analytically as well.

How much selection to make?

This question is more difficult analytically. I have asked some math and physics people and they could not solve it analytically. The info we have is the mean value of the selected group, which is about 2.5, relative to a standard normal population. Assuming we make use of top-down selection (i.e. everybody above a threshold gets selected, no one below), where must we place our threshold to get a mean of 2.5? It is not immediately obvious. I solved it by trying some values and calculating the mean trait value in the selected group. It turns out that to get a selected group with a mean of 2.5, the threshold must be 2.14.

Since this group is selected for having positive errors and positive true scores, their true scores and second testing scores will be lower. How much lower? 2.32 and 2.15 according to my simulation. I’m not sure why the second measurement scores are lower than the true scores, perhaps someone more learned in statistics can tell me.

So there is still some way down to an IQ of 122 from 132.3 (2.15*15+100). However, we may note that they used a shorter WAIS-R form, which correlates .91 with the full-scale. Factoring this in reduces our estimate to 129.4. Given the selectivity noted above, this is not so unrealistic.

Also, the result can apparently be reached simply by 2.5*.86. I was aware that this might work, but wasn’t sure so I tested it (the purpose of this post). One of the wonders of statistical software like this is that one can do empirical mathematics. :)

R code

##SAT AND IQ FOR HARVARD
library(dplyr)

#size and true score
n.size = 1e6
true.score = rnorm(n.size)

#add error to true scores
test.1 = .929*true.score + .370*rnorm(n.size)
test.2 = .929*true.score + .370*rnorm(n.size)
SAT = data.frame(true.score,
                 test.1,
                 test.2)
#verify it is correct
cor(SAT)

#select a subsample
selected = filter(SAT, test.1>2.14) #selected sample
describe(selected) #desc. stats

References