Examining the S factor in US states

S_IQ2_noDC

Introduction and data sources

In my previous two posts, I analyzed the S factor in 33 Indian states and 31 Chinese regions. In both samples I found strongish S factors and they both correlated positively with cognitive estimates (IQ or G). In this post I used cognitive data from McDaniel (2006). He gives two sets of estimated IQs based on SAT-ACT and on NAEP. Unfortunately, they only correlate .58, so at least one of them is not a very accurate estimate of general intelligence.

His article also reports some correlations between these IQs and socioeconomic variables: Gross State Product per capita, median income and percent poverty. However, data for these variables is not given in the article, so I did not use them. Not quite sure where his data came from.

However, with cognitive data like this and the relatively large number of datapoints (50 or 51 depending on use of District of Colombia), it is possible to do a rather good study of the S factor and its correlates. High quality data for US states are readily available, so results should be strong. Factor analysis requires a case to variable ratio of at least 2:1 to deliver reliable results (Zhao, 2009). So, this means that one can do an S factor analysis with about 25 variables.

Thus, I set out to find about 25 diverse socioeconomic variables. There are two reasons to gather a very diverse sample of variables. First, for method of correlated vectors to work (Jensen, 1998), there must be variation in the indicators’ loading on the factor. Lack of variation causes restriction of range problems. Second, lack of diversity in the indicators of a latent variable leads to psychometric sampling error (Jensen, 1994; review post here for general intelligence measures).

My primary source was The 2012 Statistical Abstract website. I simply searched for “state” and picked various measures. I tried to pick things that weren’t too dependent on geography. E.g. kilometer of coast line per capita would be very bad since it’s neither socioeconomic and very dependent (near 100%) on geographical factors. To increase reliability, I generally used all data for the last 10 years and averaged them. Curious readers should see the datafile for details.

I ended up with the following variables:

  1. Murder rate per 100k, 10 years
  2. Proportion with high school or more education, 4 years
  3. Proportion with bachelor or more education, 4 years
  4. Proportion with advanced degree or more, 4 years
  5. Voter turnout, presidential elections, 3 years
  6. Voter turnout, house of representatives, 6 years
  7. Percent below poverty, 10 years
  8. Personal income per capita, 1 year
  9. Percent unemployed, 11 years
  10. Internet usage, 1 year
  11. Percent smokers, male, 1 year
  12. Percent smokers, female, 1 year
  13. Physicians per capita, 1 year
  14. Nurses per capita, 1 year
  15. Percent with health care insurance, 1 year
  16. Percent in ‘Medicaid Managed Care Enrollment’, 1 year
  17. Proportion of population urban, 1 year
  18. Abortion rate, 5 years
  19. Marriage rate, 6 years
  20. Divorce rate, 6 years
  21. Incarceration rate, 2 years
  22. Gini coefficient, 10 years
  23. Top 1%, proportion of total income, 10 years
  24. Obesity rate, 1 year

Most of these are self-explanatory. For the economic inequality measures, I found 6 different measures (here). Since I wanted diversity, I chose the Gini and the top 1% because these correlated the least and are well-known.

Aside from the above, I also fetched the racial proportions for each state, to see how they relate the S factor (and the various measures above, but to get these, run the analysis yourself).

I used R with RStudio for all analyses. Source code and data is available in the supplementary material.

Missing data

In large analyses like this there are nearly always some missing data. The matrixplot() looks like this:

matrixplot

(It does not seem possible to change the font size, so I have cut off the names at the 8th character.)

We see that there aren’t many missing values. I imputed all the missing values with the VIM package (deterministic imputation using multiple regression).

Extreme values

A useful feature of the matrixplot() is that it shows in greytone the relatively outliers for each variable. We can see that some of them have some hefty outliers, which may be data errors. Therefore, I examined them.

The outlier in the two university degree variables is DC, surely because the government is based there and there is a huge lobbyist center. For the marriage rate, the outlier is Nevada. Many people go there and get married. Physician and nurse rates are also DC, same reason (maybe one could make up some story about how politics causes health problems!).

After imputation, the matrixplot() looks like this:

matrixplot_after

It is pretty much the same as before, which means that we did not substantially change the data — good!

Factor analyzing the data

Then we factor analyze the data (socioeconomic data only). We plot the loadings (sorted) with a dotplot:

S_loadings_US

We see a wide spread of variable loadings. All but two of them load in the expected direction — positive are socially valued outcomes, negative the opposite — showing the existence of the S factor. The ‘exceptions’ are: abortion rate loading +.60, but often seen as a negative thing. It is however open to discussion. Maybe higher abortion rates can be interpreted as less backward religiousness or more freedom for women (both good in my view). The other is marriage rate at -.19 (weak loading). I’m not sure how to interpret that. In any case, both of these are debatable which way the proper desirable direction is.

Correlations with cognitive measures

And now comes the big question, does state S correlate with our IQ estimates? They do, the correlations are: .14 (SAT-ACT) and .43 (NAEP). These are fairly low given our expectations. Perhaps we can work out what is happening if we plot them:

S_IQ1S_IQ2

Now we can see what is going on. First, the SAT-ACT estimates are pretty strange for three states: California, Arizona and Nevada. I note that these are three adjacent states, so it is quite possibly some kind of regional testing practice that’s throwing off the estimates. If someone knows, let me know. Second, DC is a huge outlier in S, as we may have expected from our short discussion of extreme values above. It’s basically a city state which is half-composed of low s (SES) African Americans and half upper class related to government.

Dealing with outliers – Spearman’s correlation aka. rank-order correlation

There are various ways to deal with outliers. One simple way is to convert the data into ranked data, and just correlate those like normal. Pearson’s correlations assume that the data are normally distributed, which is often not the case with higher-level data (states, countries). Using rank-order gets us these:

S_IQ1_rank S_IQ2_rank

So the correlations improved a lot for the SAT-ACT IQs and a bit for the NAEP ones.

Results without DC

Another idea is simply excluding the strange DC case, and then re-running the factor analysis. This procedure gives us these loadings:

S_loadings_noDC

(I have reversed them, because they were reversed e.g. education loading negatively.)

These are very similar to before, excluding DC did not substantially change results (good). Actually, the factor is a bit stronger without DC throwing off the results (using minres, proportion of var. = 36%, vs. 30%). The reason this happens is that DC is an odd case, scoring very high in some indicators (e.g. education) and very poorly in others (e.g. murder rate).

The correlations are:

S_IQ1_noDCS_IQ2_noDC

So, not surprisingly, we see an increase in the effect sizes from before: .14 to .31 and .43 to .69.

Without DC and rank-order

Still, one may wonder what the results would be with rank-order and DC removed. Like this:

S_IQ1_noDC_rankS_IQ2_noDC_rank

So compared to before, effect size increased for the SAT-ACT IQ and decreased slightly for the NAEP IQ.

Now, one could also do regression with weights based on some metric of the state population and this may further change results, but I think it’s safe to say that the cognitive measures correlate in the expected direction and with the removal of one strange case, the better measure performs at about the expected level with or without using rank-order correlations.

Method of correlated vectors

The MCV (Jensen, 1998) can be used to test whether a specific latent variable underlying some data is responsible for the observed correlation between the factor score (or factor score approximation such as IQ — an unweighted sum) and some criteria variable. Altho originally invented for use on cognitive test data and the general intelligence factor, I have previously used it in other areas (e.g. Kirkegaard, 2014). I also used it in the previous post on the S factor in India (but not China because there was a lack of variation in the loadings of socioeconomic variables on the S factor).

Using the dataset without DC, the MCV result for the NAEP dataset is:

MCV_NAEP_noDC

So, again we see that MCV can reach high r’s when there is a large number of diverse variables. But note that the value can be considered inflated because of the negative loadings of some variables. It is debatable whether one should reverse them.

Racial proportions of states and S and IQ

A last question is whether the states’ racial proportions predict their S score and their IQ estimate. There are lots of problems with this. First, the actual genomic proportions within these racial groups vary by state (Bryc, 2015). Second, within ‘pure-breed’ groups, general intelligence varies by state too (this was shown in the testing of draftees in the US in WW1). Third, there is an ‘other’ group that also varies from state to state, presumably different kinds of Asians (Japanese, Chinese, Indians, other SE Asia). Fourth, it is unclear how one should combine these proportions into an estimate used for correlation analysis or model them. Standard multiple regression is unsuited for handling this kind of data with a perfect linear dependency, i.e. the total proportion must add up to 1 (100%). MR assumes that the ‘independent’ variables are.. independent of each other. Surely some method exists that can handle this problem, but I’m not familiar with it. Given the four problems above, one will not expect near-perfect results, but one would probably expect most going in the right direction with non-near-zero size.

Perhaps the simplest way of analyzing it is correlation. These are susceptible to random confounds when e.g. white% correlates differentially with the other racial proportions. However, they should get the basic directions correct if not the effect size order too.

Racial proportions, NAEP IQ and S

For this analysis I use only the NAEP IQs and without DC, as I believe this is the best subdataset to rely on. I correlate this with the S factor and each racial proportion. The results are:

Racial group NAEP IQ S
White 0.69 0.18
Black -0.5 -0.42
Hispanic -0.38 -0.08
Other -0.26 0.2

 

For NAEP IQ, depending on what one thinks of the ‘other’ category, these have either exactly or roughly the order one expects: W>O>H>B. If one thinks “other” is mostly East Asian (Japanese, Chinese, Korean) with higher cognitive ability than Europeans, one would expect O>W>H>B. For S, however, the order is now O>W>H>B and the effect sizes much weaker. In general, given the limitations above, these are perhaps reasonable if somewhat on the weak side for S.

Estimating state IQ from racial proportions using racial IQs

One way to utilize all the four variable (white, black, hispanic and other) without having MR assign them weights is to assign them weights based on known group IQs and then calculate a mean estimated IQ for each state.

Depending on which estimates for group IQs one accepts, one might use something like the following:

State IQ est. = White*100+Other*100+Black*85+Hispanic*90

Or if one thinks other is somewhat higher than whites (this is not entirely unreasonable, but recall that the NAEP includes reading tests which foreigners and Asians perform less well on), one might want to use 105 for the other group (#2). Or one might want to raise black and hispanic IQs a bit, perhaps to 88 and 93 (#3). Or do both (#4) I did all of these variations, and the results are:

Variable Race.IQ Race.IQ2 Race.IQ3 Race.IQ4
Race.IQ 1 0.96 1 0.93
Race.IQ2 0.96 1 0.96 0.99
Race.IQ3 1 0.96 1 0.94
Race.IQ4 0.93 0.99 0.94 1
NAEP IQ 0.67 0.56 0.67 0.51
S 0.41 0.44 0.42 0.45

 

As far as I can tell, there is no strong reason to pick any of these over each other. However, what we learn is that the racial IQ estimate and NAEP IQ estimate is somewhere between .51 and .67, and the racial IQ estimate and S is somewhere between .41 and .45. These are reasonable results given the problems of this analysis described above I think.

Supplementary material

Data files and R source code available on the Open Science Framework repository.

References

Bryc, K., Durand, E. Y., Macpherson, J. M., Reich, D., & Mountain, J. L. (2015). The Genetic Ancestry of African Americans, Latinos, and European Americans across the United States. The American Journal of Human Genetics, 96(1), 37-53.

Jensen, A. R., & Weng, L. J. (1994). What is a good g?. Intelligence, 18(3), 231-258.

Jensen, A. R. (1998). The g factor: The science of mental ability. Westport, CT: Praeger.

Kirkegaard, E. O. W. (2014). The international general socioeconomic factor: Factor analyzing international rankings. Open Differential Psychology.

Zhao, N. (2009). The Minimum Sample Size in Factor Analysis. Encorewiki.org.

Two very annoying statistical fallacies with p-values

Some time ago, I wrote on Reddit:

There are two errors that I see quite frequently:

  1. Conclude from the fact that a statistically significant difference was found to that a significant socially, scientifically or otherwise difference was found. The reason this won’t work is that any minute difference will be stat.sig. if N is large enough. Some datasets have N=1e6, so very small differences between groups can be found reliably. This does not mean they are worth any attention. The general problem is the lack of focus on effect sizes.
  2. Conclude from the fact that a difference was not statistically significant to that there was no difference in that trait. The error being that they ignore the possibility of false negative; there is a difference, but sample size is too small to reliably detect it or sampling fluctuation caused it to be smaller than usual in the present sample. Together with the misuse of P values, one often sees stuff like “men and women differed in trait1 (p<0.04) but did not differ in trait2 (p>0.05), as if the p value difference of .01 has some magical significance.

These are rather obvious (to me), so I don’t know why I keep reading papers (Wassell et al, 2015) that go like this:

2.1. Experiment 1

In experiment 1 participants filled in the VVIQ2 and reported their current menstrual phase by counting backward the appropriate number of days from the next onset of menstruation. We grouped female participants according to these reports. Fig. 2A shows the mean VVIQ2 score for males and females in the follicular and mid luteal phases (males: M = 56.60, SD = 10.39, follicular women: M = 60.11, SD = 8.84, mid luteal women: M = 69.38, SD = 8.52). VVIQ2 scores varied between menstrual groups, as confirmed by a significant one-way ANOVA, F(2, 52) = 8.63, p < .001, η2 = .25. Tukey post hoc comparisons revealed that mid luteal females reported more vivid imagery than males, p < .001, d = 1.34, and follicular females, p < .05, d = 1.07, while males and follicular females did not differ, p = .48, d = 0.37. These data suggest a possible link between sex hormone concentration and the vividness of mental imagery.

A normal interpretation of the above has the authors as making the fallacy. It is even contradictory, an effect size of d=.37 is a medium-small effect, but in the same sentence they state that there is no effect (i.e. d=0).

However, later on they write:

VVIQ2 scores were found to significantly correlate with imagery strength from the binocular rivalry task, r = .37, p < .01. As is evident in Fig. 3A, imagery strength measured by the binocular rivalry task varied significantly between menstrual groups, F(2, 55) = 8.58, p < .001, η2 = .24, with mid luteal females showing stronger imagery than both males, p < .05, d = 1.03, and late follicular females, p < .001, d = 1.26. These latter two groups’ scores did not differ significantly, p = .51, d = 0.34. Together, these findings support the questionnaire data, and the proposal that imagery differences are influenced by menstrual phase and sex hormone concentration.

Now the authors are back to phrasing it in a way that cannot be taken as the fallacy. Sometimes it gets more silly. One paper, Kleisner et al (2014) which received quite a lot of attention in the media, is based on this kind of subgroup analysis where the effect had p<.05 for one gender but not the other. The typical source of this silliness is the relatively small sample size of most studies combined with the authors’ use of exploratory subgroup analysis (which they pretend to be hypothesis-driven in their testing). Gender, age, and race are the typical groups explored and in combination.

Probably, it best is scientists would stop using “significant” to talk about lowish p values. There is a very large probability that the public will misunderstand this. (There was agood study recently about this, but I can’t find it again, help!)

References

Kleisner, K., Chvátalová, V., & Flegr, J. (2014). Perceived intelligence is associated with measured intelligence in men but not women. PloS one, 9(3), e81237.

Wassell, J., Rogers, S. L., Felmingam, K. L., Bryant, R. A., & Pearson, J. (2015). Sex hormones predict the sensory strength and vividness of mental imagery. Biological Psychology.

The S factor in China

Introduction

Richard Lynn has been publishing a number of papers on IQ in regions/areas/states within countries along with various socioeconomic correlates. However, usually his and co-authors analysis is limited to reporting the correlation matrix. This is a pity, because the data allow for a more interesting analysis with the S factor (see Kirkegaard, 2014). I have previously re-analyzed Lynn and Yadav (2015) in a blogpost to be published in a journal sometime ‘soon’. In this post I re-analyze the data reported in Lynn and Cheng (2013) as well as more data I obtained from the official Chinese statistical database.

Data sources

In their paper, they report 6 variables: 1) IQ, 2) sample size for IQ measurement, 3) % of population Ethnic Han, 4) years of education, 5) percent of higher education (percent with higher education?), and 6) GDP per capita. This only includes 3 socioeconomic variables — the bare minimum for S factor analyses — so I decided to see if I could find some more.

I spent some time on the database and found various useful variables:

  • Higher education per capita for 6 years
  • Medical technical personnel for 5 years
  • Life expectancy for 1 year
  • Proportion of population illiterate for 9 years
  • Internet users per capita for 10 years
  • Invention patents per capita for 10 years
  • Proportion of population urban for 9 years
  • Scientific personnel for 8 years

I used all available data for the last 10 years in all cases. This was done to increase reliability of the measurement, just in case there was some and reduce transient effects. In general tho regional differences were very consistent thruout the years, so this had little effect. One could do factor analysis and get the factor scores, but this would make the score hard to understand for the reader.

For the variable with data for multiple years, I calculated the average yearly intercorrelation to see how reliable the measure were. In all but one case, the average intercorrelation was >=.94 and the last case it was .86. There would be little to gain from factor analyzing these data and using the scores instead of just averaging the years preserves interpretable data. Thus, I averaged the variables for each year to produce one variable. This left me with 11 socioeconomic variables.

Examining the S factor and MCV

Next step was to factor analyze the 11 variables and see if one general factor emerged with the right direction of loadings. It did in fact, the loadings are as follows:

S_loadings

All the loadings are in the expected direction. Aside from the one negative loading (illiteracy), they are all fairly strong. This means that MCV (method of correlated vectors) analysis is rather useless, since there is little inter-loading variation. One could probably fix this by going back to the databank and fetching some variables that are worse measures of S and that varies more.

Doing the MCV anyway results in r=.89 (inflated by the one negative loading). Excluding the negative loading gives r=.38, which is however solely due to the scientific personnel datapoint. To properly test it, one needs to fetch more data that varies more in its S loading.

MCV

S and, IQ and Han%

We are now ready for the two main results, i.e. correlation of S with IQs and % ethnic Han.

S_Han S_IQ

Correlations are of moderate strength, r.=.42 and r=.48. This is somewhat lower than found in analyses of Danish and Norwegian immigrant groups (Kirkegaard 2014, r’s about .55) and much lower than that found between countries (r=.86) and lower than that found in India (r=.61). The IQ result is mostly due to the two large cities areas of Beijing and Shanghai, so the results are not that convincing. But they are tentative and consistent with previous results.

Han ethnicity seems to be a somewhat more reasonable predictor in this dataset. It may not be due to higher general intelligence, they may have some other qualities that cause them to do well. Perhaps more conscientious, or more rule-conforming which is arguably rather important in authoritarian societies like China.

Supplementary material

The R code and datasets are available at the Open Science Foundation repository for this study.

References

Understanding restriction of range with Shiny!

I made this: emilkirkegaard.shinyapps.io/Understanding_restriction_of_range/

Source:

# ui.R
shinyUI(fluidPage(
  titlePanel(title, windowTitle = title),
  
  sidebarLayout(
    sidebarPanel(
      helpText("Get an intuitive understanding of restriction of range using this interactive plot. The slider below limits the dataset to those within the limits."),
      
      sliderInput("limits",
        label = "Restriction of range",
        min = -5, max = 5, value = c(-5, 5), step=.1),
      
      helpText("Note that these are Z-values. A Z-value of +/- 2 corresponds to the 98th or 2th centile, respectively.")
      ),
    
    
    mainPanel(
      plotOutput("plot"),width=8,
      
      textOutput("text")
      )
  )
))
# server.R
shinyServer(
  function(input, output) {
    output$plot <- renderPlot({
      #limits
      lower.limit = input$limits[1] #lower limit
      upper.limit = input$limits[2]  #upper limit
      
      #adjust data object
      data["X.restricted"] = data["X"] #copy X
      data[data[,1]<lower.limit | data[,1]>upper.limit,"X.restricted"] = NA #remove values
      group = data.frame(rep("Included",nrow(data))) #create group var
      colnames(group) = "group" #rename
      levels(group$group) = c("Included","Excluded") #add second factor level
      group[is.na(data["X.restricted"])] = "Excluded" #is NA?
      data["group"] = group #add to data
      
      #plot
      xyplot(Y ~ X, data, type=c("p","r"), col.line = "darkorange", lwd = 1,
             group=group, auto.key = TRUE)
    })
    
    output$text <- renderPrint({
      #limits
      lower.limit = input$limits[1] #lower limit
      upper.limit = input$limits[2]  #upper limit
      
      #adjust data object
      data["X.restricted"] = data["X"] #copy X
      data[data[,1]<lower.limit | data[,1]>upper.limit,"X.restricted"] = NA #remove values
      group = data.frame(rep("Included",nrow(data))) #create group var
      colnames(group) = "group" #rename
      levels(group$group) = c("Included","Excluded") #add second factor level
      group[is.na(data["X.restricted"])] = "Excluded" #is NA?
      data["group"] = group #add to data
      
      #correlations
      cors = cor(data[1:3], use="pairwise")
      r = round(cors[3,2],2)
      #print output
      str = paste0("The correlation in the full dataset is .50, the correlation in the restricted dataset is ",r)
      print(str)
    })
    
  }
)
#global.R
library("lattice")
data = read.csv("data.csv",row.names = 1) #load data
title = "Understanding restriction of range"

Indian states: G and S factors

G_S

Differences in cognitive ability, per capita income, infant mortality, fertility and latitude across the states of India

Richard Lynn and Prateek Yadav (2015) have a new paper out in Intelligence reporting various cognitive measures, socioeconomic outcomes and environmental factors in some 33 states and areas of India. Their analyses consist entirely of reporting the correlation matrix, but they list the actual data in two tables as well. This means that someone like me can reanalyze it.

They have data for the following variables:

 

1.
Language Scores Class III (T1). These data consisted of the language scores of class III 11–12 year old school students in the National Achievement Survey (NAS) carried out in Cycle-3 by the National Council of Educational Research and Training (2013). The population sample comprised 104,374 students in 7046 schools across 33 states and union territories (UTs). The sample design for each state and UT involved a three-stage cluster design which used a combination of two probability sampling methods. At the first stage, districts were selected using the probability proportional to size (PPS) sampling principle in which the probability of selecting a particular district depended on the number of class 5 students enrolled in that district. At the second stage, in the chosen districts, the requisite number of schools was selected. PPS principles were again used so that large schools had a higher probability of selection than smaller schools. At the third stage, the required number of students in each school was selected using the simple random sampling (SRS) method. In schools where class 5 had multiple sections, an extra stage of selection was added with one section being sampled at random using SRS.

The language test consisted of reading comprehension and vocabulary, assessed by identifying the word for a picture. The test contained 50 items and the scores were analyzed using both Classical Test Theory (CTT) and Item Response Theory (IRT). The scores were transformed to a scale of 0–500 with a mean of 250 and standard deviation of 50. There were two forms of the test, one in English and the other in Hindi.

2.

Mathematics Scores Class III (T2). These data consisted of the mathematics scores of Class III school students obtained by the same sample as for the Language Scores Class III described above. The test consisted of identifying and using numbers, learning and understanding the values of numbers (including basic operations), measurement, data handling, money, geometry and patterns. The test consisted of 50 multiple-choice items scored from 0 to 500 with a mean score was set at 250 with a standard deviation of 50.

3.

Language Scores Class VIII (T3). These data consisted of the language scores of class VIII (14–15 year olds) obtained in the NAS (National Achievement Survey) a program carried out by the National Council of Educational Research and Training, 2013) Class VIII (Cycle-3).The sampling methodology was the same as that for class III described above. The population sample comprised 188,647 students in 6722 schools across 33 states and union territories. The test was a more difficult version of that for class III, and as for class III, scores were analyzed using both Classical Test Theory (CTT) and Item Response Theory (IRT), and were transformed to a scale of 0–500 with a mean 250.

4.

Mathematics Scores Class VIII (T4). These data consisted of the mathematics scores of Class VIII (14–15 year olds) school students obtained by the same sample as for the Language Scores Class VIII described above. As with the other tests, the scores were transformed to a scale of 0–500 with a mean 250 and standard deviation of 50.

5.

Science Scores Class VIII (T5). These data consisted of the science scores of Class VIII (14–15 year olds) school students obtained by the same sample as for the Language Scores Class VIII described above. As with the other tests, the scores were transformed to a scale of 0–500 with a mean 250 and standard deviation of 50. The data were obtained in 2012.

6.

Teachers’ Index (TI). This index measures the quality of the teachers and was taken from the Elementary State Education Report compiled by the District Information System for Education (DISE, 2013). The data were recorded in September 2012 for teachers of grades 1–8 in 35 states and union territories. The sample consisted of 1,431,702 schools recording observations from 199.71 million students and 7.35 million teachers. The teachers’ Index is constructed from the percentages of schools with a pupil–teacher ratio in primary greater than 35, and the percentages single-teacher schools, teachers without professional qualification, and female teachers (in schools with 2 and more teachers).

7.

Infrastructure Index (II). These data were taken from the Elementary State Education Report 2012–13 compiled by the District Information System for Education (2013). The sample was the same as for the Teachers’ Index described above. This index measures the infrastructure for education and was constructed from the percentages of schools with proper chairs and desks, drinking water, toilets for boys and girls, and with kitchens.

8.

GDP per capita (GDP per cap). These data are the net state domestic product of the Indian states in 2008–09 at constant prices given by the Reserve Bank of India (2013). Data are not available for the Union Territories.

9.

Literacy Rate (LR). This consists of the percentage of population aged 7 and above in given in the 2011 census published by the Registrar General and Census Commission of India (2011).

10.

Infant Mortality Rate (IMR). This consists of the number of deaths of infants less than one year of age per 1000 live births in 2005–06 given in the National Family Health Survey, Infant and Child Mortality given by the Indian Institute of Population Sciences (2006).

11.

Child Mortality Rate (CMR). This consists of the number of deaths of children 1–4 years of age per 1000 live births in the 2005–06 given by the Indian Institute of Population Sciences (2006).

12.

Life Expectancy (LE). This consists of the number of years an individual is expected to live after birth, given in a 2007 survey carried out by Population Foundation of India (2008).

13.

Fertility Rate (FR). This consists of the number of children born per woman in each state and union territories in 2012 given by Registrar General and Census Commission of India (2012).

14.

Latitude (LAT). This consists of the latitude of the center of the state.

15.

Coast Line (CL). This consists of whether states have a coast line or are landlocked and is included to examine whether the possession of a coastline is related to the state IQs.

16.

Percentage of Muslims (MS). This is included to examine a possible relation to the state IQs.

 

This article will include the R code line for line commented as a helping exercise for readers not familiar with R but who can perhaps be convinced to give it a chance! :)

library(devtools) #source_url
source_url("https://osf.io/j5nra/?action=download&version=2") #mega functions from OSF
#source("mega_functions.R")
library(psych) #various
library(car) #scatterplot
library(Hmisc) #rcorr
library(VIM) #imputation

This loads a variety of libraries that are useful.

Getting the data into R

cog = read.csv("Lynn_table1.csv",skip=2,header=TRUE,row.names = 1) #load cog data
socio = read.csv("Lynn_table2.csv",skip=2,header=TRUE,row.names = 1) #load socio data

The files are the two files one can download from ScienceDirect: Lynn_table1 Lynn_table2 The code makes it read it assuming values are divided by comma (CSV = comma-separated values), skips the first two lines because they do not contain data, loads the first line as headers, and uses the first column as rownames.

Merging data into one object

Ideally, I’d like all the data as one object for easier use. However, since it comes it two, it has to be merged. For this purpose, I rely upon a dataset merger function I wrote some months ago to handle international data. It can however handle any merging of data where one wants to match rows by name from different datasets and combine them into one dataset. This function, merge_datasets(), is found in the mega_functions we imported earlier.

However, first, it is a good idea to make sure the names do match when they are supposed to. To check this we can type:

cbind(rownames(cog),rownames(socio))

I put the output into Excel to check for mismatches:

Andhra Pradesh Andhra Pradesh TRUE
Arunachal Pradesh Arunachal Pradesh TRUE
Bihar Bihar TRUE
Chattisgarh Chattisgarh TRUE
Goa Goa TRUE
Gujarat Gujarat TRUE
Haryana Haryana TRUE
Himanchal Pradesh Himanchal Pradesh TRUE
Jammu Kashmir Jammu & Kashmir FALSE
Jharkhand Jharkhand TRUE
Karnataka Karnataka TRUE
Kerala Kerala TRUE
Madhya Pradesh Madhya Pradesh TRUE
Maharashtra Maharashtra TRUE
Manipur Manipur TRUE
Meghalaya Meghalaya TRUE
Mizoram Mizoram TRUE
Nagaland Nagaland TRUE
Odisha Odisha TRUE
Punjab Punjab TRUE
Rajashthan Rajasthan FALSE
Sikkim Sikkim TRUE
Tamil Nadu TamilNadu FALSE
Tripura Tripura TRUE
Uttarkhand Uttarkhand TRUE
Uttar Pradesh Uttar Pradesh TRUE
West Bengal West Bengal TRUE
A & N Islands A & N Islands TRUE
Chandigarh Chandigarh TRUE
D & N Haveli D & N Haveli TRUE
Daman & Diu Daman & Diu TRUE
Delhi Delhi TRUE
Puducherry Puducherry TRUE

 

So we see that the order is the same, however, we see that there are three that doesn’t match despite being supposed to. We can fix this discrepancy by using the rownames of one dataset for the other:

rownames(cog) = rownames(socio) #use rownames from socio for cog

This makes the rownames of cog the same as those for socio. Now they are ready for merging.

Incidentally, since the order is the same, we could have simply merged with the command:

cbind(cog, socio)

However it is good to use merge_datasets() since it is so much more generally useful.

Missing and broken data

Next up, we examine missing data and broken data.

#examine missing data
miss.table(socio)
miss.table(cog)
table(miss.case(socio))
matrixplot(socio)

The first, miss.table(), is another custom function from mega_functions. It outputs the number of missing values per variable. The outputs are:

Lit  II  TI GDP IMR CMR FER  LE LAT  CL  MS 
  0   0   0   6   0   4   0   0   0   0   0
T1 T2 T3 T4 T5 CA 
 0  0  0  0  0  0

So we see that there are 10 missing values in the socio and 0 in cog.

Next we want to see how these are missing. We can do this e.g. by plotting it with a nice function like matrixplot() (from VIM) or by tabling the missing cases. Output:

 0  1  2 
27  2  4

matrixplot

So we see that there are a few cases that miss data from 1 or 2 variables. Nothing serious.

One could simply ignore this, but that would be not utilizing the data to the full extent possible. The correct solution is to impute data rather than removing cases with missing data.

However, before we do this, look at the TI variable above. The greyscale shows the standardized values of the datapoints. So in this variable we see that there is one very strong outlier. If we take a look back at the data table, we see that it is likely an input error. All the other datapoints have values between 0 and 1, but the one for Uttarkhand has 247,200.595… I don’t see how the input error happened the so best way is to remove it:

#fix broken datapoint
socio["Uttarkhand","TI"] = NA

Then, we impute the missing data in the socio variable:

#impute data
socio2 = irmi(socio, noise.factor = 0) #no noise

The second parameter is used for multiple imputation, which we don’t use here. Setting it as 0 means that the imputation is deterministic and hence exactly reproducible for other researchers.

Finally, we can compare the non-imputed dataset to the imputed one:

#compare desp stats
describe(socio)
describe(socio2)
round(describe(socio)-describe(socio2),2) #discrepancy values, rounded

The output is large, so I won’t show it here, but it shows that the means, sd, range, etc. of the variables with and without imputation are similar which means that we didn’t completely mess up the data by the procedure.

Finally, we merge the data to one dataset:

#merge data
data = merge.datasets(cog,socio2,1) # merge above

Next, we want to do factor analysis to extract the general socioeconomic factor and the general intelligence factor from their respective indicators. And then we add them back to the main dataset:

#factor analysis
fa = fa(data[1:5]) #fa on cognitive data
fa
data["G"] = as.numeric(fa$scores)

fa2 = fa(data[7:14]) #fa on SE data
fa2
data["S"] = as.numeric(fa2$scores)

Columns  1-5 are the 5 cognitive measures. Cols 7:14 are the socioeconomic ones. One can disagree about the illiteracy variable, which could be taken as belonging to cognitive variables, not the socioeconomic ones. It is similar to the third cognitive variable which is some language test. I follow the practice of the authors.

The output from the first factor analysis is:

    MR1     h2   u2 com
T1 0.40 0.1568 0.84   1
T2 0.10 0.0096 0.99   1
T3 0.46 0.2077 0.79   1
T4 0.93 0.8621 0.14   1
T5 0.92 0.8399 0.16   1
Proportion Var 0.42

This is using the default settings, which is minimum residuals. Since the method used typically does not matter except for PCA on small datasets, this is fine.

All loadings are positive as expected, but T2 is only slightly so.

We put the factor scores back into the dataset and call it “G” (Rindermann, 2007).

The factor analysis output for socioeconomic variables is:

      MR1    h2   u2 com
Lit  0.79 0.617 0.38   1
II   0.36 0.128 0.87   1
TI   0.91 0.824 0.18   1
GDP  0.76 0.579 0.42   1
IMR -0.92 0.842 0.16   1
CMR -0.85 0.721 0.28   1
FER -0.84 0.709 0.29   1
LE   0.14 0.019 0.98   1
Proportion Var 0.55

Strong positive loadings for: proportion of population literate (LIT), teacher index (TI), GDP, medium positive for infrastructure index (II), weak positive for life expectancy (LE). Strong negative for infant mortality rate (IMR), child mortality rate (CMR) and fertility. All of these are in the expected direction.

Then we extract the factor scores and add them back to the dataset and call them “S”.

Correlations

Finally, we want to check out the correlations with G and S.

#Pearson results
results = rcorr2(data)
View(results$r)  #view all correlations
results$r[,18:19] #S and G correlations
results$n #sample size

#Spearman
results.s = rcorr2(data, type="spearman") #spearman
View(results.s$r) #view all correlations

#discrepancies
results.c = results$r-results.s$r

We look at both the Pearson and Spearman correlations because data may not be normal and may have outliers. Spearman’s is resistant to these problems. The discrepancy values are how larger the Pearson is than the Spearman.

There are too many correlations to output here, so we focus on those involving G and S (columns 18:19).

G S
T1 0.41 0.41
T2 0.10 -0.39
T3 0.48 0.16
T4 0.97 0.62
T5 0.96 0.53
CA 0.87 0.38
Lit 0.66 0.81
II 0.45 0.37
TI 0.40 0.93
GDP 0.40 0.78
IMR -0.60 -0.94
CMR -0.54 -0.87
FER -0.56 -0.86
LE 0.01 0.14
LAT -0.53 -0.34
CL -0.63 -0.54
MS -0.24 -0.08
G 1.00 0.59
S 0.59 1.00

 

So we see that G and S correlate at .59, fairly high and similar to previous within country results with immigrant groups (.54 in Denmark, .59 in Norway Kirkegaard (2014a), Kirkegaard and Fuerst (2014)) but not quite as high as the between country results (.86-.87 Kirkegaard (2014b)). Lynn and Yadav mention that data exists for France, Britain and the US. These can serve for reanalysis with respect to S factors at the regional/state level.

Finally, we want may to plot the main result:

#Plots
title = paste0("India: State G-factor correlates ",round(results$r["S","G"],2)," with state S-factor, N=",results$n["S","G"])
scatterplot(S ~ G, data, smoother=FALSE, id.n=nrow(data),
            xlab = "G, extracted from 5 indicators",
            ylab = "S, extracted from 11 indicates",
            main = title)

G_S

It would be interesting if one could obtain genomic admixture measures for each state and see how they relate, since this has been found repeatedly elsewhere and is a strong prediction from genetic theory.

Update

Lynn has sent me the correct datapoint. It is 0.595. The imputed value was around .72. I reran the analysis with this value, imputed the rest. It doesn’t change much. The new results are slightly stronger.

  Correct datapont   Discrepancy scores
G S G S
T1 0.41 0.42 0.00 -0.01
T2 0.10 -0.37 0.00 -0.02
T3 0.48 0.18 0.00 -0.02
T4 0.97 0.63 0.00 -0.02
T5 0.96 0.54 0.00 -0.01
CA 0.87 0.40 0.00 -0.02
Lit 0.66 0.81 0.00 -0.01
II 0.45 0.37 0.00 0.00
TI 0.42 0.92 -0.02 0.01
GDP 0.40 0.78 0.00 0.00
IMR -0.60 -0.95 0.00 0.00
CMR -0.54 -0.87 0.00 0.00
FER -0.56 -0.86 0.00 -0.01
LE 0.01 0.14 0.00 0.00
LAT -0.53 -0.35 0.00 0.01
CL -0.63 -0.54 0.00 0.00
MS -0.24 -0.09 0.00 0.00
G 1.00 0.61 0.00 -0.01
S 0.61 1.00 -0.01 0.00

 

Method of correlated vectors

This study is special in that we have two latent variables each with its own set of indicator variables. This means that we can use Jensen’s method of correlated vectors (MCV; Jensen (1998)), and also a new version which I shall creatively dub “double MCV”, DMCV using both latent factors instead of only one.

The method consists of correlating the factor loadings of a set of indicator variables for a factor with the correlations of each indicator variable with a criteria variable. Jensen used this with the general intelligence factor (g-factor) and its subtests with criteria variables such as inbreeding depression in IQ scores and brain size.

So, to do regular MCV in this study, we first choose either the S and G factor. Then we correlate the loadings of each indicator with its correlation with the criteria variable, i.e. the S/G factor we didn’t choose.

Doing this analysis is in fact very easy here, because the results reported in the table above with S and G is exactly that which we need to correlate.

## MCV
#Double MCV
round(cor(results$r[1:14,18],results$r[1:14,19]),2)
#MCV on G
round(cor(results$r[1:5,18],results$r[1:5,19]),2)
#MCV on S
round(cor(results$r[7:14,18],results$r[7:14,19]),2)

The results are: .87, .89, and .97. In other words, MCV gives a strong indication that it is the latent traits that are responsible for the observed correlations.

References

Jensen, A. R. (1998). The g factor: The science of mental ability. Westport, CT: Praeger.

Kirkegaard, E. O. W. (2014a). Crime, income, educational attainment and employment among immigrant groups in Norway and Finland. Open Differential Psychology.

Kirkegaard, E. O. W., & Fuerst, J. (2014). Educational attainment, income, use of social benefits, crime rate and the general socioeconomic factor among 71 immigrant groups in Denmark. Open Differential Psychology.

Kirkegaard, E. O. W. (2014b). The international general socioeconomic factor: Factor analyzing international rankings. Open Differential Psychology.

Lynn, R., & Yadav, P. (2015). Differences in cognitive ability, per capita income, infant mortality, fertility and latitude across the states of India. Intelligence, 49, 179-185.

Rindermann, H. (2007). The g‐factor of international cognitive ability comparisons: The homogeneity of results in PISA, TIMSS, PIRLS and IQ‐tests across nations. European Journal of Personality, 21(5), 667-706.

Simpler way to correct for restriction of range?

Restriction of range is when the variance in some variable is reduced compared to the true population variance. This lowers the correlation between this variable and other variables. It is a common problem with research on students which are selected for general intelligence (GI) and hence have a lower variance. This means that correlations between GI and whatever found in student samples is too low.

There are some complicated ways to correct for restriction of range. The usual formula used is this:

restriction of range

which is also known as Thorndike’s case 2, or Pearson’s 1903 formula. Capital XY are the unrestricted variables, xy the restricted. The hat on r means estimated.

However, in a paper in review I used the much simpler formula, namely: corrected r = uncorrected r / (SD_restricted/SD_unrestricted) which seemed to give about the right results. But I wasn’t sure this was legit, so I did some simulations.

First, I selected a large range of true population correlations (.1 to .8) and a large range of selectivity (.1 to .9), then I generated very large datasets with each population correlation. Then for each restriction, I cut off the datapoints where the one variable was below the cutoff point, and calculated the correlation in that restricted dataset. Then I calculated the corrected correlation. Then I saved both pieces of information.

This gives us these correlations in the restricted samples (N=1,000,000)

cor/restriction R 0.1 R 0.2 R 0.3 R 0.4 R 0.5 R 0.6 R 0.7 R 0.8 R 0.9
r 0.1 0.09 0.08 0.07 0.07 0.06 0.06 0.05 0.05 0.04
r 0.2 0.17 0.15 0.14 0.13 0.12 0.11 0.10 0.09 0.08
r 0.3 0.26 0.23 0.22 0.20 0.19 0.17 0.16 0.14 0.12
r 0.4 0.35 0.32 0.29 0.27 0.26 0.24 0.22 0.20 0.17
r 0.5 0.44 0.40 0.37 0.35 0.33 0.31 0.28 0.26 0.23
r 0.6 0.53 0.50 0.47 0.44 0.41 0.38 0.36 0.33 0.29
r 0.7 0.64 0.60 0.57 0.54 0.51 0.48 0.45 0.42 0.37
r 0.8 0.75 0.71 0.68 0.65 0.63 0.60 0.56 0.53 0.48

 

The true population correlation is in the left-margin. The amount of restriction in the columns. So we see the effect of restricting the range.

Now, here’s the corrected correlations by my method:

cor/restriction R 0.1 R 0.2 R 0.3 R 0.4 R 0.5 R 0.6 R 0.7 R 0.8 R 0.9
r 0.1 0.10 0.10 0.10 0.10 0.10 0.10 0.10 0.10 0.09
r 0.2 0.20 0.20 0.20 0.20 0.21 0.21 0.20 0.20 0.20
r 0.3 0.30 0.31 0.31 0.31 0.31 0.31 0.30 0.30 0.29
r 0.4 0.41 0.41 0.42 0.42 0.42 0.42 0.42 0.42 0.42
r 0.5 0.52 0.53 0.53 0.54 0.54 0.55 0.55 0.56 0.56
r 0.6 0.63 0.65 0.66 0.67 0.68 0.69 0.70 0.70 0.72
r 0.7 0.76 0.79 0.81 0.83 0.84 0.86 0.87 0.89 0.90
r 0.8 0.89 0.93 0.97 1.01 1.04 1.07 1.10 1.13 1.17

 

Now, the first 3 rows are fairly close deviating by max .1, but it the rest deviates progressively more. The discrepancies are these:

cor/restriction R 0.1 R 0.2 R 0.3 R 0.4 R 0.5 R 0.6 R 0.7 R 0.8 R 0.9
r 0.1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 -0.01
r 0.2 0.00 0.00 0.00 0.00 0.01 0.01 0.00 0.00 0.00
r 0.3 0.00 0.01 0.01 0.01 0.01 0.01 0.00 0.00 -0.01
r 0.4 0.01 0.01 0.02 0.02 0.02 0.02 0.02 0.02 0.02
r 0.5 0.02 0.03 0.03 0.04 0.04 0.05 0.05 0.06 0.06
r 0.6 0.03 0.05 0.06 0.07 0.08 0.09 0.10 0.10 0.12
r 0.7 0.06 0.09 0.11 0.13 0.14 0.16 0.17 0.19 0.20
r 0.8 0.09 0.13 0.17 0.21 0.24 0.27 0.30 0.33 0.37

 

So, if we can figure out how to predict the values in these cells from the two values in the row and column, one can make a simpler way to correct for restriction.

Or, we can just use the correct formula, and then we get:

cor/restriction R 0.1 R 0.2 R 0.3 R 0.4 R 0.5 R 0.6 R 0.7 R 0.8 R 0.9
r 0.1 0.10 0.10 0.10 0.10 0.10 0.10 0.10 0.10 0.09
r 0.2 0.20 0.20 0.20 0.20 0.20 0.20 0.20 0.21 0.20
r 0.3 0.30 0.30 0.30 0.30 0.30 0.30 0.30 0.30 0.30
r 0.4 0.40 0.40 0.40 0.40 0.40 0.40 0.40 0.39 0.39
r 0.5 0.50 0.50 0.50 0.50 0.50 0.50 0.50 0.50 0.49
r 0.6 0.60 0.60 0.60 0.60 0.60 0.60 0.60 0.60 0.60
r 0.7 0.70 0.70 0.70 0.70 0.70 0.70 0.70 0.70 0.71
r 0.8 0.80 0.80 0.80 0.80 0.80 0.80 0.80 0.80 0.80

 

With discrepancies:

cor/restriction R 0.1 R 0.2 R 0.3 R 0.4 R 0.5 R 0.6 R 0.7 R 0.8 R 0.9
r 0.1 0 0 0 0 0 0 0 0 -0.01
r 0.2 0 0 0 0 0 0 0 0.01 0
r 0.3 0 0 0 0 0 0 0 0 0
r 0.4 0 0 0 0 0 0 0 -0.01 -0.01
r 0.5 0 0 0 0 0 0 0 0 -0.01
r 0.6 0 0 0 0 0 0 0 0 0
r 0.7 0 0 0 0 0 0 0 0 0.01
r 0.8 0 0 0 0 0 0 0 0 0

 

Pretty good!

Also, I need to re-do my paper.


R code:

library(MASS)
library(Hmisc)
library(psych)

pop.cors = seq(.1,.8,.1) #population correlations to test
restrictions = seq(.1,.9,.1) #restriction of ranges in centiles
sample = 1000000 #sample size

#empty dataframe for results
results = data.frame(matrix(nrow=length(pop.cors),ncol=length(restrictions)))
colnames(results) = paste("R",restrictions)
rownames(results) = paste("r",pop.cors)
results.c = results
results.c2 = results

#and fetch!
for (pop.cor in pop.cors){ #loop over population cors
  data = mvrnorm(sample, mu = c(0,0), Sigma = matrix(c(1,pop.cor,pop.cor,1), ncol = 2),
                 empirical = TRUE) #generate data
  rowname = paste("r",pop.cor) #get current row names
  for (restriction in restrictions){ #loop over restrictions
    colname = paste("R",restriction) #get current col names
    z.cutoff = qnorm(restriction) #find cut-off
    rows.to.keep = data[,1] > z.cutoff #which rows to keep
    rdata = data[rows.to.keep,] #cut away data
    cor = rcorr(rdata)$r[1,2] #get cor
    results[rowname,colname] = cor #add cor to results
    sd = describe(rdata)$sd[1] #find restricted sd
    cor.c = cor/sd #corrected cor, simple formula
    results.c[rowname,colname] = cor.c #add cor to results
    
    cor.c2 = cor/sqrt(cor^2+sd^2-sd^2*cor^2) #correct formula
    results.c2[rowname,colname] = cor.c2 #add cor to results
  }
}

#how much are they off by?
discre = results.c
for (num in 1:length(pop.cors)){
  cor = pop.cors[num]
  discre[num,] = discre[num,]-cor
}

discre2 = results.c2
for (num in 1:length(pop.cors)){
  cor = pop.cors[num]
  discre2[num,] = discre2[num,]-cor
}

THC and driving: Causal modeling and statistical controls

I am a major proponent of drug legalization, and have also been following the research on drugs’ influence on driving skills. In media discourse, it is taken for granted that driving under the influence (DUI) is bad because it causes crashes. This is usually assumed to be true for drugs at large, but especially THC (cannabis) and alcohol gets attention. Unfortunately, most of the research on the topic is non-experimental, and so open to multiple causal interpretations. I will focus on the recently published report, Drug and Alcohol Crash Risk (US Dept. of Transportation), which I found via The Washington Post.

The study is a case-control design where they try to adjust for potential correlates and causal factors both by statistical means and by data-collection means. Specifically:

The case control crash risk study reported here is the first large-scale study in the United States to include drugs other than alcohol. It was designed to estimate the risk associated with alcohol- and drug-positive driving. Virginia Beach, Virginia, was selected for this study because of the outstanding cooperation of the Virginia Beach Police Department and other local agencies with our stringent research protocol. Another reason for selection was that Virginia Beach is large enough to provide a sufficient number of crashes for meaningful analysis. Data was collected from more than 3,000 crash-involved drivers and 6,000 control drivers (not involved in crashes). Breath alcohol measurements were obtained from a total of 10,221 drivers, oral fluid samples from 9,285 drivers, and blood samples from 1,764 drivers.

Research teams responded to crashes 24 hours a day, 7 days a week over a 20-month period. In order to maximize comparability, efforts were made to match control drivers to each crash-involved driver. One week after a driver involved in a crash provided data for the study, control drivers were selected at the same location, day of week, time of day, and direction of travel as the original crash. This allowed a comparison to be made between use of alcohol and other drugs by drivers involved in a crash with drivers not in a crash, resulting in an estimation of the relative risk of crash involvement associated with alcohol or drug use. In this study, the term marijuana is used to refer to drivers who tested positive for delta-9-tetrahydrocannabinal (THC). THC is associated with the psychoactive effects of ingesting marijuana. Drivers who tested positive for inactive cannabinoids were not considered positive for marijuana. More information on the methodology of this study and other methods of estimating crash risk is presented later in this Research Note.

So, by design, they control for: location, day of week, time of day, direction of travel. It is also good that they don’t conflate inactive metabolites with THC as commonly done.

The basic results are shown in Tables 1 and 3.

DUI_table1DUI_table3

The first shows the raw data, so to speak. It can be seen that drug use while driving is fairly common at about 15% both in crash drivers and normal drivers. Since their testing probably didn’t detect all possible drugs, these are underestimates (assuming that the testing does not bias it with uneven false positive/false negative rates).

Now, the authors write:

These unadjusted odds ratios must be interpreted with caution as they do not account for other factors that may contribute to increased crash risk. Other factors, such as demographic variables, have been shown to have a significant effect on crash risk. For example, male drivers have a higher crash rate than female drivers. Likewise, young drivers have a higher crash rate than older drivers. To the extent that these demographic variables are correlated with specific types of drug use, they may account for some of the increased crash risk associated with drug use.

Table 4 examines the odds ratios for the same categories and classes of drugs, adjusted for the demographic variables of age, gender, and race/ethnicity. This analysis shows that the significant increased risk of crash involvement associated with THC and illegal drugs shown in Table 3 is not found after adjusting for these demographic variables. This finding suggests that these demographic variables may have co-varied with drug use and accounted for most of the increased crash risk. For example, if the THC-positive drivers were predominantly young males, their apparent crash risk may have been related to age and gender rather than use of THC.

Table 4 looks like this, and for comparison, Table 6 for alcohol:

DUI_table4 DUI_table6

The authors do not state anything outright false. But they only mention one causal model that fits the data, the one where THC’s rule is non-causal. However, it is more proper to show both models openly:

Causal models of driving, drug use and demographic variables

The first model is the one discussed by the authors. Here demographic variables cause THC use and crashing, but THC use has no effect on crashing. THC use and crashing are statistically associated because they have a common cause. In the second model, demographic variables cause both THC use and crashing, and THC use also causes crashing. In both models, if one controls for demographic variables, the statistical associated of THC use and crashing disappears. Hence, controlling for demographic variables cannot distinguish between those two important models.

However, they can test the second model by controlling for THC use and seeing if demographic variables are still associated with crashing. If they are not, the second model above is falsified (assuming no false negative/adequate statistical power).

Alcohol was still associated with crashing even controlling for demographic variables, which strengthens the case for its causal effect.

How common is alcohol driving?

Incidentally, some interesting statistics on DUI for alcohol:

The differences between the two studies in the proportion of drivers found to be alcohol-positive are likely to have resulted from the concentration of Roadside Survey data collection on weekend nighttime hours, while this study included data from all days of the week and all hours of the day. For example, in the 2007 Roadside Survey the percentage of alcohol-positive weekday daytime drivers was only 1.0 percent, while on weekend nights 12.4 percent of the drivers were alcohol-positive. In this study, 1.9 percent of weekday daytime drivers were alcohol- positive, while 9.4 percent of weekend nighttime drivers were alcohol-positive.

Assuming the causal model of alcohol on crashing is correct, this must result in quite a lot of extra deaths in traffic. Another reason to fund more research into safer vehicles:

Mandatory follow-up:

Review: The Second Machine Age (Brynjolfsson & McAfee)

Goodreads. Libgen.

This book is background material for CGPGrey’s great short film:

So, if you saw that and are more curious, perhaps this book is for you. If the film above is not interesting to you, the book will be useless. Generally, the film conveys the topic better than the book, but the book of course contains more information.

The main flaw of the book is that the authors speculate on various economic and educational changes but without knowing about differential psychology and behavior genetics. For instance, they note that the median income has been falling. They don’t seem to consider that this may be in part due to a changing population in the US (relatively fewer Europeans, more Hispanics). Another example is that they look at the average income of people with only high-school over time and compare them with those with college. They don’t realize that due to the increased uptake of college education, the mean GMA of people with only high school has been falling steadily. So it does not necessarily have anything to do with educational attainment as they think.

The most interesting section was this:

What This Problem Needs Are More Eyeballs and Bigger Computers

If this response is at least somewhat accurate—if it captures something about how innovation and economic growth work in the real world—then the best way to accelerate progress is to increase our capacity to test out new combinations of ideas. One excellent way to do this is to involve more people in this testing process, and digital technologies are making it possible for ever more people to participate. We’re interlinked by global ICT [Information and Communication Technology], and we have affordable access to masses of data and vast computing power. Today’s digital environment, in short, is a playground for large-scale recombination. The open source software advocate Eric Raymond has an optimistic observation: “Given enough eyeballs, all bugs are shallow.”20 The innovation equivalent to this might be, “With more eyeballs, more powerful combinations will be found.”

NASA experienced this effect as it was trying to improve its ability to forecast solar flares, or eruptions on the sun’s surface. Accuracy and plenty of advance warning are both important here, since solar particle events (or SPEs, as flares are properly known) can bring harmful levels of radiation to unshielded gear and people in space. Despite thirty-five years of research and data on SPEs, however, NASA acknowledged that it had “no method available to predict the onset, intensity or duration of a solar particle event.”21

The agency eventually posted its data and a description of the challenge of predicting SPEs on Innocentive, an online clearinghouse for scientific problems. Innocentive is ‘non-credentialist’; people don’t have to be PhDs or work in labs in order to browse the problems, download data, or upload a solution. Anyone can work on problems from any discipline; physicists, for example, are not excluded from digging in on biology problems.

As it turned out, the person with the insight and expertise needed to improve SPE prediction was not part of any recognizable astrophysics community. He was Bruce Cragin, a retired radio frequency engineer living in a small town in New Hampshire. Cragin said that, “Though I hadn’t worked in the area of solar physics as such, I had thought a lot about the theory of magnetic reconnection.”22This was evidently the right theory for the job, because Cragin’s approach enabled prediction of SPEs eight hours in advance with 85 percent accuracy, and twenty-four hours in advance with 75 percent accuracy. His recombination of theory and data earned him a thirty-thousand-dollar reward from the space agency.

In recent years, many organizations have adopted NASA’s strategy of using technology to open up their innovation challenges and opportunities to more eyeballs. This phenomenon goes by several names, including ‘open innovation’ and ‘crowdsourcing,’ and it can be remarkably effective. The innovation scholars Lars Bo Jeppesen and Karim Lakhani studied 166 scientific problems posted to Innocentive, all of which had stumped their home organizations. They found that the crowd assembled around Innocentive was able to solve forty-nine of them, for a success rate of nearly 30 percent. They also found that people whose expertise was far away from the apparent domain of the problem were more likely to submit winning solutions. In other words, it seemed to actually help a solver to be ‘marginal’—to have education, training, and experience that were not obviously relevant for the problem. Jeppesen and Lakhani provide vivid examples of this:

[There were] different winning solutions to the same scientific challenge of identifying a food-grade polymer delivery system by an aerospace physicist, a small agribusiness owner, a transdermal drug delivery specialist, and an industrial scientist. . . . All four submissions successfully achieved the required challenge objectives with differing scientific mechanisms. . . .

[Another case involved] an R&D lab that, even after consulting with internal and external specialists, did not understand the toxicological significance of a particular pathology that had been observed in an ongoing research program. . . . It was eventually solved, using methods common in her field, by a scientist with a Ph.D. in protein crystallography who would not normally be exposed to toxicology problems or solve such problems on a routine basis.23

Like Innocentive, the online startup Kaggle also assembles a diverse, non-credentialist group of people from around the world to work on tough problems submitted by organizations. Instead of scientific challenges, Kaggle specializes in data-intensive ones where the goal is to arrive at a better prediction than the submitting organization’s starting baseline prediction. Here again, the results are striking in a couple of ways. For one thing, improvements over the baseline are usually substantial. In one case, Allstate submitted a dataset of vehicle characteristics and asked the Kaggle community to predict which of them would have later personal liability claims filed against them.24 The contest lasted approximately three months and drew in more than one hundred contestants. The winning prediction was more than 270 percent better than the insurance company’s baseline.

Another interesting fact is that the majority of Kaggle contests are won by people who are marginal to the domain of the challenge—who, for example, made the best prediction about hospital readmission rates despite having no experience in health care—and so would not have been consulted as part of any traditional search for solutions. In many cases, these demonstrably capable and successful data scientists acquired their expertise in new and decidedly digital ways.

Between February and September of 2012 Kaggle hosted two competitions about computer grading of student essays, which were sponsored by the Hewlett Foundation.* Kaggle and Hewlett worked with multiple education experts to set up the competitions, and as they were preparing to launch many of these people were worried. The first contest was to consist of two rounds. Eleven established educational testing companies would compete against one another in the first round, with members of Kaggle’s community of data scientists invited to join in, individually or in teams, in the second. The experts were worried that the Kaggle crowd would simply not be competitive in the second round. After all, each of the testing companies had been working on automatic grading for some time and had devoted substantial resources to the problem. Their hundreds of person-years of accumulated experience and expertise seemed like an insurmountable advantage over a bunch of novices.

They needn’t have worried. Many of the ‘novices’ drawn to the challenge outperformed all of the testing companies in the essay competition. The surprises continued when Kaggle investigated who the top performers were. In both competitions, none of the top three finishers had any previous significant experience with either essay grading or natural language processing. And in the second competition, none of the top three finishers had any formal training in artificial intelligence beyond a free online course offered by Stanford AI faculty and open to anyone in the world who wanted to take it. People all over the world did, and evidently they learned a lot. The top three individual finishers were from, respectively, the United States, Slovenia, and Singapore.

Quirky, another Web-based startup, enlists people to participate in both phases of Weitzman’s recombinant innovation—first generating new ideas, then filtering them. It does this by harnessing the power of many eyeballs not only to come up with innovations but also to filter them and get them ready for market. Quirky seeks ideas for new consumer products from its crowd, and also relies on the crowd to vote on submissions, conduct research, suggest improvements, figure out how to name and brand the products, and drive sales. Quirky itself makes the final decisions about which products to launch and handles engineering, manufacturing, and distribution. It keeps 70 percent of all revenue made through its website and distributes the remaining 30 percent to all crowd members involved in the development effort; of this 30 percent, the person submitting the original idea gets 42 percent, those who help with pricing share 10 percent, those who contribute to naming share 5 percent, and so on. By the fall of 2012, Quirky had raised over $90 million in venture capital financing and had agreements to sell its products at several major retailers, including Target and Bed Bath & Beyond. One of its most successful products, a flexible electrical power strip called Pivot Power, sold more than 373 thousand units in less than two years and earned the crowd responsible for its development over $400,000.

I take this to mean that: 1) polymathy/interdisciplinarity is not dead or dying at all, it is in fact very useful, 2) to make oneself very useful, one should focus on learning a bunch of unrelated methods for analyzing data, and when studying a field, one should attempt to use methods not commonly used in that field, 3) work that is related to AI, machine learning etc. is the future (until we are completely unable to compete with computers).

Review: What Intelligence Tests Miss: The Psychology of Rational Thought (Stanovich, 2009)

www.goodreads.com/book/show/6251150-what-intelligence-tests-miss

MOBI on libgen

I’ve seen this book cited quite a few times and when looking for what to read next, it seemed like on okay choice. The book is written in typical popscience style: no crucial statistical information about the studies is mentioned, so it is impossible for the skeptical reader to know which claims to believe and which not to.

For instance, he spends quite a while talking about how IQ/SAT etc. do not correlate strongly with rationality measures. Rarely does he mention the exact effect size. He does not mention whether it is measured as a correlation of IQ with single item rationality measures. Single items have lower reliability which reduces correlations, and are usually dichotomous which also lowers (Pearson) correlations (simulation results here, TL;DR multiple by 1.266 for dichotomous items). He does not say whether it was university students, which lower correlations as they are selected for g and rationality (maybe). The OKCupid dataset happens to contain a number of items on rationality items (e.g. astrology), I have already noted on Twitter that these correlate with g in the expected direction (religiousness).

Otherwise the book feels like reading Kahneman’s Thinking Fast and Slow. It covers the most well-known heuristics and how they sometimes lead astray (representativeness, ease of recall, framing effect, status quo bias, planning bias, etc.).

The book can be read by researchers with some gain in knowledge, but don’t expect that much. For the serious newcomer, it is better to read a textbook on the topic (unfortunately, I don’t know any, as I have yet to read one myself — regrettably!). For the curious layperson, I guess it is okay.

Therefore, it came as something of a surprise when scores on various college placement exams and Armed Forces tests that the president had taken over the years were converted into an estimated IQ score. The president’s [Bush #2] score was approximately 120-roughly the same as that of Bush’s opponent in the 2004 presidential election, John Kerry, when Kerry’s exam results from young adulthood were converted into IQ scores using the same formulas. These results surprised many critics of the president (as well as many of his supporters), but 1, as a scientist who studies individual differences in cognitive skills, was not surprised.

Virtually all commentators on the president’s cognition, including sympathetic commentators such as his onetime speechwriter David Frum, admit that there is something suboptimal about the president’s thinking. The mistake they make is assuming that all intellectual deficiencies are reflected in a lower IQ score.

In a generally positive portrait of the president, Frum nonetheless notes that “he is impatient and quick to anger; sometimes glib, even dogmatic; often uncurious and as a result ill-informed” (2003, p. 272). Conservative commentator George Will agrees, when he states that in making Supreme Court appointments, the president “has neither the inclination nor the ability to make sophisticated judgments about competing approaches to construing the Constitution” (2005, P. 23)

Seems fishy. One obvious idea is that he has had some kind of brain damage since his recorded score. Since it is based on a SAT score, it is possible that he had considerable help on the SAT test. It is true that SAT prepping does not generally work well and has diminishing returns, but surely Bush had quite a lot of help as he comes from a very rich and prestigious family. (I once read a recent meta-analysis of SAT prepping/coaching, but I can’t find it again. Mean effect size was about .25, which corresponds to 3.75 IQ.)

See also: en.wikipedia.org/wiki/U.S._Presidential_IQ_hoax

Actually, we do not have to speculate about the proportion of high-IQ people with these beliefs. Several years ago, a survey of paranormal beliefs was given to members of a Mensa club in Canada, and the results were instructive. Mensa is a club restricted to high-IQ individuals, and one must pass IQ-type tests to be admitted. Yet 44 percent of the members of this club believed in astrology, 51 percent believed in biorhythms, and 56 percent believed in the existence of extraterrestrial visitors-all beliefs for which there is not a shred of evidence.

Seems fishy too. Maybe MENSA just attracts irrational smart people. I know someone who is in Danish MENSA, so I can perhaps do a new survey.

Rational thinking errors appear to arise from a variety of sources -it is unlikely that anyone will propose a psychometric g of rationality. Irrational thinking does not arise from a single cognitive problem, but the research literature does allow us to classify thinking into smaller sets of similar problems. Our discussion so far has set the stage for such a classification system, or taxonomy. First, though, I need to introduce one additional feature in the generic model of the mind outlined in Chapter 3.

But that is exactly what I will propose. What is the factor structure of rationality? Is there a general factor, is it hierarchical? Is rationality perhaps a second-order factor of g? I get inspiration from study study of ‘Emotional Intelligence’ as a second-stratum factor (MacCann et al, 2014).

The next category (defaulting to the autonomous mind and not engaging at all in Type 2 processing) is the most shallow processing tendency of the cognitive miser. The ability to sustain Type 2 processing is of course related to intelligence. But the tendency to engage in such processing or to default to autonomous processes is a property of the reflective mind that is not assessed on IQ tests. Consider the Levesque problem (“Jack is looking at Anne but Anne is looking at George”) as an example of avoiding Type 2 processing. The subjects who answer this problem correctly are no higher in intelligence than those who do not, at least in a sample of university students studied by Maggie Toplak in my own laboratory.

This sure does sound like a 1 item correct/wrong item correlated with IQ scores from a selected g group. He says “no higher” but perhaps his sample was too small too and what he meant was that the difference was not significant. Samples for this kind are usually pretty small.

Theoretically, one might expect a positive correlation between intelligence and the tendency of the reflective mind to initiate Type 2 processing because it might be assumed that those of high intelligence would be more optimistic about the potential efficacy of Type 2 processing and thus be more likely to engage in it. Indeed, some insight tasks do show a positive correlation with intelligence, one in particular being the task studied by Shane Frederick and mentioned in Chapter 6: A bat and a ball cost $I.Io in total. The bat costs $I more than the ball. How much does the ball cost? Nevertheless, the correlation between intelligence and a set of similar items is quite modest, .43-.46, leaving plenty of room for performance dissociations of the type that define dysrationalia 14 Frederick has found that large numbers of high-achieving students at MIT, Princeton, and Harvard when given this and other similar problems rely on this most primitive of cognitive miser strategies.

The sum of the 3 CRT items (one mentioned above) correlated r=.50 with the 16 item ICAR sample test in my student (age ~18, n=72) data. These items do not perform differently when factor analyzed with the entire item set.

In numerous place he complains that society cares too much about IQ in selection even tho he admits that there is substantial evidence for it works. He also admits that there is no standard test for rationality and cites no evidence that selecting for rationality will improve outcomes (e.g. job performance, GPA in college, prevention of drop-out in training programs), it is difficult to see what he has to complain about. He should have been less bombastic. Yes, we should try rationality measures, but calling for wide scale use before proper validation is very premature.

 

References

MacCann, C., Joseph, D. L., Newman, D. A., & Roberts, R. D. (2014). Emotional intelligence is a second-stratum factor of intelligence: Evidence from hierarchical and bifactor models. Emotion, 14(2), 358.

As AI improves, what is the long-term solution to spam?

I’ve thought about this question a few times but haven’t found a good solution yet. The problem needs more attention.

AI and computer recognition of images is getting better. I’ll go ahead and say that the probability of reaching the level where computers are as good at humans at any of kind of CAPTCHA/verify-I’m-not-a-bot-test within the next 20 years is very, very high. CAPTCHAs and other newer anti-bot measures cannot continue to work.

We don’t want our comment sections, discussion forums and social media full of spamming bots, so how do we keep them from doing so? I have three proposal types for the online world.

Make sending cost

Generally speaking, spam does not work well. The click-rate is very, very low but it pays because sending mass amounts of it is very cheap. In the analog world, we also see spam in our mail boxes (curiously, this is allowed but online spam is not!), but not quite as much as online. The cost of sending spam in the analog world keeps the amount down (printing and postage).

Participation in the online world is generally free after one has paid one’s internet bill. Generally, stuff that isn’t free on the internet is not used much. Free is the new normal. Free also means the barrier for participation is very low which is good for poor people.

The idea is that we add a cost to writing comments, but keep it very low. Since spam only works when sending large amounts (e.g. 10,000,000 emails per year), and normal human usage does not require sending comparably large amounts (<1,000 emails in most cases), we could add a cost to this which is prohibitively large for bots, but (almost) negligible for humans. E.g. .01 USD per email sent. Thus, human usage would cost <10 USD per year, but botting would cost 100,000 USD.

Who gets the money? One could make it decentralized, so that the blog/newspaper/media owner gets the money. In that way, discussing on a service also supports them, altho very little.

This could maybe work for email spam, but for highly read comment sections (e.g. on major newspapers or official forums for large computer games), the rather small price to pay for writing 1000 (or even 10) posts would not be a deterrent. Hence the pay-for-use proposal fails to deal with some situations.

Making the microtransactions work should not be a problem with cryptocurrencies which can also send anonymously.

Verified users by initial cost

Another idea based on payment is that one can set up a service where users can pay a small fee (e.g. 10 USD) to be registered. The account from this service can then be used to log into the comment section (forum etc.) of other sites and comment for free. The payment can be with cryptocurrency as before so anonymity can be preserved. It is also possible to create multiple outward profiles from one registered account so that a user cannot be tracked from site to site.

If an account has been found to send spam, it can then be disabled and the payment has been wasted. The payment will not have to be large, but it needs to be sufficient to run the service. Perhaps one can outsource the spammer-or-not decision-making to a subset of users who wish to work for free (many services rely upon a subset of users to provide their time, e.g. OKCupid).

The proposal has the same problem as the one above in that it requires payment to participate.

Verified users without payment

A third proposal is to set up a service where one can make a profile for free, but that requires one to somehow prove that one is a real person. This could be done with confidential information about a person e.g. passport + access to a database of this. This would probably require cooperation with officials in each country. Probably they will keep the information about who is who if they can, so it is difficult to see how privacy could be preserved with proposals of this type.

As before, accounts will still need to be deactivated if they are found to be spamming. If the government is involved, they will surely push for other grounds for deactivation: intellectual monopoly infringement, sex work related stuff, foul language, national security matters and so on. This makes it the least preferred solution type to me.

Other solutions?

Generally, the goal is to preserve privacy, no cost of participation and being spam-free. How can it be done? Is online discussion doomed to be overspammed?