Clear Language, Clear Mind

March 27, 2019

Up to date introductions to psychology, stats, and genomics — 2019 March

Filed under: Math/Statistics,Psychology — Tags: , — Emil O. W. Kirkegaard @ 23:54

A fellow emailed me asking for help on what to start reading to learn psychology and stats. The replication crisis means that most older psychology introductions should be viewed with distrust, and the various new streams in stats means that much of the stuff in stats textbooks, while not wrong, is less than optimal. In the optimal world, there would be a new textbook on various psychology topics that covered all the important ideas that failed replication (stereotype threat, growth mindset etc.),  and especially the things that did not fail (stereotype accuracy, IQ testing, behavioral genetics [not candidate or GxE]). Unfortunately, this doesn’t exist yet. However, the below are a reasonable start in my opinion.

Psychology:

I’m open to recommendations that cover more subfields, but I’m not aware of any post-crisis introduction to e.g. social psychology.

Stats:

Bonus, for genomics I recommend:

December 23, 2018

It works in practice, but does it work in (my) theory?

There’s a certain type of person that doesn’t produce any empirical contribution to “Reducing the heredity-environment uncertainty”. Instead, they contribute various theoretical arguments which they take to undermine the empirical data others give. Usually, these people have a background in philosophy or some other theoretical field. A recent example of this pattern is seen on Reddit, where Jinkinson Payne Smith (u/EverymorningWP) made this thread:

Heritability and Heterogeneity: The Irrelevance of Heritability in Explaining Differences between Means for Different Human Groups or Generations” includes (on its page 398, section 2.1) some interesting paragraphs that decisively refute the claims of Neven Sesardic regarding “heritability”. One particularly relevant quote is this one: “The shortcomings I describe involve matters of logic and methodology; empirical considerations are beside the point.” So those who wish to use the “hitting-them-over-the-head” style* so common of behavior geneticists, involving the deflection conceptual, logical criticisms to focus on narrow technical issues, should keep in mind that superficial empirical concerns are not the only ones worth taking seriously.

*The term “hitting them over the head” was coined by Aaron Panofsky in his 2014 book Misbehaving Science. As defined by Burt & Simons (2015, p. 104), “This approach involves dodging criticisms by misrepresenting arguments and insinuating that critics are politically motivated and reject scientific truths as well as focusing on a few “‘tractable’ empirical objections” while “ignoring the deeper theoretical objections””.

So: It works in practice, but does it work in (my) theory? These philosophy arguments are useless. Any physics professor knows this well because they get a lot of emails allegedly refuting relativity and quantum mechanics using thought experiments and logical arguments (like Time Cube). These arguments convince no one, even if one can’t find the error in the argument immediately (like in the ontological argument). It works the same way for these anti-behavioral genetics theoretical arguments. If these want to be taken seriously, they should produce 1) contrasting models, 2) that produce empirically testable predictions, and 3) show that these fit with their model and do not fit with the current behavioral/quantitative genetics models.

For a historical example of this, see Jensen’s reply (pp. 451ff) to Schönemann’s sophistry (chapter 18) along the same lines regarding an obscure and empirically irrelevant problem in factor analysis (factor score indeterminacy). An excerpt:

Components analysis and factor analys is were invented and developed by the pioneers of differential psychology as a means of dealing with substantive problems in the measurement and anal ysis of human abilities. The first generation of factor analysts—psychologists such as Spearman, Burt, and Thurstone—were first of all psychologists, with a primary interest in the structure and nature of individual differ ences. For them factor analysis was but one methodological means of advancing empirical research and theory in the domain of abilities. But in subsequent generations experts in factor analysis have increasingly become more narrowly specialized. They show little or no interest in psychology, but confine their thinking to the ‘pure mathematics’ of factor analysis, without reference to any issues of substantive or theoretical importance. For some it is methodology for methodology’s sake, isolated from empirical realities, and disdainful of substantive problems and ‘dirty data’. Cut off from its origin, which was rooted in the study of human ability, some of the recent esoterica in factor analysis seem like a sterile, self-contained intellectual game, good fun perhaps, but having scarcely more relevance to anything outside itself than the game of chess. Schönemann is impressive as one of the game’s grandmasters. The so-called ‘factor indeterminacy’ problem, which is an old issue recognized in Spearman’s time, has, thanks to Schönemann, been revived as probably the most esoteric weapon in the ‘IQ controversy’.

A useful follow-up here is also:

  • Jensen, A. R., and Weng, J.J. (1994). “What is a good g?Intelligence, 18, 231—258.

which shows that if one extracts the g factor in lots of different ways, factor scores from these all correlate .99 with each other, so the fact that one cannot precisely give the true scores for a given set of data is empirically irrelevant.

I must say that I do feel some sympathy with Jinkinson’s approach. I am myself somewhat of a verbal tilt person who used to study philosophy (for bachelor degree), and who used to engage in some of these ‘my a priori argument beats your data’ type arguments. I eventually wised up, I probably owe some of this to my years of drinking together with the good physicists at Aarhus University, who do not care so much for such empirically void arguments.

December 21, 2018

You can’t ignore gene-environment correlations when looking for gene-environment interactions

Humans love interactions, they tell interesting stories (however, no study has investigated this bias, AFAIK). However, statistics and nature hate interactions. Interactions in general have low prior, and because people fail to realize this properly, reports of interactions generally fail to replicate. This is also true for gene-environment interactions (GxE), the love-child of any would be behavioral genetics critic (strong interactions make standard ANOVA of family data very tricky and thus lets critics retreat into ‘it’s too complicated, we don’t know anything’ territory).

Some large scale failures of replication

  • Duncan, L. E., & Keller, M. C. (2011). A critical review of the first 10 years of candidate gene-by-environment interaction research in psychiatry. American Journal of Psychiatry, 168(10), 1041-1049.

Objective

Gene-by-environment interaction (G×E) studies in psychiatry have typically been conducted using a candidate G×E (cG×E) approach, analogous to the candidate gene association approach used to test genetic main effects. Such cG×E research has received widespread attention and acclaim, yet cG×E findings remain controversial. The authors examined whether the many positive cG×E findings reported in the psychiatric literature were robust or if, in aggregate, cG×E findings were consistent with the existence of publication bias, low statistical power, and a high false discovery rate.
Method

The authors conducted analyses on data extracted from all published studies (103 studies) from the first decade (2000–2009) of cG×E research in psychiatry.
Results

Ninety-six percent of novel cG×E studies were significant compared with 27% of replication attempts. These findings are consistent with the existence of publication bias among novel cG×E studies, making cG×E hypotheses appear more robust than they actually are. There also appears to be publication bias among replication attempts because positive replication attempts had smaller average sample sizes than negative ones. Power calculations using observed sample sizes suggest that cG×E studies are underpowered. Low power along with the likely low prior probability of a given cG×E hypothesis being true suggests that most or even all positive cG×E findings represent type I errors.
Conclusion

In this new era of big data and small effects, a recalibration of views about “groundbreaking” findings is necessary. Well-powered direct replications deserve more attention than novel cG×E findings and indirect replications.

Authors also note the evidence for publication bias in replications, by showing that replications that are positive have smaller sample sizes.

I know, I know. Authors state this one as a success, but it really isn’t.

The hypothesis that the S allele of the 5-HTTLPR serotonin transporter promoter region is associated with increased risk of depression, but only in individuals exposed to stressful situations, has generated much interest, research and controversy since first proposed in 2003. Multiple meta-analyses combining results from heterogeneous analyses have not settled the issue. To determine the magnitude of the interaction and the conditions under which it might be observed, we performed new analyses on 31 data sets containing 38 802 European ancestry subjects genotyped for 5-HTTLPR and assessed for depression and childhood maltreatment or other stressful life events, and meta-analysed the results. Analyses targeted two stressors (narrow, broad) and two depression outcomes (current, lifetime). All groups that published on this topic prior to the initiation of our study and met the assessment and sample size criteria were invited to participate. Additional groups, identified by consortium members or self-identified in response to our protocol (published prior to the start of analysis) with qualifying unpublished data, were also invited to participate. A uniform data analysis script implementing the protocol was executed by each of the consortium members. Our findings do not support the interaction hypothesis. We found no subgroups or variable definitions for which an interaction between stress and 5-HTTLPR genotype was statistically significant. In contrast, our findings for the main effects of life stressors (strong risk factor) and 5-HTTLPR genotype (no impact on risk) are strikingly consistent across our contributing studies, the original study reporting the interaction and subsequent meta-analyses. Our conclusion is that if an interaction exists in which the S allele of 5-HTTLPR increases risk of depression only in stressed individuals, then it is not broadly generalisable, but must be of modest effect size and only observable in limited situations.

You can’t ignore gene-environment correlations

But aside from the usual false positives due to fishing expeditions, there’s the deeper problem that ignoring gene-environment correlations/dependencies leads many methods to falsely detect gene-environment interactions. A number of methods papers have made this point, but it is still not widely acknowledged. Since gene-environment correlations are ubiquitous, this method problem is also ubiquitous.

Candidate gene × environment (G × E) interaction research tests the hypothesis that the effects of some environmental variable (e.g., childhood maltreatment) on some outcome measure (e.g., depression) depend on a particular genetic polymorphism. Because this research is inherently nonexperimental, investigators have been rightly concerned that detected interactions could be driven by confounders (e.g., ethnicity, gender, age, socioeconomic status) rather than by the specified genetic or environmental variables per se. In an attempt to eliminate such alternative explanations for detected G × E interactions, investigators routinely enter the potential confounders as covariates in general linear models. However, this practice does not control for the effects these variables might have on the G × E interaction. Rather, to properly control for confounders, researchers need to enter the covariate × environment and the covariate × gene interaction terms in the same model that tests the G × E term. In this manuscript, I demonstrate this point analytically and show that the practice of improperly controlling for covariates is the norm in the G × E interaction literature to date. Thus, many alternative explanations for G × E findings that investigators had thought were eliminated have not been.

Gene-environment interactions have the potential to shed light on biological processes leading to disease and to improve the accuracy of epidemiological risk models. However, relatively few such interactions have yet been confirmed. In part this is because genetic markers such as tag SNPs are usually studied, rather than the causal variants themselves. Previous work has shown that this leads to substantial loss of power and increased sample size when gene and environment are independent. However, dependence between gene and environment can arise in several ways including mediation, pleiotropy, and confounding, and several examples of gene-environment interaction under gene-environment dependence have recently been published. Here we show that under gene-environment dependence, a statistical interaction can be present between a marker and environment even if there is no interaction between the causal variant and the environment. We give simple conditions under which there is no marker-environment interaction and note that they do not hold in general when there is gene-environment dependence. Furthermore, the gene-environment dependence applies to the causal variant and cannot be assessed from marker data. Gene-gene interactions are susceptible to the same problem if two causal variants are in linkage disequilibrium. In addition to existing concerns about mechanistic interpretations, we suggest further caution in reporting interactions for genetic markers.

Studying how genetic predispositions come together with environmental factors to contribute to complex behavioral outcomes has great potential for advancing our understanding of the development of psychopathology. It represents a clear theoretical advance over studying these factors in isolation. However, research at the intersection of multiple fields creates many challenges. We review several reasons why the rapidly expanding candidate gene-environment interaction (cGxE) literature should be considered with a degree of caution. We discuss lessons learned about candidate gene main effects from the evolving genetics literature and how these inform the study of cGxE. We review the importance of the measurement of the gene and environment of interest in cGxE studies. We discuss statistical concerns with modeling cGxE that are frequently overlooked. And we review other challenges that have likely contributed to the cGxE literature being difficult to interpret, including low power and publication bias. Many of these issues are similar to other concerns about research integrity (e.g., high false positive rates) that have received increasing attention in the social sciences. We provide recommendations for rigorous research practices for cGxE studies that we believe will advance its potential to contribute more robustly to the understanding of complex behavioral phenotypes.

We review gene × environment interaction (G×E) research in behavioral and psychiatric genetics. Two approaches to G×E are contrasted: a latent-variable approach that seeks to determine whether the heritability of a behavioral outcome varies by environmental exposure, and a candidate-gene × environment approach that seeks to determine whether genotypes are differentially sensitive to environmental conditions. Three major challenges to current G×E research are identified: (1) most published G×E findings are based on small samples and thus a high proportion are likely to be false-positive reports; (2) imprecision in the assessment of the phenotype, environment, and the genotype can significantly attenuate the power of a G×E study; and (3) a G×E is not an inherent property of the organism but rather a feature of a statistical model and so its identification depends on the structure of that model. The promise of genomic medicine is that interventions can be tailored to individual treatments, a form of G×E. Nonetheless, there is currently limited evidence of gene × intervention interactions in behavioral and psychiatric genetics. Future gene × intervention research will benefit from what we have learned from earlier G×E research and especially the need for large samples and the standardization of assessments to enable pooling of data across multiple studies.

October 3, 2018

The g factor and principal components regression

Jensen spent decades trying to convince people that the g factor (general intelligence) is the primary reason why IQ tests predict stuff, not whatever mental abilities the tests appear to measure or what the makers wanted to measure. E.g. as written in:

As for the tests themselves, and for many of the real-life tasks and demands on which performance is to some degree predictable from the most g-loaded tests, it appears generally that g is associated with the relative degree of complexity of the tests’ or tasks’ cognitive demands. It is well known that test batteries that measure IQ are good predictors of educational achievement and occupational level (Jensen, 1993a). Perhaps less well- known is the fact that g is the chief “active ingredient” in this predictive validity more than any of the specific knowledge and skills content of the tests. If g were statistically removed from IQ and scholastic aptitude tests, they would have no practically useful predictive validity. This is not to say that certain group factors (e.g., verbal, numerical, spatial, and memory) in these tests do not enhance the predictive validity, but their effect is relatively small compared to g.

It’s funny because if we go to another field, we find that the generalized version of this finding is not controversial at all, but a common assumption!

The principal components regression (PCR) approach involves constructing the first M principal components, Z 1 , . . . , Z M , and then using these components as the predictors in a linear regression model that is fit using least squares. The key idea is that often a small number of principal components suffice to explain most of the variability in the data, as well as the relationship with the response. In other words, we assume that the directions in which X 1 , . . . , X p show the most variation are the directions that are associated with Y . While this assumption is not guaranteed to be true, it often turns out to be a reasonable enough approximation to give good results.

If the assumption underlying PCR holds, then fitting a least squares model to Z 1 , . . . , Z M will lead to better results than fitting a least squares model to X 1 , . . . , X p , since most or all of the information in the data that relates to the response is contained in Z 1 , . . . , Z M , and by estimating only M  p coefficients we can mitigate overfitting.

I wasn’t able to find anyone who had noticed this connection before, but it’s somewhat remarkable in hindsight. The active ingredient status of g in cognitive data is just a specific case of validity of the assumption underlying the effectiveness of principal components regression. One can also quantify the validity of this assumption for a given domain by looking at how much of the validity is concentrated in the first components (similar to this idea in genomics). My hypothesis a few years ago was that for personality, validity is quite distributed such that using few latent variables will not work so well compared to the entire set of personality variables (items). Functionally, this is basically the opposite of what the general factor of personality (GFP) people are saying. Revelle and colleagues has a recent paper showing this to be true. In a project underway, I have shown that cognitive low-level data (items with response-level data) contain a lot more validity as well (5 to 40% more; in the personality study, it was >100% more for items than for OCEAN), but not as much extra validity as the personality data. Thus, as usual, we find that Jensen was approximately correct — cognitive data is mostly predictive because of the g factor.

May 29, 2018

Null hypothesis testing for loadings in unit weighted factor analysis

Filed under: Math/Statistics — Tags: , — Emil O. W. Kirkegaard @ 03:33

Just a minor stats point for an otherwise excellent paper.

They use unit weighted factor analysis, which sounds fancy, but it is just averaging the z-scored versions of each indicator. What this does is assume that all variables load on a general factor, whereas ordinary exploratory factor analysis methods treat this as something to be tested. Thus, if one can make the assumption from theory or prior studies, one can dispense of this need to test, and thus gain some precision. From their paper, we have this table.

Where they test the null hypothesis of r = .00. However, because UWFA is used, this is now incorrect because we do expect a positive correlation under the null hypothesis, which we can take as uncorrelated variables. The expected correlation based on chance is r = sqrt(1/k), where k is the number of indicators. This is because the composite variable gets its variance from each indicator evenly given r’s = 0 (or uniform) and the correlation is the square root of the variance contribution. Here’s an empirical demonstration assuming k = 4 which means that we expect r = sqrt(1/4) = sqrt(.25) = .50.

MASS::mvrnorm(n = 10e3, mu = rep(0, 4), empirical = T, Sigma = matrix(c(1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1), ncol = 4)) %>% as.data.frame() %>% set_colnames(letters[1:4]) %>% mutate(g = a+b+c+d) %>% cor() %>% round(2)
 a b c d g
a 1.0 0.0 0.0 0.0 0.5
b 0.0 1.0 0.0 0.0 0.5
c 0.0 0.0 1.0 0.0 0.5
d 0.0 0.0 0.0 1.0 0.5
g 0.5 0.5 0.5 0.5 1.0

So, one would have to devise a test to find loadings that deviate from .50, either alone or as a group. For the group, one can presumably use some kinda of modified ANOVA approach, and for test, one can probably modified the usual test difference between two correlations formula. Or be lazy, and type in a really large sample size, like I did (using this):

So, the p < .05 for this case is at .79 (one-tailed; .825 for two-tailed), meaning that 3 of the 4 correlations reported in Woodley et al are still p <.05.

January 28, 2018

Making better use of the scientific literature: large-scale automatic retrieval of data from published figures

Filed under: Math/Statistics,Metascience — Tags: , , , — Emil O. W. Kirkegaard @ 00:46

Science is a set of related methods that aim at finding true patterns about the world. These methods are generally designed so as to remove noise from random circumstances (the traditional focus of statistics) and human biases. Current practices are currently not very good at the second part due to the innumerable ways human biases result in biased findings (see e.g. discussion for social psychology in Crawford and Jussim’s new book). However, I feel confident that many of these biases can be strongly reduced with the advent of several new tools and practices: 1) by the development of meta-analytic tools to properly summarize existing possibly biased research (e.g. p-curve, z-curve, pet-peese, TIVA, R index), 2) registered replication reports that remove outcome bias in peer review, 3) the increasing reliance on automated tools for checking the validity of scientific works (statcheck, GRIM, SPRITE etc.), 4) increasing awareness of the presence of very substantial ideological biases in the scientific community.

Here I want to suggest a new approach that attacks a different but related angle: lack of open research data. While there is a growing movement to publish all research data, still a large chunk of data remains unpublished. Most of this data will be lost as no backups of it exist and the authors eventually die or throw away their laptops etc. However, papers endure and papers have various visualizations of the data. Some of these are scatterplots (example on the right), which allow for automatic and complete retrieval of the underlying data (unless there’s overplotting).

 

Others are visualizations of mostly summary statistics which allow for some form of data retrieval. For instance, boxplots (on left) allow for retrieval of the 25th, 50th and 75th centiles and also of any outlying datapoints (and usually the 1.5 IQRs). This kind of information can be used in meta-analyses and can also be combined with GRIM/SPRITE-type methods to verify the integrity of the underlying data.

Various other kinds of visualizations allow for all kinds of intermediate levels of data retrieval. For instance, scatterplots that vary the size of the dots by a third variable allow retrieval of at least some levels of that variable as well. Many newer PDFs contain vector graphics not raster graphics, and this allows for precise retrieval, not just approximate.

 

There already exist quite a large collection of published tools for data retrieval, including some of which are open source R or Python packages which could easily be integrated in any existing framework for data mining the scientific literature. I have found the following collections of tools:

  • https://academia.stackexchange.com/questions/7671/software-for-extracting-data-from-a-graph-without-having-to-click-on-every-singl
  • https://stats.stackexchange.com/questions/14437/software-needed-to-scrape-data-from-graph/72974
  • https://www.techatbloomberg.com/blog/scatteract-first-fully-automated-way-mining-data-scatter-plots/

January 10, 2018

Standard deviation of total SAT scores: not simply the sum of the standard deviations of subtests

Filed under: intelligence / IQ / cognitive ability,Math/Statistics — Tags: — Emil O. W. Kirkegaard @ 06:08

Someone on Reddit for SlateStarCodex:

AFAIK, SATs are normed to mean of 500 per section and standard deviation of 100. Assuming that math and verbal are highly correlated, that approximates to 1000 mean and 200 standard deviation for the SAT 1600 scale. The median reported score score was 1490, and even the 10th percentile was 1320. Which means that half of SSC’ers are in the top percentile of intelligence, and 90% are in the top 5% of US population. This is about in line with reported IQ, where median was 137 (equivalent to 1493 SAT score using 100/15 IQ scale) and 10th percentile 124 (equivalent to 1320 SAT score).

Then someone else:

Standard deviations don’t add like that when you combine distributions, do they? I thought there’d be a square-sum-root step in there somewhere.

Second someone is correct. In fact, the assumption that first someone mentions goes the exact opposite way.

No, they don’t. Variance is simply additive but only when variables are uncorrelated. When they are correlated, the variance grows faster (‘super-additive’).

https://en.wikipedia.org/wiki/Variance#Basic_properties

SAT subtests (M, V) presumably correlate at .73 or so. So variance for combined metric is about sqrt(10000 + 10000 + 2*7300) = 186. To make sure we did it right, let’s simulate some SAT-like data.

> library(tidyverse)
 > sat = MASS::mvrnorm(n = 1e6, mu = c(500, 500), Sigma = matrix(c(10000, 7300, 7300, 10000), nrow = 2), empirical = T) %>% 
 + as.data.frame() %>% 
 + set_colnames(c("M", "V")) %>% 
 + mutate(total = M + V)
 > cor(sat)
 M V total
 M 1.00 0.73 0.93
 V 0.73 1.00 0.93
 total 0.93 0.93 1.00
 > psych::describe(sat)
 vars n mean sd median trimmed mad min max range skew kurtosis se
 M 1 1e+06 500 100 500 500 100 6.5 989 982 0 0.00 0.10
 V 2 1e+06 500 100 500 500 100 -1.9 963 965 0 -0.01 0.10
 total 3 1e+06 1000 186 1000 1000 186 118.9 1802 1684 0 0.00 0.19

Math checks out.

See also interesting case study where this matters.

The Composite Score Extremity Effect

June 22, 2017

Regression Modeling Strategies (2nd ed.) – Frank Harrell (review)

Filed under: Book review,Math/Statistics,R — Tags: , — Emil O. W. Kirkegaard @ 19:11

I heard some good things about this book, and some of it is good. Surely, the general approach outlined in the introduction is pretty sound. He sets up the following principles:

  1. Satisfaction of model assumptions improves precision and increases statistical power.
  2. It is more productive to make a model fit step by step (e.g., transformation estimation) than to postulate a simple model and find out what went wrong.
  3. Graphical methods should be married to formal inference.
  4. Overfitting occurs frequently, so data reduction and model validation are important.
  5. In most research projects, the cost of data collection far outweighs the cost of data analysis, so it is important to use the most efficient and accurate modeling techniques, to avoid categorizing continuous variables, and to not remove data from the estimation sample just to be able to validate the model.
  6. The bootstrap is a breakthrough for statistical modeling, and the analyst should use it for many steps of the modeling strategy, including derivation of distribution-free confidence intervals and estimation of optimism in model fit that takes into account variations caused by the modeling strategy.
  7. Imputation of missing data is better than discarding incomplete observations.
  8. Variance often dominates bias, so biased methods such as penalized maximum likelihood estimation yield models that have a greater chance of accurately predicting future observations.
  9. Software without multiple facilities for assessing and fixing model fit may only seem to be user-friendly.
  10. Carefully fitting an improper model is better than badly fitting (and overfitting) a well-chosen one.
  11. Methods that work for all types of regression models are the most valuable.
  12. Using the data to guide the data analysis is almost as dangerous as not doing so.
  13. There are benefits to modeling by deciding how many degrees of freedom (i.e., number of regression parameters) can be “spent,” deciding where they should be spent, and then spending them.

Readers will recognize many of these from my writings. Not mentioned in the principles is that the book takes a somewhat anti-P value stance (roughly ‘they have some uses but are widely misused, so beware!’), and pro effect size estimation stance. And some the chapters do seem to follow these principles, but IMO the majority of the book does not really follow it. Mostly it is about endless variations on testing for non-linear effects of predictors, whereas in real life a lot of predictors will be boringly linear. There’s some decent stuff about overfitting, bootstrapping and penalized regression, but they have been done better already (read Introduction to Statistical Learning). I did learn some new things, including on the applied side (e.g. the ease of applying cubic splines, something that would have been useful for this study), and the book comes with a complimentary R package (rms) so one can apply the ideas to one’s own research immediately. On the other hand, most of the graphics in the book are terrible base plot ones, and only some are ggplot2.

This edition needed (came out 2015, first edition 2001) more work before it should have been published, but it is still worth reading for people with an interest in post-replication crisis statistics. Frank Harrell should team up with Andy Field, who’s a much better writer, and with someone with good ggplot2 skills (throw in Shiny too for extra quality). Then they could write a really good stats book.

March 21, 2017

The neuroscience of intelligence: very preliminary because of power failure and lack of multivariate studies

I don’t have time to provide extensive citations for this post, so some things are cited from memory. You should be able to locate the relevant literature, but otherwise just ask.

  • Haier, R. J. (2016). The Neuroscience of Intelligence (1 edition). New York, NY: Cambridge University Press.

Because I’m writing a neuroscience-related paper or two, it seemed like a good idea to read the recent book by Richard Haier. Haier is a rare breed of a combined neuroscientist and intelligence researcher, and he’s also the past ISIR president.

Haier is refreshingly honest about what the purpose of intelligence research is:

The challenge of neuroscience is to identify the brain processes necessary for intelligence and discover how they develop. Why is this important? The ultimate purpose of all intelligence research is to enhance intelligence.

While many talk about how important it is to understand something, understanding something is arguably just a preliminary goal on the road to what we really want: control it. In general, one can read the history of science as man’s attempt to control nature, and this requires having some rudimentary understanding of it. The understanding does not need to be causally deep, as long as one can make above chance level predictions. Newtonian physics is not the right model of how the world works, but it’s good enough to get to the Moon and building skyscrapers. Animal breeders historically had no good idea about how genetics worked, but they knew that when you breed stuff, you tend to get similar offspring to the stuff you bred, no matter whether this is corn or dogs.

Criticism of intelligence boosting studies

Haier criticizes a number of studies that attempted to raise intelligence. However, his criticisms are not quite good. For example, in reply to the n-back training paradigm, he spends about a page covering criticism of administration of one validation test:

The first devastating critique came quickly (Moody, 2009). Dr. Moody pointed out several serious flaws in the PNAS cover article that rendered the results uninterpretable. The most important was that the BOMAT used to assess fluid reasoning was administered in a flawed manner. The items are arranged from easy ones to very difficult ones. Normally, the test-taker is given 45 minutes to complete as many of the 29 problems as possible. This important fact was omitted from the PNAS report. The PNAS study allowed only 10 minutes to complete the test, so any improvement was limited to relatively easy items because the time limit precluded getting to the harder items that are most predictive of Gf, especially in a sample of college students with restricted range. This non-standard administration of the test transformed the BOMAT from a test of fluid intelligence to a test of easy visual analogies with, at best, an unknown relationship to fluid intelligence. Interestingly, the one training group that was tested on the RAPM showed no improvement. A crucial difference between the two tests is that the BOMAT requires the test-taker to keep 14 visual figures in working memory to solve each problem, whereas the RAPM requires holding only eight in working memory (one element in each matrix is blank until the problem is solved). Thus, performance on the BOMAT is more heavily dependent on working memory. This is the exact nature of the n-back task, especially as the version used for training included the spatial position of matrix elements quite similar to the format used in the BOMAT problems (see Textbox 5.1). As noted by Moody, “Rather than being ‘entirely different’ from the test items on the BOMAT, this [n-back] task seems well-designed to facilitate performance on that test.” When this flaw is considered along with the small samples and issues surrounding small change scores of single tests, it is hard to understand the peer review and editorial processes that led to a featured publication in PNAS which claimed an extraordinary finding that was contrary to the weight of evidence from hundreds of previous reports.

But he puts little emphasis on the fact that the original study had n=69 and p=.01 or something (judging from the confidence intervals):

Given publication bias and methodological degrees of freedom, and this is very poor evidence indeed. It requires no elaborate explanation of scoring of tests.

Haier does not cover the general history of attempts to increase intelligence, and this is a mistake because those don’t read history make the same mistakes over and over again. History supplies ample evidence that can inform our prior. I can’t think of any particular reason not to briefly cover this history given that Spitz wrote a pretty good book on the topic.

WM/brain training is just the latest fad in a long series of cheap tricks to improve intelligence or other test performance. The pattern is:

  1. Small study, usually with marginal effects by NHST standards (p close to .05)
  2. Widespread media and political attention
  3. Follow-up/replication research takes 1-3 decades and is always negative.
  4. The results from (3) are usually ignored, and the fad continues until it slowly dies because researchers get bored of it, and switch to the next fad. Sometimes this dying can take a very long time.

[See also: Hsu’s more general pattern.]

Some immediate examples I can think of:

  • Early interventions for low intelligence children: endless variations. Proponents always cite early, tiny studies (these are: Perry Preschool Project (1962), The Abecedarian Project (1972), The Milwaukee Project (1960s)), and neglect large negative replications (e.g. Head Start Impact Study).
  • Interventions targeted at autistic, otherwise damaged and the regular dull children. History of full of charlatans pandering miracle treatments to sad parents (see Spitz’ review).
  • Pygmalion/self-fulfilling prophecy effects. Usually people only mention a single study from 1968. Seeing the pattern here?
  • Stereotype threat. Actually this is not even a boosting effect, but is widely understood in that way. This one is still in stage.
  • Mozart effect.
  • Hard to read fonts.

I think readers would better appreciate the current studies if they knew the historical record – 0% success rate despite >80 years of trying with massive funding and political support. Haier does kind of get to this stage, but only after going over the papers and not as explicitly as I do above:

Speaking of independent replication, none of the three studies discussed so far (the Mozart Effect, n-back training, and computer training) included any replication attempt in the original reports. There are other interesting commonalities among these studies. Each claimed a finding that overturned long-standing findings from many previous studies. Each study was based on small samples. Each study measured putative cognitive gains with single test scores rather that extracting a latent factor like g from multiple measures. Each study’s primary author was a young investigator and the more senior authors had few previous publications that depended on psychometric assessment of intelligence. In retrospect, is it surprising that numerous subsequent studies by independent, experienced investigators failed to replicate the original claims? There is a certain eagerness about showing that intelligence is malleable and can be increased with relatively simple interventions. This eagerness requires researchers to be extra cautious. Peer-reviewed publication of extraordinary claims requires extraordinary evidence, which is not apparent in Figures 5.1, 5.3, and 5.4. In my view, basic requirements for publication of “landmark” findings would start with replication data included along with original findings. This would save many years of effort and expense trying to replicate provocative claims based on fundamentally flawed studies and weak results. It is a modest proposal, but probably unrealistic given academic pressures to publish and obtain grants.

Meta-analyses are a good tool to estimate the entire body of evidence, but they feed on published studies, and when published studies produce biased effect size estimates due to researcher degrees of freedom and publication bias (+ small samples), the meta-analyses will also tend to do this  too. One can adjust for the publication bias to some degree, but this works best when there’s a large body of research. One cannot adjust for researcher degrees of freedom. For a matter as important as increasing intelligence, there is only one truly convincing kind of evidence: large-scale, pre-registered trials with public data access. Preferably they should be announced in advance (see Registered Reports). This way it’s hard to just ‘forget’ to publish the results (common), swap outcomes (also common), and use creative statistics to get the precious p < .05 (probably common too).

Statistics of (neuro)science

In general, I think Haier should have started the book with a brief introduction to the replication crisis and the relevant statistics: Why do we have it, how do we fix it. This would of course also mean that he would have to be a lot more cautious in describing many of the earlier studies presented in the preceding chapters. Most of these studies were done on tiny samples and we know that they suffer publication bias because we have a huge meta-analysis of brain size x intelligence showing the decline effect. There is no reason to expect the other reported associations will hold up any better or at all.

Haier spends too much time noting apparent sex differences in small studies. These claims are virtually always based on the NHST subgroup fallacy – the idea that if some association is p < alpha for one group, and p > alpha for another group, then we can conclude there’s an 1/0 interaction effect such that there is an effect in one population, and none in the other.

To be fair, Haier does lay out 3 laws:

  1. No story about the brain is simple.

  2. No one study is definitive.

  3. It takes many years to sort out conflicting and inconsistent findings and establish a compelling weight of evidence.

which kind of make the same point. Tiny samples make every study non-conclusive (2), and we have to wait for the eventual replications and meta-analyses (3). Tiny samples combined with inappropriate NHST statistics give rise to pseudo-interactions in the literature which make (1) seem truer than it is (cf. the situational specificity hypothesis in I/O psych). Not to say that the brain is not complicated, but no need to add spurious interactions to the mix. This is not hypothetical: many papers reported such sex ‘interactions’ for brain size x intelligence, but the large meta-analysis by Pietschnig et al found no such moderator effect.

Beware meta-regression too. This is just regular regression and has the same problems: if you have few studies, say k=15, use weights (study SE), and try out many predictors — sex, age, subtest, country, year, … –, then it’s easy to find false positive moderators. An early meta-analysis did in fact identify sex as a moderator (which Haier cites approvingly), and this turned out not to be so.

Going forward

I think the best approach to working out the neurobiology of intelligence is:

  1. Compile large, public-use datasets. We are getting there with some recent additions, but they are not public use: 1) PING dataset n=1500, 2) Human Connectome n=1200, 3) Brain Genomics Superstruct Project n=1570. Funding for neuroscience, almost any science, should be contingent on contributing the data to a large, public repository. Ideally, this should be a single, standardized dataset. Linguistics has a good precedent to follow: “The World Atlas of Language Structures (WALS) is a large database of structural (phonological, grammatical, lexical) properties of languages gathered from descriptive materials (such as reference grammars) by a team of 55 authors.”.

  2. Include many diverse measures of neurobiology so we can do multivariate studies. Right now, the literature is extremely fragmented and no one knows the degree to which specific reported associations overlap or are merely proxies for each other. One can do meta-analytic modeling based on summary statistics, but this approach is very limited.

  3. Clean up the existing literature by including measures reported in the literature first. While many of these may be positive positives, a reported p < alpha finding is better evidence than nothing. Similarly: p > alpha findings in the literature may be false negatives. Only by testing them in large samples can we know. It would be bad to miss an important predictor just because some early study with 20 persons failed to find an association.

Actually boosting intelligence

After discussing nootropics (cognitive enhancing drugs, in his terms), Haier writes:

If I had to bet, the most likely path toward enhancing intelligence would be a genetic one. In Chapter 2 we discussed Doogie mice, a strain bred to learn maze problem-solving faster than other mice. In Chapter 4 we enumerated a few specific genes that might qualify as relevant for intelligence and we reviewed some possible ways those genes might influence the brain. Even if hundreds of intelligence-relevant genes are discovered, each with a small influence, the best case for enhancement would be if many of the genes worked on the same neurobiological system. In other words, many genes may exert their influence through a final common neurobiological pathway. That pathway would be the target for enhancement efforts (see, for example, the Zhao et al. paper summarized in Section 2.6). Similar approaches are taken in genetic research on disorders like autism and schizophrenia and many other complex behavioral traits that are polygenetic. Finding specific genes, as difficult as it is, is only a first step. Learning how those genes function in complex neurobiological systems is even more challenging. But once there is some understanding at the functional system level, then ways to intervene can be tested.This is the step where epigenetic influences can best be explicated. If you think the hunt for intelligence genes is slow and complex, the hunt for the functional expression of those genes is a nightmare. Nonetheless, we are getting better at investigations at the molecular functional level and I am optimistic that, sooner or later, this kind of research applied to intelligence will pay off with actionable enhancement possibilities. The nightmares of neuroscientists are the driving forces of progress.

None of the findings reported so far are advanced enough to consider actual genetic engineering to produce highly intelligent children. There is a recent noteworthy development in genetic engineering technology, however, with implications for enhancement possibilities. A new method for editing the human genome is called CRISPR/Cas9 (Clustered Regularly Interspaced Short Palindromic Repeats/Cas genes). I don’t understand the name either, but this method uses bacteria to edit the genome of living cells by making changes to targeted genes (Sander & Joung, 2014). It is noteworthy because many researchers can apply this method routinely so that editing the entire human genome is possible as a mainstream activity. Once genes for intelligence and how they function are identified, this kind of technology could provide the means for enhancement on a large scale. Perhaps that is why the name of the method was chosen to be incomprehensible to most of us. Keep this one on your radar too.

I think there may be gains to had from nootropics, but to find them, we have to get serious: large-scale neuroscientific studies of intelligence must look for correlates that we know/think we can modify, such as the prevalence of certain molecules. Then large-scale pre-registered RCTs must be done on plausible candidates for these. In general, however, it seems more plausible that we can find ways to improve non-intelligence traits that nevertheless help. For instance, speedy drugs generally enhance self-control in low doses. This shows up in higher educational attainment and lower crime convictions. These are very much real gains to be had.

Haier discusses various ways of directly stimulating the brain. In my speculative model of neuro-g, this would constitute an enhancement of the activity or connectivity factors. It seems possible that one can make some small gains this way, but I think the gains are probably larger for non-intelligence traits such as sustained attention and tiredness. If we could find a way to sleep more effectively, this would have insanely high social payoffs, so I recommend research into this.

(From: Nerve conduction velocity and cognitive ability: a large sample study)

For larger gains, genetics is definitely the most certain route (only other alternative is neurosurgery with implants). Since we know genetic variation is very important for variation in intelligence (high heritability), all we have to do is tinker with that variation. Haier makes the usual mistake of focusing on direct editing approaches. Direct editing is hard because one must know the causal variants and be able to change them (not always easy!). So far we know of very few confirmed causal variants, and those we know, are mostly bad: e.g. aneuplodies such as Down syndrome (trisomy 21). However, siblings show that it is quite possible to have the same parents and different genomes, so this means that all we have to do is filter among possible children: embryo selection. Embryo selection does not require us to know the causal variants, it only requires predictive validity, something that’s much easier to attain. See Gwern’s excellent writings on the topic for more info.

4. Socialist neuroscience?

Haier opines a typical HBD-Rawlsian approach to social policy:

Here is my political bias. I believe government has a proper role, and a moral imperative, to provide resources for people who lack the cognitive capabilities required for education, jobs, and other opportunities that lead to economic success and increased SES. This goes beyond providing economic opportunities that might be unrealistic for individuals lacking the requisite mental abilities. It goes beyond demanding more complex thinking and higher expectations for every student irrespective of their capabilities (a demand that is likely to accentuate cognitive gaps). It even goes beyond supporting programs for early childhood education, jobs training, affordable childcare, food assistance, and access to higher education. There is no compelling evidence that any of these things increase intelligence, but I support all these efforts because they will help many people advance in other ways and because they are the right thing to do. However, even if this support becomes widely available, there will be many people at the lower end of the g-distribution who do not benefit very much, despite best efforts. Recall from Chapter 1 that the normal distribution of IQ scores with a mean of 100 and a standard deviation of 15 estimates that 16% of people will score below an IQ of 85 (the minimum for military service in the USA). In the USA, about 51 million people have IQs lower than 85 through no fault of their own. There are many useful, affirming jobs available for these individuals, usually at low wages, but generally they are not strong candidates for college or for technical training in many vocational areas. Sometimes they are referred to as a permanent underclass, although this term is hardly ever explicitly defined by low intelligence. Poverty and near-poverty for them is a condition that may have some roots in the neurobiology of intelligence beyond anyone’s control.

The sentence you just read is the most provocative sentence in this book. It may be a profoundly inconvenient truth or profoundly wrong. But if scientific data support the concept, is that not a jarring reason to fund supportive programs that do not stigmatize people as lazy or unworthy? Is that not a reason to prioritize neuroscience research on intelligence and how to enhance it? The term “neuro-poverty” is meant to focus on those aspects of poverty that result mostly from the genetic aspects of intelligence. The term may overstate the case. It is a hard and uncomfortable concept, but I hope it gets your attention. This book argues that intelligence is strongly rooted in neurobiology. To the extent that intelligence is a major contributing factor for managing daily life and increasing the probability of life success, neuro-poverty is a concept to consider when thinking about how to ameliorate the serious problems associated with tangible cognitive limitations that characterize many individuals through no fault of their own.

Public policy and social justice debates might be more informed if what we know about intelligence, especially with respect to genetics, is part of the conversation. In the past, attempts to do this were met mostly with acrimony, as evidenced by the fierce criticisms of Arthur Jensen (Jensen, 1969; Snyderman & Rothman, 1988), Richard Herrnstein (1973), and Charles Murray (Herrnstein & Murray, 1994; Murray, 1995). After Jensen’s 1969 article, both IQ in the Meritocracy and The Bell Curve raised this prospect in considerable detail. Advances in neuroscience research on intelligence now offer a different starting point for discussion. Given that approaches devoid of neuroscience input have failed for 50 years to minimize the root causes of poverty and the problems that go with it, is it not time to consider another perspective?

Here is the second most provocative sentence in this book: The uncomfortable concept of “treating” neuro-poverty by enhancing intelligence based on neurobiology, in my view, affords an alternative, optimistic concept for positive change as neuroscience research advances. This is in contrast to the view that programs which target only social/cultural influences on intelligence can diminish cognitive gaps and overcome biological/genetic influences. The weight of evidence suggests a neuroscience approach might be even more effective as we learn more about the roots of intelligence. I am not arguing that neurobiology alone is the only approach, but it should not be ignored any longer in favor of SES-only approaches. What works best is an empirical question, although political context cannot be ignored. On the political level, the idea of treating neuro-poverty like it is a neurological disorder is supremely naive. This might change in the long run if neuroscience research ever leads to ways to enhance intelligence, as I believe it will. For now, epigenetics is one concept that might bridge both neuroscience and social science approaches. Nothing will advance epigenetic research faster than identifying specific genes related to intelligence so that the ways environmental factors influence those genes can be determined. There is common ground to discuss and that includes what we know about the neuroscience of intelligence from the weight of empirical evidence. It is time to bring “intelligence” back from a 45-year exile and into reasonable discussions about education and social policies without acrimony.

It’s a little odd that he ignores genetic interventions here, given his early mentioning of them. Aside from that, the focus on neurobiology is eminently sensible. If typical S approaches – e.g. even more income theft redistribution – do have causal effects on intelligence, this must be thru some pretty long causal pathway, so we cannot expect large effects for the change we make in the income distribution. Neurobiology, however, is the direct biological substrate of intelligence, and thus one can expect to see much larger gains by interventions targeted at this domain for the simple reason that it’s the direct causal antecedent of the thing we’re trying to manipulate – provided of course that any non-genetic intervention can work.

From a utilitarian/consequentialist perspective, government action to increase intelligence, if it works, is likely to have huge payoffs at many levels, so it is definitely something I can get behind – with the caveat that we get serious about it: open data, large-scale, preregistered RCTs.

Chronometric measures and the elusive ratio scale of intelligence

Haier quotes Jensen on chronometric measures:

At the end of his book, Jensen concluded, “… chronometry provides the behavioral and brain sciences with a universal absolute [ratio] scale for obtaining highly sensitive and frequently repeatable measurements of an individual’s performance on specially devised cognitive tasks. Its time has come. Let us get to work!” (p. 246). This method of assessing intelligence could establish actual changes due to any kind of proposed enhancement in a before and after research design. The sophistication of this method for measuring intelligence would diminish the gap with sophisticated genetic and neuroimaging methods.

But it does not follow. Performance on any given chronometric test is a function of multiple independent factors, some of which might change without the others doing so. According to a large literature, specific abilities generally have zero or near-zero predictive validity, so boosting them is not of much use. Chronometric tests measure intelligence, but also other factors. When one repeats chronometric tests, there are gains – people’s reaction times do become faster, — but it does not mean that intelligence increased, it was some other factor.

Besides, it is easy enough to have a ratio scale. Most tests that are part of batteries are indeed on a ratio scale: if you get 0 items right you have no ability on that test. The trouble is aligning the ratio scale of a given test with the presumed ratio scale of intelligence. Anyone can make a test sufficiently hard so that no one can get any items right, but that doesn’t mean people who took the test have 0 intelligence.

Besides, chronometric measures are reversed: a 0 on a chronometric test is the best possible performance, not the worst possible performance. So, while they are ratio scale – a 0 ms is a true 0 and one can sensibly apply multiplication operations – they are clearly not aligned with the hypothetical ratio scale of intelligence itself.

With this criticism in mind, chronometric tests are interesting for multiple reasons and deserve much further study:

  1. They are resistant to training gains. One cannot just memorize the items beforehand. This is important for high stakes testing. There are training gains on them, but they show diminishing returns, and thus, if we want to do testing that is not very confounded with test gains, we can give people lots of practice trials.

  2. They are resistant to motivation effects. Studies have so far not produce any associations between subjects’ rating of how hard they tried at a given trial and their actual performance on that trial.

  3. They are much closer to the neurobiology of intelligence and thus their use is likely to lead to advances in our understanding of that area too. I recommend doing brain measurements of people as they complete chronometric tests and see if one can link the two. EEG and similar methods are getting quite cheap so this is possible to do at a large scale.

  4. Chronometric tests are resistant to ceiling problems. While one can max out on simple reaction time tests – get to 0 ms decision time – it is easy to simply add some harder tests which will effectively raise the ceiling again. One can do this endlessly. It is quite likely that one can find ways to measure into the extreme high end of ability using simple predictive equations, something that’s not generally possible with ordinary tests. This could be researched using e.g. the SMPY sample.

Unfortunately, I do not have the resources right now to pursue this area of research. It would require having a laboratory and some way to get people to participate in large numbers, Galton style.

Conclusion

All in all, this is a very readable book that summarizes research into intelligence for the intelligent layman, while also giving an overview of the neuroscientific findings. It is similar in spirit to Ritchie’s recent book. The main flaws are statistical, but they are not critical to the book in my opinion.

March 19, 2017

Number of siblings vs. total fertility rate

Filed under: Math/Statistics — Tags: , , — Emil O. W. Kirkegaard @ 15:56

This is commentary on:

For over a century, social scientists have predicted declines in religious beliefs and their replacement with more scientific/naturalistic outlooks, a prediction known as the secularization hypothesis. However, skepticism surrounding this hypothesis has been expressed by some researchers in recent decades. After reviewing the pertinent evidence and arguments, we examined some aspects of the secularization hypothesis from what is termed a biologically informed perspective. Based on large samples of college students in Malaysia and the USA, religiosity, religious affiliation, and parental fertility were measured using self-reports. Three religiosity indicators were factor analyzed, resulting in an index for religiosity. Results reveal that average parental fertility varied considerably according to religious groups, with Muslims being the most religious and the most fertile and Jews and Buddhists being the least. Within most religious groupings, religiosity was positively associated with parental fertility. While cross-sectional in nature, when our results are combined with evidence that both religiosity and fertility are substantially heritable traits, findings are consistent with view that earlier trends toward secularization (due to science education surrounding advancements in science) are currently being counter-balanced by genetic and reproductive forces. We also propose that the inverse association between intelligence and religiosity, and the inverse correlation between intelligence and fertility lead to predictions of a decline in secularism in the foreseeable future. A contra-secularization hypothesis is proposed and defended in the discussion. It states that secularism is likely to undergo a decline throughout the remainder of the twenty-first century, including Europe and other industrial societies.

To investigate fertility differentials, they rely on surveys that include a question about the number of siblings. This is a potentially problematic measure of fertility.

Total fertility rate (TFR) = Number of children birth per woman over her entire reproductive life.

Suppose we have a population where all women have exactly 2 children (TFR=2). If we sample children at random from this population, the mean number of siblings will be 1.00 (sd=0). If we add 1 to take into account the respondent himself, we get 2.00. So, this seems to be a way to estimate the TFR using another type of data.

However, suppose we have a population with 5 kinds of women: those who have 0, 1, 2, 3 and 4 children. These are equally common, so each is 20%. Pretend that we have 100 women. The distribution looks like this:

  • 20 women with 0 children
  • 20 women with 1 child
  • 20 women with 2 children
  • 20 women with 3 children
  • 20 women with 4 children

The total population of children is 20 * 0 + 20 * 1 + 20 * 2 + 20 * 3 + 20 * 4 = 200. So, TFR=2.00. Now consider the case where we sample the children instead, and you can see the problem:

  • 80 of the children have 3 siblings
  • 60 of the children have 2 siblings
  • 40 of the children have 1 sibling
  • 20 of the children have 0 siblings

The mean number of siblings if we count everybody is (80 * 3 + 60 * 2 + 40 * 1)/200 = 2. If we then add the respondent, we get 3. But the total fertility rate is only 2, not 3. This is because when you sample at the child level, you are more likely to sample from the larger sibships and indeed it’s impossible to sample from the children among women who had no children, so these are necessarily under-counted. Thus, the number of siblings metric is not usually comparable to TFR.

Two populations that have different sibship distributions but equal TFRs may have different mean number of siblings as well, so the metric is maybe problematic to use for comparison purposes. To investigate how problematic in practice, one must either have some nice data that allow for the calculation of both metrics, so it has to be a familial design. Or if one has a realistic idea about the data generation process, one could just simulate some data. My intuitive hunch is that for comparison purposes, the number of siblings will be a good metric because the shape of the distribution will not vary markedly between populations.

Here’s some simple empirical support at the country level from EU Stat.

Basically, it follows something close to a Poisson distribution with a lambda that varies by group. The Poisson distribution is tricky to understand. It takes only a single parameter. This parameter is the average number of events, for instance, per time. In our case, it will children per woman. This way, we are treating the number of children a woman will have as being independent of the number of children she already has and also being a random event with some chance of happening. As such, there is no maximum number of children a woman can have. This is unrealistic for two reasons: 1) it takes some time to have a child because pregnancy and the reproductive span is limited. If we take history as our realism, the max achieved seems to have been 27 births with 69 offspring. 2) There are genetic defects that cause untreatable infertility, and which thus always results in 0 children. 3) One can have multiple children in one birth (twins, triplets etc.). Still, it’s a good approximation.

So, I did some simple simulations and…

http://rpubs.com/EmilOWK/260043

So turns out that the mean number of siblings exactly balances out and matches the TFR when simulating data using a Poisson distribution. Yes, I double checked. There is some deeper math here I don’t understand, but empirical math is good enough to me.

PS. The authors are aware that this metric can be problematic.

Obviously, parental fertility cannot be equated with the fertility of couples, of women, or of populations (the most common bases for operationalizing fertility). The main distinction between these more common fertility measures and ours is that our measure over-estimates fertility by excluding all childless couples from being counted.

It seems that it can. But note that this does not imply that it works the same as a comparative measure at the individual level. This I did not test.


Nathaniel Bechhofer worked out the proof.

Older Posts »

Powered by WordPress