Defense against the psychometricians

This paper analyzes the theoretical, pragmatic, and substantive factors that have hampered the integration between psychology and psychometrics. Theoretical factors include the operationalist mode of thinking which is common throughout psychology, the dominance of classical test theory, and the use of “construct validity” as a catch-all category for a range of challenging psychometric problems. Pragmatic factors include the lack of interest in mathematically precise thinking in psychology, inadequate representation of psychometric modeling in major statistics programs, and insufficient mathematical training in the psychological curriculum. Substantive factors relate to the absence of psychological theories that are sufficiently strong to motivate the structure of psychometric models. Following the identification of these problems, a number of promising recent developments are discussed, and suggestions are made to further the integration of psychology and psychometrics.

The article is a highly entertaining attack on mainstream psychology methods by what one might call the method enthusiasts. I should say that I too like fancy methods, but I try to constrain myself to when they also make a practical difference (my coauthors constantly try to persuade me to use simpler methods so reviewers won’t fret too much).

Thus, even though psychometric modeling has seen rapid and substantial developments in the past century, psychometrics, as a discipline, has not succeeded in penetrating mainstream psychological testing to an appreciable degree. This is striking. Measurement problems abound in psychology, as is evident from the literature on validity (; ; ), and it would seem that the formalization of psychological theory in psychometric models offers great potential in elucidating, if not actually solving, such problems. Yet, in this regard, the potential of psychometrics has hardly been realized. In fact, the psychometric routines commonly followed by psychologists working in 2006 do not differ all that much from those of the previous generations. These consist mainly of computing internal consistency coefficients, executing principal components analyses, and eyeballing correlation matrices. As such, contemporary test analysis bears an uncanny resemblance to the psychometric state of the art as it existed in the 1950s.

I think the answers are quite simple: 1) simple classical test theory psychometrics works well enough in practice; IRT methods add little validity, and test bias is rarely important, and when it is, it’s rather obvious (non-native language users), 2) psychologists don’t like math and usually can’t code, so the fancier methods are not accessible to them and they have little interest in them. Ordinarily psychologists are interested in people, not things. Psychometrics is for autistic psychologists.

Many investigations into the structure of individual differences theorize in terms of latent variables, but rely on Principal Components Analyses (PCA) when it comes to the analysis of empirical data. However, the extraction of a principal components structure, by itself, will not ordinarily shed much light on the correspondence with a putative latent variable structure. The reason is that PCA is not a latent variable model but a data reduction technique (e.g., ). This is no problem as long as one does not go beyond the obvious interpretation of a principal component, which is that it is a conveniently weighted sumscore. Unfortunately, however, this is not the preferred interpretation among the enthusiastic users of principal components analysis.

Typical horse to beat: PCA vs. latent variables. They have different math/assumptions, as explained: (exploratory) factor analysis (FA) attempts to find reflective (causal) latent traits, while principal components analysis (PCA) just tries to summarize the data in an optimal way (each new dimension captures the maximum amount variation in the data, assuming linearity and additivity. They might give different results, and one can engineer examples to produce very different results (one can also engineer FA to give strange results, see the entire career of Peter Schönemann). However, in practice they almost always agree (Jensen and Weng 1994; Kirkegaard 2014), so the matter is mostly pointless/only of academic interest. The same is usually true for full-scale IQ vs. g factors, something I learned myself after being interested in theory. I was surprised to learn that these IQ scores usually correlate .99 with my fancy g factor scores. It is also true for item level data: IRT computed g factor scores usually correlate .99 with simple sums (i.e. sum of correct responses), e.g. Bursan et al 2018, with small number of items, IRT scores can have some advantage, e.g. Kirkegaard et al 2017. IRT also deals with missing data in a more sophisticated fashion, which is sometimes useful, e.g. Kirkegaard et al 2016. It’s not a big advantage because one can just use the mean of completed items with CTT instead of the sum. For Wordsum, as recently discussed, one can use IRT on the 10 items, but in my analysis, this score correlates only about 5% stronger with outcomes than the sum score, so the matter is mostly practically pointless.

Now it turns out that, with respect to the Big Five, CFA gives Big Problems. For instance, found that a five factor model is not supported by the data, even though the tests involved in the analysis were specifically designed on the basis of the PCA solution. What does one conclude from this? Well, obviously, because the Big Five exist, but CFA cannot find them, CFA is wrong. “In actual analyses of personality data […] structures that are known to be reliable [from principal components analyses] showed poor fits when evaluated by CFA techniques. We believe this points to serious problems with CFA itself when used to examine personality structure”; (.

I believe this rather points to serious problems in psychologists’ interpretation of principal components; for it appears that, in the minds of leading scholars in personality research, extracting a set of principal components equals fitting a reflective measurement model (or something even better). The problem persists even though the difference between these courses of action has been clearly explicated in accessible papers published in general journals like Psychological Bulletin () and Psychological Methods (). Apparently, psychometric insights do not catch on easily.

One man’s modus ponens in another man’s modus tollens.

Tests of measurement invariance are conspicuously lacking, for instance, in some of the most influential studies on group differences in intelligence. Consider the controversial work of and . These researchers infer latent intelligence differences between groups from observed differences in IQ (across race and nationality, respectively) without having done a single test for measurement invariance. (It is also illustrative, in this context, that their many critics rarely note this omission.) What these researchers do instead is check whether correlations between test scores and criterion variables are comparable (e.g., ), or whether regressing some criterion on the observed test scores gives comparable regression parameters in the different groups (e.g., Herrnstein & Murray, 2002, p. 627). This is called prediction invariance. Prediction invariance is then interpreted as evidence for the hypothesis that the tests in question are unbiased.

Since this article, a large number of such measurement invariance tests have been done, there’s a new and nice (2019) review on /r/heredity by BasementInhabitant. They mostly show that MI ‘holds’ or has some unimportant deviation (e.g. one has to ‘free’ one test’s error variance or some such). So, the old methods were fine, and the new methods gave about the same results. This general theme of this research is about coming up with increasingly obscure ways tests that be biased without the usual methods showing it, while ignoring the lack of practical plausibility of these ideas.

To be fair: measurement invariance never completely holds for any socially important groups. It is just a matter of sample size/statistical power to find the small deviations. The hunt for MI failures is just a hunt for larger sample sizes, as per the usual NHST fetish in social science. In our study of 60k Jordanian kids (Bursan et al), we find that the IRT loadings correlate about .90 across sex, while one would have expected .99 by sampling error alone. Test fails MI? Definitely. Does this affect any conclusion? No because these item parameter differences did not relate to anything else. For item parameter equivalence to matter, these failures most relate to something else, otherwise it will just average out in the sum score. Is this a new conclusion? No, Jensen and McGurk (1987) did this kind of study using CTT (and proto-IRT) decades ago and found the same kind of result for black-white gap.

The above were the examples of failure used in the article, the rest is mostly about the author’s speculations on origins (mostly philosophy of measurement stuff) that curiously does not mention the obvious issue of, well, psychology of the psychologists.

Leave a Reply