Interrater reliability in job performance ratings

And now for something completely different. Well, not that different. But not the usual things. I’ve been working in human resources/education since February. It’s quite enjoyable work. My job consists of being asked how to examine the effects of our company training program. Ultimately, what we really want is a return on investment (ROI) calculation, hopefully positive but you never know. Providing this value is tough! Myriad of problems: 1) job performance really hard to measure, 2) most data aren’t easy RCT form, so causal inference is difficult, 3) job performance metrics aren’t in dollar units, 4) there’s a lot of different simultaneous efforts for different skills and mindsets etc. About the first issue, job performance measurement or job performance appraisal. One might think this is just a matter of designing some objective test of job tasks (work sample test) and testing people. Well, that’s one way but these tests have low reliability (too few items/tasks), and are very expensive. For instance, giving a programmer a few tasks to complete might take all day. Consider the salary of a programmer, and then multiply that by the number of people you want to test, say, a few hundred, and you can see it’s not a good idea. Most companies take a very pragmatic approach: they just have a single supervisor rate each underling on job performance. The question then becomes: how accurate are these things? The answer is that question is… complicated. We don’t have any gold standard measurement of job performance. We only have a bunch of different suboptimal ones:

  • Objective
    • Metrics produced as part of daily job functioning, e.g. deals closed, customers served, sales volume, issues on Github closed
    • Work sample tests
    • Knowledge tests
    • Clock time data (overdue, absenteeism)
    • Promotions/demotions
  • Ratings by humans
    • Self
    • Other: underlings
    • Other: peers
    • Other: superiors

(If you are looking for reviews of this, try this or this or this or this paper.)

As you can figure, all of these have issues. This is collectively called the criterion problem. Anyway, to the topic of this post: ratings. How reliable? How do we know? How should we adjust for lack of perfect reliability? There’s three particularly interesting exchanges between researchers:

The general debate in these papers is that we want to know how well various things predict job performance. Yet, because of the criterion problem, it’s really difficult to get anything nice. Almost no studies measure in more than one way. Most studies only use the weakest measurement method, single supervisor ratings. Most studies do not have data about reliability of measures used. Most studies are done on job incumbents only, so not the applicant pool. All in all, we face serious issues of measurement errors in both predictors and outcomes, and we have to deal with range restriction too. And of course, science being what it is, there’s almost no data sharing, most older data probably lost forever, so all we have to go on is some hundreds of scattered studies with partially missing results. What to do…

Some people have come up with the approach that we need to figure out how to best remove the known biases. Some simple examples. In study A, they measured IQ with a test with .90 reliability based on a test with 200 items, and in study B, they did it with a shorter version with only .80 reliability based on a test with 100 items. If the studies were otherwise identical and had infinite sample sizes, there would be an exact and known relation between the observed relationships to other variables. We can adjust for this lack of perfect reliability based on the assumption that the lack of 1.00 reliability is due to random errors (also called classical errors, normal distribution with mean 0 and some variance). Because random errors by definition do not relate to anything, we know it can only cause downwards bias in results, and we simply have to multiply by some constant to get at the theoretical value if the tests we gave were infinitely long and thus had reliability of 1.00. Well, sort of. Not all measurement error is due to not having a finite length test, some of them are due to other things. Semi-random factors like sleep deprivation, illness, hunger, bad mood, wrong foot out of bed etc. also have minor effects on test performance. These are the so called transient errors. They aren’t exactly random, but they are pretty random, and would not be pretty if we tested the same person two times with a few days in between.

OK, back to job ratings. Suppose we find some studies that somehow had multiple supervisors, or more accurately, superiors, rate the same employees for job performance, then we could see how much they agree on average. How much is that? There’s a few meta-analyses on this question:

  • Salgado, J. F., & Moscoso, S. (1996). Meta-analysis of interrater reliability of job performance ratings in validity studies of personnel selection. Perceptual and Motor skills, 83(3_suppl), 1195-1201.
    • Rating scales are most frequently used to assess the criterion in studies of validity in personnel selection. However, only a few articles report the interrater reliability for these scales. This paper presents four meta-analyses in which the inter-rater reliability was estimated for Civil Composite Criterion, Military Composite Criterion, Total Composite Criterion, and Single Civil Criterion. Mean reliabilities were .64, .53, .62, and .40, respectively. The implications of these findings for single and meta-analytic studies are discussed.
  • Viswesvaran, C., Ones, D. S., & Schmidt, F. L. (1996). Comparative analysis of the reliability of job performance ratings. Journal of applied psychology, 81(5), 557.
    • This study used meta-analytic methods to compare the interrater and intrarater reliabilities of ratings of 10 dimensions of job performance used in the literature; ratings of overall job performance were also examined. There was mixed support for the notion that some dimensions are rated more reliably than others. Supervisory ratings appear to have higher interrater reliability than peer ratings. Consistent with H. R. Rothstein (1990), mean interrater reliability of supervisory ratings of overall job performance was found to be .52. In all cases, interrater reliability is lower than intrarater reliability, indicating that the inappropriate use of intrarater reliability estimates to correct for biases from measurement error leads to biased research results. These findings have important implications for both research and practice.

You think these sound bad? Yeah, they are! But it gets worse. If we want to adjust the results based on this terrible reliability, we have to assume that the non-shared variance between two raters is due to random error, or at least something random enough that it isn’t shared between raters in general. So anything that might be shared, sometimes, is not random. Say, team extrovert rates other extroverts higher, and likewise for team introvert. This would result in some variance that isn’t random, and correction could overcorrect… or undercorrect, it’s hard to say.

My general read of the Murphy papers is that they pretty much stay there at this conceptual level. Yeah, the correction method we are using is relying on assumptions not entirely correct. We should use this other theory [his favorite, naturally]… that’s not applicable to existing data, limited as they are, so we are left without any useful estimates from large datasets. Umm, maybe I will stick with what we have then. There’s a bunch of sniff tests we can do to see if things are completely messed up, and they seem mostly OK. So maybe the model is wrong, all models are wrong, but it’s not that wrong. His argument strikes me as similar to people who keep arguing about factor analysis vs. principal component analysis. Yes, the math theory is different, but the results are basically the same under normal conditions (also my paper here). Emphasis under normal conditions, one can make up some examples where they disagree a lot.

I like LeBreton et al 2014 more. They do engage in some of the same doubt raising about the sensibility of adjusting based on such a low reliability, that’s not very convincing, there is no alternative. But they also attempt an empirical take down. They take the values from the latest Schmidt and Hunter psychometric meta-analysis, and then try to show that using the methods of their opponents, adding some more variables means that our variables basically explain 100% of job performance! Or even more if we add some more plausible candidates based on guess-work correlations. I have respect for this argument. However, it does seem that some of the numbers they rely on were a bit wrong, and they mixed up applicant pool data with incumbent data and so on. So it seems their argument needs some refinement. The defenders of the status quo seems not very interested in doing this exercise right, they were engaging in lazy defense of merely pointing out errors in their opponents work. Paul Sackett had a useful idea: he came up with a proof/derivation of the interrater adjustment formula that doesn’t rely upon classical test theory — the favorite target of the critics.

I feel like we are in this situation because of lack of proper data sharing. Even the meta-analyses in this field — industrial and organization psychology — do not post any tables of studies. This means that 3rd parties cannot reanalyze the data. The simplest approach I can think of here is to check whether the adjusted correlations are consistent with the observed ones with more raters. I mean, most studies use data with 1 supervisor doing the ratings. But some use 2 supervisors. Do they find stronger correlations as they should do? Do they find the right amount of stronger? If so, this would suggest our correction approach is on the right course. It may or may not be possible to do this kind of sniff test with existing data. It’s a matter of whether there are enough studies with these method variations.