I am watching Brian Boutwell’s (Twitter, RG) talk at a recent conference and this got me thinking.

What are we measuring?

As far as I know, there are typically two outcome variables used in criminological studies:

  1. Official records convictions.
  2. Self-reported criminal or anti-social behavior.

But exactly what trait are we trying to measure? It seems to me that we (or I am!) are really interested in measuring something like tendency to break laws that are harmful to other people. Harmful is here used in a broad sense. Stealing something may not always cause someone harm, but it does deprive them (usually) unfairly of their property. Stealing is not always wrong, but it is usually wrong. Let’s call the construct we want to measure harmful criminal behavior.

Measurement error: two types

Before going on, it is necessary to distinguish between the two types of measurement error in classical test theory:

  1. Random measurement error.
  2. Systematic measurement error.

Random measurement error is by definition error in measurement that is not correlated with anything else at all (sampling error aside). Conceptually we can think of it as adding random noise to our measurements. A simple, every-day example of this would be a study where we examine the relationship between height and GPA for ground/elementary school students. Suppose we obtain access to a school and we measure the height of all the students using a measurement tape. Then we obtain their GPAs from the school administration. Random measurement error here would be if we used dice to pick random numbers and added/subtracted these to each student’s height.

Systematic measurement error (also called bias) is different. Suppose we are measuring the ability of persons to sneak past a guard post because we want to recruit a team of James Bond-type super spies. We conduct the experiment by having people try to sneak past a guard post. Because we have a lot of people to test, our experiment is carried out all day beginning in the early morning and ending in the evening. Each individual has to try three times to sneak past the guard post and we measure their ability as the number of times they sneaked past (so 0-3 are possible scores) We assign their trials in order of their birthdays: people born early in the year take their trials in the early morning. Because it is easier to see when the sun is higher in the sky, the individuals who happen to be born later or very early in the year have an advantage: it is more difficult for the guards to spot them when it is darker. Someone who successfully sneaked past the guards three times in the evening is not necessarily at the same skill level as someone who sneaked three times around noon. There is a systematic error in the measurement of sneaking ability related to the time of testing, and it is furthermore related to the persons’ birthday.

Problems with official records

Using official records as a measure of harmful criminal behavior has a big problem: they often include convictions for things that aren’t wrong (e.g. drug use or sex work). Ideally, we don’t care about convictions for things like smoking cannabis because in a sense, this isn’t a real crime: it’s just the government that is evil. In the same way that homosexual sex or even oral sex is not a crime anymore, and was not a real crime back when it was illegal (overview of US ‘sodomy’ laws). There is a moral dimension as to what to one is trying to measure if one does not just want to go with the construct of ‘any criminal behavior that the present day state in this country happen to have criminalized’.

Furthermore, official records are based on court decisions (and pleas). Court decisions are in turn the result of the police taking up a case. If the police are biased — rightly or wrongly — in their decision about which cases to pursue, this will give rise to systematic measurement error.

Since the police does not have infinite resources, they will not pursue every case they know of. They probably won’t even pursue every case they know of they think they can win in court. There is thus an inherent randomness in which cases they will pursue. i.e. random measurement error.

Worse, which cases the police pursues may depend on irrelevant things like whether the police leadership has set a goal for the number of cases of a given type that must be pursued every year. This practice seems to be fairly common, and yet it results in serious distortions in the use of police resources. In Denmark, the police often have these goals about biking violations (say, biking on the sidewalk). The result is that in December (if the goal is based on a year-to-year basis), if they are not close to meeting their goal, the police leadership will divert resources away from more important crimes, say, break-ins, to hand out fines for people breaking biking laws. They may also lower the bar as to what counts as a violation.

Even worse, they may focus on targeting violations that are not wrong they are easy to pursue. One police officer gave the following story (anonymously in order to prevent reprisals from the leaders!) in response to a parliament discussion of the topic:

“When we are told that we must write 120 bikers [hand out fines to] the next 14 days, then we don’t place ourselves in the pedestrian area while there are pedestrians, and when the bikers may cause problems. No, we take them in the morning when they bike thru the empty pedestrian area on their way to work, because then we get more quickly to the 120 number. In other words, we do it for the numbers’ sake and not for the sake of traffic safety.”

This kind of police behavior induce both random and systematic measurement error in the official records. For instance, people who happen to bike to work and whose work is on the opposite side of a pedestrian area are more likely to receive such fines.

Measurement error, self-rating and the heritability of personality traits

While personality is probably not really that simple to summarize, most research on personality use some variant of the big five/OCEAN model (use this test). Using such measures, it has generally been found that the heritability of OCEAN traits is around 40%. Lots of room for environmental effects, surely. Unfortunately, most of the non-heritable variance is in the everything else-category.

But, these results are based on self-rated personality and not even corrected for random measurement error which is usually easy enough to do. So, suppose we correct for random measurement error, then perhaps we get to 50% heritability. This is because (almost?) any kind of measurement error biases heritability downwards.

What about self-rating bias? Surely there are some personality traits that cause people to systematically rate themselves different from how other people rate themselves, i.e. systematic measurement error. Even for height — a very simple trait — using self-reported height deflated heritability by about 4% compared with clinical measurement (from 91 to 87%), and clinical measurement is not free of random measurement error either. Furthermore, human height varies somewhat within a given day — a kind of systematic measurement error.

So, are other-ratings of personality better? There is a large meta-analysis showing that other-ratings are better. They have stronger correlations with actual criteria outcomes than self-ratings:

Other_rating_strangersother_rating_academic other_rating_workperf

This suggests considerable systematic measurement error in the self-ratings. The counter-hypothesis: others’ ratings of one’s personality, while not actually more accurate than self-ratings, causally influences the chosen outcomes, such that it appears that other-ratings are better. E.g. teachers/supervisors give higher grades/performance ratings to those they incorrectly judge to be more open minded due to some kind of halo effect. I don’t know of any research on this question.

Still, what do we find if we analyze the heritability of personality using other-ratings and especially the combination of self- and other-ratings? We get this:


A mean heritability of 81% for the OCEAN traits. Like the height study, there was evidence of heritable influence on systematic self-rating error (53% in this study, the height study found 36% but had limited precision).

Conclusion: measurement error and criminology

Back to criminology. We have seen that:
  1. Official records have serious problems with measuring the right construct (criminal harmful behavior), probably suffer from lots of random measurement error and probably some systematic measurement error.
  2. Self-ratings suffer from systematic measurement error.
  3. Measurement error biases estimates of heritability downwards.
We combine them and derive the conclusion: heritabilities of harmful criminal behavior are probably seriously underestimated.
Questions for future research:
  • Locate or do behavioral genetic studies of crime based on multiple methods and other-ratings. What do they show?
  • Find evidence to determine whether the higher validity of other ratings is due to their higher precision or due to causal halo effects.