Administration time and reliability in measurement

For some research designs, one pays the participants by the amount of time they spend on the task (like regular work). For other designs where one does not, usually participants are able to drop out if one annoys them too much. In both cases it is important to use brief measures. Still, when one uses a briefer measurement, one usually loses some accuracy (i.e. get more measurement error). Since measurement error causes all kinds of nasty methodological problems, one would like to avoid it as much as possible. So, there is a trade-off. It’s a psychological example of the more general class of speed vs. accuracy trade-offs. Here’s a paper about the trade-off in ants’ decision making.

Theoretically, one could avoid measurement error entirely by using a scale with an infinite number of items. But we cannot do that. Perhaps our research grant only allows us to use a scale that takes 5 minutes to take for the sample size we need. How do we get the best possible test that takes only 5 minutes? Well, to begin, we could try to abbreviate existing longer scales. It may or may not be possible to try all the possible compositions. If not, one will have to try some kind of non-exhaustive search and hope one does not end up in a local optimum. Still, here we are minimizing measurement error and item count, not directly minimizing the administration time and measurement error which is what we really want.

The distinction between item count and administration time matters because some items take longer to solve than others. Consider the following two items:

1, 2, 4, 7, 11. What is the next number in the series?

A, B, D, G, K. What is the next letter in the series?

Here we have two variants of the same idea: the testee must find the pattern in a series and predict the next symbol. The second item has merely converted the numbers into letters from the alphabet using the familiar A=1, B=2 etc. method. To solve the second item, one must be able to convert numbers to and from letters. This task is probably easier than the task of finding the pattern in the numbers, so the second item probably is only slightly more difficult than the first. The extra difficulty arises mainly from the probably that one makes a clerical error in the conversion process. Still, the second item takes a lot longer to solve because one has to repeat the alphabet in one’s head a number of times to convert between numbers and letters (at least, that’s how I do it). Suppose it takes twice as long as the first, the question arises of whether we would get a better scale by giving 2 items in the straight-up numerical format rather than 1 item in the alphabetic format. It’s probably not so easy to answer this kind of question without empirical data from all three items. I say not so easy instead of impossible because Aiden Loe is working on automatic item generation where one does try to predict the item parameters from the parameters used to generate the items. While his work has centered on item difficulty, there is no reason why it could not be extended to administration time as well.

In our recent paper on creating the ICAR5, we were trying to create a shorter version of the ICAR. This was done specifically because we needed a briefer measure for practical purposes. However, we did not actually have administration time measures of the items, so we had to base our item choice on our own experiences. An interesting research project would be to construct a series of scales for different desired administration times (e.g. 1, 2, 5, 10 minutes) that have the least measurement error of general cognitive ability as possible. My guess is that to do this (while avoiding the use of computerized adaptive testing and automatic item generation), one will have to try a large number of items and item combinations to construct the scales. Still, for current use, the ICAR5 seems to be a reasonable compromise between administration time and measurement error. As part of another project, we gave the ICAR5 test to a sample of about 500 representative adult Danes and the items seemed to work reasonably well. The three figures below show the item response theory output:

item_curves item_info test_info