I have been complaining to colleagues about this one for several months. Of the various criticisms of this method, I don’t recall one pointing out this problem. I don’t have time to do some simulations right now, but I did some quick illustrations.
The same problem holds for the item-level analyses (these are done using the wrong method as well, but there’s an easy fix). Fortunately, the Vietnam Experience Study dataset — forthcoming Kirkegaard, Nyborg et al collaboration, some output here — allows one to study the matter at test and item level using a mixed selection of about 20 tests (some are from WAIS, others from ASVAB, others ad hoc), as well as item level data (>100 items).