Buj (1981) is probably fraudulent and shouldn’t be used

There’s an odd old paper sometimes mentioned and even worse, used for calculations:

Buj, Vinko. (1981). Average IQ values in various European countries. Personality and Individual Differences, 2(2), 168-169.

Paper doesn’t even have an abstract, and it’s 2 pages long, so let’s just quote it in full here:

VINKO BUJ Erziehungsberatungstelle des Landkreises Grafschaft Bentheim in Nordhorn, West Germany

(Received 22 July 1980)

The present study reports the averaged IQ scores in 21 European countries, collected during a period exceeding five years. An attempt was made to test a representative sample in all European capitals. It proved possible to do this in 16 capital cities, but in five countries it was necessary to choose the biggest or the second biggest town instead of the capital (Amsterdam, Bratislava, Hamburg, Zagreb and Zurich), the assumption being made that these cities would represent the whole country in a meaningful manner permitting of suitable comparisons to be made.

Participants of the study were men and women of 16 years or more, the assumption being made that the development of intelligence in most cases would be completed by that age. The scale used was Cattell’s Scale CFT3, a non-verbal, culture-fair test which is particularly appropriate for comparisons of different nationalities. The test has high g saturations and is a good measure of general mental capacity, particularly fluid general ability (Weiss, 1971). The test was validated and standardized in the United States, and all comparisons are with the standardization figures.

The sample was based on the testing of one person for each 40,000 inhabitants, and the sample may be considered as stratified because three subgroups of the population were represented proportionally: (1) Men and women; (2) Age of participants (distribution was into six classes, beginning with the 1620 year olds, and then going up by 10 year steps to the over 60s); (3) Socio-economic status was also considered, using a threefold classification.

Probands were ascertained by the investigators through the use of population statistics, and were then tested in groups of about 50 persons, The investigations were carried out by qualified psychologists and students of psychology who were carefully instructed to use identical criteria of testing. In each case the experimenters were of course native born in the country in question. For all sampling purposes the breakdown was based on the total population of the country, not on the particular town or city in which the test was carried out. Altogether 10,737 persons were tested.

Results are shown in Table 1, which gives the number of persons tested In each City,the mean IQ value, the standard deviation, and the standard error. The arithmetic mean for the total population tested was 102.2, with a standard deviation of 18.7. The population was 48.8% male and 51.2% female. The first five countries are significantly differentiated from the mean in a positive direction, the bottom two countries in a negative direction. The other countries do not differ from the mean significantly.

The mean of the total population is 2.2 points of IQ above the expected value of 100, which, while significant statistically, is so largely by virtue of the very large numbers involved. The actual value is small, and is probably due to the fact that people living in cities have generally IQ values several points above those living in the country.

In addition to the populations reported, a further study was made on the same principles of 225 probands in Akkra, the capital of Ghana. The mean IQ value of this population was 82.2: this is significantly below the value for the populations in Table 1. Nothing can be said here about the causes of this difference.

Several points regarding Table 1 may call for comments. It is interesting that the mean value for this very large European sample is very similar to the standardization value for the American sample; the small difference of two points is very likely due to the fact that all the testing in the European countries was done in large towns, so that there will almost inevitably be a slight over-estimation of the general IQ for that country. The standard deviation for the total European sample is also higher than that of the American standardization group, a fact which may be due to a better sampling in the present study.

The mean values for the 21 countries, and the standard deviations observed, differ more than one might have expected (particularly the latter), and in addition the mean values for the different countries do not agree with what one might have expected on the basis of previous studies. Thus France unexpectedly comes at the bottom of the scale. below countries like Spain, Greece, Ireland and Portugal, which previous research would have suggested as being more likely candidates for that position. It should be noted, however, that the great majority of differences between countries are not significant, and that the majority do not differ from the mean value significantly. It is thus possible that sampling errors may be responsible for the remaining statistically ‘significant’ results, a few such almost inevitably arising when a large number of comparisons are being made.

The most curious feature of the table is the very great divergence of standard deviations from their mean. ranging from 11.6 for Norway to 34.7 for Bulgaria and Spain. It is difficult to explain these differences; they must certainly cast some doubt on the comparability of sample choice in the different countries. It is exceedingly difficult to standardize conditions. instructions. and motivational factors over a large number of different testers. organisers, and subjects, and any differences along these lines may contribute materially to observed differences in the results. In spite of these well known difficulties it seems worthwhile to publish the results obtained, without claiming a high degree of accuracy. They may serve as a baseline against which future investigations may be carried out. and with which future results may be compared.

And the table:

The criticism of the paper is right there in the paper itself. I don’t think it is possible to obtain some varying standard deviations, and especially not at these sample sizes. In fact, I don’t know of any other study ever that has ever reported a standard deviation of 35, let alone 4 of them! There is something extremely amiss with these data. Standard deviations are very quite robust, and it’s almost impossible to obtain so large differences between samples. Case in point, Heiner Rindermann has done a lot of studies of the cognitive elite, based on PISA etc. data. The standard deviations of PISA samples are also reported by country, and have occasionally been used for research. They look like this:

You want to look at the SD column, where 91 is the average for OECD countries. The largest value is 108 (Israel) and smallest is 71 (Dominican Republic). I don’t even think these differences are that real, but in any case, this is at most a 52% difference (i.e., Israel’s SD is 52% larger than DR’s). In the Buj data, the largest difference is from 11.6 (Norway) to 35.2 (Italy) i.e. 203% difference! Before you say something about the two Italies, in the PISA data, Norway has an SD of 90, and Italy has one of 94, or 4% difference. Thus, it is hard to claim this earlier result was due to some extremely large real difference.

The standard deviation results cannot be explained by small samples either, as the claimed sample size is 1380 for Italy and 100 for Norway. To give an idea of the sampling error of a standard deviation, it looks like this:

I simulated data based on sample sizes of 50, 100, 250, and 500, all with true SD of 15. There are 10,000 repeats for each. As you can see, even with n=50, it is hardly possible to get even SD=20 or SD=10. Of 10,000 tries, 3 samples produces SD’s below 10, and 4 produced SD’s above 20, so a joint chance of getting it more than 5 wrong in any direction is 7/10,000.

Next, think about how weird it is for an unknown guy called Vinko Buj, seemingly a Croatian living in Germany in 1981, to gather these samples alone. THEN think about how weird it is that this is a 2 page paper! Who goes through all that effort to sample all these data and write a 2 page paper? The guy exist as there is a Croatian Wikipedia page about him, and a bunch of other sources. Alas he died last year, so we can’t ask about these data. Maybe his paper was a summary of his dissertation? No, his dissertation was:

Buj, V. (1977). Merkfähigkeit und Intelligenzleistung bei Hirnorganikern und bei visuell trainierten Verhaltensgestörten [Memory and intelligence in the brain-damaged and in visually trained behaviorally disordered]. Unpublished doctoral dissertation, University of Hamburg.

I don’t see a free copy anywhere but one can apparently buy it for 17 EUR. If someone can obtain this, please email it to me.

Apparently, he was also involved in a human rights lawsuit but it seems unrelated to this paper.

When Heiner Rindermann did his famous 2007 study on the international national g factor, he also discussed this anomalous paper:

Only one international comparison study has been carried out using a uniform intelligence test measured over a short time period under more or less standardised conditions. This is the study with the Cattell Culture Fair Test 3 (CFT3) non-verbal scale (Buj, 1981), probably conducted in the 1970s in 21 European countries and Ghana. The tests were administered in capital cities or in the biggest town in each country. But researchers believe the data from this study are of dubious quality: nobody knows the author; he did not work at a university; the way he collected so much data is unknown; the description of samples and testing procedure is scanty; and only one single two-page-long publication exists. The correlations with other measures, except PISA, are good (see below).

…

Correlations involving the methodically criticised CFT-study by Buj (1981) were also high, except for PISA (r = .01, corrected: r = .01, N = 21, Ghana not included). For example, the correlation with the unadjusted student assessment sum was r = .71 (adjusted: r = .70, N = 22, Ghana included; but Ghana excluded and only European countries: r_uncorr = .04 and r_corr = .08, N = 21). The Buj study showed high correlations only by including a country from Africa. The study IAEP-II, which has been challenged for low representativeness of pupils, showed even higher and more stable correlations with all variants of cognitive ability measurements at the macro-social level (for example, the correlation with unadjusted cognitive ability sum was r = .88, N = 19; adjusted: r = .89). The results for the old student assessment studies were similar (the correlation with unadjusted cognitive ability sum was r = .95, N = 19; adjusted: r = .95). Even very old data about the ‘intellectual’ order of immigrants 1921 to the USA (mainly from European countries) showed substantial correlations (for example, with unadjusted cognitive ability sum: rank r = .44, N = 14; adjusted rank r = .53).

So, the study’s mean IQs don’t even correlate with other data, except because there is one African country with lowish IQ (also not enough enough). Martin Voracek in a reply to Rindermann defends the study though, by analogy with the Cyril Burt case:

Something old (and new). Rindermann mentions criticism regarding Buj’s (1981) IQ study (‘…the data from this study are of dubious quality: nobody knows the author; he did not work at a university; the way he collected so much data is unknown…’). This is somewhat reminiscent to posthumous accusations made against Sir Cyril Burt, concerning allegedly non-existent (i.e. fictitious) co-authors, which accusations were wrongful (Fletcher, 1991, pp. 266–276; Hearnshaw, 1979, pp. 242–243; Jensen, 1995). Vinko D. Buj (Croatian, born 1938) can be traced (and asked about study details). He appears in several recent (2006) online papers of Croatian newspapers (e.g. media coverage of Lynn & Vanhanen, 2006, http://www.slobodnadalmacija.hr/20060721/zadnjastrana01.asp; http://www.slobodnadalmacija.hr/20060701/novosti03.asp; http://vijesti.hrt.hr/arhiv/99/04/07/KUL.html), took his Ph.D. (psychology) from the University of Hamburg (Buj, 1977), with doctoral supervisor Manfred Amelang (Professor Emeritus, University of Heidelberg), a widely known differential psychologist. Although Buj (1981) is his single entry in the PsychINFO and ISI Web of Science databases and he never again published a similarly large-scaled study, further publications are found in the databases PubMed (Buj, 1983) and Psyndex (Buj, 1990; Buj, Specht, & Zuschlag, 1981). Buj (1981) made clear that his collection of IQ data in 21 European cities (N=10.737) took over 5 years, achieved through a network of native-born collaborators in the respective countries. It is entirely plausible that one nonfamous researcher, apparently a traveller, with good personal contacts abroad, could have carried out this during the mid-1970s. For comparison, the early David Buss amassed data from 37 study sites around the world (N=10.047; Buss,1989) over a comparable period, starting in 1982, when literally nobody outside the U.S. had e-mail access (Buss, 2003, p. 4; 2004, p. xxi). Richard Lynn (personal communication; February 19, 2007) used Buj’s (1981) data (Lynn & Vanhanen, 2002, 2006) because of the totality-of-evidence principle. However, since there is a plethora of other studies, inclusion or exclusion of Buj’s data does not impact national IQ estimates. On the whole, the criticism regarding Buj (1981) appears unjustified.

But Voracek ignores the impossible standard deviations, and the mean IQs not relating to to other data. Cyril Burt was a famously secretly but a resourceful person, with many other verified impressive achievements. Burt’s data also agree with other studies on the topic. David Buss is a leading researcher in evolutionary psychology (now), and surely back then had academic access. Buss’ results were replicated recently with flying colors. So the similarities are not that strong.

As for Voracek’s claim that Buj’s later papers on the topic provide explanation and vindicate the data. I don’t think so. They could easily be post-hoc post-fraud ‘oh by the way here’s how I got the data’, in the same way that famous Dutch social psychologist Diederik Stapel did it.

But let’s look at them. What is Buj 1983, 1990, and Buj et al 1981 about? I checked all the 44 citations of the Buj 1981 paper, but none of them apparently cite the extraordinary 1981 paper. If these papers are relevant, why don’t they cite the prior study? Weird citations patterns are common in fraud. The paper titles are:

Buj, V., Specht, F., & Zuschlag, B. (1981). Erziehungs- und Familienberatung in der Bundesrepublik Deutschland [Educational and family counseling in the Federal Republic of Germany]. Zeitschrift für Klinische Psychologie, 10, 147–166.
Buj, V. D. (1983). Trial of an intelligence test for dogs. Tierärztliche Praxis, 11, 537–542.
Buj, V. D. (1990). Studie über minimale cerebrale Dysfunktion und Intelligenzleistung der Kinder von Raucherinnen und Nichtraucherinnen [A study on minimal brain dysfunction and intelligence scores among children of female smokers and nonsmokers]. Suchtgefahren, 36, 123–126.

The titles make it clear these papers are not so relevant. I couldn’t find any copies of these rather obscure German-language papers. If someone can obtain these, please email them to me.

Overall, the evidence against Buj 1981 is too strong. These numbers should not be used for anything.

You Might Also Like

Whose responsibility is it anyway?

Book review: What We Owe the Future, 2022, by William MacAskill

How much should you trust IQ etc. information from The Guardian? Experts answer: not so much