Some time ago, I wrote on Reddit:
There are two errors that I see quite frequently:
- Conclude from the fact that a statistically significant difference was found to that a significant socially, scientifically or otherwise difference was found. The reason this won’t work is that any minute difference will be stat.sig. if N is large enough. Some datasets have N=1e6, so very small differences between groups can be found reliably. This does not mean they are worth any attention. The general problem is the lack of focus on effect sizes.
- Conclude from the fact that a difference was not statistically significant to that there was no difference in that trait. The error being that they ignore the possibility of false negative; there is a difference, but sample size is too small to reliably detect it or sampling fluctuation caused it to be smaller than usual in the present sample. Together with the misuse of P values, one often sees stuff like “men and women differed in trait1 (p<0.04) but did not differ in trait2 (p>0.05), as if the p value difference of .01 has some magical significance.
These are rather obvious (to me), so I don’t know why I keep reading papers (Wassell et al, 2015) that go like this:
2.1. Experiment 1
In experiment 1 participants filled in the VVIQ2 and reported their current menstrual phase by counting backward the appropriate number of days from the next onset of menstruation. We grouped female participants according to these reports. Fig. 2A shows the mean VVIQ2 score for males and females in the follicular and mid luteal phases (males: M = 56.60, SD = 10.39, follicular women: M = 60.11, SD = 8.84, mid luteal women: M = 69.38, SD = 8.52). VVIQ2 scores varied between menstrual groups, as confirmed by a significant one-way ANOVA, F(2, 52) = 8.63, p < .001, η2 = .25. Tukey post hoc comparisons revealed that mid luteal females reported more vivid imagery than males, p < .001, d = 1.34, and follicular females, p < .05, d = 1.07, while males and follicular females did not differ, p = .48, d = 0.37. These data suggest a possible link between sex hormone concentration and the vividness of mental imagery.
A normal interpretation of the above has the authors as making the fallacy. It is even contradictory, an effect size of d=.37 is a medium-small effect, but in the same sentence they state that there is no effect (i.e. d=0).
However, later on they write:
VVIQ2 scores were found to significantly correlate with imagery strength from the binocular rivalry task, r = .37, p < .01. As is evident in Fig. 3A, imagery strength measured by the binocular rivalry task varied significantly between menstrual groups, F(2, 55) = 8.58, p < .001, η2 = .24, with mid luteal females showing stronger imagery than both males, p < .05, d = 1.03, and late follicular females, p < .001, d = 1.26. These latter two groups’ scores did not differ significantly, p = .51, d = 0.34. Together, these findings support the questionnaire data, and the proposal that imagery differences are influenced by menstrual phase and sex hormone concentration.
Now the authors are back to phrasing it in a way that cannot be taken as the fallacy. Sometimes it gets more silly. One paper, Kleisner et al (2014) which received quite a lot of attention in the media, is based on this kind of subgroup analysis where the effect had p<.05 for one gender but not the other. The typical source of this silliness is the relatively small sample size of most studies combined with the authors’ use of exploratory subgroup analysis (which they pretend to be hypothesis-driven in their testing). Gender, age, and race are the typical groups explored and in combination.
Probably, it best is scientists would stop using “significant” to talk about lowish p values. There is a very large probability that the public will misunderstand this. (There was agood study recently about this, but I can’t find it again, help!)
Kleisner, K., Chvátalová, V., & Flegr, J. (2014). Perceived intelligence is associated with measured intelligence in men but not women. PloS one, 9(3), e81237.