Some time ago, I wrote on Reddit:

There are two errors that I see quite frequently:

- Conclude from the fact that a statistically significant difference was found to that a significant socially, scientifically or otherwise difference was found. The reason this won’t work is that any minute difference will be stat.sig. if N is large enough. Some datasets have N=1e6, so very small differences between groups can be found reliably. This does not mean they are worth any attention. The general problem is the lack of focus on effect sizes.
- Conclude from the fact that a difference was not statistically significant to that there was no difference in that trait. The error being that they ignore the possibility of false negative; there is a difference, but sample size is too small to reliably detect it or sampling fluctuation caused it to be smaller than usual in the present sample. Together with the misuse of P values, one often sees stuff like “men and women differed in trait1 (p<0.04) but did not differ in trait2 (p>0.05), as if the p value difference of .01 has some magical significance.

These are rather obvious (to me), so I don’t know why I keep reading papers (Wassell et al, 2015) that go like this:

## 2.1. Experiment 1

In experiment 1 participants filled in the VVIQ2 and reported their current menstrual phase by counting backward the appropriate number of days from the next onset of menstruation. We grouped female participants according to these reports. Fig. 2A shows the mean VVIQ2 score for males and females in the follicular and mid luteal phases (males:

M= 56.60,SD= 10.39, follicular women:M= 60.11,SD= 8.84, mid luteal women:M= 69.38,SD= 8.52). VVIQ2 scores varied between menstrual groups, as confirmed by a significant one-way ANOVA,F(2, 52) = 8.63,p< .001,η^{2}= .25. Tukey post hoc comparisons revealed that mid luteal females reported more vivid imagery than males,p< .001,d= 1.34, and follicular females,p< .05,d= 1.07,while males and follicular females did not differ,These data suggest a possible link between sex hormone concentration and the vividness of mental imagery.p= .48,d= 0.37.

A normal interpretation of the above has the authors as making the fallacy. It is even contradictory, an effect size of d=.37 is a medium-small effect, but in the same sentence they state that there is no effect (i.e. d=0).

However, later on they write:

VVIQ2 scores were found to significantly correlate with imagery strength from the binocular rivalry task,

r= .37,p< .01. As is evident in Fig. 3A, imagery strength measured by the binocular rivalry task varied significantly between menstrual groups,F(2, 55) = 8.58,p< .001,η^{2}= .24, with mid luteal females showing stronger imagery than both males,p< .05,d= 1.03, and late follicular females,p< .001,d= 1.26.These latter two groups’ scores did not differ significantly,. Together, these findings support the questionnaire data, and the proposal that imagery differences are influenced by menstrual phase and sex hormone concentration.p= .51,d= 0.34

Now the authors are back to phrasing it in a way that cannot be taken as the fallacy. Sometimes it gets more silly. One paper, Kleisner et al (2014) which received quite a lot of attention in the media, is based on this kind of subgroup analysis where the effect had p<.05 for one gender but not the other. The typical source of this silliness is the relatively small sample size of most studies combined with the authors’ use of exploratory subgroup analysis (which they pretend to be hypothesis-driven in their testing). Gender, age, and race are the typical groups explored and in combination.

Probably, it best is scientists would stop using “significant” to talk about lowish p values. There is a very large probability that the public will misunderstand this. (There was agood study recently about this, but I can’t find it again, help!)

**References**

Kleisner, K., Chvátalová, V., & Flegr, J. (2014). Perceived intelligence is associated with measured intelligence in men but not women. *PloS one*, *9*(3), e81237.

*Biological Psychology*.

Normal practice is to treat likert scales as continuous variable even though they are not. As long as there are >=5 options, the bias from discreteness is not large.

I simulated the situation for you. I generated two variables with continuous random data from two normal distributions with a correlation of .50, N=1000. Then I created likert scales of varying levels from the second variable. Then I correlated all these variables with each other.

Correlations of continuous variable 1 with:

continuous2 0.5

likert10 0.482

likert7 0.472

likert5 0.469

likert4 0.432

likert3 0.442

likert2 0.395

So you see, introducing discreteness biases correlations towards zero, but not by much as long as likert is >=5 level. You can correct for the bias by multiplying by the correction factor if desired:

Correction factor:

continuous2 1

likert10 1.037

likert7 1.059

likert5 1.066

likert4 1.157

likert3 1.131

likert2 1.266

Psychologically, if your data does not make sense as an interval scale, i.e. if the difference between options 1-2 is not the same as between options 3-4, then you should use Spearman’s correlation instead of Pearson’s. However, it will rarely make much of a difference.

Here’s the R code.

#load librarylibrary(MASS)#simulate dataset of 2 variables with correlation of .50, N=1000simul.data = mvrnorm(1000, mu = c(0,0), Sigma = matrix(c(1,0.50,0.50,1), ncol = 2), empirical = TRUE)simul.data = as.data.frame(simul.data);colnames(simul.data) = c(“continuous1″,”continuous2”)#divide into bins of equal lengthsimul.data[“likert10”] = as.numeric(cut(unlist(simul.data[2]),breaks=10))simul.data[“likert7”] = as.numeric(cut(unlist(simul.data[2]),breaks=7))simul.data[“likert5”] = as.numeric(cut(unlist(simul.data[2]),breaks=5))simul.data[“likert4”] = as.numeric(cut(unlist(simul.data[2]),breaks=4))simul.data[“likert3”] = as.numeric(cut(unlist(simul.data[2]),breaks=3))simul.data[“likert2”] = as.numeric(cut(unlist(simul.data[2]),breaks=2))#correlationsround(cor(simul.data),3)