Disclaimer: I only took a brief look. Doing the reading and writing this post took maybe 30 minutes. I may have missed something, but unlikely it would have a major effect on the conclusions.

Whenever I hear of these things, I skip all the fluff (journalist interpretations/summaries) and go straight to the technical report/paper. In this case, it’s here: Philosophy for children. So what did they do?

So, a randomized trial using the delay design. In this way, all the schools eventually get the treatment and so it’s easier to convince schools to sign up. One then does the comparison after the treatment has been given, but before the delay group get their treatment. The price is that one has no ability to do a longitudinal comparison.

So, we note that the randomization is at the school level and there’s only 48 schools, so there’s good room for uneven assignment. They did check for this and didn’t find anything obvious, except from more non-UK natives in the treatment group.

I note that the balance of schools in uneven. Not sure why this is the case exactly. Perhaps they rolled the dice for each school independently which will only tend to get a distribution near 50%. I would have had drawn from a limited pool to force equal assignment at the school-level (i.e. 24-24 split instead of 22-26 actual).

What were the outcomes? There were two: Scholastic achievement and cognitive ability (CA, intelligence). The fist was measured with 3 standardized tests on reading, math and writing. The second was measured via the CAT-4 battery also used by e.g. Deary et al 2007.

So what did they find? I compiled their results into a table:

Outcome Subtype FSM d Pre-score – treatment Post-score – treatment Score change – treatment Pre-score – control Post-score – control Score change – control Pre-score difference
Achievement Reading All 0.12 -0.08 -0.02 0.06 0.08 0.02 -0.05 -0.16
Achievement Writing All 0.03 -0.07 -0.05 -0.05 0.07 0.06 -0.02 -0.14
Achievement Math All 0.10 -0.09 -0.04 -0.04 0.08 0.04 -0.04 -0.17
Cognitive ability General All 0.07
Cognitive ability Verbal All 0.08
Cognitive ability Quantitative All -0.01
Cognitive ability Non-verbal All 0.04
Cognitive ability Spatial All 0.07
Achievement Reading FSM 0.29 -0.40 -0.16 0.24 -0.10 -0.12 -0.02 -0.30
Achievement Writing FSM 0.17 -0.36 -0.25 0.12 -0.10 -0.12 -0.02 -0.26
Achievement Math FSM 0.20 -0.36 -0.28 0.09 -0.03 -0.11 -0.08 -0.33
Cognitive ability General FSM -0.02
Cognitive ability Verbal FSM
Cognitive ability Quantitative FSM
Cognitive ability Non-verbal FSM
Cognitive ability Spatial FSM

(FSM = Free school meal eligible, i.e. means poor.)

Overall, their results are fairly small: d’s .03 to .12 on the achievement tests, and .07 on the CA test.

It’s important to note what they expected to find. In their power analysis they write:

Power calculations make a number of assumptions that are not relevant here, but for illustration the estimate of sample size in the protocol was based on prior research evidence suggesting an effect size of 0.4. Assuming an intra-cluster correlation of 0.2 for the outcome scores, a minimum sample size of 480 pupils per arm would be needed (for 80% power to detect a difference of 0.4 with alpha of 5%) according to Lehr’s formula (Gorard 2013). In fact, the situation is better than this, because of the correlation between pre- and post-tests scores for both Key Stage and CAT data. Thus a sample of 48 schools with over 3000 pupils should easily provide sufficient traditional ‘power’ to detect an effect in terms of CAT4 and KS2 progress outcomes.

So, their overall effect is 30% of their expected effect size (.12 vs. .40).

They emphasize that the effect size is larger for the FSM group, true enough, except for the CA test. I saw no formal interaction test (nothing turned up using text search for “interaction”) and they give no confidence intervals, so I don’t know if the difference they emphasize is statistically certain or not.

Supposing it is real, there is a simple alternative explanation not considered in the study as far as I can tell: regression towards the mean (0 text search hits for “regress”). If you look at the pre-scores, the treatment group invariably had lower initial by accident of the randomization, especially so for the FSM group. As such, they are expected to regress towards the mean (i.e. 0). The control group likewise, but in the other direction. The more extreme pre-score differences for the FSM group should also produce more extreme effect sizes, which they did in fact do.

So, no I’m not very convinced by these data. Especially not considering that these gains may be useless. After all, it’s easy enough to teach children stuff, but they forget it after some time, and the stuff you teach them may not be very useful in their later life.