Wanted: scientific immune system to identify weak studies getting lots of attention

In the interest of keeping the scientific enterprise towards finding truth, it is important to reduce the impact of problematic studies in the scientific literature. Studies can be problematic in many ways (e.g. lacks a control for genetic confounding in social science), but one relatively simple problem to automatically identify is low precision due to small sample size. Since such studies are too imprecise to really tell us much about reality, they should be given quite little attention. Unfortunately, something like the opposite might be true, with small studies with flashy results being given worldwide media attention.

Here’s my idea. We need to have a preferably public database of all scientific papers with fulltext as they are published. It doesn’t have to have complete backlog coverage because we are just trying to reduce incoming damage to the literature from uninformative papers, it is too late for the old ones. As each new paper comes out, it is put in the database along with extracted metadata from it such as sample sizes, statistical tests, standard errors, and whatever information one can find. Then we calculate some kind of overall paper informativeness score, which could be simply something like the replication index or median observed power. We also monitor every papers’ altmetrics score, which tracks attention to papers. Then we identify papers with weak statistics and high attention. The scientific team seeing the results can then write a specific response to that particular paper in an attempt to reduce its impact on the literature.

I’ll give two case studies of what I have in mind.

Yet another trans-generational epigenetics study

So what’s the study? Besides being a trans-generational epigenetics study, itself a red flag, the study does not list the sample size anywhere, not even in the methods section. However, we can find it by looking at the reported degrees of freedom, which are at 10. My guess is that it is a balanced 2×6 study, i.e. a study of 12 mice is causing worldwide media attention! Looking at the stats makes us even more worried. There are 10 p values reported exactly, these are: p = 0.17, p = 0.10, p = 0.07, p = 0.018, p = 0.04, p = 0.01, p = 0.01, p = 0.28, p = 0.19, p = 0.08. All values between 0.01 and 0.28, very suspicious! We can’t do a formal test for too little variation here because some of these test related hypothesis, thus the values are correlated by design, violating the assumptions of the TIVA test (independence). But we are still pretty skeptical because a study of n = 12 can produce pretty much any result imaginable and is basically useless.

Trusting co-partisans more

Study is being given attention by the heterodox community on Twitter (complete listing of people posting the link on Twitter):

The methods section:

2.1. Participants
American residents over 18 years of age who speak English were recruited on Amazon
Mechanical Turk. All participants provided demographic information (see supplementary
materials). 154 participants completed the first part of the task (Learning Stage), out of which
97 participants (34 females and 63 males, aged 20-58 years M = 34.81, SD = 9.59) completed
the second stage (Choice Stage). All participants were paid $2.50 for completing the first stage
of the experiment and were told they could earn a bonus of $2.50 to $7.50 based on their
performance. Thus, they had an incentive to perform well. Because in reality participant
performance was held constant at 50% all participants who completed the entire experiment
were paid a $5 bonus.

I.e. a design they could have easily collected more data for. Why stop at such a low value? Very suspicious of optional stopping. The splashy result is is described as “t(96) = -2.10, p = .038, d = -.37″. I suggest there is no reason to read the rest of the paper until they bump the sample size.

Views All Time
Views All Time
Views Today
Views Today