Making better use of the scientific literature: large-scale automatic retrieval of data from published figures

Science is a set of related methods that aim at finding true patterns about the world. These methods are generally designed so as to remove noise from random circumstances (the traditional focus of statistics) and human biases. Current practices are currently not very good at the second part due to the innumerable ways human biases result in biased findings (see e.g. discussion for social psychology in Crawford and Jussim’s new book). However, I feel confident that many of these biases can be strongly reduced with the advent of several new tools and practices: 1) by the development of meta-analytic tools to properly summarize existing possibly biased research (e.g. p-curve, z-curve, pet-peese, TIVA, R index), 2) registered replication reports that remove outcome bias in peer review, 3) the increasing reliance on automated tools for checking the validity of scientific works (statcheck, GRIM, SPRITE etc.), 4) increasing awareness of the presence of very substantial ideological biases in the scientific community.

Here I want to suggest a new approach that attacks a different but related angle: lack of open research data. While there is a growing movement to publish all research data, still a large chunk of data remains unpublished. Most of this data will be lost as no backups of it exist and the authors eventually die or throw away their laptops etc. However, papers endure and papers have various visualizations of the data. Some of these are scatterplots (example on the right), which allow for automatic and complete retrieval of the underlying data (unless there’s overplotting).

 

Others are visualizations of mostly summary statistics which allow for some form of data retrieval. For instance, boxplots (on left) allow for retrieval of the 25th, 50th and 75th centiles and also of any outlying datapoints (and usually the 1.5 IQRs). This kind of information can be used in meta-analyses and can also be combined with GRIM/SPRITE-type methods to verify the integrity of the underlying data.

Various other kinds of visualizations allow for all kinds of intermediate levels of data retrieval. For instance, scatterplots that vary the size of the dots by a third variable allow retrieval of at least some levels of that variable as well. Many newer PDFs contain vector graphics not raster graphics, and this allows for precise retrieval, not just approximate.

 

There already exist quite a large collection of published tools for data retrieval, including some of which are open source R or Python packages which could easily be integrated in any existing framework for data mining the scientific literature. I have found the following collections of tools:

  • https://academia.stackexchange.com/questions/7671/software-for-extracting-data-from-a-graph-without-having-to-click-on-every-singl
  • https://stats.stackexchange.com/questions/14437/software-needed-to-scrape-data-from-graph/72974
  • https://www.techatbloomberg.com/blog/scatteract-first-fully-automated-way-mining-data-scatter-plots/

Leave a Reply