An algorithm that can detect duplicate images across thousands of research papers has been created with the hope of licensing it to journals and research-integrity offices.
A team of researchers from Syracuse University (NY, USA) has created an algorithm capable of analyzing more than 2.6 million images from research papers on PubMed to check for image fraud.
Currently, many journals perform random spot checks on submitted images, however there are very few with automated processes, which can lead to fraudulent images slipping through the cracks. The researchers, led by Daniel Acuna, realized that a routine, automated procedure to eliminate the laborious and time-consuming process of manually checking images in submitted manuscripts was something that was long overdue.
The algorithm extracted images – including micrographs of cells and tissues, and gel blots – from 760,000 articles on the PubMed database of biomedical literature and created a digital fingerprint of each image by zooming on in the most feature-rich areas.
The team was left with approximately 2 million images when common features such as arrows and flow chart components were eliminated. In order to avoid too much computational stress, these images were compared solely against publications from the same first and corresponding authors, with the algorithm capable of detecting duplicates even if the images had been resized, rotated or colors adjusted.
The researchers sampled 3750 images manually, predicting that 1.5% of the papers in the database contained suspicious images and 0.6% of the papers contained fraudulent images.
“The work shows that it is possible to use technology to detect duplicates,” commented Acuna.
Whilst the algorithm won’t be made public, it will be licensed to journals and research-integrity offices.
“It would be extremely helpful for a research-integrity office,” observed Lauran Qualkenbush, director of the Office for Research Integrity at Northwestern University (IL, USA).
“I am very hopeful my office will be a test site to figure out how to use Daniel’s tool this year.”
In order for the algorithm to be used successfully, publishers would need to create a shared database of all published images. Crossref’s Similarity Check service has enabled screening of submitted manuscripts for plagiarism using iThenticate software. However, plans for a publisher-wide system for image checking are currently non-existent – partly owing to a lack of technology.
However, earlier this year, Elsevier’s $1.2million partnership with Humboldt University (Berlin, Germany) announced their intentions to create a database of images from retracted publications, which could provide test images for researchers to further development of automated image screening software.
“The reasons why scientists may commit misconduct are still poorly understood, but regardless of motive the requirement to ‘tread carefully’ when handling such matters remains paramount,” concluded the authors.
Written By Abigail Sawyer
Updated 17 December, 2018
Source Acuna DE, Brookes PS, Kording KP. Bioscience-scale automated detection of figure element reuse. bioRxiv preprints. doi:10.1101/269415 (2018) http://10.0.4.77/269415 https://www.nature.com/articles/d41586-018-02421-3