There’s a Commentary in this week’s Nature about detecting plagiarism in scientific papers (free access this week) by using eTBLAST, a strange but seemingly effective hybrid of alignment search and heuristics originally designed to help search PubMed. Basically you give it a paragraph of text and it finds papers that contain similar words and phrases.
Here’s a summary from the News piece accompanying the story:
As many as 200,000 of the 17 million articles in the Medline database might be duplicates, either plagiarized or republished by the same author in different journals, according to a commentary published in Nature today [… the authors] used text-matching software to look for duplicate or highly-similar abstracts in more than 62,000 randomly selected Medline abstracts published since 1995.
This the second place this week that I’ve read about bioinformatics techniques being applied to document processing; the other was Deepak’s post about IBM using the Teiresias algorithm to detect spam emails with great success. Don’t know if there are any other bioinformatics algorithms that have been applied to non-biological problems? BLAST, surely, must have other some novel uses…
Anyway, the authors do mention that:
In general, the duplication of scientific articles has largely been ignored by the gatekeepers of scientific information — the publishers and database curators. Very few journal editors attempt to systematically detect duplicates at the time of submission.
Sort of. CrossRef – the academic publishing industry body that looks after DOIs for scientific papers, amongst other things – is building a plagiarism detection service called CrossCheck in association with the company that makes Turnitin, a popular piece of software used by high schools and colleges to make sure student’s don’t crib off of each other. If you are going to submit exactly the same paper to multiple journals in the hope of getting multiple citations then do it now…