…historical online literature lacks the relevant structure and metadata
to make our task easier, but it is time that publishers thought ahead
about some of the advantages of online publishing.
I can’t agree more. I heard sometimes the claim that within 5-10 years, more than 95% of the scientific literature is going to be read by computers only. Possible. However, the converse alternative might be interesting to consider: what if 95% of scientific papers could be ‘written’ by computers? Even if this formulation is obviously provocative and unrealistic, the point is that harnessing the ‘network effect’ of the web may have two complementary components, one community- the other computer-driven. On one hand, web 2.0 functionalities enable community-driven commenting, rating and even writing of scientific publications. On the other hand, semantic web technologies are expected to facilitate computer-driven integration of scientific data from multiple sources, which is likely to play an increasingly important role in science. Rather than mining thousands of unread papers, the scientist of the future may rather search the web for relevant data first and integrate it to generate – or ‘write’ – novel insight. In fact, integration of large datasets already represents a major field of research in systems biology (see Chuang et al 2007, Xue et al 2007 or Mani et al 2008 as recent examples published in Mol Syst Biol).
It seems thus that, in addition of being web 2.0 enabled, new publishing models should ‘embed’ more structured data into online publications. In short, ‘papers’ could progressively transform into hybrid online objects that resemble more to database records (see Timo Hannay’s post on this topic) or highly structured documents. At the extreme, one could even imagine to publish ‘naked’ datasets, without any ‘stories’ around them. Of course, efficient data integration will require the data to be in a standard and structured format and its quality will have to be well characterized. These are all far from trivial qualities.
The good old-fashioned papers are probably not going to disappear as publication units, in particular for high-impact studies reporting novel and deep insights. It is also not the point here to propose dumping every scientist’s hard drive into the web. Data-rich publications would be published only when the authors would feel it appropriate. There might thus be some equilibrium to find between papers that will never be read except by a text mining engine and pure datasets, published as a resource, easier to search, to mine and to integrate. This dialectic may ultimately boil down to the issue of how well will text mining and data integration technologies perform in the future.
In any case, within the context of the current debate about the saturation of the peer-review system, I wonder whether a data-centric form of scientific publishing could help to release somewhat the pressure. Reviewing of datasets might be quicker and could rely more on standardized evaluation parameters. If assorted with proper credit attribution mechanisms and metrics of impact, data-rich (or even data-only) publications may represent an alternative model complementing the traditional ‘paper’ format. It would prevent the loss of useful data otherwise buried in verbal descriptions and, most importantly, would hopefully stimulate web-wide integration of disparate datasets.