The Seven Stones

Fewer papers to read, more data to use…

In a nice post at bbgm, Deepak writes:

…historical online literature lacks the relevant structure and metadata

to make our task easier, but it is time that publishers thought ahead

about some of the advantages of online publishing.


I can’t agree more. I heard sometimes the claim that within 5-10 years, more than 95% of the scientific literature is going to be read by computers only. Possible. However, the converse alternative might be interesting to consider: what if 95% of scientific papers could be ‘written’ by computers? Even if this formulation is obviously provocative and unrealistic, the point is that harnessing the ‘network effect’ of the web may have two complementary components, one community- the other computer-driven. On one hand, web 2.0 functionalities enable community-driven commenting, rating and even writing of scientific publications. On the other hand, semantic web technologies are expected to facilitate computer-driven integration of scientific data from multiple sources, which is likely to play an increasingly important role in science. Rather than mining thousands of unread papers, the scientist of the future may rather search the web for relevant data first and integrate it to generate – or ‘write’ – novel insight. In fact, integration of large datasets already represents a major field of research in systems biology (see Chuang et al 2007, Xue et al 2007 or Mani et al 2008 as recent examples published in Mol Syst Biol).

It seems thus that, in addition of being web 2.0 enabled, new publishing models should ‘embed’ more structured data into online publications. In short, ‘papers’ could progressively transform into hybrid online objects that resemble more to database records (see Timo Hannay’s post on this topic) or highly structured documents. At the extreme, one could even imagine to publish ‘naked’ datasets, without any ‘stories’ around them. Of course, efficient data integration will require the data to be in a standard and structured format and its quality will have to be well characterized. These are all far from trivial qualities.

The good old-fashioned papers are probably not going to disappear as publication units, in particular for high-impact studies reporting novel and deep insights. It is also not the point here to propose dumping every scientist’s hard drive into the web. Data-rich publications would be published only when the authors would feel it appropriate. There might thus be some equilibrium to find between papers that will never be read except by a text mining engine and pure datasets, published as a resource, easier to search, to mine and to integrate. This dialectic may ultimately boil down to the issue of how well will text mining and data integration technologies perform in the future.

In any case, within the context of the current debate about the saturation of the peer-review system, I wonder whether a data-centric form of scientific publishing could help to release somewhat the pressure. Reviewing of datasets might be quicker and could rely more on standardized evaluation parameters. If assorted with proper credit attribution mechanisms and metrics of impact, data-rich (or even data-only) publications may represent an alternative model complementing the traditional ‘paper’ format. It would prevent the loss of useful data otherwise buried in verbal descriptions and, most importantly, would hopefully stimulate web-wide integration of disparate datasets.


  1. Report this comment

    Roy Wollman said:

    This is a very valid point. Integration of datasets from different papers or even the re-analysis of data from a single paper could be very helpful. Unfortunately, the current publishing paradigm doesn’t support this as all. I am a computational biologist (which is a nice term to say that a lot of time I am a parasite on other people data…). Most of my work is collaborative so if I have an idea for new data integration or a different analysis of the raw data this is not a problem, but I also often resort to “reverse engineering” of figures from papers to extract the raw data so I could analyze it in a different way or in a different context. The lack of the raw data that was used to create the figures hinders further progress which could result by having more “fresh eyes” looking at the same data.

    The solution is simple: each paper published should be accompanied by a compressed archived file of all the raw data used to generate the figures for the paper. It doesn’t have to be the entire hard-drive / lab notebook. It could be a just the final gel / image / table that were used to create the figures the author decided are important for the “story”. This is somewhat true for papers with very large datasets (microarray, large screens) but I don’t see a reason why this shouldn’t be the norm rather than the exception.

    Ideally you would want this raw data file to be nicely structured and machine readable. Which would of course require a standard, and standards are hard to agree on. But the lack of standards is no excuse. First, unstructured data is better than no data at all. Second, many types of data have standards already (OME, SBML, MIAME) just to name a few. And finally, I think this falls into the ‘if you build them they would come’ category. Without extensive sharing of raw data there is no need for standards. Forcing scientist to share raw data would create those standards that would evolve with time.

    Beside dataset integration and reanalysis, there are additional “side benefits” of increased scientific integrity. Knowing that the entire world has access to your raw data could act as a deterrent. A higher possibility of getting caught even after the paper is out would quickly reduce fraud and misconduct.

    The only reason I see no to do it is the additional cost of storage. However, with the price for online storage the way it is this should really be minor. And if cost is a problem, I’m sure that google would be happy to create the “Google Scientific” and organize the scientific world raw data…

Comments are closed.