London Blog

Tomorrow’s Giants: Data

The Tomorrow’s Giants conference, jointly hosted by Nature and the Royal Society, was held on 1 July 2010 and focussed on the future of UK science. The meeting was composed of three strands: careers, measurement and assessment, and the future of data. Today, I will summarise the main discussion points from the Data strand. All quotes are paraphrased rather than direct (I was typing as fast as I could, but may have mislaid the odd word). I’ve also combined views from various session (preparatory, feedback and panel debate) into one narrative.

During the afternoon panel discussion, David Willett’s MP related an anecdote that put the data discussion into context. While having dinner with Eric Schmidt, CEO of Google, Schmidt suggested that the amount of data now created in one weekend is equal to all the information created in the whole of human history up until 2003. That, I think you’ll agree, is a staggering statistic (if true). What do we do with all this data? Should it be made freely available to all? Who decides? Can we balance data protection and availability? And who curates all this information? These and other questions led to plenty of interesting debate at the Tomorrow’s Giants conference.

Phil Campbell, Editor of Nature, synthesised these issues into three main themes:

1) Funding for curation of data – who will step up to the plate to tackle this?

2) Does the decentralised, cloud approach to collating data offer more than a monolithic, top-down approach?

3) Should publicly funded data be made public?

As soon as Phil stopped speaking, one audience member asked: “If I immediately make my data available, could I still publish in Nature at a later date?”. Phil was of the view that authors should not be penalised by journals for sharing data, and that the practice was actively encouraged by Nature with respect to the human genome. I sensed that this is something of a grey area, however, and different journals might take a different view depending on the nature of the science.

It was then asked: “How can we ensure long-term infrastructure to allow all kinds of scientists to use the huge datasets we have today?”. Two bridges need to be crossed: the habits, attitudes and vocabulary differences between disciplines, and getting money to support databases and infrastructure. Indeed, we heard several times that it is relatively easy to get money to set up a database, but very hard to find funding to maintain and curate it. David Willetts MP suggested that institutions are needed to sustain these sorts of projects through economic ups and downs. There’s some merit in placing firm responsibility with one entity, but I was a little surprised no one championed ‘bottom-up’ approaches to maintaining a database. Grassroots methods for distributing the cost and effort of handling data seem, to me, potentially more robust to the whims of the economy than a centralised agency.

Returning to the theme of institutions housing data, one audience member wondered whether librarians might have a role to play. Could librarians procure data from scientists rather than the reverse, and then curate that data? Panelist Tony Hay of Microsoft Research whimsically noted that his students only go to libraries to chat and have coffee. He agreed that libraries could play a bigger role in collecting data from an institution, not all of which is published. He believes that the budget might be found as the traditional reliance on paper journals decreases. Much retraining would be needed, however.

David Willetts MP conveyed the same idea by analogy. “When data occupies physical space, (e.g.books on shelves) its organisation is an editorial function. When you shift to a virtual space, there’s a question about how that editorial function should be brought about. Who is going to organise this stuff for us. So we need the modern electronic equivalent of shelving books.”

Certain scientific disciplines, such as crystallography, proteomics and genomics, are naturally data intensive, and have led the way in database growth and openness. It’s hard to imagine a free and open resource such as the Protein Data Bank, for example, being set up for drug discovery leads and other IP-sensitive areas of Big Pharma and Biotech. Most were agreed, therefore, that there can never be a one-size-fits-all solution for collecting and sharing scientific data. It can be difficult enough to convince people to share results, without forcing everyone to do it in the same way.

Finally, panelist Terrence Keeley of the University of Buckingham, raised an interesting and apposite historical anecdote. Before the Royal Society was founded 350 years ago, he said, scientists would often publish secretly, getting their work notorized by lawyer before hiding it away somewhere safe. They would then produce it only after someone else had made a claim, to gain priority. Robert Hooke even used latinised anagrams to publish his data in plain sight, while making it so obscure that only he could read it. The Royal Society changed all this by allowing its members to internally publish data and mutually benefit from each other’s findings. That model of publishing and sharing has now expanded to the whole world.


Read about the Careers strand here, and join in the debate in the Tomorrow’s Giants forum.

Comments

Comments are closed.