Nascent

Tony Hey visits Nature

Last week we were extremely lucky to be visited by Tony Hey, VP for Technical Computing at Microsoft. Tony has previously been a physicist, a computer scientist, Dean of Engineering at the University of Southampton, and director of the UK’s e-Science Initiative. He’s one of the most interesting commentators on the impact of information technology on scientific research, a subject close to our hearts. Here are my rough notes from his talk to Nature staff.


“E-science” is about global collaboration in key areas of science, and the next generation of infrastructure that will enable it.

Different paradigms in science:

  • Experimental science
  • Theoretical science
  • Computational science
  • E-science: collaborative networked science

An e-science example: Project Neptune is a proposal to put a network of sensors on the seabed to monitor geological and biological activity. It includes remote-controlled sensors.

Key elements of e-science include large data sets, distributed computation and interoperability.

Science publishing is also undergoing change: We can now searching and obtain visual overviews of the literature. ‘Live documents’ that are continually updated (with RSS feeds to alert readers to updates). New measures of reputation and influence. Peer review is also going to be different in the future, perhaps including Amazon-like voting and social-networking approaches.

An e-science example in astronomy: Astronomers divide themselves up by wavelength(!). But you need to combine information from all wavelengths to understand a single patch of sky. SkyServer.SDSS.org was built to combine data from 20 separate observatories. It also links to the relevant literature. The UK version is AstroGrid, which uses a wiki and is part of the UK e-Science Initiative.

An e-science example in chemistry: The CombeChem project creates many chemical combinations and allows analysis of structures (e.g., against known structures). This can all be controlled remotely.

Digital lab books (also used by chemists) allow proper archiving and sharing between researchers, for example using Tablet PCs. Researchers can also receive messages on their PDAs to allow remote monitoring of experiments and more efficient use of their time.

Publishing in chemistry: It is now possible to create electronic versions of papers (e-prints) that are linked to the crystallographic database and are published locally (e.g., in institutional repositories). The eBank Project makes grey literature part of the overall digital library, and also ties into virtual education.

Key issues for e-science:

  • Data life cycle
  • Scholarly communication

Data Life Cycle

Currently there are multiple heterogeneous stages between acquisition and preservation (some using MS software, others not). CombeChem is an interesting case study:

  • End-to-end linking of data to information, and publishing at source.
  • Collecting data (and metadata) with regard to how it could eventually be used. Metadata needs special care.
  • In the chemistry lab, people and machines work together.

How do we get scientists to record data and metadata while they do experiments? We also need to validate data and capture provenance.

Scholarly Communication

There’s a revolution going on in scholarly communication. Probably the most at risk are the scholarly societies.

Science blogs will be an increasingly interesting vehicle. For example, you can put your lab notebook on a blog. This encourages and facilitates collaboration, and allows validation. For example, the Useful Chemistry blog records experiments, including ones that didn’t work.

Also OpenWetWare, a wiki, captures how experiments and done and shares this information with others, though it’s quite anarchic.

There’s also the concept of publications as live documents: Click on figure to get the underlying data, run a simulation, etc. In some areas database replacing (paper) publications as a medium of communication. These often take a great deal of effort to maintain: UniProt has 140 curators. How should they be paid for? Exporting and sharing information is becoming easier through use of open XML standards — including MS Office.

Open Access: There is an argument that taxpayers should be able to see the results of the research they fund. There have been declarations, policies and plans from the OECD and NIH, among others. Due to journal price rises, Tony’s library at Southampton could not afford to subscribe to all the journals where members of the department published. Physicists used to share preprints by post. Now they do so electronically through arXiv.org.

Open access [via self-archiving] is coming: There is proposed legislation in the US, an EU petition attracted 19k signatories. Increasingly, access to the latest information is through the web and preprints, not the journals.

Three prophets of open access:

Institutional repositories will take off: The Google Scholar rank for all institutions puts Harvard top in the world (not surprising) but Southampton top in the UK (surprising). Oxford and Cambridge will not be able to tolerate this relative lack if visibility. There are over 1400 repositories worldwide. EPrints and DSpace are the most widely used platforms. We won’t get the same software used everywhere. And repositories will contain grey literature, data sets, etc. as well as traditional papers. We also need more than OAI-PMH to enable interoperability, and now we have OAI-ORE too to enable sharing and reuse.

Other areas:

  • Search: Grokker provides useful categorised search.
  • New forms of peer review: For example, Faculty of 1000 from BioMed Central.
  • Sharing: Connotea.
  • Library 2.0: Need to make static catalogues active and ‘mashable’.
  • Preservation: “Digital information lasts for ever, or for 5 years, whichever comes first.” – Jeff Rothenberg. The new version of MS Office uses Office Open XML (OOXML), which is an open standard not controlled by MS but by a standards body.

Technical Computing at MS is involved in:

  • Advanced computing: New algorithms and tools (now).
  • High-productivity computing: Clusters, databases (0-5 years).
  • Radical computing: Breakthrough technologies (e.g., parallel programming (5-10 years).

Comments

Comments are closed.