« December 2006 | Main | March 2007 »

February 28, 2007

Open Text Mining Interface - Update

We've posted here before (here and here) about OTMI, the Open Text Mining Interface specification that we are proposing as a means to disclose scholarly full text for text analysis purposes. This post details some recent updates.

Contact email

First things first. We've set up a dedicated mail address for all OTMI queries and feedback: otmi@nature.com. Please feel free to make use of this contact address for any issues you may want to raise about OTMI.

Wiki

We've also created an online forum, a wiki (powered by MediaWiki for those who may be interested), for collecting resources relating to OTMI and for hosting discussions and enabling user collaborations. The wiki can be found at http://opentextmining.org/. Again, do not hesitate to join in the discussion by adding to the existing pages or by creating new topics about OTMI of your own choosing.

Currently the wiki hosts the following resource categories:

  • Repository
  • Information
  • Resources
  • Web Resources

Repository

Which leads to the real meat of this post - the OTMI Repository. We have currently made available in OTMI format two years' worth of full text for (or otherwise everything published by) five of our titles:

Directories of available OTMI files are provided using OPML (or Outline Processor Markup Language) files for easy navigation:

OTMI files are available at both the issue and article levels:

Feedback

Some early feedback has included:

  • Tarballs for issue downloads
  • MD5 digests
  • Review regex

We originally posted just the loose OTMI files, one per article, but were asked if we could make tarballs ("*.tar.gz") of all the OTMI files available for download. That made a lot of sense, so we added that option. And now, as detailed above, both download options are available. We were also asked if we could provide MD5 digests of the tarballs. We've added those too.

Another specific bit of feedback received is the regex we use to split the full text into sentences, or "snippets" as we have called them.

"OK, so you seem to split in \.\s+(?=[A-Z])
Maybe you can substitute \.\s+(?=[A-Z]) with \.\.\s+(?=[A-Z]) before the
split?

And although other punctuation marks are not common in scientific text, maybe you can be more generic by using something like [\.\?\;]\s+[A-Z0-9]">

Well, we'll certainly look into that and we invite others to comment. Maybe something for the wiki?

We look forward to further feedback and to new and developing discussions on the wiki. And feel free to create new pages on the wiki. Perhaps even a page for advocacy?

February 21, 2007

Nature Network v2 is live

We've relaunched Nature Network, continuing the focus on Boston, but enabling other cities too, London is next. It builds on the concept of social software for scientists and extends it to a global audience, anyone can signup and take part in the conversations.

We've redesigned the application visually and added message board functionality which is tightly integrated with the existing groups. On the site you can post local events, see local job listings and each city has a local news magazine.

Nature Network draws on the best ideas in community software on the web. We are planning to continue iterating the functionality, but if you have particular ideas, then tell us on the forums. Some novel aspects of the site present themselves in tagging and in forums.

Tagging acts as bookmarking or favouriting internally, anyone can tag pretty much anything with a tag and multiple people can tag the same thing with the same word. In this way users can tag content on Nature Network to make a meaningful library of useful information for themselves. Tagging is present across every content type and acts as internal navigation between content types, you can easily see which people are interested in Psychology when you are reading message board topics on the Psychology.

Every topic on the forums is taggable, so the tags become a core navigation device. They can be used to subdivide a forum into a particular facet, or you can draw on a tag across multiple forums. The former is helpful in busy forums where there are multiple subjects addressed at the same time, eg renewables might have wind and solar. The latter is useful for broader issues that might work as a forum on their own, ethics or a particular technique perhaps.

The forums are are very community focused, anyone can reply to a topic, asking a question joins you to the group that hosts the forum, thus drawing people in to visible communities, rather than anonymous postings.

Lastly Nature Network is a social network, so everyone gets a public profile page, which they can keep from job to job. They can create their own network of friends and contacts of interesting people they meet. Activity on the site is relayed within this network, so it is easy to keep up to date with colleagues. All of these networking is under the user's control via privacy settings.

We hope that Nature Network will be a great place for scientists to meet one another and talk about science. It supports those whose work is focused on a particular area those whose work is interdisciplinary, allowing them to converse on a unique platform for scientists. Have a look and signup

February 19, 2007

Extending Connotea with scripts and mashups

We released an API for Connotea, our scientific bookmarking service, back in May of last year. It's your data, so it seemed fitting that we provide a way for you to get at it programmatically if you so desire. It's been cool to see people build proper applications on top of the functionality that the API provides (Robert Muetzelfeldt's MultiGuise is one such app).

People have also been using the API on a smaller scale to extend Connotea's funtionality, often in conjunction with the Greasemonkey extension for Firefox (if you run Firefox and don't have Greasemonkey installed, I urge you to go and download it now).

Here are a few recent examples of scripts and plugins that let you do more with Connotea. Some of them were written by us and some were contributed by users (see the 'Connotea Tools' wiki page for more details):

Tools that anybody can use

Greasemonkey scripts

Perl / Ruby scripts

February 15, 2007

Tony Hey visits Nature

Last week we were extremely lucky to be visited by Tony Hey, VP for Technical Computing at Microsoft. Tony has previously been a physicist, a computer scientist, Dean of Engineering at the University of Southampton, and director of the UK's e-Science Initiative. He's one of the most interesting commentators on the impact of information technology on scientific research, a subject close to our hearts. Here are my rough notes from his talk to Nature staff.

"E-science" is about global collaboration in key areas of science, and the next generation of infrastructure that will enable it.

Different paradigms in science:

  • Experimental science
  • Theoretical science
  • Computational science
  • E-science: collaborative networked science

An e-science example: Project Neptune is a proposal to put a network of sensors on the seabed to monitor geological and biological activity. It includes remote-controlled sensors.

Key elements of e-science include large data sets, distributed computation and interoperability.

Science publishing is also undergoing change: We can now searching and obtain visual overviews of the literature. 'Live documents' that are continually updated (with RSS feeds to alert readers to updates). New measures of reputation and influence. Peer review is also going to be different in the future, perhaps including Amazon-like voting and social-networking approaches.

An e-science example in astronomy: Astronomers divide themselves up by wavelength(!). But you need to combine information from all wavelengths to understand a single patch of sky. SkyServer.SDSS.org was built to combine data from 20 separate observatories. It also links to the relevant literature. The UK version is AstroGrid, which uses a wiki and is part of the UK e-Science Initiative.

An e-science example in chemistry: The CombeChem project creates many chemical combinations and allows analysis of structures (e.g., against known structures). This can all be controlled remotely.

Digital lab books (also used by chemists) allow proper archiving and sharing between researchers, for example using Tablet PCs. Researchers can also receive messages on their PDAs to allow remote monitoring of experiments and more efficient use of their time.

Publishing in chemistry: It is now possible to create electronic versions of papers (e-prints) that are linked to the crystallographic database and are published locally (e.g., in institutional repositories). The eBank Project makes grey literature part of the overall digital library, and also ties into virtual education.

Key issues for e-science:

  • Data life cycle
  • Scholarly communication

Data Life Cycle

Currently there are multiple heterogeneous stages between acquisition and preservation (some using MS software, others not). CombeChem is an interesting case study:

  • End-to-end linking of data to information, and publishing at source.
  • Collecting data (and metadata) with regard to how it could eventually be used. Metadata needs special care.
  • In the chemistry lab, people and machines work together.

How do we get scientists to record data and metadata while they do experiments? We also need to validate data and capture provenance.

Scholarly Communication

There's a revolution going on in scholarly communication. Probably the most at risk are the scholarly societies.

Science blogs will be an increasingly interesting vehicle. For example, you can put your lab notebook on a blog. This encourages and facilitates collaboration, and allows validation. For example, the Useful Chemistry blog records experiments, including ones that didn't work.

Also OpenWetWare, a wiki, captures how experiments and done and shares this information with others, though it's quite anarchic.

There's also the concept of publications as live documents: Click on figure to get the underlying data, run a simulation, etc. In some areas database replacing (paper) publications as a medium of communication. These often take a great deal of effort to maintain: UniProt has 140 curators. How should they be paid for? Exporting and sharing information is becoming easier through use of open XML standards — including MS Office.

Open Access: There is an argument that taxpayers should be able to see the results of the research they fund. There have been declarations, policies and plans from the OECD and NIH, among others. Due to journal price rises, Tony's library at Southampton could not afford to subscribe to all the journals where members of the department published. Physicists used to share preprints by post. Now they do so electronically through arXiv.org.

Open access [via self-archiving] is coming: There is proposed legislation in the US, an EU petition attracted 19k signatories. Increasingly, access to the latest information is through the web and preprints, not the journals.

Three prophets of open access:

Institutional repositories will take off: The Google Scholar rank for all institutions puts Harvard top in the world (not surprising) but Southampton top in the UK (surprising). Oxford and Cambridge will not be able to tolerate this relative lack if visibility. There are over 1400 repositories worldwide. EPrints and DSpace are the most widely used platforms. We won't get the same software used everywhere. And repositories will contain grey literature, data sets, etc. as well as traditional papers. We also need more than OAI-PMH to enable interoperability, and now we have OAI-ORE too to enable sharing and reuse.

Other areas:

  • Search: Grokker provides useful categorised search.
  • New forms of peer review: For example, Faculty of 1000 from BioMed Central.
  • Sharing: Connotea.
  • Library 2.0: Need to make static catalogues active and 'mashable'.
  • Preservation: "Digital information lasts for ever, or for 5 years, whichever comes first." - Jeff Rothenberg. The new version of MS Office uses Office Open XML (OOXML), which is an open standard not controlled by MS but by a standards body.

Technical Computing at MS is involved in:

  • Advanced computing: New algorithms and tools (now).
  • High-productivity computing: Clusters, databases (0-5 years).
  • Radical computing: Breakthrough technologies (e.g., parallel programming (5-10 years).
"Nascent Web publishing efforts have their genesis in a burning need to say something, but their ultimate success comes from people wanting to listen, needing to hear each other’s voices, and answering in kind."
Rick Levine
The Cluetrain Manifesto

Subscribe

Subscribe to this blog's feeds:

[What is this?]

Recent Comments

Powered by
Movable Type 3.2