« March 2006 | Main | May 2006 »

April 24, 2006

Open Text Mining Interface

Update: My colleague, Ben Lund, has posted about a new version of our OTMI demo here.

Every now and then a scientist contacts Nature asking for a machine-readable copy if our content (i.e., the XML) to use in text-mining research. We're usually happy to oblige, but there has to be a better way for everyone concerned, not least the poor researcher, who might have to contact any number of publishers and deal with many different content formats to conduct their work. Much better, surely, to have a common format in which all publishers can issue their content for text-mining and indexing purposes.

The Open Text Mining Interface (OTMI) is a suggestion from Nature about how we might achieve that. As described in my earlier post, I presented a brief summary of the idea at the Bio IT World conference in Boston. We've since been sharing the idea with some other publishers. This post is intended to provide a few written details and an update.

Our initial demo uses the 23 March issue of Nature (by happy coincidence, a wonderful special issue on the future of scientific computing). Embedded in the HTML of the abstract and full-text file for each article is a tag like this:

<link rel="OTMI" type="application/atom+xml" href="../otmi/otmi-nature04614.xml" />

which points to an OTMI file — a machine-readable representation of the text. (Technically, it's an Atom Entry document with various XML namespace extensions to allow us to include additional information.) As I write this, the example files for our test issue contain the following information:

  • Bibliographic details (of the kind you might also find in our table of contents RSS feeds)
  • Word vectors. That is, a list of all the words that appear in the article and the number of occurrences. (There's also a stop-word list of very common words that have been excluded.) This enables the construction of the most basic types of search index.
  • 'Snippets'. Basically sentences, presented out of order, which allows more sophisticated indexing and text mining (e.g., the kind that looks out for common constructions such as "A binds to B" or "X inhibits Y"), but not, of course, anything that looks across sentence boundaries.

Note for that for both words and sentences — actually quite hard concepts to define in strict computational terms — the algorithms used to tokenize the text are defined in the OTMI file using regular expressions, so anyone — or anything — examining the file can in principle know exactly how the text was processed to create the respective lists. Note also that the word vectors will usually be redundant if you have the sentences, but we include both for the purposes of this demo (and who know, maybe it's useful to some people if we provide both).

There are still a lot of things that could be improved here. For example:

  1. Allow for text from different sections of an article (e.g., abstract, figure legends) to be labelled as such.
  2. Allow for text to be presented in normal human-readable form for publishers who are willing to provide this.
  3. Add a list of cited articles, providing at least DOIs but perhaps other information too. This would, of course, open up the content to citation analysis.
  4. Add references to the OTMI files from the corresponding RSS feed items (and from the log-in page where content is access-controlled).
  5. Add references to a common stop-word list instead of repeating it in each OTMI file.
  6. Add rights information.
  7. Add references to associated data files and/or database entries.
  8. Provide an actual spec. ;)

We intend to make at least some of these changes (and perhaps others besides) over the coming weeks, so expect the example files to change before your eyes. (A good reason not to try any serious programming against them, BTW.) There's also an even more basic issue around whether an Atom entry document is the right starting point. For example, perhaps an RDF/XML format would be more useful, at least to some people.

The example of RSS shows how powerful a relatively simple common standard can be when it comes to aggregating content from multiple sources (even when it's messed up as badly as RSS ;). So maybe an approach like OTMI (or a better one dreamt up by someone else) can help those who want to index and text-mine scientific and other content. Like RSS, I think publishers might also come to see this as a kind of advert for their content because it should help interested readers to discover it. And on the basis that a something is always better than nothing, it also doesn't force publishers to give away the human-readable form of their content — they can limit themselves to snippets or even just word vectors if they want to.

But the only chance we have of turning this into something useful for lots of people is to get plenty of feedback, so please post yours comments below or send them to me: t.hannay[AT]nature.com. Thanks.

Update (27/4/06): There is an editorial about this in today's issue of Nature.

April 10, 2006

P2P in science

In response to the 'computing in science' post a couple of weeks ago, Anna Winterbottom asked about distributed computing and peer-to-peer networks, and whether we'd be covering them in this blog. I must admit to being pretty ignorant of these areas, especially in their applications to science, so I invited Anna to send me something that I could post. Here it is:

Controversy continues to surround peer-to-peer (P2P) networks as a result of ongoing court battles over music and film copyright. However, the efficacy of file sharing protocols such as BitTorrent in speeding up transfers of large data sets, like those involved in genome and phenome projects, is evident. In recent years, scientific uses of P2P have progressed from running large programs using distributed computing power, to developing analytical tools and facilitating multi-institutional collaboration.

Think, a collaborative P2P project begun in 2001, runs as a screen saver on thousands of personal computers, testing binding interactions of proteins against small-molecule drug candidates. Grid.Org now coordinates various similar projects.

The proliferation of bioinformatics software often means researchers struggle to keep up-to-date when selecting appropriate tools. Chinock, designed for self-administered P2P communities, uses Java or Perl to unify access to alignment software and facilitate comparisons of programs submitted using XML. Chinook's effectiveness was demonstrated by an assessment of transcription factor binding site discovery programs.

A UN Working Group on science recently highlighted the utility of P2P networks for academic collaboration (PDF, 370K). An attempt to put this into practice is LionShare, a project of the Pennsylvania State University, with collaborators including the open source authentication project, Shibboleth, MIT's Open Knowledge Project, and the P2P Working Group.

LionShare includes personal file servers and networking to support file sharing. To avoid the obvious disadvantage of P2P networks, that file availability depends on the users connected at a given time, 'peer servers' aggregate documents and provide persistent mirrors for files. Advanced search functions are being developed by building on Gnutella's protocol.

The project's most ambitious aim, however, is facilitating collaboration across academic institutions, using Shibboleth and Internet2's EduPerson to overcome traditional administrative barriers. While LionShare is free for any member of a higher education institute to download, unlike most P2P networks, users are required to identify themselves, rendering copyright violations less likely.

A few words about Anna: She is setting up a free website for scientists www.firstauthor.org and also does some editing work for WitH Ltd., an Oxford-based science communications company.

April 07, 2006

Science and Creationism

There's a great debate going on at the Nature Newsblog following the publication of the Tiktaalik papers (about fish with limb-like fins) in this week's issue of Nature.

Don't miss the contributions from Nature's own ever-excellent and irrepressible Henry Gee. Great stuff!

A Wiki for Connotea Users

Yesterday we released the first version of the Connotea Community Pages. This is a section of Connotea where users can write and edit pages about anything relating to Connotea, including documentation, usage tips, and feature requests. The Community Pages are a wiki, so any Connotea user can contribute to them, and each user also has their own personal profile page, where they can (if they choose) write some information about themselves.

The idea behind the Community Pages is to allow Connotea users to become more involved in the future direction of the site. We hope the wiki will enable users to collaborate with each other and us to decide on future developments, help them communicate, and let them swap ideas about how people are using Connotea.

We have added a small amount of content to get things started: please take a look and let us know what you think. All feedback would be very welcome, either in the comments or to connotea AT nature DOT com.

As this is my first post, I should introduce myself: I'm Joanna Scott, I work as an analyst in the Web Publishing department at Nature, and you can find out more about me from my Connotea profile

April 06, 2006

Web 2.0 in Science

Earlier this week it was my privilege to chair a tutorial session at Bio IT World on Web 2.0 in science. The term "Web 2.0" has become burdened in recent months with an unfortunate combination of hype and cynicism. But it is the fate of influential ideas to be misunderstood, and despite their slightly nebulous nature, I find that Web 2.0 themes help a lot in thinking about how to get the most out of our new online environment. Lucky for me, then, that we had a dream-team panel of experts to explain what it's all about.

To kick things off, Tim O'Reilly — Mr. Web 2.0 himself — gave a summary of the concepts, in the process touching on areas in which he sees them applying to science, such as open access. He also talked about his habit of watching the alpha geeks and his astonishing (to me) track record in identifying and articulating emerging technological trends, including the emergence of the web and open-source software.

Next, Jim Ostell of the NCBI (hosts of what must be the biggest bioinformatics hub on earth) described the state of the art in biomedical web applications and how these already include some Web 2.0-like features (e.g., open APIs and the concept of the "productotype", a robust prototype application). He also talked about NCBI Bookshelf and described how online books can go beyond the printed page, for example by using animated figures. Jim rounded off by describing some really nice querying and visualisation tools available to study influenza, from anti-viral compounds in PubChem to viral genome analysis.

Third up was Nature's own Declan Butler. He started off by talking about the relative reluctance of scientists to embrace Web 2.0-like activities such as blogging. Declan is, of course, a keen blogger himself. He believes that blogs could become a useful supplement to more traditional forms of scientific communication such as journals and conferences. (I agree.) The second half of Declan's talk was about mashups, focusing on the amazing work that he's done pulling together data relevant to avian flu and making it available as overlays for Google Earth. His demos not only look fabulous, they have real scientific value and might even end up saving lives. This is scientific techno-geekery at its most sublime. :)

I brought up the rear with some examples of ways in which Web 2.0 thinking has infiltrated Nature. (And its influence on our online activities has really been profound.) First I talked about open APIs, starting with our RSS feeds, which we have enriched with all sorts of metadata to make them a kind of machine-readable catalogue of our content rather than just a simple alerting service. But even enriched RSS can take you only so far, so I also presented a new idea, called OTMI, for exposing information about the contents of an article in a form suitable for indexing and text-mining — more about this in another post. I then went on to talk about 'architectures of participation' — web sites that get better for everyone the more people use them. I described our various blogs and Connotea, our social bookmarking service for clinicians and scientists. I rounded off with a brief description of some forthcoming services: Nature Protocols, which will be a peer-reviewed journal but also a collaborative website, and Nature Network Boston, an online social space for scientists that will be launching in May.

There were loads of good questions during the Q&A that ensued, and from where I sat the room seemed to be packed out, so many thanks to all of those who came along and contributed to what I think was a stimulating session.

I have slides from some of the talks, but a technical problem is preventing me from uploading these now, so I'll do it later and post a comment below when I've done so.

Update (26/4/06): Here are slides from Jim (PPT PDF), Declan (PPT PDF) and me (PPT PDF).

"Nascent Web publishing efforts have their genesis in a burning need to say something, but their ultimate success comes from people wanting to listen, needing to hear each other’s voices, and answering in kind."
Rick Levine
The Cluetrain Manifesto

Subscribe

Subscribe to this blog's feeds:

[What is this?]

The Life Scientists on FriendFeed

Recent Comments

Out of 368 total comments.
The most recent three were on:
Powered by
Movable Type 3.2