Open Text Mining Interface version 0.2

A few weeks ago we floated OTMI as a suggested way of opening up subscriber-only articles to text mining research.

We received a lot of very useful feedback from that initial proposal, so we decided to update the demo, incorporating some of your suggested changes and additions.

As before, each of the articles in the 23rd March 2006 issue of Nature on the Future of Scientific Computing has some code embedded in the HTML:

<link rel=”OTMI” type=”application/atom+xml” href=”../otmi/otmi-nature04614.xml” />

Those OTMI files have now been updated a little bit. In particular, we’ve made the following changes or additions:

Modified the method for extracting ‘snippets’

Snippets are now a slightly better approximation to sentences, but they’re still not perfect. The regular expression used to split the text into snippets is given in the OTMI file, as before.

Split the text into sections

OTMI files can now label snippets, word vectors, and the other text formats we’ve added by the section of the article that they came from. Currently this labeling uses a controlled list of names: abstract, standfirst, body, firstpara, methods, and other. These labels can be nested (so, for example, a snippet can be in both the methods section and the body), and most of the current example files have just one section that is labeled body and other.

Added raw text and reduced text options

As well as offering the text digested as word vectors or snippets, OTMI files now include two extra text formats. The first, raw text, is the full text of an article or section with all mark-up removed. The second, reduced text, is the raw text with all stop words removed. There is obviously some redundancy between these different digests, but we are offering all of them to increase the range of options for publishers. In our example files, the abstract, standfirst and first paragraph sections contain all forms, and other sections just give the word vectors and snippet forms.

Added information about figures

As well as sections, the new OTMI files also contain a list of the titles and captions for all the figures in an article. In our example files these are given as raw text, but the ability to use any of the other text forms is available too.

Included references

Each OTMI file now has a list of DOIs for the article’s references. There’s also an ‘and n others’ element in case a reference doesn’t have a DOI.

Separated stoplist

The stoplist of common words is no longer included in every file, but pointed to from the file. The stoplist is still somewhat limited, and is there mostly to illustrate the concept.

Added rights element

Each article now has basic copyright information given in the Atom entry section of the OTMI file.

Thanks go to everyone who made contributions that led to these enhancements. Our plan now is to look for further feedback on this latest suggestion, and then move on to formalising a specification and rolling out the demonstration more widely.

Please have a look at the latest version and then leave feedback, either in the comments here or via b dot lund at nature dot com.


