Open Text Mining Interface

Update: My colleague, Ben Lund, has posted about a new version of our OTMI demo here.

Every now and then a scientist contacts Nature asking for a machine-readable copy if our content (i.e., the XML) to use in text-mining research. We’re usually happy to oblige, but there has to be a better way for everyone concerned, not least the poor researcher, who might have to contact any number of publishers and deal with many different content formats to conduct their work. Much better, surely, to have a common format in which all publishers can issue their content for text-mining and indexing purposes.

The Open Text Mining Interface (OTMI) is a suggestion from Nature about how we might achieve that. As described in my earlier post, I presented a brief summary of the idea at the Bio IT World conference in Boston. We’ve since been sharing the idea with some other publishers. This post is intended to provide a few written details and an update.

Our initial demo uses the 23 March issue of Nature (by happy coincidence, a wonderful special issue on the future of scientific computing). Embedded in the HTML of the abstract and full-text file for each article is a tag like this:

<link rel=”OTMI” type=”application/atom+xml” href=”../otmi/otmi-nature04614.xml” />

which points to an OTMI file — a machine-readable representation of the text. (Technically, it’s an Atom Entry document with various XML namespace extensions to allow us to include additional information.) As I write this, the example files for our test issue contain the following information:

  • Bibliographic details (of the kind you might also find in our table of contents RSS feeds)
  • Word vectors. That is, a list of all the words that appear in the article and the number of occurrences. (There’s also a stop-word list of very common words that have been excluded.) This enables the construction of the most basic types of search index.
  • ‘Snippets’. Basically sentences, presented out of order, which allows more sophisticated indexing and text mining (e.g., the kind that looks out for common constructions such as “A binds to B” or “X inhibits Y”), but not, of course, anything that looks across sentence boundaries.

Note for that for both words and sentences — actually quite hard concepts to define in strict computational terms — the algorithms used to tokenize the text are defined in the OTMI file using regular expressions, so anyone — or anything — examining the file can in principle know exactly how the text was processed to create the respective lists. Note also that the word vectors will usually be redundant if you have the sentences, but we include both for the purposes of this demo (and who know, maybe it’s useful to some people if we provide both).

There are still a lot of things that could be improved here. For example:

  1. Allow for text from different sections of an article (e.g., abstract, figure legends) to be labelled as such.
  2. Allow for text to be presented in normal human-readable form for publishers who are willing to provide this.
  3. Add a list of cited articles, providing at least DOIs but perhaps other information too. This would, of course, open up the content to citation analysis.
  4. Add references to the OTMI files from the corresponding RSS feed items (and from the log-in page where content is access-controlled).
  5. Add references to a common stop-word list instead of repeating it in each OTMI file.
  6. Add rights information.
  7. Add references to associated data files and/or database entries.
  8. Provide an actual spec. 😉

We intend to make at least some of these changes (and perhaps others besides) over the coming weeks, so expect the example files to change before your eyes. (A good reason not to try any serious programming against them, BTW.) There’s also an even more basic issue around whether an Atom entry document is the right starting point. For example, perhaps an RDF/XML format would be more useful, at least to some people.

The example of RSS shows how powerful a relatively simple common standard can be when it comes to aggregating content from multiple sources (even when it’s messed up as badly as RSS ;). So maybe an approach like OTMI (or a better one dreamt up by someone else) can help those who want to index and text-mine scientific and other content. Like RSS, I think publishers might also come to see this as a kind of advert for their content because it should help interested readers to discover it. And on the basis that a something is always better than nothing, it also doesn’t force publishers to give away the human-readable form of their content — they can limit themselves to snippets or even just word vectors if they want to.

But the only chance we have of turning this into something useful for lots of people is to get plenty of feedback, so please post yours comments below or send them to me: t.hannay[AT] Thanks.

Update (27/4/06): There is an editorial about this in today’s issue of Nature.


Comments are closed.