« Nature Network v2 is live | Main | Nature Web Feeds - A New Look »

Open Text Mining Interface - Update

We've posted here before (here and here) about OTMI, the Open Text Mining Interface specification that we are proposing as a means to disclose scholarly full text for text analysis purposes. This post details some recent updates.

Contact email

First things first. We've set up a dedicated mail address for all OTMI queries and feedback: otmi@nature.com. Please feel free to make use of this contact address for any issues you may want to raise about OTMI.

Wiki

We've also created an online forum, a wiki (powered by MediaWiki for those who may be interested), for collecting resources relating to OTMI and for hosting discussions and enabling user collaborations. The wiki can be found at http://opentextmining.org/. Again, do not hesitate to join in the discussion by adding to the existing pages or by creating new topics about OTMI of your own choosing.

Currently the wiki hosts the following resource categories:

  • Repository
  • Information
  • Resources
  • Web Resources

Repository

Which leads to the real meat of this post - the OTMI Repository. We have currently made available in OTMI format two years' worth of full text for (or otherwise everything published by) five of our titles:

Directories of available OTMI files are provided using OPML (or Outline Processor Markup Language) files for easy navigation:

OTMI files are available at both the issue and article levels:

Feedback

Some early feedback has included:

  • Tarballs for issue downloads
  • MD5 digests
  • Review regex

We originally posted just the loose OTMI files, one per article, but were asked if we could make tarballs ("*.tar.gz") of all the OTMI files available for download. That made a lot of sense, so we added that option. And now, as detailed above, both download options are available. We were also asked if we could provide MD5 digests of the tarballs. We've added those too.

Another specific bit of feedback received is the regex we use to split the full text into sentences, or "snippets" as we have called them.

"OK, so you seem to split in \.\s+(?=[A-Z])
Maybe you can substitute \.\s+(?=[A-Z]) with \.\.\s+(?=[A-Z]) before the
split?

And although other punctuation marks are not common in scientific text, maybe you can be more generic by using something like [\.\?\;]\s+[A-Z0-9]">

Well, we'll certainly look into that and we invite others to comment. Maybe something for the wiki?

We look forward to further feedback and to new and developing discussions on the wiki. And feel free to create new pages on the wiki. Perhaps even a page for advocacy?

Postgenomic TrackBack

Similar items from Scintilla

Comments

This is very exciting, and I am happy to learn that this project did not die after earlier steps. I blogged about OTMI twice in relation to chemistry:

http://chem-bla-ics.blogspot.com/search?q=OTMI

Especially, I think that combining this with OSCAR (see the above link) will provide an excellent chemical fingerprint of article.

Therefore, what are users of OTMI allowed to do? Specifically, are we allowed to take the OTMI, use the word list to define a set of molecules cited in this article, and freely (open access or even open data) distribute the latter information? To make this clear, this would not mean redistributing OTMI content, but only results from analysing the OTMI content.

Hi Egon,

That's a very good question. I think we (Nature) would be happy for people to do what you suggest, and it's precisely one of the things we're trying to achieve with OTMI. If chemists (among others) can use this approach to more easily identify papers of interest then I think we all stand to gain.

Hi,
First, thank you for providing this great resource.
The section containing the snippets is a valuable information to have, especially when dealing with contextual information. I was wondering if you plan to provide this information for the whole repository or only for the recent entries. Thank you,
Sylvain Gaudan, EBI/EMBL UK.

Hi Sylvain:

At the moment we are very much awaiting general feedback in order to judge the overall reception of this initiative. We are hoping to generate OTMI files as part of our ongoing publishing schedule as soon as may be realistically feasible. We will, of course, update the wiki pages and post to the blog here when this is put in place. Once established as part of the regular publishing workflow it is likely that we would go back through the online archive and generate OTMI files for that content too. Again, we would expect to make announcements both on the wiki pages and on this blog.

Post a comment

Comments will be reviewed by the editors before being published. You can be as critical or controversial as you like, but please don't get personal or offensive. We strongly encourage you to use your real, full name. Email addresses are useful in case we need to discuss your comment with you privately, or notify you in case we decide not to publish your comment. Email addresses will not be made public on the blog.

We have designed this blog to be as accessible to as many people as possible. If you are having difficulty leaving a comment because of the graphical security code below, please send your comment to 'nascent at nature.com'



"Nascent Web publishing efforts have their genesis in a burning need to say something, but their ultimate success comes from people wanting to listen, needing to hear each other’s voices, and answering in kind."
Rick Levine
The Cluetrain Manifesto

Subscribe

Subscribe to this blog's feeds:

[What is this?]

The Life Scientists on FriendFeed

Recent Comments

Out of 368 total comments.
The most recent three were on:
Powered by
Movable Type 3.2