We’ve posted here before (here and here) about OTMI, the Open Text Mining Interface specification that we are proposing as a means to disclose scholarly full text for text analysis purposes. This post details some recent updates.
Contact email
First things first. We’ve set up a dedicated mail address for all OTMI queries and feedback: otmi@nature.com. Please feel free to make use of this contact address for any issues you may want to raise about OTMI.
Wiki
We’ve also created an online forum, a wiki (powered by MediaWiki for those who may be interested), for collecting resources relating to OTMI and for hosting discussions and enabling user collaborations. The wiki can be found at https://opentextmining.org/. Again, do not hesitate to join in the discussion by adding to the existing pages or by creating new topics about OTMI of your own choosing.
Currently the wiki hosts the following resource categories:
- Repository
- Information
- Resources
- Web Resources
Repository
Which leads to the real meat of this post – the OTMI Repository. We have currently made available in OTMI format two years’ worth of full text for (or otherwise everything published by) five of our titles:
- Nature
- Nature Genetics
- Nature Reviews Drug Discovery
- Nature Structural & Molecular Biology
- The Pharmacogenomics Journal
Directories of available OTMI files are provided using OPML (or Outline Processor Markup Language) files for easy navigation:
- Master file – master OPML file for all journals
- https://www.nature.com/otmi/journals.opml
- references journal OPML files (via attribute “opmlUri”)
- Journal file – OPML file for given journal – here The Pharmacogenomics Journal
- e.g. https://www.nature.com/tpj/otmi/tpj.opml
- references issue OPML files (via attribute “opmlUri”)
- references issue OTMI tarball (via attribute “gzipUri”)
- Issue file – OPML file for per-issue OTMI files
- e.g. https://www.nature.com/tpj/journal/v5/n1/otmi/otmi-manifest.opml
- references article OTMI files (via attribute “otmiUri”)
OTMI files are available at both the issue and article levels:
- Issue level – tarball of OTMI files for complete issue
and for each tarball there’s a corresponding MD5 digest file
- Article level – OTMI file for individual article
Feedback
Some early feedback has included:
- Tarballs for issue downloads
- MD5 digests
- Review regex
We originally posted just the loose OTMI files, one per article, but were asked if we could make tarballs (“*.tar.gz”) of all the OTMI files available for download. That made a lot of sense, so we added that option. And now, as detailed above, both download options are available. We were also asked if we could provide MD5 digests of the tarballs. We’ve added those too.
Another specific bit of feedback received is the regex we use to split the full text into sentences, or “snippets” as we have called them.
"OK, so you seem to split in .s+(?=[A-Z])
Maybe you can substitute .s+(?=[A-Z]) with ..s+(?=[A-Z]) before the
split?
And although other punctuation marks are not common in scientific text,
maybe you can be more generic by using something like [.?;]s+[A-Z0-9]">
Well, we’ll certainly look into that and we invite others to comment. Maybe something for the wiki?
We look forward to further feedback and to new and developing discussions on the wiki. And feel free to create new pages on the wiki. Perhaps even a page for advocacy?
Leave a Reply