« P2P in science | Main | Linda Stone visits Nature »

Open Text Mining Interface

Every now and then a scientist contacts Nature asking for a machine-readable copy if our content (i.e., the XML) to use in text-mining research. We're usually happy to oblige, but there has to be a better way for everyone concerned, not least the poor researcher, who might have to contact any number of publishers and deal with many different content formats to conduct their work. Much better, surely, to have a common format in which all publishers can issue their content for text-mining and indexing purposes.

The Open Text Mining Interface (OTMI) is a suggestion from Nature about how we might achieve that. As described in my earlier post, I presented a brief summary of the idea at the Bio IT World conference in Boston. We've since been sharing the idea with some other publishers. This post is intended to provide a few written details and an update.

Our initial demo uses the 23 March issue of Nature (by happy coincidence, a wonderful special issue on the future of scientific computing). Embedded in the HTML of the abstract and full-text file for each article is a tag like this:

<link rel="OTMI" type="application/atom+xml" href="../otmi/otmi-nature04614.xml" />

which points to an OTMI file — a machine-readable representation of the text. (Technically, it's an Atom Entry document with various XML namespace extensions to allow us to include additional information.) As I write this, the example files for our test issue contain the following information:

  • Bibliographic details (of the kind you might also find in our table of contents RSS feeds)
  • Word vectors. That is, a list of all the words that appear in the article and the number of occurrences. (There's also a stop-word list of very common words that have been excluded.) This enables the construction of the most basic types of search index.
  • 'Snippets'. Basically sentences, presented out of order, which allows more sophisticated indexing and text mining (e.g., the kind that looks out for common constructions such as "A binds to B" or "X inhibits Y"), but not, of course, anything that looks across sentence boundaries.

Note for that for both words and sentences — actually quite hard concepts to define in strict computational terms — the algorithms used to tokenize the text are defined in the OTMI file using regular expressions, so anyone — or anything — examining the file can in principle know exactly how the text was processed to create the respective lists. Note also that the word vectors will usually be redundant if you have the sentences, but we include both for the purposes of this demo (and who know, maybe it's useful to some people if we provide both).

There are still a lot of things that could be improved here. For example:

  1. Allow for text from different sections of an article (e.g., abstract, figure legends) to be labelled as such.
  2. Allow for text to be presented in normal human-readable form for publishers who are willing to provide this.
  3. Add a list of cited articles, providing at least DOIs but perhaps other information too. This would, of course, open up the content to citation analysis.
  4. Add references to the OTMI files from the corresponding RSS feed items (and from the log-in page where content is access-controlled).
  5. Add references to a common stop-word list instead of repeating it in each OTMI file.
  6. Add rights information.
  7. Add references to associated data files and/or database entries.
  8. Provide an actual spec. ;)

We intend to make at least some of these changes (and perhaps others besides) over the coming weeks, so expect the example files to change before your eyes. (A good reason not to try any serious programming against them, BTW.) There's also an even more basic issue around whether an Atom entry document is the right starting point. For example, perhaps an RDF/XML format would be more useful, at least to some people.

The example of RSS shows how powerful a relatively simple common standard can be when it comes to aggregating content from multiple sources (even when it's messed up as badly as RSS ;). So maybe an approach like OTMI (or a better one dreamt up by someone else) can help those who want to index and text-mine scientific and other content. Like RSS, I think publishers might also come to see this as a kind of advert for their content because it should help interested readers to discover it. And on the basis that a something is always better than nothing, it also doesn't force publishers to give away the human-readable form of their content — they can limit themselves to snippets or even just word vectors if they want to.

But the only chance we have of turning this into something useful for lots of people is to get plenty of feedback, so please post yours comments below or send them to me: t.hannay[AT]nature.com. Thanks.

Update (27/4/06): There is an editorial about this in today's issue of Nature.

TrackBack

TrackBack URL for this entry:
http://blogs.nature.com/cgi-bin/mt/mt-tb.cgi/521

Listed below are links to weblogs that reference Open Text Mining Interface:

» Atom Newsreel from M�che�l � Foghl�'s Weblog
Tim Bray on Atom Newsreel: I’ve been accumulating things Atomic to write about for a while, so here goes. Item: You’ll be able to blog from inside Microsoft Word 2007 via the Atom Publishing Protocol. Item: Sam Ruby has wrangled Planet to... [Read More]

» Sentence ordering in OTMI from HubLog
The Biomedical Literature Mining Publications (BLIMP) archive has a fine collection of articles discussing the extraction of information from scientific papers. One in particular, A baseline feature set for learning rhetorical zones using full articles... [Read More]

Comments

Hi Timo,
OTMI is really interesting as it puts semantic-web in the scientific abstracts.
Here are a few comments/suggestions:


  • Authors'names could be used to build social networks. I guess using an URI pointing to an author database (OK, this doesn't exist but..) rather than a literal could avoid any ambiguity (Hum, I don't know if Atom can use either literal or something like rdf:resource).

  • Another ambiguity for a text-mining analysis: In biology many genes have homonyms. So otmi:snippet could contains a link to Unigene/Uniprot...

  • important words in otmi:snippet could use links to SKOS

  • otmi:vector could use an uri pointing to connotea/tag rather than a literal ?

  • A little bug in your parser: it breaks the sentence when it finds the dot in an acronym (e.g. "Fig." or "Ref.")


Thanks, Pierre.

Yes, I agree that in principle entities referred to in articles (e.g., people, genes and so on) could be formally identified, but it would require someone to make that identification. Currently we don't have that information without having someone go through and curate it manually, which is a massive task. I guess my dream is that someone (or more likely some collection of people) out there is clever enough to write software to do this job using the basic OTMI files, and that they would be willing to allow that information to be incorporated back into the OTMI system for use by others. That would be really exciting if it could be made to work, but I'm not holding my breath... ;)

You're right about the sentence tokenization. Actually, that's why we call them 'snippets' rather than sentences in the OTMI files themselves. Because full stops serve a number of grammatical purposes, it's actually quite hard to write a regular expression that reliably splits text only at sentence boundaries. That's why we took the approach of specifying the reg ex we used and leaving it at that. The snippets approximate to sentences, but they aren't exactly the same, as you've noticed. I guess we're assuming that this is not a big problem for most purposes, but if it is -- and more to the point if you can think of a better reg ex to use for this -- then please say.

There's a lot that can be said, but on the basic
topic of sentence boundary detection, you can
go to my search page at:

http://bionlp.org/search.html

Enter

sentence boundary detection

in the search field labeled "Search the ACL Anthology"

and you'll find 173 papers related to accurate determination
of sentence boundaries. My favorite is the long 2002 paper
by Mikheev that you can find by searching in the same way with

mikheev periods capitalized

More progress has been made in the intervening four years,
of course.

- Bob Futrelle

_______________________________________________________________
Robert P. Futrelle | Biological Knowledge Laboratory
Associate Professor | College of Computer and Information
| Science MS WVH202
Office: (617)-373-4239 | Northeastern University
Fax: (617)-373-5121 | 360 Huntington Ave.
futrelle@ccs.neu.edu | Boston, MA 02115
http://www.ccs.neu.edu/home/futrelle
http://www.bionlp.org http://www.diagrams.org

http://biologicalknowledge.com
mailto:biologicalknowledge@gmail.com
_______________________________________________________________

Hi Timo,

This is a wonderful idea, for some reason it puts me in mind of Douglas Adams's "Starship Titanic". It's novelisation (which starts here), and its associated competition.

Keep up the good work


Chris

As I mentioned in HubMed's HubLog there are a couple of things (some of which are listed by Timo above) that are wanting in the present form of OTMI. A couple not covered:

- it seems to make assumptions about the nature of the text mining that will be applied. The stopwords and term frequency are completely redundent and the latter suggests a vector-space model view of text mining.

- By listing the sentences out-of-order (in alphabetical order and not in article order) and not including paragraph and other document structure,
techniques which take advantage of the information clustering and flow that the structure of the article provides - which represent
newer and more effective analysis techniques - cannot be applied. Even fairly traditional things like proximity search will not work using an OTMI source if the two words of interest are not in the same sentence.

That said, this is a very good idea but the implementation needs some work. I would be interested in participating in the development of this idea.

constructively,

Glen

Glen Newton
CISTI Research
CISTI
National Research Council Canada
glen.newton@nrc-cnrc.gc.ca


It seems to me that this effort is misguided for the yet to be published literature. At the point of manuscript submission there is a great opportunity to have paper curated by the people who know the work the best, the authors, and quality checked by experts in the field, the reviewers. If all journals required that authors formulate an XML or similar document that contained all the significant conclusions from their manuscript, there would be little need for semantic mining of the text of the future literature. This document could be generated on-line using a controlled vocabulary, similar to that currently used by the various genome databases. For biology, most of this would consist of binary relationships, ie. gene A down regulates gene B, enzyme C is inhibited by compound D, phenotype F maps to QTL G. This information would then be submitted to the NLM and kept as part of the Medline record for each paper. It would then be trivial to search for all compounds that inhibit a given enzyme and get the list of all relevant publications. The generation of protein interaction networks would be greatly facilitated.

There are a number of advantages to this approach. First, the most important part of the literature, the conclusions, would be captured in an unambiguous, machine-readable format. Second, this information would be contributed and checked by the most knowledgeable groups possible, the authors and reviewers. Third, in exchange for this extra effort, there would be rewards to the publishing process. Setting out the conclusions that the authors plan to prove in an explicit format would help many authors to be more organized and focused in their presentation. The reviewers would then be able to focus on how well the data supports these specific conclusions and there would be far less confusion as to what was actually being claimed by the authors. Conclusions that the reviewers felt were not supported by the data could be modified or deleted as part of the manuscript revision process. Furthermore, editors could more easily evaluate the impact and novelty of a manuscript. Since this conclusion information would be freely distributed, akin to the abstract, there would be no problems with current journal publishing business models.

Rather than try to have sophisticated algorithms try to tease the conclusions from convoluted papers in the literature, why not just require the authors do it explicitly and more accurately by stating their conclusions in a machine and humanly readable format at the beginning of the publishing process?

Brian makes a great point. I was thinking along the same lines as I was reading this article.

I agree that presenting the paper in a broken form cripples most kinds of text analysis one would want to do. In addition, inferred relationships would not be as accurate as those written or reviewed by the author. However, the best reason for doing the semantic mark-up up front is that it is another value-added service that a publisher can provide. Journal publishers provide a valuble editorial service, without which most of the scientific literature would be a jumbled mess. However, with more people accessing the online version of articles, and with initiatives such as the Public Library of Science, the pressure is on the subscriber-pays publishers to provide a valuable service that institutional customers will continue to pay subscription fees for.

Many of you are probably thinking that publishing today is hard enough, and takes long enough, already, without this extra step of complexity. Many authors still don't even provide lists of keywords for their papers. Therefore, it is the responsibility of the subscriber-pays journals, if they insist on holding to that revenue model, to develop a text-mining approach, rather than expect the authors to properly do the mark-up or the public to create the tool for them given imperfect source material. I know the folks at NPG are trying to provide a compromise solution that has the most chance at being adopted, but it feels a little insulting that you expect people to be excited about working on a mangled version of the source material. Of course, I don't see publishers jumping at the chance to hire software developers, not would I expect the results to be very good, with the developers so detached from the scientists who want the tools.

There's a hybrid approach for those publishers who don't fancy the idea of hiring staff to implement text-mining, and it's inspired by the idea of open source software. Open-source software is created, as Eric Raymond says, to scratch an itch of the developer. Text-mining algorithms will be created to scratch an itch of the scientist, namely that of keeping on top of the overwhelming load of information being published. No one would be doing this if they could easily read and remember all the relevant papers in their field. So here's an idea: How about subscriber-pays publishers getting together to offer developers of text-mining algorithms personal subscriptions to the full text of their journals?

A fascinating suggestion.

I might add that the combination of an Atom-like RSS carrier (RSS 1.0) and the CML namespace have been suggested elsewhere as useful for such data mining, although this lacked the OTMI component.

Certainly, for explicit mining of molecules, both the InChI identifier and the CML molecular declaration would be useful additions.

See DOI: http://dx.doi.org/10.1186/1471-2105-6-180 and DOI: http://dx.doi.org/10.1186/1471-2105-6-141

where this is addressed in the context of Bioinformatics and also DOI http://dx.doi.org/10.1021/ci034244p

where the use of CML and RSS is elaborated.

We also suggested PRISM for mining seminar announcements in sciences;

http://dx.doi.org/10.1021/ci0504115

along with the likes of FOAF for mining "people".

Many thanks for all the comments. Please keep them coming. In the meantime, I'll try to respond to a few of them here.

Glen: We'd love to have your input if and when we decide that this is something worth pursuing further. In particular, if there are other formats that we can provide (that also have a good chance of widespread adoption) then we'd love to hear about them.

Brian: I think you're taking about something slightly different, albeit related. All that OTMI is trying to do is expose the information that publishers already have about already published material. If I've understood correctly, you are talking about asking authors to add to that information ahead of publication. This is also very important, and we're working on it (in an early-stage R&D kind of way). If we do manage to capture this kind of information (and it's turned out to be a much, much harder than I thought it would be -- perhaps a subject for a future post) then we could certainly put it into the OTMI files. But right now it's generally not available and so isn't part of our current OTMI demo.

Grady: I'm under the impression that there's actually a lot you can do with the kind of information in the demo OTMI files, though as currently conceived you certainly can't use algorithms that look across sentence boundaries. If this is a showstopper for almost everyone who might want to use the data then there is indeed no point in doing this (at least in the way proposed). But if it turns out to be useful for at least some things (and at this early stage I think the jury's still out) then it presumably is worth doing. We're not expecting people to be excited, only asking if this would be useful, and we certainly don't intend to insult anyone. On the contrary, we're trying to find a way of improving the current situation, which seems to me to be much more incovenient: text miners having to secure access to content publisher by publisher, then scrape the information from hundreds of different HTML and XML formats. The aim of OTMI, therefore, is precisely to open up the content to analysis and experimentation by anyone, an aim on which I think we're both agreed.

In case people miss the update at the foot of the post above, there is an editorial about OTMI in today's issue of Nature.

As for Brian's suggestion about having authors add knowledge at paper creation time,
I speculated about how this might be done in a 1995 paper:

http://www.ccs.neu.edu/home/futrelle/papers1102/DEXA95-RPF-Fridman.pdf

Futrelle, R. P., and N. Fridman. (1995)
"Principles and Tools for Authoring Knowledge-Rich Documents"
DEXA 95 (Database and Expert System Applications)
Workshop on Digital Libraries, London, UK pp. 357-362.

Elaborating a bit on Glenn Newton's very important concerns about the assumptions underlying OTMI, I'd like to point out that much published work takes advantage of document information not provided in the draft OTMI -- for example, knowledge of which section of a document a text fragment came from (e.g. methods section versus a figure caption). For more details, see my review in Molecular Cell 21, 589-594, doi 10.1016/j.molcel.2006.02.012

Currently, there is no uniform way to identify document segments from the "full text" forms provided -- PDFs are impossible, SGML easy, and most HTML or XML formats somewhere in between. Another important issue is reformatting aspects of publications that are currently done in either non-standard or purely visual forms e.g. handling greek characters, equations, superscripts, etc. A valuable role for an OTMI would be to define a uniform and easily parsable set of meta-data and stand-off annotations to address these needs.

The idea that the OTMI should provide only an unordered (or reordered) set of either words or 'snippets' is an unfortunate one. Human beings get signficant information from the ordering within text, and therefore it is possible that programs can, too. If the "out of order" approach is suggested only to address the concerns of "subscriber pays" publishers about misuse of their texts, it is absolutely the wrong way to go about doing that. Some form of rights management (DRM perhaps, or some other approach) is the proper way to protect the rights of publishers. Manipulating texts so that they are not human readable will fundamentally alter them in a way that will impede text mining research and unmine the goal of the OTMI. Reformating the text and providing metadata in a standardized way would be very valuable to the text mining community. Other alterations in a misguided attempt to preserve publishers rights would not.

Larry

In case people are interested, I'm collecting other mentions of OTMI from around the web in my Connotea account under the "OTMI" tag.

While we're on this topic, I'm going to throw out a related, but more specialized suggestion regarding where XML-based information could be exposed: gene expression data. Authors and journals could be convinced to include all relevant gene expression data (northerns, in situs, etc.) as supplemental data because it would greatly increase the chance of a publication being cited. That data could be accompanied by XML information that would allow its automated incorporation into existing gene expression databases, such as the one at Jackson Labs, at a increased rate and reduced cost.

Tags would need to include:
assay type
species
strain/cell line
stage/conditions
tissues/subcellular localization

And I'm sure others i'm not thinking about.

Great initiative! I agree with other readers that including as much data from the article as possible is a good thing, like experimental data, unique identifiers such as InChI's, or even full chemical structures using CML.

In the past I have written the CMLRSS plugins for Jmol and JChemPaint (DOI: http://dx.doi.org/10.1021/ci034244p)
and recently been working on a replacement for Bioclipse which integrates Jmol, JChemPaint and other open source bio- and chemoinformatics software.

As an indication that I think this is going the right direction, I've hacked up OTMI support in Bioclipse this morning. See the details and the obligatory screenshot in my chem-bla-ics blog.

Now, this OTMI support is mostly a proof-of-concept, and offers no nice table view of the vectors and snippets; it just shows the regular expressions and the numbers of both OTMI elements. Neither does it read the PRISM data at this moment, but will not put effort in that until the format has become a bit more fixed.

Most information in biology can ultimately be associated with a unique Latin binomen, the species name, e.g. Drosophila melanogaster (fruit fly) or Homo sapiens (man). This fact supports the discrete compilation of coherent information sets for taxa (Nature 439, 6; 2006) – provided that the information is online and open access – and makes initiatives such as the Open Text Mining Interface (OTMI; Nature 440, 1090; 2006) even more powerful. Scientific names can also be linked to original and subsequent species descriptions, both, usually the latter with even a richer sources of data that typically include detailed morphological characterizations, distributional data, phenological data, notes on biology, biotic associations, and more. More than ten million of such descriptions exist in our heritage literature, and every year more than 20,000 new species are described. The relatively normal structure of these “taxon treatments” supports the development of formal mark-up schemas, such as the taxonx XML schema developed and applied by a joint research team based at the American Museum of Natural History and supported by the US-NSF (IIS-0241229) and German DFG.

Schemas such as taxonx specify the encoding of structural components of treatments (e.g., descriptions of species), enabling application of machine mark-up techniques to a diverse corpus of legacy material to identify species names, their associated treatments, and to encode the structure of data within the treatment. These data can be operated on or extracted for use in a wide range of general and specialized applications. The xml schema approach is modular in concept permitting inclusion of domain-specific schemas within more general schema such as those applied in text mining (e.g., OTMI) and archiving (e.g., the National Library of Medicine’s Journal Archiving and Interchange Schema). The American Museum of Natural History has recently made the entire corpus of its scientific publications available on the Web and plans are underway to undertake massive digital capture of the world’s bioystematics legacy literature through the Biodiversity Heritage Library Project . These efforts promise to expose > 10 million pages of descriptions covering the world’s species to current search mechanisms. Data derived from encoded taxonomic treatments can feed into fledgling repositories of biological names such as Zoobank (Nature 437, 477; 2005). Prospective publishers could also use the schema approach to provide well-demarcated, open access to the factual content of descriptions of new taxa while maintaining requisite copyright restrictions (Nature 439, 392; 2006).

Donat Agosti 1; Indra Neil Sarkar 2; Terry Catapano 2; Christie Stephenson 2; Robert A. Morris 3; Thomas D. Moritz 4; Norman F. Johnson 5; Klemens Boehm 6; Guido Sautter 6

1 Naturmuseum der Burgergemeinde Bern, Bern Switzerland; 2 American Museum of Natural History, New York, USA; 3 University of Massachusetts, Boston, USA, 4 Getty Research Institute, Los Angeles, USA; 5 Ohio State University, Columbus, USA, 6 Universität Karlsruhe, Karlsruhe, Germany. agosti@amnh.org

What a thought-provoking thread! I would add two points:

First, it is important to emphasize not only the HOW of marking up full-text of scientific papers, but also the WHY – that is, the multiple kinds of uses that this is intended to serve. I agree that NLP processing of the sentences/snippets should be done after the initial mark up occurs, but the initial mark-up needs to be designed in a way that anticipates and facilitates the kinds of subsequent processing that will be done. My group is approaching this not as a problem of mark up for one major use [e.g. machine readability] but as developing a basic infrastructure that can support several different types and layers of subsequent text-mining and semantic processing.

Second, in conjunction with the newly launched, BioMed Central hosted, Journal of Biomedical Discovery and Collaboration (http://www.j-biomed-discovery.com), I am starting to organize a conference on “The Future of the Scientific Paper” that will cover at least four areas: a) How scientists are writing papers differently today and into the future; b) the impact of publishing trends; c) the impact of the internet and computing; and d) the issues of standards in marking-up and processing full-text [including metadata, figures, supplementary files, etc.]. The exact list of co-sponsors is still being assembled, but the conference will have an affiliation with several University of Illinois units and may take place on either Chicago or Champaign-Urbana campuses. I sincerely welcome suggestions regarding potential topics and potential speakers – from around the nation or around the world.

Thanks a lot,

Neil Smalheiser

I really like the OTMI proposal. I think that the current version is a good draft and cannot wait to see more content in this format. The reason that I like OTMI is that it is the first attempt at making the full text of articles from subscriber-pay journals available for indexing or text mining purposes. The format is a compromise that I believe will work for many indexing and text mining applications.

Some responses expressed concern that shuffling the sentences will make the OTMI documents unsuitable to some text mining approaches. What are these approaches? Can those who expressed concern cite work that would be hindered with OTMI, but would work if the full text was intact?

In contrast to hypothetical future text mining applications, I know that OTMI will work for a number of practical text searching and text mining applications. For instance, it will definitely work for indexing full text articles at the sentence level. (A tool that supports sentence-level searches of Medline is http://twease.org/ developed in my laboratory.) A version of Twease searching full text articles is a practical application that could be deployed very quickly and be of significant benefit to the biomedical community. Indeed, as of today, most biomedical researchers can only search abstracts from Medline, and full text for a relatively small (but growing) number of open-access journals. If adopted throughout NPG and by other publishers, OTMI could spearhead the development of search tools able to retrieve sentences from articles that biomedical researchers would never have suspected existed had they searched only abstracts through PubMed. Google Scholar let its users search the full text of some scientific articles, but the search methodology is opaque, and it is not clear what journals are being indexed or what is the performance of the tool for scientific searches. Extrapolating from help documentation posted on the Google Scholar web site, it appears that Google is negotiating access with individual publishers to retrieve articles for indexing purposes. Unfortunately, the complexity of full text access negotiations suggests that academic laboratories will never be able to compete with a company such as Google, if this is what is required to obtain full text material.

For all these reasons, I believe that OTMI is a first step in the right direction. I think that those serious about text indexing and mining should try it out and see what its limits really are. I suspect that OTMI will help text mining research go a long way. I would be glad to hear from any publisher interested in supporting OTMI to support sentence-level indexing of their article collection.

I was refered to this blog by an editorial in the April 27 issue of Nature. The URL presented was ridiculously long and unmemorable. I would like to direct the author of the editorial to tinyurl.com.

Post a comment

Comments will be reviewed by the editors before being published. You can be as critical or controversial as you like, but please don't get personal or offensive. We strongly encourage you to use your real, full name. Email addresses are useful in case we need to discuss your comment with you privately, or notify you in case we decide not to publish your comment. Email addresses will not be made public on the blog.