I presented a talk and a poster on OTMI at BioNLP 2007 the week before last (Friday, June 29). This was a one-day workshop attached to ACL 2007 (45th Annual Meeting of the Association for Computational Linguistics) conference held in quiet outskirts of Prague. There were around 80 present including speakers and delegates. The talk was well received and there were many questions (see below) which provide some food for thought. Many thanks to Kevin Cohen and Lynette Hirschman for inviting me.
I was fortunate enough to talk early in the morning while people were still lively (talk is here) and there were several questions afterwards both in the Q&A and later during the breaks and the poster session at end of the workshop. Some questions/observations listed below:
- General. Seems that there is not enough appreciation that OTMI is being proposed as a standard framework and methodology for disclosing subscription full text for text mining. That is, most of the features are parametrized and it is up to individual publishers to determine e.g. whether a snippet is a paragraph or a phrase, whether snippets are randomized or not, etc.
- Random order. Questions asked about need to shuffle the order or can the size of the snippets be made larger, e.g. paragraph units? (See point above re publisher choice.)
- Stopwords. Feeling is that omitting stopwords is just needlessly destructive. Do we need to inflict this lossy transformation on the full text? (It is proper that the OTMI framework allows for this, but do we want to cripple the text thus effectively rendering certain text mining techniques inoperative?)
- Word vectors. Immediate feeling was that these are pretty much useless as anybody can count, but more practiced hands conceded that these could be a useful ‘entry level’ for non-specialists, i.e. the vectors could be used to determine a rough and ready document categorization. Related to this were questions on word vectors being made available for a document corpus rather than just the document in question, so that the document could be guaged against a corpus.
- Sections. There was positive feedback re our picking out key sections (methods, conclusions) although there are still questions about section ordering and section naming.
- Tables. Do we include table cations? Answer is no, and here I really don’t understand why not. Had we thought of making the actual table data available? I don’t know but probably represents an extra level of complexity because some kind of row/column ordering would need to be preserved.
- Figures. We include figure captions, but did we think about adding in (i.e. referencing) the figures themselves? (The figures are currently maintained behind a subscrition firewall.)
- Rerefences. Are references linked back to the original text? I don’t think they are properly marked up to allow the reference to be paired off with the source text snippets. This makes a lot of sense.
- Reuse policy. Are snippets of full text able to be reproduced along with annotations on a third-party website?
We’re going to be looking into these questions and trying to come up with some real good answers. Of course, we are always open to feedback, either from comments on this post, privately to the feedback address email@example.com or publicly to the OTMI discussion list firstname.lastname@example.org. And the OTMI wiki at http://opentextmining.org/ is always available for public input to the project.
(Note: Peter Corbett of the Unilever Centre for Molecular Informatics, Department of Chemistry, University of Cambridge has posted an account of the BioNLP 2007 workshop here.)