Lunch with Egon Willighagen
We had lunch yesterday with Egon Willighagen who in his spare time runs the Chemical Blog space, now situated at http://cb.openmolecules.net/ (running on postgenomic code).
The chat over lunch was pretty good, it turns out that Egon's favorite molecule might be Ascorbic acid. One of the topics that really animated Egon was how how to link molecules to academic papers. By this I mean for example if you do a search in google, or in some dedicated search engine, for a molecule, how does your search engine know which papers deal with this molecule. There are a couple of problems with solving this. One is that many different fields use different terminology for molecules, especially as the molecules become large, so a plain text search for the name will not get all of the papers that you might be interested in, also papers don't have semantic markup of molecules.
One solution to marking up molecules is to use an InChi (an IUPAC International Chemical Identifier). These have been championed by Peter Murray Rust and there is an extensive InChi FAQ available. The short story is that an InCHi is a character string which uniquely describes a chemical substance. From any chemical structure you can generate an InChi.
Peter has a writeup on using inCHi in blogs, and if every chemical that appeared everywhere was somehow marked up with it's InChi, or the article referring to it tagged with them then the findability problem would be solved by simple string searching.
OK great, well what's the problem? For a start there is an alternative system SMILES (which is a Simplified Molecular Input Line Entry System), a markdown for molecules if you like. There is a very good description of the syntax here and the KinasePro blog has a short comment on how many people use SMILES vs InChi. The bottom line is that more people use SMILES, but it seems easier to search Google with InChi. I'm not a chemist, but it seems from my naive stand point that the SMILES syntax seems closer to the text description of chemistry that we know from school, wheres the InChi system is more rigorous, it requires one further step of abstraction. It reminds me of the difference between LaTeX for math and MathML. MathML is a hell of a lot easier to write a parser for than LaTeX, as LaTeX can be quite expressive, however no one writes raw MathML. Scientists are lazy and that extra step of abstraction might be the reason why SMILES seems to be used more frequently at the moment.
Egon suggested as a solution that journals should require papers dealing with chemicals to include InChis. He said that every tool for drawing chemicals (standard issue for anyone writing a paper on the subject) can now output the InChi with the click of a button. Sounds reasonable, seems easy, but there are problems with this approach. I have heard a few times people say, you are Nature, you can make authors do anything in order to get a paper published so why not get them to do x. Well, for a start, that's an editorial decision, but even so, making more demands on scientists may not be the best decision when the process of publication is already pretty fraught and stressful. Even if we did this what would that gain? A small selection of the literature would be marked up, but the vast majority of journals in the area would need to follow suit in order to gain full coverage. Of course an argument that we should not do x because other people are not doing x is not what I am getting at here, but rather that this cannot be seen to be a final solution to the problem. Journals are naturally shy of any step that can delay the publication time of an article, and so I am also skeptical that we would see such obligatory requirements. Better, I think, to have this step as a voluntary one. Practically all journals allow supplementary information and I am sure all of them would accept InChi as supplementary information.
Even then one is still left with the vast existing corpora of papers that are already published. Egon points out that no one uses the literature in this area from 50 years ago, as modern techniques have advanced so far that this literature is functionally of little use. The implication here is that 50 years in the future we will only need to go back as far as today's papers. Even so there has to be a value in seeing the evolution of an idea for insertion into the literature right through to where it has led today, and Egon agreed with this.
So what can we do now to help making connections between papers and molecules? Peter Corbett, who works with Peter Murray Rust, is working on automated methods of getting computers to read chemistry papers and output semantic markup of them. Tools like this can begin to fill in the semantic blanks, both for papers from the past and for the current literature. Egon has now created rdf pages for molecules on openmolecules.net. These pages use the InChi in their structure, and now each molecule had it's own web page. Egon's pages check Connotea, and pull from Connotea co-tags of InChi tags (Here is a short description of this). If we work on this a bit more we should be able to set up a system where if you tag a paper with an InChi, that paper could appear on Egon's pages. We got quite excited about this idea yesterday and are certainly going to discuss this further. It's a small start, but a start nonetheless.

Comments
For a practical example of how InChIs, Web publishing, and chemical structure search come together:
http://depth-first.com/articles/2007/09/24/building-the-chemically-aware-web-totallysynthetic-and-inchimatic
Posted by: Rich Apodaca | October 11, 2007 11:54 AM
Well, Ian didn't invite me to lunch :( or I could have talked about the new InChIKey, a hashed counterpart of the InChI, which I blogged about here on CrossTech, the publisher technology forum. I would also have pointed to the RSC work on syndicating InChI's with their RSS feeds as part of Project Prospect, which they have blogged about on CrossTech here.
Certainly, InChiKey looks to be a very promising development and should be more friendly for both search engines and for users to manipulate. Ian notes that "Egon suggested as a solution that journals should require papers dealing with chemicals to include InChis." The more compact and robust form of the InChIKey is likely to be a significant help in getting users to provide and publishers to cite InChI identifiers.
Posted by: Tony Hammond | October 12, 2007 06:05 AM
Getting InChIs out from the chemical drawing is easily done now, but I don't think there will be a realistic way to get them into the authoring process until the tools offer a robust way to keep the InChIs in the right place (and validated). Certainly it's not a burden we could currently expect of the majority of authors, which is why RSC Project Prospect relies on a combination of text mining and input by skilled technical editors. It's quite hard to do in practice, but it's worth it when you see the results which won us the ALPSP/Charlesworth Publishing Innovation award this year. The InChIKey should help to promote acceptance and use as Tony suggests, along with common treatment of these standards across publishers .
Posted by: Richard Kidd | October 15, 2007 05:11 AM
I'm no chemist so hesitate to comment on how appropriate any particular identifier system is for the domain; however, I have to strongly disagree with one of the general themes of this post. The idea that utilizing a well-established entity identification system within the context of published documents is too high a hurdle for most authors to pass just seems a) wrong and b) and horribly (and surprisingly) short-sighted. As one who has had the displeasure of playing the name game in bioinformatics research (ambiguous String in document A =? HUGO name =? Entrez gene =? Affymetrix Id =? UniProt_id =? = ? =?), the critical importance of establishing and using standards for identifying the subjects of scientific discourse just seems obvious.
The idea that this is too much of a burden for publishing authors doesn't make sense to me; there is already precedent for a similar but much more elaborate "burden" in the publication process. Many journals, including Nature, require the submission of any new DNA sequence discussed in a publication to a public sequence database which assigns a unique identifier that can be used in the paper.
Making this kind of activity a mandatory part of publishing is something that a) journals do have the power to do - it certainly won't stop scientists from trying to publish in Nature.. b) will clearly provide many benefits in terms of both the computational and the human use of the knowledge embodied in scientific documents.
If this isn't the final solution to the problem (of providing consistent object identification within future publications), what is?
Posted by: Benjamin Good | October 16, 2007 11:42 PM
I have commented on this posting in some detail here http://www.chemspider.com/blog/?p=204
I wish I’d been at that lunch as I’d have some comments to add in. I’ve extracted from Ian’s post below and italicized his words then commented below.
“One solution to marking up molecules is to use an InChi (an IUPAC International Chemical Identifier). These have been championed by Peter Murray Rust and there is an extensive InChi FAQ available. The short story is that an InCHi is a character string which uniquely describes a chemical substance. From any chemical structure you can generate an InChi.”
AW> Peter has been a great advocate and champion for InChI and has definitely evangelized the value. But we should not forget those who have pushed the development and executed on delivering it. Specifically, Steve Heller, Steve Stein, Dmitrii Tchekhovskoi (all associated with NIST) and Alan McNaught (associated with IUPAC). The InChI was originally called the IUPAC-NIST Chemical Identifier. I’ve spoken previously about heroes and these people are truly the heroes of InChi. The rest of us get to use it, talk about and celebrate it…they had the vision AND executed on it.
“Peter has a writeup on using inCHi in blogs, and if every chemical that appeared everywhere was somehow marked up with it’s InChi, or the article referring to it tagged with them then the findability problem would be solved by simple string searching.”
AW> Yes, this is true. BUT it is limited. And people don’t appear to be talking about the limitations. Chemists don’t necessarily want to search only on an exact structure (and don’t me get started on all of the various layers that can be layered onto an InChIString - stereo, fixed hydrogens etc). They may want to search on substructure and similarity of structure and InChIs are going to have to be aggregated to allow this … I have blogged about an approach and Egon could help get us there!
“Egon suggested as a solution that journals should require papers dealing with chemicals to include InChis. He said that every tool for drawing chemicals (standard issue for anyone writing a paper on the subject) can now output the InChi with the click of a button you are Nature, you can make authors do anything in order to get a paper published so why not get them to do x. Well, for a start, that’s an editorial decision,
Journals are naturally shy of any step that can delay the publication time of an article, and so I am also skeptical that we would see such obligatory requirements. Better, I think, to have this step as a voluntary one. Practically all journals allow supplementary information and I am sure all of them would accept InChi as supplementary information.”
I agree with Egon…I’ve written almost a dozen peer-reviewed articles this year. The insturctions for authors demand systematic nomenclature and the authors are responsible for it. Demand InChI. Alternatively the majority of papers have structures embedded as OLE compatible objects. Develop a tool (not difficult) to generate InChIs on them. By the way, the InChIs COULD be embedded directly inside a PDF (I managed a product that generated PDF files that were STRUCTURE-SEARCHABLE! as well as generated images that were structure searchable. ) Yes, there is work to be done BUT it can be done. The challenge, I believe, is to get the primary societies to throw down the gauntlet. RSC are already using InChIs in Project prospect. if Chemical Abstracts Service were to utilize and index InChIs the American Chemical Society might be very interested in requiring InChIs for their manuscripts, whether directly embedded in the documents or as supplementary information. Rich Docherty over at TotallySynthetic has started tagging his posts with InChIKeys…not InChIStrings (I’ve talked about the value of this here and here)
“So what can we do now to help making connections between papers and molecules? Peter Corbett, who works with Peter Murray Rust, is working on automated methods of getting computers to read chemistry papers and output semantic markup of them. “
AW> Over at ChemSpider we are working with Will Griffiths who developed ChemRefer . We have already extracted 10s of thousands of chemical names and will be linking them up to ChemSpider structures to enable Open Access papers to be structure/substructure searchable. However, we’ve hit a bit of a hurdle…more details on this will follow shortly but we have been asked to remove thousands of articles indexed according to what we believe is a standard search engine policy from the ChemRefer index. During our conversation today with the publisher the conversion of chemical names to chemical structures to provide a structure searchable index of the articles was deemed to be “re-purposing” of the Open Access articles and is NOT allowable. Peter Corbett and Peter Murray Rust are engaged in similar activities so will likely run into the same challenges. If they manage to get around this issue with this and other publishers then they will be working in a “permissive” role where they will need to get permission from publishers to perform semantic markup. Their semantic markup is also “re-purposing”. The “permissive challenge” is far away from Peter’s stance in terms of Open Data for all.
“Egon has now created rdf pages for molecules on openmolecules.net. These pages use the InChi in their structure, and now each molecule had it’s own web page. “
AW> We are now working with Egon to RDF our own ChemSpider pages. Watch this space…
“Egon’s pages check Connotea, and pull from Connotea co-tags of InChi tags (Here is a short description of this). If we work on this a bit more we should be able to set up a system where if you tag a paper with an InChi, that paper could appear on Egon’s pages. “
AW> Not only Egon’s pages…we will index directly into ChemSpider also. The molecules will become part of a close to 20 million structure index including analytical data. It is one big web of chemistry, it is all coming together now, and Egon is a good guy to have lunch with. Wish I was there….
Posted by: ChemSpiderMan | October 17, 2007 10:52 PM