« Help! I’m becoming more normal. | Main

Microattribution for community annotation of the human genome

Human Variome Project Planning Meeting 25 - 29 May 2008, San Feliu de Guixols, Costa Brava, Spain


PROBLEM BEING ADDRESSED
Microcitation is a way to incentivize public data deposition by extending the practice of citing journal articles to database entries and by providing quantitative citation for every unique author.

SYSTEMS AND PLANS
A pilot project, commissioning peer refereed locus reviews as journal articles with microattribution for individual variants was introduced in a recent Editorial and was expanded upon in detail in this blog.

Each journal article should have a publicly accessible Supplementary Table 1 listing all the accessions cited in the article. The accessions must be indexed to a unique sequence indicating a nucleotide position (an ssID in NCBI) and a unique allelic state. Each string must have an author ID and a unique locator for the citing journal. Thus a citation string is formed as a list of parameters carried on a URL that resolves to the appropriate database:

(ss71650991, A, TSC2DB, doi1038/ng.123, NM_000548.2:c.138+1G>A, OMIM191100, Popfreq=ALFRED#XXX,)

used as a URL, this resolves to:http://www.ncbi.nlm.nih.gov/SNP/snp_ss.cgi?subsnp_id=71650991
even though it does not cite all of the data parameters related to that accession in dbSNP and the string also carries a parameter pointing to the accession number of population frequency information that was submitted to another database.

Microattribution can operate locally, with journals and databases each reporting quantitative citation of accessions. However, depositing the proposed Supplementary Table 1 in a central registry of cited accessions (at publication) has three great virtues. Firstly, different users can create citation counting interfaces to the same information, secondly, if the site is a proxy, it can record all microattribution (web traffic and vendor information as well as microcitation). Finally, the central site can be mined for citations associated with unique author identities and with each author’s publications and database entries.

To anticipate storage problems, parameter-rich accessions (ssID, allele, phenotype_tableID, submitter, curator, LSDB_ID, PBD_ID, ArrayExpress _ID, GeneTests_ID, PharmGKB_ID, local_confidential_record) would be stored for frequent online access, whereas less intensively curated accessions (ssID, allele, submitter, platform) might be stored on hard disks for occasional searching.

OpenURL conventions used by publishers in the CrossRef citation system already lay out rules for constructing parameter strings to be carried upon URLs. This group is also developing a publishers’ version of author disambiguation and there are already web-wide projects that could be tapped, like OpenID.

I suggest that parameter sets be nested within existing conventions to allow committees of publishers, microattribution activists, genome annotators, and mendelian mutation curators to define and update parameter forms that work for their communities.

(citer defined)
..........(microattribution)
....................(g,e,n,_,m,e)
.................... | | | ... |
....................(g,e,n,o,m,_)
..........(microattribution)
(target defined)

COLLABORATION AND SHARING CAPACITY
Thanks to HVP, HGVS, HUGO, NCBI, EBI, UCSF, SNPedia, Genome Commons, and INSIGHT for their time and ideas. These ideas are not limited to the genome community but we have a unique indexing system in the genome and have an opportunity to demonstrate best scientific practice in accurate citation.

TOPIC SECTION
Publication, credit & incentives

TrackBack

TrackBack URL for this entry:
http://blogs.nature.com/cgi-bin/mt/mt-tb.cgi/4854

Comments

Alf Eaton points out that I intended the strings to be good OpenURL, with approved parameter names, equals signs and parameter values. That way the order in which they are cited does not matter. Alf is right.

Naive question:
What is your recommendation for well-meaning but pressed-for-time researchers in human genetics who bring to light currently unannotated but apparently non-pathogenic variations? It happens all the time when sequencing a candidate gene, that a variation turns out to be present in some percentage of control chromosomes as well as in the disease cohort. Now that the meeting is past, is there an optimized way to make such data available so that it is of some use to others, even if not to oneself? Thanks in advance for your thoughts.

Post a comment

Comments will be reviewed by the editors before being published. You can be as critical or controversial as you like, but please don't get personal or offensive. We strongly encourage you to use your real, full name. Email addresses are useful in case we need to discuss your comment with you privately, or notify you in case we decide not publish your comment. Email addresses will not be made public on the blog.


Please enter the numbers you see below - this helps us to cut down on spam. If you are having trouble with this system, you can instead e-mail a comment to 'a.packer at natureny dot com'.

Subscribe

Subscribe to this blog's feeds:

[What is this?]

Untitled
Powered by
Movable Type 3.2