Nature Genetics | Free Association

Towards a hermeneutics of quantum citation

FAQ

“I haven’t got time to read all this, what are you on about?”

Microattribution [editorial1] [editorial2]should be built into database entries, databases, genome browsers, journal articles, journals and social network face pages, probably even into your fridge.

Providing incentive for global community annotation of the human genome. Giving database accessions the same citation conventions and indices that journal articles currently enjoy, so that genotype-phenotype-frequency annotations are counted as genotype-author-phenotype-frequency annotations. Microattributions should be complemented by high level peer-refereed reviews and a convenient browser interface. Annotators should form a social network that displays their publications, microattributions, affiliations and credentials. Vendors can sustain free access by sponsoring particular sets of content (eg. loci, pathways, probesets) and via annotator endorsements. These models might be appropriate for a Variome browser, Variome reviews and a network of human genome annotators, HUGONet.

“Who are you anyway?”

We are the world, we are the children. Seriously though, there are a lot of clever, highly motivated people who weren’t around when the human genome project went down and who would like to get involved, perhaps by adopting a neglected region of the genome of local interest. There are also a lot of expert old lags hanging onto lovingly curated real pathogenic human mutation data collected in the pregenomic era that needs to be brought out to create a framework of human variation before the trace archive tsunami hits. Plenty of puzzled physicians who want to know which variants are disease-associated too.

“Are you trying to build your own database?”

No databases will be harmed in the making of the Variome project. We are building a filtering and highlighting site that credits databases and has a feed (Variome track) to Ensembl and UCSC browsers. The Variome server will database only the microattribution counter, the information required to mark variant ssID and annotations from NCBI as peer reviewed or not, it will have the ability to mark a region between genome coordinates as open, under review or published, and it will have a microattribution wiki for community comments. The MTS peer review database can be left up to commissioning journals.

“Won’t Ensembl just pick all the information directly from the LSDB and NCBI?”

Maybe, but even if intensive curation is enough to get the information right, is ending up on a browser track giving enough credit to the data producers to induce them to participate in the data transfer, checking and indexing process? Why not offer them microattribution and a publication as well?

“Why not just use the UCSC wiki browser?”

We would if it had a graphic track showing quantitative microattribution in published literature and tally of wiki comments loaded onto each ssID, links to the appropriate database entries, and the ability to close commenting across a region, and to mark a region as published. I think we would still need a server on which to store a local copy as the annotation committee (authors, editor and referees) worked on the regions to publish.

“But surely, all that is required is for variants to be submitted to NCBI and they will appear on the genome browser”

This has not happened. LSDBs are holding a large number of well-annotated variants (representing years of work and many grants) that have not been moved to a common indexing system. There is no pipeline from the clinical labs and we are about to be deluged with resequencing deposited in sequence trace archives with little human annotation.

“Why doesn’t NCBI just provide citation statistics for every ssID, author handle and annotation?”

They should. The microattribution concept means that every database entry should not only provide the links to, but should provide a current count of its forward citations in papers and database entries. If NCBI did this, we would still need the high level reviews to ensure locus annotation quality and give high level priority to the annotator community.

“Why not just provide browser feeds from the LSDBs?”

LSDB server capacity and maintenance is variable, the databases are in a variety of formats and the variants are not indexed to genome, duplication of NCBI on a small scale is inefficient, LSDB curators lack the resources to convert their data.

“Why is it not sufficient to use existing wiki (SNPedia) or wiki browser (UCSC)?”

Wiki comments, even from an approved annotator community, will vary in credibility and detail and even if in theory they are scrutinized by readers, it is not certain that they will be corrected unless there is a high level incentive (the Variome Review) to encourage correction.

“Why do we need a wiki at all, since anyone can submit variant annotations to NCBI curators?”

The activation barrier to wiki commenting is very low and because the results can be seen immediately, alert students, as well as experienced researchers, can readily make a big difference.

Curators need to sleep, whereas the wiki is automated and can take comments from all time zones.

“Why peer review? Surely the data producers together with NCBI curators can provide authoritative variant reports indexed to the genome?”

This works for the variant report alone, but doesn’t provide consistent quality control or scrutiny across the locus. David Ravine’s review of PKD1 variants (Nature Genetics April 2007 -which is the model for a Variome Review journal article) revealed that 5% of the published variants were wrong. Community scrutiny under a uniform, editorially controlled process not only provides high level reviews to bring credit to the data producers, but highlights which datasets can be used with confidence.

“Is there a better way to view genotype-author-phenotype compound statements and their associated accession numbers than a microattribution wiki browser or a database table?

Undoubtedly. If you have built one, we will use it.

“I have a grant to build/have built/have thought of and am going to build/a better solution and/or I am a hugely influential funding body and we have a plan that doesn’t include you so you are wasting your time.”

Fine, we are providing only what is missing: respect and credit for database builders, curators, authors, data producers, cogs in big teams, funding bodies, research participants (with their permission) and yes, eventually even journals and editors. Maybe you’d like to be appreciated too?

“Microattribution is not even a new idea, it is obvious.”

Yes, isn’t it.

“Isn’t Gogol going to index teh everything anyway?”

Maybe, but the ssID is a pretty cool way to index data (thanks, Donna!) and wouldn’t you rather trust your peers to evaluate genome variants in the first instance, so that the plex will develop their genome tools in ways we will be able to use?

“Won’t your filter become redundant if all NCBI entries are correct and journals provide quantitative attribution via a whizzy interface?”

Yes, I look forward to this day. If this process is catalytic and every database and journal provides microattribution credit, the Variome Browser filter will have served its purpose.

“Why not organize the HVP along the lines of the DECIPHER network?”

The problems are fundamentally different. DECIPHER exists to catalog clinical importance of structural genome variation, starting from scratch. For mendelian mutations, existing annotator communities have done much of the work, but now need credit for reformatting and indexing their variation collections to the genome and scrutinizing the annotations, locus by locus.

“Surely you are duplicating the work of PharmGKB, HGMD (Cardiff) and OMIM ?”

These are databases without the intention to create systematic review and journal articles.

None provides quantitative citation credit, although they reference the original sources of the variants.

“Aren’t you just duplicating the effort of HUGENET ?”

HUGENET provides disease-centered meta-analysis of genetic epidemiology, Variome integrates findings of rare variants in rare mendelian diseases, rare variants in common diseases and common variants in common diseases.

“How will you distinguish peer reviewed from entries that were curated but not reviewed?”

We make no distinction at the browser level between curated entries and those added by wiki, only between reviewed and non-reviewed information. The source will be evident upon clicking the link, since these lead to NCBI and to the locally databased wiki respectively.

NCBI entries and wiki comments will be visible on the Variome browser in gray, the information that survived collaborative annotation and peer review will be in black.

“Surely the peer review process will entail rewriting and correcting NCBI entries, rather than merely filtering out those that are wrong?”

Irretrievably wrong or unattributable entries will need to be excluded. If the original data producers are unwilling to reannotate them correctly, new entries will be made. If the original data are largely correct, an annotation (corrigendum) will be appended to the existing NCBI entry.

“If competing journals have commissioned reviews on different regions of the genome, how can we distinguish one annotation group from another (some authors will overlap). “

This happens at the moment with papers. Editors, referees and authors behave remarkably ethically. Authors can be added at the editor and senior author’s discretion. With microattribution, if an author on a collaborative annotation makes less than one annotation between the locus genome coordinates, their author status may be questioned.

“Won’t this confuse journal impact factors? What microcitation measure is the right one?”

Professional bibliometrists, get to work! Since when was more information a bad thing? The existence of the $20 bill does not do away with a need for $1 bills.

“How will authors know if they are citing the right entity?”

This problem exists already in journal articles. Authors will cite a GEO GPL platform accession number rather than the appropriate experiment accession (GSE number). As journals become more database-like, the database part of the article will reference the correct compound (ssID, author, phenotype, frequency) to support the assertion made (this variant is always associated with this disease).

“Are you planning to provide a disease-centered interface, like AlzGene [link2 ]”

These objectives are on the HVP wish list for Informatics but outside the remit of Publication and Credit. The physician community will probably want to design a pipeline from the clinical sequencing labs to NCBI and a disease-oriented browser interface that starts with the variants for which clincal tests are available, then moves on to a list of variants sorted by whether they are pathogenic or not. These projects will be enabled by – but are are outside the immediate scope of – the microcitation proposal.

“Will other journals provide microcitation statistics?”

Hurry! Nature Genetics already does and we are now looking for better ways to display and use the information (data producer index, annotator index, collaboration index, mentorship index).

“What else could HUGONet members put on their face page?”

If you were the referee who provided the decisive experiment for a highly cited Nature paper, you might like want to reveal that fact (after publication) and post your review on your face page, even if you didn’t want your comments publicly on the Nature web site? If you were sparsely linked within the network, but had a postdoc position you needed to fill, why not pay HUGO to link you up for a month to the highest priority part of the network (or to everyone within the network), HUGO needs money and you need a postdoc.

“What about genes where tests for variants have been patented?”

Companies holding a lot of patents on particular genome regions (eg. BRCA1) might be motivated to adopt that gene within the Variome system, providing information, advertising and funding. I see no reason why their annotations should not be subjected to peer review in the usual way, since journals are happy to publish corporate research so long as methods are transparent and materials are available.

“What kinds of annotation will be counted?”

Ideall all, but the intention was to highlight all forms of annotation that require human effort, so the citation counter highlights papers and reviews separately from annotations with phenotypic information and annotations without phenotypic information. The compound (ssID, author, phenotype, frequency) has more value than the compound (ssID, author, vendor, probe) or the fundamental microcitation particle (ssID, author) even if it is cited less often.

For example, the SNP Consortium will be recognized for its discovery of a SNP every time a user cites a ssID or its rsID synonym. Affymetrix will get a microcitation for every paper that lists a probe used in high throughput genotyping. A mendelian geneticist will be interested in a single clinical report (dbGAP entry) coupled to a sequence report (ssID) and frequency (NCBI annotation ID).

“What happens when the journals are no longer interested in publishing reviews?”

Some loci receive more interest than others. Annotations of genes with many mutations and phenotypes may comprise publications on their own, while others may be annotated mutations affecting a pathway or process. Annotation of neglected regions will accrue progressively more attention via the microattribution process and will be adopted by interested groups for publication.

In the spirit of working with all interested parties, it is not anticipated that Variome will launch a journal to house the reviews, but this is an option to discuss.

Comments

  1. Report this comment

    Belinda Giardine said:

    From: Belinda M. Giardine [giardine@bx.psu.edu]

    http://hgwdev-giardine.cse.ucsc.edu/

    Sent: Friday, January 04, 2008 12:26 PM

    To: Axton, Myles

    Subject: RE: Microattribution reviews

    Here Myles (MA)has attempted to answer Belinda’s (BG)questions.

    >BG0 First problem how is a SNP defined? As a location? the location and change that occurred? For example are c.1A>T and c.1A>G the same SNP?

    Does strand count? What if it is the matching change on the opposite strand?

    MA0) SNPs, mutations, indels and other more complex variants must be defined by a unique sequence with the key nucleotide locally indicated within it. This sequence must bear an ssID. Similar sequences differing by one nucleotide can then represent different alleles at one nucleotide position. At the rsID level, complementary sequences representing the same nucleotide state need to be recognized as synonymous. A convention of preferring the ssID representing a sequence identical to a genome build’s forward sequence could help. Has NCBI already solved this problem for ssIDs?

    >BG1) Number of wiki comments on the SNP (link to table of wiki comments)

    Given the complexity above is the count to be generated on the fly every time a SNP is viewed? What if someone edits/adds to an existing entry as does that count as 1 or more comments?

    MA1) The wiki browser I envisage is probably better described as a comment thread browser. That is, each user is free to edit their own entries in wiki style but each new observation about a variant appears as a new comment in the thread. The number of comments and the number of times each individual comments is displayed in real time or automatically updated daily in table or browser graphic form (and can be seen per variant or per locus). Clicking on each nucleotide brings up the thread of comments so that their text can be read and their links followed.

    > BG2) Number of papers referencing the SNP (link to PubMed)

    If I remember correctly this isn’t in what HGVS is recommending LSDBs share.

    MA2) The problem is solved in principle by CrossRef, a Google search of full text content from participating journals http://www.crossref.org/crossrefsearch.html. While this doesn’t currently find rsIDs cited in Supplementary Information, it could be automated with a script that would count, tabulate and link to the doi of any paper that cites any of a list of rsIDs within the genome coordinates of a review.

    > BG3) Number of database entries referencing the SNP (link to NCBI)

    I am not sure what you mean here? Number of entries in an LSDB? Or number of databases LSDB or genome-wide? Where does the link to NCBI come in?

    MA3) The ssID of a given variant will appear in a number of annotations within dbSNP. Other annotations

    For example, http://www.nature.com/ng/journal/v40/n2/abs/ng.72.html reports that rs10757278-G is associated with abdominal aneurysm and intracranial aneurysm but not with or type 2 diabetes, the paper references several other papers that associate the same allele of this SNP with coronary heart disease. Each of these observations would be a citable observation that would be indexed to this particular nucleotide identified by ssID. NCBI should make sure each of these observations could be separately evaluated and tabulated by the review committee annotating the locus “CDKN2A/B”.

    Question MA3a: I am assuming that a microattribution review will filter a subset of NCBI content that the referees find has passed their criteria for inclusion. Using the ssID will point to all the associated data, not merely those items the referees have selected.

    Does dbSNP provide enough structure for a microattribution browser or microattribution review to extract and tabulate each of these observations? Is the ssID enough indexing, or do we need a portable, citable “molecule” (ssID, 2×2 table comparing number of genotypes or allele count and phenotypes, author ID, links out) for each study?

    >BG 4) Simultaneously display open and accumulating wiki comments (gray)

    and frozen peer reviewed content (black) – as you say, it may be easier to display two parallel tracks initially.

    With the setup the way it is at UCSC 2 tracks will be better. I can actually get it so that the reviewed track can have all the functionality that the Locus Variants track now does. If you want the Locus Variants track could probably be used as an initial version of the reviewed track.

    I would have to check with others to make sure there are no objections or problems I didn’t think of.

    MA4) There are four deliverables for Variome Microattribution Reviews. i) A published locus review. ii) a published locus track, iii) trackback or counting of forward citation for the published locus annotations individually, iv) A comment browser track. These can proceed in parallel to separate deliverables (up to three separate browser tracks and many tables), but will need to be coordinated and integrated.

    I think the hardest problem is iii) a) Clicking on “cite this” can provide an appropriate link and readily propagate a citation back to NCBI while also increasing the counter on the Variome server. But b) we have the problem of ensuring that a journal or database that publishes the variome ID that refers to a Variome Review entry sends the doi of the citing paper back to our variome microattribution counter. Would we have to do an annual search?

    >BG 5) Display the limits of the locus under annotation, coordinating author and deadline for publication.

    This I don’t understand at all. Sort of a terms of use?

    MA5) The idea is that any journal can commission a locus review or annotation team initiate a review, but we would want to reduce duplication of effort and conflict between teams annotating neighboring loci. The publication browser should show the limits of the locus being reviewed and an approximate timescale for submission and publication (it would not be right for a journal or annotation team to stake out a locus without delivering a review within say 9 months). All those with data to deposit in EBI or NCBI should do so and become an author by contacting the communicating author posted on the browser. The track need not declare which journal has commissioned the review, since it may not meet referee criteria at the journal.