Free Association

March 31, 2008

Microattribution for community annotation of the human genome

Human Variome Project Planning Meeting 25 - 29 May 2008, San Feliu de Guixols, Costa Brava, Spain


PROBLEM BEING ADDRESSED
Microcitation is a way to incentivize public data deposition by extending the practice of citing journal articles to database entries and by providing quantitative citation for every unique author.

SYSTEMS AND PLANS
A pilot project, commissioning peer refereed locus reviews as journal articles with microattribution for individual variants was introduced in a recent Editorial and was expanded upon in detail in this blog.

Each journal article should have a publicly accessible Supplementary Table 1 listing all the accessions cited in the article. The accessions must be indexed to a unique sequence indicating a nucleotide position (an ssID in NCBI) and a unique allelic state. Each string must have an author ID and a unique locator for the citing journal. Thus a citation string is formed as a list of parameters carried on a URL that resolves to the appropriate database:

(ss71650991, A, TSC2DB, doi1038/ng.123, NM_000548.2:c.138+1G>A, OMIM191100, Popfreq=ALFRED#XXX,)

used as a URL, this resolves to:http://www.ncbi.nlm.nih.gov/SNP/snp_ss.cgi?subsnp_id=71650991
even though it does not cite all of the data parameters related to that accession in dbSNP and the string also carries a parameter pointing to the accession number of population frequency information that was submitted to another database.

Microattribution can operate locally, with journals and databases each reporting quantitative citation of accessions. However, depositing the proposed Supplementary Table 1 in a central registry of cited accessions (at publication) has three great virtues. Firstly, different users can create citation counting interfaces to the same information, secondly, if the site is a proxy, it can record all microattribution (web traffic and vendor information as well as microcitation). Finally, the central site can be mined for citations associated with unique author identities and with each author’s publications and database entries.

To anticipate storage problems, parameter-rich accessions (ssID, allele, phenotype_tableID, submitter, curator, LSDB_ID, PBD_ID, ArrayExpress _ID, GeneTests_ID, PharmGKB_ID, local_confidential_record) would be stored for frequent online access, whereas less intensively curated accessions (ssID, allele, submitter, platform) might be stored on hard disks for occasional searching.

OpenURL conventions used by publishers in the CrossRef citation system already lay out rules for constructing parameter strings to be carried upon URLs. This group is also developing a publishers’ version of author disambiguation and there are already web-wide projects that could be tapped, like OpenID.

I suggest that parameter sets be nested within existing conventions to allow committees of publishers, microattribution activists, genome annotators, and mendelian mutation curators to define and update parameter forms that work for their communities.

(citer defined)
..........(microattribution)
....................(g,e,n,_,m,e)
.................... | | | ... |
....................(g,e,n,o,m,_)
..........(microattribution)
(target defined)

COLLABORATION AND SHARING CAPACITY
Thanks to HVP, HGVS, HUGO, NCBI, EBI, UCSF, SNPedia, Genome Commons, and INSIGHT for their time and ideas. These ideas are not limited to the genome community but we have a unique indexing system in the genome and have an opportunity to demonstrate best scientific practice in accurate citation.

TOPIC SECTION
Publication, credit & incentives

February 24, 2008

Help! I’m becoming more normal.

Association of common variants to diseases is still in a phase of rapid discovery. One immediate consequence is that the relative risk for an individual, predicted from very partial information, can change rapidly as more information is added. For example, three of the 18 risk predictions have changed in my profile on the deCODEme site since November:

Restless legs syndrome: OR 1.94>0.97, was 1, now 4 SNPs
Prostate cancer: OR 1.05>0.77, was 5, now 8 SNPs
Type 2 diabetes: OR 1.45>1.10, still 8 SNPs

In the first two cases, the new information is the association of new variants that can be added to the risk calculation. In the third case I do not know if this is a recalculation or the result of more studies on the same 8 SNPs. I’ll make a rash prediction and be prepared to be proven wrong, but I think the more common variants are added, the more instances of (individual, disease) will tend to OR=1.0. Maybe I am just indulging in the fallacy called “the law of averages”, but it is at least a conservative, testable hypothesis.

Full genotypes are now available for download, so the project is about to become interesting.

January 11, 2008

Do I need a personal trainer or a personal genetic counsellor?

David Hunter, Muin Khoury and Jeffrey Drazen’s Perspective “Letting the Genome out of the Bottle- will we get our wish?” (NEJM 2008. 358;2 105-107) goes further than we have done (below) in its skepticism of consumer genomics services like deCODEme and Navigenics. While Nature Genetics as a research journal can welcome the opportunities for public research participation and ever larger experiments and longitudinal studies, the doctors worry that they will soon be deluged with patients asking them what all this risk information means for them. The authors have some good points, but largely ignore the unpredictable motivational potential inherent in handing people their genomes and asking them to participate in finding out more about their variation and phenotypes. Sometimes, the best doctor will say “we don’t know yet, let’s find out together”. However, we do applaud efforts to build genetics into every stage of medical education and appreciate the authors making their editorial an opportunity to hammer that point.

I take some quotes from the article as the starting point for a few of my own thoughts:

“..premature attempts at popularizing genetic testing…”
Not so much premature as truly disruptive. Personal genomic testing confers a personal stake in the ongoing research effort and a huge incentive to find out more. A personal stake in finding something that was not previously know is the key to getting students into research and may well be a powerful tool to interest individuals in the details of their own health and functioning.

“transparent quality control monitoring”
It is true that the use of multiplex genotyping platforms for genetic epidemiology gains some buffering from large numbers of subjects and replication. At the individual level, it is hard to check for a single miscalled SNP, so it might be desirable to build some redundancy into the genotyping using haplotype information. For the genetics enthusiast, it would be reassuring to see the SNP calls in cluster plot format together with those of other anonymized participants.

“clinical validity…predictive value...the area in which the data are in the greatest flux.”
This is true for conventional exposures and markers too. Physicians no longer prescribe smoking for its “health benefits”, at least not to their patients. Blood pressure and BMI limits now elicit more vigorous treatment from physicians as better information has been gained. It is to be expected that an individual’s risk profile will change with each new study since new variants will be discovered that moderate or exacerbate their individual risk.

“full accounting of disease susceptibility awaits the identification of these multiple variants and their interaction in well-designed studies.”
I don’t think more retrospective studies will entirely solve this problem. In the meantime, individual genomics might be a great recruiting tool for a longitudinal study. Deliver a detailed genotype to the participants right at the beginning.

“assumption that interventions that have proven successful in the general population will behave the same way in a genetically at-risk population.”
Optimizing generally applicable interventions with the enthusiastic participation of the research subjects themselves may be the best way forward.

“interventions - such as smoking cessation, weight loss, increased physical activity and control of blood pressure – are likely to be broadly beneficial in relation to many diseases, regardless of a person’s genetic susceptibility to a specific disease.”
And we pay a physician for this information……why?
It may take years of personal experimentation with different drugs, doses and dose regimens to achieve a balance of blood pressure control and side effects. Would it not be better to have a few clues as to who is likely to achieve blood pressure control via breathing exercises or salt reduction, who by exercise and weight reduction, and who actually needs the beta blockers? OK, gene tests don’t help at the moment with ACE inhibitors or diuretic dosing, but they could when your atherosclerosis gets to the thrombosis stage and are deciding how much rat poison you want to eat.

“patients who test negative may be falsely reassured”
I don’t think I am “falsely reassured” by normal cholesterol, kidney function and blood pressure readings. Individuals have a remarkable ability to use information in their own self interest, indeed to integrate family history and the evidence of their own eyes with the gene test information. For a stunningly candid example of a family affected by Alzheimer disease without the major genetic risk factor, read the Tangled Neuron.


“but a detailed consumer report may be beyond most physicians’ skill sets.”
It should not be beyond most physicians’ ability to explain the quantitative risks conferred by - and the research underlying – the predictors they currently use: BMI, cholesterol, blood pressure, age and sex. A physician should also be able to relate quantitative risks conferred by a family history containing one or more affected relative. A physician should also be able to advise the patient whether or not to participate in research tests such as that for eg. C-reactive protein, with reasons not including “the insurer doesn’t cover that” or “taking additional tests will remain on your insurance record”.

“More information is needed…potential value that genomic profiles can add to that of simpler tools, such as family health history.”
Genomic profiles of families will be much more informative to individuals than their comparison with epidemiological studies. For many, the personal genomic profile is the stimulus to explore family history. Discussion of personal genetic risk factors may act to release family medical history kept private because it had no perceived use.

“encourage them to enroll in formal scientific studies.”
Maybe participation in personal genomics would have this desirable effect. With no personal stake and no current health problems, it may not occur to many individuals that their participation would be welcomed by scientists and the public alike. It would be interesting to compare the general health and health awareness of eg. Framingham Heart Study participants with nonparticipants, I just haven’t had the time to read any such sociology.

“genomic services will galvanize…translational research for the rational integration of genomic information into medical training and practice.”
Here the authors get to the heart of their worries: this is a disruptive technology that has caught doctors in the middle of their effort to bring genetics to its proper place in medical education and they might get deluged with questions they aren’t yet ready to answer. For more on the change of emphasis in medical education toward genetics and the individual, see David Valle’s article referenced below.

“better off spending their money on gym membership or a personal trainer….follow a diet and exercise regimen that we know will decrease his risk of heart disease and diabetes.”
Why not both? Different people have different motivation. Socially body conscious individuals may prefer the gym, the introverted but vain may prefer the attention of the trainer. Intellectually motivated people may choose their sport, or only exercise out of curiosity to see whether they can fulfil the promise of their “elite athlete” SNPs.


Now, some of my favorite quotes from “A Science of the Individual: Implications for a Medical School Curriculum” Childs, Wiener and Valle. Ann Rev. Genomics Hum Genet 2005. 6;313-330.


“Although hypertension is an elevation in systemic blood pressure, each patient reaches that phenotype by different paths, each determined by a unique combination of genetic makeup and experiences of the environment…..Individuality is also apparent in the treatment of hypertension.”

“contradiction between the singularity of the patient and the generality of treatments and prevention.”

“a tradition of typological thinking about biology and disease that does not accommodate variation.”


Finally, highlights from the “Risky Business” Editorial by my colleague, Alan Packer
(Nat Genet. 2007. 39;12 1415).

“The need for both physicians and their patients to be better educated about complex genetics has taken on added urgency of late.”

“With the possible exception of age-related macular degeneration, how much can we say with confidence about the spectrum of risk?”
Variants with low relative risk make poor classifiers. This point was made by Alan, and by the NEJM authors. So, individual genomics may not induce anyone to take a clinical test until the list of risk variants adds up the point where it can identify some particularly unlucky individuals. In the meantime, it will have informed thousands by providing a personal stake in one of the most exciting areas of medical research and it may recruit enthusiastic participants in a massive longitudinal study that they will have funded partially from their own pockets.

That being said, they are participants, not patients, and the experiment will be conducted on their own terms!

December 21, 2007

For what it's worth

APOE.jpg

Source: deCODEme

December 20, 2007

Adventures in personal genomicsland

SNPedia
I recall a joke that probably plenty of folks have told; I heard it
from Francis Collins, the head of NIH's Genome Project.
A previously-married woman heads to bed for the first time with her
new beau, and to his surprise, she admits to being a virgin. When he
wonders why, she says, "Well, I was married to a genome biologist, and
every night, he just sat in bed and talked about how great our sex
life would be someday."


The Genetic Genealogist

myDNAchoice - Are Your Surfing Habits the Result of Your Genome?

The Gene Sherpa
One of these companies will get sued

The Personal Genome
For example, I share parts of my Y-chromosome with my father (I didn't
ask his permission to post parts of it online it either).

Wingedpig
Googling around, I found that the APOE gene on chromosome 19 is of
particular interest, specifically APOE e2, e3 and e4. In the Genome
Explorer, I can type in APOE, and it takes me to a listing of 19 SNPs
on the APOE gene. Ok, great. But I have no idea which one(s) of those
SNPs are the ones we're talking about and what the mutations are.
Without this last bit, the Genome Explorer is basically meaningless.

December 16, 2007

Everything you need to last you two lifetimes

I'm roadtesting personal genomics services, starting with deCODEme, since many of the genotype-phenotype associations they report have been published in a reputable journal. I am the guinea pig and here are the ground rules: I will reveal everything I believe to be useful to future research. If that seems too coy, please comment and I will answer truthfully. I reserve the right to move the detailed discussion elsewhere, since space in Nature Genetics is limited, even in the blog (thanks for the space, Alan, apologies if this seems TMI).

These services offer the opportunity for real people to participate in research and to address for the very first time the question, "I have this genotype, what will happen to me?". The tests offered are not clinical tests, so insurers, employers, physicians and family, please comment as fellow research participants and don't try to make more of these information services than they purport to be. By real people, I mean individuals with their own responses and interpretations of the research as it affects them, rather than the anonymized people genetic epidemiology uses to make its predictions.

The first figure below shows my thoughts on the subject before I started to look at the results. My initial impression was that I was not going to pay attention to SNPs that on published precedent suggest my lifetime risk of any condition is less than 30%. I guessed I would research any biological hypothesis in the 30-60% range and possibly seek a clinically approved genetic test and medical advice for any genetic prediction of elevated risk over 60%. Given the predictable response of my fellow commuters to eg. seat belts, anti-lock brakes and airbags, I feared I might compensate behaviorally if I got a hint of protective alleles (eg. ADH2*2, CCL3L1). Impressed by Andrew Niccols' prescient (if insufficiently palindromic) GATTACA I assumed that there might even be SNPs that would convince Uma Thurman to have my babies.


Umagraph003.jpg

The first unexpected problem was to identify myself. Since the website is very new and I don't have the raw Illumina SNP calls or any population samples with which to examine the cluster plots myself, I can't verify the raw data. Even if I could do so, I have little but consistency with other genotyping services to ensure I am looking at my own genes. From Genographic, I know I have Cambridge mitochondria (H, 16188G, 16311C, 16519C) and a R1b1c Y chromosome and from DNAPrint Genomics's "proprietary AIMs", I know only that I am mostly of European ancestry (which luckily tallies with the origins of my great great great grandparents: 12 German, 9 English, 5 Dutch, 2 Irish, 2 Welsh, 1 Swiss, 1 Scottish). Thanks to Daniel Gubjartsson's recent paper, I also know I am quite likely to have brown hair and brown eyes. The main problem, apart from lack of SNP data to search on my own via Greg Lennon and Michael Cariaso's wonderful SNPedia site and the underlying literature, is the problem of self recognition.

Individual taste preferences might provide a solution, so I suggested the self-recognition problem might be solved via an olfactory SNP-social-network-wine-club. Another consumer genomics company does report on "taste-related" alleles but they haven't invited me to the party..... I guess the Icelanders are still running the gas chromatograph on the 70cl of wine the last foreign visitor brought into the country in their duty-free allowance.

Using the information at my disposal, I first plotted my risk ratios against the prevalence of the conditions.

ORvspoprisk.jpg

The rheumatoid arthritis results caught my attention since the combination of major and minor contributors and mix of risk and protection alleles pretty typical of the other common diseases for which more than one locus has been implicated. Here are the actual results:
RASNPs.jpg

I next tried plotting my risk against the population mean risk, as in my sketch above. A decade of research reveals a several things that I couldn't have predicted on the back of an index card. My zone of solidarity is - of course- a cone of solidarity, since the variance of the risk increases with the risk.

riskriskplot.jpg

Seen this way, SNPs associated with larger risks of rarer conditions fall into perspective. So what can I offer Uma? It goes without saying that I wouldn't kick her out of bed for eating cookies. In the very unlikely event that I were to do so, I would draw her attention to my elevated risk of "restless leg syndrome". I would not be eating the cookies myself, because of the theoretical worry of type 2 diabetes. Observing a BMI of 27 from a safe distance, she would not be particularly convinced by my predicted genetic chance of resisting the onset of middle-age spread.

November 24, 2007

Towards a hermeneutics of quantum citation

FAQ
“I haven’t got time to read all this, what are you on about?”
Microattribution [editorial1] [editorial2]should be built into database entries, databases, genome browsers, journal articles, journals and social network face pages, probably even into your fridge.
Providing incentive for global community annotation of the human genome. Giving database accessions the same citation conventions and indices that journal articles currently enjoy, so that genotype-phenotype-frequency annotations are counted as genotype-author-phenotype-frequency annotations. Microattributions should be complemented by high level peer-refereed reviews and a convenient browser interface. Annotators should form a social network that displays their publications, microattributions, affiliations and credentials. Vendors can sustain free access by sponsoring particular sets of content (eg. loci, pathways, probesets) and via annotator endorsements. These models might be appropriate for a Variome browser, Variome reviews and a network of human genome annotators, HUGONet.

“Who are you anyway?”
We are the world, we are the children. Seriously though, there are a lot of clever, highly motivated people who weren’t around when the human genome project went down and who would like to get involved, perhaps by adopting a neglected region of the genome of local interest. There are also a lot of expert old lags hanging onto lovingly curated real pathogenic human mutation data collected in the pregenomic era that needs to be brought out to create a framework of human variation before the trace archive tsunami hits. Plenty of puzzled physicians who want to know which variants are disease-associated too.

“Are you trying to build your own database?”
No databases will be harmed in the making of the Variome project. We are building a filtering and highlighting site that credits databases and has a feed (Variome track) to Ensembl and UCSC browsers. The Variome server will database only the microattribution counter, the information required to mark variant ssID and annotations from NCBI as peer reviewed or not, it will have the ability to mark a region between genome coordinates as open, under review or published, and it will have a microattribution wiki for community comments. The MTS peer review database can be left up to commissioning journals.

“Won’t Ensembl just pick all the information directly from the LSDB and NCBI?”
Maybe, but even if intensive curation is enough to get the information right, is ending up on a browser track giving enough credit to the data producers to induce them to participate in the data transfer, checking and indexing process? Why not offer them microattribution and a publication as well?

“Why not just use the UCSC wiki browser?”
We would if it had a graphic track showing quantitative microattribution in published literature and tally of wiki comments loaded onto each ssID, links to the appropriate database entries, and the ability to close commenting across a region, and to mark a region as published. I think we would still need a server on which to store a local copy as the annotation committee (authors, editor and referees) worked on the regions to publish.


“But surely, all that is required is for variants to be submitted to NCBI and they will appear on the genome browser”
This has not happened. LSDBs are holding a large number of well-annotated variants (representing years of work and many grants) that have not been moved to a common indexing system. There is no pipeline from the clinical labs and we are about to be deluged with resequencing deposited in sequence trace archives with little human annotation.

“Why doesn’t NCBI just provide citation statistics for every ssID, author handle and annotation?”
They should. The microattribution concept means that every database entry should not only provide the links to, but should provide a current count of its forward citations in papers and database entries. If NCBI did this, we would still need the high level reviews to ensure locus annotation quality and give high level priority to the annotator community.

“Why not just provide browser feeds from the LSDBs?”
LSDB server capacity and maintenance is variable, the databases are in a variety of formats and the variants are not indexed to genome, duplication of NCBI on a small scale is inefficient, LSDB curators lack the resources to convert their data.

“Why is it not sufficient to use existing wiki (SNPedia) or wiki browser (UCSC)?”
Wiki comments, even from an approved annotator community, will vary in credibility and detail and even if in theory they are scrutinized by readers, it is not certain that they will be corrected unless there is a high level incentive (the Variome Review) to encourage correction.

“Why do we need a wiki at all, since anyone can submit variant annotations to NCBI curators?”
The activation barrier to wiki commenting is very low and because the results can be seen immediately, alert students, as well as experienced researchers, can readily make a big difference.
Curators need to sleep, whereas the wiki is automated and can take comments from all time zones.

“Why peer review? Surely the data producers together with NCBI curators can provide authoritative variant reports indexed to the genome?”
This works for the variant report alone, but doesn’t provide consistent quality control or scrutiny across the locus. David Ravine’s review of PKD1 variants (Nature Genetics April 2007 -which is the model for a Variome Review journal article) revealed that 5% of the published variants were wrong. Community scrutiny under a uniform, editorially controlled process not only provides high level reviews to bring credit to the data producers, but highlights which datasets can be used with confidence.

“Is there a better way to view genotype-author-phenotype compound statements and their associated accession numbers than a microattribution wiki browser or a database table?
Undoubtedly. If you have built one, we will use it.

“I have a grant to build/have built/have thought of and am going to build/a better solution and/or I am a hugely influential funding body and we have a plan that doesn’t include you so you are wasting your time.”
Fine, we are providing only what is missing: respect and credit for database builders, curators, authors, data producers, cogs in big teams, funding bodies, research participants (with their permission) and yes, eventually even journals and editors. Maybe you’d like to be appreciated too?

“Microattribution is not even a new idea, it is obvious.”
Yes, isn’t it.

“Isn’t Gogol going to index teh everything anyway?”
Maybe, but the ssID is a pretty cool way to index data (thanks, Donna!) and wouldn’t you rather trust your peers to evaluate genome variants in the first instance, so that the plex will develop their genome tools in ways we will be able to use?

“Won’t your filter become redundant if all NCBI entries are correct and journals provide quantitative attribution via a whizzy interface?”
Yes, I look forward to this day. If this process is catalytic and every database and journal provides microattribution credit, the Variome Browser filter will have served its purpose.

“Why not organize the HVP along the lines of the DECIPHER network?”
The problems are fundamentally different. DECIPHER exists to catalog clinical importance of structural genome variation, starting from scratch. For mendelian mutations, existing annotator communities have done much of the work, but now need credit for reformatting and indexing their variation collections to the genome and scrutinizing the annotations, locus by locus.

“Surely you are duplicating the work of PharmGKB, HGMD (Cardiff) and OMIM ?”
These are databases without the intention to create systematic review and journal articles.
None provides quantitative citation credit, although they reference the original sources of the variants.

“Aren’t you just duplicating the effort of HUGENET ?”
HUGENET provides disease-centered meta-analysis of genetic epidemiology, Variome integrates findings of rare variants in rare mendelian diseases, rare variants in common diseases and common variants in common diseases.

“How will you distinguish peer reviewed from entries that were curated but not reviewed?”
We make no distinction at the browser level between curated entries and those added by wiki, only between reviewed and non-reviewed information. The source will be evident upon clicking the link, since these lead to NCBI and to the locally databased wiki respectively.
NCBI entries and wiki comments will be visible on the Variome browser in gray, the information that survived collaborative annotation and peer review will be in black.

“Surely the peer review process will entail rewriting and correcting NCBI entries, rather than merely filtering out those that are wrong?”
Irretrievably wrong or unattributable entries will need to be excluded. If the original data producers are unwilling to reannotate them correctly, new entries will be made. If the original data are largely correct, an annotation (corrigendum) will be appended to the existing NCBI entry.

“If competing journals have commissioned reviews on different regions of the genome, how can we distinguish one annotation group from another (some authors will overlap). “
This happens at the moment with papers. Editors, referees and authors behave remarkably ethically. Authors can be added at the editor and senior author’s discretion. With microattribution, if an author on a collaborative annotation makes less than one annotation between the locus genome coordinates, their author status may be questioned.

“Won’t this confuse journal impact factors? What microcitation measure is the right one?”
Professional bibliometrists, get to work! Since when was more information a bad thing? The existence of the $20 bill does not do away with a need for $1 bills.

“How will authors know if they are citing the right entity?”
This problem exists already in journal articles. Authors will cite a GEO GPL platform accession number rather than the appropriate experiment accession (GSE number). As journals become more database-like, the database part of the article will reference the correct compound (ssID, author, phenotype, frequency) to support the assertion made (this variant is always associated with this disease).

“Are you planning to provide a disease-centered interface, like AlzGene [link2 ]”
These objectives are on the HVP wish list for Informatics but outside the remit of Publication and Credit. The physician community will probably want to design a pipeline from the clinical sequencing labs to NCBI and a disease-oriented browser interface that starts with the variants for which clincal tests are available, then moves on to a list of variants sorted by whether they are pathogenic or not. These projects will be enabled by - but are are outside the immediate scope of - the microcitation proposal.

“Will other journals provide microcitation statistics?”
Hurry! Nature Genetics already does and we are now looking for better ways to display and use the information (data producer index, annotator index, collaboration index, mentorship index).

“What else could HUGONet members put on their face page?”
If you were the referee who provided the decisive experiment for a highly cited Nature paper, you might like want to reveal that fact (after publication) and post your review on your face page, even if you didn’t want your comments publicly on the Nature web site? If you were sparsely linked within the network, but had a postdoc position you needed to fill, why not pay HUGO to link you up for a month to the highest priority part of the network (or to everyone within the network), HUGO needs money and you need a postdoc.

“What about genes where tests for variants have been patented?”
Companies holding a lot of patents on particular genome regions (eg. BRCA1) might be motivated to adopt that gene within the Variome system, providing information, advertising and funding. I see no reason why their annotations should not be subjected to peer review in the usual way, since journals are happy to publish corporate research so long as methods are transparent and materials are available.

“What kinds of annotation will be counted?”
Ideall all, but the intention was to highlight all forms of annotation that require human effort, so the citation counter highlights papers and reviews separately from annotations with phenotypic information and annotations without phenotypic information. The compound (ssID, author, phenotype, frequency) has more value than the compound (ssID, author, vendor, probe) or the fundamental microcitation particle (ssID, author) even if it is cited less often.
For example, the SNP Consortium will be recognized for its discovery of a SNP every time a user cites a ssID or its rsID synonym. Affymetrix will get a microcitation for every paper that lists a probe used in high throughput genotyping. A mendelian geneticist will be interested in a single clinical report (dbGAP entry) coupled to a sequence report (ssID) and frequency (NCBI annotation ID).

“What happens when the journals are no longer interested in publishing reviews?”
Some loci receive more interest than others. Annotations of genes with many mutations and phenotypes may comprise publications on their own, while others may be annotated mutations affecting a pathway or process. Annotation of neglected regions will accrue progressively more attention via the microattribution process and will be adopted by interested groups for publication.
In the spirit of working with all interested parties, it is not anticipated that Variome will launch a journal to house the reviews, but this is an option to discuss.

September 25, 2007

Tiny bubbles

With at least the provisional success of genome-wide association studies to identify common disease-related variants now apparent, and highly significant P values floating up out of figures looking like nothing so much as a series of champagne flutes full to the brim, it’s interesting to take a look back at a paper that arguably provided the key impetus for the field. In their 1996 Science paper “The Future of Genetic Studies of Complex Human Diseases” (cited 2,231 times as of this writing), Neil Risch and Kathleen Merikangas asked:

“Has the genetic study of complex disorders reached its limits?”

Remarkably, this was the key question only 11 years ago. Of course, Risch & Merikangas were referring to the specific question of whether linkage studies would be adequate to detect variants of modest effect in the realm of complex disease. In his address upon receiving the Curt Stern award at the 2004 meeting of the American Society of Human Genetics, Risch described how the article came about:

“One colleague I worked with extensively, both in teaching and research, was Kathleen Merikangas, a psychiatric epidemiologist with interests in genetics. We spoke frequently about the state of the field of genetic epidemiology and where it was and should be going. We continued these discussions even after my move to Stanford in 1995. We began to develop an awareness that the linkage approach, although having some modest success in complex diseases, was unlikely to identify the large majority of genes. We were influenced by a news item that appeared in Science on July 14, 1995, entitled “Epidemiology Faces Its Limits”….Although the article did not discuss human genetics or genetic epidemiology, we realized that many of the comments could apply to the developing situation in human genetics as well”.

Interestingly, the author of that Science news article, Gary Taubes, is back in The New York Times magazine, with another piece pouring cold water on the field of epidemiology.

In any case, after penning a draft entitled “Human Genetics Facing Its Limits”, they realized that a more optimistic slant would be required, preferably one that offered an alternative approach. Again, from the Stern address:

“If we could have any tool to use for mapping disease genes, we wondered what would it be? Again, on the basis of my experience with HLA-associated diseases and my knowledge about disease associations with other blood-group systems, I knew that many of these associations, although highly significant statistically, would not produce substantial or robust linkage signals. Therefore, why not reverse the process of positional cloning? Instead of searching randomly through the genome by location, why not start with genetic variants and test them directly as candidates? The problem with candidate-gene association studies had been the limited number of candidates and, therefore, the low prior probability of a ‘hit’. But what if we could compile a list of all polymorphisms in the human genome?”

You know the rest. What’s remarkable about the Risch & Merikangas paper, beyond the power calculations showing that the relative gain in power for association studies as opposed to linkage, was the authors’ prescience in outlining the key issues. They noted the stringent genome-wide significance level that would be required for testing on the order of 1 million variants, while also pointing out the likelihood that linkage disequilibrium would allow this number to be reduced substantially. They also implored investigators to preserve all of their samples for future large-scale testing, and it could be argued that the collection of samples is now the rate-limiting step in association studies. Concluding, they wrote:

“Thus, the primary limitation of genome-wide association tests is not a statistical one but a technological one. A large number of genes (up to 100,000) and polymorphisms…must first be identified, and an extremely large number of polymorphisms will need to be tested”.

And finally:

“The human genome project can have more than one reward. In addition to sequencing the entire human genome, it can lead to identification of polymorphisms for all the genes in the human genome and the diseases to which they contribute”.

This is a reminder to those of us who, in the wake of so many robust associations, thought this was considered to be the reward from the very beginning. It wasn’t always obvious, it seems.


September 04, 2007

Cover puzzle

Anthony Edwards has produced an elegant representation of the genetic code in all its degenerate complexity for this month's , cover explained in Touching Base. Now you have a chance to use his device to solve a puzzle. Please post your solutions to the Nature Precedings website, or send them to me and I'll add them to this blog.

Problem: Find a ‘Gray code’ order for the codons.
When the numbers 1 to 2n are written in binary form in their natural order the number of digits, 0 or 1, that change on proceeding from one number to the next varies. There exist, however, orderings in which only a single digit changes each time and the last number only differs from the first in respect of a single digit as well. These are known as Gray codes, the numbers forming a complete cycle.
The same principle can be applied to the codon triplets, ordering all 64 in a cycle such that each differs from its predecessor in exactly one position. There are many such orders, each forming a Gray code. They differ in the extent to which they group together triplets that code for the same amino-acid.
One measure of success in forming such groups would be the number of times in the cycle that neighbouring triplets code for different amino-acids (or a stop signal). Since there are 21 of these the absolute minimum of such changes between neighbours is simply 21, but this may not be attainable.
It is easy to find a Gray code ordering with 25 changes by threading a regular route through the standard table of the genetic code. But can you find one with fewer? There’s a route through the Edwards–Venn diagram given in Figure 3 of ‘Picturing the genetic code’ (Nature Precedings doi:10.1038/npre.2007.682.1) with only 23 changes of amino-acid.

P.S. Why ARE there two groups of serine codons, anyway?

June 07, 2007

Trust but verify

The Wellcome Trust Case Control Consortium presents associations to seven common diseases and I can't help asking, "so how have we done so far?" Judging by the reference list of the recent WTCCC paper in Nature, the results of 10 of the 13 Nature Genetics papers listed were replicated associations or reported associations replicated by this study.

In the data of the paper itself, evidence is presented that replicates the associations published in (at a quick but incomplete count) 6/17 past Nature Genetics papers, validating or revalidating 14/28 of the loci we have published. The reasons for failure are a rather uninstructive mixture. There are unimpressive p values for SNPs in common between studies that might indicate the initial report was a false positive, or signal a real population difference. There are many SNPs not used in both studies, a situation calling for comparison using proxy SNPs. Some previously reported loci are not considered in the WTCCC paper, and the status of these will have to be dug from the raw association data by dedicated meta-analysts.

Despite the awesome scale of the WTCCC scan, it is considered by many to be a hypothesis generating exercise and an accompanying Feature from Stephen Chanock, Teri Manolio and many colleagues warns of the necessity to replicate association results.

Replication is certainly powerful in the wake of such a big screen. The primary scan of nearly 500,000 SNPs across 17,000 people for seven diseases has so far resulted in evidence for 24 newly or multiply replicated common variant loci for three diseases that have been verified in just three replication papers.

Subscribe

Subscribe to this blog's feeds:

[What is this?]

Recent Comments

Untitled
Powered by
Movable Type 3.2