« Second Nature and the American Chemical Society | Main | XTech 2007 Science BOF »

The way we present genomic and proteomic data on the web sucks

seqviz.png

IMHO the way we present genomic and proteomic data on the web sucks.

Part of this is down to the fact that our basic visualizations are a bit lame. Not the one-offs you see rendered for the front of magazine covers or science documentaries on TV, the everyday ones you get in journal articles and the big genome browsers. The ones scientists actually use. Pages like this. Seriously, how much of that page is actually useful to any one researcher? Does anybody wonder why people need training courses to use genome browsers? Does it really make full use of the web as an interactive medium?

This isn't a dig at Ensembl, incidentally. Ensembl is an excellent, progressive resource and their last application note hinted at cool things to come. It's more of a general complaint: why are we stuck representing genomic regions with flat images of lines and boxes (with the occasional flag or lollipop)? Can't anybody think of a better metaphor? Try zooming out on the Ensembl page above. The features that make sense as individual elements when you're looking at a single gene don't scale as lines and boxes and the page becomes a mess.

Speaking of zooming out, why is everything static? Ensembl have already brought in dynamic image loading to some extent and the GBrowse prototype from Ian Holmes' lab shows a lot of promise, but we're still a long way away from any Google Maps style scrollable, zoomable genomes. It should be painless to jump a megabase to the right or left of wherever you're currently looking in a genome browser, but that's not currently the case.

When was the last shift in the way that we represented sequence alignments? NCBI's BLAST just relaunched with a spiffy new interface, but the results still look like they'd be at home in a web browser circa 1990, which coincidentally was when sequence logos were first introduced.

Progress in visualizing biological networks is just as poor: is there ever a good reason to produce static images of large scale protein-protein interaction networks (where the text is too small, everything is a mess of connecting lines, attempts at grouping proteins together are - by necessity - one dimensional, etc. etc.)? Some authors think so. Network visualization software is already out there, somebody just needs to spend some time adapting it.

Ben Fry at MIT's Media Lab - now well known for Processing - produced some great stuff while a grad student, but that was back in 2004. Does anybody know of any other groups or individuals who are currently exploring better ways of looking at life science data?

Perhaps issues of aesthetics and design are a low priority for the scientific community online and it's more important to just get the data out there for people to use? As time goes on and we accumulate more and more data - while stuck with systems not designed to handle it - that could prove to be a problem.

Postgenomic TrackBack

Similar items from Scintilla

Comments

ftp://ftp.ncbi.nih.gov/toolbox/gbench/

Perhaps issues of aesthetics and design are a low priority for the scientific community online and it's more important to just get the data out there for people to use?
90% of life science researchers use Word for preparing their manuscripts, and the only LaTeX they know is the polymer. Most are just coming around to the idea that the web is a good place to find information, so they're a good bit away from wanting well-structured data for their bio-mashup.

I completely agree with you, and I think tools such as the Simile project's Timeline are great, but you've got lots more work to do to convince the average researcher that this is even a good idea, much less something he should get involved in.

I agree that genome browsers leave a lot to be desired. I've always found Ensembl (and especially UCSC) to be way too busy and cluttered. For me, investing some effort paid off with Ensembl but I've never made it past a UCSC start page.

Gbrowse is clearer but awfully slow and suffers from the static problem that you highlight. One of the clearer displays is the JGI Integrated Microbial Genome resource: example, but again static.

As for standalone tools - gbench look promising but a bugger to compile from source. Artemis from the Sanger centre is pretty good, apart from being a RAM-guzzling Java app.

I guess what you're saying is - where is the Flickr of genome viewers? And the answer is not that the technology doesn't exist - a good AJAX programmer could hack up a genome browser tomorrow. It's that tools to store and visualise data are rarely developed in academia due to lack of funding, lack of interest and the perception that results (i.e. papers) are all that matters and anything else (useful tools) is a waste of time.

This one uses a java applet to display genotyping data:

http://benfry.com/isometricblocks/

Pierre

Hello,

This is an interesting topic that was raised..

I agree with a previous post that there is a general lack of interest in investing effort and resources in tool development in the bioinformatics field..

There are some good display tools, Cytoscape http://www.cytoscape.org is a powerful network display tool and has gained widespread use in the research community

Our (Columbia University - C2B2) open-source platform geworkbench has many visualization tools for data as diverse as sequence, microarray, SNPs, Genotype data

http://wiki.c2b2.columbia.edu/workbench/index.php/Main_Page

best wishes,
Manjunath

do you know sockeye!?
Have a look!
http://www.bcgsc.ca/project/bomge/sockeye/screenshots

IMHO Ensembl is just using dynamic loading because it's so damn slow.

Having written visualization tools for genomes (Genomorama, http://snp.lanl.gov/genomorama), protein structures (Qmol, http://www.dnastar.com/qmol/) and 2D data (PrestoPlot, http://server.ccl.net/cca/software/MS-WIN95-NT/PrestoPlot/index.shtml), I can attest that the presentation of genomic data has been, by far, the most challenging. The presentation of features at multiple scales (i.e. 10^5 bases for a mammalian gene, 10^4 bases for a pathogenicity island, 10^3 bases for a bacterial gene, 10^2 bases for a tRNA, 10^1 bases for a transcription factor binding site and 10^0 for a single nucleotide polymorphism) requires informative, legible and accurate semantic zooming. It seems to me that using a web browser for display limits responsiveness. Instead, we need a client-server platform where the dedicated client application (like Google Earth or World of Warcraft) seamlessly streams data from the server.

Regards,

Jason

Hi,

Also take a look at

http://www.bioviz.org.

This is a project my lab works on.

There's a link to an interactive genome viewer ("PlantIGB") that currently shows data for Arabidopsis. It's basically a customized form of the Integrated Genome Browser, open source software from (http://genoviz.sourceforge.net)

There are also some links to some other programs that IMHO are very good and also fun to use, including JalView (for alignments editing) and TableView (for data-mining.)

There is definitely some good visualization software "out there," but it's often not very well-known, and is usually supported as a labor of love by some dedicated individuals.

My two cents: it's time NIH and NSF started supporting a more diverse range of projects. Google maps for bioinformatics should be just the beginning. We need greater diversity in approaches and personnel, and a healthier, more open-minded community overall. To get that, NIH and NSF have to invest.

Best,

Ann Loraine

I've been doing some Flash stuff for presenting overviews of genome data as a way to get a big picture and then zoom in and link out to other higher resolution viewers (genome browsers currently). The goal was that it could be more interactive than the typical web-based tools we have.

You can see the Flash GViewer here:

http://www.gmod.org/flashgviewer

the docs, etc. and a demo version are here:
http://blog.gmod.org/nondrupal/FlashGViewer_forWeb/

Examples of implementation here:
http://gmed.bu.edu/index.php
http://rgd.mcw.edu/dportal/cardiovascular/

Circos is a very cool way of showing genome data, it's static though, a live version would be really neat.

http://mkweb.bcgsc.ca/circos/

One of the challenges with more interactive displays, particularly web-based ones, is the amount of data you might want to display. The GViewer copes with hundreds of features, but currently gets ugly with thousands. Effective ways to show portions of the data or stream it in would be very handy.

I love the ideas of warcraft, etc. and game engines in general - they have spectacular graphics stuff built in, they're built for 3D but they dont have the connectivity to the outside world (hyperlinks to our favourite databases, some sort of API to stream other types of data to layer onto the visualization, for example).

SecondLife is another potential place for some interesting visualization and collaboration tools for this data.

http://secondlife.com/

Check out The Gene Pool as one of the first places I've seen genomic information and SL mixed together. Here its as an education tool rather than as an active visualization tool but it has possibilities.

The Gene Pool, SL location: Immaculate 212, 209, 22

SL does have the ability to link to outside resources but quantity of data is going to be the problem here too.

Simon.

If you feel this way, and that genuinely these tools will enhance research, you must let NIH/NSF know about it. I know from harsh experience that when preparing a large grant, software development is always underfunded, and the first area cut. Ultimately one is asked to perform 3 ftes effort for a 25% position. Then, in review panels our peers (fairly) ask, should we line out research dollars for our research area and give them to someone we dont know to do software development?

The questions are fair, but until the community commits to funding better visualization tools by answering yes, the tools you ask for will not exist. As you noted, the technolgy exists, and the willing who want to perform the development exist. Here is a free protein viewer
http://sirius.sdsc.edu/
we have struggled to support. Similar tools for genomes can be built with 300K of funding and input from users on what they need to see.

At Ensembl we really try - each release and each development cycle, user interface issues on the web site get discussed - sometimes we can do something about it, sometimes not. Just the engineering to make sure one has all the data straight and at least displayed is quite a challenge

But, as the original article says, we can surely do better. About 6 months ago we introduced a drag-and-zoom functionality on contigview, and this, coupled with the AJAX for each window should give progressively "smoother" viewing experience. We have more histogram and colour gradient tracks. Yet more things are in the works...

...but don't forget that biology is inherently complex. No matter how one draws a diTag result, one has to know what it means to interpret. Good visualisation helps (alot) but also integration of the data types into appropriate summaries (the promoter for this transcript is xxx to yyy for example).

Finally, as one poster noted, speed is an issue. We've worked hard on this for a couple of years and you should have noticed a difference. Interestingly in many areas we are encountering misconfigured network issues on the internet that make Ensembl look slow - as people see the speed to other sites (eg, BBC) fast, they tend to blame Ensembl without investigating whether there is another network issue between them and Ensembl. We are putting in place more monitoring to understand this, but if you think Ensembl is slow for you, let us know and we will investigate!

Ewan

Visualization is only a part of the problem. The real issue is how to enable experimental biologists to perform genome-wide analyses on the data. You can visualize in any way you want, but if you want to find all SNP-containing exons and then extract corresponding alignments from five mammals and build phylogenetic trees - you hit the wall. A solution is to integrate tools with the data, so that biologists can take alignments, overlap them with exons (or any other feature) and then built trees (or do other analyses) without worrying thet PAML will not take MAF files. At Penn State we try to do this with Galaxy,

http://main.g2.bx.psu.edu

which rapidly catching up with the community

Would a google maps style genome browser actually help people do better genomics? I've been hearing a lot of people pushing for more ajax, more prettiness, more fluff in genome visualization, which seems like the wrong focus. Ben Fry's stuff is amazingly cool as *art*, but does it convey information any better? If we can develop a new visualization metaphor for genomics that really helps then great, but the focus really needs to be on usability. I think Ewan is right though that integrating data into visual summaries is promising. High throughput experiments are overwhelming when painted individually onto a genome, but with the right integrated visual presentation they could be much more informative. Rather than changing the 'linear' browser metaphor (which people are very comfortable with now), or adding more whiz and bang, we should focus on maximizing the information density in browser displays and look for ways to allow the browser to draw attention to interesting correlations and such.

-- jt

It's funny, at UCSC we are in the midst of a user survey on how to improve the site (http://www.surveymonkey.com/s.asp?u=881163743177). We listed 20 new possible features and asked the users to rank them. The feature "update the appearance of the browser website" currently is in 4th to last place. (The bottom three are doing updates of the yeast, fly, and worm browsers.) Of the people who take the time to write in comments we get about an even number of people asking for Ajax, and people saying please not to change the site's basic appearance. We do have nice Ajax support in our VisiGene tool for viewing in-situ images. Ironically the feature "add more images to VisiGene" is coming in 5th to last. The number of tracks we've got on the browser is getting large enough to be intimidating to the new users. However "establishing an editorial board to more tightly control what tracks are displayed" came in, you guessed it, 6th to last. What people seem to want is more data and more inclusive gene sets. Of course you do have to take such an online survey with a grain of salt, since it tells us a great deal about the people who use our site, and next to nothing about people who take one look at it and flee.

In some ways we're still playing catch up with what was perhaps the first genome browser, ACE. It was implemented under X windows. Anyone with a good Unix box with graphics, and a fast internet connection to the server (which these days includes almost everyone using a Mac) could get a very smooth display without the cycle that the basic CGI models Ensembl, NCBI, UCSC, WormBase, FlyBase, and SGD currently use. ACE is still very popular at the sequencing centers, but for the general public X windows support just was not widespread enough, so everything shifted to http based protocols.

There has been quite a bit of experimentation with Java-based views, such as the Integrated Genome Browser (http://www.affymetrix.com/support/developer/tools/download_igb.affx) and Apollo (http://www.fruitfly.org/annot/apollo/). These move the data to the Java client, and you can zoom and scroll very quickly, and select using the usual desktop program methods. Unfortunately getting the data to the client in the first place takes some time. For moderate sized data sets like ESTs the initial wait to download the data to get going can take minutes. For large sets like tiling microarrays it may take hours. For the most typical use, just looking at the data in the neighborhood of a gene or two, the approach in most common use is actually much faster.

The AJAX methods do theoretically make it possible to get near-desktop-program levels of interactivity and avoid a big up-front download. They are, at least currently, significantly more labor intensive to develop though, and also require extensive testing on each and every web browser. Platforms such as the Google Web Toolkit (GWT) are addressing this issue. We have some students experimenting with GWT now, and as it and similar tools mature and our experience with them grows hopefully we'll be able to take more advantage of AJAX without having to devote 75% of our staff to it. Meanwhile we'll probably be focusing on adding new genomes, adding new high throughput datasets, improving gene sets, and a new display that I hope will be popular, that essentially just abbreviates the introns.


VisualComplexity has a nice gallery of biological network visualizations: http://www.visualcomplexity.com/vc/index.cfm?domain=Biology

Ewan is saying that Ensembl is fast. Hm. Well, type any gene symbol into Ensembl and UCSC at the same time and measure the time it takes to get to the genome browser: UCSC 10 seconds in total, Ensembl...difficult to compare, as there are several clicks necessary.
OK, so "Zoom out": UCSC: almost instantly, Ensembl...4-5 seconds. When you're doing this several hundred times a day, 4-5 seconds matter. I'm not starting with multicontigview here, as probably no one is using it anyways.

I've also answered "ajax" when filling out the UCSC questionnaire. Jim is pointing out that Ajax takes a lot of time to implement. Maybe but what if you don't need "real" ajax. I guess that all "ajax" that you need is the "google maps"-like functionality to browse around on images and the zoom function that already exists in ensembl.

You can copy the zoom function from ensembl and the javascripts for browsing could be built upon Ian Holmes' project with Gbrowse.
You don't need GWT's extreme overhead (heck, GWT has it's own toolchain).

For UCSC, I think you shouldn't ask users what they want as you quite certainly know it better. For instance, power users will always demand new features. The more tracks you add, the more difficult it will become to attract newbies. Your questionnaire will mostly be filled out by power users, however.

No kidding!

Many of us in the arts community have been saying these things for years. Having been trained as a biologist, I am always surprised and disheartened by comments that suggest that "it's all about th research." Keep in mind that genetics, genomics, and especialy evolutionary biology and hardly the sole purview of scientists. There are many stakeholders, and making the research, teh methods of research, and the information accessible, narrative, and engaging should be the first goal. It seems naive to suggest that these datasets are objective and do not portray any specific point of view.

It might be worth including, as Ann Loraine mentions, "greater diversity in approaches and personnel, and a healthier, more open-minded community overall." How will you support the inclusion of artists, designers, and storytellers in the development of visual bioinformatic resources?

One thing we've tried is to reconnect the database and the organism in the case of yeast. OrganelleView is a VRML browser for a protein localization database. It's not the end solution, but it's a heck of a lot more visually interesting than the database by itself.

Here at the Paterson Institute, we have started to use Exon Microarrays from Affymetrix, and a need grew for just such the visualisation tools you describe. We also needed to annotate the probesets on the arrays to see where they were hitting the genome. We came up with X:Map, our Web 2.0 Google Maps based genome viewer (http://xmap.picr.man.ac.uk)

James Taylor does however have a valid point, and on its own for the work we wanted to do, it was not hugely useful (as it shows an easily accessible overview of what may be occurring, but lacks any real detail), so we also have a Bioconductor package "exonmap" which allows statistical analysis of results and can generate plots of expression values across the genes of interest. http://genomebiology.com/2007/8/5/R79

Currently, the map is made up of 52GB of pre-rendered images at the 4 different zoom levels, and so isn't as flexible or dynamic as the Ensembl browser, but it is a different possible approach (and more suited to our needs)

Post a comment

Comments will be reviewed by the editors before being published. You can be as critical or controversial as you like, but please don't get personal or offensive. We strongly encourage you to use your real, full name. Email addresses are useful in case we need to discuss your comment with you privately, or notify you in case we decide not to publish your comment. Email addresses will not be made public on the blog.

We have designed this blog to be as accessible to as many people as possible. If you are having difficulty leaving a comment because of the graphical security code below, please send your comment to 'nascent at nature.com'



"Nascent Web publishing efforts have their genesis in a burning need to say something, but their ultimate success comes from people wanting to listen, needing to hear each other’s voices, and answering in kind."
Rick Levine
The Cluetrain Manifesto

Subscribe

Subscribe to this blog's feeds:

[What is this?]

The Life Scientists on FriendFeed

Recent Comments

Powered by
Movable Type 3.2