Science Online New York (SoNYC) encourages audience participation in the discussion of how science is carried out and communicated online. To celebrate our first birthday, we are handing the mic over to the audience so that anyone who would like to participate will get five minutes to show off their favourite online tool, application or website that makes science online fun. To complement the celebrations, we’re hosting a series of guest posts on Soapbox Science where a range of scientists share details about what’s in their online science toolkits. Why not let us know how they compare to the tools that you use in the comment threads?
Boris Adryan is a biologist by training (studies at Mainz, Germany, and Charleston, USA). He obtained a PhD at the Max-Planck-Institute for Biophysical Chemistry (Göttingen, Germany) for work on the development of the Drosophila tracheal system. Postdoctoral work at the University of Cambridge (UK) and the MRC Laboratory of Molecular Biology exposed him to both the wet- and dry-bench sides of modern genomics and computational biology. As a Royal Society University Research Fellow at the Cambridge Systems Biology Centre, his team of currently six biologists, computer scientists and mathematicians works on experimental and theoretical studies of transcriptional regulation and transcription factors. In his spare time he enjoys playing computer games with his children, running and boxing.
Constraints. Life is full of constraints. And software is, too. The trick is to work around them. So when I turned my back on the wet-bench and towards computational biology, I did this with the full intention to squeeze experiments into a time frame that was compatible with my five-minute attention span. Writing software from scratch to make it do exactly what I want was definitely not part of the plan. To me that’s like making your own Taq polymerase and purifying Eco RI in the old days. It’s exciting, it can give you that great feeling of accomplishment on a lonely Sunday afternoon in the lab, but at the end of the day, you want to address a research question and not bother about implementation details.
Well, nobody can escape their past. In my past 20 years as a programmer, I’ve seen the rise of object-oriented programming and ‘modularity’ is something that was hammered onto my forehead. These days, I organise my entire life as a computational biologist around little modules that I re-use in almost every workflow. Yes, sure, you may call me a one-trick pony, but in terms of productivity, call me plough horse.
My core modules are ACQUISITION, COMPUTATION, VISUALISATION, and usually I glue those together with a few lines of Perl or the Unix command line. Here come the constraints again: To overcome the limitations of the software that I’m often “misusing”, I use my own scripts to shove data from one format into the next, and back again. I think every biologist who deals with lots of data, not only us computational folk, should know a few handy lines to quickly turn comma-separated files into tab-delimited, strip a table of empty quotes or grep some essential info.
ACQUISITION aka identifier hell. A few years ago I did a big analysis on Drosophila transcription factors. The simple task turned out to be not so simple: How are you going to compare the expression of the gene twin-of-eyeless, toy for those familiar with the arcane art of fly pushing, with other genes if in one release of the fly gene annotation it is called FBgn0004523, in the next one FBgn0019650, and Affymetrix claims to represent it on their microarray as CG11186 but that, in your table, only maps to some NCBI gene 43833? I have lost years in that hell! These days, I don’t care any more if I can get a handle on ALL of the data for my genes of interest. I go to dedicated portals that have done all the data collection and consolidation for me, even if that means that I might miss out on the newest datasets.
By far the website I use most often for data acquisition is FLYMINE, which is a one-stop-shop for all Drosophila genomic features. There are also other data available (a few selected gene expression datasets including the BDGP In situ Database and FlyAtlas, functional annotation from Gene Ontology, KEGG and Reactome, structural domain assignments from InterPro, phenotypic data from GenomeRNAi, and protein-protein as well as genetic interaction data from BioGRID etc). The best thing: They take care of all identifiers. No more hell. You can even upload a gene list with long outdated names and have it returned with the most recent identifiers, along with your data of interest. While FlyMine was originally developed for normal bench biologists and their team seem to invest a lot of time into the look-and-feel of the website, I’ve started using their API which comes for most popular scripting languages and Java. You can handle, store, merge, mix and match of your lists on their website, but typically I just export them as tab-delimited files or via the API. In both cases, you get to determine the content and order of your data, which takes a lot of the ACQUISITION pain away. Thanks to the general InterMine framework there’s now a growing list of mines for model organisms becoming available, and they all support the exchange of gene lists between each other. Also the NIH-led modENCODE project has their data present in, what’s called, modMINE.
UCSC Genome Browser: A powerful data source and genome browsing tool on its own, it allows dynamic loading of data from the user’s servers for visualisation on the fly.
COMPUTATION. There are basically two kinds of computation we do in our group: Sequence-related work, and counting and comparing; the latter including statistics. Applications for sequence-related work are as vast as protein and DNA sequences, and it’s difficult to give a good summary here. However, hidden away for the connoisseur, in the guts of the UCSC Genome Browser is a great set of tools for download (think BLAT or liftOVer), which is very useful if you don’t want to strain their server too much. For counting and comparing, I recommend R. Over the past ten years, it really has become the de facto standard in the computational biology field, with hundreds of Bioconductor modules (“plug-ins”) from all fields of biology and a huge user base to support any queries. I personally prefer MySQL for filtering through big data, and the RMySQL module enables the cross-talk between R and the database.
Genesis: Originally developed as a microarray gene expression analysis tool, it can be used to cluster and visualise virtually any numerical data in matrix format. The visual appearance of the output is highly customisable with just a few mouse clicks.
VISUALISATION. Data integration, that’s what your brains are actually very good at. Especially when you work with big datasets. For summary statistics of numerical values I rely on R, although I’ve grown the odd gray hair over comparative histograms and the way R handles graphics – a combination of computed graphics and some mouse intervention for prettifications would be great. In most cases, once data is in some sort of matrix form, I use Genesis. It’s once been developed for microarray analysis, but I use it for everything that requires visualisation and clustering. Unfortunately, Genesis insists on numerical values and I often find myself writing quick search-and-replace scripts so I can use it on arbitrary data. However, I really like how one can visually defined clusters and written a few parsers to extract group identities from Genesis .xml output files. For downstream analysis of gene function I then load those into Ontologizer, a sophisticated Gene Ontology statistics tool with a great GUI and even better command line support. A key feature of Ontologizer is how it handles the redundancies that one often observes in functional annotation gene lists, and it attempts to just show the biologically most relevant ones. Beware that Ontologizer requires Java 1.6 and therefore won’t run on Mac OS X < 10.6. My favourite tool for network visualisation and analysis is Cytoscape, as every visual feature (and there are a lot of them!) can be linked to user-defined parameters – once again, coming from tab-delimited files. For showing data in the context of the genome I use the Integrated Genome Browser (IGB) if it’s for my own use, because that runs pretty slick on a normal desktop machine as it really shows just the basic facts. For visually pleasing representations, I use the UCSC Genome Browser. While many people use their ‘custom tracks’ to upload a few features using GFF or BED format, I really like the functionality that allows you to store genome-size bigWIG or bigBED files on your own server and have your own data visualised within the genome browser on the fly, in the context of all the other data that UCSC provide. That’s truly amazing!
You can follow the online conversation on Twitter with the #ToolTales hashtag and you can read Mary Mangan’s Tool Tale here, Dr Peter Etchells’s Tool Tale here, Alan Cann’s here, Jerry Sheehan’s here, Anthony Salvagno’s here, Daniel Burgarth and Matt Leifer’s here, Zen Faulkes’s here, Jenn Cable’s here , Mike Biocchi’s here, Susanna Speier’s here, Derek Hennen’s here, Musa Akbari’s here, Benedict Noel’s here, Chris Surridge’s here and Gerd Moe-Behrens’s here.