Brenton Graveley is the John and Donna Krenicki Professor in Genomics and Personalized Healthcare, Associate Director of the UConn Institute for Systems Genomics, Department of Genetics and Developmental Biology at UConn Health, Farmington, CT, USA.
How open is your field to the sharing of data?
My main field is genomics, and it’s probably one of the fields that shares data the best. Genomics has a fairly open culture; not everyone in the entire field is open, but in general it’s open. There’s a very large amount of shared data, both from groups that wait until the day the paper’s published to share it and groups that release data prior to publications. I’m involved in the ENCODE project and we are depositing all of the data generated, as soon as we generate the data and it meets a certain quality. There used to be a nine month moratorium, where there was an embargo on using that data in a publication for nine months, and in the current round we’ve lifted it. If somebody sees a dataset, they can download it, use it, and publish on it at will. There are many people in the genomics field that post pre-prints of their papers long before it actually ever appears in press. And so people in genomics are sharing not only their data, but also the results they’ve extracted from the data prior to publication.
There’s a huge amount of reuse, so the National Human Genome Research Institute is carefully tracking the papers that they can identify that either use modENCODE or ENCODE data, then they’re determining whether the users of that data were actually part of the consortia or are outside users. As of now, the majority of papers that have been published using that data are from people outside of the consortia. So that data is getting used a lot, and not necessarily just by the people who generated the data. That’s the entire goal of those types of projects, to generate large data resources for the entire community to use.
How big an issue is quality in the data that is being shared?
Quality is certainly an issue. With the vast majority of datasets out there, there is very little sense of the quality control that’s gone into it before deciding whether or not to release it. Often it’s hard to determine what that actual quality is. On some of the larger projects they do have data standards, but sometimes it can be hard to find what those standards are. But the majority of stuff that is out there, just in general, has very little specific quality control to it – certainly nothing that is consistently used from one group to the next.
Demonstrating quality control is really critical. If you knew data generated in your own lab was of poor quality you would either redo the experiment or just not use that data at all. So having the quality of a particular dataset known and demonstrated is critical, especially when it’s someone else’s data, otherwise you really have no idea what was done during the experiments. It’s often hard to know whether or not to trust particular datasets, just because you don’t know what the quality is.
Exactly what goes into the quality description of an experiment obviously depends on exactly what that experiment is. The minimum you would want would be a description of the quality of the starting material and the experiment. In the case of genomics the sample material is either DNA or RNA and you’d want some indication of what the quality of that sample was, and an indication of what the quality of a particular library that was generated from that particular nucleic acid sample is, as well as an indication of the quality of the sequence data, because not all sequencing runs go perfectly smoothly. In an ideal world, you would have that particular experiment done more than once, and some sort of indication of the quality or the reproducibility between those two experiments.
Are there other barriers still to be overcome to data sharing in genomics?
There are definitely still barriers. In genomics in general, most of the data formats are fairly standardised already (e.g., sam, bam, bigwig, etc.), but certainly if you were to have any sort of large summary table of multiple experiments and gene expression values, those are just flat files for which there’s no standard format. And even though there are standards for some file formats, there still can be some differences in the way that people generate them as well as how they use optional fields. There’s still variability; you can’t just download a file and run with it, you still need to take a particular file and manipulate it into a slightly different format. So there are still things to overcome in this area, and I think getting data standards across all of these different aspects is still an important thing.
What is the role for a product like Scientific Data in demonstrating and encouraging quality in the sharing of data?
Scientific Data has an important role. For instance, people submit genomic data to the GEO database or to the Short Read Archive, and neither of those actually stores any quality information related to the experiment. In that sense Scientific Data will enforce data quality, even if there’s not a constant standard across data types or between different users. I don’t think every single Scientific Data publication will have the same data quality standards of a particular type, but at least it will be documented for each dataset. So I think that’s a very important role.
The other important function, especially for people that generate these large datasets, is getting credit. In a lot of the projects that generate a large volume of data, the person that actually generated the data doesn’t get credit in the form of a citation since it can be released to the public long before the data generators publish a paper including the data. If somebody else uses that data, they cite the GEO or the Short Read Article accession number, which is hard to track and not something you can easily put on your CV. A Scientific Data publication is a place that people can cite, and will help to give traditional credit to the people that are actually generating the data, often graduate students and post-docs.
Interview by David Stuart, a freelance writer based in London, UK