Scientific Data | Scientific Data

Data Matters: interview with Timothy Rowe

Tim-Rowe-2Timothy Rowe is Professor of Paleontology at the University of Texas at Austin, USA.

How much of the scientific data generated in your field is being shared at the moment?

A very small percentage. For a paleontologist there are two forms of data, one is the fundamental specimens in national history collections, and we have a long history of requiring that specimens collected from the field be collected legally, documented and vouchered into museum collections. But now, in increasing ways the discoveries are being made not so much on the specimens themselves but on computed tomography (CT) data, CT has given us a non-destructive look inside these specimens and it’s the CT data that are providing the source of discovery.

We actually don’t know how much data is being generated, except to say that the volume is increasing. More and more institutions are installing their own computer tomography scanners, more researchers are getting access to hospital and medical scanners, but there’s no clear record of what has been generated, or what is currently being generated, but the trend is clear, more and more people are using the data. My impression is that the volume of data is growing exponentially, but we’re not tracking it, and as far as sharing goes, that’s another element that’s not being tracked. From my own lab, we’ve generated somewhere around the order of 3,000 datasets on fossils and recent biological specimens, and every week I get a request for a full resolution dataset or multiple full resolution datasets, and we rarely say no. When we do, it’s just a case where we’re waiting on a researcher to finish a project and then release the data.

What are the barriers to the wider sharing of CT data?

CT data are taking on a status of specimens themselves, but what we don’t have are the same sort of best practices. We still lack standards for their deposition in a repository, or metadata, or accessibility policies that would let you download those same data to validate and verify my discovery and to potentially reuse the data like we continue to use specimens over and over again. This can be problematic in the sense that no one is born into this world understanding how CT data ought to be interpreted. Just having in hand a dataset is not to say that it’s been interpreted properly. One of the fundamental tenets of science is to disclose one’s data so that you or anyone else can check it.

“No one is born into this world understanding how CT data ought to be interpreted.”

There’s also a critical piece of missing infrastructure. Even though we’re very free with the data we produce, it’s not an automated system. If you wanted a full resolution dataset, you’d have to email us, and we’d have to move it to an ftp site for you to download. What I envisage is much more of a GenBank Model, an automated model where as we generate data, the full resolution datasets are put out there on the web. The GenBank model is the one to study and to extend that from gene sequences to 3-dimensional and now 4-dimensional scanning. We use a web-based interface like Digimorph to present an abstract of the data. For example, Digimorph now provides small Quicktime movies that show you the slice by slice quality of the CT data or a 3D spinning model of the CT data so you can get some idea of what it is you’re about to download. But one critical piece of infrastructure is missing, the repository, a place for everyone who’s generating scientific data using CT or MRI or any of the other technologies that generate 3D voxel data. Ideally, new data and should be uploaded upfront, just as one does with GenBank. The data should be made available to reviewers for validation, and then once the article is accepted and goes public, so too the data are freely downloadable for validation and repurposing. Right now, we have mechanisms like FigShare and Data Dryad and they score and dispense full resolution datasets but you only have a text interface, so you take a chance and download this big chunk of stuff, and then look at it. These are very important enterprises for data sharing, but it is unclear in what ways they are still experimental and destined to change,  in what ways they represent final solutions.

What is the role of a publication like Scientific Data?

One of the impediments to scientific advancement in the past has been that career advancement and promotion was not tied to data generation, it was tied to publication. Researchers in the past have generated 3-dimensional datasets and built 3D models and extracted information and written articles about it, and yet it’s taken back down to type and to pixelated imagery and the 3D data are largely discarded. With Scientific Data there’s now a creditable way of publishing one’s data. This means that young researchers who are generating data have an incentive for sharing the data and they get credit for it towards their promotion. I think that that’s a really huge advance in terms of the social underpinnings of modern science. Promotion and advancement is something that younger people need to think about and this is a citable way of getting your data out there and earning credit.

“With Scientific Data there’s now a creditable way of publishing one’s data.”

Scientific Data also fills a glaring a gap in the infrastructure of science, which is the disclosure of scientific data. If you publish a paper based on scientific data, then this data must be made available for validation and reuse and yet the mechanism until now was not there. Scientific Data is a huge step forward in filling this missing gap. If you think about the revolution in microbiology, the thing that really broke it wide open in my view, is the repository. Once it was possible to sequence a gene and upload it, why would I then have to regenerate and re-sequence that same gene when I can download it in a few seconds? It’d be nice to see that same model working with the 3-dimensional data that the scientists are generating with CT or MRI.

Learning how to interpret these datasets is a non-trivial task, but the published data articles are a huge step forward and I really hope to see it do what it has promised to do, in transforming the field in a fundamental way so that everyone, every student is brought up from a very young age computing in 3-dimensions instead of in 2-dimensions.

Interview by David Stuart, a freelance writer based in London, UK

Comments

There are currently no comments.