Isaac Kohane is a Professor of Pediatrics and Health Sciences and Technology at Harvard Medical School, USA.
What are the current data sharing practices in your field?
I work in the multidisciplinary field of bioinformatics, making use of large clinical and genomic datasets to identify signifiers for a variety of conditions with genetic links: autism, major depression, rheumatoid arthritis, type 2 diabetes. Genetics has been amongst the leaders for the sharing of data, although historically clinical data has not been aggregated and shared to the same extent.
We need the new informatics practices to cross the boundaries. We need to extract data from the clinical records where the data costs are already sunk and to encourage the sharing of data where the appetite is currently relatively slight. I have been part of two such projects. The i2b2 (Informatics for Integrating Biology & the Bedside) project combines codified and narrative data from existing clinical data with genomic data to design therapies for patients with genetic diseases; and SHRINE (Shared Health Research Information Network), which like the old MP3 file sharing systems, enables the aggregation of records from a large number of institutions.
Why is data reuse important in your field?
Reuse is important for both individual datasets and aggregations of datasets. By having more eyes looking at the same data we’re able to rapidly progress individual diagnosis. In the CLARITY challenge at Harvard we made the genome sequences and clinical data of three families available, each with their own difficult life threatening disease. During the challenge multiple teams made the same new diagnosis on two of the children that had not been made in the hospital where they had been cared for. In other cases it is only when you aggregate data that the value is realised, for example the adverse effects of drugs. Although there may be very significant effects, at the level of the individual clinician they are not obvious at all, it’s only when you aggregate them to the institutional wide level that you see it. By looking at the data we’re able to see very large effects that are reproducible and have led to some medications actually getting warnings placed on them by various local authorities.
The other aspect of data reuse is that because the data is acquired during clinical care the marginal costs are low. You can actually have a study involving 40,000 individuals with a particular disease at literally a tenth of the cost, and if you’re willing to use the discarded samples to do the genomics for example, there again, you have a factor of 10 improvement in cost. The importance of this cannot be overstated, and in the United States perhaps the best evidence of that is not only that the NIH is investing in data reuse, but that we are seeing a huge number of new start-ups seeking to mine these data because they understand the value in these data.
What are the biggest barriers to data sharing and reuse?
We need a culture, and incentive, for sharing. Just as in genomics it took a bit of leadership to convince people that the likelihood of being scooped if they put their data out there was not going to be high, we need to do the same thing for all these large clinical databases. We need to do away with a culture of sitting on the data until we have mined every useful scientific grain out of it.
We also need to deal with the issue of identification. For clinical data or genomic data, the more useful the data, the more structure to the information there is, and the more identifiable the patient is. Everybody understands the genomic aspects, because the genomic sequence is fairly unique to the individual so it can lead to identification fairly easily, but it turns out that the peculiarities of everybody allow us to identify a person with a few pieces of information. If we know what zip code you are, what gender you are, how old you are, then I only have to know a few more characteristics and I know who you are. As we have large databases that can be mashed together to provide some integrated analyses across multiple data types, you’ll see more and more opportunities for re-identification. In my opinion the solution is not to have increasingly draconian restrictions on the use of data, I think it’s being extremely clear about what constitutes good behaviour and bad behaviour and putting in place the penalties if bad behaviour occurs.
How do you think a product like Scientific Data fits into the changing data ecosystem?
The data itself is going to be useful in the sense that making significant datasets available will accelerate the ability of other researchers to generate and test hypotheses that they would not have been able to otherwise. But the larger benefit is tangential, which is that two important sociologies will be established. Firstly, that there is actually a value to sharing the data, that it is both a public good, and also a personal achievement, and one that is recognised by a peer-review process of being of genuine value. The second sociological good, is that by having this occur again and again, Scientific Data will have demonstrated that a large number of our peers have done so and have actually not suffered any consequences and instead are recognised and thanked for the contribution of their important data.
There is an alternative model to this, which might yet come to pass. That is, rather than institutions viewing themselves as the institutions of record for medical and other patient data, increasingly there will be a number of other, possibly non-governmental organisations that have the trust either of specific patient groups, or of general populations of patients, to which people will altruistically, contribute their own data. Patients will be making more of their data directly available under the equivalent of a creative commons license that says, please use my data under the following restrictions, or without restrictions. There is a possible future where the data will not actually be coming directly from the institutions or the researchers, but what will be happening is that the researchers will be using that publicly donated data and doing something interesting with it, and the data will have already been in the public domain, and it will be those scientific derivatives that will end up being published.
Interview by David Stuart, a freelance writer based in London, UK