Ben Lehner is a group leader at the EMBL/CRG Systems Biology Research Unit, in Barcelona, Spain.
Could you briefly introduce your own research?
My lab works on genetics, essentially. It’s a mixture of producing our own data, and using other people’s data. We’re a combined wet and dry lab, and we work with organisms and data from bacteria, through yeast, worms, all the way up to human clinical genetic data.
Broadly, how open do you think the human genomics community has been to sharing data?
I think there is a cultural history here that’s important. You can divide the human genomics community into two groups. There’s the group that is descended from genome sequencing. And, these are people that have really pioneered the sharing of data. The human genome public project, all the data was released as they went along. It was all shared. And, this is the model that has been applied to many other genomics projects since then: HapMap, 1000 Genomes, Encode, GTEx, and The Cancer Genome Atlas, which is all patient data. So this is the community that’s descended from human genome sequencing, and these people are actually descended from model organisms. So, many of the original people – such as John Sulston and Richard Durbin, Bob Waterston – they worked on C. elegans where a culture of free sharing of data has always been there from the very start, when Sydney Brenner started the field. And, this culture has propagated through genomic sequencing to human genetics.
Then you have the rest of human genetics, which is essentially clinical genetics and genomics. And, there the culture, historically, is very different. It’s with keeping your data to yourself. Partly because of good reasons: patient confidentiality. And, partly because of bad reasons, which is the self-interest of the researchers. It’s a lot of work to collect these data, and there is a tendency to keep it to yourself.
So, I think at the moment what we have is a mixture of the two. There’s the core public projects, which are very open, and then the clinical projects which are, in general, very closed.
How is your research impacted directly by the accessibility of human genomics data, and perhaps these two different camps?
The data from the one camp massively facilitates my research, and the research of many other people. The Cancer Genome Atlas project data are used over and over again throughout the world, and has proved far more insights into cancer than they ever intended, because it’s been reused.
And, then a lot of the rest of the human genomics data we just cannot use at all, even though we would like to use it, because either you cannot get it, or it’s too difficult to get it, or when you do get it, it’s missing vital information. I think there are many, many people in the world who would benefit enormously by more human genetics data, including phenotype data, being available.
Can you give some more specifics on the real challenges you face when trying to access genetic or genomic data?
The main challenge is that the data described in many published papers is just not available. You just cannot get the underlying genomic data, or often you can get the genomic data but you can’t get any of the phenotypic data attached to it, any of the description of the patients, cases, controls, etc., the clinical data. So, that’s the first problem: most of it’s just not available in any way, unless you collaborate with the authors.
The second problem is that even when it is available – through databases like dbGAP or EGA, these are the two genotype-phenotype databases – getting the data can be a slow and painful process, because you have to get approval. This normally involves the technology transfers offices of your institute interacting with another institute, which can be a slow legal wrangling just to get a hold of the data. And, then finally, sometimes when you actually get a hold of the data, you realize it’s not what you wanted in the first place. Because the data is so non-standardized, the way it’s in these databases, that it’s not until you actually get access you realize the bit you need is actually not there, or it’s in unusable format, etc. So there is a real non-standardization problem as well, within these databases.
One final problem is something that I am surprised is so common: often the patient consent forms on these datasets are very restrictive. So for example, there may be a leukaemia dataset which is only allowed to be used for leukaemia research, and access, if you want to use it combined with other cancer genomics datasets, is denied, because it’s only supposed to be used specifically for leukaemia research and not for general cancer research.
I think that’s a great point because it underlines the fact that researchers don’t just have to be open to sharing; they have to be open to sharing very early when they think about their consent forms.
Yes, exactly. The problem is these consent forms are often written a long time ago and written in such a restrictive way, that it means the data becomes unusable by many people, even if the owners of the data actually release it – you’re not allowed to use it because of the patient consent form.
And, a final point: now some of these datasets are huge, in terms of computing storage. Actually, a major problem is that it can take months of downloading the datasets to get them, and then thousands of dollars of months of computing storage charges just to keep them locally. So, another thing we have to work towards in the future, is having centralized storage and people with access to datasets being able to analyse them in a centralized computing cloud or something, rather than everyone having to download it to their own secure servers, and keep it there, which takes a lot of time and also expense to do, as well. That’s another place to go to.
Right. I think that’s both a problem and pointer toward solutions in the future. So, what are other things that could be done to actually improve access and usability?
Well the main thing is that there needs to be a cultural change, so that sharing of data and open access of clinical data becomes expected, and ‘the norm’, just as it is for basic genomic datasets or gene expression datasets or proteomics datasets. These have to be released as a condition of publication in journals or have to be released as a condition of funding from agencies. This has to become the case for human genetic/genomic datasets.
In order for that to happen there is one very key thing which also has to happen. For the people that create and collect these datasets, which is an enormous amount of work and expensive (these are the clinical researchers and hospitals), there has to be a way for them to get credit for what they’re doing, so that they continue to do it, other than the current system, which is basically that they put their name on every paper that ever uses that data. There has to be an alternative way for them to get credit for their work, and to be able to get recognition, and continued recognition over time, for these collections they’ve built up. But, this has to be done in a way that the data are made available, so that the system of credit and career advancement is one that facilitates research, not restricts it, which is what actually happens at the moment. People keep the data to themselves, because they know that they need it to generate publications and citations for themselves.
We’ve encountered that some of these existing credit mechanisms, like co-authorship, can be a barrier. Have you ever encountered that in the past, where negotiating for co-authorship has become an issue?
It’s very common that you write to somebody asking if you can get a dataset from them for some specific purpose that’s not related to their research, and they say “Yes, fine, so long as we consider this a collaboration”. This is okay, but the world would be much better if the data were just open, free and people didn’t demand collaborations just for sending you data, basically.
I think there are good examples now. It’s not as if the people who sequenced the human genome didn’t get credit for what they were doing. These are all super-famous scientists, very successful. You could actually argue that by making the data available immediately, this actually had a bigger impact on their careers, because the data are being used over and over and over again, and it gets them more citations, more respect, more talk invitations, etc. This is also true for the public projects that have followed since, like Encode, TCGA cancer genome sequencing. The fact that the data is available from the start has helped the data generators or the people who initiated the project. It hasn’t harmed them in any way.
Who do you think is responsible for creating this shift in scientific culture?
Primarily it has to be us, the scientists. However, we also have to acknowledge that there’s a huge conflict of interest here, in that many researchers think that by keeping data to themselves it ensures control over their data and that they will get credit for anything that is done with that data. So, I don’t think just leaving it to the scientists is actually going to result in the change. The people who are best empowered to change this culture is actually the other stakeholders, patients and patient groups, funding agencies and charities, and, of course, the governments and journals.
You can already see this in the UK a bit, because the Wellcome Trust is demanding that research they fund is made available openly. And, therefore, of course, everyone does it, because they want to be funded. If the NIH did more of this, the MRC, and the EU especially did more of this, then it would immediately change what is happening. If tied to funding, in addition to forcing open access publication, they also forced open access data, this would change the environment overnight, I think.
Same goes for the journals. Just as they expect that you upload your expression data, your sequence data, your proteomics data, etc., to databases, and that you share all your reagents and materials if people ask for them, as a condition of publication, it should be the same for human genetics data. If the flagship journals – Nature, Science, Nature Genetics, etc. – enforced sharing of human genetic and phenotype data as a condition of publication, then this would immediately change the situation and people would share their data more. Of course this has to be done in a way that – you can’t just release the data – it has to be in a controlled access way. But, there’s no reason why once an institute has been evaluated as having safe computers and secure computing, etc., etc. – that they shouldn’t be entitled to the data, just as the host institution of the data generator. If it’s just as secure.
There are already databases that have restricted-access that can fulfil this – dbGAP and EGA are designed to have controlled access for researchers to protect the privacy of patients whilst still sharing data. We now just have to do it.
And, of course, there are journals like Scientific Data, where the whole point of the journal is to propagate the survival of a large-scale dataset, and to make it available, and to track and record reuse. You have a way to track how much data is being used, and then hopefully feed this back to scientists and funding agencies so that they can see that their data is being useful, which is not just a publication count, but it’s how much the data is actually being used, which is important.
At the end of the day, we have to remind ourselves why we are funding or doing clinical research – this is to increase our understanding of human biology and of the causes of disease and how to prevent and treat them. Not sharing data slows all of this down. It isn’t an efficient way to spend money on science. I also personally think that it isn’t ethically acceptable either.
Interview conducted by Andrew Hufton, Managing Editor, Scientific Data