Scientific Data | Scientific Data

Data Matters: Interview with Michael Milham

milham_bioMichael P. Milham, MD, PhD, is an internationally recognized neuroscience researcher, clinician, and the founding director of the Center for the Developing Brain at the Child Mind Institute.

You have helped found and organize several major brain imaging data sharing projects, starting with the 1000 Functional Connectomes Project. Could you tell us a bit about how these projects got started?

That actually brings up a funny memory – back in 2008, myself and some colleagues had just had a paper accepted, which focused on establishing the test-retest reliability resting-state after MRI in the journal Cerebral Cortex. At that point I was in San Francisco, and went for a truly bizarre hair cutting experience.

As I walked out the door of the barbers, I rang my colleague Zarrar Shehzad and mentioned that we should start looking across imaging sites. Zarrar, a talented Research Assistant of mine at the time, asked “why would I do that?” My response was “you’re right! It’s not your thing”. I then called Bharat Biswal, a friend and colleague, who helped to found resting state fMRI – he was very excited to give it a try.

Data Matters presents a series of interviews with scientists, funders and librarians on topics related to data sharing and standards.

That following December in Magdeberg, Germany, there was the first Bi-Annual Resting State Brain Connectivity Workshops, where we managed to pull together 5 sites from ourselves and a few friends. We finished analysing the data roughly 30 minutes before it was presented! It was an interesting moment, as Bharat presented our findings. People had two questions – one was, ‘do you guys want us to help?’ The other common question was ‘how can we get access to that?’

Our response was to say yes to everyone who was happy to offer data. Following this, we started reaching out to people in the field, to pull together a steering committee for what was to become the 1000 Functional Connectomes Project. We got folks speaking on the steering committee and started trying to figure out what kind of model is it that researchers would actually want. We reached out to anyone who’d published in the resting state field, and had a huge range of responses back. Most people were impressively open to the idea, but it also opened up dozens of questions. Who has the right to the data? What kind of data quarantines would be in place, if any? Should we curate analyses, and make people register them in advance?

A big administrative concern was how on earth we would handle all this. You don’t want to feel like anybody is trying to hold the data back. You don’t want to be the one judging what you think is good or not, because that’s of course for peer review to decide afterwards. Eventually, everybody came around to the same opinion – to just openly share the data. That’s where the Neuroimaging Informatics Tools and Resources Clearinghouse (NITRC) came in. When we contacted them, the team was extremely receptive to the idea of letting us use their website as an open data repository, and wanted to help make it work. We went ahead and released the data, publishing the scripts some months later. The data very quickly got tens of thousands of downloads registered; those scripts also quickly got a few thousand downloads, suggesting the need for help with analysing large-scale datasets as well.

How did this lead to the Consortium on Reliability and Reproducibility?

That’s how things initially started, and then with the International Neuroimaging Data-sharing Initiative (INDI) I had a growing number of people asking us ‘what next’? My initial thoughts were that the initial dataset only provided information on participants’ age and sex, so we needed something more clinically relevant. We launched with the ADHD-200 as an example of an “INDI Retrospective Sharing” dataset, where you’re sharing datasets that were previously published on. Additionally, we launched the NKI-Rockland Sample as an example of “INDI Prospective Sharing”, where data is shared as it is acquired (i.e., before publication). The NKI-Rockland Sample initiative was an effort targeting 1,000 subjects between the age of 6 and 85, for which data is shared quarterly. INDI launched in 2010 with ADHD-200 and NKI/Rockland. ADHD-200 led to about 90 publications by this point, and NKI/Rockland has led to around 50 papers using that data. That really pushed the open sharing model.

And then came the Consortium for Reliability and Reproducibility (CoRR). My colleague Xi-Nian Zuo was the major impetus behind CoRR. Zuo was a former postdoc of mine: he’s written various data analysis papers and a recent review on reliability, and done an incredible job with raising questions about the issue. So we started out the project of reaching out to sites in China, the US, so forth. That’s how CoRR came about.

It’s striking how recent all of this is; that these tools didn’t exist before.

Things are rapidly emerging. To me, there are two ways to approach initiatives of this kind – one is the ‘build it and they will come’ model. People build large infrastructures in informatics and so forth, but don’t actually have data in hand. They simply hope somebody will come and use it. The other model is the ‘seat of your pants’-type model, where you get the data and see how you can get it out there, and we’ll just keep getting better at what we do as we go along. That’s essentially been our approach.

I never thought that such a major part of what I do would revolve around data sharing. But the scientific questions data sharing opens up make a huge difference to our advances in the field. Reliability is absolutely crucial. Donald F. Klein, one of the godfathers of psychopharmacology, would often ask me ‘do you really think this would repeat twice? Is there any reliability to anything that you’re saying?’ That’s why we created that first NYU test-retest dataset, because we were tired of being asked that question!

In your experience, what works in terms of motivating scientists to share their data?

A combination of things. One thing is social pressure – I believe that scientists don’t want to feel left out. When they step into a culture where this is the social norm, they’ll conform to that norm. The other thing is realization that people are able to do this with no detriment whatsoever to their career. Researchers occasionally give me very heated commentaries about the kind of danger I’m creating for their postdocs and so forth by sharing data, and asking them to share their data. But I think that people are realising that data sharing is simply opening up a platform upon which much greater questions can be asked. The other thing is simply that a huge number of researchers are using shared data. Somewhere between 350 and 400 papers have come out using FCP data alone since 2010. As people use the data, and benefit from it, the virtues of sharing become more obvious.

“What you’re seeing now is a shift in the paradigm. Everyone is talking more and more about big data.”

Have the datasets collected through FCP and the other subsequent initiatives contributed to your own research?

Yes, we try to use the open data as much as we can. One example is the motion crisis for resting-state fMRI, where I supervised a paper by Yan et al. The team and I carried out a whole series of analyses to provide a more grounded line of enquiry, and tried to temper the crisis, by asking what the problem is, and how to best deal with it. In doing so, one of the biggest luxuries was to be able to sit there and draw from this amazing range of datasets, and be able to present 150 subjects from one site, 150 from another, etc. That’s the kind of science that you could only dream of a few years ago. So we’ve definitely made use of datasets like that. It’s really encouraging, and has definitely increased the scope of questions we can ask.

In your opinion, how important is effective data sharing in advancing our understanding of neuroscience?

It’s going to be essential. What you’re seeing now is a shift in the paradigm. Everyone is talking more and more about big data – when you talk about big data, it’s not that you want to power your sample so you can detect a trivial effect. That’s really not the goal with big data. The goal is to have samples that have higher representative value, that may be close to epidemiologic in some instances, but that can definitely capture a broad range of heterogeneity. And so, in that way, without data sharing it’s going to be virtually impossible to get those large scale samples that we really need. I think it’s crucial.

Interview by Matt Ward, Open Research, Nature Publishing Group

Comments

There are currently no comments.