Daniele Marinazzo is an associate professor at the University of Ghent in the department of Data Analysis of the Faculty of Psychology and Pedagogical Sciences.
He and his research group focus on methodological and computational aspects of neuroscience research. In particular developing, implementing and validating methods rooted in statistical physics for the study of brain connectivity and activity – investigating how information is stored and transferred in complex networks and how these results are then translated to the brain. Usually their validation of methodologies is done on publicly available data, and the code is always shared.
Daniele became aware of a dataset that could be of benefit to his research through the author of the data posting about it on Twitter. Daniele was then able to effectively utilise and implement this data into his research after the author of the original data published a Data Descriptor in Scientific Data. We caught up with Daniele to find out about his experiences of finding, sharing and using data.
Question 1: Data plays a large role in your research work. Where do you usually discover/source this data?
My research work involves on one side the implementation and validation of new methodologies and tools, and on the other their application to specific problems in clinical or experimental psychology.
For the first part I always prefer to use publicly available data, both because their quality is normally excellent and checked, and to allow other researchers to reproduce my results. For the second part, I either rely again on public data, or on my collaborators.
Normally, I know about the existence of these data by reading papers, or by regularly visiting known repositories (OpenfMRI, 1000 Functional Connectomes, etc.), or again I get to know about their existence from my Twitter feed.
Question 2: Is it always possible to use/reuse the data that you discover?
There’s an increasing amount of great data available, which allow researchers to answer many questions, and to ask novel ones themselves. Then of course it’s always difficult to find the perfect dataset that meets all your requirements. Sometimes, like in the present case, possible issues arising from the sole suboptimal parameter (the not-so-short repetition time) were addressed by comparing the results with another public dataset.
Question 3: You prefer to use data that is publically available but what, if anything, do you do when the data you want to utilise is not openly available?
It depends. Sometimes I have a research idea, and start looking for some data that would help me address it. If the data exist and are not publicly available, I ask for them. If they are shared with me, normally I ask whether they can be openly shared to anyone. If they cannot be shared with me (as happened lately, because the project of the data collectors was on-going and that was a problem for them), well that’s life, I’ll just wait I guess. If data should be available but they are not sometimes I get a bit angry but also happy when I’m heard.*
Then, there are cases in which seeing an open dataset gives me a new research idea. That’s great, and this simply does not happen when data are not there.
Question 4: What do you need as a researcher to ensure the data you find is useable and understandable?
Well first of all, the more data you collect, the better. Then of course the documentation and metadata are crucial (a complete list of the available data, how they were collected, details on the protocol, on the subjects, possible issues with data quality, etc). A good shared repository is not simply a file dump. Last, but not least, the format: I think that the neuroimaging field is going in the right direction in defining common standards, such as the Brain Imaging Data Structure and the Committee on Best Practices in Data Analysis and Sharing (COBIDAS) created by the OHBM council.
Question 5: You recently produced a paper on a study that utilised a dataset described in Scientific Data, how did you first come across that dataset?
I first knew about that dataset before it was described in Scientific Data. Chris Gorgolewski (the first author of the paper) had first shared it via peer to peer (torrent) on Academic Torrents, and advertised it on Twitter. Chris then also tweeted about the Data Descriptor once it was published.
Question 6: Would you have been able to carry out your study if the data had not been openly available?
Without access to this data – providing high resolution (data were collected at 7T), physiological parameter, and a test-retest protocol – we would probably not have started looking into the question addressed in the paper.
Question 7: What would have been the impact on your research workflow if Gorgolewski had not published a Data Descriptor of his data?
The main idea was clear already when Gorgolewski advertised this dataset on Twitter. I knew that a 7T test-retest dataset with physiological parameters was being released, and this in the space of a tweet.
This particular dataset is so well organized that any researcher with some experience in fMRI can extract all the desired information. But a good and extensive description, as the one present in Gorgolewski’s Data Descriptor, is crucial in order to decide whether you want to use the data, to correctly run the analysis and to interpret the results. Furthermore half of the work that you would have to do if only the data were available is already done (statistics on the processing parameters, on the reliability, etc.).
Question 8: What were the results of you being able to access and understand this dataset?
We were able to design and conduct our study, aimed to investigate the variability in the estimation of the hemodynamic response function at rest in any region of the brain, at a fine scale. We were then able to spot some regions in which the physiological noise (cardiac and respiratory fluctuations) is most influential.
Question 9: How do you think the description and open availability of datasets benefits science and research in general?
I think it’s a great tool on so many levels. Nowadays, all the undergraduate students in my lab use openly available datasets for their projects, both the methodological and the applicative ones. And this is the case also for the PhD students, the postdocs and myself. And then of course open datasets provide a great tool to validate the feasibility of a new project. Just a few days ago we downloaded another dataset released on Scientific Data to test our analyses on different types of brain tumours. We would have waited several months to collect the same data here at the hospital, so we can now plan the research ahead. Then in turn we will share our data.
With the proper diffusion and description of the data, the step between an idea and its implementation is much more immediate. Also, open and annotated datasets provide excellent benchmarks for novel methodologies and technical approaches, which anyone else can in turn test and reproduce. It’s the highway towards robust and reproducible research, accessible to anyone.
*Disclaimer, I am an Academic Editor at PLOS One, and proud of PLOS data policy.
The preprint version of Professor Marinazzo’s paper, Hemodynamic response function in resting brain: disambiguating neural events and autonomic effects, which utilised data described in Scientific Data, can be accessed on BioRxiv. You can find the rest of his publications on his website.
The Gorgolewski et al. Data Descriptor, A high resolution 7-Tesla resting-state fMRI test-retest dataset with cognitive and physiological measures, can be accessed on the Scientific Data website.
Interview by Mathias Astell, Scientific Data, Nature Publishing Group.
Sign up to Scientific Data e-alerts to ensure you keep up to date with the latest descriptions of data in your area.