With Big Data come big problems and big responsibilities. Digital Science recently ran a #datadramas tag on Twitter, asking scientists about their own data dramas. It’s scary, but what it and the Naturejobs poll show is that many scientists still use the laptops and USB sticks to store data long term. To talk about this, we spoke to Susanna-Assunta Sansone, the associate director of the University of Oxford e-Research Centre, and a data consultant and honorary academic editor for Scientific Data, a new open access data publication by Nature Publishing Group that launched this week.
The first question to tackle is: what is big data? Sansone says people often only mean size and volume, but from her point of view big data is also about “variety and complexity. So, data is multidimensional. You have video, audio files, text files, you have physical specimens which you have recorded information about.”
Sansone is a biologist by training, and now works with life sciences data, of which there is an incredible amount, especially within the genomics fields of research. How scientists manage all this data varies on the data types, says Sansone. “There are different tools for different data types. And there are different enablers like terminology or format, which work for different data types.” If you are a newcomer to the field of life sciences, this can be incredibly confusing. There are some general tools that are available, and the one that is used most is Microsoft Excel. “It’s better than nothing, but there are better tools nowadays.”
Modern tools allow more enriched working environments, and there are different tools for different jobs. Some tools help you to collect information, other are there to move data to analysis pipeline or store information around queries. “It’s not just one solution that works for everyone, and there isn’t just one tool.” So the traditional biologists are becoming data managers, but “if they are not data managers, they should work with people that have such an expertise. Nowadays it’s a really collaborative environment.”
There is plenty of opportunity, and need, for data curators. This is Sansone’s career paths. “At the time there weren’t many courses to go to, to learn. So I learned by actually doing data curation myself.” Now, things are different, although there still aren’t many courses. “We really need biologists who love science, but who love to enable other people to do science. And that is what a data curator actually is.”
Looking at the Naturejobs poll that we have been running on how our readers and listeners store data. From the options available, at the time of recording, there were a total of 258 votes. 82 of these (32%) were for “on your laptop or pc”, 60 (23.3%) were for “on an external hard drive” and in third place was the notebook with 34 votes (13.2%). (The poll closed on Wednesday 28th May – here are the votes from the time of closing.) For Sansone, this was shocking, but not surprising, “the point is, long term data storage and stewardship is key.” Image you leave your position in a lab, or drop the computer or lose USB stick that had all the data on it, “it’s not the way to store the data long term. We need to educate scientists that there are public repositories out there that are meant for long term storage.”
Data privacy and protection now becomes the issue. By having your data in a public repository, how do you stop others using it before you’ve had the chance to analyse it fully? Sansone says that there are public repositories available that can help, like institutional repositories.
What about being given credit for the work that you’ve done? “There are data publication platforms where the moment you start sharing your data, you get credit for your data.”
Looking back at the results, where scientists are still using the laptops and external hard drives, Sansone hopes that this is a step-ward approach. “At first, they write [their results] in the lab book. They then probably move it [the data] from the notebook into the laptop, hopefully. At least in an Excel spread sheet as a starting point. And then they start to explore other ways to better store their data for the long term.” Using cloud software like Google Docs can be useful when looking to share data, but are still not long term solutions.
Some people also create their own websites and databases to store their data. But again, this isn’t permanent because “the website might disappear, the person might leave the lab and no-one maintains it.”
It is also important to consider the context in which the data is stored. On its own it has less value, says Sansone. “In the context of related data, they [the data] become more valuable.” If you can link with, and connect to, every other data set produced from all researchers studying the same thing, “it makes your data more relevant.” To do this, you can use general repositories, like Dryad or Figshare.
Sansone’s advice is to think about making your data “visible, accessible and reusable, which is not trivial. This is of course linked to the concept of managing, and better notating and better sharing of your data.” Using subject specific repositories are better, but general ones are a step in the right direction, says Sansone.
It is also important to get recognition for the data that you have collected, says Sansone, “that’s why I really want to mention Scientific Data, which is the new NPG open-access data publication platform. It is particularly focussed on data descriptors; so the only content type is a descriptor of your data sets.”