Timothée Poisot recently travelled to London for MozFest 2017, “The world’s leading festival for the open Internet movement.” There, the quantitative and computational ecologist at the University of Montréal in Canada ran a session entitled “Scientific computing for the terabyte-less.” Here, he tells Naturejobs why life science research needn’t necessarily follow the Big Data model.
What do you mean by ‘terabyte-less’?
There is an expectation today that all scientists will have to deal with Big Data. But most datasets in my field can fit on a handful of floppy disks, and most scientists have similar needs. Yet small data is not covered well by the current training being offered. At the supercomputing center at the University of Montréal, where I work, we have some very introductory training material, and very advanced training material, and nothing in between. We need some sort of stepping stone in terms of training that we can give to these people.
What’s wrong with applying big data solutions to small data?
If you have a mosquito, do you slap it, or do you use a cannon? It’s the same idea here. Big data requires infrastructure and training, and maybe that scales well for large datasets, but it doesn’t really scale for very small datasets.
More importantly, big data is intimidating, at least for biologists that lack strong backgrounds in computer science or programming, and they avoid these techniques. So what we’ve been discussing is, what practices can we adopt from the big data world to train people that would otherwise be intimidated by this entirely different universe. This is also why the Department of Biological Sciences at the University of Montréal started a new MSc program in Quantitative and Computational Biology: we need to bring these methods into the training of future practitioners.
What training or tools should they have?
Assuming you have good data, good ideas, and a good hypothesis, the thing that will determine whether the science you do is robust or not is, is your code robust or not? Because 90% of what we do right now is contingent upon the quality of the code that we write to handle the data, read the data, do the analysis, and so on.
Over the past few years a handful of papers have been retracted in ecology because the authors discovered mistakes in their code, which changed the results and the conclusions entirely. And there has been this growing realization that it could happen to any of us, because all of us write code but almost none of us were trained in how to write good code. And we have no working definition of what good code is.
So from my point of view, training should emphasize writing code that you (and the scientific community) can trust. What practices can we adopt, what mindsets do we need to have as researchers, to write code that we can trust and that would not put us in the position to ever have to retract a paper.
Do you have any recommendations?
The first rule we came up with at MozFest was, embrace simplicity. If you have a piece of code that is very simple, very modular, that does just one thing but does it really well, as opposed to a giant script with hundreds of functions — if it’s simple enough that you can understand and explain it, then it’s most likely to work.
Second, trust nothing. Don’t trust the code you write and don’t trust the code that other people write. Instead, test its behavior. Do you have training data that you can use and whose expected output you know? Does it pass the biological sniff test?
Finally, consider reproducibility. How do you make sure that a program that runs on your machine is going to run on someone else’s machine, or on one of these cloud computing services, to give the same result everywhere? That is what we expect, but it’s incredibly difficult to actually do that in practice.
It’s not really a question of tools or programming languages, but mindset. And the mindset is, be cautious and realize that none of us were trained as software engineers. And so, do things that are very simple, go step by step, and take the time to check everything that happens. That might slow you down, but it’s better to spend a couple more months on a project than having to retract a paper.
How can people learn these skills?
As an instructor for both Software Carpentry and Data Carpentry, I absolutely recommend that people take their classes, if possible.
People can also create study groups. When I was a postdoc, we organized regular meetings where we discussed software and other quantitative matters. We ended up looking for more documentation and giving informal presentations. And it was a success in that it increased the computer science culture of biologists locally.
Scientific computing shouldn’t be defined by the needs of people with the most complex requirements. Most scientists will never need to connect to a supercomputer, and never need to think about big data. We should come up with a definition of scientific computing that works for the majority of people, and then recognize that a lot of these very advanced and very powerful and also very cool technologies are things that are useful for a minority.
By doing that, by reframing scientific computing in terms of the minimum skill set, we can make, I hope, a more welcoming environment, and also a less frightening experience, especially for new students.
Jeffrey Perkel is Technology Editor, Nature
10 Nov 2017: The title of this post has been changed.
Suggested posts
My digital toolbox: Julia Stewart Lowndes
My digital toolbox: Lorena Barba
My digital toolbox: Santiago Perez De Rosso on Git, reimagined
Recent comments on this blog
African astronomy and how one student broke into the field
From Doctorate to Data Science: A very short guide
Work/life balance: New definitions