Early in his graduate career, John Blischak found himself creating figures for his advisor’s grant application.
Blischak was using the programming language R to generate the figures, and as he iterated and optimized his code, he ran into a familiar problem: Determined not to lose his work, he gave each new version a different filename — analysis_1, analysis_2, and so on, for instance — but failed to document how they had evolved.
“I had no idea what had changed between them,” says Blischak, who now is a postdoctoral scholar at the University of Chicago. “If the professor were to come back and say, ‘which version did you use to create this figure?’ I would have had no idea.”
Later, while attending a workshop on basic research computing skills, he discovered a better approach: Git.
Git is a free and open-source distributed version-control system. Written to manage development of the open-source Linux operating system, it allows large, geographically distributed teams of programmers to work independently on their own copies of the code, track changes with line-by-line granularity, merge those changes back into the main repository, and reverse them when necessary.
But Git also facilitates scientific reproducibility across a wide range of disciplines, from archeology to zoology. When used in conjunction with GitHub, an online Git repository hosting service-slash-social network, the tool allows researchers to store and share their code, analysis scripts, and data, and to ensure analyses are always executed using the appropriate versions of the files. Other researchers can then access those files to see how the work was done and to apply it to their own studies – features that advance research transparency, says Juan Antonio Vizcaíno, proteomics team leader at the European Bioinformatics Institute in Cambridge, UK.
Using Git, Blischak says, he no longer needed to maintain multiple copies of his files. “I just keep overwriting it and changing it and saving the snapshots. And if the professor comes back and says, ‘oh, you sent me an email back in March with this figure’, I can say, ‘okay, well, I’ll just go back to the March version of my code and I can recreate it’.”
That said, Git is a tool that researchers love to hate, with a vexing and confusing command-line interface intended more for seasoned programmers than casual users.
Greg Wilson, who cofounded the research computing workshops, Software Carpentry, is blunt in his assessment: Though he recognizes its value, and uses it himself, “I hate Git…. It is one of the worst pieces of software to teach that I’ve come across in 35 years of teaching people software.” But, he adds, mastering Git is as essential to modern research as learning to read English. Those who use Git and have become immune to its complexity, he jokes, suffer from Git-induced “Stockholm Syndrome”.
Titus Brown, a bioinformatician at the University of California at Davis, calls Git a “power tool”, one whose power comes at the cost of complexity.
Git is not the only tool of its type. Other options include Mercurial, a more user-friendly alternative. So, why is Git so popular? In a word, GitHub.
With some 28 million users and 85 million software repositories, GitHub is an elegant (and largely free) online portal built atop Git’s foundations, on which programmers and researchers can archive, share, discuss, and edit their code, manuscripts, documentation, and data. The site provides a convenient online home for projects, as well as a backup in case a user’s local copy is ever damaged.
Much more than a code warehouse, GitHub is effectively a social network for software development. Programmers can ‘fork’ any user’s public project (that is, make their own copy), modify it, and use that updated code — or any previous version thereof — on their own data. They can then make that updated code available to the community, a form of “permissionless” editing that is “tremendously powerful,” says Brown. Some journals even run their peer-review processes on GitHub.
“Communication and collaboration are the killer apps of version control,” wrote University of British Columbia statistician Jenny Bryan in a recent article. And that is equally true for bench scientists as for programmers. In one recent example, Casey Greene of the University of Pennsylvania Perelman School of Medicine and Anthony Gitter of the University of Wisconsin, Madison, led a team of over 40 researchers who collaborated to write an extensive review on a form of artificial intelligence called ‘deep learning’. The project, which they called the ‘deep review’, was managed (and eventually automated) entirely on GitHub.
“Git is the price you have to pay in order to use GitHub,” Wilson says.
https://youtu.be/31XZYMjg93o
Brown’s lab uses GitHub as a way to manage collaborations, control access, run automated quality checks, and provide a ‘canonical’ copy of its code. And they use it to advance reproducibility, Brown says, by allowing team members to identify and retrieve precisely those versions of their code, scripts, and data they originally used to perform a particular analysis.
Tracy Teal, executive director of The Carpentries, an organization that develops and runs workshops on data and computational skills, says that Git and GitHub are even useful for those researchers who like to work solo. “Most researchers are primarily collaborating with themselves,” Teal explains. “So, we teach it from the perspective of being helpful to ‘future you’.”
On 4 June, Microsoft Corp. announced plans to acquire GitHub for $7.5 billion – a deal whose implications for programmers, and the scientific community, remain unclear. But there are alternatives to GitHub, including GitLab and Atlassian’s BitBucket, and both have reported sharp spikes in new users in the wake of Microsoft’s announcement.
Whatever platform you use, Git can be daunting for the uninitiated, but the basics are straightforward enough — see here, here, and here for good tutorials. The program is text-based, but several free graphical user interfaces are available, including SourceTree and GitKraken. And many programming tools, such as RStudio, feature Git integration as well.
But there’s no arguing the tool is complicated, and things often can go wrong. In that case, Blischak advises perseverance. “Appreciate the fact that it’s going to be a little complicated to start off. And don’t get discouraged if it seems a little overwhelming at first, because that’s how everybody feels.”
Indeed, getting a feel for the software in a non-mission-critical situation may be the best way to learn Git, says Teal; that way, instructors can walk you through the hiccups that inevitably arise. “I won’t pretend that it’s not a challenge. It’s just one that’s worth it,” she says.
Jeffrey M. Perkel is Technology Editor, Nature
Suggested posts
Software quality tests yield best practices
Recent comments on this blog
African astronomy and how one student broke into the field
From Doctorate to Data Science: A very short guide
Work/life balance: New definitions