Soapbox Science

Bookkeeping or science: what’s behind a paleo data compilation?

Guest blog by Darrell Kaufman, Northern Arizona University, US

Here Professor Kaufman talks about the importance of community-endorsed data compilations for accelerating discovery in paleoclimate science. He helped coordinate an international consortium that assembled “A global multiproxy database for temperature reconstructions of the Common Era”, which was published on 11 July in the journal Scientific Data. You can read it here

A respected colleague of mine once told me that compiling data was a task better left to bookkeepers. He’d rather focus on ‘science.’ Granted, that was before the term ‘informatics’ had appeared on the scene, and prior to the massive buildup of data we now face. But his sentiment still rings in the inevitable criticism of my grant applications in which I propose to assemble a database of existing data: rehashing other people’s old data is one rung down on the intellectual ladder. If one needs to borrow data from a previous study, one just contacts the public data repositories.

I contend, however, that there’s more scientific ingenuity to a well-crafted ‘data compilation’ and more coordination behind a ‘community endorsed’ data product than meets the eye.

My experience with the development of the Past Global Change’s (PAGES) 2k paleo-temperature dataset, which was just released through the journal, Scientific Data, is a case in point. PAGES aims to improve understanding of past climate variability; within this program, PAGES2k focuses on the past 2000 years—the Common Era (CE)—a period when climate was relatively similar to today and for which records based on ‘paleo’ records (evidence from natural archives that attest to climate change prior to the instrumental period) are relatively abundant. But to start, I first need to define a ‘paleo data compilation’ and to provide some background to the topic.

Motivations for a paleo data compilation

The first step to securing data resources for any scientific community is to establish a long-term, well-organized data repository. Fortunately for the paleoclimate community, several first-rate data repositories actively curate relevant data.

The theoretical next step is to mine the archives to extract the data needed to address a particular research question, such as the pattern of surface temperature change over the planet during past two millennia, the purview of the PAGES2k project.

When doing so, however, my data-hungry colleagues and I have discovered that the repositories contain an uneven and incomplete sampling of the huge variety and long legacy of observational datasets that have been interpreted in terms of past climate. We also found that key metadata and other data needed to assess uncertainties are missing in many cases, hampering the reuse of the data.

The high proportion of unavailable, inconsistently formatted or incompletely documented data motivates some paleoclimate researchers to assemble datasets that target a particular scientific question. They might start with data from existing archives, but then add missing records, fill in fundamental information, provide quality control and standardize the format and vocabulary so the contents can be easily ingested by computers.

These paleo data compilations accelerate scientific discovery on several fronts. For example, they often point to future scientific priorities through recognition of crucial gaps. They enable us to avoid over reliance on select records while proving an objective means to recognize aberrant or misinterpreted records through systematic comparison against the full body of other available recordsAnd, they lend themselves to quantitative analyses by multiple researchers who can apply a suite of different approaches to solve a research question, all using the exact same set of records.

Maximizing its scientific value

Such data compilations have great utility in paleoclimate science, but considering the enormous amount of work involved, how do we make the best use of limited resources to maximize the scientific outcome? Addressing this question has been one of the most rewarding scientific challenges of my career.

The iterative process involves identifying the most substantial scientific questions, then adapting them according to what can actually be answered based on the existing data, which isn’t known at the outset, and also fitting them to what is doable given the person power available to assemble the data, which is a major limitation. Additional questions, including those that are lurking beyond the scientific horizon, are brought into the mix, especially if they could be tackled through an incremental additional effort to expand the dataset.

The endeavor becomes more scientifically challenging in light of the large variety of information sources about past climate, including tree rings, coral, glacier ice, and marine and lake sediments, not to mention the complicated array of data that are used to establish the timelines that underlie the paleoclimate records. Organizing a disparate assortment of data into a uniform and unified database requires a wide-ranging appreciation of the variety of data and the questions that they might usefully address, now and into the future. It requires a coordinated community of collaborative specialists.

Promoting it as a community product

A bone fide ‘community endorsed’ data product, especially one that seeks to gather evidence about worldwide phenomenon, is based on an extended international effort. The process must be open, but not a free-for-all. Participants need avenues for genuine engagement and explicit credit for their contributions.

The PAGES2k paleo temperature database project was coordinated through the PAGES International Project Office in Switzerland. PAGES is a core project of Future Earth and is funded jointly through the US National Science Foundation and Swiss Academy of Sciences.

PAGES actively maintains a large directory of paleoclimate scientists, all of whom were invited to participate in the data compilation. This process gathered 98 volunteers from 22 countries to represent the paleoclimate community. Most of the authors contributed new data and many certified records as appropriate for inclusion in the database, including whether they met some basic quality criteria.

Some authors also annotated individual records according to their expert judgment and, in some cases, included cautionary notes about alternative or evolving interpretations. Their expertise and their identity will travel along with the data, in hopes that the data will be used wisely. In my view, the inclusion of expert’s comments and other documentation needed for intelligent reuse of paleo data is the most important development in the PAGES2k database.

To incorporate these and other innovative features, the PAGES2k database is contained in the highly flexible Linked Paleo Data (LiPD) format, which was developed along with the database itself. LiPD can accommodate the unlimited variety of data types used by paleoclimate scientists, including chronological data. A systematic version scheme has been established to track revisions as new data are added or existing records are modified. The data have also been loaded onto the LinkedEarth data-management platform, which enables transparent discussions of the evolving interpretation and versioning of individual records, and is supported by the first paleoclimate ontology.

It’s difficult to foresee how data-handling practices will evolve in the rapidly changing, cyber-based, data-management landscape. Yet one thing is for certain: it will be in the direction of taking better care and making better use of our scientific data assets, regardless of whether it’s considered tedious bookkeeping or challenging science.


Comments are closed.