Scientific Data | Scientific Data

Data Matters: Interview with Anne Schöler

November 27, 2014 | 11:25 am | Posted by Sylvia Tippmann | Category: Data Matters

Anne Schöler is a post-doctoral fellow at the Helmholtz Zentrum in Munich in the Research Unit of Environmental Genomics.

Which broad research field do you call home?

I would consider myself a geneticist, so I am interested in the genomes of organisms on earth. I started out on mammalian genomics and recently moved to environmental genomics.

Which environment are you looking at?

I am particularly interested in soil, one of the most complex environments on earth.

Why is that?

Soils have developed over centuries and play many important roles – supporting plant growth, provision of clean drinking water, decomposition of waste – but it remains unclear how soils exactly carry out these functions and how the huge diversity of microorganisms that live in the soil relates to that.

Data Matters presents a series of interviews with scientists, funders and librarians on topics related to data sharing and standards.

So when you look at the ‘genome of soil’, you don’t look at one genome but at many?

Yes, in fact it’s estimated that one gram of soil harbors around 50,000 different microorganisms – so it’s basically impossible to select one out of them. But with the development of new sequencing techniques we can now look at all of them at the same time and try to explore the full diversity.

So, what is your research question?

Well, for the majority of the microorganisms in soil we don’t know ‘Who they are’ – it’s likely that when we find new microorganisms we will also find new proteins. I am interested to use these proteins and metabolites to find new useful drugs or help develop novel, more sustainable production methods for commodity chemicals.

What could be such a compound?

Intensive research has been focussing on cellulases, the complex group of enzymes that degrades cellulose, which is the most abundant organic polymer on earth. So identifying novel cellulases in soil that have a high activity could be useful to improve the industrial degradation of biomass to produce cellulosic ethanol.

Ok, so let’s get technical: What does your typical experiment look like?

We do sequencing, so we would sample some soil that we are interested in, for example agricultural soil, and then we extract all the DNA that is contained in this soil and sequence it.

What’s your raw data?

So we basically obtain a lot of little pieces of DNA that come from all the organisms that life in the soil. It’s like a puzzle but we don’t have all the pieces.

“Supplying metadata is of key importance.”

Why not?

First of all, most data are annotated by comparing it to some kind of database, and these databases are not complete. Second, a very large and expensive effort is needed to fully sequence a soil. The majority of microorganisms in soil is thought have a very low abundance so it’s difficult to assess how much you have to sequence to obtain a complete metagenome. Depending on your question it is generally not necessary to obtain a full picture, usually the dominant players are also the most important ones. Here, it makes a lot of sense to share money and sequencing efforts; especially, if several researchers are interested in different specific pathways and have no competing interests.

How open is the community to sharing the sequencing data?

Very open, I would say, at least regarding published data. As with any genomics dataset, researchers are obliged to publish all raw sequencing data upon publication of the original research. This can be done in SRA or a specific database called MG-RAST. So the problem lies not so much with sharing the data, but more with judging if metagenomic data can be compared. Soil is a very complex habitat and harbours much more life than just microorganisms. Many environmental parameters (metadata) have an influence on the metagenome (such as the pH or how the DNA was extracted) and can bias the analysis. In order to compare metagenomes that were generated in different laboratories you have to take these biases into account. To date it is however unclear which soil parameters are the most important ones and would have to be identical in order to compare the soils. Therefore, supplying metadata is of key importance.

What does this mean for your research?

Depending on my research question these biases are more or less crucial. A lot of research is aimed at understanding the influence of environmental factors on microorganisms in the soil. Here we have to be very careful to not overlook any biases if we were to compare soils. If, for example I want to compare the metagenomes of two soils in different climates I cannot be certain that any differences I detect can really be explained by the change in climatic conditions if I don’t have all the metadata that tell me that the soils are really comparable.

Would you like your research community to move more into the direction of sharing data and how would you like a journal like Scientific Data to support this process?

I think journals could help educate the community about the importance of robust metadata and try to integrate metagenomic datasets into public databases. Public databases are biased towards culturable microbes. Integrating metagenomic data would provide a better representation of reality. Journals could also publish re-analysed data. You don’t always have to generate new data, it can also be highly useful to analyse existing datasets and find new connections between them.

Interview by Sylvia Tippmann, a freelance journalist based in London, UK

Comments

There are currently no comments.

You need to log in or register to comment.

About this blog

Scientific Data is an online-only, peer-reviewed publication for descriptions of scientifically valuable datasets. Follow this blog for news about Scientific Data, as well as commentary from our editors and the diverse set of researchers, funders, and data managers who are supporting us.
Find out more

nature.com blogs home

Scientific Data

Scientific Data updates