« September 2009 | Main

October 21, 2009

Google Wave Science Hack Day at Nature this Friday

I'm really happy to announce that we will be hosting a hack day on Friday for developing scientific applications in google wave. The event was thought up by Cameron Neylon and we at Nature were able to find a room and are able to provide interweb access and coffee. The JISC DevSci project and Google will be providing Pizza. If you have a google wave account you can check out the wave discussing this event.

We will have a number of our onsite developers taking part, some external people coming in, and quite a few people will be joining in remotely via Wave. It's going to be very interesting to see how a full day of collaboration through wave works out.

The exact number of people coming in has not yet been finalised, but we do have some extra spaces, so if you are a developer with a wave account in the London area or you are a scientist with some great ideas for apps that could work well for scientists then please feel free to drop me a line and we will see if we have space to fit you in. You can email me at i.mulvany@nature.com. You can, of course, pop in via wave and say hi.

The hashtag for the event will be #swlhd (science wave london hack day) if you are into that kind of thing.

October 16, 2009

From Web 2.0 to the Global Database

I'm on my way home having just attended the 2009 Microsoft eScience Workshop at Carnegie Mellon University in Pittsburgh, where Tony Hey and his team at Microsoft Research also launched a book called The Fourth Paradigm. It's a collection of essays that provide relatively accessible accounts of the impact and potential of digital science, and has been published in memory of Jim Gray, a pioneer in this area.

I delivered a short talk summarising my essay, which was called "From Web 2.0 to the Global Database". I'm reproducing the text below, together with some of the slides I used to illustrate my talk.

(Update 20/10/09: Added link to book website.)

4thParadigm_Hannay.001.png

4thParadigm_Hannay.004.png

One of the most articulate of web commentators, Clay Shirky, put it best. During his "Lessons from Napster" talk at the O’Reilly Peer-to-Peer Conference in 2001, he invited his audience to consider the infamous prediction of IBM’s creator, Thomas Watson...

4thParadigm_Hannay.005.png

...that the world market for computers would plateau at somewhere around five [1].

4thParadigm_Hannay.006.png

No doubt some of the people listening that day were themselves carrying more than that number of computers on their laps or their wrists and in their pockets or their bags. And that was even before considering all the other computers about them in the room—inside the projector, the sound system, the air conditioners, and so on. But only when the giggling subsided did he land his killer blow. "We now know that that number was wrong," said Shirky. "He overestimated by four." Cue waves of hilarity from the assembled throng.

4thParadigm_Hannay.007.png

Shirky’s point, of course, was that the defining characteristic of the Web age is not so much the ubiquity of computing devices (transformational though that is) but rather their interconnectedness. We are rapidly reaching a time when any device not connected to the Internet will hardly seem like a computer at all. The network, as they say, is the computer.

4thParadigm_Hannay.008.png

This fact—together with the related observation that the dominant computing platform of our time is not Unix or Windows or Mac OS, but rather the Web itself—led Tim O’Reilly to develop a vision for what he once called an "Internet operating system" [2], which subsequently evolved into a meme now known around the world as "Web 2.0" [3].

4thParadigm_Hannay.009.png

Wrapped in that pithy (and now, unfortunately, overexploited) phrase are two important themes. First, Web 2.0 acted as a reminder that, despite the dot-com crash of 2001, the Web was—and still is—changing the world in profound ways. Second, it incorporated a series of best-practice themes (or "design patterns and business models") for maximizing and capturing this potential. These themes included:

4thParadigm_Hannay.010.png

The first of these has widely become seen as the most significant. The Web is more powerful than the platforms that preceded it because it is an open network and lends itself particularly well to applications that enable collaboration. As a result, the most successful Web applications use the network on which they are built to produce their own network effects, sometimes creating apparently unstoppable momentum. This is how a whole new economy can arise in the form of eBay. And how tiny craigslist and Wikipedia can take on the might of mainstream media and reference publishing, and how Google can produce excellent search results by surreptitiously recruiting every creator of a Web link to its cause.

If the Web 2.0 vision emphasizes the global, collaborative nature of this new medium, how is it being put to use in perhaps the most global and collaborative of all human endeavors, scientific research? Perhaps ironically, especially given the origins of the Web at CERN [4], scientists have been relatively slow to embrace approaches that fully exploit the Web, at least in their professional lives. Blogging, for example, has not taken off in the same way that it has among technologists, political pundits, economists, or even mathematicians. Furthermore, collaborative environments such as OpenWetWare and Nature Network have yet to achieve anything like mainstream status among researchers.

4thParadigm_Hannay.011.png

Physicists long ago learned to share their findings with one another using the arXiv preprint server, but only because it replicated habits that they had previously pursued by post and then e-mail. Life and Earth scientists, in contrast, have been slower to adopt similar services, such as Nature Precedings.

This is because the barriers to full-scale adoption are not only (or even mainly) technical, but also psychological and social. Old habits die hard, and incentive systems originally created to encourage information sharing through scientific journals can now have the perverse effect of discouraging similar activities by other routes.

4thParadigm_Hannay.012.png

Yet even if these new approaches are growing more slowly than some of us would wish, they are still growing. And though the timing of change is difficult to predict, the long-term trends in scientific research are unmistakable: greater specialization, more immediate and open information sharing, a reduction in the size of the "minimum publishable unit," productivity measures that look beyond journal publication records, a blurring of the boundaries between journals and databases, and reinventions of the roles of publishers and editors. Most important of all—and arising from this gradual but inevitable embrace of information technology—we will see an increase in the rate at which new discoveries are made and put to use. Laboratories of the future will indeed hum to the tune of a genuinely new kind of computationally driven, interconnected, Web-enabled science.

4thParadigm_Hannay.013.png

Look, for example, at chemistry. That granddaddy of all collaborative sites, Wikipedia, now contains a great deal of high-quality scientific information, much of it provided by scientists themselves. This includes rich, well organized, and interlinked information about many thousands of chemical compounds. Meanwhile, more specialized resources from both public and private initiatives—notably PubChem and ChemSpider are growing in content, contributions, and usage despite the fact that chemistry has historically been a rather proprietary domain. (Or perhaps in part because of it, but that is a different essay.)

And speaking of proprietary domains, consider drug discovery. InnoCentive, a company spun off from Eli Lilly, has blazed a trail with a model of open, Web-enabled innovation that involves organizations reaching outside their walls to solve research-related challenges. Several other pharmaceutical companies that I have spoken with in recent months have also begun to embrace similar approaches, not principally as acts of goodwill but in order to further their corporate aims, both scientific and commercial.

4thParadigm_Hannay.014.png

In industry and academia alike, one of the most important forces driving the adoption of technologically enabled collaboration is sheer necessity. Gone are the days when a lone researcher could make a meaningful contribution to, say, molecular biology without access to the data, skills, or analyses of others. As a result, over the last couple of decades many fields of research, especially in biology, have evolved from a "cottage industry" model (one small research team in a single location doing everything from collecting the data to writing the paper) into a more "industrial" one (large, distributed teams of specialists collaborating across time and space toward a common end).

In the process, they are gathering vast quantities of data, with each stage in the progression being accompanied by volume increases that are not linear but exponential. The sequencing of genes, for example, has long since given way to whole genomes, and now to entire species [5] and ecosystems [6]. Similarly, one-dimensional protein-sequence data has given way to three-dimensional protein structures, and more recently to high-dimensional protein interaction datasets.

4thParadigm_Hannay.015.png

This brings changes that are not just quantitative but also qualitative. Chris Anderson has been criticized for his Wired article claiming that the accumulation and analysis of such vast quantities of data spells the end of science as we know it [7], but he is surely correct in his milder (but still very significant) claim that there comes a point in this process when "more is different." Just as an information retrieval algorithm like Google’s PageRank [8] required the Web to reach a certain scale before it could function at all, so new approaches to scientific discovery will be enabled by the sheer scale of the datasets we are accumulating.

4thParadigm_Hannay.016.png

But realizing this value will not be easy. Everyone concerned, not least researchers and publishers, will need to work hard to make the data more useful. This will involve a range of approaches, from the relatively formal, such as well-defined standard data formats and globally agreed identifiers and ontologies, to looser ones, like free-text tags [9] and HTML microformats [10]. These, alongside automated approaches such as text mining [11], will help to give each piece of information context with respect to all the others. It will also enable two hitherto largely separate domains—the textual, semi-structured world of journals and the numeric, highly structured world of databases—to come together into one integrated whole. As the information held in journals becomes more structured, as that held in many databases becomes more curated, and as these two domains establish richer mutual links, the distinction between them might one day become so fuzzy as to be meaningless.

Improved data structures and richer annotations will be achieved in large part by starting at the source: the laboratory. In certain projects and fields, we already see reagents, experiments, and datasets being organized and managed by sophisticated laboratory information systems. Increasingly, we will also see the researchers’ notes move from paper to screen in the form of electronic laboratory notebooks, enabling them to better integrate with the rest of the information being generated. In areas of clinical significance, these will also link to biopsy and patient information. And so, from lab bench to research paper to clinic, from one finding to another, we will join the dots as we explore terra incognita, mapping out detailed relationships where before we had only a few crude lines on an otherwise blank chart.

4thParadigm_Hannay.017.png

Scientific knowledge—indeed, all of human knowledge—is fundamentally connected [12], and the associations are every bit as enlightening as the facts themselves. So even as the quantity of data astonishingly balloons before us, we must not overlook an even more significant development that demands our recognition and support: that the information itself is also becoming more interconnected. One link, tag, or ID at a time, the world’s data are being joined together into a single seething mass that will give us not just one global computer, but also one global database. As befits this role, it will be vast, messy, inconsistent, and confusing. But it will also be of immeasurable value—and a lasting testament to our species and our age.

References

[1] C. Shirky, "Lessons from Napster," talk delivered at the O’Reilly Peer-to-Peer Conference, Feb. 15, 2001.

[2] T. O’Reilly, "Inventing the Future," 2002.

[3] T. O’Reilly, "What Is Web 2.0," 2005.

[4] T. Berners-Lee, Weaving the Web. San Francisco: HarperOne, 1999.

[5] "International Consortium Announces the 1000 Genomes Project".

[6] J. C. Venter et al., "Environmental genome shotgun sequencing of the Sargasso Sea," Science, vol. 304, pp. 66–74, 2004, doi:10.1126/science.1093857.

[7] C. Anderson, "The End of Theory: The Data Deluge Makes the Scientific Method Obsolete," Wired, July 2008.

[8] S. Brin and L. Page, "The Anatomy of a Large-Scale Hypertextual Web Search Engine," 1998.

[9] http://en.wikipedia.org/wiki/Tag_(metadata)

[10] http://en.wikipedia.org/wiki/Microformat

[11] http://en.wikipedia.org/wiki/Text_mining

[12] E. O. Wilson, Consilience: The Unity of Knowledge. New York: Knopf, 1998.

October 04, 2009

Demo Web Clients for nature.com OpenSearch

opensearch-client-dc.jpg
(Click image to enlarge.)

[Update - 2009.10.05: This post (2. Clients) is one of three. See also: 1. Service, 3. Widgets.]

The previous post described the nature.com OpenSearch service. Prior to that I posted on our new desktop widgets which use one of the XML interfaces - specifically the RSS feed.

Here we wanted to also show what can be done in the browser itself. We've created a small gallery of demo clients which all use the text-based JSON interface (or rather JSONP for cross-site scripting purposes). You can find the demos here:

http://nurture.nature.com/opensearch/apps
These demo apps show how the JSON interface can be used to build very simple web clients for search. They make use of an early OpenSearch JavaScript library which has classes for OpenSearch and SRU responses. The demos show how to link back to the nature.com platform (using the DOI), how to locate metadata properties, how to use OpenSearch links for pagination, how to compare OpenSearch and SRU views, how to extract RDF triples, etc. They are simply intended to show how easy it is to access nature.com search remotely. We hope you find them fun to use.


October 03, 2009

nature.com OpenSearch

opensearch-interfaces.png
(Click image to enlarge.)

[Update - 2009.10.05: This post (1. Service) is one of three. See also: 2. Clients, 3. Widgets.]

Earlier this week we soft-launched a new service: nature.com OpenSearch. Simply put, nature.com OpenSearch provides a structured resource discovery facility for content hosted on nature.com. In effect, this is a sister service to our regular nature.com search service which allows a user to query nature.com and browse the result sets. By contrast, the new service allows applications to query nature.com and to fetch the results back in formats of their choosing. The diagram above attempts to compare the existing user-oriented nature.com search service at a) with the new application-oriented nature.com OpenSearch service at b). Applications from widgets to web pages (and beyond) are the immediate clients of the service. (A companion post here already discussed the new nature.com search search widgets which are one such application.)

In terms of interfacing to the service, machine-readable description documents are available for both OpenSearch and SRU (Search and Retrieval via URL) modes of access. These documents are referenced from autodiscovery links which are beginning to be added to all our nature.com web pages. Each web page thus links not only to our search, but more than that it provides the instructions on 'how to search'.

Query is either by simple search terms or by the query language CQL which is a high-level query language designed to be be human readable and writable, and to be intuitive while maintaining the expressiveness of more complex languages. Result sets can be returned in a variety of media types, both text (HTML and JSON) and XML (SRU, ATOM, RSS). Media types are selectable using HTTP content negotiation or by using the specific parameter 'httpAccept'.

opensearch-querytype-results.png
(Click image to enlarge.)

And what does this all mean? Well, it really amounts to the ability to run off-platform search, i.e. I can now run my search over nature.com anywhere I choose to run it. For example, say I want to run it right here in this blog post, I can. Let's jig up a simple interface. What we'll do is to run a full-text keyword search and just list out the raw properties of the first item returned with no real attempt at styling. (The CQL checkbox just allows a CQL query to be input, otherwise the search terms are sent to the server as simple alternates.)


    CQL
ajax-loader.gif

We would be very pleased to receive any feedback on this service. You can post comments either here on the blog or else you can mail them direct to interfaces@nature.com.

The nature.com OpenSearch service has two access points:

User interface:
http://www.nature.com/opensearch
Service endpoint:
http://www.nature.com/opensearch/request
Technical documentation on the service is available here on our Librarian Gateway.

Special credits go to Ralph LeVan of OCLC for not only creating the excellent open-source oclcsrw software package but also updating the package to support HTTP content negotiation and to make this generally more responsive to OpenSearch requests such as allowing for alternate formats. We'd like especially to thank Ralph for suffering our numerous questions and for responding to our various requests for new features in a very timely fashion so as to make this service possible within the project timelines. Thanks are also due to Nawab Siddiqui of Nature Publishing Group for doing the actual hard graft in implementing this service for nature.com and also for leaping agilely into this new terrain.

October 01, 2009

Desktop Widgets: nature.com search

opensearch-widget-fliprollie.jpg

[Update - 2009.10.05: This post (3. Widgets) is one of three. See also: 1. Service, 2. Clients.]

The newly launched nature.com OpenSearch web service (which I'll discuss in a separate post) is an interface that provides distributed access to search on the nature.com platform. Specifically, the interface allows for structured queries from remote clients as well as for structured responses, and implements two compatible industry standards for search: OpenSearch and SRU (Search and Retrieval via URL)

As a practical demonstration of this distributed access we have developed a nature.com search desktop widget which is a small standalone app that runs on a user's desktop and interacts with the nature.com OpenSearch server by sending a simple URL request and receiving in response a regular RSS feed. This URL request closely mirrors the request strings in the OpenSearch URL templates that are now being linked to from a growing number of our web pages.

The actual parameter set used in the query makes up a complete SRU request and the response from the nature.com OpenSearch server is a full SRU response which is hosted on an SRU extension format - here, RSS. Technically speaking, the widget is an SRU client and nature.com OpenSearch is an SRU server.

The real value add, though, in the present widget/server exchange over and above the simple URL template interface commonly used by OpenSearch clients lies in the query language that it uses. This is CQL - the Contextual Query Language - which is a standards-based means for expressing queries and enables both keyword and fielded searches. The widget opens with a simple search input box (see A in the figure below) which allows for regular OpenSearch-style keyword searches. But click the forms icon to the right (circled in the figure) and a drop-down form appears (see B in the figure below) with the full nature.com search interface. You can now query based on any of the usual citation elements: title, author, date, volume, issue or page. Or if the DOI is known that can be used directly. Note that by default searches are run across the entire journal collection, although individual journal titles can be selected (and remembered) as a user preference by clicking the 'i' icon at bottom right of the form at B.

widget-close-open.png

You can also preview the actual CQL query that is sent to the server by peeking under the form using the peelback icon at bottom left in the form at B.

This widget is available in two separate flavours:

  • Apple Dashboard (for Mac)
  • Yahoo! Widget (for PC and Mac)

To introduce the widget better we have uploaded a short (4 min) screencast to YouTube which demonstrates some of the basic features.

Many thanks to Andrew Mee at Nature Publishing Group who developed both of these packages and produced the screencast.

Further information about these widgets and download details are available on our widgets page at:

http://www.nature.com/widgets
We hope you enjoy using these widgets. Send any comments or feedback to interfaces@nature.com.

"Nascent Web publishing efforts have their genesis in a burning need to say something, but their ultimate success comes from people wanting to listen, needing to hear each other’s voices, and answering in kind."
Rick Levine
The Cluetrain Manifesto

Subscribe

Subscribe to this blog's feeds:

[What is this?]

The Life Scientists on FriendFeed

Recent Comments

Out of 405 total comments.
The most recent three were on:
Powered by
Movable Type 3.2