Nascent

December 14, 2009

Sensemaking in Multi-Fusion Environments

Jeff Jonas visited us again at the start of November and gave a talk about some of the new work that he is doing. Jeff is our first return speaker, and this time he gave us an update on his thinking about sensemaking systems and how that is effecting his on-going work in developing a new technology.

Jeff mainly works on building sensemaking systems that can reconcile large amounts of data in real time. In brief a sensemaking system is one that, in contrast to a data warehousing solution, does something active with each piece of data as it is acquired, rather than only storing the data for later re-use. Identity disambiguation is a problem that these class of systems have been applied to in the past, however the new technique will be more generally applicable. One of the difficulties with the sensemaking problem is that any individual piece of data that arrives, on its own, is hard to evaluate in terms of how important it is in terms of relevance. Each piece of data needs somehow to be contextualised first. Jeff illustrated the underlying mechanics of such a system with an analogy to jigsaw puzzle solving.

When solving a jigsaw puzzle we make an assertion with each new puzzle piece that we pick up, it either fits perfectly in some place in the evolving solution space, or it belongs to a similar set of pieces but we don’t exactly know how yet, or one has no idea where it goes so it is placed anywhere.

When asserting that the new piece fits into an existing piece one always favors the false-negative, as one never puts pieces together unless we are really sure that they go together.

When we get a new connecting piece we re-consider if by now knowing this, other previous pieces already considered have a better placement.

Sometimes a new piece reverses an earlier assertion – e.g., determining where a piece belongs reveals a connected piece that upon closer inspection, really did not belong. In this case, this misplaced piece is removed.

The working space needed during the process of solving the puzzle is much larger than the final solution space.

From this description it sounded to me that Jeff’s system accumulates attributes of the things that we are interested, in folders, and then makes connections between these folders as new pieces of information come in to the system.

One of the important characteristics of a usable sensemaking system is that it needs to be able to change its decision state as new information comes in. This is to ensure that the system does not drift from the truth as arriving new data invalidates earlier assertions.

Systems like this end up expressing bias base on the observations they have received. So in theory, an organization could ingest slightly different data sets into different instances of the program in parallel and poll them for their views. One would be able to see dissent between these different instances.

A key issue about sensemaking is the ability of the system to count discrete objects. If you can’t count the discrete entities that you are interested in, then you can’t expect to produce high quality predictions. This is the key principle behind the new technology that Jeff is developing—on that this new work is a general counting engine. With this in mind he is currently looking for hard science problems that such an engine could be applied to, and this was one of the reasons for his visit in to our offices, so if anyone reading has some ideas please post them and we’ll pass them along.

One good way to disambiguate things is being able to track their spacetime and life arcs. The same thing cannot be in two places at the same time (at least it can’t if it is large enough to not be concerned with quantum mechanical effects) and the path something takes over space and time (life arcs) can be itself be a discriminating signature of identity. Science produces very large data sets, and some of these data sets are produced quickly. Jeff hopes to be able to find problems that would benefit from the disambiguation techniques that he is working on. Trying to imagine which types of data in the scientific realm would be a good candidate for this kind of analysis raises some interesting questions. Most science is produced through publication, which is a slow process and is not very real time. That said, Pubmed indexed a new paper about every 40 seconds in 2009, which is quasi-real time. Often it’s not individual members of a class of objects in which we are interested. It’s not a given Higgs Boson that interests us, but rather all characteristics of all Higgs Bosons. That said, one of the most important jobs of detectors at particle accelerators is indeed to do exactly the event disambiguation of particle trails that uses spacetime paths as the key discriminating factor.

I wondered whether in the context of scientifically interesting objects one could try to do this disambiguation of paths by projecting into a more general higher-dimensional parameter space. Jeff was very clear on the point that as far as he was concerned spacetime and life arcs are the gold standard in this regard, and I’d have to agree with that, however I think that the idea of using higher dimensional parameter spaces has some merit.

As with his last visit, Jeff reserved his most thought provoking idea till last. Quite recently he has been fascinated by the growing number of systems used by some companies to track mobile phone trails (life arcs). There are 600 billion transactions being generated daily in the US that contain geospatial data. Your travel patterns reveal where you spend your time, who you spend your time with, and they are highly predictive. The data is being de-identified and being shared with 3rd parties, however re-identification of an individual, in most cases, is trivial.

This data can also be seen in real time. It can give real time analytics on the health of a store; how many people are visiting in real time, what is their average journey distance to get to that store, is that number going up or down? Jeff suggested a number of ways to raise consumer awareness of the power of this kind of information. He suggested that phone companies should provide information such as the first name and first initial of the last name of the 10 individuals that you spend most of your time with not at work or at home (notably: if there is a name on the list you do not recognize, they are probably following you). There has been some research into analysing these trails, but it’s clear that we are just beginning to scratch the surface on this.

October 21, 2009

Google Wave Science Hack Day at Nature this Friday

I'm really happy to announce that we will be hosting a hack day on Friday for developing scientific applications in google wave. The event was thought up by Cameron Neylon and we at Nature were able to find a room and are able to provide interweb access and coffee. The JISC DevSci project and Google will be providing Pizza. If you have a google wave account you can check out the wave discussing this event.

We will have a number of our onsite developers taking part, some external people coming in, and quite a few people will be joining in remotely via Wave. It's going to be very interesting to see how a full day of collaboration through wave works out.

The exact number of people coming in has not yet been finalised, but we do have some extra spaces, so if you are a developer with a wave account in the London area or you are a scientist with some great ideas for apps that could work well for scientists then please feel free to drop me a line and we will see if we have space to fit you in. You can email me at i.mulvany@nature.com. You can, of course, pop in via wave and say hi.

The hashtag for the event will be #swlhd (science wave london hack day) if you are into that kind of thing.

October 16, 2009

From Web 2.0 to the Global Database

I'm on my way home having just attended the 2009 Microsoft eScience Workshop at Carnegie Mellon University in Pittsburgh, where Tony Hey and his team at Microsoft Research also launched a book called The Fourth Paradigm. It's a collection of essays that provide relatively accessible accounts of the impact and potential of digital science, and has been published in memory of Jim Gray, a pioneer in this area.

I delivered a short talk summarising my essay, which was called "From Web 2.0 to the Global Database". I'm reproducing the text below, together with some of the slides I used to illustrate my talk.

(Update 20/10/09: Added link to book website.)

"From Web 2.0 to the Global Database" »

October 04, 2009

Demo Web Clients for nature.com OpenSearch

opensearch-client-dc.jpg
(Click image to enlarge.)

[Update - 2009.10.05: This post (2. Clients) is one of three. See also: 1. Service, 3. Widgets.]

The previous post described the nature.com OpenSearch service. Prior to that I posted on our new desktop widgets which use one of the XML interfaces - specifically the RSS feed.

Here we wanted to also show what can be done in the browser itself. We've created a small gallery of demo clients which all use the text-based JSON interface (or rather JSONP for cross-site scripting purposes). You can find the demos here:

http://nurture.nature.com/opensearch/apps
These demo apps show how the JSON interface can be used to build very simple web clients for search. They make use of an early OpenSearch JavaScript library which has classes for OpenSearch and SRU responses. The demos show how to link back to the nature.com platform (using the DOI), how to locate metadata properties, how to use OpenSearch links for pagination, how to compare OpenSearch and SRU views, how to extract RDF triples, etc. They are simply intended to show how easy it is to access nature.com search remotely. We hope you find them fun to use.


October 03, 2009

nature.com OpenSearch

opensearch-interfaces.png
(Click image to enlarge.)

[Update - 2009.10.05: This post (1. Service) is one of three. See also: 2. Clients, 3. Widgets.]

Earlier this week we soft-launched a new service: nature.com OpenSearch. Simply put, nature.com OpenSearch provides a structured resource discovery facility for content hosted on nature.com. In effect, this is a sister service to our regular nature.com search service which allows a user to query nature.com and browse the result sets. By contrast, the new service allows applications to query nature.com and to fetch the results back in formats of their choosing. The diagram above attempts to compare the existing user-oriented nature.com search service at a) with the new application-oriented nature.com OpenSearch service at b). Applications from widgets to web pages (and beyond) are the immediate clients of the service. (A companion post here already discussed the new nature.com search search widgets which are one such application.)

In terms of interfacing to the service, machine-readable description documents are available for both OpenSearch and SRU (Search and Retrieval via URL) modes of access. These documents are referenced from autodiscovery links which are beginning to be added to all our nature.com web pages. Each web page thus links not only to our search, but more than that it provides the instructions on 'how to search'.

Query is either by simple search terms or by the query language CQL which is a high-level query language designed to be be human readable and writable, and to be intuitive while maintaining the expressiveness of more complex languages. Result sets can be returned in a variety of media types, both text (HTML and JSON) and XML (SRU, ATOM, RSS). Media types are selectable using HTTP content negotiation or by using the specific parameter 'httpAccept'.

opensearch-querytype-results.png
(Click image to enlarge.)

And what does this all mean? Well, it really amounts to the ability to run off-platform search, i.e. I can now run my search over nature.com anywhere I choose to run it. For example, say I want to run it right here in this blog post, I can. Let's jig up a simple interface. What we'll do is to run a full-text keyword search and just list out the raw properties of the first item returned with no real attempt at styling. (The CQL checkbox just allows a CQL query to be input, otherwise the search terms are sent to the server as simple alternates.)

"nature.com OpenSearch" »

"Nascent Web publishing efforts have their genesis in a burning need to say something, but their ultimate success comes from people wanting to listen, needing to hear each other’s voices, and answering in kind."
Rick Levine
The Cluetrain Manifesto

Subscribe

Subscribe to this blog's feeds:

[What is this?]

The Life Scientists on FriendFeed

Recent Comments

Out of 405 total comments.
The most recent three were on:
Powered by
Movable Type 3.2