« Blogging in the Lab, ChemTools | Main | What's in your nature.com? »

Jeff Jonas Web Seminar at Nature

On Friday the 4th of April Jeff Jonas came in to give the current latest installment of our Tech Talks. Jeff is the chief scientist for IBM's Entity Analytics, but that is just one data point out of what, during the course of Jeff's talk, became apparent was a very rich context.

He managed to jam in about 90 slides in 45 minutes, so I'm mostly going to paraphrase what he was saying in his presentation, as it went by so quickly.

As this is quite a long blog post I'll save you the trouble or reading it by giving away the ending right now, the main theme that Jeff talked about was data. Lot's of data, almost mind staggeringly huge volumes of data, and how to deal with it all. The answer is to construct a system in which each of the nodes (or sensors) reporting information provides that information in a format that can be stitched together in a contextually aware way. By stepping away from extracting a signal from one piece of datum, and instead building a way to look at the context in which that datum lives you can solve interesting problems. That's kind of the big picture.

At the end of his talk he also entertained us with some of his thoughts on diverse topics, from the total surveillance state to how safe is the world really? The longer write up is below the break.

The longer version begins now. Jeff started out in the IT industry through looking at questions relating to identifying people who were trying to hide their identities. This was initially work for credit and collection agencies who often deal with people falsifying records for lots of different reasons. Though he didn't say so explicitly, it was probably through solving these kinds of problems that led to his very interesting take on dealing with data. This work soon found interested clients in the casino's of Las Vegas.

Las Vegas is a place where there is quite an incentive to cheat, in that a correctly worked scam can net a large profit for the perpetrators in a very short space of time. He showed a video of a table where one of the gamblers traded his deck with the dealer deck. These cards had been ordered in such a way that everyone at the table knew the ordering of the cards, enabling them to play a deterministic game, and cheat the casino. It would probably be prudent for these kinds of people to try to hide their identities and also prudent for the casino to try to recognise such people when they turn up at the casino's doorsteps.

He detailed the lengths that some criminal organisations go to in order to introduce people who will cheat to the casinos in such a way that the casino will have no prior information about these people. Quite a way is the answer.

On the other hand there is a lot of data out there, and if you could solve the puzzle by stitching the data together then you might be able to stay one step ahead and intercept these scams before or during, instead of finding out a lot later.

He devised a system called NORA (non-obvious relationship awareness) for these clients to tackle just this problem, then in 1998 he was asked to speak about this work at a public NSA-hosted conference. It was after this that his company was acquired by IBM.

He described a conversation with a counterterrorism intelligence analyst where he asked her what she could wish for. She said that she wished she could get answers faster, to which Jeff replied, 'what are the chances that you can ask every smart question every day?'. The point here is that sometimes a question that is asked today needs to wait until some event happens in the future before it can have a meaningful or relevant answer. You probably can't ask that question every day, but if there was a way to put that question into storage and allow it to become active when the data that is required to give it relevance shows up, then this would be a useful way of dealing with the question. In fact what you are doing is treating the question like data. One of Jeff's key points is that you have to allow the data to find the data and the relevance must find the user. OK, so that might sound a bit cryptic, but the basic idea behind it is pretty straightforward.

The current situation is that organisations have huge piles of data, but that data tends to reside in separate relational silos, and these silos don't talk to one another. Moreover a query against one of these silos tends to only match exact terms, so if you are searching for 'Bill' and you have perhaps millions of names in your data set you are not going to get results for anyone called 'William', or 'Bil' or 'Billy', even though as humans we know that semantically these forms are all related. Aside from the lack of semantic reconciliation that exists within one data set, different data sets rarely connect to one another. For instance a database containing fraud investigations is rarely connected to one’s own employee database. Jeff described this by saying that this data is isolated and as a result our perceptions of the data is isolated.

In order to solve a puzzle, such as determining connections between fraud rings and insiders, you need to give a context to your data so that you can begin to infer relationships within the data. He described this as creating persistent context.

OK, so what is the pseudo-algorithm for dealing with all of this messiness?

Lets take as an example a record for renting an apartment. This might contain a name, address, date of birth, and perhaps a phone number. Let's say that the name is 'Bill Weather'. Now let's take a record in the apartment’s eviction database. Again this may contain a name, date of birth, and address. Perhaps the name in this case is 'William Weather', but the address is the same as in the first case. Tying these together tells us that anytime we encounter 'William Weather' or 'Billy Weather' at this address in the future, we are probably dealing with the same person, through the glue of the same address. In order to create a data engine that can do this kind of matching you have to extract key features (names, addresses, phones, etc.) from all of your sources into one store where semantic reconciliation is attempted on each new record in real time. These key features generally represent 'who', 'what', 'where' and 'when' as available on each individual observation (e.g., atomic level of data).

You accumulate and store. You can also do the same for questions. A question might be structured as 'did this person buy anything in our store'. If you have no record of that person right now, instead of throwing away the question add it to the data store, in it's atomic form, and if any record for that person comes along you already have some indication that there is something interesting about this person. In effect you are getting rid of the distinction between questions and data, and replacing them by relationships, or contexts, about entities. After all, questions also usually concern people, places, times or events.

You won't know whether a piece of data is important until someone asks, but by melding everything together you build up persistent contexts.

Jeff compared this method to one of mining huge amounts of latent data, which he described as trying to boil the ocean. The flip side of Jeff's approach is that not only do you treat queries as data, but data then also become queries, as a new datum can tie together pieces in your puzzle and trigger a reconciliation of information which leads to insight about the data set. The trick is to take the approach of contextualising each datum as it arrives. This is the most efficient way to deal with data. By processing upon receipt you can begin to scale with new information. You don't have to go back periodically and try to mine through the ocean. This reminded me of a comment about fixing bugs in computer code. The most efficient place to fix a bug in computer code is just after you have typed it. Any delay past this point adds to the cost of fixing the bug.

He cited the example of a US federal agency who probably have multi-zetobytes of data lying around the place. There are not enough computers on Earth to sift through all of this data via brute-force, and this is a problem that is getting worse as our capacity as a species to leave digital trials increases.

Another emergent aspect of this system is that queries find queries, and give a deeper picture about the information that you are dealing with. What you are doing is constructing context in an ongoing way, and when that context reaches a relevance threshold level you can publish insight.

Jeff said that bad data was good for solving problems like this because it helps to spread out the interaction of pieces of the puzzle. He said that if you polish all of your data you end up loosing essential features of the data. So how do you treat ambiguities and false positives in the data? The answer seems to be to throw more data at the problem, and in the process of reconciling bits together you get rid of the ambiguities. He also said that orthogonal data sets were very important for gluing disparate data together.

He gave an example of where he was asked to find invented identities in a population. The given number of total individuals was known. Data from a variety of sources was ingested into the system, and each time a new name, or set of information regarding a person, was encountered a potential identity was created in the system. As data was poured in the number of possible people in the population at first grew to a multiple of the actual figure, before data reconciliation occurred and the potential numbers of individuals rapidly dropped down towards the real figure, with identifications of false identities popping out on the way down.

Now, of course, these tools work where there is a lot of information about people, and indeed Jeff was talking about use cases in situations where the population of an entire country was being queried, which obviously raises questions about data, privacy and surveillance. As he was talking I was wondering about the issue of false positives, and the example of passenger no fly lists came to my mind, but then I realised that the example of the TSA no fly lists was a perfect example of not using the techniques that Jeff was describing for doing reconciliation of knowledge about people based on multiple silos of data.

In fact Jeff addressed questions of privacy directly in two ways. In the first case he asked how do you go about reconciling information in two data silos where you might not want to share all the information between both of the stores of data. He said that this was a big problem for government agencies where there are very strict regulations about data sharing between departments, so that one group looking at terrorist activity may not be able to access the database of a group looking at drug smuggling. To overcome this Jeff suggested that it is possible to share one-way hashed representations of portions of the data between data stores. The hashed representation is a unique non-reversible representation of a piece of information. If you give me one of these hashes there is nothing that I can do to reverse the transformation and extract the original information, but if you tell me the algorithm you used to create the one way hash, I can apply that algorithm to my data and see if any of my data produces the same hash that you gave me. If it does then I know that you have some information about something that I also have information about and it might be worth looking at cooperating on this particular item.

He stressed that this technique does not do away with proper policy controls on how you manage access to your data, but rather what it can do is simplify the conversation you might want to have about looking into sharing information between different agents.

That was pretty much most of what Jeff talked about during his main talk, then he started telling us about stuff that he had been thinking about. One of these things, unsurprisingly, is the emergence of the total surveillance society. His opinion is that this is both irresistible and inevitable and that it is being driven by consumer forces. The fact that people love GPS on their phones is just the start of what will be a series of technologies that will make total surveillance a reality. His other example was the idea of a pair of glasses that have an embedded RFID chip. The convenience of not being able to loose such items owing to their constant traceability will make people buy items like this simply due to our tendency to try to optimise our lives, and a consequence of this pursuit will be an in effect a the creation of a surveillance infrastructure.

Virtual worlds also came up, and Jeff said that he expects a significant amount of time to be spent in virtual worlds, and that we will be drawn into them through our need to conduct business there. He said that at the moment he finds them kind of boring, but indicators that they will be important are that they provide an immersive experience, the 100 dollar laptop is becoming a reality, and for billions of people the experience of life that is represented in virtual worlds is significantly more appealing than the circumstances that they are faced with in their day to day lives. If you have a business model where the very poor can gain access to a virtual world for a micro payment of a few cents a month, and you add this to a potential market of a couple of billion people, then you have a viable, indeed a compelling, business.

The last thing that Jeff talked about was how safe the world is. He pointed out two opposite trends here. The first is that at the moment the world is safer than it has ever been before. That the current average life expectancy world wide is 67, which is higher than at any point in the history of the world. He compared mortality rates in the 13th century from the black death to the kinds of threats that tend to be broadcast across the media today. The black death killed 17% of the population of the world. Jeff pointed out that even if you took the US and Europe and dropped them into the sea you would only manage to get rid of 5.5% of the population of the earth, and so the kinds of threats that we worry over from terrorist activities today are incredibly minor within a historic perspective, and that the reality is that there has been no better time to be alive.

In contrast the cost of manufacturing tools for killing lots of people have been dropping as our technology advances. The cost of the first nuclear weapon was a significant % of US GDP, but now we can manufacture potentially lethal virus strains that could be more damaging, and at a fraction of the cost. He called this section more death, faster, cheaper.

All in all, Jeff raised a lot of thinking points. At the end we had a chance to ask him a few questions. He said that the kind of work he is involved with is not just applicable to casinos and to government agencies, but to all manner of businesses. When asked how we might apply these ideas at Nature he advised that we ask ourselves what we do and what we are good at and then try to map these things onto the kinds of atomic questions that can be used for gluing lots of data together.

The issue of using these ideas in science seems to have some applicability, for instance using the one way hash idea to see if different labs are working on the same genes, or chemicals, without saying directly what those specific entities are, however it's not clear whether science could deal with the cost issue involved in doing this.

Jeff writes prolifically on his blog about his ideas and he suggested the following posts as further reading regarding the topics that he touched on in his talk

Postgenomic TrackBack

Similar items from Scintilla

Post a comment

Comments will be reviewed by the editors before being published. You can be as critical or controversial as you like, but please don't get personal or offensive. We strongly encourage you to use your real, full name. Email addresses are useful in case we need to discuss your comment with you privately, or notify you in case we decide not to publish your comment. Email addresses will not be made public on the blog.

We have designed this blog to be as accessible to as many people as possible. If you are having difficulty leaving a comment because of the graphical security code below, please send your comment to 'nascent at nature.com'



"Nascent Web publishing efforts have their genesis in a burning need to say something, but their ultimate success comes from people wanting to listen, needing to hear each other’s voices, and answering in kind."
Rick Levine
The Cluetrain Manifesto

Subscribe

Subscribe to this blog's feeds:

[What is this?]

Recent Comments

Out of 263 total comments.
The most recent three were on:
Powered by
Movable Type 3.2