On Friday the 4th of April Jeff Jonas came in
to give the current latest installment of our Tech Talks. Jeff is the
chief scientist for IBM’s Entity Analytics, but that is just one data
point out of what, during the course of Jeff’s talk, became apparent was
a very rich context.
He managed to jam in about 90 slides in 45 minutes, so I’m mostly going
to paraphrase what he was saying in his presentation, as it went by so
As this is quite a long blog post I’ll save you the trouble or reading
it by giving away the ending right now, the main theme that Jeff talked
about was data. Lot’s of data, almost mind staggeringly huge volumes of
data, and how to deal with it all. The answer is to construct a system in
of the nodes (or sensors) reporting information provides that information
that can be stitched together in a contextually aware way.
away from extracting a signal from one piece of datum, and instead
building a way to look at the context in which that datum lives you can
solve interesting problems. That’s kind of the big picture.
At the end of his talk he also entertained us with some of his thoughts
on diverse topics, from the total surveillance state to how safe is the
world really? The longer write up is below the break.
The longer version begins now. Jeff started out in the IT industry
through looking at questions relating to identifying people who were
trying to hide their identities. This was initially work for credit and
collection agencies who often deal with people falsifying
records for lots of different reasons. Though he didn’t say so
explicitly, it was probably through solving these kinds of problems that
led to his very interesting take on dealing with data. This work soon
found interested clients in the casino’s of Las Vegas.
Las Vegas is a place where there is quite an incentive to cheat, in that
a correctly worked scam can net a large profit for the perpetrators in a
very short space of time. He showed a video of a table where one of the
gamblers traded his deck with the dealer deck. These cards had been ordered
in such a
way that everyone at the table knew the ordering of the cards, enabling
them to play a deterministic game, and cheat the casino. It would
probably be prudent for these kinds of people to try to hide their
identities and also prudent for the casino to try to recognise such
people when they turn up at the casino’s doorsteps.
He detailed the lengths that some criminal organisations go to in order
to introduce people who will cheat to the casinos in such a way that the
casino will have no prior information about these people. Quite a way is
On the other hand there is a lot of data out there, and if you could
solve the puzzle by stitching the data together then you might be able
to stay one step ahead and intercept these scams before or during,
instead of finding out a lot later.
He devised a system called NORA (non-obvious relationship awareness) for
these clients to tackle just this problem, then in 1998 he was asked to
speak about this work at a public NSA-hosted conference. It was after this
that his company was
acquired by IBM.
He described a conversation with a counterterrorism intelligence analyst
where he asked her
what she could wish for. She said that she wished she could get answers
faster, to which Jeff replied, ‘what are the chances that you can ask
every smart question every day?’. The point here is that sometimes a
question that is asked today needs to wait until some event happens in
the future before it can have a meaningful or relevant answer. You
probably can’t ask that question every day, but if there was a way to
put that question into storage and allow it to become active when the
data that is required to give it relevance shows up, then this would be
a useful way of dealing with the question. In fact what you are doing is
treating the question like data. One of Jeff’s key points is that you
have to allow the data to find the data and the relevance must find the
user. OK, so that might sound a bit cryptic, but the basic idea behind
it is pretty straightforward.
The current situation is that organisations have huge piles of data, but
that data tends to reside in separate relational silos, and these silos
don’t talk to one another. Moreover a query against one of these silos
tends to only match exact terms, so if you are searching for ‘Bill’ and
you have perhaps millions of names in your data set you are not going to
get results for anyone called ‘William’, or ‘Bil’ or ‘Billy’, even
though as humans we know that semantically these forms are all related.
Aside from the lack of semantic reconciliation that exists within one
data set, different data sets rarely connect to one another. For
instance a database containing fraud investigations is rarely connected to
one’s own employee database. Jeff described this by saying that this
data is isolated and as a result our perceptions of the data is
In order to solve a puzzle, such as determining connections between fraud
rings and insiders, you
need to give a context to your data so that you can begin to infer
relationships within the data. He described this as creating persistent
OK, so what is the pseudo-algorithm for dealing with all of this
Lets take as an example a record for renting an apartment. This might
contain a name, address, date of birth, and perhaps a phone number. Let’s
the name is ‘Bill Weather’. Now let’s take a record in the apartment’s
Again this may contain a name, date of birth, and address.
Perhaps the name in this case is ‘William Weather’, but the address is
the same as in the first case. Tying these together tells us that
anytime we encounter ‘William Weather’ or ‘Billy Weather’ at this address
we are probably dealing with the same person, through the glue of the
same address. In order to create a data engine that can do this kind of
matching you have to extract key features (names, addresses, phones, etc.)
from all of your sources into one
store where semantic reconciliation is attempted on each new record in real
These key features generally represent ‘who’, ‘what’, ‘where’ and ‘when’ as
available on each individual observation (e.g., atomic level of data).
You accumulate and store. You can also do the same for questions. A
question might be structured as ‘did this person buy anything in our
store’. If you have no record of that person right now, instead of
throwing away the question add it to the data store, in it’s atomic
form, and if any record for that person comes along you already have
some indication that there is something interesting about this person.
In effect you are getting rid of the distinction between questions and
data, and replacing them by relationships, or contexts, about entities.
After all, questions also usually concern people, places, times or
You won’t know whether a piece of data is important until someone asks,
but by melding everything together you build up persistent contexts.
Jeff compared this method to one of mining huge amounts of latent data,
which he described as trying to boil the ocean. The flip side of Jeff’s
approach is that not only do you treat queries as data, but data then
also become queries, as a new datum can tie together pieces in your
puzzle and trigger a reconciliation of information which leads to
insight about the data set. The trick is to take the approach of
contextualising each datum as it arrives. This is the most efficient way
to deal with data. By processing upon receipt you can begin to scale
with new information. You don’t have to go back periodically and try to
the ocean. This reminded me of a comment about fixing bugs in computer
code. The most efficient place to fix a bug in computer code is just
after you have typed it. Any delay past this point adds to the cost of
fixing the bug.
He cited the example of a US federal agency who probably have
multi-zetobytes of data lying
around the place. There are not enough computers on Earth to sift through all of
this data via brute-force, and this is a problem that is getting worse as
our capacity as a species to leave digital trials increases.
Another emergent aspect of this system is that queries find queries, and
give a deeper picture about the information that you are dealing with.
What you are doing is constructing context in an ongoing way, and when
that context reaches a relevance threshold level you can publish insight.
Jeff said that bad data was good for solving problems like this because
it helps to spread out the interaction of pieces of the puzzle. He said
that if you polish all of your data you end up loosing essential
features of the data. So how do you treat ambiguities and false
positives in the data? The answer seems to be to throw more data at the
problem, and in the process of reconciling bits together you
get rid of the ambiguities. He also said that orthogonal data sets were
very important for gluing disparate data together.
He gave an example of where he was asked to find invented identities in
a population. The given number of total individuals was known. Data from
a variety of sources was ingested into the system, and each time a new
name, or set of information regarding a person, was encountered a
potential identity was created in the system. As data was poured in the
number of possible people in the population at first grew to a multiple of
the actual figure, before data reconciliation occurred and the potential
numbers of individuals rapidly dropped down towards the real figure,
with identifications of false identities popping out on the way down.
Now, of course, these tools work where there is a lot of information
about people, and indeed Jeff was talking about use cases in situations
where the population of an entire country was being queried, which
obviously raises questions about data, privacy and surveillance. As he
was talking I was wondering about the issue of false positives, and the
example of passenger no fly lists came to my mind, but then I realised
that the example of the TSA
no fly lists was a perfect example of not using the techniques that
Jeff was describing for doing reconciliation of knowledge about people
based on multiple silos of data.
In fact Jeff addressed questions of privacy directly in two ways. In the
first case he asked how do you go about reconciling information in two
data silos where you might not want to share all the information between
both of the stores of data. He said that this was a big problem for
government agencies where there are very strict regulations about data
sharing between departments, so that one group looking at terrorist
activity may not be able to access the database of a group looking at
drug smuggling. To overcome this Jeff suggested that it is
possible to share one-way hashed representations of portions of the data
between data stores. The hashed representation is a unique
non-reversible representation of a piece of information. If you give me
one of these hashes there is nothing that I can do to reverse the
transformation and extract the original information, but if you tell me
the algorithm you used to create the one way hash, I can apply that
algorithm to my data and see if any of my data produces the same hash
that you gave me. If it does then I know that you have some information
about something that I also have information about and it might be worth
looking at cooperating on this particular item.
He stressed that this technique does not do away with proper policy
controls on how you manage access to your data, but rather what it can
do is simplify the conversation you might want to have about looking
into sharing information between different agents.
That was pretty much most of what Jeff talked about during his main
talk, then he started telling us about stuff that he had been thinking
about. One of these things, unsurprisingly, is the emergence of the
total surveillance society. His opinion is that this is both
irresistible and inevitable and that it is being driven by consumer
forces. The fact that people love GPS on their phones is just the start
of what will be a series of technologies that will make total
surveillance a reality. His other example was the idea of a pair of
glasses that have an embedded RFID chip. The convenience of not being able
to loose such items owing to their constant traceability will make
people buy items like this simply due to our tendency to try to optimise
our lives, and a consequence of this pursuit will be an in effect a the
creation of a surveillance infrastructure.
Virtual worlds also came up, and Jeff said that he expects a significant
amount of time to be spent in virtual worlds, and that we will be drawn
into them through our need to conduct business there. He said that at
the moment he finds them kind of boring, but indicators that they will
be important are that they provide an immersive experience, the 100
dollar laptop is becoming a reality, and for billions of people the
experience of life that is represented in virtual worlds is
significantly more appealing than the circumstances that they are faced
with in their day to day lives. If you have a business model where the
very poor can gain access to a virtual world for a micro payment of a
few cents a month, and you add this to a potential market of a couple of
billion people, then you have a viable, indeed a compelling, business.
The last thing that Jeff talked about was how safe the world is. He
pointed out two opposite trends here. The first is that at the moment
the world is safer than it has ever been before. That the current
average life expectancy world wide is 67, which is higher than at any
point in the history of the world. He compared mortality rates in the
13th century from the black death to the kinds of threats that tend to
be broadcast across the media today. The black death killed 17% of the
population of the world. Jeff pointed out that even if you took the US
and Europe and dropped them into the sea you would only manage to get
rid of 5.5% of the population of the earth, and so the kinds of threats
that we worry over from terrorist activities today are incredibly minor
within a historic perspective, and that the reality is that there has
been no better time to be alive.
In contrast the cost of manufacturing tools for killing lots of people
have been dropping as our technology advances. The cost of the first
nuclear weapon was a significant % of US GDP, but now we can manufacture
potentially lethal virus strains that could be more damaging, and at a
fraction of the cost. He called this section more death, faster,
All in all, Jeff raised a lot of thinking points. At the end we had a
chance to ask him a few questions. He said that the kind of work he is
involved with is not just applicable to casinos and to government
agencies, but to all manner of businesses. When asked how we might apply
these ideas at Nature he advised that we ask ourselves what we do and
what we are good at and then try to map these things onto the kinds of
atomic questions that can be used for gluing lots of data together.
The issue of using these ideas in science seems to have some
applicability, for instance using the one way hash idea to see if
different labs are working on the same genes, or chemicals, without
saying directly what those specific entities are, however it’s not clear
whether science could deal with the cost issue involved in doing this.
Jeff writes prolifically on his blog about his ideas and he suggested the following posts as further reading regarding the topics that he touched on in his talk
- To Know Semantic Reconciliation is to Love Semantic Reconciliation
- More Data is Better, Proceed With Caution
- Ubiquitous Sensors? You Have Seen Nothing yet
- Accumulating Context: Now or Never
- Virtual Reality: There Is No Place Like Home
- How to Use a Glue Gun to Catch a Liar
- It Turns Out Both Bad Data and a Teaspoon of Dirt May Be Good For You
- Why Faster Systems Can Make Organizations Dumber Faster
- You Won’t Have to Ask — The Data Will Find Data and Relevance Will Find the User
- To Anonymize or Not to Anonymize, That is the Question
- What Came First, the Query or the Data?
- More Death Cheaper in Future
- The World is Not a More Dangerous Place
- Six Ticks till Midnight: One Plausible Journey from Here to a Total Surveillance Society