When preparing a grant or publication, where can you turn for new ideas? You can bounce ideas off colleagues, search PubMed and Web of Science for related literature, and maybe take a trip down Google lane. But it’s difficult to get outside one’s particular area of expertise — to mine the opportunities at cross-disciplinary boundaries unless you know what you’re looking for. The developers of a new document search engine hope to make such cognitive leaps easier, finds Jeff Perkel.
According to CEO and company co-founder Brian Sager, Omnity is a document-matching engine that uses the entire text of a document — or rather, a representation of its rarest words — as a search term. (Google, by comparison, caps search queries at 32 words.) By comparing that representation to those of millions of documents, Omnity claim, it can return unlikely matches.
Launched in May 2016, Omnity updated its tool on 13 December 2016 to accept multilingual document searches across up to 100 languages, including Japanese, Chinese, Spanish, and Arabic. Simply upload a machine-readable document in any of these languages, and the engine will return its results in English.
The English language, Sager explains, contains about 700,000 words. Of those, the top 100 represent 50% of all published words, and the top 7000 represent 90%, including “virtually all verbs and most adjectives.” The remainder – “the very long tail of rarely used words” – are what Sager calls “specialist nouns,” words like “nanoglobule” and “informatics” that mean precisely what they say, regardless of the language in which the document was written, or its topic. Those rare words create a signature that Sager calls a “semantic signal.”
Sager adds: “We make a mathematical equation of the statistical pattern of rare words that are distributed in a document, and then we scan through hundreds of millions of other documents that also have had mathematical equations constructed for each one. And then we look for matches where similar patterns of statistical distribution imply similar patterns of themes or topics in the documents. And we’ve found that to be true in every knowledge domain that we’ve tested.”
Your mileage may vary. In my own albeit limited testing, Omnity returned some odd results, and failed to find what I would consider key resources. According to Sager, that could be due to residual semantic “noise,” which users can reduce by adding or removing search terms from the ‘word cloud’ of rare terms extracted from the document, filtering out certain classes of hits (such as patents), and narrowing the search by clicking on particularly relevant hits in the results.
By default, Omnity includes some 15 terabytes of federal documents, including scientific papers, grant applications, patents, US Food and Drug Administration and Securities and Exchange Commission filings, legal rulings, selected web content, and more. Academic users can add up to 10 custom documents for free, and rotate them in and out as necessary; premium customers can add unlimited documents at a cost of about $0.05 per page.
According to Sager, the company’s search strategy avoids the difficulty often associated with intelligent document searching in that it reduces a document down to its isolated words, rather than trying to suss out either grammar or meaning. And the system can perform millions of pairwise comparisons between those documents in seconds.
To use the service, simply sign up for an account, then drag-and-drop a document into the search window. Omnity “ingests” the query document and distills (a biochemist by training, with a minor in linguistics, Sager prefers “purifies”) it down to its most unusual words. It then translates that signature into a mathematical representation, producing the query. The user is presented with a distance graph showing most similar documents across a series of domains, including “NIH Scientific Papers,” US patents, Wikipedia, Answers.com, and company profiles. Other visualizations are also available, including representations of document release by time and location.
A researcher preparing a grant application could upload the introduction in order to find related work and potential collaborators, says Sager. Publishers could use the system to identify potential peer-reviewers for submitted manuscripts, and patent attorneys could mine it to identify prior art.
Perhaps most significantly to the readers of Nature, researchers may be able to use the tool to step outside their comfort zones. In one example, Sager says, a corporate client looking for strategies to cross the blood-brain barrier used Omnity to identify a potential lead in, of all places, battery research.
“The neuroscientists would never have looked in battery literature, and people in battery literature would rarely look at the brain literature,” he says. “But in this case, the fact that we were looking at all of the words in these documents and focusing on the rare words let us find … lateral possibilities that wouldn’t have been possible when someone’s locked in from their window of expertise.”
Jeffrey Perkel is Technology Editor, Nature