Social Tagging for Science

bengood.pngOver the next few weeks we’re going to run a short series of guest posts from people working on the sharp edge of science 2.0 who we think are particularly cool and interesting.

Our first guest author is Ben Good, a grad student at the University of British Columbia where he works on bioinformatics projects and semantic web goodness for the life sciences. As well as producing a number of cool apps and mashups he’s almost certainly the first person to have used Greasemonkey in anger in a peer reviewed journal article.


Social Tagging for Science, now with added meaning!

By now, most readers of Nascent will be familiar with the concept of social tagging as it is displayed in services like Connotea, Flickr, and Delicious. Readers may also be familiar with the concept of the semantic web, AKA the web of data, the giant global graph, web 3.0, linked data, the structured web and so on. At a high level, both concepts are fundamentally about the same thing, the creation and use of associations between terms (broadly speaking) and the things that are represented on the Web; however, they traditionally approach the problems of creating and representing those terms and associations in different ways. In this post I will explore an emerging species of social tagging system that, as a symbiosis of social tagging and the semantic web, seems to offer powerful enhancements to both.

You say ‘potatoe‘, I say ‘potato

You say ‘sonic hedgehog‘, I say ‘SHH

Social tagging systems effectively let anyone connect whatever string of symbols they like to whatever Web resource they like. When I bookmark a web page with Connotea, I am free to tag it however I choose. While this makes it very easy for me to quickly create metadata that can help to organize the personal yet public information in my bookmark collection, it has the unfortunate result that, in the context of the whole web or even just in the context of Connotea, the tags are ambiguous. When you tag cell’ and I tag ‘cell’ we may mean completely different things and, of course, when you tag ‘SHH‘ and I tag ‘sonic hedgehog‘ we might actually mean the same thing. Not to mention that if I tag something with ‘koala’, I will never be able to find it with a search for ‘marsupial’. This ambiguity has the unfortunate effect that search and navigation through information collections organized by tags can sometimes leave much to be desired.

As information organization professionals have known for some time, there are quite effective ways to improve upon uncontrolled terms for the purposes of indexing. Controlled vocabularies of varying shapes and sizes are applied throughout the world where effective retrieval over large databases is required. (Such indexing is particularly important when the content indexed has no text suitable for automatic processing). Terminological structures, such as the MeSH thesaurus and the Gene Ontology, provide professional indexers in many different domains with authoritative sets of terms, linked together with meaningful relationships, that facilitate the process of indexing for later retrieval.

As part of my laboratory‘s research into socially distributed mechanism’s for creating, maintaining, and using semantic web content for applications in bioinformatics, we are building tools to explore how the millions of terms already represented in Web-accessible controlled vocabularies can be used to enhance the process and resultant products of social tagging. Our approach is simple, provide the users of social tagging services with a way to use terms from controlled vocabularies, which we call ‘semantic tags’, as easily as they now use tags created by themselves. In principle, this could allow for much better organization of tagged collections through the inference and disambiguation made possible by the semantic tags. Tags that meant the same thing to their users at the time they were applied like, ‘SHH‘, ‘sonic hedgehog‘, and ‘hedgehog‘, would mean the same thing at the time they were used for retrieval with consequent improvements not only in search but also in ‘related-user’ and ‘related-tag’ functionality. This is fundamentally the same approach as that taken by ZigTag, one of the first companies to incorporate around the idea of semantic tagging.

Challenges of building a semantic social tagging system

To achieve such a happily meaningful state of social semantic symbiosis, we need effective ways to connect users to semantic tags in intuitive, non-alienating ways. The experience of using a semantic social tagging system needs to be as pleasingly simple and fast as that produced by the current free-text tagging systems. The challenges to creating such a system are not insignificant. Here are a few that we’ve faced in the development of an extension to Connotea called the Entity Describer, (which is in its second version and in what you might call a permanent public alpha stage of development):

  • you need lots of tags

Once you have made the commitment to use a semantic tagging system, it is quite frustrating when you can’t find the term you want to use in the system.

  • you need to present the users with a fast, non-annoying way to use these tags during the tagging process

The first iteration of the Entity Describer was too slow to present the tags to the users when large terminologies were made available. Of course you need large terminologies to handle the first point, so…

  • you need to devise new interfaces that let the users take advantage of their new, meaning-enhanced tag collections

Without demonstrating improved functionality for the individual, users are left wondering “what was the point of that” as they wander off to other services that attend directly to their needs. Especially for researchers, the truly intriguing aspects of these systems emerge at the level of the collective. Because of this, there is a danger of forgetting the Delicious lesson and not focusing first and foremost on improving the experience of individual users. The trees must come before the forest.

  • you need a way for the users to contribute new semantic tags or to alter the definitions for existing tags

Without such a mechanism, the system can not grow and change to meet the needs of the community. Ideally a social semantic tagging system should be both a beneficiary of and a contributor to semantic repositories.

How it works

To create a semantic tagging application that lets users author the tags for their posts (rather than trying to predict them), the user has to be presented with a way to quickly and easily find the tags she needs. So far, all of the interfaces I’ve seen use some form of a type-ahead query over a centralized repository of semantic tags. The user begins to tag something as they normally would, but then they are presented with a list of candidate semantic tags based on what they type and are then allowed to select the tag that they mean to use. If the term they want can not be found, they are, at a minimum, allowed to enter it as a normal, free-text tag.

To make the type-ahead work, you need a database of semantic tags (the bigger and faster the better) on the backend and some clever JavaScript on the front. The current development version of the Entity Describer draws its semantic tags and a lot of its JavaScript from Freebase, an “open shared database of the world’s knowledge”. Other semantic tagging efforts, like ZigTag, Fuzzzy, and the first version of the Entity Describer, rely on their own databases of semantic tags, but, for now, the size, the openness and the quality of the provided API made Freebase a natural choice. It already contains more than 3 million terms (many garnered from Wikipedia), anyone can load whatever terms they want into it (for example, some one loaded the whole gene ontology), anyone can edit the textual definitions and the relationships that exist between the terms (called ‘Topics‘), and it provides a very fast JavaScript component for type-ahead search that can easily be embedded in external Web pages.

The Entity Describer works by embedding a Freebase search in the tagging form that appears in the Add2Connotea bookmarklet. Using this modified bookmarklet, user’s have the choice of specifying which Types of semantic tags to search for, searching through them all, or adding in their own free text tags. Types can be used to limit the search to, for example, terms from the Gene Ontology Group or Anatomical Structures or films. When a bookmark is posted through the system, it is stored both in Connotea using just the free text forms of the tags, but also in an RDF database that captures the semantic relationships that define these strings. This database is served as a SPARQL endpoint which essentially functions as a generic HTTP API for accessing the collected data. Applications that provide access to this data are starting to be written and we welcome others to contribute their own.

So far, we’ve had a very positive experience using Freebase in this manner, but there are some problems.

  • data isn’t provided according to semantic web standards and thus requires translation machinery to import and export from information sources that do
  • once a vocabulary such as the Gene Ontology is loaded onto the Freebase platform, there is a chance that the definitions of the terms could diverge from those in the external source
  • as Freebase isn’t actually designed as a terminological resource per se, it does not, by default, contain support for common relationships such as broader-than and narrower-than

However, we feel that each of these problems can be dealt with effectively and that, in comparison to the costs of composing, maintaining, and hosting another semantic repository with similar functionality the benefits far outweigh the costs.

“… and in the darkness bind them”

To conclude, the worlds of social tagging and the semantic web are rapidly crashing together in the form of a new generation of intelligent, personal/public information organization systems. In this new symbiosis of the often separate worlds of the social and the semantic web, social tagging systems can benefit from the increased precision and opportunities for reasoning provided by semantic representations and the semantic Web can benefit from the flow of user-generated content currently shaping the rest of the Web.


Comments are closed.