Web publishing as a discipline has few tenets but I think release early, release often and don’t be afraid to fail are pretty sound. That was the philosophy behind Connotea when Timo and Ben Lund launched it in 2004 and it’s the spirit in which I’ve just put up an early version of Streamosphere.
Streamosphere is a pet side project which I’m running according to what I guess you could call the Paul Graham principles (it’d be disingenuous to say “as a start-up” as most startups don’t have NPG level resources. OTOH we lack a fussball table and free M&Ms). Think of it as a pre-alpha alpha.
The elevator pitch
Streamosphere lets you track scientific discussion on the web, in real time.
What it does
If you visit streamosphere.nature.com/preview.php#24 you’ll see a page of stacked timelines like these:
Each timeline shows discussion around a particular item, for now always a web page. The portrait on the left is of one of the people who first started talking about the item. The slice of time in which the discussion was active (people were leaving comments, tweeting, liking or bookmarking it) is coloured a shade of magnolia. Behind the active slice is a graph – this shows you how much activity there was at any one point.
Click on an item’s active slice to pop up more details about it including an activity breakdown and a selection of associated comments and tweets. If the item is a video or photograph it should be embedded in the popup. If the item description is in a foreign language hover your mouse cursor over it to get the English translation.
Streamosphere only ever shows the most active items in a given time period. Use the controls on the right hand side of the screen to see the most active items in the past few hours, day, week or month. You can also filter items by domain or by keywords in their description.
In smaller time periods you’ll see some items that aren’t anything to do with science: recently there’s been stuff about Iran and a viral video for example. I’m not sure if this is a bug or a feature, or how to filter out non-science stuff is that’s a requirement – suggestions welcome.
In the future I’d like to see the page update dynamically as new activity gets tracked but for now to refresh the page you need to reload or choose a new time period.
How it works
Streamosphere tracks ~ 4k accounts on half a dozen different social media sites including Friendfeed, Twitter and bookmarking services like Delicious. The account owners have all self-identified (sometimes implicitly) as scientists or people interested in science.
It uses a combination of polling, web hooks (via GNIP) and SUP feeds to aggregate public updates from tracked accounts as soon after they happen as possible. Average latency is ~ 3 minutes for Friendfeed and a few seconds for Twitter.
Right now there’s only one view on the data: by item. Items are the URIs associated with or mentioned in updates: if I tweet “I love http://lolcats.com” and you bookmark it on delicious then the streamosphere database will record a single item (lolcats.com) associated with two updates.
Items are currently always websites but in the future I’d like to add views for users and topics; these are non-trival because of problems with account owner disambiguation and classifying short messages respectively.
Owner disambiguation relies on the Google Social Graph API. We need to disambiguate owners because otherwise the same person could post a single link on multiple services and Streamosphere would believe it’s amazingly popular.
Sometimes users have set up rules to automatically route updates from one service to another (e.g. they share an item on Google Reader which appears in their Friendfeed stream which gets pushed out to their Twitter account). Rules like this are the bane of Streamosphere’s existence – it’s non-trivial to detect this kind of thing and handle them correctly.
I’m collecting hashtags, tags and extracting key terms from all updates but don’t quite know what to do with them yet – still need a good algorithm to detect trending topics. Links are extracted from updates but right now there’s no disambiguation for papers (Buggotea is alive and well in Streamosphere). There’s a best effort attempt to resolve shortened URLs though occasionally one will slip through.
There’s no API but if anybody has a good use for the data I’m happy to set something up using GNIP or long polling to support real time updates if necessary – just send me a use case.