« June 2007 | Main | August 2007 »

July 19, 2007

Appidemiology

Could you use mathematical models from epidemiology to predict how an app is going to spread within the Facebook population?

People already talk about apps spreading virally. Admittedly my knowledge of epidemiology comes almost entirely from Wikipedia and an old episode of Numb3rs, but it seems like it'd all fit:

Your Facebook friends are your friends in real life (theoretically, anyway) so they tend to be clustered geographically - there's the place you grew up, where you went to college, where your first job was... etc.

Apps spread primarily from friend to friend - either by exposure through a box on a user's profile page (medium level exposure) or through invites (high exposure) and news stories (low exposure).

There are seasonal trends - well, daily trends. Mondays and Fridays are hot Facebook app time, when immune systems (common sense? Other things to do?) are low and infection is most likely to spread.

An example of an acute infection would be Zombies - you install it because all your friends have it, send out loads of invites and pretty soon afterwards you realize that it's a bit rubbish and uninstall it (whereupon you're cured and are no longer infectious). Chronic infections would be things like Grafitti and SuperPoke that get used all the time. Latent infections would be something like Flickster that only kicks in when you make an occasional update.

Seriously, at the University of Texas Lauren Ancel Meyers is already using social graphs to to model infectious disease:

“Each person within a community is represented as a point in the network,” Meyers explains. “The edges that connect a person to other people represent interactions that take place inside or outside of the home, including interactions that take place at school or work, while shopping or dining, while at a hospital, etc. The network thereby captures the diversity of human contacts that underlie the spread of disease.”

There are obvious problems with trying to capture all of the significant interactions inside real life communities. Not such problems for Facebook, where interactions are all stored in convenient news feed form.

Some people may come into contact with very few people, but others may have many strands connecting them to other people in the community through their work or social habits. If this person becomes sick, he or she has the potential to become what researchers call a “superspreader,” someone who spreads disease to a lot of people in the community. Identifying potential superspreaders is one step in curbing an outbreak (ed: towards knowing who to market to.)

Any epidemiologists or Facebook app developers interested in investigating further? AOP Publishing Awards here we come...

July 17, 2007

Scientific blogging plug-ins

Are there any aspects of Blogger, MoveableType or Wordpress that could be tweaked to make them more suitable for science blogging? Any plugins that'd be really useful?

I'm thinking of things like:

  • A plugin that took the work out of finding the free version of a paper (where available) - you just enter a DOI and the software automatically replaces it with the most appropriate final destination.
  • A bibliography / reference builder that tied into Connotea, CiteULike and / or Zotero
  • Automated submission of research-heavy posts to Precedings or WebCite

Any of those sound useful? Any other suggestions?

We were discussing this in Web Publishing because there've been two WordPress plugins of interest to the science blogging community released recently.

The first is Pierre Far's EasyPg which makes it easy to include the relevant openreview friendly markup in your blog posts so that it can be picked up by sites like Postgenomic and Chemical Blogspace.

The second is ConnoShow by Andrew Straw which lets you embed Connotea bookmarks in your posts. Andrew's plugin is pretty simple at the moment but as he points out the possibilities are intriguing:

Although the current version is perfectly functional, its simplicity means it only suggests what is possible. For example, automatic bibliography generation with in-text citations and custom styles seems a straightforward, if somewhat labor intensive, extension.

There are a lot of talented developers out there. Maybe some could get together to create an easy to install, open source 'so you want to run a science blog?' package of plug-ins for the popular blogging platforms?

Bill McCoy (Adobe) and Mike Culver (Amazon) visit Nature

Last Friday we had the pleasure of welcoming Bill McCoy, General Manager of e-Publishing Business at Adobe Systems, to speak about their next-generation digital publishing technologies.

A while back, we were also lucky enough to play host to Mike Culver, Web Services Evangelist at Amazon. I was travelling so much after Mike came to see us that I never got around to posting my notes, so I'm adding them below.

It was very eye-opening to hear from two organisations that are doing so much to reinvent the way that publishers (and others) can operate online.

Bill McCoy, General Manager, e-Publishing Business, Adobe Systems

eBooks have seen false dawns, but digital publishing is now finally reaching a tipping point:

  • Demand increasing: Over 1,000 libraries in the US are lending e-books. Take-up has exceeded expectations. Not necessarily in direct competition with paper books. Pragmatic Programmers (a tech publisher) make 30% of their turnover (and 80% of international sales) from e-books. In the future, consumers will expect content to be available in digital formats.
  • Supply increasing: Every single major trade and textbook publisher (at least in the US) is committed to providing their content in digital form.
  • Digital distribution alternatives multiplying. In particular, hardware is getting better.

But:

  • Print is still the vast majority of the revenue for most publishers.
  • The web is good place to market and distribute, but in many cases the browser is not a good place to read the content. E.g., O'Reilly Safari finds that most users download content as PDFs. People want the 'downloading' experience rather than the 'browsing' experience.
  • Consumers and publishers still suffer from format confusion, client software and DRM hassles. For example, PDFs are good for preserving the print layout, but don't work well on small screens.

There's not going to be an 'iPod for e-books':

  • Technology constraints require compromise. This makes it more like the digital camera market.
  • Digital reading is gaining adoption in multiple contexts.
  • Text-based content is marketed and distributed in a wide variety of forms via a wide variety of channels. As a result, there will be no 'iTunes for books'.

Audience question: When will e-paper come of age?

E-ink is competing with cheaper LCD technology, which is also getting better all the time. So it may never hit the mass market. But e-ink is already best if you want long battery life and high resolution/contrast. Most reading right now takes place on generic devices, not specialised readers.

Q: How does e-book library lending work?

Two ways: Online viewing (not very successful) and downloadable content (strong take-up). Typically DRM-protected at publishers' insistence. Some publishers are against the whole idea of e-book lending, even with DRM. E.g., in the US Random House and Simon & Schuster (two of the big five) don't support it.

Standards are essential to a successful ecosystem:

  • PDF: paginated final-form content. Now controlled by an ISO-track standardisation group. Therefore an open standard.
  • IDPF (International Digital Publishing Forum) developing a similar standard, 'EPUB', for 'liquid' content. Based on XHTML, CSS, SVG, OpenType — the EPUB file is just a zip file containing these different components.

Adobe ePublishing Solution:

  • Adobe Digital Editions, a new application that supports both of the above standards. Focused on reading, not authoring, so it's a light download (3MB). It incorporates Flash too — the first fusion of Adobe and Macromedia technology.
  • Authoring tool: InDesign CS3.
  • Hosted content-protection service: ADEPT (coming next month).
  • Broad approach: Cross-platform support, mass-market mobile and dedicated devices.

Demo of ADE: Starts on the 'library' screen. It looks nothing like Adobe Reader. Allows you to sort, group and search content. Allows highly composed, image-driven content and two-page layout. Simple panning and zooming of PDF content. Also allows personal annotations. (Shared annotations coming.) Also allows reflowing text to fit screen size using new 'liquid' EPUB file format. Showed creation of EPUB e-book file from InDesign file. You can also transform to other XML formats.

(A Q&A session followed in which I didn't take notes.)


Mike Culver, Web Services Evangelist, Amazon

Intro

What is Amazon?

  • Online retail business (over 64 million active customer accounts)
  • Merchant business (sell on Amazon.com as a merchant)
  • Technology business (100s of 1000s of Amazon Associates, >1.1m active seller accounts, >2220k developers use Amazon Web Services)

Mike comes from the technology business.

Web startups have a hard time scaling when they suddenly take off -- e.g., YouTube. (He shows some interesting Alexa stats that show Photobucket outpacing Flickr in terms of reach.) Amazon Web Services (AWS) changes all the fixed costs (e.g., CPUs, disks) into variable costs.

Most systems need computing power, storage and messaging (between systems). AWS packages these into different products:

Ecommerce (ECS)

Exposes Amazon's product data plus a shopping cart. Now in 4th major release. Surprised by high demand, and this made Amazon look more closely at Web Services. For example, TVmojo.com is driven by Amazon. There are many vendors that are better at serving niches than Amazon is. This service is free. In fact, there's commission on referral sales. See also Couchville, which uses the ECS service to provide information about movies (and allows people to buy DVDs).

Storage (S3)

Data storage in Amazon data centres. Data 'objects' are automatically duplicated across multiple locations. Supports both SOAP and REST. Can keep data private or make it public. By default, each object is 5GB but they can be chunked to store larger files, and there's no upper limit on total storage. Now over 5 billion objects stored.

Charges: $0.10 per GB to upload, $0.15 per GB/month, tiered download charges.

Mike then demos an interactive desktop tool that allows him to manage files on S3. He also shows how each of these objects (e.g., pictures) can be made available on the web if they are made public — each one has it's own URL at s3.amazonaws.com. With the right DNS changes, you can also make this work with your own domain, even though it's actually served by Amazon.

The Second Life client application is now downloaded from S3 because they have a lot of users and upgrade it often, so they needed a scalable solution.

Computing (EC2)

This allows similar upscaling (and downscaling) of computing capacity. It provides Linux servers with complete root access. Amazon is responsible for uptime but not backups.

EC2 can take jobs from SQS. He shows an example of an anonymous company whose software automatically monitors the queue length and fires up more servers when the queue gets too long (and vice versa).

$0.10 per 'wall-clock hour' (i.e., about $72 a month) per server instance. Each virtual machine looks like a 1.7GB RAM, 160GB disk server.

Mechanical Turk

The "MTurk" allows you to request people to do things — usually small tasks for micropayments — via a Web Services interface.

For example, a UK company called Geospatial uses it to have people annotate images of roads. Also, price comparison sites use the MTurk to check that two products on offer at two different sites really are the same. In both cases, these are very hard programming challenges.

The MTurk also includes an eBay-like reputation system to make sure that 'employers' behave honestly. You can pay whatever you like for work, but as in any free market, if you pay too little then your tasks will never get done. He also shows the Sheep Market example. The owner got MTurk users to draw sheep for $0.02 each. Then he sells plates of 20 for $20. A more serious example is Casting Words, which does podcast transcriptions by farming the work out via the MTurk.

Questions:

How many MTurk workers are there? Don't know.

How do you stop MTurk being used to generate spam? It's against the terms of use, so Amazon will stop you. Another example is click fraud.

Does Amazon use the MTurk? Yes, they use it for a variety of uses. One is UnSpun, which gives information such as "Best Irish Pubs in the Seattle Area".

Scientific uses of the MTurk? Scientific translation.

Competitors? No one is doing quite the same thing. For example, other storage options provide higher-level interfaces, not developer-level interfaces.

Ethical implications of the MTurk? Amazon takes 10% - that's the business model. You need an Amazon account to receive payment, which would eliminate developing country participants who might otherwise be exploited.

Investors have been skeptical about Amazon's move into infrastructure services? Yes, at first, but more recently they've been supportive.

July 12, 2007

GalaxyZoo

Alex Szalay, professor of astronomy at Johns Hopkins University (JHU) and an e-science visionary, sent us this message today.

I do not know if you had a chance to look at GalaxyZoo. It is being served up from JHU, was done in collaboration with Chris Lintott et al. from [the BBC TV programme] Sky At Night. Here are some recent press hits: The Times, the Washington Post, Nature and the BBC.

We are getting phenomenal traffic. We will hit 1M galaxy classifications before the end of day 2. Our servers are literally melting down. Another interesting example of how people are actually really interested in science if there is a right presentation, and also they can do something concrete.

Yup, that's one million image classifications in two days. A heart-warming example of online collaboration, and of hardcore science reaching out to, and inspiring, the mainstream.

July 10, 2007

Interviews

Jeff Gomez, Director of Internet Marketing at Holtzbrinck (Nature's ultimate parent company) has posted a great interview with Manolis Kelaidis, inventor of bLink, star of the O'Reilly Tools of Change Conference, and a good friend of Nature's.

Meanwhile, elsewhere, John Dupuis at York University in Toronto was kind enough to post this email interview with me. Boy, I made him wait for my responses — but we were in the middle of rolling out Nature Precedings and Scintilla, and John was very understanding.

I also had the pleasure the other day of speaking with Jon Udell, who's been a tech hero of mine since at least the time that his book, Practical Internet Groupware opened my eyes fully to the collaborative potential of the Internet. Jon's posted it as a podcast. (I see he's blogged it here too.) I haven't had a chance to listen to the final cut yet, but I fear that I come over as rather incoherent unless Jon has worked some serious magic in the editing room.

OTMI at BioNLP 2007

Charles Bridge.JPG

I presented a talk and a poster on OTMI at BioNLP 2007 the week before last (Friday, June 29). This was a one-day workshop attached to ACL 2007 (45th Annual Meeting of the Association for Computational Linguistics) conference held in quiet outskirts of Prague. There were around 80 present including speakers and delegates. The talk was well received and there were many questions (see below) which provide some food for thought. Many thanks to Kevin Cohen and Lynette Hirschman for inviting me.

I was fortunate enough to talk early in the morning while people were still lively (talk is here) and there were several questions afterwards both in the Q&A and later during the breaks and the poster session at end of the workshop. Some questions/observations listed below:

  1. General. Seems that there is not enough appreciation that OTMI is being proposed as a standard framework and methodology for disclosing subscription full text for text mining. That is, most of the features are parametrized and it is up to individual publishers to determine e.g. whether a snippet is a paragraph or a phrase, whether snippets are randomized or not, etc.
  2. Random order. Questions asked about need to shuffle the order or can the size of the snippets be made larger, e.g. paragraph units? (See point above re publisher choice.)
  3. Stopwords. Feeling is that omitting stopwords is just needlessly destructive. Do we need to inflict this lossy transformation on the full text? (It is proper that the OTMI framework allows for this, but do we want to cripple the text thus effectively rendering certain text mining techniques inoperative?)
  4. Word vectors. Immediate feeling was that these are pretty much useless as anybody can count, but more practiced hands conceded that these could be a useful 'entry level' for non-specialists, i.e. the vectors could be used to determine a rough and ready document categorization. Related to this were questions on word vectors being made available for a document corpus rather than just the document in question, so that the document could be guaged against a corpus.
  5. Sections. There was positive feedback re our picking out key sections (methods, conclusions) although there are still questions about section ordering and section naming.
  6. Tables. Do we include table cations? Answer is no, and here I really don't understand why not. Had we thought of making the actual table data available? I don't know but probably represents an extra level of complexity because some kind of row/column ordering would need to be preserved.
  7. Figures. We include figure captions, but did we think about adding in (i.e. referencing) the figures themselves? (The figures are currently maintained behind a subscrition firewall.)
  8. Rerefences. Are references linked back to the original text? I don't think they are properly marked up to allow the reference to be paired off with the source text snippets. This makes a lot of sense.
  9. Reuse policy. Are snippets of full text able to be reproduced along with annotations on a third-party website?

We're going to be looking into these questions and trying to come up with some real good answers. Of course, we are always open to feedback, either from comments on this post, privately to the feedback address otmi@nature.com or publicly to the OTMI discussion list otmi-discuss@crossref.org. And the OTMI wiki at http://opentextmining.org/ is always available for public input to the project.

(Note: Peter Corbett of the Unilever Centre for Molecular Informatics, Department of Chemistry, University of Cambridge has posted an account of the BioNLP 2007 workshop here.)

"Nascent Web publishing efforts have their genesis in a burning need to say something, but their ultimate success comes from people wanting to listen, needing to hear each other’s voices, and answering in kind."
Rick Levine
The Cluetrain Manifesto

Subscribe

Subscribe to this blog's feeds:

[What is this?]

The Life Scientists on FriendFeed

Recent Comments

Out of 368 total comments.
The most recent three were on:
Powered by
Movable Type 3.2