« November 2008 | Main | January 2009 »

December 19, 2008

XMP Labelling for Nature

There was an interesting piece by Steve Mollman on CNN.com yesterday (Making sense of the 'semantic Web') which put forward this example:

"The kiosk takes advantage of the fact that MP3 files are "things" that have already been described in ways that machines can understand. That's because they have ID3 tags, which supply information on the artist and album. An MP3 file on an iPhone is already a semantic annotated object, which means it's easily read by a computer."
Now if that story had instead talked about a PDF instead of an MP3, and if XMP packet was substituted for ID3 tags, then any scholarly article could lay claim to being a "semantic annotated object ... easily read by a computer".

Yesterday's issue of Nature was the first NPG title to go live with such marked-up PDFs. The screenshots below from Acrobat (File > Properties, CMD-D / CTL-D) show what the user might see both with (bottom-left) and without (top-right) semantic markup.

pdf_props.png

Fair enough as far as that goes, but to a machine it's a whole other game. We now have a complete bibliographic record (including DOI) embedded in the PDF using structured markup. And, moreover, we also have a solid bedrock for adding in any additional metadata should the need arise. This semantic labelling is available on all new issues of Nature and will be added to other NPG titles over the coming months.

XMP as a labelling technology could well go a long way towards addressing concerns raised by Olivia Judson in an op-ed piece earlier this week in the New York Times: Defeating Bedlam. The author decries that "downloading papers from journal Web sites" means that "access to information is easier and faster than ever before ... but there’s been no obvious way to manage it once you’ve got it." Those days may soon be over.

Now with XMP all manner of scholarly content - documents, images and other media types - can be properly labelled and many programs (not just Zotero and Papers which she reviews) can directly profit from the richness of semantic web descriptions.

December 12, 2008

Work for Nature, Go to SciFoo

A final reminder (from me, anyway) that entries for the Nature Network Blogging Challenge close on Monday, 5th January 2009. The winners get an all-expenses-paid trip to Science Foo Camp in summer 2009. For background reading, here's a pair of Times Higher Ed articles: one to explain why you should go to SciFoo and another to size up the competition.

Of course, for some of us, just being able to work at Nature feels like something of a win. ;) If you're the same, and if you're a metadata/ontology/text-mining kind of person then you might be interested in a post that we currently have open (yup, some organisations are still hiring). Further details below.

Nature is very busy organising its metadata and now seeks an XML Content Manager / Taxonomist. The person in this new position will lead the strategic creation of metadata for NPG's content, and help to define and map taxonomies for it. They will also have responsibility for the review and update of these taxonomies. Working with a team of subject matter experts, they will identify and create appropriate classification schemas. They will also work closely with NPG's technology and editorial teams to create the right tools, and to integrate content enrichment into our publishing workflows and onto our websites.

Responsibilities will include:

  • Primary metadata advisor for the company. Create and maintain NPG's taxonomy and classification systems, including subject analyses and identification of metadata.
  • Identify current and future metadata needs to support the overall company data strategy; help ensure that metadata management is properly represented in the overall data strategy.
  • Control the metadata lifecycle, promoting metadata through creation, development, and QA stages, enforcing quality standards at each step.
  • Provide subject matter expertise on metadata management tools.
  • Coordinate a team of subject matter experts.
  • Cross-link related metadata artifacts across the entire enterprise, creating descriptions of end-to-end data flows.
  • Manage supporting documentation for both internal and external users.

Experience: 5 or more years data management experience (proven ability to design, organize, and manage taxonomies and metadata, data classification, metrics, etc). Strong understanding of the STM field. Good understanding of data modeling concepts and XML . Proven data expertise.

Personal attributes: Proven communicator; team player; innovator; facilitator; pragmatic. Education: Bachelors degree or formal training in a related field.

If you're interested, please contact Amanda Ward (a DOT ward AT nature DOT com).

December 09, 2008

RSS subscriber numbers for science blogs

All your Google Reader stats are belong to us.

Following on from the blogroll data I took a look at the subscriber counts in Google Reader for each science blog that we track.

Some things to bear in mind before trying to make sense of the data:

  • Google Reader is one RSS reader of many. Anecdotally it accounts for between 30%-70% (big range) of all RSS subscribers.
  • Some blogs are going to have more old school 'refresh the page' vs RSS traffic than others, so this doesn't answer any 'does x have more readers than y' type questions.
  • A subset of blogs have more than one feed (Atom and RSS variants, for example), I'm only counting one - the one submitted to Scintilla or Nature.com Blogs. That could significantly affect results for some blogs.
  • Number of subscribers != number of people actively reading their subscriptions
  • I've left the raw numbers out again, though you can probably work some out from the graph below.

That said, here are the top thirty science blogs by Google Reader subscriptions (#15 - #30 are below the fold):

Rank Blog
#1Pharyngula
#2Mind Hacks
#3Bad Astronomy
#4Science Blog - Science news straight from the source
#5Good Math, Bad Math
#6Wired Science
#7WSJ.com: Health Blog
#8Cognitive Daily
#9PsyBlog
#10Not Even Wrong
#11Environmental Graffiti
#12BPS Research Digest
#13The n-Category Cafe
#14Roland Piquepaille's Technology Trends
#15Cocktail Party Physics

And here's the obligatory long tailed graph:

blogsubs.png

Pharyngula is, again, way ahead of the pack - PZ is pretty much King of the Science Blogosphere any way you shake it (well, except one. And even then he's in the top ten). The Panda's Thumb is at #26 this time. Interestingly the top ten blogs by blogroll is otherwise almost completely different to the top ten blogs by subscriber count (that's not to say that the blogrolled blogs aren't popular: they are, just not in the top ten most popular in this context).

2.5-5k subscribers might not seem an awful lot, but it's pretty good going. Boing Boing has around 28k subscribers in Reader (this seemed very low to me, perhaps they've changed feed address recently)? The Nature journal TOC is delivered to around 4,800 Reader users. This blog has 468. The BBC News science and technology feed goes to 6,722.

For even more context - beware, not comparing like with like here so take with a pinch of salt - the average mid-tier science journal might print 450 copies (you could extrapolate wildly to get the number of actual readers by multiplying by ten or twelve).

Anyway, the top fifteen is a pretty mixed bag, I can't see many common themes. 4 are neuroscience / psychology related (BPS Digest, Mind Hacks, PsyBlog and Cognitive Daily) and 4 are associated with publications of some sort in real life (Mind Hacks, Not Even Wrong, Wired, WSJ). Suggestions / analysis welcome.

Rank Blog
#16Bad Science
#17Savage Minds: Notes and Queries in Anthropology ? A Group Blog
#18Social Science Statistics Blog
#19The Loom
#20Developing Intelligence
#21The Panda's Thumb
#22Uncertain Principles
#23Mixing Memory
#24Climate Progress
#25john hawks weblog
#26blog.khymos.org
#27CogNews
#28About.com Archaeology
#29Respectful Insolence
#30NHS Blog Doctor

December 05, 2008

Long Tails: just how we roll

(interested in this kind of thing? Also check out Christina's forthcoming talk, via Bora)

If we assume that people blogroll sites that they think (a) are good and (b) are relevant to their audience then it seems fair to also assume that by aggregating and analyzing blogrolls from blogs tagged in Nature.com Blogs (did I mention Nature.com Blogs already? Did you sign up?) we can come up with some sort of 'top ten blogs' in each area.

I wrote some scripts to do the relevant scraping. Here are the results. Note that ranks can be tied - The Loom and Aetiology share fourth place, for example:

Across all subject areas

BlogRank
Pharyngula1
The Panda's Thumb2
RealClimate3
The Loom4
Aetiology4
A Blog Around The Clock5
Cosmic Variance6
Adventures in Ethics and Science7
Respectful Insolence8
The Intersection8

I left out the raw numbers for now (at this stage it's just an experiment) but can tell you that Pharyngula and The Panda's Thumb are way ahead of the competition... they're on twice as many blogrolls as Real Climate (also a very popular choice).

It all looks very Long Tail'ish...

blogrolls_long_tail.png

... that is to say that there are a very small number of blogs on lots of blogrolls and a very large number of blogs on few blogrolls.

We're been tracking links between science blogs since Nature.com Blogs launched but there isn't enough data to do a proper comparison between blogroll popularity and incoming link popularity yet. I'd be interested to see what kind of correlation there is: do you link more often to the blogs on your blogroll? Are there some blogs that you add because you feel you should? Does much reciprocal blogrolling go on (almost certainly, I'd guess)? Is there a blogrolling equivalent to prominently displaying an unread copy of Ulysses on your bookshelf even though all you read is Stephen King?

(tables for individual subject areas are below the fold)

Chemistry

BlogRank
Molecule of the Day1
Useful Chemistry1
Developing Intelligence2
The Sceptical Chymist2
Genomicron2
easternblot.net2
Stoat3
petermr's blog4
The Culture of Chemistry4
chem-bla-ics4

Life Sciences

BlogRank
Pharyngula1
The Panda's Thumb2
Aetiology3
A Blog Around The Clock4
Living the Scientific Life (Scientist, Interrupted)5
Eye on DNA6
The Other 95%7
Laelaps7
Gene Expression8
Not Exactly Rocket Science8

Physics

BlogRank
Cosmic Variance1
Bad Astronomy2
Uncertain Principles3
Biocurious3
Good Math, Bad Math4
Cocktail Party Physics4
Angry Physics5
Shtetl-Optimized5
Backreaction6
A Photon In The Darkness7

Bioinformatics

BlogRank
business|bytes|genes|molecules1
Public Rambling1
What You're Doing Is Rather Desperate1
nodalpoint.org - A bioinformatics weblog2
The Hyphal Tip3
Flags and Lollipops - Bioinformatics Blog3
Omics! Omics!3
The Seven Stones3
Suicyte Notes4
Microarray and bioinformatics4
"Nascent Web publishing efforts have their genesis in a burning need to say something, but their ultimate success comes from people wanting to listen, needing to hear each other’s voices, and answering in kind."
Rick Levine
The Cluetrain Manifesto

Subscribe

Subscribe to this blog's feeds:

[What is this?]

The Life Scientists on FriendFeed

Recent Comments

Powered by
Movable Type 3.2