Who comments on scientific papers - and why?
I drafted a long paragraph for here about how science publishers are generally rubbish at commenting implementations, but if you read journals online you already know this. Can you name three science publishers that allow online commenting? When last did you leave a comment on a paper? Have you ever?
We're planning on rolling out commenting on the rest of Nature.com relatively soon (the system is already up at News@Nature) so obviously we're interested in seeing how other publishers have handled things and what the results have been.
BioMedCentral (BMC) are one of the companies who have had some success on the commenting front. In publishing tech they're quietly innovative in lots of ways and they've allowed online commenting on all of their papers for the past five years. Over time they've built up a reasonably large database of user comments. Matt Cockerill at BMC kindly gave us access to a dump of all comments published since they launched in 2002 (the majority of BMC's data is open access). I figured it'd be interesting to do a bit of simple data mining.
So.... who leaves comments on BMC? How frequently? What are they about?
The basics
From November 2002 to July 2008 BMC journals accumulated 945 comments from 753 different users on 732 different papers.
How widespread is commenting at BMC?
BMC have published 37,916 papers since they launched. That means that (732 / 379.16) = ~ 2% of BMC papers have attracted comments. Most of BMC publishes low to medium impact papers (this is not a reflection of the quality of those papers, I hasten to point out), it'd be interesting to see if the percentage is any higher when you only look at their higher impact journals. I didn't have time to do that.
It'd also be interesting to see how this compares with somewhere like PLoS One.
The graph below shows the number of comments attached to papers grouped by the year of publication. Perhaps unsurprisingly there's an upwards trend. I guess this is partly because more people are becoming familiar with commenting and partly because there are more papers to comment on.

(edit: Note that there are only 82 comments on papers published in 2008 so far, even though we're in July ie halfway through the year. It'd be interesting to see what the average time between publication of the article and publication of the first comment is. Are most comments in any one calendar year on the papers published the year before?)
Who leaves comments?
~ 1/3rd of comments on papers are left by the authors themselves, in order to present supplemental information, to make readers aware of errors or typos in the manuscripts, or to reply to an earlier comment by somebody else (authors added their comment to an existing thread 37% of the time).
64 people out of the 753 (~ 8%) commented on more than one paper. Only two dozen people have commented on more than two different papers. In other words it's not the same people commenting all the time (as you often find with blogs).
What are the comments about?
Broadly speaking the comments fall into one of seven categories:

Updates from authors 24%
Authors providing corrections, updates and replies.
"As of 4/13/06, correspondence for Peter K. Rogan should be sent to:"
"Please DO NOT utilize the version of the software as currently available. It will not function as specified ... [we're working on a fix]"
"Please note that there is a typo in the phosphopeptide sequence in the METHODS section"
One reason this percentage is so high is that it seems authors are encouraged to comment on their own papers if they spot typos or other errors in their published manuscripts, as it can take BMC a wee while to sort out corrections in the PDF and HTML. By commenting readers can be made aware of mistakes sooner.
Requests for clarification 8%
Readers asking for more information from the authors.
"Were the PAO2 values directly estimated by sampling end tidal alveolar gas?"
Interpretation and see also... 22%
Readers suggesting how the results of a paper might be interpreted, often pointing towards additional relevant research.
"Recent research data suggests that the lethality of the H5N1 strain..."
"We would like to inform readers of this paper of our prior efforts in the area of PCR primer design"
"Researchers planning to follow the 'plan A' approach discussed in this paper may benefit from first checking out..."
Direct criticism 17%
Readers pointing out possible experimental flaws or errors.
"We think, however, that the authors have overlooked an important confounder when adjusting the relationship between exposure and allergic disease.."
"The fact that the network is trained on only 20 samples and validated on the rest 20 at the end of the flow schematic does not mean that a correct validation has been done"
Bonus material 17%
Things that could plausibly have been included in the paper. Quite common in the bioinformatics journals. Links to datasets, implementations and software downloads.
"The described method is available as an R script and can be found at..."
"The software described in this article is available online at http://dulci.org/sage/."
"I have made the test data we used for this paper available from http://biotext.org.uk/ on the Downloads page."
Other comments 8%
Appreciative comments.
"I feel that this is a timely commentary, addressing the issue of semantic enrichment of our scientific literature"
"This paper is, in my opinion, the by far most clear and up to the point paper I ever read on the analysis of microarray data"
Crazies 2%
Self explanatory.
Takeaway message?
The quality of comments at BMC is high and the vast majority add value to the paper, though the numbers involved are relatively low (would a larger audience reading higher impact papers be different?).
Perhaps unsurprisingly comments on papers are not like comments on blogs; they're far more formal (only 8% of comments were of the chatty, supportive variety) and it's not the same people coming back each time (with the exception of the crazy 2%).


Comments
Interesting. Did you code all of the comments? Where did the categories come from? Did you do a reliability check on your coding?
Posted by: Christina Pikas | July 22, 2008 01:16 PM
Hi Christina,
Good question. ;) The categories are arbitrary and the comments were coded by me (no reliability checks).
I just eyeballed a subset, noted down the common themes then coded the entire set. The themes / categories got revised as I went along.
The 'comments left by authors' %s come from comparing authorship data from CrossRef about the paper in question to the name of the person who left the comment.
If anybody fancies doing a more formal study then I'm happy to pass on what I've got. Matt C is keen for people to take a look at the BMC data and I'm pretty sure you could get PLoS comments too.
Euan
Posted by: Euan | July 22, 2008 02:53 PM
Very interesting breakdown, Euan, thanks. And thank you to BMC for making their data available. I have not looked at PLOS One's comments for a while, but when I last did, there were not that many either, and most of the ones I found were of the "author addition or correction" variety. Of course this was a "by eye" look rather than a proper analysis.
Do you know whether BMC provides any filter (moderation) before the comments are live on their website? Possibly not, if 2 per cent were crazies.
I fairly recently looked at Biology Direct, a BMC journal which publishes the referees' comments along with the papers, a nice feature. There were not many additional comments on those papers, but it was certainly interesting to read the referee reports.
Posted by: Maxine | July 22, 2008 03:13 PM
@Maxine - yes, we do filter comments, but we try to be as inclusive as possible and only exclude obviously crazy and/or off-topic comments. I'd be interested to see which comments were judged to make the 2% of craziness in the comments corpus, but I'd argue this is a decent percentage and shows we are not over-zealous with our filtering.
Posted by: Chris Leonard | July 22, 2008 03:53 PM
Hi Euan -
Neat study - Can you let us know the distribution by commenter? Do a small handful of commenters post the majority of comments, perhaps following the general 90-9-1 rule?
Posted by: Hilary | July 22, 2008 05:23 PM
Hilary, apart from the 2 %, Euan says not. I think you would expect this for this type of content, from the type of comments received (highly specific to the paper) and the breadth of subjects covered by the papers. On the other hand, if you are crazy you are crazy about a lot of different things!
Posted by: Maxine | July 23, 2008 04:02 AM
Great work, Euan, thanks! We need more of this sort of analysis!
Now how was that for a chatty, supportive comment? :-)
Posted by: Bjoern Brembs | July 23, 2008 07:06 AM
Hi Euan,
So that's 3 a week across all the BMC journals [why exactly are you bothering with a comments system? :) ] whereas there were 7 comment on your blog post within a day. Apart from some systems being hopelessly inconvenient to use, there also seems to be a mismatch between the archival nature of published papers and the transience of a post like this.
I suspect if you got authors to agree to attend to a blog-type system for a time limited period after publication you'd get a lot more activity. I've had pretty extensive mailing-list discussions on my own papers, but no on-line comments.
As you know we're in the pdf and web page annotation business too. Virtually all the interest in our system, a.nnotate.com, is in individual or group annotation (discussing documents, correcting drafts, making indexes). Almost no-one makes their annotated documents publicly accessible. In many cases of course it is a privacy or copyright issue, but, looking at our own comments, a lot are just "notes to self" or things where you disagree or don't understand. Most people are rather cautious of making such stuff public, but it does seem to be where much of the value of an annotation system lies.
Posted by: Robert Cannon | July 23, 2008 12:43 PM
Hi Robert,
By blog type system do you mean in terms of ease of use? Your time limited suggestion is interesting, though I think it'd be unforgivably unfashionable to ignore the long tail of people finding the paper more than, say, three months after it got published. ;)
I do agree that quite apart from anything else the barrier to leaving a comment should be as low as possible.
re: annotations, yeah. I wonder what the PLoS One public annotations data is like.
Posted by: Euan | July 23, 2008 01:08 PM
Yes, blog comments have certainly solved the ease of use problem (unlike most annotation systems - I wonder about the PLoS One data too). But another important consideration seems to be the "this is a blog: we're operating within blog rules" aspect which is far more conducive to joining the discussion than the "this is a peer-reviewed paper: your comment will be visible for eternity" aura around some journal comments.
I agree about the long tail problem, but it is a trade off between obligations on the authors and benefits for the readers. It would be great if authors were willing to keep an indefinite blog open on all their papers (thinking of my own, I would for some: I'd want to close others) but they'd probably be more willing to try it out if it was a time limited commitment.
P.S. this puzzled me before, but presumably the times on your blog posts are from the US east coast, not London time? IT could do with an "EST" label or some such!
Posted by: Robert Cannon | July 23, 2008 03:47 PM
Comments on BMC articles, in my opinion, are useful, since they can higlight some points of the arguments, up-dating them, possibly. In addition commenting author, sometimes is not able to publish own ideas, submitting his (her) paper to BMJ directly. For instance, Apolipoproteins allow to predict CAD better than total cholesterol and LDL blood level. A possible comment on such article could underline the ovelooked knowledge of inherited CAD real risk, characterized by coronary microcirculatory remodelling, wherein new born-pathological, subtype b) aspecific, type I, Endoarteriolar Blocking Devices play a central role . This accounts for the reason that NOT all individuals with diabetes, hypertenions, dyslipidaemias, a.s.o., are suffering from CAD! (156. Stagnaro Sergio. Role of Coronary Endoarterial Blocking Devices in Myocardial Preconditioning - c007i. Lecture, V Virtual International Congress of Cardiology, 2007. http://www.fac.org.ar/qcvc/llave/c007i/stagnaros.php)
Posted by: Sergio Stagnaro MD | July 24, 2008 06:06 AM
Nice, thanks for the analysis. Can you take 2006 and compare papers that have comments with those without comments and check the average number of citations ? Do you know if we could get access to number of PDF downloads for BMC papers as well ? I would like to correlate those with citations.
Comments tend to pick up over the years when the communities get used to them. Eventually the problem tends to be on how to filter them. Still, it would be great to know from places like Amazon, Netflix, etc if there are some design tips on how to promote online discussion and rating.
Posted by: Pedro Beltrao | July 24, 2008 10:33 PM
Good ideas, Pedro. I don't know if BMC would be happy to share download figures for individual papers (with Nature, anyway ;)), but you could ask. I don't think they're publicly available, anyway.
After you commented I tried fetching citation counts from Scopus for all of the papers in the dataset, but only ~ 1/4 - 1/3rd are found (not found is not the same as having no citations) so the dataset is probably too small to draw any real conclusions from... :(
Posted by: Euan | July 25, 2008 10:36 AM