Action Potential

Turning web traffic into citations

Our June editorial discusses the relationship between web traffic and citations. Specifically, can one predict how well any particular paper is cited years after publication, based solely on the number of downloads it receives immediately following its appearance online? Our preliminary analysis suggests that this relationship not only exists, but is surprisingly strong.

I’ll leave you to read the editorial for more of the background as to why we examined this relationship, but I will repeat a few keys things here. The main purpose of this post is to provide more of the details behind the data and analysis, and to initiate a good discussion.


Everyone has their own pet problem with impact factors, whether it be with the calculation method, the non-reproducibility of the actual values, or the disagreement over what IFs really represent, just to name a few. Despite all of these concerns (and more), these numbers are typically used to rate the importance or prominence of a particular journal, and thus by proxy, the importance of the individual papers published within. This is a seriously flawed use of association (see a previous Nature Neuroscience editorial discussing this concept), leading scientists to often equate the total number of citations with scientific impact, which can be fraught with problems. Searching for an alternative measure of impact that is perhaps free of the “bias of authority” (citing a paper because it is from a famous lab) or the “lemming bias” (citing a paper just because everyone else seems to do so whenever broaching a particular subject) led us to explore readership.

The readership of a particular article should roughly reflect the outside interest in the topic and the perceived value of the experiments within. Readership can potentially be quantified through the examination of download statistics from the website where a manuscript is published. These download statistics can be viewed in the same way as the NY Times bestseller list in the sense that the data are indirect; these numbers don’t actually measure readership as much as they measure access and potential readership. In other words, in our case, we are assuming that everyone who downloads a paper (especially the PDF version) actually reads it. Definitely a leap of faith, but nonetheless, we took this caveat in hand and pressed forward.

The papers in our initial dataset were published online between January 2005 and November 2005 (N = 215 papers). Only research articles and reviews were considered. Our download statistics are COUNTER-compliant, an initiative that provides libraries and publishers with more consistent and credible usage data. For the purposes of the editorial and here, the actual numbers are transformed. Our citation data come from Scopus, although we could have probably used Google Scholar or Thomson products just as easily (several studies have found an equivalence in the citation listings between Thomson’s Web of Science and Google Scholar, and there is no reason to believe that Scopus would be any different [Belew, 2005; Pauly & Stergiou, 2005]). Both sets of data were accurate as of the end of March. For web traffic data, total downloads within a particular time frame were calculated starting from the published Advanced Online Publication (AOP) date.

We initially noticed that immediate PDF downloads correlated better with eventual citation counts than did HTML downloads (R = 0.60 vs. 0.65 for HTML and PDF downloads, respectively). Therefore, we focused on PDF web downloads for the remainder of the analysis. It is important to note that this measurement is independent of citation or web traffic differences between fields or different types of papers. The lowest cited, least downloaded paper contributes equally to the weakness or robustness of the correlation as does the most highly cited, heavily downloaded papers.

As the download time frame was extended, the correlation progressively increased up until 180 days post-AOP, but fast-forwarding to 1 year, the correlation dropped significantly (Fig. 1). This makes sense in retrospect, since web traffic typically declines with time (as a paper becomes “old news”), while citation rates increase with time. The divergence in these two measurements dramatically affects the correlation. This peak correlation between downloads and citations at 6 months was also observed in a previous study that examined the relationship between web traffic and citations for papers deposited in the arXiv pre-print server (Brody et al, 2006).

time.png

Figure 1 The correlation between downloads and citation counts increases up until 6 months, and then dramatically decreases at 1 year. Correlation coefficients are graphed as a function of time.

We next decided to see how well web download data could predict eventual citations. For this analysis, we calculated a linear best-fit equation for the data graphed in the editorial. We then took all papers published in Nature Neuroscience within the first 3 months of 2006 (N = 55 papers), and used their 90 day PDF download numbers as the ‘X’ input into the equation. This yielded a series of citation values that would produce a predicted linear best-fit line for the 2006 data. Comparing this line to the actual best-fit line for the data, we see that although they are different, the slopes are nearly identical, suggesting that there is an offset in our predicted values, biased towards higher citation rates in a systematic fashion (Fig. 2). This offset could arise because the citation data is not mature enough for papers published so recently, with actual citations lagging behind those predicted by the model, in general.

predicted.png

Figure 2 PDF downloads vs. citations counts for 2006 papers. Predicted line derived from calculations using 2005 best-fit equation.

Finally, we decided to test how this relationship would hold up across another discipline. Previous studies examining downloads vs. citations found that physics and math preprints (Brody et al, 2006) and a subset of the medical literature (Perneger, 2004) revealed a similar positive correlation between downloads and citation counts for individual papers. We extended our own study to papers published in Nature Genetics (N = 168) for 2005. Again, we found a strong correlation between immediate PDF downloads and eventual citation counts (R = 0.71) (Fig. 3). Thus, this relationship is likely to hold up across various disciplines, across journals with different impact factors, and includes pre-prints as well as published articles. With studies suggesting that open-access articles receive more citations than those published behind firewalls (Eysenbach, 2006), it would be interesting to determine how open-access articles (with a presumed higher readership, or at least potential readership) fare in this type of analysis.

ng_2005 corrected.jpg

Figure 3 PDF downloads vs. citation counts for 2005 articles published in Nature Genetics.

We realize that this analysis is enticing at best, potentially providing a piece of an alternative solution for deciphering the impact of an individual paper. In this current scientific climate where tenure and grant funding decisions are influenced by flawed metrics like impact factor, it is important to make good use of all available technology in an attempt to realize a better system of measuring the scientific impact of any particular paper. This analysis is obviously preliminary and flawed in its own ways, but perhaps metrics such as paper downloads can find a place in a compilation of aggregated stats, painting a more accurate and informative picture of manuscript influence.

This analysis was conducted jointly with Hilary Spencer of Nature Precedings. We would like to thank Jamie Sampson for assistance in acquiring the download statistics.

_____________________________________________________________________________________________

 

UPDATE:Sorry it took so long, but here are the plots of the data (the NN data from the editorial and the NG data from above) without the log scales. This was by request.

Nature Neuroscience

Downloads citations scatter 90 (new)

Nature Genetics

ng_2005 (new)

Comments

  1. Dr J K Aronson said:

    I have two questions:

    1. What were the citation counts for papers that were not downloaded? You should plot them for comparison.

    2. What were the slopes of the regression lines on the log-log plots? A slope of unity shows a linear relation, but a slope other than unity means that the relation is a power function and is not linear. The data in Figure 1 in the editorial seem to have a slope that is different to unity, in contrast to the data in Figure 3 in the blog. It would be helpful to see the data plotted on Cartesian coordinates, rather than in double log format.

  2. Pedro Beltrao said:

    These are very interesting results and really more should be done to emphasize that impact factor of the journals were papers are published is not a very good measure to evaluate individual researchers.

    The only problem with accepting number of downloads or any web traffic related metric as an evaluation criteria is that is much easier to game than the impact factor of a journal. Hopefully the Counter initiative is thinking ahead and including some tests that would prevent inflation of these metrics.

    Also, journals would facilitate the adoption of this metric if this information would be available for each paper published online.

  3. DrugMonkey said:

    Much agreed with your final para here Noah.

    This little effort of you all at NN also points out what should be the obvious. Isn’t the statistic that authors most want to know the number of people actually reading their work? Publishers have been making articles available online for how long now?

    It seems as though the per-article download stats would be more readily discussed, studied and (perhaps) available to the authors by now.

  4. David Colquhoun said:

    Good for Aronson for spotting all the statistical weaknesses in the argument. The correlation is in fact not at all impressive, especially when you recall that is r-squared that matters.

    The comment by Beltrao, “is much easier to game” I find more disturbing. It is not a “game” it is dishonesty. The major objection to using citations to assess individuals is that they they don’t measure quality, but almost as important as an objection is that they are an incentive to dishonesty, and will penalise honest science.

  5. Noah Gray said:

    @Dr. Aronson: All papers were both downloaded and cited. So you are seeing all of the data. The slopes are something other than unity, indicating that the relationship is not truly linear in the technical sense, but it is trending in that direction. The R-squared value for the goodness of fit for the linear relationship to the data is 0.41. The actual line falls below unity suggesting, as in the prediction analysis, that these citation data are a little too young. There are less citations per paper than would be expected. My thought is that in a few years, the goodness of fit would steadily increase with the maturing citation counts. For the record, the R-squared for the Nature Genetics data is 0.5. I am currently traveling and will plot the data without the log scales upon my return.

    @Pedro, I agree that download statistics can be manipulated. Therefore, any future use of this metric would need to seriously consider the problems of spiders, etc… However, this analysis does show a relationship between downloads and citations, suggesting that further thought towards making download stats more robust may be warranted.

    @David, Please see my response to Dr. Aronson for your R-squared values. Considering the variability of the data, I am surprised that they are even that high.

  6. Pedro Beltrao said:

    David Colquhoun – I am not a native speaker so I might have used the expression incorrectly. “gaming the system” for me is observing and using the rules of a system to obtain an outcome that was not in the intention of the rules. In this case, scientists trying to inflate their value as measure by usage of their papers.

    I read your blog post and if I understood your point you are saying that evaluation metrics distort the objective of science since individuals focus more on the metrics than on the production of science. The problem is that there is limited resources and they have to be allocated by some way. What would be then a fair way of allocating resources?

    Ideally scientists should be evaluated by their peers that know the work and can say what how interesting. It should be a group of people large enough that the very subjective criteria of single individuals would not influence the outcome.

    I am guessing that this is not done because there is just not enough time/people to do this in every case.

    If metrics are necessary to help decide then we should be thinking about the metrics that best reflect the interest/impact/usefulness of a work and that is not easily abused since there will always be a fraction of people that are dishonest.

  7. Maxine said:

    A set of perspectives has just been published on all aspects of the use and misuse of bibliometric indicators. Please see this Nature Network forum for links and discussion: https://network.nature.com/london/forums/citation-science/1717

    One contributor is Dr Philip Campbell, Editor in Chief of Nature and Nature publications, who argues among other things that the performance metric should be the number of citations of a scientist’s individual papers, not of the journal in which he/she publishes. This is one answer to your question of what would be a better metric, Pedro, but the perspectives linked to in the NN forum provide much other food for thought.

  8. Nathaniel Blair said:

    Thanks for posting this Noah, it is interesting and thought provoking. And as for the positive yet weak correlation between downloads and citations, I acutally see that as suggesting that the downloads are more useful than they otherwise might be. If the correlation were much stronger, then there would be no distinct ‘signal’ to be gained from looking at the download counts, and we might as well stick to just the total citations. As it stands, a paper where the downloads are quite different might be indicating some amount of influence missed by the citations.

    I have to admit though that I am sympathetic both to David’s and Pedro’s comments. As David wrote in his comment and linked post, it is very difficult (neigh, impossible?) in truly reduce the impact of a paper into a meaningful, and easily digestible metric. And to rely on some of these as benchmarks for performance is dangerous, and produce actions antithetical to good science.

    Yet I also understand what Pedro’s is saying, that in the face of a science that is getting both bigger and more reductionist, it is hard accurately assess the quality of a paper. The best people to really assess a paper are those directly in the (sub)field, and their judgements are subject to all sorts of bias, conscious or not.

    Still though, as I’m currently reading James Suroweiki’s “Wisdom of Crowds” I can’t help but think that if there really were some way of aggregating all the individual assessments made by the relevant scientific group regarding a paper, we would get some strong information about its validity and impact. How that would actually be done isn’t clear, though I have some vague hopes that Web2.0 tools will ultimately address that.

    Report this comment Cancel report
    Your details

    Please confirm the words below

    In order to reduce spamming, this process ensures you are a real person and not an automated program.

  9. Sergio Stagnaro MD said:

    I admit that both this Editorial and all former comments to it are really fascinating and intriguing. However, at my best knowledge, the solutions about citations, impact factor, a.s.o. are not exhaustive, as to the importance of research is concerned. Such as problem is yet an open question. For instance, measuring what research and researcher is the most outstanding is a problem that is difficult to resolve.