We were very lucky a couple of weeks ago to have David Lipman, director of the NCBI, come to visit us in London. David was kind enough to give a talk to assembled NPG staff. Here are my notes:
PubMed/Medline records have grown linearly since the late 1960s. But GenBank and other databases show closer to exponential growth. NCBI serves up to 1.4m users a day and these users are downloading ~2.25 Terabytes each day. Generally speaking, growth in usage parallels closely the growth in the amount of data in the database.
NCBI spends 90% of its budget on people. 20% is for basic research with the other 80% involved in ‘production’. 75% or more or the production side is involved with sequence data.
Main area of activity is the ‘Sequence Core’, which is composed of:
- Sequence repository (e.g, GenBank)
- Assembly, integration, annotation & curation (e.g., RefSeq)
- Comparison and classification (e.g., BLAST)
Other activities include:
- Retrieval Core (e.g., Entrez)
- Text Core (e.g., PubMed)
- Visualisation Core
- GAIN (with Pfizer)
- Framingham: Long-term study (since 1940s) of a town in MA, especially heart data. Now surviving participants being genotyped.
PMC started as an archive for journals who choose to deposit their content. XML DTD adopted by HighWire, JStor, PLoS, Atypon and others.
Portable PMC allows quick setup of a local mirror of PMC (e.g., Wellcome Trust and BL in the UK).
Literature Archiving Software Suite (LASS): Takes books and articles in NLM DTD and allows search, rendering for the web, etc. Now working on a Word-based authoring tool.
PMC submission system. PMC submission rate is still very low (<5% of NIH grantees), which was predictable because it's not mandatory and makes no difference to future funding. A lot of discussion now about whether it should be made mandatory. 80-85% of grantees know about the policy, but are often sketchy on the details. Making Entrez more user-friendly
Real-world example of using Entrez: Searches in Entrez return results from across the various databases (e.g., “Fanconi renal failure”). The entry in OMIM (annotated bibliography) mentions a family in Wisconsin with a genetic version of the disease. One of the papers in OMIM locates the gene. Skipping to Entrez Gene, we can get information about this gene: it is hypothetical and there have been no experiments on it to date. Precomputed BLAST results show this protein to be very well conserved across eukaryotes but with little experimental data. Zeroing in on yeast to look for data shows that it is sodium stress-repsonse regulator. This makes sense given its apparent role in a renal disorder.
This took about 10 mins to discover, but unfortunately most Entrez users wouldn’t do this. They follow a simpler, Google-like pattern of typing queries and browsing results. In this respect, scientists don’t use the web in a very different way to other types of users.
A while ago Entrez introduced a search term spellcheck (like Google’s). This works really well and a lot of people click on the link to a search that uses the correct spelling. In contrast, the “Links” option provided in GenBank is very obscure and little used. The expectation that scientists would work out how to use this sort of feature proved to be incorrect.
Now trying to determine which additional links are most valued by users and give these greater prominence. For example PubMed’s “Related Articles” link is used by only 4% of users. If some information (e.g., partial titles of the related articles) were presented then they ought to be used a lot more. NCBI will be trying out these kinds of changes with small proportions of users (e.g., 1% = 10k users a day) and measure the effect.
Taking full advantage of the connected information space requires more work by the server to determine what might be of most value to users (in the NCBI’s case, scientists) in any given context.