Working with Citehound projects¶
This section describes the typical process for loading a given bibliographical dataset into Citehound and running queries on it.
It is assumed here that you have already gone through the Installation section that covers building
up the project_base “infrastructure” of datasets. [1]
Setting off to a new project¶
The most common research workflow involving Citehound proceeds as follows:
Work on a solid literature search strategy
Run a thematic search on a literature database (e.g. Pubmed)
Download the bibliographical dataset
Import the bibliographical dataset to Citehound
Link a given bibliographical dataset to external datasets
Proceed with further analysis
Hopefully, the first item in this list does not come as a surprise. Citehound was not built to substitute the part of comprehending the information. You still need to make the effort of composing the picture that emerges from the bibliography and how it fits to a given research context. What Citehound helps with is navigating through large volumes of academic papers to summarise, or highlight, a particular aspect of the collection. For more information around organising a literature review, please see this guide or any other similar introduction to “Structured Literature Reviews”.
The result of the search strategy is a set of rules (or “constraints”) that define what constitutes an acceptable paper for a given bibliographical dataset or not. This is a very important step in the whole analysis because it becomes the semantic thread that binds together the concepts that are presented throughout the papers.
These first two steps already encode a large part of the success of a research project on bibliographical data (or, that of a literature review).
Steps 3 onwards simply upload and link the data within Citehound and allow you to navigate a large bibliographic dataset or produce the evidence for answering questions such as:
How does research on a given topic scale throughout the years given a bibliographical datset?
Who is the most prolific author?
Who are other authors they tend to work together;
…and others.
The rest of this guide describes steps 3 onwards focusing on bibliographical data originating from Pubmed. [2]
Creating a new Citehound “project”¶
When using Citehound, it is advised to allocate every different “literature review” or “bibliographical data research project” to its own underlying database.
This is basically a Neo4J database, preloaded with the Citehound data model and a number of datasets.
This process is outlined in section Installation and it results in project_base.
To create the (example) project pubmed_project_1:
> cadmin.py db create pubmed_project_1 --based-on project_base
This concludes with the creation of the pubmed_project_1 and we are now ready to import a bibliographical dataset.
Importing a Pubmed bibliographical dataset into pubmed_project_1¶
Fig. 2 Importing Pubmed to Citehound¶
Citehound was originally developed to process XML files exported from Pubmed. The option to export a search “result set” as an XML file used to be available from Pubmed’s search page but not any more. Unfortunately, the currently available options to export data from the search page, result in datasets that are severely limited in terms of data processing.
Citehound includes a convenient tool that can download Pubmed data in XML format given a list of PMIDs [3].
Obtaining Pubmed XML data¶
To download a given set of publication data in XML format:
Run your query on PubMed.
Export your result set in PMID format (suppose it is saved in
pubmed_articles.pmid).
To fetch the article data in XML format:
> cadmin.py fetch pubmedxml pubmed_articles.pmid > pubmed_articles.xml
Note
Citehound uses a set of standard calls towards NCBI’s API to download the XML data of a set of PMIDs.
If you do not have an API key, then a connection is rate limited to 3 requests per second. WITH an API key, this rate goes up to 10 requests per second. A “request” here does not correspond to a single “article” but rather a bunch of articles. Citehound usually downloads bibliographical data in “bunches” of up to 300 entries.
If you do have an NCBI API key, then you can make it known to Citehound by setting the
environment variable NCBI_API_KEY.
If Citehound finds the NCBI_API_KEY then it will be “hitting” NCBI at the fast rate, otherwise
it will be falling back to not exceeding the low rate.
For more information on how to obtain an NCBI API KEY, please see here
Importing Pubmed XML data¶
Now, given the pubmed_articles.xml Pubmed XML file, importing it to Citehound is achieved by:
Make sure that your
pubmed_project_1is activeImport the dataset
> cadmin.py ingest data PUBMED pubmed_articles.xml
This concludes with importing a bibliographic dataset in Citehound.
Data linking¶
Fig. 3 Simplified diagram of the data linking process.¶
At this point, we have three different datasets in the system with minimal links between them. In order to link the newly imported Pubmed bibliographical dataset with ROR, you need to run a “data linking” step.
This is achieved with:
> cadmin.py db link
Very briefly, this script applies blocking on countries and then for each country runs a linkage step for the country’s institutions.
For more information about the topic of “Record Linkage”, start here
Conclusion¶
This concludes the process of importing and linking a Pubmed bibliographical dataset.
Onwards now, to working with queries.