Adding Open Citations Support¶
Open citations is a huge dataset that is available from this link.
While the full model is useful the main thing to remember is that the data.csv file contains 5 columns:
- oci
A unique identifier of the resource described
- citing
The DOI of the citing paper
- cited
The DOI of the cited paper
- creation
When this association was created
- timespan
Timespan during which the association was valid
There are other datafiles that describe the rest of the model’s entities but for the purposes of deriving co-citation networks the simplistic approach is to match the DOI of the citing paper and retrieve the cited.
So, the key problem now is how to do this efficiently.
Working with the OC dataset¶
The dataset is massive (~50GB at the time of writing) but sqlite seems to be able to handle it well.
Installation¶
Prior to starting the installation make sure that you have at least 300GB of disk space available and sudo apt-get install sqlite3 and optionally sqlitestudio too.
Download the dataset data.zip
Decompress it. At this point the dataset is ~50GB (at the time of writing).
Start sqlite3 and run the following script:
BEGIN; CREATE TABLE OC_citations( "oci" TEXT, "citing" TEXT, "cited" TEXT, "creation" TEXT, "timespan" TEXT ); CREATE INDEX onciting ON OC_citations(citing); COMMIT;
This is not entirely accurate but it will work for the purposes of a simple test.
From within sqlite3:
`.mode csv` `.import data.csv OC_citations`
This will create a file that is approximately 72GB long. At this point data.csv can be erased.
Querying¶
Use standard SQL to query the file.