The PubMed data set is re-indexed and published once a year and the latest version (December 2016) can be found here:
wget -N ftp://ftp.ncbi.nlm.nih.gov/pubmed/baseline/*.gz -P /path/to/target/
You have to expect about 22Gb of zipped data. Make sure you get the latest version, so after December 2017 this link might change. After the re-indexing, PubMed publishes daily updates for new papers, which can be accessed through
wget -N ftp://ftp.ncbi.nlm.nih.gov/pubmed/updatefiles/*.gz -P /path/to/target/
This link will include daily updates until they are included in the re-indexed dataset in December.
With these two downloads, you should have obtained a copy of the entire scientific work in the medical science of the last 100 years or so. Let's see what we can do with that.
First have a look at the README.txt, which provides some general information about this dataset. There are also the terms and conditions, which you should have a look at depending on what you are planning to do with this data.
Next, we have to ask the question how to access the data. Right now it is in the form of .gz files, but reading 15Gb of data from disk every time we want to use the data is not going to work.
I am planning to add this dataset to my web application www.benty-fields.com and since this application uses elasticsearch, I will use that database system to store and access the data. In the next blog post, I will describe how to parse the data into elasticsearch.
cheers
Florian
No comments:
Post a Comment