Saturday, July 29, 2017

Elasticsearch and pubmed: Step 1 Getting a local copy of PubMed

With this blog post, I want to start a series, which looks at the PubMed scientific data set and applies machine learning for categorization. PubMed is an online archive of all sorts of scientific publications related to medical sciences. As of today, it contains about 27 million papers. Here we will only deal with the metadata, so the title, abstract, journal information, authors and about 50 other attributes describing each publication. This first blog post focuses on how to obtain the dataset. I will describe the details of my project after we are set up with the data, which will probably take up this and the next blog post.

The PubMed data set is re-indexed and published once a year and the latest version (December 2016) can be found here:

    wget -N ftp://ftp.ncbi.nlm.nih.gov/pubmed/baseline/*.gz -P /path/to/target/

You have to expect about 22Gb of zipped data. Make sure you get the latest version, so after December 2017 this link might change. After the re-indexing, PubMed publishes daily updates for new papers, which can be accessed through

    wget -N ftp://ftp.ncbi.nlm.nih.gov/pubmed/updatefiles/*.gz -P /path/to/target/

This link will include daily updates until they are included in the re-indexed dataset in December.

With these two downloads, you should have obtained a copy of the entire scientific work in the medical science of the last 100 years or so. Let's see what we can do with that.

First have a look at the README.txt, which provides some general information about this dataset. There are also the terms and conditions, which you should have a look at depending on what you are planning to do with this data.

Next, we have to ask the question how to access the data. Right now it is in the form of .gz files, but reading 15Gb of data from disk every time we want to use the data is not going to work.

I am planning to add this dataset to my web application www.benty-fields.com and since this application uses elasticsearch, I will use that database system to store and access the data. In the next blog post, I will describe how to parse the data into elasticsearch.
cheers
Florian