Tuesday, September 19, 2017

Elasticsearch and PubMed: Step 3, Diving in the Data

In the last two blog posts, I explained how to get a local copy of all 30 million PubMed publications and how to parse this data set into an elasticsearch database. In this blog post, I will describe a small project which leverages this data set.

We will go through all publications (title and abstract) and count how many publications contain a certain term. We will do that for different terms and plot them as a function of time. This will tell us which topics are currently trending in medical science. Here is a possible output of the program described in this post:
Figure 1: Number of PubMed publications with certain topics as a function of time.
This plot shows that there was a very steep increase in the publications about Ebola research in 2014. We can also see that the research on blood cancer or leukemia has decreased in the last 25 years, while the research on prostate cancer has increased in those years.

Elasticsearch uses a JSON based syntax, which indeed can be a bit cumbersome, but the documentation is very good.

First, we write the elasticsearch query which counts the papers. There are a couple of different queries in elasticsearch which can do that e.g. the matchterm and match_phrase queries.

The match query analyzes the search term and then compares it to the stored indices.
{
    "query": {
        "match" : {
            "title" : "Cancer"
        }
    }
}
To analyze the term elasticsearch runs the same analyzer we used to index the field. You might remember from the last blog post that we used the standard analyzer, which tokenizes the text (splits it into words) and puts everything in lower case. So the match query above will return all papers which mention the term 'Cancer' or 'cancer'. 

Note that the elasticsearch approach can be limiting. For example, there is no easy way to distinguish lower case and upper case anymore because of the pre-processing of the standard analyzer. You can use the term query to prevent the search term to be analyzed
{
    "query": {
        "term" : { 
            "title" : "Cancer" 
        }
    }
}
but this search will return zero results since 'Cancer' with a capital 'C' does not exist anywhere in the index. For our case, this does not matter, but if you want to keep the ability to distinguish upper and lower case, you need to use a different analyzer when indexing your data.

In some cases, we might be interested to find multiple word terms or phrases, like 'drug resistance'. The match query is going to run the standard analyzer on this phrase and besides lower casing all words, the standard analyzer also tokenizes the phrase. So it turns 'drug resistance' into ['drug', 'resistance'] and looks for papers which contain either one of the words. We can additionally provide the and operator to enforce that documents need to contain both terms,
{
    "query": {
        "match" : {
            "title" : {
                "query": "drug resistance",
                "operator": "and"
            }
        }
    }
}
but this is still not exactly what we want since this will return documents which have the words 'resistance' and 'drug' anywhere. We want to enforce that these words need to be together. This can be achieved with the match_phrase query.
{
    "query": {
        "match_phrase":{
            "title": "drug resistance"
        }
    }
}
If you want to allow some leniency on how close the two words can appear together you can provide the slope parameter. A slope of 2 would still count documents which have not more than 2 words in between 'drug' and 'resistance'. For infinite slope one would recover the match query.

Besides making sure that the papers have the specific term, we also want to limit the publication date to a certain range. We, therefore, create a bool-must query like this
{
    "query": {
        "bool": {
            "must": [{
                "range": {
                    "arxiv_date":{
                        "gte" : low_date.strftime('%Y-%m-%d'), 
                        "lte" : up_date.strftime('%Y-%m-%d'), 
                        "format": "yyyy-MM-dd"
                    }
                }
            }]
        }
    }
}
All conditions in the bool-must query must be true. The must query in this example is a list, as shown by the [] brackets, and we can just append the match_phrase query from above.

However, in our case, we just want to find papers which have the search term in the title OR abstract. So we do not want to put this directly in a bool-must query since this would require the term to be in the title AND abstract. Instead, we create a bool-should query within the bool-must query. The bool-should query requires only one of the conditions within it to be true. Here is the final code.
def get_doc(low_date, up_date, list_of_terms=[]):
    term_doc = []
    for sub_string in list_of_terms:
        term_doc.append({
            "match_phrase":{
                "title": sub_string
            }
        })
        term_doc.append({
            "match_phrase":{
                "abstract": sub_string
            }
        })
    doc = {
        "query": {
            "bool": {
                "must": [{
                    "range": {
                        "created_date":{
                            "gte" : low_date.strftime('%Y-%m-%d'), 
                            "lte" : up_date.strftime('%Y-%m-%d'), 
                            "format": "yyyy-MM-dd"
                        }
                    }
                },
                {
                    'bool': {
                        "should": term_doc
                    }
                }]
            }
        }
    }
    return doc
This function expects a list of terms so that we can search for synonyms like ['blood cancer', 'leukemia']. For each term, we create a match_phrase query for the title and abstract fields and include this list to the bool-should query.

Now, all we have to do is to call this query for different time steps and different terms and plot it. Here is the code which loops over the last 25 years of PubMed publications and counts the number of papers. The number of papers is always normalized to the total number of papers in the same time frame.
def get_paper_count(list_of_terms, timestep):
    # We start 25 years in the past
    start_date = datetime.datetime.utcnow() - datetime.timedelta(days=365*25)

    list_of_counts = []
    list_of_dates = []
    low_date = start_date
    # loop through the data year by year
    while low_date < datetime.datetime.utcnow() - datetime.timedelta(days=10):
        up_date = low_date + datetime.timedelta(timestep)

        doc = get_doc(low_date, up_date)
        # we are only interested in the count -> size=0
        res = es.search(index=index_name, size=0, body=doc)
        norm = res['hits']['total']

        doc = get_doc(low_date, up_date, list_of_terms)
        # we are only interested in the count -> size=0
        res = es.search(index=index_name, size=0, body=doc)

        # norm should always be >0 but just in case   
        if norm > 0:
            list_of_counts.append(100.*float(res['hits']['total'])/float(norm))
            list_of_dates.append(low_date + datetime.timedelta(days=timestep/2))
        else:
            list_of_counts.append(0.)
            list_of_dates.append(low_date + datetime.timedelta(days=timestep/2))

        low_date = low_date + datetime.timedelta(timestep)
    return list_of_counts, list_of_dates 
Using this function we can create the plot I showed at the beginning of the page like this
def create_trending_plot():
    timestep = 365 # average over 365 days
    
    # Get a generic list of colors
    colors = dict(mcolors.BASE_COLORS, **mcolors.CSS4_COLORS)
    colors = [color[0] for color in colors.items()]
    # get all possible line styles
    linestyles = ['-', '--', '-.', ':']

    list_of_queries = [['prostate cancer'], 
                       ['blood cancer', 'leukemia'], 
                       ['Ebola'],
                       ['alzheimer', 'dementia']]
    timestamp = datetime.datetime.utcnow()

    plt.clf()
    # The maximum number of terms is given by the available colors
    for i, list_of_terms in enumerate(list_of_queries[:len(colors)]):
        print "i = ", i, "term = ", list_of_terms
        list_of_counts, list_of_dates = get_paper_count(list_of_terms, timestep)
        plt.plot(list_of_dates, 
                 list_of_counts, 
                 color=colors[i], 
                 label=', '.join(list_of_terms),
                 linestyle=linestyles[i%len(linestyles)])
    plt.xlabel('Date [in steps of %d days]' % timestep)
    plt.title('Relative number of papers for topic vs. time')
    plt.ylabel('Relative number of papers [%]')
    plt.legend(loc='upper left', prop={'size': 7})
    plt.savefig(OUTPUT_FOLDER + "trending_pubmed_%s.png" % 
                timestamp.strftime('%m-%d-%Y-%H-%M-%f'))
    return 
I also created a small web application using this code www.benty-fields.com/trending. The entire exercise can also be downloaded on GitHub. Let me know if you have any comments/questions below.
cheers
Florian

No comments:

Post a Comment