We will go through all publications (title and abstract) and count how many publications contain a certain term. We will do that for different terms and plot them as a function of time. This will tell us which topics are currently trending in medical science. Here is a possible output of the program described in this post:
|
Elasticsearch uses a JSON based syntax, which indeed can be a bit cumbersome, but the documentation is very good.
First, we write the elasticsearch query which counts the papers. There are a couple of different queries in elasticsearch which can do that e.g. the match, term and match_phrase queries.
The match query analyzes the search term and then compares it to the stored indices.
{ "query": { "match" : { "title" : "Cancer" } } }
Note that the elasticsearch approach can be limiting. For example, there is no easy way to distinguish lower case and upper case anymore because of the pre-processing of the standard analyzer. You can use the term query to prevent the search term to be analyzed
but this search will return zero results since 'Cancer' with a capital 'C' does not exist anywhere in the index. For our case, this does not matter, but if you want to keep the ability to distinguish upper and lower case, you need to use a different analyzer when indexing your data.
{ "query": { "term" : { "title" : "Cancer" } } }
In some cases, we might be interested to find multiple word terms or phrases, like 'drug resistance'. The match query is going to run the standard analyzer on this phrase and besides lower casing all words, the standard analyzer also tokenizes the phrase. So it turns 'drug resistance' into ['drug', 'resistance'] and looks for papers which contain either one of the words. We can additionally provide the and operator to enforce that documents need to contain both terms,
but this is still not exactly what we want since this will return documents which have the words 'resistance' and 'drug' anywhere. We want to enforce that these words need to be together. This can be achieved with the match_phrase query.
If you want to allow some leniency on how close the two words can appear together you can provide the slope parameter. A slope of 2 would still count documents which have not more than 2 words in between 'drug' and 'resistance'. For infinite slope one would recover the match query.
{ "query": { "match" : { "title" : { "query": "drug resistance", "operator": "and" } } } }
{ "query": { "match_phrase":{ "title": "drug resistance" } } }
Besides making sure that the papers have the specific term, we also want to limit the publication date to a certain range. We, therefore, create a bool-must query like this
{ "query": { "bool": { "must": [{ "range": { "arxiv_date":{ "gte" : low_date.strftime('%Y-%m-%d'), "lte" : up_date.strftime('%Y-%m-%d'), "format": "yyyy-MM-dd" } } }] } } }
However, in our case, we just want to find papers which have the search term in the title OR abstract. So we do not want to put this directly in a bool-must query since this would require the term to be in the title AND abstract. Instead, we create a bool-should query within the bool-must query. The bool-should query requires only one of the conditions within it to be true. Here is the final code.
def get_doc(low_date, up_date, list_of_terms=[]): term_doc = [] for sub_string in list_of_terms: term_doc.append({ "match_phrase":{ "title": sub_string } }) term_doc.append({ "match_phrase":{ "abstract": sub_string } }) doc = { "query": { "bool": { "must": [{ "range": { "created_date":{ "gte" : low_date.strftime('%Y-%m-%d'), "lte" : up_date.strftime('%Y-%m-%d'), "format": "yyyy-MM-dd" } } }, { 'bool': { "should": term_doc } }] } } } return doc
Now, all we have to do is to call this query for different time steps and different terms and plot it. Here is the code which loops over the last 25 years of PubMed publications and counts the number of papers. The number of papers is always normalized to the total number of papers in the same time frame.
def get_paper_count(list_of_terms, timestep): # We start 25 years in the past start_date = datetime.datetime.utcnow() - datetime.timedelta(days=365*25) list_of_counts = [] list_of_dates = [] low_date = start_date # loop through the data year by year while low_date < datetime.datetime.utcnow() - datetime.timedelta(days=10): up_date = low_date + datetime.timedelta(timestep) doc = get_doc(low_date, up_date) # we are only interested in the count -> size=0 res = es.search(index=index_name, size=0, body=doc) norm = res['hits']['total'] doc = get_doc(low_date, up_date, list_of_terms) # we are only interested in the count -> size=0 res = es.search(index=index_name, size=0, body=doc) # norm should always be >0 but just in case if norm > 0: list_of_counts.append(100.*float(res['hits']['total'])/float(norm)) list_of_dates.append(low_date + datetime.timedelta(days=timestep/2)) else: list_of_counts.append(0.) list_of_dates.append(low_date + datetime.timedelta(days=timestep/2)) low_date = low_date + datetime.timedelta(timestep) return list_of_counts, list_of_dates
def create_trending_plot(): timestep = 365 # average over 365 days # Get a generic list of colors colors = dict(mcolors.BASE_COLORS, **mcolors.CSS4_COLORS) colors = [color[0] for color in colors.items()] # get all possible line styles linestyles = ['-', '--', '-.', ':'] list_of_queries = [['prostate cancer'],
['blood cancer', 'leukemia'],
['Ebola'], ['alzheimer', 'dementia']] timestamp = datetime.datetime.utcnow() plt.clf() # The maximum number of terms is given by the available colors for i, list_of_terms in enumerate(list_of_queries[:len(colors)]): print "i = ", i, "term = ", list_of_terms list_of_counts, list_of_dates = get_paper_count(list_of_terms, timestep) plt.plot(list_of_dates,
list_of_counts,
color=colors[i],
label=', '.join(list_of_terms), linestyle=linestyles[i%len(linestyles)]) plt.xlabel('Date [in steps of %d days]' % timestep) plt.title('Relative number of papers for topic vs. time') plt.ylabel('Relative number of papers [%]') plt.legend(loc='upper left', prop={'size': 7}) plt.savefig(OUTPUT_FOLDER + "trending_pubmed_%s.png" %
timestamp.strftime('%m-%d-%Y-%H-%M-%f')) return
cheers
Florian