In this blog post, I will describe how I build a recommendation engine for arXiv papers for the web application
www.benty-fields.com.
benty-fields is used by researchers in physics, mathematics and computer science, to follow the latest publications and organize journal clubs. The website collects data on the publication's users like, for example, if they vote for a paper or put it in their personal library. Here I will use those data to build a recommendation engine.
There are two different use cases for this recommendation engine:
(1) When a user visits the website to read the latest publications, I want that list of publications ordered, so that the most interesting publications for this user are at the top.
(2) I want to present a list of publications to the user with the title 'papers you might have missed'. This list should include papers from the last 3 months, which the user has not shown any interest in yet (has not voted for or put it in her/his library), but which might be of interest.
We will use exactly the same model for both cases. So let's get started.
First, we have to think about what features we want to use to train a model. The data we have available are the papers the user liked in the past. These papers have abstracts, titles, authors, publication dates and a lot more information we could use. Here we will restrict the analysis to the abstract, title and author list. I tested other features but the most useful feature seems to be the abstract, with the title and author list adding marginal information.
First we turn the list of abstracts, titles and authors into a feature vector:
# We calculate the tf-idf for each feature
vectorizer = TfidfVectorizer(analyzer="word", tokenizer=None,
stop_words="english", max_features=5000)
# fit_transform() serves two functions: First, it fits the model
# and learns the vocabulary; second, it transforms our training data
# into feature vectors. The input to fit_transform should be a list
# of strings.
train_data_features1 = vectorizer.fit_transform(list_of_abstracts)
train_data_features2 = vectorizer.fit_transform(list_of_titles)
train_data_features3 = vectorizer.fit_transform(list_of_authors)
I use the
TfidfVectorizer() from the scikit-learn package to turn the text into a matrix of papers vs. term frequency-inverse document frequency (tf-idf) scores. tf-idf is looking at the frequency of terms normalized to how often the term appears in other documents. So if a term appears in every document, the tf-idf score is very small, since the term does not seem to have much distinctive power, while if the term only appears in a few documents, this term is very useful to separate papers into categories and therefore gets a large tf-idf score.
Before I feed the abstract, title and author list into the vectorizer I pre-process the data:
import nltk
from nltk.stem.porter import PorterStemmer
stemmer = PorterStemmer()
def stem_tokens(tokens, stemmer):
stemmed = []
for item in tokens:
stemmed.append(stemmer.stem(item))
return stemmed
def tokenize(text):
# remove numbers
text_wo_numbers = re.sub(r'\d+', '', text)
# remove single letters
text_wo_single_letters = re.sub(r'\b\w\b', ' ', text_wo_numbers)
# remove punctuation
text_wo_punctuation = ' '.join(re.findall('\w+',
text_wo_single_letters))
tokens = nltk.word_tokenize(text_wo_punctuation)
stems = stem_tokens(tokens, stemmer)
return stems
def pre_process_text(list_of_texts):
output_list_of_texts = []
for text in list_of_texts:
output_list_of_texts.append(' '.join( tokenize(text) ) )
return output_list_of_texts
This code uses the nltk library to remove numbers from the abstract and title, as well as punctuation and one letter words. It also transforms the words into word stems (e.g.
equations -> equat), which ensures that e.g. equation and equations maps to the same word stem. I leave the removal of stop words to
TfidfVectorizer(). The author list is not processed in this way since it does not make much sense to turn names into stems.
Next, we can use these features to train a model. I am using a Random Forest model which I train for each user with the available papers for that user, as well as an equal size negative paper set (papers the user does not seem to be interested in). The negative paper set is picked from the same time period as the positive papers, but I made sure that none of those papers is present in the positive paper set.
# Now we get the negative paper examples
negative_papers = get_negative_arxiv_papers(user, len(input_papers))
paper_sample = input_papers + negative_papers
y = [1]*len(input_papers) + [0]*len(negative_papers)
y = np.array(y)
X1, X2, X3 = prepare_data(paper_sample, user, write=True)
X_train = np.c_[X1, X2, X3]
forest = RandomForestClassifier(n_estimators=100)
# forest now stores the fit information used later for the prediction
forest.fit(X_train, y)
The functions
get_negative_arxiv_papers() is very specific to how I store the data and I will not discuss the details of these functions here. If you are interested in the details I posted the complete code on
GitHub. The function
prepare_data() contains the data cleaning and vectorization steps discussed above.
However, it is important to pick the correct set of negative papers. In my case I could, for example, pick just any paper in the arXiv library. The problem is that users follow only certain arXiv categories, so the fact that a user who follows the category 'Statistics' has never shown any interest in papers from the category 'Optics' is not very surprising since the user has probably never seen these papers.
The model developed here is supposed to order the papers in the category the user is looking at, so it has to distinguish between papers in the user specific category. Therefore I pick the negative papers also from this category. It might be interesting to train a more general model and suggest papers outside the user's specified field of interest, but this is not the purpose of this model. Note that these choices impact the performance of the model significantly, so I did a lot of tests to make sure to pick a representative negative dataset. It is important to pick the training data according to the questions you are trying to answer.
Ok, now that we have a model, we can test the model. One question we need to answer is when do we want to use the model. For many users, the available data to train the model is very small, just because many users are not very active. If the model is trained with small datasets, it might be better to put the model aside and instead recommend the most popular papers to the users.
So when should we use the model to derive recommendations and when is it better to instead use the most popular papers? Here is a plot which shows the precision of the random forest model as a function of the size of the training dataset.
|
Figure 1: Precision reached for the Random Forest model as a function of the size of the
training dataset for each user. The precision is calculated as the average of 10 cross-validation step. |
Figure 1 clearly shows that the precision has larger and larger error bars with smaller datasets, and as soon as the error bar includes a precision of 0.5, the model basically behaves like a random selection. From this plot, I determined that having less than 50 papers to train the model does not result in a predictive model. I used that to distinguish active users, for whom I use the model to get paper recommendations, and less active users for whom I use the most popular papers as recommendations.
So here is the function which uses the random forest model to find the most popular papers
def get_tf_idf_ranking(papers, user):
''' This function is used to rank papers by user interest '''
X1, X2, X3 = prepare_data(papers, user, read=True)
X_pred = np.c_[X1, X2, X3]
model_file = ("%d/recommendation_model_%d.pickle" %
(app.config['ML_FOLDER'], user.id))
forest = pickle.load( open( model_file, "rb" ) )
# Use the model for prediction
y_proba = forest.predict_proba(X_pred)
return y_proba[:,1]
I just load the random forest model and use its
predict_proba() function. The preparation of the data in
prepare_data() is very similar to what I did above, but now I read in the feature list instead of creating a new one.
My code to test the recommendation model is as follows:
def get_auc_through_cross_validation(user):
'''
This function calculates the area under the curve (AUC)
for the paper prediction model
'''
input_papers = paper_click_data(user)
negative_papers = get_negative_arxiv_papers(user, len(input_papers))
paper_sample = input_papers + negative_papers
y = [1]*len(input_papers) + [0]*len(negative_papers)
y = np.array(y)
X1, X2, X3 = prepare_data(paper_sample, user, write=False)
X = np.c_[X1, X2, X3]
metrics = {
'num_cases': len(X),
'curves': [],
'aucs': [],
'pr': []
}
cross_validation_steps = 10
kf = KFold(n_splits=cross_validation_steps, shuffle=True)
forest = RandomForestClassifier(n_estimators=100)
for train, test in kf.split(X):
X_train, X_test = X[train], X[test]
y_train, y_test = y[train], y[test]
forest.fit(X_train, y_train)
probabilities = forest.predict_proba(X_test)[:,1]
prec, rec, thresholds = precision_recall_curve(y_test,
probabilities,
pos_label=1)
thresholds = np.append(thresholds, np.array([1]))
fp_rate, tp_rate, thresholds2 = roc_curve(y_test,
probabilities,
pos_label=1)
roc_auc = auc(false_positive_rate, true_positive_rate)
print roc_auc
if not math.isnan(roc_auc):
av_pr = average_precision_score(y_test, probabilities)
case_rate = []
for threshold in thresholds:
case_rate.append(np.mean(probabilities >= threshold))
curves = {
'thresholds': thresholds,
'precision': precision,
'recall': recall,
'case_rate': case_rate,
'fpr': false_positive_rate,
'tpr': true_positive_rate
}
metrics['curves'].append(curves)
metrics['aucs'].append(roc_auc)
metrics['pr'].append(av_pr)
plot_cross_validation_result(user, metrics)
return metrics
This code calculates a large set of evaluation metrics, including a Receiver Operating Characteristic (ROC) and Precision-Recall plots for each model
Figure 2: Precision-Recall curve and ROC curve for the Random Forest model of
the most active user on benty-fields.com.
The different lines correspond to the 10 cross-validation steps. The area under the curve (AUC) tells us the predictive power of the model, with 1 being perfect, and 0.5 representing a random selection. Our AUC of about 0.8 is pretty good and given that we have the cross-validation results we can also calculate an error for this number.
The code to produce these plots is
def plot_cross_validation_result(user, results):
# ROC Curve plot
# average values for the legend
auc_av = "{0:.2f}".format(np.mean(results['aucs']))
auc_sd = "{0:.2f}".format(np.std(results['aucs']))
plt.clf()
plt.figure(2)
# plot each individual ROC curve
for chart in results['curves']:
plt.plot(chart['fpr'], chart['tpr'], color='b', alpha=0.5)
plt.plot([0,1],[0,1],'b--')
plt.xlabel('False positive rate')
plt.ylabel('True positive rate')
plt.title('Cross-validated ROC for user %d (sample size %d)' %
(user.id, len(user.paper_click_objects)))
plt.text(0.6, 0.1,r'AUC = {av} \pm {sd}'.format(av=auc_av,sd=auc_sd))
plt.savefig("%s/user_like_roc_curve2_%d.png" %
(app.config['STATISTICS_FOLDER'], user.id))
# Precision-recall plot
pr_av = "{0:.2f}".format(np.mean(results['pr']))
pr_sd = "{0:.2f}".format(np.std(results['pr']))
plt.clf()
plt.figure(4)
for chart in results['curves']:
plt.plot(chart['recall'], chart['precision'],
color='b', alpha=0.5)
plt.xlabel('Recall')
plt.ylabel('Precision')
plt.title('Cross-validated prec/rec for user %d (sample size %d)' %
(user.id, len(user.paper_click_objects)))
plt.text(0.6, 0.9,r'AUC = {av} \pm {sd}'.format(av=pr_av,sd=pr_sd))
plt.savefig("%s/user_like_precision_recall_%d.png" %
(app.config['STATISTICS_FOLDER'], user.id))
return
Ok, that's it. A lot of code in this post but let me know if you have any comments. I also posted the entire code including some functions I skipped in this post on
GitHub.
cheers
Florian