Using Topic model and LSH for finding similar articles in PubMed

In this work we analyse topics of articles in PubMed and search for similar articles. This project is for my Machine Learning course at Comenius University and also for our company Black Swan Rational.

Links

The full project report.
The full project directory.
If interested please ask for the used PubMed dataset.

Task

Our aim was to find related articles in the PubMed article database.
Pubmed is an
open database of about 22 million science articles in medicine. We wanted to design some-
thing like a recommender system for science articles meaning that ”if you found interesting
this article, you should also look at”. We want to apply the results for our project ”sci-
curve.com” to link similar articles and maybe visualizing this tree and clusters, i.e. topics.
Please note that almost all of the code was written by the author and is licenses under
GPLv3.

Keywords

PubMed, Topic Model, LSH, LSI, gensim, python, c++, recommender system,
kNN, cosine similarity.

Our approach

First of all, as the dataset is quite big, we selected 1,6M articles which have
citations (that’s about 5%). Then we parsed the articles from XML to following datasets:

corpus.csv – tokenized abstracts
citations.csv – one-to-many mapping from one article which cited many

Then we transformed the corpus to:

dictionary.pkl – one-to-one mapping between token (from corpus) and it’s unique
integer identifier.
word frequencies.csv – for each word frequency of usage in all documents (from
corpus).
docs used.csv – filtered and transformed tokens; for example number where omitted.

After the data was preprocessed we trained a topic model on docs used with gensim using
LSI:

lsi.pkl – the trained model (it took about 16h).
topics.csv – most significant tokens for each topic.
vectors.csv – transformed docs used to LSI space. These vectors should be compared
with cosine similarity.

Finally, we developed two metrics based on LSI vectors and the citation graph to calculate
the distance between two articles. To fasten the computation we used our LSH implementation in C++. Note that we first tried the bruteforce O(N 2) kNN implementation to know
if our metrics are working.

bf dist.csv – tuples (dist, id1, id2) of the closest articles. For each bucket in LSH we
did O(n2 ) kNN, inserted all pairs to a global list and sorted. Note that only 2% of all
were stored as there were over 1,000,000,000 pairs.
buckets.csv – for each bucket LSH the list of articles which belong to them.

Note that we also had two smaller datasets of size 10,000 (small) and 100,000 (citation_small) which were used for testing purposes and preliminary results.
Also note that to download the PubMed dataset we used their open API. Pubmed allows
to use its API to download the whole dataset.

Output

We compare the given cosine-similarity distance with exact citation walk distance
on several hundred examples (note that it’s computationally expensive to run BFS on such
a graph). We also show example of few abstract (i.e. text) pairs with different distances.
To illustrate the dataset we include plots showing distributions of the data we used and
transformed.

Author: Peter Csiba

Software Engineer at Google View all posts by Peter Csiba

4 thoughts on “Using Topic model and LSH for finding similar articles in PubMed”

Hello Peter, I recently did a benchmark on various kNN libraries.

I think it may be of interest to you: http://radimrehurek.com/2013/12/performance-shootout-of-nearest-neighbours-contestants/

Hi Radim,
Thank you very much! It looks great. We will definitely try FLANN and Annoy.
I will let you know how they performed in comparison to my dummy LSH implementation.

i wonder how to decide num_topics=150?

Hi Whille, than you for asking.

We selected 150 topics because:

Performance, especially for the next steps of our LSH kNN implementation
Based on Radim’s gensim tutorial where for the Wikipedia article dataset he used 400 topics (http://radimrehurek.com/gensim/wiki.html). As we worked with Pubmed abstracts dataset which is kind of specialized, we decided to use only 150topics. When examining the meaning of the generated topics such as “genetics” or “molecules” then for lots of them was hard to find any meaning. Therefore we decided that 150 was enough. We agree that a more precise way to do that was to try different numbers of topics and only after that choose one number.