In this work we analyse topics of articles in PubMed and search for similar articles. This project is for my Machine Learning course at Comenius University and also for our company Black Swan Rational.
Our aim was to find related articles in the PubMed article database.
Pubmed is an
open database of about 22 million science articles in medicine. We wanted to design some-
thing like a recommender system for science articles meaning that ”if you found interesting
this article, you should also look at”. We want to apply the results for our project ”sci-
curve.com” to link similar articles and maybe visualizing this tree and clusters, i.e. topics.
Please note that almost all of the code was written by the author and is licenses under
PubMed, Topic Model, LSH, LSI, gensim, python, c++, recommender system,
kNN, cosine similarity.
First of all, as the dataset is quite big, we selected 1,6M articles which have
citations (that’s about 5%). Then we parsed the articles from XML to following datasets:
- corpus.csv – tokenized abstracts
- citations.csv – one-to-many mapping from one article which cited many
Then we transformed the corpus to:
- dictionary.pkl – one-to-one mapping between token (from corpus) and it’s unique
- word frequencies.csv – for each word frequency of usage in all documents (from
- docs used.csv – filtered and transformed tokens; for example number where omitted.
After the data was preprocessed we trained a topic model on docs used with gensim using
- lsi.pkl – the trained model (it took about 16h).
- topics.csv – most significant tokens for each topic.
- vectors.csv – transformed docs used to LSI space. These vectors should be compared
with cosine similarity.
Finally, we developed two metrics based on LSI vectors and the citation graph to calculate
the distance between two articles. To fasten the computation we used our LSH implementation in C++. Note that we first tried the bruteforce O(N 2) kNN implementation to know
if our metrics are working.
- bf dist.csv – tuples (dist, id1, id2) of the closest articles. For each bucket in LSH we
did O(n2 ) kNN, inserted all pairs to a global list and sorted. Note that only 2% of all
were stored as there were over 1,000,000,000 pairs.
- buckets.csv – for each bucket LSH the list of articles which belong to them.
Note that we also had two smaller datasets of size 10,000 (small) and 100,000 (citation_small) which were used for testing purposes and preliminary results.
Also note that to download the PubMed dataset we used their open API. Pubmed allows
to use its API to download the whole dataset.
We compare the given cosine-similarity distance with exact citation walk distance
on several hundred examples (note that it’s computationally expensive to run BFS on such
a graph). We also show example of few abstract (i.e. text) pairs with different distances.
To illustrate the dataset we include plots showing distributions of the data we used and