Keyphrase extraction is the task of identifying single or multi-word expressions that represent the main topics of a document. There are 2 approaches to extract topics (and/or keyphrases) from a text: supervised and unsupervised.
Supervised approach
This is a multi-label, multi-class classification algorithm, where following features can be used as an input:
- text converted to bag-of-words
- text is treated as a stream of vectors, which are pre-trained word embeddings
For bag-of-words linear SVM is a good classifier. For word embeddings RNN are a better fit
Later, when the model is trained it may be used to classify unseen text.
Limitations:
- large amount of data is needed to train classifiers
- training dataset should be labeled
- labels/topics are predefined
- RNN-LTSM are slow
Advantages:
- good accuracy
- extracted entities are not keyphrases from the text itself, but more general topics
Unsupervised learning
Adrien Bougouin proposed a method to extract keyphrases that represent topics of the document without prior model training.
My implementation of his algorithm can be found here which goes like this:
- tokenize text and identify part-of-speech of each token
- identify longest sequences of adjectives and nouns - they constitute keyphrases
- convert each keyphrase to term frequency vector using bag of words, apply stemmer for better compression
- find clusters of keyphrases using Hierarchical Agglomerative Clustering (HAC) algorithm to form topics:
- use average strategy
- identify cluster by max depth of 0.74 (good explanation if HAC can be found here
- use clusters as graph vertices and sum of distances between each keyphrase of topic pairs as edge weight
- apply PageRank to identify most prominent topics (implemented in
networkx
)
For topN topics extract most significant keyphrases that represent this topic. Possible strategies:
- use a keyphrase which is closer to the beginning of the text
- use center of the cluster from p.4
- use most frequent
Limitations:
- topics are represented by words from the text itself
Advantages:
- fast
- no training needed
- unlike LDA topics are easily interpreted by humans
Tests
I ran the code agains the original article on this algoithms and it gave: document
, topics
, keyphrase
, graph
, method
, topicrank
, word
, vertices
, keyphrase candidate
, datasets
as keyphrases, which is pretty close to what article is about.