April 16, 2018

Extracting keyphrases from texts: unsupervised algorithm TopicRank

Keyphrase extraction is the task of identifying single or multi-word expressions that represent the main topics of a document. There are 2 approaches to extract topics (and/or keyphrases) from a text: supervised and unsupervised.

Supervised approach

This is a multi-label, multi-class classification algorithm, where following features can be used as an input:

For bag-of-words linear SVM is a good classifier. For word embeddings RNN are a better fit

Later, when the model is trained it may be used to classify unseen text.

Limitations:

  • large amount of data is needed to train classifiers
  • training dataset should be labeled
  • labels/topics are predefined
  • RNN-LTSM are slow

Advantages:

  • good accuracy
  • extracted entities are not keyphrases from the text itself, but more general topics

Unsupervised learning

Adrien Bougouin proposed a method to extract keyphrases that represent topics of the document without prior model training.

My implementation of his algorithm can be found here which goes like this:

  • tokenize text and identify part-of-speech of each token
  • identify longest sequences of adjectives and nouns - they constitute keyphrases
  • convert each keyphrase to term frequency vector using bag of words, apply stemmer for better compression
  • find clusters of keyphrases using Hierarchical Agglomerative Clustering (HAC) algorithm to form topics:
  • use average strategy
  • identify cluster by max depth of 0.74 (good explanation if HAC can be found here
  • use clusters as graph vertices and sum of distances between each keyphrase of topic pairs as edge weight
  • apply PageRank to identify most prominent topics (implemented in networkx)

For topN topics extract most significant keyphrases that represent this topic. Possible strategies:

  • use a keyphrase which is closer to the beginning of the text
  • use center of the cluster from p.4
  • use most frequent

Limitations:

  • topics are represented by words from the text itself

Advantages:

  • fast
  • no training needed
  • unlike LDA topics are easily interpreted by humans

Tests

I ran the code agains the original article on this algoithms and it gave: document, topics, keyphrase, graph, method, topicrank, word, vertices, keyphrase candidate, datasets as keyphrases, which is pretty close to what article is about.

© Alexey Smirnov 2023