Graph-based Ranking Model for Text
Overview of TextRank keyword extraction based on Google's PageRank algorithm
Introduction
A graph-based ranking algorithm is a way of deciding on the importance of a vertex within a graph by taking into account that the global information is computed recursively from the entire graph rather than relying only on local vertex-specific information.
Applying such methods to lexical or semantic graphs extracted from natural language documents can be used for tasks ranging from automated extraction of keyphrases to extractive summarization and word sense disambiguation.
Ranking Algorithm
Let G = (V,E) be a directed graph with the set of vertices V and set of edges E, where E is a subset of VxV. For a given vertex V_i, let In(V_i) be the set of vertices that point to it (predecessors), and let Out(V_i) be the set of vertices that vertex V_i points to (successors). The score of a vertex V_i is defined as follows:
where d is a damping factor that can be set between 0 and 1.
TextRank
TextRank keyword extraction algorithm is fully unsupervised, and proceeds as follows:
The text is tokenized, and annotated with part of speech tags.
Lexical units that pass the syntactic filter (usually nouns and verbs) are added to the graph, and an edge is added between those lexical units that co-occur within a window of N words.
Each vertex is assigned an initial score of 1, and the ranking algorithm is run on the graph for several iterations until it converges.
Vertices are sorted in reversed order of their score, and the top T vertices are retained for post-processing.
During post-processing, retained lexical units are marked in the text, and sequences of adjacent units are collapsed into multi-word units.