Ranking

PageRankAnnoy

class datawords.ranking.PageRankAnnoy(metric_distance='euclidean', l2_norm=True, barrier=0.9, n_trees=10, n_jobs=-1, tqdm=True)

It uses Annoy Indexer and PageRank to rank a list of documents, where each document should be an string.

PageRank is an old known google algorithm, originally used to get the “most important” links from internet.

In the text world, there are different approachs to achive the same goal. For instance, if two documents share the same words, then they could be considered as nodes connected between.

Because PageRank is a graph algorithm the idea of nodes and edges is important.

Other approach is to measure the similarity between two documents (nodes), if they are similar enough (barrier), then they are connected. This last method is used here.

__init__(metric_distance='euclidean', l2_norm=True, barrier=0.9, n_trees=10, n_jobs=-1, tqdm=True)

For l2_norm param refer to https://machinelearningmastery.com/vector-norms-machine-learning/

Parameters:
  • metric_distance

    the options are “angular”, “euclidean”, “manhattan”, “hamming”, or “dot”. It will be used by Annoy

    as measure to calculates distance between texts.

  • l2_norm – True by default, usually is recommend their usage.

  • barrier – It depends on the metric_distance choose, but this param will serve as filter to define if 2 texts are connected as nodes.

  • n_trees – trees used by annoy index.

  • n_jobs – using multithreading for the index.

  • tqdm – tqdm usage when adj. matrix is calculated.

static as_ndarray(X: List[ndarray]) ndarray
fit(X: ndarray)
transform()
fit_transform(X: ndarray)
rank(index: List[Any], top_n=5)

given a index build a ranking using the scores created by PageRank.

property scores
property edges
property adjacency