Indexes

TextIndex

TextIndex is a wrapper around Spotify’s AnnoyIndex

class datawords.indexes.TextIndex(words_model: Word2VecHelper, id_mapper: Dict[int, Any], ann_trees: int = 10, distance_metric: str = 'angular', ix: Annoy | None = None)

__init__(words_model: Word2VecHelper, id_mapper: Dict[int, Any], ann_trees: int = 10, distance_metric: str = 'angular', ix: Annoy | None = None)

This Index is opinated, usually AnnoyIndex allows any kind of Vector, insted, this index is prepared to accept only texts, parse it and perform searchs.

AnnoyIndex only accepts integers as indices of the data, for that reason and id_mapper is built in the build process.

Parameters:

words_model (Word2VecHelper) – Words Model to be used for vectorizing texts.
id_mapper (Dict[int, Any]) – A dict where keys are the indices stored in Annoy and values are the real indices in the domain which the data belongs to.
distance_metric (str) – type of distance to be used: “angular”, “euclidean”, “manhattan”, “hamming”, or “dot”.
ix (AnnoyIndex) – AnnoyIndex object.

encode(txt: str) → ndarray

a wrapper around the encode method from Word2Vec model. it’s get a txt and return a the vectorized version of the text.

Parameters:: txt (str) – full string sentence.
Returns:: an array from text
Return type:: np.ndarray

search(txt: str, top_n=5, include_distances=False)

After AnnoyIndex was trained, it can performs searchs over the index. It will returns the original id’s using the id_mapper built during the training process.

Parameters:

txt (str) – full text to search
top_n (int) – How many results it returns.
include_distances – include distances.

classmethod build(ids: List[Any], *, getter: Callable, words_model_path: str | None = None, words_model: Word2VecHelper | None = None, distance_metric: str = 'angular', ann_trees: int = 10, n_jobs: int = -1) → TextIndex

Build the TextIndex . Use as follows:

def getter(id_: str) -> str:
    return db.get(id_)

ix = TextIndex.build(ids, getter=getter, words_model=wv)

Check the test cases for a better example.

Parameters:

ids (List[Any]) – a list of the original ids. This id’s will be mapped with the internal ids used by Annoy.
getter (Callable) – A function which get’s an ID and returns a texts to be encoded and indexed.
words_model_path – A fulpath to the datawords.words.Word2VecHelper
words_model (Word2VecHelper) – optionally a datawords.words.Word2VecHelper could be provided.
vector_size – size of the vector to be indexed, it should match with vector_size of the word2vec model.
distance_metric (str) – type of distance to be used: “angular”, “euclidean”, “manhattan”, “hamming”, or “dot”.
ann_trees (int) – builds a forest of n_trees trees. More trees gives higher precision when querying
n_jobs (int) – specifies the number of threads used to build the trees. n_jobs=-1 uses all available CPU cores.

Returns:

TextIndex trained.

Return type:

TextIndex

classmethod load(fp: str | PathLike, words_model: Word2VecHelper | None = None) → TextIndex

loads the TextIndex model.

Parameters:: fp – The path to the index. Each model is stored in a folder.

The path should be to that folder. :type fp: Union[str, os.PathLike] :param words_model: Optional, the words model to be used. :type words_model: Word2VecHelper

Returns:: TextIndex loaded.
Return type:: TextIndex

save(fp: str | PathLike)

Save index to a folder.

Parameters:: fp (Union[str, os.PathLike]) – The path to the index. Each model is stored in a folder. The path should be to that folder.

SQLiteIndex

SQLiteIndex index and search documents using SQLite. Check sqlite fts search queries

class datawords.indexes.LiteDoc(id: str, text: str)

Used by SQLiteIndex

id: str

text: str

__init__(id: str, text: str) → None: Method generated by attrs for class LiteDoc.

class datawords.indexes.SQLiteIndex(sqlite=':memory:', stopwords={})

__init__(sqlite=':memory:', stopwords={})

SQLiteIndex allows to store documents and search over them. It uses the fts5 module from sqlite. Also it parse the text in an opinated way.

Parameters:

sqlite (str) – path where database will be stored. By default it’s saved on memory
stopwords (Set[str]) – list of stop words to be used by the parsed.

add(doc: LiteDoc) → bool

Adds a document of type LiteDoc into the index. If the document already exist, then it will not be stored.

Parameters:: doc (LiteDoc) – A document
Returns:: True if it was stored or false if not.
Return type:: bool

add_batch(docs: List[LiteDoc]) → List[bool]

Add documents in batch.

Parameters:: docs (List[LiteDoc]) – A list of documents
Returns:: True if it was stored or false if not.
Return type:: List[bool]

list_docs(limit=10, offset=0) → List[LiteDoc]

list_ids(limit=10, offset=0) → List[str]: It lists the ids of the document stored.

property total: int: Returns the total of documents stored in the index.

search(text: str, limit: int = 1) → List[LiteDoc]

Performs a search in the index.

Parameters:

text (str) – text to search.
limit (int) – how many results retrieve.

Returns:

Documents found

Return type:

List[LiteDoc]