Indexes

TextIndex

TextIndex is a wrapper around Spotify’s AnnoyIndex

class datawords.indexes.TextIndex(words_model: Word2VecHelper, id_mapper: Dict[int, Any], ann_trees: int = 10, distance_metric: str = 'angular', ix: Annoy | None = None)
__init__(words_model: Word2VecHelper, id_mapper: Dict[int, Any], ann_trees: int = 10, distance_metric: str = 'angular', ix: Annoy | None = None)

This Index is opinated, usually AnnoyIndex allows any kind of Vector, insted, this index is prepared to accept only texts, parse it and perform searchs.

AnnoyIndex only accepts integers as indices of the data, for that reason and id_mapper is built in the build process.

Parameters:
  • words_model (Word2VecHelper) – Words Model to be used for vectorizing texts.

  • id_mapper (Dict[int, Any]) – A dict where keys are the indices stored in Annoy and values are the real indices in the domain which the data belongs to.

  • distance_metric (str) – type of distance to be used: “angular”, “euclidean”, “manhattan”, “hamming”, or “dot”.

  • ix (AnnoyIndex) – AnnoyIndex object.

encode(txt: str) ndarray

a wrapper around the encode method from Word2Vec model. it’s get a txt and return a the vectorized version of the text.

Parameters:

txt (str) – full string sentence.

Returns:

an array from text

Return type:

np.ndarray

search(txt: str, top_n=5, include_distances=False)

After AnnoyIndex was trained, it can performs searchs over the index. It will returns the original id’s using the id_mapper built during the training process.

Parameters:
  • txt (str) – full text to search

  • top_n (int) – How many results it returns.

  • include_distances – include distances.

classmethod build(ids: Iterable, *, getter: Callable, words_model_path: str | None = None, words_model: Word2VecHelper | None = None, distance_metric: str = 'angular', ann_trees: int = 10, n_jobs: int = -1, progress_bar=True, total_ids=None) TextIndex

Build the TextIndex . Use as follows:

def getter(id_: str) -> str:
    return db.get(id_)

ix = TextIndex.build(ids, getter=getter, words_model=wv)

Check the test cases for a better example.

Parameters:
  • ids (List[Any]) – a list of the original ids. This id’s will be mapped with the internal ids used by Annoy.

  • getter (Callable) – A function which get’s an ID and returns a texts to be encoded and indexed.

  • words_model_path – A fulpath to the datawords.words.Word2VecHelper

  • words_model (Word2VecHelper) – optionally a datawords.words.Word2VecHelper could be provided.

  • vector_size – size of the vector to be indexed, it should match with vector_size of the word2vec model.

  • distance_metric (str) – type of distance to be used: “angular”, “euclidean”, “manhattan”, “hamming”, or “dot”.

  • ann_trees (int) – builds a forest of n_trees trees. More trees gives higher precision when querying

  • n_jobs (int) – specifies the number of threads used to build the trees. n_jobs=-1 uses all available CPU cores.

Returns:

TextIndex trained.

Return type:

TextIndex

classmethod load(fp: str | PathLike, words_model: Word2VecHelper | None = None) TextIndex

loads the TextIndex model.

Parameters:

fp – The path to the index. Each model is stored in a folder.

The path should be to that folder. :type fp: Union[str, os.PathLike] :param words_model: Optional, the words model to be used. :type words_model: Word2VecHelper

Returns:

TextIndex loaded.

Return type:

TextIndex

save(fp: str | PathLike)

Save index to a folder.

Parameters:

fp (Union[str, os.PathLike]) – The path to the index. Each model is stored in a folder. The path should be to that folder.

class datawords.indexes.TextIndexMeta(name: str, words_model_path: str, vector_size: int = 100, ann_trees: int = 10, distance_metric: str = 'angular', version: str = '0.7.3')
name: str
words_model_path: str
vector_size: int
ann_trees: int
distance_metric: str
version: str
__init__(name: str, words_model_path: str, vector_size: int = 100, ann_trees: int = 10, distance_metric: str = 'angular', version: str = '0.7.3') None

Method generated by attrs for class TextIndexMeta.

SQLiteIndex

SQLiteIndex index and search documents using SQLite. Check sqlite fts search queries

class datawords.indexes.LiteDoc(id: str, text: str)

Represents a document indexed in the SQLiteIndex

id: str
text: str
__init__(id: str, text: str) None

Method generated by attrs for class LiteDoc.

class datawords.indexes.SQLiteIndex(sqlite=':memory:', stopwords={})
__init__(sqlite=':memory:', stopwords={})

SQLiteIndex allows to store documents and search over them. It uses the fts5 module from sqlite. Also it parse the text in an opinated way.

Parameters:
  • sqlite (str) – path where database will be stored. By default it’s saved on memory

  • stopwords (Set[str]) – list of stop words to be used by the parsed.

journal_mode()
parse(txt) List[str]

Parse a text

add(doc: LiteDoc) bool

Adds a document of type LiteDoc into the index. If the document already exist, then it will not be stored.

Parameters:

doc (LiteDoc) – A document

Returns:

True if it was stored or false if not.

Return type:

bool

classmethod build(ids: Iterable, *, getter: Callable, stopwords={}, progress_bar=True, total_ids=None, sqlite: str = ':memory:') SQLiteIndex
classmethod load(sqlite: str, stopwords={}) SQLiteIndex
add_batch(docs: List[LiteDoc]) List[bool]

Add documents in batch.

Parameters:

docs (List[LiteDoc]) – A list of documents

Returns:

True if it was stored or false if not.

Return type:

List[bool]

get_doc(id: str) LiteDoc
list_docs(limit=10, offset=0) List[LiteDoc]
list_ids(limit=10, offset=0) List[str]

It lists the ids of the document stored.

property total: int

Returns the total of documents stored in the index.

search(text: str, top_n: int = 5) List[LiteDoc]

Performs a search in the index. It will parse and match the wodos as:

MATCH ‘”{words}” *

To use a more low level search query use :method:`SQLiteIndex.search_query`

Parameters:
  • text (str) – text to search.

  • limit (int) – how many results retrieve.

Returns:

Documents found

Return type:

List[LiteDoc]

search_query(query: str, top_n: int = 5) List[LiteDoc]

Performs a search query into the sqlite database.

‘search AND (sqlite OR help)’

Parameters:
  • query (str) – query to search

  • limit (int) – how many results retrieve.

Returns:

Documents found

Return type:

List[LiteDoc]