Indexes
TextIndex
TextIndex is a wrapper around Spotify’s AnnoyIndex
- class datawords.indexes.TextIndex(words_model: Word2VecHelper, id_mapper: Dict[int, Any], ann_trees: int = 10, distance_metric: str = 'angular', ix: Annoy | None = None)
- __init__(words_model: Word2VecHelper, id_mapper: Dict[int, Any], ann_trees: int = 10, distance_metric: str = 'angular', ix: Annoy | None = None)
This Index is opinated, usually AnnoyIndex allows any kind of Vector, insted, this index is prepared to accept only texts, parse it and perform searchs.
AnnoyIndex only accepts integers as indices of the data, for that reason and id_mapper is built in the build process.
- Parameters:
words_model (Word2VecHelper) – Words Model to be used for vectorizing texts.
id_mapper (Dict[int, Any]) – A dict where keys are the indices stored in Annoy and values are the real indices in the domain which the data belongs to.
distance_metric (str) – type of distance to be used: “angular”, “euclidean”, “manhattan”, “hamming”, or “dot”.
ix (AnnoyIndex) – AnnoyIndex object.
- encode(txt: str) ndarray
a wrapper around the encode method from Word2Vec model. it’s get a txt and return a the vectorized version of the text.
- Parameters:
txt (str) – full string sentence.
- Returns:
an array from text
- Return type:
np.ndarray
- search(txt: str, top_n=5, include_distances=False)
After AnnoyIndex was trained, it can performs searchs over the index. It will returns the original id’s using the id_mapper built during the training process.
- Parameters:
txt (str) – full text to search
top_n (int) – How many results it returns.
include_distances – include distances.
- classmethod build(ids: List[Any], *, getter: Callable, words_model_path: str | None = None, words_model: Word2VecHelper | None = None, distance_metric: str = 'angular', ann_trees: int = 10, n_jobs: int = -1) TextIndex
Build the TextIndex . Use as follows:
def getter(id_: str) -> str: return db.get(id_) ix = TextIndex.build(ids, getter=getter, words_model=wv)
Check the test cases for a better example.
- Parameters:
ids (List[Any]) – a list of the original ids. This id’s will be mapped with the internal ids used by Annoy.
getter (Callable) – A function which get’s an ID and returns a texts to be encoded and indexed.
words_model_path – A fulpath to the
datawords.words.Word2VecHelperwords_model (Word2VecHelper) – optionally a
datawords.words.Word2VecHelpercould be provided.vector_size – size of the vector to be indexed, it should match with vector_size of the word2vec model.
distance_metric (str) – type of distance to be used: “angular”, “euclidean”, “manhattan”, “hamming”, or “dot”.
ann_trees (int) – builds a forest of n_trees trees. More trees gives higher precision when querying
n_jobs (int) – specifies the number of threads used to build the trees. n_jobs=-1 uses all available CPU cores.
- Returns:
TextIndex trained.
- Return type:
- classmethod load(fp: str | PathLike, words_model: Word2VecHelper | None = None) TextIndex
loads the TextIndex model.
- Parameters:
fp – The path to the index. Each model is stored in a folder.
The path should be to that folder. :type fp: Union[str, os.PathLike] :param words_model: Optional, the words model to be used. :type words_model: Word2VecHelper
- Returns:
TextIndex loaded.
- Return type:
- save(fp: str | PathLike)
Save index to a folder.
- Parameters:
fp (Union[str, os.PathLike]) – The path to the index. Each model is stored in a folder. The path should be to that folder.
SQLiteIndex
SQLiteIndex index and search documents using SQLite. Check sqlite fts search queries
- class datawords.indexes.LiteDoc(id: str, text: str)
Used by
SQLiteIndex- id: str
- text: str
- __init__(id: str, text: str) None
Method generated by attrs for class LiteDoc.
- class datawords.indexes.SQLiteIndex(sqlite=':memory:', stopwords={})
- __init__(sqlite=':memory:', stopwords={})
SQLiteIndex allows to store documents and search over them. It uses the fts5 module from sqlite. Also it parse the text in an opinated way.
- Parameters:
sqlite (str) – path where database will be stored. By default it’s saved on memory
stopwords (Set[str]) – list of stop words to be used by the parsed.
- add(doc: LiteDoc) bool
Adds a document of type
LiteDocinto the index. If the document already exist, then it will not be stored.- Parameters:
doc (LiteDoc) – A document
- Returns:
True if it was stored or false if not.
- Return type:
bool
- add_batch(docs: List[LiteDoc]) List[bool]
Add documents in batch.
- Parameters:
docs (List[LiteDoc]) – A list of documents
- Returns:
True if it was stored or false if not.
- Return type:
List[bool]
- list_ids(limit=10, offset=0) List[str]
It lists the ids of the document stored.
- property total: int
Returns the total of documents stored in the index.