Models

PhrasesModel

The PhrasesModel is used to build bgrams. In this case, it’s a wrapper around Gensim Phrases

class datawords.models.PhrasesModel(parser_conf: ParserConf, min_count: float = 1.0, threshold: float | None = None, max_vocab_size: int | None = None, connector_words=None, model=None)
__init__(parser_conf: ParserConf, min_count: float = 1.0, threshold: float | None = None, max_vocab_size: int | None = None, connector_words=None, model=None)
Parameters:
  • parser_conf (parsers.ParserConf) – configuration of the parser

  • min_count (float) – Ignore all words and bigrams with total collected count lower than this value.

  • threshold – Represent a score threshold for forming the phrases (higher means fewer phrases). A phrase of words a followed by b is accepted if the score of the phrase is greater than threshold. Heavily depends on concrete scoring-function, see the scoring parameter.

  • max_vocab_size (Optional[int] int) – Maximum size (number of tokens) of the vocabulary. Used to control pruning of less common words, to keep memory under control. The default of 40M needs about 3.6GB of RAM. Increase/decrease max_vocab_size depending on how much available memory you have.

  • connector_words (Frozenset[str]) – Set of words that may be included within a phrase, without affecting its scoring. If any is provided it will use the lang value from the parser_conf. By default datawords include CONNECTOR_WORDS for English, Portugues an Spanish.

property model: FrozenPhrases
fit(X: Iterable)

This will train the phrase model. It needs an iterable.

Parameters:

X (Iterable) – An iterable which returns plain texts.

transform(X: Iterable) List[List[str]]

Trasform a list of texts.

Parameters:

X (Iterable) – an iterable.

Returns:

A list of phrases.

Return type:

List[List[str]]

parse(txt: str) List[str]

It will parse only one text. :param txt: str :return: a list of words :rtype: List[str]

save(fp: str | PathLike)

Save phrase model to a folder.

Parameters:

fp (Union[str, os.PathLike]) – The path to the folder. Each model is stored in a folder. The path should be to that folder.

classmethod load(fp: str | PathLike) PhrasesModel

loads the TextIndex model.

Parameters:

fp (Union[str, os.PathLike]) – The path to the index. Each model is stored in a folder. The path should be to that folder.

Returns:

PhrasesModel loaded.

Return type:

PhrasesModel

Word2VecHelper

A wrapper around Gensim Word2Vec

class datawords.models.Word2VecHelper(parser_conf: ParserConf, phrases_model=None, size: int = 100, window: int = 5, min_count: int = 1, workers: int = 1, epoch: int = 5, model: Word2Vec | None = None, using_kv=False, loaded_from=None)
__init__(parser_conf: ParserConf, phrases_model=None, size: int = 100, window: int = 5, min_count: int = 1, workers: int = 1, epoch: int = 5, model: Word2Vec | None = None, using_kv=False, loaded_from=None)
Parameters:
  • parser_conf (parsers.ParserConf) – configuration of the parser

  • min_count (Optional[float]) – Ignore all words and bigrams with total collected count lower than this value.

  • threshold – Represent a score threshold for forming the phrases (higher means fewer phrases). A phrase of words a followed by b is accepted if the score of the phrase is greater than threshold. Heavily depends on concrete scoring-function, see the scoring parameter.

  • max_vocab_size (Optional[int] int) – Maximum size (number of tokens) of the vocabulary. Used to control pruning of less common words, to keep memory under control. The default of 40M needs about 3.6GB of RAM. Increase/decrease max_vocab_size depending on how much available memory you have.

  • connector_words (Frozenset[str]) – Set of words that may be included within a phrase, without affecting its scoring. If any is provided it will use the lang value from the parser_conf. By default datawords include CONNECTOR_WORDS for English, Portugues an Spanish.

property vector_size: int
property wv: Word2Vec | KeyedVectors
fit(X: Iterable)

This will train the model. It needs an iterable.

Parameters:

X (Iterable) – An iterable which returns plain texts.

parse(sentence: str) List[str]

It will parse only one text. :param txt: str :return: a list of words :rtype: List[str]

encode(sentence: str) ndarray

gets a sentence in plain text and encode it as vector

vectorize(sentence: List[str]) ndarray

Get a vector from a list of words if a sentence has words that don’t match in the word2vec model, then it fills with zeros

save(fp: str | PathLike)
classmethod load(fp: str | PathLike, keyed_vectors=False) Word2VecHelper