Parsers

Parsers module contains mostly functions and classes to parse texts.

doc_parser

doc_parser is the most critical function in the package. It’s used internally by multiple classes.

This function get a text and produce the tokens needed to feed diferent models.

datawords.parsers.doc_parser(txt: str, stop_words: FrozenSet[str], stemmer: Any | None = None, emo_codes=False, strip_accents=True, lower=True, numbers=False, parse_urls=False) List[str]

Get a string text an return a list of words.

Note

emo_codes and parse_urls alter the order of the tokens. If they are found, then it will put them at the end of the list.

This function is related to ParserConf

Parameters:
  • txt (str) – text to be parsed.

  • stop_words (Set[str]) – a list of stop words. It’s possible to get a list using load_stop2()

  • stemmer (Optional[Any]) – optional, a stemmer.

  • emo_codes (bool) – if true, emo_codes will be decoded into text

  • stip_accents – replace accents with the same letter without accent

  • lower (bool) – transform text to lower letters.

  • numbers (bool) – if True it will keep numbers.

  • parse_urls (bool) – keep urls.

Returns:

a list of tokens

Return type:

List[str]

class datawords.parsers.ParserConf(lang: str = 'en', emo_codes: bool = False, strip_accents: bool = False, lower: bool = True, numbers: bool = True, parse_urls: bool = False, stopw_path: str | None = None, use_stemmer: bool = False, stemmer_class: str | None = None, phrases_model_path: str | None = None)

Related to doc_parser()

lang: str
emo_codes: bool
strip_accents: bool
lower: bool
numbers: bool
parse_urls: bool
stopw_path: str | None
use_stemmer: bool
stemmer_class: str | None
phrases_model_path: str | None
__init__(lang: str = 'en', emo_codes: bool = False, strip_accents: bool = False, lower: bool = True, numbers: bool = True, parse_urls: bool = False, stopw_path: str | None = None, use_stemmer: bool = False, stemmer_class: str | None = None, phrases_model_path: str | None = None) None

Method generated by attrs for class ParserConf.

Sentences Parser

class datawords.parsers.ParserProto

Abstract class that any parser should agree with.

abstract parse(txt: str) List[str]
class datawords.parsers.SentencesParser(lang='en', lower=True, phrases_model=None, stemmer: Any | None = None, emo_codes=False, strip_accents=False, numbers=True, stop_words=typing.FrozenSet[str], parse_urls=False)
__init__(lang='en', lower=True, phrases_model=None, stemmer: Any | None = None, emo_codes=False, strip_accents=False, numbers=True, stop_words=typing.FrozenSet[str], parse_urls=False)
parse(txt: str) List[str]

It gets a txt which could be a phrase, a doc, or anything in between, and then parse it using the phraase model or the parser.

export_conf() ParserConf

It exports the parser configuration but omits stopw path and stemmer path.

class datawords.parsers.SentencesIterator(data: Iterable, *, parser: ParserProto)
__init__(data: Iterable, *, parser: ParserProto)
parser_generator()

PhrasesModel

The PhrasesModel is used to build bgrams. In this case, it’s a wrapper around Gensim Phrases

class datawords.parsers.PhrasesModel(parser_conf: ParserConf, min_count: float = 1.0, threshold: float | None = None, max_vocab_size: int | None = None, connector_words=None, model=None, saved_path=None)
__init__(parser_conf: ParserConf, min_count: float = 1.0, threshold: float | None = None, max_vocab_size: int | None = None, connector_words=None, model=None, saved_path=None)
Parameters:
  • parser_conf (parsers.ParserConf) – configuration of the parser

  • min_count (float) – Ignore all words and bigrams with total collected count lower than this value.

  • threshold – Represent a score threshold for forming the phrases (higher means fewer phrases). A phrase of words a followed by b is accepted if the score of the phrase is greater than threshold. Heavily depends on concrete scoring-function, see the scoring parameter.

  • max_vocab_size (Optional[int] int) – Maximum size (number of tokens) of the vocabulary. Used to control pruning of less common words, to keep memory under control. The default of 40M needs about 3.6GB of RAM. Increase/decrease max_vocab_size depending on how much available memory you have.

  • connector_words (Frozenset[str]) – Set of words that may be included within a phrase, without affecting its scoring. If any is provided it will use the lang value from the parser_conf. By default datawords include CONNECTOR_WORDS for English, Portugues an Spanish.

property model: FrozenPhrases
fit(X: Iterable)

This will train the phrase model. It needs an iterable.

Parameters:

X (Iterable) – An iterable which returns plain texts.

parse(txt: str) List[str]

It will parse only one text. :param txt: str :return: a list of words :rtype: List[str]

export_conf() PhrasesModelMeta
save(fp: str | PathLike)

Save phrase model to a folder.

Parameters:

fp (Union[str, os.PathLike]) – The path to the folder. Each model is stored in a folder. The path should be to that folder.

classmethod load(fp: str | PathLike) PhrasesModel

loads the TextIndex model.

Parameters:

fp (Union[str, os.PathLike]) – The path to the index. Each model is stored in a folder. The path should be to that folder.

Returns:

PhrasesModel loaded.

Return type:

PhrasesModel

Other functions

datawords.parsers.parser_from_conf(conf: ParserConf, *, stopw: StopWords | None = None, phrases: PhrasesModel | None = None)

It loads SentencesParser based on the ParserConf configuration. Also it’s possible give an already intialized stop words and phrases objects, to avoid multiple instances in a same process.

datawords.parsers.load_stop(lang='en', *, models_path=None, strip_accents=True) StopWords

Open a list of stop words.

If models_path is ommited then it will look internally in the datawords package. Actually, it supports en, pt and es

datawords.parsers.generate_ngrams(tokens: List[str], n: int = 2, sep=' ') List[str]

Generate n grams from a string s.

Parameters:
  • tokens (List[str]) – a list of words already parsed.

  • n (int) – how many ngrams generate. 2 by default.

  • sep (str) – Field to use as seperator between words.

Returns:

a list of ngrams

Return type:

List[str]

datawords.parsers.norm_token(tkn: str) str

An opinated token normalizer. It lower any string, strip any accents and keeps letters and number from a token.