Parsers

Parsers module contains mostly functions and classes to parse texts.

doc_parser

doc_parser is the most critical function in the package. It’s used internally by multiple classes.

This function get a text and produce the tokens needed to feed diferent models.

datawords.parsers.doc_parser(txt: str, stop_words: FrozenSet[str], stemmer: Any | None = None, emo_codes=False, strip_accents=True, lower=True, numbers=False, parse_urls=False) → List[str]

Get a string text an return a list of words.

Note

emo_codes and parse_urls alter the order of the tokens. If they are found, then it will put them at the end of the list.

This function is related to ParserConf

Parameters:

txt (str) – text to be parsed.
stop_words (Set[str]) – a list of stop words. It’s possible to get a list using load_stop2()
stemmer (Optional[Any]) – optional, a stemmer.
emo_codes (bool) – if true, emo_codes will be decoded into text
stip_accents – replace accents with the same letter without accent
lower (bool) – transform text to lower letters.
numbers (bool) – if True it will keep numbers.
parse_urls (bool) – keep urls.

Returns:

a list of tokens

Return type:

List[str]

class datawords.parsers.ParserConf(lang: str = 'en', emo_codes: bool = False, strip_accents: bool = False, lower: bool = True, numbers: bool = True, parse_urls: bool = False, stopw_path: str | None = None, use_stemmer: bool = False, stemmer_class: str | None = None, phrases_model_path: str | None = None)

Related to doc_parser()

lang: str

emo_codes: bool

strip_accents: bool

lower: bool

numbers: bool

parse_urls: bool

stopw_path: str | None

use_stemmer: bool

stemmer_class: str | None

phrases_model_path: str | None

__init__(lang: str = 'en', emo_codes: bool = False, strip_accents: bool = False, lower: bool = True, numbers: bool = True, parse_urls: bool = False, stopw_path: str | None = None, use_stemmer: bool = False, stemmer_class: str | None = None, phrases_model_path: str | None = None) → None: Method generated by attrs for class ParserConf.

Sentences Parser

class datawords.parsers.ParserProto

Abstract class that any parser should agree with.

abstract parse(txt: str) → List[str]

class datawords.parsers.SentencesParser(lang='en', lower=True, phrases_model=None, stemmer: Any | None = None, emo_codes=False, strip_accents=False, numbers=True, stop_words=typing.FrozenSet[str], parse_urls=False)

__init__(lang='en', lower=True, phrases_model=None, stemmer: Any | None = None, emo_codes=False, strip_accents=False, numbers=True, stop_words=typing.FrozenSet[str], parse_urls=False)

parse(txt: str) → List[str]: It gets a txt which could be a phrase, a doc, or anything in between, and then parse it using the phraase model or the parser.

export_conf() → ParserConf: It exports the parser configuration but omits stopw path and stemmer path.

class datawords.parsers.SentencesIterator(data: Iterable, *, parser: ParserProto)

__init__(data: Iterable, *, parser: ParserProto)

parser_generator()

PhrasesModel

The PhrasesModel is used to build bgrams. In this case, it’s a wrapper around Gensim Phrases

class datawords.parsers.PhrasesModel(parser_conf: ParserConf, min_count: float = 1.0, threshold: float | None = None, max_vocab_size: int | None = None, connector_words=None, model=None, saved_path=None)

__init__(parser_conf: ParserConf, min_count: float = 1.0, threshold: float | None = None, max_vocab_size: int | None = None, connector_words=None, model=None, saved_path=None)

Parameters:

parser_conf (parsers.ParserConf) – configuration of the parser
min_count (float) – Ignore all words and bigrams with total collected count lower than this value.
threshold – Represent a score threshold for forming the phrases (higher means fewer phrases). A phrase of words a followed by b is accepted if the score of the phrase is greater than threshold. Heavily depends on concrete scoring-function, see the scoring parameter.
max_vocab_size (Optional[int] int) – Maximum size (number of tokens) of the vocabulary. Used to control pruning of less common words, to keep memory under control. The default of 40M needs about 3.6GB of RAM. Increase/decrease max_vocab_size depending on how much available memory you have.
connector_words (Frozenset[str]) – Set of words that may be included within a phrase, without affecting its scoring. If any is provided it will use the lang value from the parser_conf. By default datawords include CONNECTOR_WORDS for English, Portugues an Spanish.

property model: FrozenPhrases

fit(X: Iterable)

This will train the phrase model. It needs an iterable.

Parameters:: X (Iterable) – An iterable which returns plain texts.

parse(txt: str) → List[str]: It will parse only one text. :param txt: str :return: a list of words :rtype: List[str]

export_conf() → PhrasesModelMeta

save(fp: str | PathLike)

Save phrase model to a folder.

Parameters:: fp (Union[str, os.PathLike]) – The path to the folder. Each model is stored in a folder. The path should be to that folder.

classmethod load(fp: str | PathLike) → PhrasesModel

loads the TextIndex model.

Parameters:: fp (Union[str, os.PathLike]) – The path to the index. Each model is stored in a folder. The path should be to that folder.
Returns:: PhrasesModel loaded.
Return type:: PhrasesModel

Other functions

datawords.parsers.parser_from_conf(conf: ParserConf, *, stopw: StopWords | None = None, phrases: PhrasesModel | None = None): It loads SentencesParser based on the ParserConf configuration. Also it’s possible give an already intialized stop words and phrases objects, to avoid multiple instances in a same process.

datawords.parsers.load_stop(lang='en', *, models_path=None, strip_accents=True) → StopWords

Open a list of stop words.

If models_path is ommited then it will look internally in the datawords package. Actually, it supports en, pt and es

datawords.parsers.generate_ngrams(tokens: List[str], n: int = 2, sep=' ') → List[str]

Generate n grams from a string s.

Parameters:

tokens (List[str]) – a list of words already parsed.
n (int) – how many ngrams generate. 2 by default.
sep (str) – Field to use as seperator between words.

Returns:

a list of ngrams

Return type:

List[str]

datawords.parsers.norm_token(tkn: str) → str: An opinated token normalizer. It lower any string, strip any accents and keeps letters and number from a token.