Parsers

Parsers module contains mostly functions and classes to parse texts.

doc_parser

doc_parser is the most critical function in the package. It’s used internally by multiple classes.

This function get a text and produce the tokens needed to feed diferent models.

datawords.parsers.doc_parser(txt: str, stop_words: Set[str], stemmer: Any | None = None, emo_codes=False, strip_accents=True, lower=True, numbers=False, parse_urls=False) → List[str]

Get a string text an return a list of words.

Note

emo_codes and parse_urls alter the order of the tokens. If they are found, then it will put them at the end of the list.

This function is related to ParserConf

Parameters:

txt (str) – text to be parsed.
stop_words (Set[str]) – a list of stop words. It’s possible to get a list using load_stop2()
stemmer (Optional[Any]) – optional, a stemmer.
emo_codes (bool) – if true, emo_codes will be decoded into text
stip_accents – replace accents with the same letter without accent
lower (bool) – transform text to lower letters.
numbers (bool) – if True it will keep numbers.
parse_urls (bool) – keep urls.

Returns:

a list of tokens

Return type:

List[str]

class datawords.parsers.ParserConf(lang: str = 'en', emo_codes: bool = False, strip_accents: bool = False, lower: bool = True, numbers: bool = True, parse_urls: bool = False, stopw_path: str | None = None, use_stemmer: bool = False, phrases_model_path: str | None = None)

Related to doc_parser()

lang: str

emo_codes: bool

strip_accents: bool

lower: bool

numbers: bool

parse_urls: bool

stopw_path: str | None

use_stemmer: bool

phrases_model_path: str | None

__init__(lang: str = 'en', emo_codes: bool = False, strip_accents: bool = False, lower: bool = True, numbers: bool = True, parse_urls: bool = False, stopw_path: str | None = None, use_stemmer: bool = False, phrases_model_path: str | None = None) → None: Method generated by attrs for class ParserConf.

Sentences Parser

class datawords.parsers.ParserProto

Abstract class that any parser should agree with.

abstract parse(txt: str) → List[str]

class datawords.parsers.SentencesParser(lang='en', lower=True, phrases_model=None, stemmer: Any | None = None, emo_codes=False, strip_accents=False, numbers=True, stop_words=typing.Set[str], parse_urls=False)

__init__(lang='en', lower=True, phrases_model=None, stemmer: Any | None = None, emo_codes=False, strip_accents=False, numbers=True, stop_words=typing.Set[str], parse_urls=False)

parse(txt: str) → List[str]

export_conf() → ParserConf: It exports the parser configuration but omits stopw path and stemmer path.

class datawords.parsers.SentencesIterator(data: Iterable, *, parser: ParserProto)

__init__(data: Iterable, *, parser: ParserProto)

parser_generator()

Other functions

datawords.parsers.load_stop2(lang='en', *, models_path=None, strip_accents=True) → Set[str]

Open a list of stop words.

If models_path is ommited then it will look internally for the list of words. Actually, it supports en, pt and es

datawords.parsers.generate_ngrams(tokens: List[str], n: int = 2, sep=' ') → List[str]

Generate n grams from a string s.

Parameters:

tokens (List[str]) – a list of words already parsed.
n (int) – how many ngrams generate. 2 by default.
sep (str) – Field to use as seperator between words.

Returns:

a list of ngrams

Return type:

List[str]

datawords.parsers.norm_token(tkn: str) → str: An opinated token normalizer. It lower any string, strip any accents and keeps letters and number from a token.

datawords.parsers.apply_regex(reg_expr, word) → str