Parsers

Parsers module contains mostly functions and classes to parse texts.

doc_parser

doc_parser is the most critical function in the package. It’s used internally by multiple classes.

This function get a text and produce the tokens needed to feed diferent models.

datawords.parsers.doc_parser(txt: str, stop_words: Set[str], stemmer: Any | None = None, emo_codes=False, strip_accents=True, lower=True, numbers=False, parse_urls=False) List[str]

Get a string text an return a list of words.

Note

emo_codes and parse_urls alter the order of the tokens. If they are found, then it will put them at the end of the list.

This function is related to ParserConf

Parameters:
  • txt (str) – text to be parsed.

  • stop_words (Set[str]) – a list of stop words. It’s possible to get a list using load_stop2()

  • stemmer (Optional[Any]) – optional, a stemmer.

  • emo_codes (bool) – if true, emo_codes will be decoded into text

  • stip_accents – replace accents with the same letter without accent

  • lower (bool) – transform text to lower letters.

  • numbers (bool) – if True it will keep numbers.

  • parse_urls (bool) – keep urls.

Returns:

a list of tokens

Return type:

List[str]

class datawords.parsers.ParserConf(lang: str = 'en', emo_codes: bool = False, strip_accents: bool = False, lower: bool = True, numbers: bool = True, parse_urls: bool = False, stopw_path: str | None = None, use_stemmer: bool = False, phrases_model_path: str | None = None)

Related to doc_parser()

lang: str
emo_codes: bool
strip_accents: bool
lower: bool
numbers: bool
parse_urls: bool
stopw_path: str | None
use_stemmer: bool
phrases_model_path: str | None
__init__(lang: str = 'en', emo_codes: bool = False, strip_accents: bool = False, lower: bool = True, numbers: bool = True, parse_urls: bool = False, stopw_path: str | None = None, use_stemmer: bool = False, phrases_model_path: str | None = None) None

Method generated by attrs for class ParserConf.

Sentences Parser

class datawords.parsers.ParserProto

Abstract class that any parser should agree with.

abstract parse(txt: str) List[str]
class datawords.parsers.SentencesParser(lang='en', lower=True, phrases_model=None, stemmer: Any | None = None, emo_codes=False, strip_accents=False, numbers=True, stop_words=typing.Set[str], parse_urls=False)
__init__(lang='en', lower=True, phrases_model=None, stemmer: Any | None = None, emo_codes=False, strip_accents=False, numbers=True, stop_words=typing.Set[str], parse_urls=False)
parse(txt: str) List[str]
export_conf() ParserConf

It exports the parser configuration but omits stopw path and stemmer path.

class datawords.parsers.SentencesIterator(data: Iterable, *, parser: ParserProto)
__init__(data: Iterable, *, parser: ParserProto)
parser_generator()

Other functions

datawords.parsers.load_stop2(lang='en', *, models_path=None, strip_accents=True) Set[str]

Open a list of stop words.

If models_path is ommited then it will look internally for the list of words. Actually, it supports en, pt and es

datawords.parsers.generate_ngrams(tokens: List[str], n: int = 2, sep=' ') List[str]

Generate n grams from a string s.

Parameters:
  • tokens (List[str]) – a list of words already parsed.

  • n (int) – how many ngrams generate. 2 by default.

  • sep (str) – Field to use as seperator between words.

Returns:

a list of ngrams

Return type:

List[str]

datawords.parsers.norm_token(tkn: str) str

An opinated token normalizer. It lower any string, strip any accents and keeps letters and number from a token.

datawords.parsers.apply_regex(reg_expr, word) str