Parsers
Parsers module contains mostly functions and classes to parse texts.
doc_parser
doc_parser is the most critical function in the package. It’s used internally by multiple classes.
This function get a text and produce the tokens needed to feed diferent models.
- datawords.parsers.doc_parser(txt: str, stop_words: Set[str], stemmer: Any | None = None, emo_codes=False, strip_accents=True, lower=True, numbers=False, parse_urls=False) List[str]
Get a string text an return a list of words.
Note
emo_codes and parse_urls alter the order of the tokens. If they are found, then it will put them at the end of the list.
This function is related to
ParserConf- Parameters:
txt (str) – text to be parsed.
stop_words (Set[str]) – a list of stop words. It’s possible to get a list using
load_stop2()stemmer (Optional[Any]) – optional, a stemmer.
emo_codes (bool) – if true, emo_codes will be decoded into text
stip_accents – replace accents with the same letter without accent
lower (bool) – transform text to lower letters.
numbers (bool) – if True it will keep numbers.
parse_urls (bool) – keep urls.
- Returns:
a list of tokens
- Return type:
List[str]
- class datawords.parsers.ParserConf(lang: str = 'en', emo_codes: bool = False, strip_accents: bool = False, lower: bool = True, numbers: bool = True, parse_urls: bool = False, stopw_path: str | None = None, use_stemmer: bool = False, phrases_model_path: str | None = None)
Related to
doc_parser()- lang: str
- emo_codes: bool
- strip_accents: bool
- lower: bool
- numbers: bool
- parse_urls: bool
- stopw_path: str | None
- use_stemmer: bool
- phrases_model_path: str | None
- __init__(lang: str = 'en', emo_codes: bool = False, strip_accents: bool = False, lower: bool = True, numbers: bool = True, parse_urls: bool = False, stopw_path: str | None = None, use_stemmer: bool = False, phrases_model_path: str | None = None) None
Method generated by attrs for class ParserConf.
Sentences Parser
- class datawords.parsers.ParserProto
Abstract class that any parser should agree with.
- abstract parse(txt: str) List[str]
- class datawords.parsers.SentencesParser(lang='en', lower=True, phrases_model=None, stemmer: Any | None = None, emo_codes=False, strip_accents=False, numbers=True, stop_words=typing.Set[str], parse_urls=False)
- __init__(lang='en', lower=True, phrases_model=None, stemmer: Any | None = None, emo_codes=False, strip_accents=False, numbers=True, stop_words=typing.Set[str], parse_urls=False)
- parse(txt: str) List[str]
- export_conf() ParserConf
It exports the parser configuration but omits stopw path and stemmer path.
- class datawords.parsers.SentencesIterator(data: Iterable, *, parser: ParserProto)
- __init__(data: Iterable, *, parser: ParserProto)
- parser_generator()
Other functions
- datawords.parsers.load_stop2(lang='en', *, models_path=None, strip_accents=True) Set[str]
Open a list of stop words.
If models_path is ommited then it will look internally for the list of words. Actually, it supports en, pt and es
- datawords.parsers.generate_ngrams(tokens: List[str], n: int = 2, sep=' ') List[str]
Generate n grams from a string s.
- Parameters:
tokens (List[str]) – a list of words already parsed.
n (int) – how many ngrams generate. 2 by default.
sep (str) – Field to use as seperator between words.
- Returns:
a list of ngrams
- Return type:
List[str]
- datawords.parsers.norm_token(tkn: str) str
An opinated token normalizer. It lower any string, strip any accents and keeps letters and number from a token.
- datawords.parsers.apply_regex(reg_expr, word) str