Parsers
Parsers module contains mostly functions and classes to parse texts.
doc_parser
doc_parser is the most critical function in the package. It’s used internally by multiple classes.
This function get a text and produce the tokens needed to feed diferent models.
- datawords.parsers.doc_parser(txt: str, stop_words: FrozenSet[str], stemmer: Any | None = None, emo_codes=False, strip_accents=True, lower=True, numbers=False, parse_urls=False) List[str]
Get a string text an return a list of words.
Note
emo_codes and parse_urls alter the order of the tokens. If they are found, then it will put them at the end of the list.
This function is related to
ParserConf- Parameters:
txt (str) – text to be parsed.
stop_words (Set[str]) – a list of stop words. It’s possible to get a list using
load_stop2()stemmer (Optional[Any]) – optional, a stemmer.
emo_codes (bool) – if true, emo_codes will be decoded into text
stip_accents – replace accents with the same letter without accent
lower (bool) – transform text to lower letters.
numbers (bool) – if True it will keep numbers.
parse_urls (bool) – keep urls.
- Returns:
a list of tokens
- Return type:
List[str]
- class datawords.parsers.ParserConf(lang: str = 'en', emo_codes: bool = False, strip_accents: bool = False, lower: bool = True, numbers: bool = True, parse_urls: bool = False, stopw_path: str | None = None, use_stemmer: bool = False, stemmer_class: str | None = None, phrases_model_path: str | None = None)
Related to
doc_parser()- lang: str
- emo_codes: bool
- strip_accents: bool
- lower: bool
- numbers: bool
- parse_urls: bool
- stopw_path: str | None
- use_stemmer: bool
- stemmer_class: str | None
- phrases_model_path: str | None
- __init__(lang: str = 'en', emo_codes: bool = False, strip_accents: bool = False, lower: bool = True, numbers: bool = True, parse_urls: bool = False, stopw_path: str | None = None, use_stemmer: bool = False, stemmer_class: str | None = None, phrases_model_path: str | None = None) None
Method generated by attrs for class ParserConf.
Sentences Parser
- class datawords.parsers.ParserProto
Abstract class that any parser should agree with.
- abstract parse(txt: str) List[str]
- class datawords.parsers.SentencesParser(lang='en', lower=True, phrases_model=None, stemmer: Any | None = None, emo_codes=False, strip_accents=False, numbers=True, stop_words=typing.FrozenSet[str], parse_urls=False)
- __init__(lang='en', lower=True, phrases_model=None, stemmer: Any | None = None, emo_codes=False, strip_accents=False, numbers=True, stop_words=typing.FrozenSet[str], parse_urls=False)
- parse(txt: str) List[str]
It gets a txt which could be a phrase, a doc, or anything in between, and then parse it using the phraase model or the parser.
- export_conf() ParserConf
It exports the parser configuration but omits stopw path and stemmer path.
- class datawords.parsers.SentencesIterator(data: Iterable, *, parser: ParserProto)
- __init__(data: Iterable, *, parser: ParserProto)
- parser_generator()
PhrasesModel
The PhrasesModel is used to build bgrams. In this case, it’s a wrapper around Gensim Phrases
- class datawords.parsers.PhrasesModel(parser_conf: ParserConf, min_count: float = 1.0, threshold: float | None = None, max_vocab_size: int | None = None, connector_words=None, model=None, saved_path=None)
- __init__(parser_conf: ParserConf, min_count: float = 1.0, threshold: float | None = None, max_vocab_size: int | None = None, connector_words=None, model=None, saved_path=None)
- Parameters:
parser_conf (parsers.ParserConf) – configuration of the parser
min_count (float) – Ignore all words and bigrams with total collected count lower than this value.
threshold – Represent a score threshold for forming the phrases (higher means fewer phrases). A phrase of words a followed by b is accepted if the score of the phrase is greater than threshold. Heavily depends on concrete scoring-function, see the scoring parameter.
max_vocab_size (Optional[int] int) – Maximum size (number of tokens) of the vocabulary. Used to control pruning of less common words, to keep memory under control. The default of 40M needs about 3.6GB of RAM. Increase/decrease max_vocab_size depending on how much available memory you have.
connector_words (Frozenset[str]) – Set of words that may be included within a phrase, without affecting its scoring. If any is provided it will use the lang value from the parser_conf. By default datawords include CONNECTOR_WORDS for English, Portugues an Spanish.
- property model: FrozenPhrases
- fit(X: Iterable)
This will train the phrase model. It needs an iterable.
- Parameters:
X (Iterable) – An iterable which returns plain texts.
- parse(txt: str) List[str]
It will parse only one text. :param txt: str :return: a list of words :rtype: List[str]
- export_conf() PhrasesModelMeta
- save(fp: str | PathLike)
Save phrase model to a folder.
- Parameters:
fp (Union[str, os.PathLike]) – The path to the folder. Each model is stored in a folder. The path should be to that folder.
- classmethod load(fp: str | PathLike) PhrasesModel
loads the TextIndex model.
- Parameters:
fp (Union[str, os.PathLike]) – The path to the index. Each model is stored in a folder. The path should be to that folder.
- Returns:
PhrasesModel loaded.
- Return type:
Other functions
- datawords.parsers.parser_from_conf(conf: ParserConf, *, stopw: StopWords | None = None, phrases: PhrasesModel | None = None)
It loads
SentencesParserbased on theParserConfconfiguration. Also it’s possible give an already intialized stop words and phrases objects, to avoid multiple instances in a same process.
- datawords.parsers.load_stop(lang='en', *, models_path=None, strip_accents=True) StopWords
Open a list of stop words.
If models_path is ommited then it will look internally in the datawords package. Actually, it supports en, pt and es
- datawords.parsers.generate_ngrams(tokens: List[str], n: int = 2, sep=' ') List[str]
Generate n grams from a string s.
- Parameters:
tokens (List[str]) – a list of words already parsed.
n (int) – how many ngrams generate. 2 by default.
sep (str) – Field to use as seperator between words.
- Returns:
a list of ngrams
- Return type:
List[str]
- datawords.parsers.norm_token(tkn: str) str
An opinated token normalizer. It lower any string, strip any accents and keeps letters and number from a token.