pressagio.tokenizer¶

Several classes to tokenize text.

class pressagio.tokenizer.ForwardTokenizer(text, blankspaces=' x0cnrtx0bx85xa0u2009', separators='`~!@#$%^&*()_+=\|]}[{";:/?.>, <¡¿†¨„“”«»।॥ו–—―´’‘‚י0123456789ः')[source]¶

Methods

`count_characters`(self)	Counts the number of unicode characters in the IO stream.
`is_blankspace`(self, char)	Test if a character is a blankspace.
`is_separator`(self, char)	Test if a character is a separator.

count_tokens
has_more_tokens
next_token
progress
reset_stream

__init__(self, text, blankspaces=' x0cnrtx0bx85xa0u2009', separators='`~!@#$%^&*()_+=\|]}[{";:/?.>, <¡¿†¨„“”«»।॥ו–—―´’‘‚י0123456789ः')[source]¶

Constructor of the Tokenizer base class.

Parameters:	text : str The text to tokenize. blankspaces : str The characters that represent empty spaces. separators : str The characters that separate token units (e.g. word boundaries).

count_characters(self)[source]¶: Counts the number of unicode characters in the IO stream.

class pressagio.tokenizer.NgramMap[source]¶

A memory efficient store for ngrams.

This class is optimized for memory consumption, it might be slower than other ngram stores. It is also optimized for a three step process:

Add all ngrams.
Perform a cutoff opertation (optional).
Read list of ngrams.

It might not perform well for other use cases.

Methods

`add`(self, ngram_indices)	Add an ngram to the store.
`add_token`(self, token)	Add a token to the internal string store.
`cutoff`(self, cutoff)	Perform a cutoff on the ngram store.
`items`(self)	Get the ngrams from the store.

__init__(self)[source]¶: Initialize internal data stores.

add(self, ngram_indices)[source]¶

Add an ngram to the store.

This will add a list of strings as an ngram to the ngram store. In our standard use case the strings are the indices of the strings, you can get those from the add_token() method.

Parameters:	list of str The indices of the ngram strings as string.

add_token(self, token)[source]¶

Add a token to the internal string store.

This will only add the token to the internal strings store. It will return an index that you can use to create your ngram.

The ngrams a are represented as strings of the indices, so we will return a string here so that the consumer does not have to do the conversion.

Parameters:	token : str The token to add to the string store.
Returns:	str The index of the token as a string.

cutoff(self, cutoff)[source]¶

Perform a cutoff on the ngram store.

This will remove all ngrams that have a frequency with the given cutoff or lower.

Parameters:	cutoff : int The cutoff value, we will remove all items with a frequency of the cutoff or lower.

items(self)[source]¶

Get the ngrams from the store.

Returns:	iterable of tokens, count The tokens are a list of strings, the real tokens that you added to the store via add_token(). The count is the the count value for that ngram.

class pressagio.tokenizer.ReverseTokenizer(text, blankspaces=' x0cnrtx0bx85xa0u2009', separators='`~!@#$%^&*()_+=\|]}[{";:/?.>, <¡¿†¨„“”«»।॥ו–—―´’‘‚י0123456789ः')[source]¶

Methods

`count_characters`(self)	Counts the number of unicode characters in the IO stream.
`is_blankspace`(self, char)	Test if a character is a blankspace.
`is_separator`(self, char)	Test if a character is a separator.

count_tokens
has_more_tokens
next_token
progress
reset_stream

__init__(self, text, blankspaces=' x0cnrtx0bx85xa0u2009', separators='`~!@#$%^&*()_+=\|]}[{";:/?.>, <¡¿†¨„“”«»।॥ו–—―´’‘‚י0123456789ः')[source]¶

Constructor of the Tokenizer base class.

Parameters:	text : str The text to tokenize. blankspaces : str The characters that represent empty spaces. separators : str The characters that separate token units (e.g. word boundaries).

count_characters(self)[source]¶: Counts the number of unicode characters in the IO stream.

class pressagio.tokenizer.Tokenizer(text, blankspaces=' x0cnrtx0bx85xa0u2009', separators='`~!@#$%^&*()_+=\|]}[{";:/?.>, <¡¿†¨„“”«»।॥ו–—―´’‘‚י0123456789ः')[source]¶

Base class for all tokenizers.

Methods

`is_blankspace`(self, char)	Test if a character is a blankspace.
`is_separator`(self, char)	Test if a character is a separator.

count_characters
count_tokens
has_more_tokens
next_token
progress
reset_stream

__init__(self, text, blankspaces=' x0cnrtx0bx85xa0u2009', separators='`~!@#$%^&*()_+=\|]}[{";:/?.>, <¡¿†¨„“”«»।॥ו–—―´’‘‚י0123456789ः')[source]¶

Constructor of the Tokenizer base class.

Parameters:	text : str The text to tokenize. blankspaces : str The characters that represent empty spaces. separators : str The characters that separate token units (e.g. word boundaries).

is_blankspace(self, char)[source]¶

Test if a character is a blankspace.

Parameters:	char : str The character to test.
Returns:	ret : bool True if character is a blankspace, False otherwise.

is_separator(self, char)[source]¶

Test if a character is a separator.

Parameters:	char : str The character to test.
Returns:	ret : bool True if character is a separator, False otherwise.

pressagio.tokenizer.forward_tokenize_file(infile: str, ngram_size: int, lowercase: bool = False, cutoff: int = 0, ngram_map: pressagio.tokenizer.NgramMap = None)[source]¶

Tokenize a file and return an ngram store.

Parameters:

infile : str: The file to parse.
ngram_size : int: The size of the ngrams to generate.
lowercase : bool: Whether or not to lowercase all tokens.
cutoff : int: Perform a cutoff after parsing. We will only return ngrams that have a frequency higher than the cutoff.
ngram_map : NgramMap: Pass an existing NgramMap if you want to add the ngrams of the given file to the store. Will create a new NgramMap if None.

Returns:

NgramMap: The ngram map that allows you to iterate over the ngrams.

pressagio.tokenizer.forward_tokenize_files(infiles: List[str], ngram_size: int, lowercase: bool = False, cutoff: int = 0)[source]¶

Tokenize a list of file and return an ngram store.

Parameters:	infile : str The file to parse. ngram_size : int The size of the ngrams to generate. lowercase : bool Whether or not to lowercase all tokens. cutoff : int Perform a cutoff after parsing. We will only return ngrams that have a frequency higher than the cutoff.
Returns:	NgramMap The ngram map that allows you to iterate over the ngrams.