pressagio.tokenizer

Several classes to tokenize text.

class pressagio.tokenizer.ForwardTokenizer(text, blankspaces=' x0cnrtx0bx85xa0u2009', separators='`~!@#$%^&*()_+=\|]}[{";:/?.>, <¡¿†¨„“”«»।॥ו–—―´’‘‚י0123456789ः')[source]

Methods

count_characters(self) Counts the number of unicode characters in the IO stream.
is_blankspace(self, char) Test if a character is a blankspace.
is_separator(self, char) Test if a character is a separator.
count_tokens  
has_more_tokens  
next_token  
progress  
reset_stream  
__init__(self, text, blankspaces=' x0cnrtx0bx85xa0u2009', separators='`~!@#$%^&*()_+=\|]}[{";:/?.>, <¡¿†¨„“”«»।॥ו–—―´’‘‚י0123456789ः')[source]

Constructor of the Tokenizer base class.

Parameters:
text : str

The text to tokenize.

blankspaces : str

The characters that represent empty spaces.

separators : str

The characters that separate token units (e.g. word boundaries).

count_characters(self)[source]

Counts the number of unicode characters in the IO stream.

class pressagio.tokenizer.NgramMap[source]

A memory efficient store for ngrams.

This class is optimized for memory consumption, it might be slower than other ngram stores. It is also optimized for a three step process:

  1. Add all ngrams.
  2. Perform a cutoff opertation (optional).
  3. Read list of ngrams.

It might not perform well for other use cases.

Methods

add(self, ngram_indices) Add an ngram to the store.
add_token(self, token) Add a token to the internal string store.
cutoff(self, cutoff) Perform a cutoff on the ngram store.
items(self) Get the ngrams from the store.
__init__(self)[source]

Initialize internal data stores.

add(self, ngram_indices)[source]

Add an ngram to the store.

This will add a list of strings as an ngram to the ngram store. In our standard use case the strings are the indices of the strings, you can get those from the add_token() method.

Parameters:
list of str

The indices of the ngram strings as string.

add_token(self, token)[source]

Add a token to the internal string store.

This will only add the token to the internal strings store. It will return an index that you can use to create your ngram.

The ngrams a are represented as strings of the indices, so we will return a string here so that the consumer does not have to do the conversion.

Parameters:
token : str

The token to add to the string store.

Returns:
str

The index of the token as a string.

cutoff(self, cutoff)[source]

Perform a cutoff on the ngram store.

This will remove all ngrams that have a frequency with the given cutoff or lower.

Parameters:
cutoff : int

The cutoff value, we will remove all items with a frequency of the cutoff or lower.

items(self)[source]

Get the ngrams from the store.

Returns:
iterable of tokens, count

The tokens are a list of strings, the real tokens that you added to the store via add_token(). The count is the the count value for that ngram.

class pressagio.tokenizer.ReverseTokenizer(text, blankspaces=' x0cnrtx0bx85xa0u2009', separators='`~!@#$%^&*()_+=\|]}[{";:/?.>, <¡¿†¨„“”«»।॥ו–—―´’‘‚י0123456789ः')[source]

Methods

count_characters(self) Counts the number of unicode characters in the IO stream.
is_blankspace(self, char) Test if a character is a blankspace.
is_separator(self, char) Test if a character is a separator.
count_tokens  
has_more_tokens  
next_token  
progress  
reset_stream  
__init__(self, text, blankspaces=' x0cnrtx0bx85xa0u2009', separators='`~!@#$%^&*()_+=\|]}[{";:/?.>, <¡¿†¨„“”«»।॥ו–—―´’‘‚י0123456789ः')[source]

Constructor of the Tokenizer base class.

Parameters:
text : str

The text to tokenize.

blankspaces : str

The characters that represent empty spaces.

separators : str

The characters that separate token units (e.g. word boundaries).

count_characters(self)[source]

Counts the number of unicode characters in the IO stream.

class pressagio.tokenizer.Tokenizer(text, blankspaces=' x0cnrtx0bx85xa0u2009', separators='`~!@#$%^&*()_+=\|]}[{";:/?.>, <¡¿†¨„“”«»।॥ו–—―´’‘‚י0123456789ः')[source]

Base class for all tokenizers.

Methods

is_blankspace(self, char) Test if a character is a blankspace.
is_separator(self, char) Test if a character is a separator.
count_characters  
count_tokens  
has_more_tokens  
next_token  
progress  
reset_stream  
__init__(self, text, blankspaces=' x0cnrtx0bx85xa0u2009', separators='`~!@#$%^&*()_+=\|]}[{";:/?.>, <¡¿†¨„“”«»।॥ו–—―´’‘‚י0123456789ः')[source]

Constructor of the Tokenizer base class.

Parameters:
text : str

The text to tokenize.

blankspaces : str

The characters that represent empty spaces.

separators : str

The characters that separate token units (e.g. word boundaries).

is_blankspace(self, char)[source]

Test if a character is a blankspace.

Parameters:
char : str

The character to test.

Returns:
ret : bool

True if character is a blankspace, False otherwise.

is_separator(self, char)[source]

Test if a character is a separator.

Parameters:
char : str

The character to test.

Returns:
ret : bool

True if character is a separator, False otherwise.

pressagio.tokenizer.forward_tokenize_file(infile: str, ngram_size: int, lowercase: bool = False, cutoff: int = 0, ngram_map: pressagio.tokenizer.NgramMap = None)[source]

Tokenize a file and return an ngram store.

Parameters:
infile : str

The file to parse.

ngram_size : int

The size of the ngrams to generate.

lowercase : bool

Whether or not to lowercase all tokens.

cutoff : int

Perform a cutoff after parsing. We will only return ngrams that have a frequency higher than the cutoff.

ngram_map : NgramMap

Pass an existing NgramMap if you want to add the ngrams of the given file to the store. Will create a new NgramMap if None.

Returns:
NgramMap

The ngram map that allows you to iterate over the ngrams.

pressagio.tokenizer.forward_tokenize_files(infiles: List[str], ngram_size: int, lowercase: bool = False, cutoff: int = 0)[source]

Tokenize a list of file and return an ngram store.

Parameters:
infile : str

The file to parse.

ngram_size : int

The size of the ngrams to generate.

lowercase : bool

Whether or not to lowercase all tokens.

cutoff : int

Perform a cutoff after parsing. We will only return ngrams that have a frequency higher than the cutoff.

Returns:
NgramMap

The ngram map that allows you to iterate over the ngrams.