pressagio.tokenizer¶
Several classes to tokenize text.
-
class
pressagio.tokenizer.
ForwardTokenizer
(text, blankspaces=' x0cnrtx0bx85xa0u2009', separators='`~!@#$%^&*()_+=\|]}[{";:/?.>, <¡¿†¨„“”«»।॥ו–—―´’‘‚י0123456789ः')[source]¶ Methods
count_characters
(self)Counts the number of unicode characters in the IO stream. is_blankspace
(self, char)Test if a character is a blankspace. is_separator
(self, char)Test if a character is a separator. count_tokens has_more_tokens next_token progress reset_stream -
__init__
(self, text, blankspaces=' x0cnrtx0bx85xa0u2009', separators='`~!@#$%^&*()_+=\|]}[{";:/?.>, <¡¿†¨„“”«»।॥ו–—―´’‘‚י0123456789ः')[source]¶ Constructor of the Tokenizer base class.
Parameters: - text : str
The text to tokenize.
- blankspaces : str
The characters that represent empty spaces.
- separators : str
The characters that separate token units (e.g. word boundaries).
-
-
class
pressagio.tokenizer.
NgramMap
[source]¶ A memory efficient store for ngrams.
This class is optimized for memory consumption, it might be slower than other ngram stores. It is also optimized for a three step process:
- Add all ngrams.
- Perform a cutoff opertation (optional).
- Read list of ngrams.
It might not perform well for other use cases.
Methods
add
(self, ngram_indices)Add an ngram to the store. add_token
(self, token)Add a token to the internal string store. cutoff
(self, cutoff)Perform a cutoff on the ngram store. items
(self)Get the ngrams from the store. -
add
(self, ngram_indices)[source]¶ Add an ngram to the store.
This will add a list of strings as an ngram to the ngram store. In our standard use case the strings are the indices of the strings, you can get those from the add_token() method.
Parameters: - list of str
The indices of the ngram strings as string.
-
add_token
(self, token)[source]¶ Add a token to the internal string store.
This will only add the token to the internal strings store. It will return an index that you can use to create your ngram.
The ngrams a are represented as strings of the indices, so we will return a string here so that the consumer does not have to do the conversion.
Parameters: - token : str
The token to add to the string store.
Returns: - str
The index of the token as a string.
-
class
pressagio.tokenizer.
ReverseTokenizer
(text, blankspaces=' x0cnrtx0bx85xa0u2009', separators='`~!@#$%^&*()_+=\|]}[{";:/?.>, <¡¿†¨„“”«»।॥ו–—―´’‘‚י0123456789ः')[source]¶ Methods
count_characters
(self)Counts the number of unicode characters in the IO stream. is_blankspace
(self, char)Test if a character is a blankspace. is_separator
(self, char)Test if a character is a separator. count_tokens has_more_tokens next_token progress reset_stream -
__init__
(self, text, blankspaces=' x0cnrtx0bx85xa0u2009', separators='`~!@#$%^&*()_+=\|]}[{";:/?.>, <¡¿†¨„“”«»।॥ו–—―´’‘‚י0123456789ः')[source]¶ Constructor of the Tokenizer base class.
Parameters: - text : str
The text to tokenize.
- blankspaces : str
The characters that represent empty spaces.
- separators : str
The characters that separate token units (e.g. word boundaries).
-
-
class
pressagio.tokenizer.
Tokenizer
(text, blankspaces=' x0cnrtx0bx85xa0u2009', separators='`~!@#$%^&*()_+=\|]}[{";:/?.>, <¡¿†¨„“”«»।॥ו–—―´’‘‚י0123456789ः')[source]¶ Base class for all tokenizers.
Methods
is_blankspace
(self, char)Test if a character is a blankspace. is_separator
(self, char)Test if a character is a separator. count_characters count_tokens has_more_tokens next_token progress reset_stream -
__init__
(self, text, blankspaces=' x0cnrtx0bx85xa0u2009', separators='`~!@#$%^&*()_+=\|]}[{";:/?.>, <¡¿†¨„“”«»।॥ו–—―´’‘‚י0123456789ः')[source]¶ Constructor of the Tokenizer base class.
Parameters: - text : str
The text to tokenize.
- blankspaces : str
The characters that represent empty spaces.
- separators : str
The characters that separate token units (e.g. word boundaries).
-
-
pressagio.tokenizer.
forward_tokenize_file
(infile: str, ngram_size: int, lowercase: bool = False, cutoff: int = 0, ngram_map: pressagio.tokenizer.NgramMap = None)[source]¶ Tokenize a file and return an ngram store.
Parameters: - infile : str
The file to parse.
- ngram_size : int
The size of the ngrams to generate.
- lowercase : bool
Whether or not to lowercase all tokens.
- cutoff : int
Perform a cutoff after parsing. We will only return ngrams that have a frequency higher than the cutoff.
- ngram_map : NgramMap
Pass an existing NgramMap if you want to add the ngrams of the given file to the store. Will create a new NgramMap if None.
Returns: - NgramMap
The ngram map that allows you to iterate over the ngrams.
-
pressagio.tokenizer.
forward_tokenize_files
(infiles: List[str], ngram_size: int, lowercase: bool = False, cutoff: int = 0)[source]¶ Tokenize a list of file and return an ngram store.
Parameters: - infile : str
The file to parse.
- ngram_size : int
The size of the ngrams to generate.
- lowercase : bool
Whether or not to lowercase all tokens.
- cutoff : int
Perform a cutoff after parsing. We will only return ngrams that have a frequency higher than the cutoff.
Returns: - NgramMap
The ngram map that allows you to iterate over the ngrams.