pressagio.tokenizer

Several classes to tokenize text.

class pressagio.tokenizer.Tokenizer(stream, blankspaces=u' x0cnrtx0bx85', separators=u'`~!@#$%^&*()_-+=\|]}[{'";:/?.>, <u2020u201eu201cu0964u0965u05d5u2013xb4u2019u2018u201au05d90123456789u0903')[source]

Base class for all tokenizers.

Methods

count_characters()
count_tokens()
has_more_tokens()
is_blankspace(char) Test if a character is a blankspace.
is_separator(char) Test if a character is a separator.
next_token()
progress()
reset_stream()
__init__(stream, blankspaces=u' x0cnrtx0bx85', separators=u'`~!@#$%^&*()_-+=\|]}[{'";:/?.>, <u2020u201eu201cu0964u0965u05d5u2013xb4u2019u2018u201au05d90123456789u0903')[source]

Constructor of the Tokenizer base class.

Parameters:

stream : str or io.IOBase

The stream to tokenize. Can be a filename or any open IO stream.

blankspaces : str

The characters that represent empty spaces.

separators : str

The characters that separate token units (e.g. word boundaries).

is_blankspace(char)[source]

Test if a character is a blankspace.

Parameters:

char : str

The character to test.

Returns:

ret : bool

True if character is a blankspace, False otherwise.

is_separator(char)[source]

Test if a character is a separator.

Parameters:

char : str

The character to test.

Returns:

ret : bool

True if character is a separator, False otherwise.