Added Tokens
Python
Rust
Node
AddedToken
class tokenizers.AddedToken
( content single_word = False lstrip = False rstrip = False normalized = True )
Parameters
-
content (
str
) — The content of the token -
single_word (
bool
, defaults toFalse
) — Defines whether this token should only match single words. IfTrue
, this token will never match inside of a word. For example the tokening
would match ontokenizing
if this option isFalse
, but not if it isTrue
. The notion of ”inside of a word” is defined by the word boundaries pattern in regular expressions (ie. the token should start and end with word boundaries). -
lstrip (
bool
, defaults toFalse
) — Defines whether this token should strip all potential whitespaces on its left side. IfTrue
, this token will greedily match any whitespace on its left. For example if we try to match the token[MASK]
withlstrip=True
, in the text"I saw a [MASK]"
, we would match on" [MASK]"
. (Note the space on the left). -
rstrip (
bool
, defaults toFalse
) — Defines whether this token should strip all potential whitespaces on its right side. IfTrue
, this token will greedily match any whitespace on its right. It works just likelstrip
but on the right. -
normalized (
bool
, defaults toTrue
with —meth:~tokenizers.Tokenizer.add_tokens andFalse
withadd_special_tokens()
): Defines whether this token should match against the normalized version of the input text. For example, with the added token"yesterday"
, and a normalizer in charge of lowercasing the text, the token could be extract from the input"I saw a lion Yesterday"
.
Represents a token that can be be added to a Tokenizer. It can have special options that defines the way it should behave.