Quicktour
Let’s have a quick look at the 🤗 Tokenizers library features. The library provides an implementation of today’s most used tokenizers that is both easy to use and blazing fast.
Build a tokenizer from scratch
To illustrate how fast the 🤗 Tokenizers library is, let’s train a new tokenizer on wikitext-103 (516M of text) in just a few seconds. First things first, you will need to download this dataset and unzip it with:
wget https://s3.amazonaws.com/research.metamind.io/wikitext/wikitext-103-raw-v1.zip unzip wikitext-103-raw-v1.zip
Training the tokenizer
In this tour, we will build and train a Byte-Pair Encoding (BPE) tokenizer. For more information about the different type of tokenizers, check out this guide in the 🤗 Transformers documentation. Here, training the tokenizer means it will learn merge rules by:
- Start with all the characters present in the training corpus as tokens.
- Identify the most common pair of tokens and merge it into one token.
- Repeat until the vocabulary (e.g., the number of tokens) has reached the size we want.
The main API of the library is the class
Tokenizer
, here is how
we instantiate one with a BPE model:
from tokenizers import Tokenizer
from tokenizers.models import BPE
tokenizer = Tokenizer(BPE(unk_token="[UNK]"))
To train our tokenizer on the wikitext files, we will need to
instantiate a [trainer]{.title-ref}, in this case a
BpeTrainer
from tokenizers.trainers import BpeTrainer
trainer = BpeTrainer(special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"])
We can set the training arguments like vocab_size
or min_frequency
(here
left at their default values of 30,000 and 0) but the most important
part is to give the special_tokens
we
plan to use later on (they are not used at all during training) so that
they get inserted in the vocabulary.
The order in which you write the special tokens list matters: here "[UNK]"
will get the ID 0,
"[CLS]"
will get the ID 1 and so forth.
We could train our tokenizer right now, but it wouldn’t be optimal.
Without a pre-tokenizer that will split our inputs into words, we might
get tokens that overlap several words: for instance we could get an
"it is"
token since those two words
often appear next to each other. Using a pre-tokenizer will ensure no
token is bigger than a word returned by the pre-tokenizer. Here we want
to train a subword BPE tokenizer, and we will use the easiest
pre-tokenizer possible by splitting on whitespace.
from tokenizers.pre_tokenizers import Whitespace
tokenizer.pre_tokenizer = Whitespace()
Now, we can just call the Tokenizer.train
method with any list of files we want to use:
files = [f"data/wikitext-103-raw/wiki.{split}.raw" for split in ["test", "train", "valid"]]
tokenizer.train(files, trainer)
This should only take a few seconds to train our tokenizer on the full
wikitext dataset! To save the tokenizer in one file that contains all
its configuration and vocabulary, just use the
Tokenizer.save
method:
tokenizer.save("data/tokenizer-wiki.json")
and you can reload your tokenizer from that file with the
Tokenizer.from_file
classmethod
:
tokenizer = Tokenizer.from_file("data/tokenizer-wiki.json")
Using the tokenizer
Now that we have trained a tokenizer, we can use it on any text we want
with the Tokenizer.encode
method:
output = tokenizer.encode("Hello, y'all! How are you 😁 ?")
This applied the full pipeline of the tokenizer on the text, returning
an Encoding
object. To learn more
about this pipeline, and how to apply (or customize) parts of it, check out this page.
This Encoding
object then has all the
attributes you need for your deep learning model (or other). The
tokens
attribute contains the
segmentation of your text in tokens:
print(output.tokens)
# ["Hello", ",", "y", "'", "all", "!", "How", "are", "you", "[UNK]", "?"]
Similarly, the ids
attribute will
contain the index of each of those tokens in the tokenizer’s
vocabulary:
print(output.ids)
# [27253, 16, 93, 11, 5097, 5, 7961, 5112, 6218, 0, 35]
An important feature of the 🤗 Tokenizers library is that it comes with
full alignment tracking, meaning you can always get the part of your
original sentence that corresponds to a given token. Those are stored in
the offsets
attribute of our
Encoding
object. For instance, let’s
assume we would want to find back what caused the
"[UNK]"
token to appear, which is the
token at index 9 in the list, we can just ask for the offset at the
index:
print(output.offsets[9])
# (26, 27)
and those are the indices that correspond to the emoji in the original sentence:
sentence = "Hello, y'all! How are you 😁 ?"
sentence[26:27]
# "😁"
Post-processing
We might want our tokenizer to automatically add special tokens, like
"[CLS]"
or "[SEP]"
. To do this, we use a post-processor.
TemplateProcessing
is the most
commonly used, you just have to specify a template for the processing of
single sentences and pairs of sentences, along with the special tokens
and their IDs.
When we built our tokenizer, we set "[CLS]"
and "[SEP]"
in positions 1
and 2 of our list of special tokens, so this should be their IDs. To
double-check, we can use the Tokenizer.token_to_id
method:
tokenizer.token_to_id("[SEP]")
# 2
Here is how we can set the post-processing to give us the traditional BERT inputs:
from tokenizers.processors import TemplateProcessing
tokenizer.post_processor = TemplateProcessing(
single="[CLS] $A [SEP]",
pair="[CLS] $A [SEP] $B:1 [SEP]:1",
special_tokens=[
("[CLS]", tokenizer.token_to_id("[CLS]")),
("[SEP]", tokenizer.token_to_id("[SEP]")),
],
)
Let’s go over this snippet of code in more details. First we specify
the template for single sentences: those should have the form
"[CLS] $A [SEP]"
where
$A
represents our sentence.
Then, we specify the template for sentence pairs, which should have the
form "[CLS] $A [SEP] $B [SEP]"
where
$A
represents the first sentence and
$B
the second one. The
:1
added in the template represent the type IDs
we want for each part of our input: it defaults
to 0 for everything (which is why we don’t have
$A:0
) and here we set it to 1 for the
tokens of the second sentence and the last "[SEP]"
token.
Lastly, we specify the special tokens we used and their IDs in our tokenizer’s vocabulary.
To check out this worked properly, let’s try to encode the same sentence as before:
output = tokenizer.encode("Hello, y'all! How are you 😁 ?")
print(output.tokens)
# ["[CLS]", "Hello", ",", "y", "'", "all", "!", "How", "are", "you", "[UNK]", "?", "[SEP]"]
To check the results on a pair of sentences, we just pass the two
sentences to Tokenizer.encode
:
output = tokenizer.encode("Hello, y'all!", "How are you 😁 ?")
print(output.tokens)
# ["[CLS]", "Hello", ",", "y", "'", "all", "!", "[SEP]", "How", "are", "you", "[UNK]", "?", "[SEP]"]
You can then check the type IDs attributed to each token is correct with
print(output.type_ids)
# [0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1]
If you save your tokenizer with Tokenizer.save
, the post-processor will be saved along.
Encoding multiple sentences in a batch
To get the full speed of the 🤗 Tokenizers library, it’s best to
process your texts by batches by using the
Tokenizer.encode_batch
method:
output = tokenizer.encode_batch(["Hello, y'all!", "How are you 😁 ?"])
The output is then a list of Encoding
objects like the ones we saw before. You can process together as many
texts as you like, as long as it fits in memory.
To process a batch of sentences pairs, pass two lists to the
Tokenizer.encode_batch
method: the
list of sentences A and the list of sentences B:
output = tokenizer.encode_batch(
[["Hello, y'all!", "How are you 😁 ?"], ["Hello to you too!", "I'm fine, thank you!"]]
)
When encoding multiple sentences, you can automatically pad the outputs
to the longest sentence present by using
Tokenizer.enable_padding
, with the
pad_token
and its ID (which we can
double-check the id for the padding token with
Tokenizer.token_to_id
like before):
tokenizer.enable_padding(pad_id=3, pad_token="[PAD]")
We can set the direction
of the padding
(defaults to the right) or a given length
if we want to pad every sample to that specific number (here
we leave it unset to pad to the size of the longest text).
output = tokenizer.encode_batch(["Hello, y'all!", "How are you 😁 ?"])
print(output[1].tokens)
# ["[CLS]", "How", "are", "you", "[UNK]", "?", "[SEP]", "[PAD]"]
In this case, the attention mask
generated by the
tokenizer takes the padding into account:
print(output[1].attention_mask)
# [1, 1, 1, 1, 1, 1, 1, 0]
Pretrained
Using a pretrained tokenizer
You can load any tokenizer from the Model Database Hub as long as a
tokenizer.json
file is available in the repository.
from tokenizers import Tokenizer
tokenizer = Tokenizer.from_pretrained("bert-base-uncased")
Importing a pretrained tokenizer from legacy vocabulary files
You can also import a pretrained tokenizer directly in, as long as you have its vocabulary file. For instance, here is how to import the classic pretrained BERT tokenizer:
from tokenizers import BertWordPieceTokenizer
tokenizer = BertWordPieceTokenizer("bert-base-uncased-vocab.txt", lowercase=True)
as long as you have downloaded the file bert-base-uncased-vocab.txt
with
wget https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased-vocab.txt