Chapter 1 · Lesson 4 · 6 min read
Tokenizers in practice
You never train your own tokenizer. You load the one that ships with the model you are using. This lesson shows the two libraries you will actually meet: tiktoken for OpenAI-style models and transformers for open models, and ends with the most important rule in the chapter.
A model only understands the tokenizer it was trained with. The two are an inseparable pair.
Text to IDs and back with tiktoken
# Install first: pip install tiktoken
import tiktoken
# Load the tokenizer used by GPT-4 class models.
enc = tiktoken.get_encoding("cl100k_base")
text = "Tokenization is the first step."
ids = enc.encode(text) # text → integer IDs
pieces = [enc.decode([i]) for i in ids] # inspect each piece
print(pieces)
# ['Token', 'ization', ' is', ' the', ' first', ' step', '.']
print(enc.decode(ids) == text) # IDs → text, lossless
# True
Run it and study the pieces (the exact integer IDs depend on the encoding version, so trust your own output over any tutorial). Three observations:
Tokenizationsplits intoToken+ization. It was not frequent enough in the training corpus to earn a single token, exactly as the BPE lesson predicts.- The space belongs to the token. The pieces are
' is'and' the', with a leading space, not'is'and'the'. This meanshelloand' hello'are different tokens with different IDs, a fact with real consequences we revisit in the final lesson. - The period
.is its own token. Punctuation usually is.
The same thing with Hugging Face
Open models like Llama or GPT-2 ship their tokenizer next to the model weights, and AutoTokenizer loads the matching one by name:
# Install first: pip install transformers
from transformers import AutoTokenizer
# Always load the tokenizer by the SAME name as the model.
tokenizer = AutoTokenizer.from_pretrained("gpt2")
print(tokenizer.tokenize("Tokenization is the first step."))
# ['Token', 'ization', 'Ġis', 'Ġthe', 'Ġfirst', 'Ġstep', '.']
The Ġ symbol surprises everyone the first time. It is purely internal notation: GPT-2’s byte-level tokenizer displays a leading space as Ġ. Decoding turns it back into an ordinary space. The structure of the split is the same as tiktoken’s, but the integer IDs are completely different, which brings us to the rule.
Never mix tokenizer and model
Each tokenizer has its own vocabulary and its own ordered merge rules, so the same sentence produces entirely different IDs under different tokenizers. Some ID, say 318, means one specific text piece to GPT-2 and an unrelated piece, or nothing at all, to Llama.
Now recall what a language model is: a network trained for months on token IDs from exactly one tokenizer. Every internal number it learned is organized around that one mapping. Feed it IDs produced by a different tokenizer and every input is scrambled. Crucially, the model does not crash and prints no warning. It just generates fluent-looking garbage. This is one of the classic silent bugs when people serve or fine-tune models by hand, and it is why AutoTokenizer.from_pretrained takes the model’s name: the API is designed to keep the pair matched.
Verify it yourself
Three assertions that confirm this lesson’s claims on your machine, independent of tokenizer version:
import tiktoken
enc = tiktoken.get_encoding("cl100k_base")
# 1. Encoding then decoding returns the exact input.
s = "Tokenization is the first step."
assert enc.decode(enc.encode(s)) == s
# 2. A leading space changes the tokens.
assert enc.encode("hello") != enc.encode(" hello")
# 3. A rare word splits into more than one token.
assert len(enc.encode("unhappiness")) > 1
If you inspect a tokenizer’s full vocabulary, you will also find strange entries like <|endoftext|> that never appear when encoding ordinary text. Those are the subject of the next lesson: special tokens.