What is a token? — hexahype

A token is the smallest unit of text the model works with. It is not always a word. Depending on the tokenizer, a token can be a whole word, a piece of a word, a single character, or a byte. Why so many options? Because the people who design tokenizers face a real trade-off, and each choice sits at a different point on it. The clearest way to understand the modern answer is to try the two obvious extremes first and watch both fail.

Subword tokens are the compromise: common words stay whole, rare words split into known pieces, and nothing is ever unknown.

Attempt 1: one token per word

The natural first idea: give every word its own ID. the is 1, cat is 2, sat is 3, and so on. Two problems sink it.

First, the vocabulary explodes. The vocabulary is the complete, fixed list of tokens the model knows, decided once before training and frozen forever after. English alone has hundreds of thousands of words. Add names, places, typos, slang, technical jargon, and then every other language on earth, and the list never stops growing. A larger vocabulary makes the model itself larger, because, as we will see in lesson 6, the model stores a block of learned numbers for every single vocabulary entry.

Second, and worse: the frozen list guarantees missing words. Any word not on the list gets replaced by a single “unknown” token, written <UNK>. If a user types “blockchain” and that word did not exist when the vocabulary was built, the model receives <UNK> and every bit of information about the word is destroyed. The model cannot even guess from the spelling, because it never sees the spelling.

Attempt 2: one token per character

So swing to the opposite extreme: only single characters. Now the vocabulary is tiny, perhaps a hundred entries, and nothing is ever unknown, since every word is just a sequence of known characters. This actually works, and early models tried it. But it pays a heavy price in sequence length: internationalization becomes 20 tokens instead of 1 or 2, and every sentence grows several times longer. That hurts twice:

Compute cost grows. The model performs work for every token it processes, so longer sequences are slower and more expensive, and they fill up the context window faster.
Learning gets harder. The model must spend capacity discovering that those 20 characters form one unit with one meaning, instead of receiving that fact for free.

The compromise: subwords

Modern tokenizers use subword tokenization, the middle ground that takes the best of both:

"the"          → ["the"]                 common word, one token
"unhappiness"  → ["un", "happiness"]     rare word, two known pieces
"qzxv"         → ["q", "z", "x", "v"]    nonsense still encodes, character by character

Common words stay whole, keeping sequences short. Rare words split into pieces the vocabulary does know. And in the worst case, a completely novel string falls all the way down to characters or bytes, so the <UNK> problem disappears entirely: everything is encodable.

Notice what decides where the splits go: frequency. A word common enough in the tokenizer’s training data earns its own token. A rarer word gets assembled from frequent fragments. This gives a rule of thumb worth memorizing, because it predicts cost and model behavior everywhere: frequent text produces few tokens, rare or unusual text produces many.

How does a tokenizer learn which fragments are frequent? Not from a dictionary and not from a linguist. In the next lesson we build the counting algorithm that learns the splits automatically from raw text.