Byte pair encoding

Who decides that unhappiness splits into un + happiness? Not a linguist with a dictionary. The splits are learned automatically from data by a counting algorithm called byte pair encoding (BPE). It powers the tokenizers of GPT models, Llama, and most other modern LLMs, and it is simple enough to build completely in this lesson.

BPE has one rule: find the most frequent pair of adjacent symbols, merge it into a new symbol, repeat.

The recipe

Training a BPE tokenizer happens once, before the language model itself is trained:

Start with a base vocabulary of all individual characters (or bytes).
Go through a large text corpus and count every pair of adjacent symbols.
Take the most frequent pair, merge it into one new symbol, add that symbol to the vocabulary, and record the merge as a rule.
Repeat steps 2 and 3 until the vocabulary reaches a target size, for example 50,000 entries.

The output is two artifacts: the final vocabulary, and the ordered list of merge rules. When the tokenizer later meets new text, it does not count anything. It simply replays the recorded merges, in order, on the new text. That replay is what encode() does.

A worked example

Take a tiny corpus where “low” appears 5 times, “lower” 2 times, and “lowest” 2 times. Every word starts split into characters:

l o w (x5)    l o w e r (x2)    l o w e s t (x2)

Count pairs. The pair (l, o) appears in all 9 words, so it wins and merges into lo. Count again: now (lo, w) appears 9 times and merges into low:

low (x5)      low e r (x2)      low e s t (x2)

After just two merges, the frequent word “low” is a single token, while the rarer words are still in pieces. Run more merges and lower and lowest eventually become whole too, if the vocabulary budget allows. Frequency decides everything.

Build it yourself

The algorithm is short enough to state precisely as code, which is the clearest possible description of it. This trainer is real, runnable Python:

# A tiny BPE trainer. Educational, not optimized.
from collections import Counter

def get_pair_counts(words):
    # Count how often each adjacent symbol pair appears,
    # weighted by how often the word appears in the corpus.
    pairs = Counter()
    for word, freq in words.items():
        symbols = word.split()
        for i in range(len(symbols) - 1):
            pairs[(symbols[i], symbols[i + 1])] += freq
    return pairs

def merge_pair(pair, words):
    # Replace every occurrence of the pair with one merged symbol.
    a, b = pair
    merged = {}
    for word, freq in words.items():
        new_word = word.replace(f"{a} {b}", f"{a}{b}")
        merged[new_word] = freq
    return merged

# Corpus: each word split into characters, with its frequency.
words = {
    "l o w": 5,        # "low" appears 5 times
    "l o w e r": 2,    # "lower" appears 2 times
    "l o w e s t": 2,  # "lowest" appears 2 times
}

num_merges = 4
for step in range(1, num_merges + 1):
    pairs = get_pair_counts(words)
    best = max(pairs, key=pairs.get)          # most frequent pair
    words = merge_pair(best, words)
    print(f"merge {step}: {best[0]} + {best[1]} -> {best[0] + best[1]}")
    print("  corpus now:", list(words.keys()))

Running it prints:

merge 1: l + o -> lo
  corpus now: ['lo w', 'lo w e r', 'lo w e s t']
merge 2: lo + w -> low
  corpus now: ['low', 'low e r', 'low e s t']
merge 3: low + e -> lowe
  corpus now: ['low', 'lowe r', 'lowe s t']
merge 4: lowe + r -> lower
  corpus now: ['low', 'lower', 'lowe s t']

One subtle detail in merges 3 and 4: after low exists, the pairs (low, e), (lowe, r), and (lowe, s) each appear exactly twice. They tie. This simple code breaks ties by taking the first pair it counted. Real implementations also have fixed tie-breaking rules, which is why the same tokenizer always produces exactly the same tokens for the same text. Determinism matters: the language model depends on it.

Three consequences worth remembering

The tokenizer mirrors its training data. If the corpus was mostly English, English words earn short, efficient tokens, while Hindi, Japanese, or rare programming languages get chopped into many small pieces. Same meaning, more tokens, more cost, less room in the context window.
Merges are frozen. After training, the rules never change. A word invented years later still encodes, just less efficiently, assembled from older pieces.
Splits are statistical, not grammatical. un + happiness looks like linguistics, but BPE has no concept of a prefix. The split exists because those fragments were frequent.

One last variant worth naming: most modern tokenizers run BPE on bytes rather than characters. The base vocabulary is then exactly the 256 possible byte values, which guarantees that any input whatsoever, including emoji and every script on earth, can be encoded. This is called byte-level BPE.

We can now build a tokenizer. In the next lesson we stop building toys and drive the real one used by GPT-class models from Python.