⬑ hexahype
Start reading
Chapter 1 Β· Lesson 9 Browse lessons

Chapter 1 Β· Lesson 9 Β· 5 min read

Why LLMs act the way they do

Everything in this chapter pays off here. Six behaviors that puzzle every new LLM user are not mysteries and not bugs: each follows directly from the mechanics of tokens and embeddings you now know.

Most β€œweird” LLM behavior around cost, spelling, and languages is just tokenization showing through.

The six quirks

  1. Pricing and limits are in tokens, not words. A β€œ128k context window” means 128,000 tokens, and API bills count tokens. For ordinary English, 1 token is roughly 0.75 words, so 128k tokens is about 96,000 words. The ratio is only a rule of thumb, and it degrades for code, unusual vocabulary, and other languages, which is the next quirk.

  2. Non-English text costs more. BPE merges are learned from frequency, and most training corpora are English-heavy, so English earned the short, efficient tokens. The same sentence in Hindi, Japanese, or Thai splits into several times more tokens. Same meaning, more tokens, higher bill, and less of the context window left for actual content.

  3. Letter-counting and spelling tasks are hard. Ask a model how many times β€œr” appears in β€œstrawberry” and it may fail. Lesson 1 explains why: the model never sees letters. It sees token pieces, perhaps str + aw + berry, and the individual characters inside a token are not directly visible to it. It is being asked about ink it cannot see.

  4. Arithmetic is wobbly. One contributing reason: numbers tokenize inconsistently. β€œ1234” might be a single token while β€œ1235” splits as β€œ12” + β€œ35”, purely depending on which digit sequences were frequent in training data. Doing digit-level math on inputs that are not split at digit boundaries is needlessly hard. Newer tokenizers force digits into fixed-size groups, one reason model arithmetic has improved.

  5. Whitespace matters. As lesson 4 showed, hello and ' hello' are different tokens, which by lesson 6 means different rows of the embedding matrix, entirely different vectors. A stray space at the end of a prompt genuinely changes the input the model receives, and can change its continuation.

  6. Tokenizer and model are married. Every ID is an index into one specific vocabulary and one specific embedding matrix EE. Encode with the wrong tokenizer and every lookup retrieves the wrong row: the model receives well-formed vectors that mean the wrong things, so it produces fluent garbage with no error message.

The chapter in one picture

"Hello world"
  β†’ tokens, via BPE merges learned once from data       (lessons 2-4)
  β†’ integer IDs, indices into the vocabulary            (lesson 4)
  β†’ vectors, learned rows of E ∈ R^(VΓ—d)                (lessons 6-7)
  β†’ plus position: h = x + p                            (lesson 8)
  β†’ into the model

Check your understanding

Predict the answers, then verify with the code from lesson 4:

  1. Will "HELLO" and "hello" produce the same token IDs? Run both through enc.encode and compare.
  2. Take any 100-word English paragraph. Predict its token count using the 0.75 rule, then measure with len(enc.encode(text)). How close were you?
  3. Why can a byte-level BPE tokenizer never produce an unknown-token error, no matter what you type?

In the next chapter we follow these vectors into the model itself and meet the mechanism that made modern LLMs possible: attention.