Why models need numbers

Type “Hello world” into a chatbot and it replies in fluent text. But inside, the model never sees letters. A language model is a neural network, and a neural network can only do one thing: arithmetic on numbers. It multiplies numbers, adds numbers, and compares numbers, billions of times per response. Letters, words, and punctuation simply do not exist at that level. So every piece of text must be converted into numbers before the model can touch it, and the model’s numeric output must be converted back into text before you can read it.

Every language model reads and writes numbers, not text. The conversion layer is where this chapter lives.

The pipeline

The conversion happens in two distinct steps, and this chapter covers both:

Tokenization: break the text into small pieces called tokens, and map each piece to an integer ID using a fixed list called the vocabulary.
Embedding: turn each integer ID into a vector, a list of real numbers, that the model can learn from and compute with.

"Hello world" → ["Hello", " world"] → [9906, 1917] → vectors → model
   raw text         tokens              token IDs     embeddings

Why two steps and not one? Because they solve different problems. Tokenization solves a bookkeeping problem: text is infinitely varied, and the model needs a finite, fixed menu of units it can recognize. Embedding solves a meaning problem: an integer ID is just a position in a list, and position in a list carries no information about what a word means. The first step makes text finite. The second step makes it meaningful.

The same pipeline also runs in reverse. The model’s actual output is a probability score for every token in its vocabulary, answering one question: which token most likely comes next? The chosen token’s ID is converted back to its text piece, appended to the output, and the loop repeats, one token at a time. When you watch a chatbot “type”, you are watching this loop run.

Why this chapter matters

These two steps sound like plumbing, but they leak into everything you will do with LLMs:

Providers charge per token, not per word, so cost estimation requires understanding tokens.
A “128k context window” is measured in tokens, and how much real text fits depends on the language and content.
Models famously miscount the letters in “strawberry”, and the reason is tokenization, not stupidity.
The same prompt costs more in Hindi than in English, for a reason that will be obvious by lesson 3.

By the end of the chapter, each of these will follow directly from mechanics you can explain.

In the next lesson we define the basic unit precisely: what exactly is a token, and why is it not simply a word?