From token IDs to vectors

After tokenization, a sentence is a list of integers like [3404, 2065, 374]. Here is the problem with that: an ID is just a position in the vocabulary list, and the numbering is arbitrary. ID 3404 is not “closer” in meaning to ID 3405 than to ID 49000. The token that got number 3404 simply happened to be created at that point during BPE training. Before the model can compute anything useful, each ID must become a representation that can actually carry meaning. That representation is an embedding: a vector of real numbers, learned during training.

An embedding layer is a lookup table of learned vectors, one row per token, where tokens used in similar ways end up with similar rows.

The embedding matrix

The whole mechanism is one big matrix:

E \in \mathbb{R}^{V \times d}

Read it piece by piece: $E$ is the embedding matrix. $V$ is the vocabulary size, for example 50,257 tokens for GPT-2. $d$ is the embedding dimension, for example 768. The notation $\mathbb{R}^{V \times d}$ means “a table of real numbers with $V$ rows and $d$ columns”. So each of the 50,257 tokens owns exactly one row, and that row is a list of 768 numbers. This matrix is also why vocabulary size costs model size, as promised back in lesson 2: 50,257 rows times 768 numbers is already about 38 million stored values, just for the lookup table.

Converting a token ID to a vector is nothing more than selecting a row:

x_i = E[t_i]

Here $t_i$ is the token ID at position $i$ of the input, and $x_i$ is its embedding: row number $t_i$ of the matrix. No arithmetic happens at all. It is a pure lookup, like opening a book to a page number. A sentence of 5 tokens goes in as 5 integers and comes out as 5 rows, a small table of 5 by 768 numbers, ready for the network.

The numbers are learned, not designed

This is the crucial fact of the lesson. Nobody sits down and decides that the vector for “cat” should contain a 0.7 in slot 312. At the start of training, every entry of $E$ is random noise, and the model’s predictions are gibberish. Training then works by gradient descent: the model predicts the next token, the prediction is scored against the real text, and every number in the model, including every entry of $E$ , is nudged slightly in the direction that would have made the prediction better. Repeat trillions of times.

A remarkable side effect falls out of this pressure. Tokens that appear in similar contexts, like “cat” and “dog”, are useful to the model in similar ways, so their rows get pushed toward similar values. Tokens with nothing in common drift apart. Nobody programs meaning in. Meaning emerges from usage.

What the dimensions mean

A natural follow-up question: what do the 768 individual numbers represent? Mostly, nothing readable. A few directions in the space sometimes align with human concepts like gender, tense, or sentiment, but in general the dimensions are not labeled and not interpretable one by one. What carries meaning is the overall direction the vector points, compared with the directions of other vectors. That comparison is exactly what we need a tool for.

If meaning lives in the direction of a vector, we need a precise way to compare directions. That is the next lesson.