Position matters — hexahype

“Dog bites man” and “Man bites dog” contain exactly the same tokens. After the embedding lookup of lesson 6, both sentences become the same three vectors, merely handed over in a different order. Surely the order is enough? Here is the catch: the transformer’s attention mechanism, which we study in the next chapter, treats its inputs as an unordered set. By itself it cannot tell position 1 from position 7. Without extra help, the model literally could not distinguish the harmless sentence from the headline.

Token embeddings say what a token is; positional information says where it is. The model needs both.

Adding position

The classic fix is disarmingly simple, plain addition:

h_i = x_i + p_i

Piece by piece: $x_i$ is the token embedding at position $i$ , the row looked up in lesson 6. $p_i$ is a positional embedding, another vector of the same dimension $d$ , which encodes the fact “this is position $i$ ” and nothing else. Adding the two element by element gives $h_i$ , the vector that actually enters the first layer of the model.

The effect: the word “dog” at position 1 and the same word at position 7 now produce two different input vectors, because the same token row had two different position vectors added to it. Order has become visible, stamped directly into the numbers. Each token effectively carries both pieces of information at once, what it is and where it stands.

Where the position vectors come from

Three schemes cover almost every model you will meet:

Learned positions: a second lookup table, exactly like the token embedding matrix but indexed by position number instead of token ID, and learned the same way. GPT-2 does this. Its drawback: the table has a fixed number of rows, so the model cannot natively handle sequences longer than it was trained for.
Computed positions: vectors built from fixed mathematical patterns rather than learned. The original transformer paper used sine and cosine waves of different frequencies, giving every position a unique, deterministic fingerprint.
Rotary embeddings (RoPE): the current standard, used by the Llama family and most recent models. Instead of adding position at the input, RoPE injects it inside the attention computation by rotating vectors, which handles long sequences more gracefully. The mechanics belong in the attention chapter.

The schemes differ in machinery, but the goal of every one is identical: make order visible to a model that would otherwise be blind to it.

The pipeline is now complete: text becomes tokens, tokens become IDs, IDs become vectors, and position is stamped in. In the final lesson we cash in all of this theory and explain the everyday quirks of LLMs.