Measuring similarity

A trained embedding layer gives every token a vector, and tokens used in similar ways get similar vectors. To make “similar” precise, we need one formula, and it is the single most used formula in applied AI: it powers semantic search, recommendation systems, and retrieval-augmented generation.

Cosine similarity measures the angle between two vectors: same direction means similar usage.

The formula

\text{sim}(a, b) = \frac{a \cdot b}{\lVert a \rVert \, \lVert b \rVert}

Piece by piece: $a$ and $b$ are the two embedding vectors being compared. $a \cdot b$ is the dot product, computed by multiplying the vectors element by element and summing the results. $\lVert a \rVert$ is the length (also called the norm) of vector $a$ , and $\lVert b \rVert$ the length of $b$ . Dividing the dot product by both lengths cancels out how long the vectors are, leaving only the angle between them.

The result always lands between -1 and 1:

near 1: the vectors point almost the same way, the tokens behave similarly in language
near 0: unrelated directions
near -1: opposite directions

A worked example by hand

Vectors in real models have hundreds of dimensions, but the formula behaves identically in two, where we can compute it on paper. Take $a = (1, 2)$ and $b = (2, 4)$ . Then:

a \cdot b = 1 \cdot 2 + 2 \cdot 4 = 10, \qquad \lVert a \rVert = \sqrt{1^2 + 2^2} = \sqrt{5}, \qquad \lVert b \rVert = \sqrt{2^2 + 4^2} = \sqrt{20}

The first value is the dot product: multiply matching elements, then add. The other two are the lengths, each computed by squaring the elements, summing, and taking the square root. Dividing: $10 / (\sqrt{5} \cdot \sqrt{20}) = 10 / \sqrt{100} = 1$ . A perfect 1, and rightly so: $b$ is exactly $a$ doubled, so the two point in precisely the same direction. Length differs, direction does not, and cosine similarity only sees direction.

Now keep $a = (1, 2)$ but take $b = (-2, 1)$ . The dot product is $1 \cdot (-2) + 2 \cdot 1 = 0$ , so the similarity is 0 regardless of the lengths: the vectors are perpendicular, fully unrelated directions.

In a trained model this machinery yields exactly what you would hope: the similarity between the vectors for “king” and “queen” is much higher than between “king” and “carrot”.

A famous claim, with a caveat

You will often see the example $\text{king} - \text{man} + \text{woman} \approx \text{queen}$ : the idea that a consistent direction in embedding space encodes the male/female relation, so vector arithmetic performs analogies. This is partly true. It works for some word pairs, especially in older word-level embedding models like word2vec, and fails for many others. Treat it as an illustration that directions can carry relations, not as a law.

One name, two things

A final clarification, because the word “embedding” names two related things and beginners mix them up. This chapter’s embeddings are token embeddings: rows of $E$ , one vector per token, the entry point of a language model. Separately, dedicated embedding models produce one vector for an entire sentence or document; those power semantic search and RAG, and are built on top of token embeddings. A later chapter covers them. Until then, “embedding” means a row of the lookup table.

One ingredient is still missing from our pipeline: the model does not yet know token order. Next lesson.