Chapter 1 Β· Lesson 7 Β· 5 min read
Measuring similarity
A trained embedding layer gives every token a vector, and tokens used in similar ways get similar vectors. To make βsimilarβ precise, we need one formula, and it is the single most used formula in applied AI: it powers semantic search, recommendation systems, and retrieval-augmented generation.
Cosine similarity measures the angle between two vectors: same direction means similar usage.
The formula
Piece by piece: and are the two embedding vectors being compared. is the dot product, computed by multiplying the vectors element by element and summing the results. is the length (also called the norm) of vector , and the length of . Dividing the dot product by both lengths cancels out how long the vectors are, leaving only the angle between them.
The result always lands between -1 and 1:
- near 1: the vectors point almost the same way, the tokens behave similarly in language
- near 0: unrelated directions
- near -1: opposite directions
A worked example by hand
Vectors in real models have hundreds of dimensions, but the formula behaves identically in two, where we can compute it on paper. Take and . Then:
The first value is the dot product: multiply matching elements, then add. The other two are the lengths, each computed by squaring the elements, summing, and taking the square root. Dividing: . A perfect 1, and rightly so: is exactly doubled, so the two point in precisely the same direction. Length differs, direction does not, and cosine similarity only sees direction.
Now keep but take . The dot product is , so the similarity is 0 regardless of the lengths: the vectors are perpendicular, fully unrelated directions.
In a trained model this machinery yields exactly what you would hope: the similarity between the vectors for βkingβ and βqueenβ is much higher than between βkingβ and βcarrotβ.
A famous claim, with a caveat
You will often see the example : the idea that a consistent direction in embedding space encodes the male/female relation, so vector arithmetic performs analogies. This is partly true. It works for some word pairs, especially in older word-level embedding models like word2vec, and fails for many others. Treat it as an illustration that directions can carry relations, not as a law.
One name, two things
A final clarification, because the word βembeddingβ names two related things and beginners mix them up. This chapterβs embeddings are token embeddings: rows of , one vector per token, the entry point of a language model. Separately, dedicated embedding models produce one vector for an entire sentence or document; those power semantic search and RAG, and are built on top of token embeddings. A later chapter covers them. Until then, βembeddingβ means a row of the lookup table.
One ingredient is still missing from our pipeline: the model does not yet know token order. Next lesson.