⬑ hexahype
Start reading
Chapter 1 Β· Lesson 7 Browse lessons

Chapter 1 Β· Lesson 7 Β· 5 min read

Measuring similarity

A trained embedding layer gives every token a vector, and tokens used in similar ways get similar vectors. To make β€œsimilar” precise, we need one formula, and it is the single most used formula in applied AI: it powers semantic search, recommendation systems, and retrieval-augmented generation.

Cosine similarity measures the angle between two vectors: same direction means similar usage.

The formula

sim(a,b)=aβ‹…bβˆ₯aβˆ₯ βˆ₯bβˆ₯\text{sim}(a, b) = \frac{a \cdot b}{\lVert a \rVert \, \lVert b \rVert}

Piece by piece: aa and bb are the two embedding vectors being compared. aβ‹…ba \cdot b is the dot product, computed by multiplying the vectors element by element and summing the results. βˆ₯aβˆ₯\lVert a \rVert is the length (also called the norm) of vector aa, and βˆ₯bβˆ₯\lVert b \rVert the length of bb. Dividing the dot product by both lengths cancels out how long the vectors are, leaving only the angle between them.

The result always lands between -1 and 1:

  • near 1: the vectors point almost the same way, the tokens behave similarly in language
  • near 0: unrelated directions
  • near -1: opposite directions

A worked example by hand

Vectors in real models have hundreds of dimensions, but the formula behaves identically in two, where we can compute it on paper. Take a=(1,2)a = (1, 2) and b=(2,4)b = (2, 4). Then:

aβ‹…b=1β‹…2+2β‹…4=10,βˆ₯aβˆ₯=12+22=5,βˆ₯bβˆ₯=22+42=20a \cdot b = 1 \cdot 2 + 2 \cdot 4 = 10, \qquad \lVert a \rVert = \sqrt{1^2 + 2^2} = \sqrt{5}, \qquad \lVert b \rVert = \sqrt{2^2 + 4^2} = \sqrt{20}

The first value is the dot product: multiply matching elements, then add. The other two are the lengths, each computed by squaring the elements, summing, and taking the square root. Dividing: 10/(5β‹…20)=10/100=110 / (\sqrt{5} \cdot \sqrt{20}) = 10 / \sqrt{100} = 1. A perfect 1, and rightly so: bb is exactly aa doubled, so the two point in precisely the same direction. Length differs, direction does not, and cosine similarity only sees direction.

Now keep a=(1,2)a = (1, 2) but take b=(βˆ’2,1)b = (-2, 1). The dot product is 1β‹…(βˆ’2)+2β‹…1=01 \cdot (-2) + 2 \cdot 1 = 0, so the similarity is 0 regardless of the lengths: the vectors are perpendicular, fully unrelated directions.

In a trained model this machinery yields exactly what you would hope: the similarity between the vectors for β€œking” and β€œqueen” is much higher than between β€œking” and β€œcarrot”.

A famous claim, with a caveat

You will often see the example kingβˆ’man+womanβ‰ˆqueen\text{king} - \text{man} + \text{woman} \approx \text{queen}: the idea that a consistent direction in embedding space encodes the male/female relation, so vector arithmetic performs analogies. This is partly true. It works for some word pairs, especially in older word-level embedding models like word2vec, and fails for many others. Treat it as an illustration that directions can carry relations, not as a law.

One name, two things

A final clarification, because the word β€œembedding” names two related things and beginners mix them up. This chapter’s embeddings are token embeddings: rows of EE, one vector per token, the entry point of a language model. Separately, dedicated embedding models produce one vector for an entire sentence or document; those power semantic search and RAG, and are built on top of token embeddings. A later chapter covers them. Until then, β€œembedding” means a row of the lookup table.

One ingredient is still missing from our pipeline: the model does not yet know token order. Next lesson.