HexaHype
Start reading
Chapter 2 · Lesson 3 Browse lessons

How Language Models Actually Work · Chapter 2 · Lesson 3 · 6 min read

Sampling and decoding

The model has produced its probability table: ” mat” 0.32, ” floor” 0.11, ” roof” 0.07, and so on across the whole vocabulary. Somebody must now pick exactly one token. This choice is called decoding or sampling, it happens outside the neural network, and it is where the familiar API knobs like temperature live.

The model proposes a probability distribution; the decoding strategy disposes. Same model, different strategy, very different text.

Greedy decoding

The simplest strategy: always pick the highest-probability token. This is greedy decoding, and setting temperature to 0 in an API requests it. It is the right default for tasks with one correct answer, like extraction or classification. For open text it has a flaw: always taking the locally safest token produces bland, repetitive output, and the loop can get stuck repeating a phrase whose tokens keep ranking first.

Temperature

Sampling instead picks a token at random, weighted by the probabilities, so ” mat” wins 32% of the time. Temperature controls how much that randomness is allowed to explore, by rescaling the logits before softmax:

P(token j)=ezj/Tk=1Vezk/TP(\text{token } j) = \frac{e^{z_j / T}}{\sum_{k=1}^{V} e^{z_k / T}}

Piece by piece: zjz_j is the logit for token jj, exactly as in lesson 1. TT is the temperature, a positive number that divides every logit before the usual softmax. Dividing by a small TT stretches the gaps between logits, so the leader dominates even more. Dividing by a large TT shrinks the gaps, flattening the distribution toward equal chances. The denominator again makes everything sum to 1.

A worked example with three candidate tokens and logits 2.02.0, 1.01.0, 0.10.1:

T = 1.0   probabilities: 0.659, 0.242, 0.099    the raw distribution
T = 0.5   probabilities: 0.864, 0.117, 0.019    sharper, leader dominates
T = 2.0   probabilities: 0.502, 0.304, 0.194    flatter, underdogs viable

Low temperature means focused and predictable, high temperature means diverse and risky. (Values are rounded to three decimals; each row of exact softmax outputs sums to precisely 1.) Push TT high enough and the model starts picking genuinely poor tokens, which reads as incoherence. As TT approaches 0, sampling converges to greedy decoding.

One practical footnote: temperature 0 in real APIs is still not perfectly reproducible. Reasons live outside the math, such as tiny floating-point differences across hardware and batching. Treat “deterministic” as “almost”.

Top-k and top-p

Pure temperature sampling has a tail problem: the vocabulary holds tens of thousands of terrible candidates, and even tiny probabilities occasionally win a random draw. Two filters cut the tail off before sampling:

  • Top-k keeps only the kk highest-probability tokens, for example k=50k = 50, sets the rest to zero, and renormalizes. Blunt but effective. Its weakness: kk is fixed, while the real number of sensible continuations varies from one step to the next.
  • Top-p, also called nucleus sampling, keeps the smallest set of top tokens whose probabilities add up to at least pp, for example p=0.9p = 0.9. The set adapts: when the model is confident, the nucleus may contain 2 tokens; when many continuations are plausible, it may contain 200.

In practice, systems combine the knobs: a temperature plus a top-p filter is the most common configuration, and providers expose exactly these in their APIs.

Choosing settings

Rules of thumb that follow from the mechanics:

  1. Deterministic tasks (extraction, code with one right answer, classification): temperature at or near 0.
  2. General assistant use: moderate temperature, around 0.7, with top-p around 0.9.
  3. Brainstorming and creative writing: temperature near 1 or slightly above.

One caution: sampling randomness is one reason a model can confidently produce wrong facts, since a plausible-sounding wrong token can win the draw. Lowering temperature reduces this but does not eliminate it; the deeper causes of hallucination are a later chapter’s topic.

The loop so far processes the prompt, then picks tokens one at a time. To understand what makes any of this work, and what it costs, we must finally open the transformer block itself. Next lesson.