Parallel processing and context size

Use any LLM API and you will notice generation has two speeds: a pause before the first token appears, then a steady stream. The pause and the stream are two genuinely different phases of computation, and both follow from how attention works.

Prompt tokens are processed all at once in parallel; generated tokens must be produced one at a time.

Prefill: the prompt in one pass

When your prompt arrives, the model does not read it word by word. All prompt tokens enter the network simultaneously, as the rows of one big matrix. Look back at the attention formula: $QK^{\top}$ computes every token’s attention to every (earlier) token in a single matrix multiplication. Nothing about processing token 50 requires finishing token 49 first, because all the tokens already exist. GPUs are built for exactly this kind of giant parallel arithmetic, which is why a 2,000-token prompt does not take 2,000 times longer than a 1-token prompt. This phase is called prefill, and it is what happens during the pause before the first token.

Decode: generation one token at a time

Generation cannot be parallelized the same way, for a simple reason: token 51 of the answer cannot be computed until token 50 has been chosen, because token 50 is part of its input. The autoregressive loop of lesson 1 is inherently sequential. This phase is called decode, one forward pass per generated token, and it is the steady stream you watch.

The two phases explain the two speed metrics every provider quotes: time to first token measures prefill, and tokens per second measures decode.

The quadratic cost of attention

Attention compares every token with every earlier token. With $n$ tokens, that is on the order of $n^2$ comparisons:

\text{cost} \propto n^2

Read it plainly: $n$ is the number of tokens in the sequence, and the work attention does grows with its square. Double the sequence and attention costs four times as much; ten times the sequence, a hundred times the cost. The memory for storing the attention score table grows the same way. Most other parts of the model, like the feed-forward network, grow only linearly in $n$ , so as sequences get long, attention becomes the bill.

The context window

The context window is the maximum number of tokens, prompt plus generated output combined, the model can handle in one sequence. It is a hard architectural limit, not a suggestion: text beyond it is simply not in the input, and the model behaves as if it never existed. Two forces set the limit. First, the quadratic cost makes very long sequences expensive to serve. Second, the model only learned to handle positions it saw during training; the positional schemes of lesson 1.8, particularly RoPE, determine how gracefully a model stretches beyond its training length, and much of the recent growth from 4k to 128k and million-token windows came from advances there plus attention variants that tame the quadratic term.

Practical consequence: in a long chat, the entire conversation re-enters the model every turn, consuming the window. When the limit approaches, applications silently trim or summarize old turns, which is why a long-running chat “forgets” its beginning.

There is also an obvious inefficiency hiding in the decode phase: each new token seemingly requires reprocessing every old token’s keys and values from scratch. The fix is the most important serving optimization in practice, and it is the next lesson.