hexahype
Start reading
Chapter 1 · Lesson 5 Browse lessons

Chapter 1 · Lesson 5 · 4 min read

Special tokens

Inspect a tokenizer’s vocabulary and you will find entries like <|endoftext|> that never appear when you encode ordinary text. These are special tokens: reserved IDs with control meanings. If ordinary tokens are the words of the model’s world, special tokens are its traffic signs, marking where things begin, end, and change speaker.

Special tokens are inserted by the training pipeline or the chat template, never produced by encoding raw user text.

The common ones

  • <|endoftext|> or </s>: marks the end of a document or sequence. During training, the corpus is billions of unrelated documents glued together, and this token is the glue. It teaches the model “what follows is unrelated to what came before”, so it does not blend a recipe into a news article.
  • <s> or [CLS]: marks the beginning of a sequence in some model families.
  • [PAD]: padding. Models process batches of many sequences at once for efficiency, and a batch must be rectangular, every row the same length. Shorter sequences are filled up with this dummy token, which the model is told to ignore.
  • [MASK]: used by BERT-style models during training. A real token is hidden behind it and the model must guess what was there.
  • <|user|>, <|assistant|>, <|system|> style markers: chat models need to know who said what.

Chat is built from special tokens

This last group deserves a closer look, because it explains how a text-completion engine becomes a chatbot. When you message an assistant, your text is not sent to the model alone. A chat template wraps the entire conversation in role markers, producing roughly:

<|system|> You are a helpful assistant. <|user|> What is BPE? <|assistant|>

The sequence ends right after the <|assistant|> marker, and the model simply continues the text from there. Because it was trained on millions of conversations in exactly this format, the most probable continuation after that marker is a helpful answer. “Being an assistant” is, mechanically, predicting what follows <|assistant|>.

A safety detail

What if a user typed <|assistant|> inside their own message, to inject a fake control token and confuse the model about who is speaking? Tokenizer libraries defend against exactly this: by default they refuse to encode special tokens from plain strings, splitting the typed characters into ordinary harmless tokens instead. The control IDs can only be inserted deliberately by the surrounding software.

Tokenization is now complete: text in, integer IDs out, with control markers where needed. But an ID is still just an index into a list, carrying no meaning of its own. In the next lesson we give tokens meaning with embeddings.