Table of contents
Everything in the tutorial, in order. 0 of 9 read.
Chapter 1 · Tokenization & Embeddings
- 1.1 Why models need numbers Language models compute with numbers, so text must be converted before and after every interaction. 3 min
- 1.2 What is a token? A token is the smallest unit of text a model works with, and subword tokens are the compromise every modern model uses. 4 min
- 1.3 Byte pair encoding BPE builds a tokenizer by repeatedly merging the most frequent pair of symbols, and the algorithm fits in thirty lines of Python. 7 min
- 1.4 Tokenizers in practice Encode and decode text with real production tokenizers, and learn why tokenizer and model must never be mixed. 6 min
- 1.5 Special tokens Tokenizers reserve IDs with control meanings, inserted by the pipeline rather than produced from user text. 4 min
- 1.6 From token IDs to vectors The embedding layer is a learned lookup table that turns each token ID into a vector, and its meaning emerges from training. 6 min
- 1.7 Measuring similarity Cosine similarity compares the directions of two vectors, which after training tracks similarity of meaning. 5 min
- 1.8 Position matters Positional information is added to token embeddings so the model can tell word order apart. 4 min
- 1.9 Why LLMs act the way they do Six everyday quirks of language models that follow directly from tokenization and embeddings. 5 min