hexahype
Start reading

Table of contents

Everything in the tutorial, in order. 0 of 9 read.

Chapter 1 · Tokenization & Embeddings

  1. 1.1 Why models need numbers Language models compute with numbers, so text must be converted before and after every interaction. 3 min
  2. 1.2 What is a token? A token is the smallest unit of text a model works with, and subword tokens are the compromise every modern model uses. 4 min
  3. 1.3 Byte pair encoding BPE builds a tokenizer by repeatedly merging the most frequent pair of symbols, and the algorithm fits in thirty lines of Python. 7 min
  4. 1.4 Tokenizers in practice Encode and decode text with real production tokenizers, and learn why tokenizer and model must never be mixed. 6 min
  5. 1.5 Special tokens Tokenizers reserve IDs with control meanings, inserted by the pipeline rather than produced from user text. 4 min
  6. 1.6 From token IDs to vectors The embedding layer is a learned lookup table that turns each token ID into a vector, and its meaning emerges from training. 6 min
  7. 1.7 Measuring similarity Cosine similarity compares the directions of two vectors, which after training tracks similarity of meaning. 5 min
  8. 1.8 Position matters Positional information is added to token embeddings so the model can tell word order apart. 4 min
  9. 1.9 Why LLMs act the way they do Six everyday quirks of language models that follow directly from tokenization and embeddings. 5 min