Module Overview
This module establishes how modern language models represent and process text. It covers the Transformer's core mechanics — from embeddings and the attention mechanism to the three canonical architecture families — and the tokenization strategies that prepare raw text for these models.
Learning Objectives
- Explain how embeddings map discrete tokens into a continuous semantic space.
- Distinguish self-attention, multi-head attention, masked multi-head attention, and cross-attention by purpose and information flow.
- Compare encoder-only, decoder-only, and encoder–decoder Transformers and identify the task each suits.
- Justify why positional encoding is required and how it is injected.
- Select an appropriate tokenization strategy and explain its effect on vocabulary, sequence length, and multilingual coverage.
Topics Covered
Transformer Core Mechanics
- Embeddings: from discrete tokens to continuous vector space
- The attention mechanism — intuition and formulation
- Self-attention: queries, keys, and values
- Multi-head attention and the value of multiple representation subspaces
- Masked multi-head attention and causal (autoregressive) information flow
- Positional encoding and order awareness
- Encoder–decoder Transformers
- Encoder-only Transformers
- Decoder-only Transformers
- Cross-attention and its role in conditioning generation on encoded input
Tokenization Strategies
- Taxonomy of tokenization: word-level, subword, character-level, byte-level
- Byte-Pair Encoding (BPE)
- WordPiece
- SentencePiece
- Trade-offs: vocabulary size, sequence length, out-of-vocabulary handling, and multilingual/code coverage
Key Concepts & Terminology
Attention weights, query/key/value projections, causal masking, context window, embedding dimension, subword vocabulary, special tokens, byte-level fallback.
Tools & Frameworks Referenced
Hugging Face Transformers, tokenizers library (BPE / WordPiece / Unigram-SentencePiece).
Prerequisites
Intermediate Python and a working understanding of neural network training fundamentals (loss, gradients, overfitting).