Module Overview
This module introduces the vision half of multimodal systems. It explains how the Transformer was adapted to images, the architectural pieces unique to Vision Transformers, the inductive-bias trade-off against CNNs, and the pre-trained encoders that serve as the visual backbone of modern VLMs and multimodal RAG.
Learning Objectives
- Explain the conceptual transition from convolutional networks to Vision Transformers.
- Describe patch embedding, the CLS token, and 2D positional encoding.
- Interpret ViT attention maps and articulate the inductive-bias trade-off versus CNNs.
- Compare CLIP, SigLIP, and DINO/DINOv2 by training objective and use case.
Topics Covered
From CNN to ViT Architecture
- From CNN to Vision Transformer — the conceptual bridge
- ViT architecture deep dive
- Image as a sequence of patches
- Patch embedding
- The CLS token
- Positional encoding for 2D inputs
- Transformer encoder on visual tokens
Attention Visualization & CNN vs ViT Trade-offs
- Attention maps — what a ViT sees
- CNN vs ViT — inductive bias and the core trade-off
Pre-Trained Vision Encoders
- CLIP — contrastive language-image pre-training
- SigLIP — sigmoid loss variant
- DINO / DINOv2 — self-supervised vision
Key Concepts & Terminology
Patch tokenisation, contrastive pre-training, sigmoid vs softmax contrastive loss, self-supervised representation learning, inductive bias, frozen vision backbone.
Tools & Frameworks Referenced
CLIP, SigLIP, DINO / DINOv2 pre-trained encoders.
Prerequisites
Modules 01–03 (Transformer foundations).