Module Overview
This module dissects the Visual Language Model. It defines VLM capabilities and decomposes the architecture into its three components, focusing on how the projector connects a frozen vision encoder to a language-model backbone and how alignment is achieved through staged training.
Learning Objectives
- Define a Visual Language Model and its multimodal capabilities.
- Identify the three architectural components of a VLM.
- Explain how the aligner/projector maps visual tokens into LLM embedding space.
- Reason through the staged training strategy for vision-language alignment.
Topics Covered
Visual Language Models
- What is a Visual Language Model?
- The VLM architecture
- The visual encoder
- The aligner / projector
- The language model backbone
- How vision and language are connected
- Vision-language alignment training strategy (feature alignment then instruction tuning)
Key Concepts & Terminology
Visual tokens, projection/connector layer, frozen vs unfrozen backbones, feature alignment stage, visual instruction tuning, multimodal context.
Tools & Frameworks Referenced
Open VLM families built on CLIP/SigLIP encoders + LLM backbones (conceptual).
Prerequisites
Module 11 (vision foundations) and Modules 01–03 (Transformer foundations).