Module Overview
This is the core fine-tuning module. It covers parameter-efficient methods that make adaptation feasible on modest GPUs, the supervised fine-tuning stage common to all post-training pipelines, and the preference-alignment techniques that shape model behaviour toward human preferences.
Learning Objectives
- Explain the intrinsic-dimensionality rationale for low-rank adaptation.
- Compare LoRA, QLoRA, DoRA, AdaLoRA, and LoRA+ by mechanism and use case.
- Describe SFT as stage one of every post-training pipeline.
- Articulate why SFT alone does not achieve alignment.
- Contrast RLHF-with-PPO against DPO in complexity and components.
Topics Covered
Parameter-Efficient Fine-Tuning (PEFT)
- The intrinsic dimensionality insight
- LoRA — rank, alpha, and target modules
- QLoRA — 4-bit NF4, double quantization, paged optimizers
- DoRA — magnitude + direction decomposition
- AdaLoRA — adaptive rank allocation
- LoRA+ — separate learning rates for the A and B matrices
Supervised Fine-Tuning (SFT)
- SFT as stage 1 of every post-training pipeline
- Instruction tuning (FLAN, Alpaca, OpenHermes)
- Chat / conversational fine-tuning
- Chain-of-Thought fine-tuning
- Domain-specific fine-tuning best practices
Preference Alignment
- Why SFT alone is not enough
- RLHF with PPO — reward model, critic, and KL penalty
- DPO — direct optimization without a separate reward model
Key Concepts & Terminology
Low-rank adapters, adapter merging, reward modelling, KL regularisation, preference pairs, reference model, reward hacking.
Tools & Frameworks Referenced
PEFT (LoRA/QLoRA/DoRA/AdaLoRA), bitsandbytes (NF4), Hugging Face TRL (SFT and DPO trainers).
Prerequisites
Modules 04–05.