Module 06: SFT, PEFT and Preference Alignment

Adapting and aligning LLMs — PEFT (LoRA, QLoRA, DoRA, AdaLoRA), supervised fine-tuning, and preference alignment with RLHF and DPO.

May 28, 20261 min readFollow

Topics You Will Master

The intrinsic dimensionality insight behind parameter-efficient fine-tuning
LoRA and its variants: QLoRA, DoRA, AdaLoRA, LoRA+
Supervised fine-tuning for instruction, chat, and chain-of-thought
Why SFT alone is insufficient for alignment

Module Overview

This is the core fine-tuning module. It covers parameter-efficient methods that make adaptation feasible on modest GPUs, the supervised fine-tuning stage common to all post-training pipelines, and the preference-alignment techniques that shape model behaviour toward human preferences.

Learning Objectives

  • Explain the intrinsic-dimensionality rationale for low-rank adaptation.
  • Compare LoRA, QLoRA, DoRA, AdaLoRA, and LoRA+ by mechanism and use case.
  • Describe SFT as stage one of every post-training pipeline.
  • Articulate why SFT alone does not achieve alignment.
  • Contrast RLHF-with-PPO against DPO in complexity and components.

Topics Covered

Parameter-Efficient Fine-Tuning (PEFT)

  • The intrinsic dimensionality insight
  • LoRA — rank, alpha, and target modules
  • QLoRA — 4-bit NF4, double quantization, paged optimizers
  • DoRA — magnitude + direction decomposition
  • AdaLoRA — adaptive rank allocation
  • LoRA+ — separate learning rates for the A and B matrices

Supervised Fine-Tuning (SFT)

  • SFT as stage 1 of every post-training pipeline
  • Instruction tuning (FLAN, Alpaca, OpenHermes)
  • Chat / conversational fine-tuning
  • Chain-of-Thought fine-tuning
  • Domain-specific fine-tuning best practices

Preference Alignment

  • Why SFT alone is not enough
  • RLHF with PPO — reward model, critic, and KL penalty
  • DPO — direct optimization without a separate reward model

Key Concepts & Terminology

Low-rank adapters, adapter merging, reward modelling, KL regularisation, preference pairs, reference model, reward hacking.

Tools & Frameworks Referenced

PEFT (LoRA/QLoRA/DoRA/AdaLoRA), bitsandbytes (NF4), Hugging Face TRL (SFT and DPO trainers).

Prerequisites

Modules 04–05.

Found this useful? Keep building with me.

New tutorials every week on YouTube — or go deeper with a full structured course.

Find this tutorial useful?

Subscribe to our YouTube channels for more practical production walk-throughs.

Discussion & Comments