#SFT#LoRA#QLoRA#DoRA#RLHF#DPO#Syllabus

Module 06: SFT, PEFT and Preference Alignment

Syllabus on adapting and aligning LLMs — parameter-efficient fine-tuning (LoRA, QLoRA, DoRA, AdaLoRA, LoRA+), supervised fine-tuning, and preference alignment with RLHF and DPO.

May 28, 2026 at 12:19 PM1 min readFollowFollow (Hindi)

Topics You Will Master

The intrinsic dimensionality insight behind parameter-efficient fine-tuning
LoRA and its variants: QLoRA, DoRA, AdaLoRA, LoRA+
Supervised fine-tuning for instruction, chat, and chain-of-thought
Why SFT alone is insufficient for alignment
Preference alignment: RLHF with PPO and Direct Preference Optimization (DPO)
Best For

Engineers ready to adapt base models efficiently and align them to human preferences on limited hardware.

Expected Outcome

The ability to select a PEFT method and an alignment strategy appropriate to a task, dataset, and compute budget.

Module Overview

This is the core fine-tuning module. It covers parameter-efficient methods that make adaptation feasible on modest GPUs, the supervised fine-tuning stage common to all post-training pipelines, and the preference-alignment techniques that shape model behaviour toward human preferences.

Learning Objectives

  • Explain the intrinsic-dimensionality rationale for low-rank adaptation.
  • Compare LoRA, QLoRA, DoRA, AdaLoRA, and LoRA+ by mechanism and use case.
  • Describe SFT as stage one of every post-training pipeline.
  • Articulate why SFT alone does not achieve alignment.
  • Contrast RLHF-with-PPO against DPO in complexity and components.

Topics Covered

Parameter-Efficient Fine-Tuning (PEFT)

  • The intrinsic dimensionality insight
  • LoRA — rank, alpha, and target modules
  • QLoRA — 4-bit NF4, double quantization, paged optimizers
  • DoRA — magnitude + direction decomposition
  • AdaLoRA — adaptive rank allocation
  • LoRA+ — separate learning rates for the A and B matrices

Supervised Fine-Tuning (SFT)

  • SFT as stage 1 of every post-training pipeline
  • Instruction tuning (FLAN, Alpaca, OpenHermes)
  • Chat / conversational fine-tuning
  • Chain-of-Thought fine-tuning
  • Domain-specific fine-tuning best practices

Preference Alignment

  • Why SFT alone is not enough
  • RLHF with PPO — reward model, critic, and KL penalty
  • DPO — direct optimization without a separate reward model

Key Concepts & Terminology

Low-rank adapters, adapter merging, reward modelling, KL regularisation, preference pairs, reference model, reward hacking.

Tools & Frameworks Referenced

PEFT (LoRA/QLoRA/DoRA/AdaLoRA), bitsandbytes (NF4), Hugging Face TRL (SFT and DPO trainers).

Prerequisites

Modules 04–05.

Find this tutorial useful?

Subscribe to our YouTube channels for more practical production walk-throughs.

Discussion & Comments