Module Overview
This module treats data as the central lever of fine-tuning. It covers the mechanics of formatting and masking instruction data, then moves to scalable synthetic data generation and the quality-control techniques (LLM-as-Judge, deduplication) needed to avoid model collapse and poisoning.
Learning Objectives
- Format instruction and chat datasets correctly for a target model.
- Explain loss masking and what breaks without it.
- Build a deduplication and filtering pipeline conceptually.
- Generate synthetic instruction and preference data using self-instruct techniques.
- Apply LLM-as-Judge scoring and mitigate model collapse and data poisoning.
Topics Covered
Data Preparation for Fine-Tuning
- Dataset formats: instruction pairs, chat format
- Chat templates: ChatML, Llama-3, Mistral
- Loss masking — why it matters and what breaks without it
- Deduplication and filtering pipelines
Synthetic Dataset Generation
- Why data is the leverage point
- The taxonomy of synthetic data
- Instruction dataset generation: Self-Instruct and Alpaca (scaling self-instruct)
- Preference dataset generation
- LLM-as-Judge and LLM-as-Judge scoring
- Tooling: distilabel, DataDreamer, Argilla
- Risks: model collapse and data poisoning
- End-to-end pipeline: building a domain SFT dataset
Key Concepts & Terminology
Instruction tuning data, preference pairs (chosen/rejected), label masking, near-duplicate detection, judge calibration and bias, distribution drift, model collapse.
Tools & Frameworks Referenced
distilabel, DataDreamer, Argilla, Hugging Face Datasets.
Prerequisites
Module 04 (lifecycle), familiarity with instruction-following LLMs.