Module Overview
This module closes the fine-tuning loop: how to measure whether fine-tuning worked, how to compress models for efficient serving, how to deploy them (including multiple LoRA adapters from one base model), and which frameworks accelerate the whole workflow.
Learning Objectives
- Justify evaluation as an integral stage of fine-tuning.
- Choose benchmark categories and apply LLM-as-judge methods responsibly.
- Compare quantization formats by quality, speed, and hardware.
- Select an inference framework and explain multi-adapter serving.
- Match a fine-tuning framework to skill level and configurability needs.
Topics Covered
Evaluation
- Why evaluation is part of the fine-tuning workflow
- Benchmark types: knowledge, reasoning, instruction-following
- LLM-as-judge: MT-Bench, Chatbot Arena
- Domain-specific evaluation design
Quantization & Deployment Preparation
- GPTQ, AWQ, BNB NF4, FP8
- Merging LoRA adapters before serving
- Inference frameworks: vLLM, SGLang
- Serving multiple LoRA adapters from one base model
- GGUF format and llama.cpp
- Speculative decoding and inference acceleration
Tooling & Frameworks
- Hugging Face TRL — SFT, DPO, GRPO, ORPO
- Unsloth — consumer-GPU optimization
- Axolotl — full YAML configurability
- LLaMA-Factory — no-code web UI
- Managed fine-tuning: Together AI, AWS SageMaker
Key Concepts & Terminology
Post-training quantization, weight-only vs activation-aware quantization, adapter hot-swapping, draft/verifier speculative decoding, judge bias and calibration.
Tools & Frameworks Referenced
vLLM, SGLang, llama.cpp, GPTQ, AWQ, bitsandbytes, GGUF, TRL, Unsloth, Axolotl, LLaMA-Factory, Together AI, AWS SageMaker.
Prerequisites
Module 06 (SFT/PEFT/alignment).