Module 07: Evaluation, Quantization and Deployment

Post-fine-tuning workflows — benchmark and LLM-as-judge evaluation, quantization (GPTQ, AWQ, NF4, FP8, GGUF), and serving with vLLM and llama.cpp.

May 28, 20261 min readFollow

Topics You Will Master

Why evaluation is part of the fine-tuning workflow, not an afterthought
Benchmark types and LLM-as-judge evaluation (MT-Bench, Chatbot Arena)
Quantization formats: GPTQ, AWQ, BNB NF4, FP8, GGUF
Serving fine-tuned models: vLLM, SGLang, llama.cpp, multi-adapter serving

Module Overview

This module closes the fine-tuning loop: how to measure whether fine-tuning worked, how to compress models for efficient serving, how to deploy them (including multiple LoRA adapters from one base model), and which frameworks accelerate the whole workflow.

Learning Objectives

  • Justify evaluation as an integral stage of fine-tuning.
  • Choose benchmark categories and apply LLM-as-judge methods responsibly.
  • Compare quantization formats by quality, speed, and hardware.
  • Select an inference framework and explain multi-adapter serving.
  • Match a fine-tuning framework to skill level and configurability needs.

Topics Covered

Evaluation

  • Why evaluation is part of the fine-tuning workflow
  • Benchmark types: knowledge, reasoning, instruction-following
  • LLM-as-judge: MT-Bench, Chatbot Arena
  • Domain-specific evaluation design

Quantization & Deployment Preparation

  • GPTQ, AWQ, BNB NF4, FP8
  • Merging LoRA adapters before serving
  • Inference frameworks: vLLM, SGLang
  • Serving multiple LoRA adapters from one base model
  • GGUF format and llama.cpp
  • Speculative decoding and inference acceleration

Tooling & Frameworks

  • Hugging Face TRL — SFT, DPO, GRPO, ORPO
  • Unsloth — consumer-GPU optimization
  • Axolotl — full YAML configurability
  • LLaMA-Factory — no-code web UI
  • Managed fine-tuning: Together AI, AWS SageMaker

Key Concepts & Terminology

Post-training quantization, weight-only vs activation-aware quantization, adapter hot-swapping, draft/verifier speculative decoding, judge bias and calibration.

Tools & Frameworks Referenced

vLLM, SGLang, llama.cpp, GPTQ, AWQ, bitsandbytes, GGUF, TRL, Unsloth, Axolotl, LLaMA-Factory, Together AI, AWS SageMaker.

Prerequisites

Module 06 (SFT/PEFT/alignment).

Found this useful? Keep building with me.

New tutorials every week on YouTube — or go deeper with a full structured course.

Find this tutorial useful?

Subscribe to our YouTube channels for more practical production walk-throughs.

Discussion & Comments