Module Overview
This module is the evaluation and LLMOps capstone for the curriculum's technique modules. It replaces ad-hoc testing with structured evaluation harnesses, covers benchmarking and agent-native evaluation, secures tool execution in sandboxes, and builds CI/CD pipelines that gate deployment on eval, cost, and latency thresholds.
Learning Objectives
- Explain why structured eval harnesses are needed and where ad-hoc testing fails.
- Benchmark LLMs with standardised harnesses and interpret the results.
- Build agent-native evals with Inspect AI's Task/Solver/Scorer abstractions.
- Secure tool execution with sandboxes and apply agent harness patterns.
- Build eval-gated CI/CD with prompt regression and cost regression guards.
Topics Covered
Evaluation Fundamentals & LLM Benchmarking
- Evaluation harness fundamentals (what evals are, why ad-hoc testing fails)
- LLM benchmarking with lm-evaluation-harness (MMLU, GSM8K, TruthfulQA)
Agent-Native Evaluation & LLM-as-Judge
- Agent-native eval with Inspect AI (Task, Solver, Scorer abstractions)
- LLM-as-judge pipelines (model-graded scoring, calibration, bias)
Execution Sandboxes, State Management & Agent Harness Patterns
- Tool execution sandboxes (E2B, Modal, Docker — secure code execution)
- Agent state management and checkpointing (LangGraph persistence, interrupts)
- ReAct loop harness patterns (iteration guards, fallback handlers, token budgets)
- Multi-agent execution harnesses (supervisor-worker contracts)
Prompt Regression, Trace-Based Testing & Agent CI/CD
- Prompt regression and snapshot testing (PromptFoo, golden datasets, versioning)
- Trace-based testing and observability (Opik / LangSmith assertions, flaky test detection)
- Agent CI/CD pipelines (GitHub Actions, eval gating on PRs, cost regression guards)
Key Concepts & Terminology
Eval harness, golden dataset, model-graded scoring, judge calibration and bias, execution sandbox, iteration guard, eval gate, cost/latency regression guard.
Tools & Frameworks Referenced
lm-evaluation-harness, Inspect AI, PromptFoo, Opik / LangSmith, E2B, Modal, Docker, GitHub Actions.
Prerequisites
Modules 19–20 (agents), Module 07 (evaluation basics), Module 22 (prompting).