Module 26: Evaluation Harnesses and Agent CI/CD

Module Overview

This module is the evaluation and LLMOps capstone for the curriculum's technique modules. It replaces ad-hoc testing with structured evaluation harnesses, covers benchmarking and agent-native evaluation, secures tool execution in sandboxes, and builds CI/CD pipelines that gate deployment on eval, cost, and latency thresholds.

Learning Objectives

Explain why structured eval harnesses are needed and where ad-hoc testing fails.
Benchmark LLMs with standardised harnesses and interpret the results.
Build agent-native evals with Inspect AI's Task/Solver/Scorer abstractions.
Secure tool execution with sandboxes and apply agent harness patterns.
Build eval-gated CI/CD with prompt regression and cost regression guards.

Topics Covered

Evaluation Fundamentals & LLM Benchmarking

Evaluation harness fundamentals (what evals are, why ad-hoc testing fails)
LLM benchmarking with lm-evaluation-harness (MMLU, GSM8K, TruthfulQA)

Agent-Native Evaluation & LLM-as-Judge

Agent-native eval with Inspect AI (Task, Solver, Scorer abstractions)
LLM-as-judge pipelines (model-graded scoring, calibration, bias)

Execution Sandboxes, State Management & Agent Harness Patterns

Tool execution sandboxes (E2B, Modal, Docker: secure code execution)
Agent state management and checkpointing (LangGraph persistence, interrupts)
ReAct loop harness patterns (iteration guards, fallback handlers, token budgets)
Multi-agent execution harnesses (supervisor-worker contracts)

Prompt Regression, Trace-Based Testing & Agent CI/CD

Prompt regression and snapshot testing (PromptFoo, golden datasets, versioning)
Trace-based testing and observability (Opik / LangSmith assertions, flaky test detection)
Agent CI/CD pipelines (GitHub Actions, eval gating on PRs, cost regression guards)

Key Concepts & Terminology

Eval harness, golden dataset, model-graded scoring, judge calibration and bias, execution sandbox, iteration guard, eval gate, cost/latency regression guard.

Tools & Frameworks Referenced

lm-evaluation-harness, Inspect AI, PromptFoo, Opik / LangSmith, E2B, Modal, Docker, GitHub Actions.

Prerequisites

Modules 19-20 (agents), Module 07 (evaluation basics), Module 22 (prompting).

Module 26: Evaluation Harnesses and Agent CI/CD

Module Overview

Learning Objectives

Topics Covered

Evaluation Fundamentals & LLM Benchmarking

Agent-Native Evaluation & LLM-as-Judge

Execution Sandboxes, State Management & Agent Harness Patterns

Prompt Regression, Trace-Based Testing & Agent CI/CD

Key Concepts & Terminology

Tools & Frameworks Referenced

Prerequisites

Found this useful? Keep building with me.

Latest recommendations you might like

Module 24: Prompt Engineering

Module 25: Context Engineering

Module 01: Transformers and Tokenization

Module 02: Hands-On Fine-Tuning of Transformers

Find this tutorial useful?

Discussion & Comments