#Evaluation Harness#PromptFoo#Inspect AI#LLM-as-Judge#Agent CI/CD#Sandboxes#Syllabus

Module 26: Evaluation Harnesses and Agent CI/CD

Syllabus on evaluation and LLMOps CI/CD — structured eval harnesses, LLM benchmarking, agent-native evaluation with Inspect AI, LLM-as-judge, execution sandboxes, prompt regression, and eval-gated agent CI/CD.

May 28, 2026 at 12:01 PM1 min readFollowFollow (Hindi)

Topics You Will Master

Evaluation harness fundamentals and why ad-hoc testing fails
LLM benchmarking with lm-evaluation-harness (MMLU, GSM8K, TruthfulQA)
Agent-native evaluation with Inspect AI and LLM-as-judge pipelines
Execution sandboxes (E2B, Modal, Docker) and agent harness patterns
Prompt regression testing (PromptFoo) and eval-gated agent CI/CD
Best For

Engineers building the evaluation and CI/CD backbone that gates LLM and agent deployments.

Expected Outcome

The ability to design eval harnesses, secure agent execution, and build CI/CD pipelines that block regressions on quality, cost, and latency.

Module Overview

This module is the evaluation and LLMOps capstone for the curriculum's technique modules. It replaces ad-hoc testing with structured evaluation harnesses, covers benchmarking and agent-native evaluation, secures tool execution in sandboxes, and builds CI/CD pipelines that gate deployment on eval, cost, and latency thresholds.

Learning Objectives

  • Explain why structured eval harnesses are needed and where ad-hoc testing fails.
  • Benchmark LLMs with standardised harnesses and interpret the results.
  • Build agent-native evals with Inspect AI's Task/Solver/Scorer abstractions.
  • Secure tool execution with sandboxes and apply agent harness patterns.
  • Build eval-gated CI/CD with prompt regression and cost regression guards.

Topics Covered

Evaluation Fundamentals & LLM Benchmarking

  • Evaluation harness fundamentals (what evals are, why ad-hoc testing fails)
  • LLM benchmarking with lm-evaluation-harness (MMLU, GSM8K, TruthfulQA)

Agent-Native Evaluation & LLM-as-Judge

  • Agent-native eval with Inspect AI (Task, Solver, Scorer abstractions)
  • LLM-as-judge pipelines (model-graded scoring, calibration, bias)

Execution Sandboxes, State Management & Agent Harness Patterns

  • Tool execution sandboxes (E2B, Modal, Docker — secure code execution)
  • Agent state management and checkpointing (LangGraph persistence, interrupts)
  • ReAct loop harness patterns (iteration guards, fallback handlers, token budgets)
  • Multi-agent execution harnesses (supervisor-worker contracts)

Prompt Regression, Trace-Based Testing & Agent CI/CD

  • Prompt regression and snapshot testing (PromptFoo, golden datasets, versioning)
  • Trace-based testing and observability (Opik / LangSmith assertions, flaky test detection)
  • Agent CI/CD pipelines (GitHub Actions, eval gating on PRs, cost regression guards)

Key Concepts & Terminology

Eval harness, golden dataset, model-graded scoring, judge calibration and bias, execution sandbox, iteration guard, eval gate, cost/latency regression guard.

Tools & Frameworks Referenced

lm-evaluation-harness, Inspect AI, PromptFoo, Opik / LangSmith, E2B, Modal, Docker, GitHub Actions.

Prerequisites

Modules 19–20 (agents), Module 07 (evaluation basics), Module 22 (prompting).

Find this tutorial useful?

Subscribe to our YouTube channels for more practical production walk-throughs.

Discussion & Comments