Project 05: EvalShip: Eval-Gated CI/CD with Auto-Rollback

Project Overview

EvalShip is the LLMOps shell that wraps every prior project. No code or prompt change reaches production without passing eval-gated CI stages. Rollouts run blue/green, and monitoring auto-rolls back on a breach.

Pipeline Stages (Hard Eval Gates)

Stage 1: Linting and unit tests.
Stage 2: Prompt regression against a golden baseline.
Stage 3: Evaluation-harness score gate per service.
Stage 4: Cost and performance regression guard.
Gate: PR blocked if any stage fails (score-delta comment posted).
Stage 5: On merge, container build, push, rolling deploy.
Stage 6: Staging smoke tests against golden cases.

Eval Gate Thresholds

Minimum faithfulness and answer relevancy scores.
Zero prompt-snapshot regressions.
Tokens-per-request, cost, and latency percentiles within tolerance of baseline.

Stack

A CI/CD platform for pipeline orchestration.
A prompt-regression framework with a golden snapshot store and rubric-based assertions.
An agent-native evaluation framework for multi-step traces.
A standardised benchmarking harness for base model evaluation.
A RAG evaluation framework for retrieval-quality metrics.
Calibrated LLM-as-Judge with domain-specific prompts.
Multi-stage container builds for all services with per-service registries.
A container orchestration service for stateless APIs and a model-serving platform for fine-tuned models.
Blue/green deployment with gradual traffic shifting.
Cloud monitoring with auto-rollback on error-rate breach.

Deployment & Observability

Tracing platform for per-call traces, token cost, and prompt versions.
A custom dashboard for LLM metrics.
Golden test cases committed as the regression baseline.

Deliverables

Golden test cases per project committed to the repository.
Per-project regression configs with rubric, cost, and latency assertions.
Full CI pipeline definition with all stages.
Multi-stage container files for all project services.
All services live on the cloud with appropriate serving infrastructure.
Monitoring dashboard with LLM metrics and an auto-rollback alarm configured.
Live demo showing a prompt change triggering the eval gate and a rollback.

Prerequisites

Modules 24-26 (prompting, context, evaluation harnesses & agent CI/CD) and the four prior capstone projects.

Project 05: EvalShip: Eval-Gated CI/CD with Auto-Rollback

Project Overview

Pipeline Stages (Hard Eval Gates)

Eval Gate Thresholds

Stack

Deployment & Observability

Deliverables

Prerequisites

Found this useful? Keep building with me.

Latest recommendations you might like

Project 06: VoiceTrack: Whisper STT Pipeline

Project 04: DevOpsCrew: Multi-Agent DevOps with HITL and A2A

Project 03: LegalRAG - Multi-Modal and Graph RAG

Project 02: TinyReason - Distilling Reasoning to a CPU Model

Find this tutorial useful?

Discussion & Comments