Project Overview
EvalShip is the LLMOps shell that wraps every prior project. No code or prompt change reaches production without passing eval-gated CI stages; rollouts run blue/green, and monitoring auto-rolls back on a breach.
Pipeline Stages (Hard Eval Gates)
- Stage 1 — Linting and unit tests.
- Stage 2 — Prompt regression against a golden baseline.
- Stage 3 — Evaluation-harness score gate per service.
- Stage 4 — Cost and performance regression guard.
- Gate — PR blocked if any stage fails (score-delta comment posted).
- Stage 5 — On merge: container build, push, rolling deploy.
- Stage 6 — Staging smoke tests against golden cases.
Eval Gate Thresholds
- Minimum faithfulness and answer relevancy scores.
- Zero prompt-snapshot regressions.
- Tokens-per-request, cost, and latency percentiles within tolerance of baseline.
Stack
- A CI/CD platform for pipeline orchestration.
- A prompt-regression framework with a golden snapshot store and rubric-based assertions.
- An agent-native evaluation framework for multi-step traces.
- A standardised benchmarking harness for base model evaluation.
- A RAG evaluation framework for retrieval-quality metrics.
- Calibrated LLM-as-Judge with domain-specific prompts.
- Multi-stage container builds for all services with per-service registries.
- A container orchestration service for stateless APIs and a model-serving platform for fine-tuned models.
- Blue/green deployment with gradual traffic shifting.
- Cloud monitoring with auto-rollback on error-rate breach.
Deployment & Observability
- Tracing platform for per-call traces, token cost, and prompt versions.
- A custom dashboard for LLM metrics.
- Golden test cases committed as the regression baseline.
Deliverables
- Golden test cases per project committed to the repository.
- Per-project regression configs with rubric, cost, and latency assertions.
- Full CI pipeline definition with all stages.
- Multi-stage container files for all project services.
- All services live on the cloud with appropriate serving infrastructure.
- Monitoring dashboard with LLM metrics and an auto-rollback alarm configured.
- Live demo showing a prompt change triggering the eval gate and a rollback.
Prerequisites
Modules 24–26 (prompting, context, evaluation harnesses & agent CI/CD) and the four prior capstone projects.