The Evaluator Framework — scientific rigor for evaluating AI agent output.
Define rubrics. Judge with LLMs. Calibrate against humans. Track over time.
Infographic-builder, presentation-builder, nanoppt, diagram-beautifier — each independently rebuilt rubrics, scenarios, judging logic, scorecards, and dashboards. None survived past their original project.
Every time a new model ships — GPT-5, Claude 4-7, Gemini 3.1 — the question “is it actually better?” required hand-rolling a comparison harness from scratch.
Without shared methodology, teams evaluated quality by intuition. No statistical rigor, no longitudinal tracking, no reproducible methodology.
When human quality assessment and eval scores disagree, the eval is wrong. Fix the rubric, not the scores.
Sourced from Eugene Yan, Hamel Husain, Anthropic, and Braintrust research. Enforced by every judge agent invocation.
The judge writes its analysis BEFORE committing to a numeric score. Improves Spearman ρ correlation with human judgments from 0.51 to 0.66.
Judge model must be from a different family than the generator. Same-family judging shows ~5–7% self-enhancement bias. Never grade your own homework.
Score each rubric dimension independently. No cross-contamination, no premature averaging. Aggregate only at the final step.
Every score must cite specific positive and negative evidence from the output. No score without a receipt.
Reliable parsing, no ambiguity. Downstream tooling — dashboards, scorecards, experiments — consumes machine-readable verdicts.
The define-evals and run-evals recipes orchestrate 9 specialized agents through a staged pipeline with approval gates.
repo-scanner reads the project — bundle config, agents, tests, commits — to understand its shape and failure modes
rubric-designer proposes orthogonal dimensions, score-level definitions, weights, and verdict thresholds
scenario-generator builds a coverage matrix: 60% representative, 25% edge cases, 15% adversarial
judge agent scores outputs against the rubric using single, multi-jury, or adversarial-critics strategy
reporter writes scorecard CSV, generates a 4-view HTML dashboard, and tracks longitudinal trends
All state lives in your project's eval/ directory — rubric.yaml, scenarios, rounds, scorecard.csv, dashboard.html. Nothing is stored inside the evaluator bundle itself. Everything is committable to git.
One model scores all dimensions with justification-before-score. Lowest cost, good signal. The default strategy.
3–5 models score independently. Aggregation via majority vote, mean + variance flagging, or weighted consensus. High-variance dimensions are flagged for review.
Three specialized critics — fact-checker, consistency-auditor, quality-assessor — attack from different angles. A synthesizer maps findings back to rubric dimensions.
Direct A/B comparison between two variants with position-bias control. Both orderings are evaluated. Produces win-rate tables by archetype and dimension.
| Known Bias | Magnitude | Mitigation |
|---|---|---|
| Position bias | ~40% inconsistency | Evaluate both orderings in pairwise |
| Verbosity bias | ~15% inflation | Reward conciseness in rubric |
| Self-enhancement | 5–7% boost | Cross-family judges (always) |
| Authority bias | Varies | Anonymize provenance |
The calibration loop ensures the judge's scores actually match human judgment — not just statistically, but on the cases that matter.
The calibration-helper presents scenarios with judge scores hidden. The human scores each dimension using the rubric's level descriptors. No contamination from the judge's reasoning.
The calibration-analyzer computes Spearman ρ and Cohen's κ per dimension. Verdicts: CALIBRATED (ρ > 0.7), DRIFTING, or MISCALIBRATED. When miscalibrated, structured guidance is emitted for the rubric-designer to refine.
Target correlation for ranking agreement between human and judge scores
Target agreement for absolute scoring on individual dimensions
Minimum human-judge agreement rate across all scored scenarios
When a new model drops, don't guess. Declare a manifest, run a matrix, see results side-by-side.
Rows = scenarios, columns = variants, cells = score + verdict + thumbnail. Hover any cell to see the SUT prompt that produced it. Click to drill into per-dimension justifications.
Reads the scorecard, identifies plateaus, high-variance dimensions, and coverage gaps. Proposes 3–7 ranked experiments by information gain. Proposes only — never executes.
The future improvement-loop bundle will call recipes/api/evaluate.yaml as a stable composition seam — ratchet loops, convergence detection, and automatic SUT mutation are explicitly out of scope for this bundle. Clean boundary by design.
Created May 7 — v1.1 complete by May 13, 2026. Single contributor. Python + Amplifier recipes. Replaces evaluation logic previously rebuilt from scratch across 4+ projects.
Primary source: singh2/amplifier-bundle-evaluator repository — README.md, bundle.md, all 22 component YAML definitions in team-knowledge, GitHub API metadata.
gh api repos/singh2/amplifier-bundle-evaluator/commits --paginate — 51 commits, single contributor (singh2), created 2026-05-07, last pushed 2026-05-13~/.amplifier/team-knowledge/ — 9 agents, 8 recipes, 4 strategy recipes, 1 behavior, enumerated by type fieldcontext/methodology.md (referenced in README), sourcing Eugene Yan, Hamel Husain, Anthropic, and Braintrust researchrepos/singh2/amplifier-bundle-evaluatorPresentation category: Architecture & Philosophy
Accent color: #FF9800 (orange — analysis/measurement)
Template: amplifier-bundle-gitea-story.html design system
Date of research: May 2026
No claims in this deck are inferred or estimated. Every figure traces to a specific source file, API response, or documentation passage.