Architecture & Philosophy · Amplifier Bundle

How Do You Know
It's Good?

The Evaluator Framework — scientific rigor for evaluating AI agent output.
Define rubrics. Judge with LLMs. Calibrate against humans. Track over time.

v1.1 · MIT License
May 2026 · singh2/amplifier-bundle-evaluator
The Problem

Every project
reinvents eval

🔄

Repeated Rebuilds

Infographic-builder, presentation-builder, nanoppt, diagram-beautifier — each independently rebuilt rubrics, scenarios, judging logic, scorecards, and dashboards. None survived past their original project.

💨

Model Drops, No Signal

Every time a new model ships — GPT-5, Claude 4-7, Gemini 3.1 — the question “is it actually better?” required hand-rolling a comparison harness from scratch.

🎯

Vibes, Not Data

Without shared methodology, teams evaluated quality by intuition. No statistical rigor, no longitudinal tracking, no reproducible methodology.

When human quality assessment and eval scores disagree, the eval is wrong. Fix the rubric, not the scores.

— Core methodology principle, Evaluator Framework
Methodology

Five non-negotiable
principles

Sourced from Eugene Yan, Hamel Husain, Anthropic, and Braintrust research. Enforced by every judge agent invocation.

1

Justification Before Score

The judge writes its analysis BEFORE committing to a numeric score. Improves Spearman ρ correlation with human judgments from 0.51 to 0.66.

2

Adversarial Model Selection

Judge model must be from a different family than the generator. Same-family judging shows ~5–7% self-enhancement bias. Never grade your own homework.

3

One Dimension at a Time

Score each rubric dimension independently. No cross-contamination, no premature averaging. Aggregate only at the final step.

4

Evidence-Based Scoring

Every score must cite specific positive and negative evidence from the output. No score without a receipt.

5

Structured JSON Output

Reliable parsing, no ambiguity. Downstream tooling — dashboards, scorecards, experiments — consumes machine-readable verdicts.

Pipeline

From repo to dashboard in five stages

The define-evals and run-evals recipes orchestrate 9 specialized agents through a staged pipeline with approval gates.

🔍

Scan

repo-scanner reads the project — bundle config, agents, tests, commits — to understand its shape and failure modes

📏

Design

rubric-designer proposes orthogonal dimensions, score-level definitions, weights, and verdict thresholds

🧪

Generate

scenario-generator builds a coverage matrix: 60% representative, 25% edge cases, 15% adversarial

⚖️

Judge

judge agent scores outputs against the rubric using single, multi-jury, or adversarial-critics strategy

📊

Report

reporter writes scorecard CSV, generates a 4-view HTML dashboard, and tracks longitudinal trends

All state lives in your project's eval/ directory — rubric.yaml, scenarios, rounds, scorecard.csv, dashboard.html. Nothing is stored inside the evaluator bundle itself. Everything is committable to git.

Judge Strategies

Four ways to judge, one contract

Single Judge

One model scores all dimensions with justification-before-score. Lowest cost, good signal. The default strategy.

Multi-Model Jury

3–5 models score independently. Aggregation via majority vote, mean + variance flagging, or weighted consensus. High-variance dimensions are flagged for review.

Adversarial Critics

Three specialized critics — fact-checker, consistency-auditor, quality-assessor — attack from different angles. A synthesizer maps findings back to rubric dimensions.

Pairwise Head-to-Head

Direct A/B comparison between two variants with position-bias control. Both orderings are evaluated. Produces win-rate tables by archetype and dimension.

Known BiasMagnitudeMitigation
Position bias~40% inconsistencyEvaluate both orderings in pairwise
Verbosity bias~15% inflationReward conciseness in rubric
Self-enhancement5–7% boostCross-family judges (always)
Authority biasVariesAnonymize provenance
Calibration

Trust, but
verify

The calibration loop ensures the judge's scores actually match human judgment — not just statistically, but on the cases that matter.

Blind Review

The calibration-helper presents scenarios with judge scores hidden. The human scores each dimension using the rubric's level descriptors. No contamination from the judge's reasoning.

Statistical Verification

The calibration-analyzer computes Spearman ρ and Cohen's κ per dimension. Verdicts: CALIBRATED (ρ > 0.7), DRIFTING, or MISCALIBRATED. When miscalibrated, structured guidance is emitted for the rubric-designer to refine.

Spearman ρ > 0.7

Target correlation for ranking agreement between human and judge scores

Cohen's κ > 0.7

Target agreement for absolute scoring on individual dimensions

> 75% Agreement

Minimum human-judge agreement rate across all scored scenarios

Experiments

Model matrix — data, not vibes

When a new model drops, don't guess. Declare a manifest, run a matrix, see results side-by-side.

# eval/experiments/2026-05-claude-vs-gpt.yaml id: 2026-05-claude-vs-gpt hypothesis: "Does GPT-5.5 beat Claude on visual_quality?" variants: - id: claude-baseline orchestrator_model: claude-opus-4-7 generator_model: nano-banana - id: gpt55-variant orchestrator_model: gpt-5.5 generator_model: gpt-image-2 adversarial: true # judge family != generator family

Variant Matrix Dashboard

Rows = scenarios, columns = variants, cells = score + verdict + thumbnail. Hover any cell to see the SUT prompt that produced it. Click to drill into per-dimension justifications.

Experiment Proposer

Reads the scorecard, identifies plateaus, high-variance dimensions, and coverage gaps. Proposes 3–7 ranked experiments by information gain. Proposes only — never executes.

Architecture

22 components, one coherent system

9 Agents

  • repo-scanner
  • rubric-designer
  • scenario-generator
  • judge
  • reporter
  • calibration-helper
  • calibration-analyzer
  • experiment-proposer
  • takeaway-synthesizer

8 Recipes

  • define-evals (3 approval gates)
  • run-evals (routine rounds)
  • run-experiment (variant matrix)
  • run-head-to-head (pairwise)
  • calibrate-rubric
  • api/evaluate (stable facade)
  • run-sut-recipe (internal)

4 Strategies

  • single-judge
  • multi-jury (3–5 models)
  • adversarial-critics (3 roles)
  • pairwise (position-bias control)

+ Composition

  • 1 behavior
  • 2 skills (methodology, rubrics)
  • Stable v1.x facade contract

The future improvement-loop bundle will call recipes/api/evaluate.yaml as a stable composition seam — ratchet loops, convergence detection, and automatic SUT mutation are explicitly out of scope for this bundle. Clean boundary by design.

Velocity

Built in six days

51
Commits
9
Agents
22
Components
4
Judge Strategies

Created May 7 — v1.1 complete by May 13, 2026. Single contributor. Python + Amplifier recipes. Replaces evaluation logic previously rebuilt from scratch across 4+ projects.

singh2/amplifier-bundle-evaluator · Python 3.11+ · MIT License
Sources & Methodology

How this deck was made

Primary source: singh2/amplifier-bundle-evaluator repository — README.md, bundle.md, all 22 component YAML definitions in team-knowledge, GitHub API metadata.

Presentation category: Architecture & Philosophy
Accent color: #FF9800 (orange — analysis/measurement)
Template: amplifier-bundle-gitea-story.html design system
Date of research: May 2026
No claims in this deck are inferred or estimated. Every figure traces to a specific source file, API response, or documentation passage.

More Amplifier Stories