Vibes With a Number

Calibrated LLM judges for agent quality

The Floor Has Moved

Modern agents rarely break — they produce plausible output that ticks the obvious boxes

When broken output is rare, "did it run?" no longer tells you anything. The hard question is now "is it actually good?" — and that is what an evaluation framework has to answer.

And the obvious move is to hand an LLM a rubric and let it score.

They produce plausible output that ticks the obvious boxes. rubric-design.md, "The Floor Has Moved"

The Self-Diagnosis

The framework itself names calibration as the defining weakness of a model judge

Model-based graders are listed as non-deterministic and more expensive than code — but the weakness that matters is trust. An uncalibrated LLM judge is just vibes with a number on it.

So which project actually fixes that? Meet the framework.

Requires calibration with human graders for accuracy. Model-based graders — Weaknesses (demystifying-evals-for-ai-agents.md)

The Project

The Evaluator Framework is microsoft/amplifier-bundle-evaluation

"A one-stop-shop for evaluating AI agents, bundles, and recipes across the Amplifier ecosystem" — an evaluation mode plus a Python harness (amplifier_evaluation) that scores agents scientifically instead of by vibes.

Its first building block: a rubric that replaces gut feel with structure.

3-phase

auditor grader session

Active

internal / experimental; last pushed 2026-06-19

Mechanism · Rubric

A rubric is named, weighted, observable criteria — not opinion

Each criterion has a short name, a point value (its relative weight), and an observable question. Discrimination comes entirely from which criteria you choose and how you weight them.

But structure alone isn't enough — who scores it, and can they cheat?

score = points_awarded / points_possible

per criterion

Mechanism · Grader

The Grader is an impartial auditor that can't invent numbers or touch the work

One multi-turn auditor session explores the sandboxed Digital Twin Universe, then submits a structured rubric. Each criterion needs a point value in [0, max] plus a non-empty reasoning string — no free-floating numbers.

Then how do we know the judge itself is right?

You must NEVER modify the agent's code or files. Changing its output is like a teacher changing a student's exam. grader.py — SYSTEM_INSTRUCTION (max 2 retries)

The Turn

Human judgment is the gold standard used to calibrate the judge

The framework prescribes closely calibrating LLM-as-judge graders with human experts, so there is little divergence between human and model grading. When a person and the model disagree, the divergence points at the rubric — not the verdict.

Which flips the fix on its head.

Gold standard quality · Matches expert user judgment · Used to calibrate model-based graders. Human graders — Strengths (demystifying-evals-for-ai-agents.md)

The Payoff

Fix the rubric, not the score

Calibrate against two example outputs: one genuinely high-quality, one "competent slop." If the score gap is under 30 points, the rubric isn't discriminating — find the criterion the slop is over-credited on and tighten it, then re-walk until the gap reflects real quality.

Do that once and it becomes infrastructure everyone inherits.

Hold one high-quality and one "competent slop" output in mind.
If the gap is < 30 points, the rubric is not discriminating.
Find and tighten the criterion the slop is over-credited on.
Re-walk until the gap reflects the actual quality difference.

What You Keep

Calibrated evaluation becomes reusable infrastructure

A shared benchmark, an evaluation mode, and a CLI mean the next project inherits trustworthy, calibrated scoring instead of re-hand-rolling rubrics and judging by vibes.

Results must be trusted — "there can be no workarounds."

27

benchmark tasks (amplifier-benchmark)

2

reference agents: amplifier-foundation, openai-codex-cli

Sources

Research Methodology

Feature status: Active · internal / experimental

Data as of: repo last pushed 2026-06-19 (pushedAt 2026-06-19T14:43:29Z)

Primary source: microsoft/amplifier-bundle-evaluation (created 2026-05-04). The calibrated rubric/judge subject maps to this bundle — the similarly named microsoft/amplifier-eval-harness explicitly does no quality scoring.

Research performed:

Repo & timeline: gh repo view microsoft/amplifier-bundle-evaluation --json createdAt,pushedAt; gh api repos/microsoft/amplifier-bundle-evaluation/commits
Grader behavior: sed -n '1,90p' src/amplifier_evaluation/grader/grader.py; grep -n 'total_weight|overall|score =' grader.py
Rubric method: sed -n '20,30p;94,104p' context/methodology/rubric-design.md
Grader taxonomy: sed -n '73,81p;249p' context/deep_dives/demystifying-evals-for-ai-agents.md
Benchmark size: ls amplifier-benchmark/tasks | wc -l (27); ls amplifier-benchmark/agents (2)
No-scoring sibling: grep -n 'No quality scoring' amplifier-eval-harness/README.md

Primary contributors: David Koleczek (DavidKoleczek / DavidKoleczekMSFT) — grader, extractor, harness, CLI, benchmark, rubric methodology, most commits; Brian Krabach (bkrabach) — zero-cost install anchor refactor (#14).

Gaps: No measured judge-vs-human agreement metric exists in the repo — the calibration story is grounded in methodology docs and code behavior, not a specific agreement percentage. "Vibes" is narrative framing (paraphrase), not a repo quote. The grader-taxonomy tables are vendored from an external Anthropic engineering article (anthropic.com, Jan 09 2026) that the bundle adopts as context.