Prove It's Done

Flywheel's evidence-driven methodology for AI agents

The Reframe

Flywheel swaps "did you do the thing?" for "can you prove the thing works?"

Flywheel is an outcome-driven development methodology for AI agents (kenotron-ms/flywheel, v0.1.0). It replaces activity-based development with evidence loops: every task defines what "done" looks like before execution starts, and execution isn't complete until evidence closes the loop.

So why isn't finishing the activity enough? Because the activity can lie.

Activity-based"Did you do the thing?" — files, functions, green tests.
Evidence loops"Can you prove the thing works?" — evidence closes the loop.
No TDDTheory of Success instead — a Goldilocks proof per task.

Why Activity Lies

A green test suite proves the suite is green — not that the system works

Flywheel's philosophy names the trap in its own words: "Tests can pass and the thing can still be broken." Activity completion is only a proxy for outcomes — and, as the design doc states, "proxies drift."

And when the proxy drifts, every signal can read green while the outcome is broken.

"A green test suite proves the test suite is green — it doesn't prove the system works. Evidence is harder to fake." context/philosophy.md, line 11

The Failure Mode

Every activity signal can be green while the outcome is broken

This is the felt version of "proxies drift." Loops can close, commits can land, agents can report — and the system still not work. Each green light measures effort, not results.

Therefore "done" has to mean something the activity can't fake.

"Loops can close, commits can land, agents can report — and the system still not work." context/philosophy.md, line 32

The Iron Law

Evidence is the deliverable — not the code, not a narration

This is Flywheel's stated Iron Law, and the phrase "evidence is the deliverable" appears 7 times across the repo. "Done" is redefined: token cost should be proportional to results, not effort expended. Agents report what they PROVED, not what they DID.

Which raises the question: how does an agent actually produce that proof?

"Evidence is the deliverable. Not the code, not a narration of what you did. The evidence speaks for itself." agents/implementer.md, line 86 (Iron Laws)

How Proof Is Produced

The implementer returns one status — and PROVEN demands the raw output, verbatim

The implementer agent returns exactly one of three status codes. PROVEN requires pasting the actual output of the proof action verbatim — nothing else. Contrast: instead of "I created the auth middleware file…", the good report is "PROVEN. Unauthenticated curl to /api/users returned 401."

But the agent producing the proof can't be the one who judges it.

PROVENRaw evidence — actual proof-action output, pasted verbatim.
PROVEN_WITH_NOTESProven, with caveats worth flagging.
BLOCKEDCould not produce the proof. No narration — status + evidence.

How Proof Is Judged

A separate verifier grades the evidence — and can route the task backward

The verifier is an evidence evaluator, not a code reviewer: it decides whether the proof action's output actually proves what the Theory of Success claims. It returns one of 5 verdicts and can send the task back, so nothing is auto-"done". This is why Flywheel states "No TDD" across 9 files — evidence loops replace test-driven development.

And no task's verdict makes the whole system done — one gate decides that.

VERIFIED NEEDS_MORE_PROOF RETRY REPLAN RETHINK

The Payoff

Nothing ships until the Acceptance Gate answers "does the whole thing work?"

Ship mode enforces a non-optional, system-level Acceptance Gate. It is NOT a re-check of individual tasks — it asks a single system-level question about the whole thing. The gate is LOCKED: the cleanup-and-commit step can't run until it passes.

Even the commit that follows records what was PROVEN, not what was done.

The Acceptance Gate

"THE ACCEPTANCE GATE IS NOT OPTIONAL." A single system-level question: does the whole thing work as intended?

LOCKED UNTIL IT PASSES

modes/flywheel-ship.md, lines 32 / 45 / 90

The Durable Shift

"Done" stops meaning "I did it" and starts meaning "I can prove it works"

The whole methodology drives to one change: the commit message summarizes what was PROVEN, not what was done — closing with "Acceptance gate: [Theory of Success] — PASSED." Activity is only ever a proxy; evidence is the deliverable.

Sources

STATUS: RELEASED v0.1.0

Research Methodology

Primary source: kenotron-ms/flywheel (public; gh repo view --json nameWithOwner → kenotron-ms/flywheel). Facts gathered from a fresh clone into /tmp/flywheel.

Repo shape: 39 files total, 32 markdown files; version 0.1.0 (package.json "flywheel-claude-code" + bundle.md). Delivered as a 4-agent / 4-mode Amplifier bundle plus 6 Claude Code skills.

Timeline & authorship: entire repo authored 2026-04-18 in 12 commits; sole author Ken <ken@flywheel.local> (12 of 12 commits).

Commands run:

gh repo clone kenotron-ms/flywheel /tmp/flywheel
grep -n 'activity-based' /tmp/flywheel/README.md
grep -n 'green test suite' /tmp/flywheel/context/philosophy.md; grep -rn 'proxies drift' /tmp/flywheel
grep -rio 'evidence is the deliverable' --include='*.md' /tmp/flywheel | wc -l → 7
grep -n 'Status Codes' /tmp/flywheel/agents/implementer.md (PROVEN / PROVEN_WITH_NOTES / BLOCKED)
grep -n 'Verdict Codes' /tmp/flywheel/agents/verifier.md (5 verdicts)
grep -rl 'No TDD\|no TDD\|test-driven' --include='*.md' /tmp/flywheel | wc -l → 9
grep -n -i 'acceptance gate\|LOCKED' /tmp/flywheel/modes/flywheel-ship.md
git rev-list --count HEAD → 12; git shortlog -sne HEAD; find /tmp/flywheel -type f | wc -l

Gaps: No git tags exist (v0.1.0 is in commit message + package.json/bundle.md only). "Ken <ken@flywheel.local>" is a local git email, not a verified GitHub account; only the namespace kenotron-ms is confirmed. Adoption metrics (stars, downloads, dependents) were not queried and are not claimed.