Adversarial quality for AI research.
Every claim sourced. Every inference challenged. Scores its own output.
March 2026 · Feature Journey
You can’t trace a single claim. Was this verified from a real source — or quietly invented to fill a gap?
High-stakes claims read the same as speculative ones. No signal for what the model is certain about versus guessing.
When evidence is missing, the model fills the silence with plausible prose. Missing evidence is hidden, not named.
You forward the brief. Someone asks where the revenue figure came from. You don’t know.
No single model does everything in one pass with no accountability.
Finds and sources evidence. Source tiering (primary / secondary / tertiary). Per-claim confidence ratings. Evidence gaps named explicitly — not hidden.
Draws labeled inferences from findings. Every conclusion traces to specific findings. Genuine synthesis — not summaries dressed as insight.
Produces prose where facts read like facts and conclusions read like conclusions. Confidence-calibrated language. Provenance footer on every document.
Generic AI
Research Specialists
Quality is measured, not assumed.
Fetches source URLs and verifies claims against actual content. Catches fabrications and misrepresentations — not just missing citations.
Two-phase adversarial protocol. Independently derives conclusions from findings — then challenges each inference the pipeline produced.
Checks whether prose language matches confidence levels. Catches “is” when the evidence says “suggests.” Enforces hedge discipline.
Fetches every source URL. Reads the actual content. Grades each claim against what the source actually says.
Score = verified ÷ (total − unreachable) × 10. Unreachable sources don’t penalize — but fabrications do.
Two phases. The hardest critic to satisfy.
The challenger reads only the findings and independently derives what conclusions a rigorous analyst would draw. No peeking at the actual inference.
The challenger then reads the pipeline’s actual inference and attacks it. Does it add genuine analytical lift? Or just restate what a reader could see from the findings alone?
Language must match evidence. Medium confidence can’t say “is.” It must say “suggests.”
Claim uses stronger language than the evidence supports. “X dominates the market” when evidence says “X has a significant share.”
A qualified finding is stated as fact without a hedging qualifier. “Adoption is accelerating” with no “evidence suggests.”
A synthesized conclusion presented as an established finding. The “analysis indicates” qualifier is missing or stripped.
The ported hedge-by-default rule gives the writer explicit confidence-to-language mappings. This critic enforces them.
Topic: Anthropic’s approach to AI safety. Pre-improvement vs. post-improvement.
| Critic | Before | After | Delta |
|---|---|---|---|
| Fact-Checker (40%) | 8.67 / 10 | 10.00 / 10 | +1.33 |
| Inference Challenger (35%) | 0.00 / 10 | 6.67 / 10 | +6.67 ← |
| Consistency Auditor (25%) | 9.68 / 10 | 9.55 / 10 | −0.13 |
| Composite | 5.89 / 10 | 8.72 / 10 | +2.83 |
The inference quality improvement drove the result. Consistency was already high — the hedge-by-default rule was a safety net, not a lift.
Inference quality score, before and after
Pre-improvement: all 3 inferences rated SUPPORTED_BUT_OBVIOUS — valid summaries, zero analytical lift.
Post-improvement: 2 of 3 rated ROBUST — genuine synthesis that adds insight no single source could provide.
Inference quality is the hardest dimension. It tests whether the pipeline thinks, not just summarizes.
Smart skip gates on the formatter steps. Zero quality impact.
Each formatter step now runs a fast bash check first. If the upstream specialist’s output is already well-formed — right headers, right structure — the LLM call is skipped entirely. When output is malformed, the formatter runs as normal.
The system reads intent. Asks once if unclear.
Researcher → Writer. For exploration and orientation. When you want to understand something, not stake your reputation on it.
Signals: “what is X”, “overview of Y”, “quick answer”, “just curious”
Full pipeline: research, URL verification, analysis, confidence calibration, writing. For work someone will push back on.
Signals: “write me a brief”, “deep dive”, “I need to send this to…”
If intent is ambiguous, the coordinator asks once: “Quick answer or deep research?” Then routes. No second question.
Every document ends with a quiet confidence signal:
Missing evidence is reported, not hidden. The document names what it doesn’t know.
High confidence: “X holds 60% share.” Low confidence: “Analysis suggests Y may be bifurcating.”
The consistency auditor enforces this across every document.
Run all 5 test topics in parallel. Full eval suite in ~50 minutes instead of ~250 minutes sequential. Three categories: well-documented, contested, thin-evidence.
Lightweight quality logging baked into the pipeline. Every user run appends a one-line consistency score. Quality signal without manual review.
Pipeline caching means critics-only re-runs take ~15 minutes. Any prompt change can be measured before it ships. The feedback loop most AI tools don’t have.
The eval harness is the differentiator. Not just that the pipeline is good — but that you can prove it, measure it, and improve it.
Data as of: March 15, 2026
Feature status: Active — deployed on main, live for all @main users
Author: github.com/cpark4x
Key commits (all verifiable in git log):
3ffc03a — Port inference quality criteria + hedge-by-default rule to agent prompts9a83f4b — Add eval-harness recipe (16 steps, 3 adversarial critics, composite scoring)f35bcaa — Smoke test complete: composite 8.72/10 on well-documented topic5acf89b — Smart skip gates on formatter steps (−6 min per run)cd7e51e — Quick vs deep routing (intent detection + single clarifying question)0aa5ca8 — Provenance footer on every writer documentA/B methodology: Both runs used identical cached pipeline output (same research, same topic). Only the writer and data-analyzer prompts differed. Critics ran independently. Scores extracted with portable sed, clamped 0–10.
Gaps: Single-topic test only (well-documented category). Inference scores are LLM-evaluated and have run-to-run variance. Fact-checker URL reachability varies by run (paywalls, rate limits).
Built by: Chris Park · @cpark4x