Built with Amplifier

Research Specialists

Adversarial quality for AI research.
Every claim sourced. Every inference challenged. Scores its own output.

March 2026  ·  Feature Journey

The Problem

AI research sounds authoritative.
You can’t verify any of it.

🔍

No sources

You can’t trace a single claim. Was this verified from a real source — or quietly invented to fill a gap?

📊

No confidence levels

High-stakes claims read the same as speculative ones. No signal for what the model is certain about versus guessing.

🕳

No gap reporting

When evidence is missing, the model fills the silence with plausible prose. Missing evidence is hidden, not named.

You forward the brief. Someone asks where the revenue figure came from. You don’t know.

The Architecture

Three specialists. One job each.

No single model does everything in one pass with no accountability.

🔎

Researcher

Finds and sources evidence. Source tiering (primary / secondary / tertiary). Per-claim confidence ratings. Evidence gaps named explicitly — not hidden.

📐

Data Analyzer

Draws labeled inferences from findings. Every conclusion traces to specific findings. Genuine synthesis — not summaries dressed as insight.

✍️

Writer

Produces prose where facts read like facts and conclusions read like conclusions. Confidence-calibrated language. Provenance footer on every document.

The Output

Before and after

Generic AI

“OpenAI holds a dominant position in the enterprise AI market, with strong adoption among Fortune 500 companies. Anthropic is gaining ground, particularly among safety-conscious organizations.” No sources. No confidence. No gaps. Can’t audit a sentence.

Research Specialists

FINDING [HIGH CONFIDENCE] OpenAI holds ~60% of enterprise API spend. Sources: [S1] Menlo Ventures 2024 (primary) [S2] Bloomberg Q3 2024 (secondary) INFERENCE [traces to: F1, F3] Market bifurcating along risk-tolerance lines. Confidence: medium. Type: pattern. EVIDENCE GAP No independent Anthropic ARR data found. Vendor claims only. Treat as unverified.
Adversarial Evaluation

Three critics score every pipeline run.

Quality is measured, not assumed.

🔗

Fact-Checker

Fetches source URLs and verifies claims against actual content. Catches fabrications and misrepresentations — not just missing citations.

⚔️

Inference Challenger

Two-phase adversarial protocol. Independently derives conclusions from findings — then challenges each inference the pipeline produced.

⚖️

Consistency Auditor

Checks whether prose language matches confidence levels. Catches “is” when the evidence says “suggests.” Enforces hedge discipline.

researcher
data-analyzer
writer
fact-checker
+
inference-challenger
+
consistency-auditor
composite score
Critic 1 — Weight: 40%

The Fact-Checker

Fetches every source URL. Reads the actual content. Grades each claim against what the source actually says.

VERIFIED
Source exists and confirms the claim as stated
MISREPRESENTED
Source exists but the claim distorts what it says
FABRICATED
No source supports this claim. The model invented it.
UNREACHABLE
Source exists but couldn’t be fetched (paywall, rate limit)

Score = verified ÷ (total − unreachable) × 10. Unreachable sources don’t penalize — but fabrications do.

Critic 2 — Weight: 35%

The Inference Challenger

Two phases. The hardest critic to satisfy.

Phase 1 — Independent Derivation

The challenger reads only the findings and independently derives what conclusions a rigorous analyst would draw. No peeking at the actual inference.

Phase 2 — Adversarial Challenge

The challenger then reads the pipeline’s actual inference and attacks it. Does it add genuine analytical lift? Or just restate what a reader could see from the findings alone?

ROBUST
Genuine synthesis. Adds insight beyond the raw evidence.
OBVIOUS
Valid, but a reader would reach this conclusion unaided.
WEAK
Unsupported or logically invalid leap from the evidence.
Critic 3 — Weight: 25%

The Consistency Auditor

Language must match evidence. Medium confidence can’t say “is.” It must say “suggests.”

CONFIDENCE_UPGRADE

Claim uses stronger language than the evidence supports. “X dominates the market” when evidence says “X has a significant share.”

MISSING_HEDGE

A qualified finding is stated as fact without a hedging qualifier. “Adoption is accelerating” with no “evidence suggests.”

INFERENCE_AS_FACT

A synthesized conclusion presented as an established finding. The “analysis indicates” qualifier is missing or stripped.

The ported hedge-by-default rule gives the writer explicit confidence-to-language mappings. This critic enforces them.

The A/B Test

Same research. Same topic. Different agent prompts.

Topic: Anthropic’s approach to AI safety. Pre-improvement vs. post-improvement.

Critic Before After Delta
Fact-Checker (40%) 8.67 / 10 10.00 / 10 +1.33
Inference Challenger (35%) 0.00 / 10 6.67 / 10 +6.67 ←
Consistency Auditor (25%) 9.68 / 10 9.55 / 10 −0.13
Composite 5.89 / 10 8.72 / 10 +2.83

The inference quality improvement drove the result. Consistency was already high — the hedge-by-default rule was a safety net, not a lift.

The Headline Number
0 → 6.67

Inference quality score, before and after

Pre-improvement: all 3 inferences rated SUPPORTED_BUT_OBVIOUS — valid summaries, zero analytical lift.

Post-improvement: 2 of 3 rated ROBUST — genuine synthesis that adds insight no single source could provide.

Inference quality is the hardest dimension. It tests whether the pipeline thinks, not just summarizes.

Speed

22 minutes → 16 minutes

Smart skip gates on the formatter steps. Zero quality impact.

6 min
saved per run
~27%
faster pipeline
0
quality impact

Each formatter step now runs a fast bash check first. If the upstream specialist’s output is already well-formed — right headers, right structure — the LLM call is skipped entirely. When output is malformed, the formatter runs as normal.

analyze
check-format
format-analysis (skip)
write
check-format
format-writer (skip)
Routing

One question. Two paths.

The system reads intent. Asks once if unclear.

Quick — 3–5 minutes

Researcher → Writer. For exploration and orientation. When you want to understand something, not stake your reputation on it.

Signals: “what is X”, “overview of Y”, “quick answer”, “just curious”

🏗️

Deep — ~15 minutes

Full pipeline: research, URL verification, analysis, confidence calibration, writing. For work someone will push back on.

Signals: “write me a brief”, “deep dive”, “I need to send this to…”

If intent is ambiguous, the coordinator asks once: “Quick answer or deep research?” Then routes. No second question.

User Experience

Research you can forward without re-verifying.

📺

Provenance footer

Every document ends with a quiet confidence signal:

Based on 28 sourced findings from 12 sources. 3 analytical inferences drawn from cross-finding synthesis.
🕳

Honest gaps

Missing evidence is reported, not hidden. The document names what it doesn’t know.

EVIDENCE GAP No independent ARR data found. Vendor claims only. Treat as directional.
🎚️

Calibrated language

High confidence: “X holds 60% share.” Low confidence: “Analysis suggests Y may be bifurcating.”

The consistency auditor enforces this across every document.

Roadmap

What’s being built now

Batch evaluation

Run all 5 test topics in parallel. Full eval suite in ~50 minutes instead of ~250 minutes sequential. Three categories: well-documented, contested, thin-evidence.

📡

Usage signal

Lightweight quality logging baked into the pipeline. Every user run appends a one-line consistency score. Quality signal without manual review.

🧪

A/B framework

Pipeline caching means critics-only re-runs take ~15 minutes. Any prompt change can be measured before it ships. The feedback loop most AI tools don’t have.

The eval harness is the differentiator. Not just that the pipeline is good — but that you can prove it, measure it, and improve it.

Sources

Research Methodology

Data as of: March 15, 2026

Feature status: Active — deployed on main, live for all @main users

Author: github.com/cpark4x

Key commits (all verifiable in git log):

A/B methodology: Both runs used identical cached pipeline output (same research, same topic). Only the writer and data-analyzer prompts differed. Critics ran independently. Scores extracted with portable sed, clamped 0–10.

Gaps: Single-topic test only (well-documented category). Inference scores are LLM-evaluated and have run-to-run variance. Fact-checker URL reachability varies by run (paywalls, rate limits).

Built by: Chris Park  ·  @cpark4x

More Amplifier Stories