The Agent Takes the Wheel

Bounded, dual-world desktop control for agents

The Gap

Agents master the terminal, then go blind at the desktop GUI

Agents already write code and run terminal commands with real autonomy — but they can't click a dialog or fill a form with any grounded, bounded discipline. amplifier-bundle-cua closes that gap as a clean-room, Amplifier-native bundle: tools, agents, and recipes, not a foreign runtime.

And it isn't a sketch — the evidence is next.

Terminal: agents can already act with autonomy
Desktop GUI: blind — no grounded way to click or type
The bundle: cua namespace, version 0.1.0 — tools + agents + recipes

Proof

Not a sketch — a working v0.1.0 backed by passing tests

Michael J. Jabbour built the whole bundle in 41 commits over 5 days (2026-03-06 to 2026-03-10), merged through 3 pull requests. The public repo michaeljabbour/amplifier-bundle-cua ships with 147 passing tests.

So what does all that capability actually hang off of?

41

commits over 5 days

147

tests passing (uv run pytest)

3

pull requests merged

1

sole author (all 41 commits)

The Primitive

The whole bundle hangs off one 'cua' tool

A single tool named cua exposes a fixed set of 12 desktop actions — observe the screen, then click, type, press keys, and scroll. Everything the argument leans on routes through this one surface.

First question: how does the agent actually see the desktop?

# the single "cua" tool — 12 actions
observe        semantic_tree
screenshot     screen_info
window_info    cursor_position

click          double_click
type_text      key_press
scroll         move_cursor

Dual-World Perception

One Observation fuses the visual and the semantic worlds

A single Observation carries the visual world — screenshot, screen geometry, cursor, windows — together with the semantic accessibility tree of roles, labels, and values. Neither is treated as secondary; both are co-equal, first-class inputs.

Seeing is half of it — now the discipline of acting.

Visual world: screenshot, screen_info, cursor, focused window, windows
Semantic world: accessibility tree of roles / labels / values
Fused: one Observation — neither visual nor semantic is secondary

The Loop

Fire exactly one atomic action, then re-observe to verify

A bounded observe-act-verify loop keeps action honest: Observe, Analyze, Plan, Act — exactly one atomic action — Verify by re-observing, then Decide. The agent never claims success without evidence.

But driving a real desktop this way carries real risk.

1Observecapture the dual-world state
2Analyze & Planchoose one atomic action
3Actfire exactly one action
4Verify & Decidere-observe, then loop or terminate

The Hazard

Autonomy on a real desktop is dangerous

Capability alone isn't enough. An agent that assumes success — or hallucinates desktop state it can't actually see — can do real damage. Unbounded desktop control is a hazard, not a feature.

So the bundle layers safety in, deliberately.

Assuming success without checking → silent failure
Hallucinating desktop state from memory → blind action
Silently faking data when a backend is missing → false confidence

Layered Safety

Structured uncertainty, human gates, and fail-fast honesty

Every action returns one of four normalized statuses. The bounded-task recipe gates both the plan and the result behind human approval, with a default budget of 20 actions and 2 retries per step. Detection fails fast — the macOS backend raises rather than silently returning fixture data.

Put it together and a plain sentence becomes a workflow.

success failure blocked ambiguous

Human-in-the-loop: approval before AND after execution
Budgets: 20 actions, 2 retries per step by default
Fail-fast: raises a RuntimeError, never fakes data

The Payoff

"Open TextEdit and type hello world" — driven end to end

A single plain sentence resolves into a fully observed, approval-gated, verified desktop workflow the agent actually drives — running on a real macOS backend (Quartz input, screencapture, AXUIElement accessibility) with a deterministic fixture backend for CI.

Status: v0.1.0 — working & tested (fixture + macOS)

# a sentence becomes a safe, auditable workflow
"Open TextEdit and type hello world"

observe  → dual-world state captured
approve → human gates the plan
act     → one atomic action at a time
verify  → re-observe, confirm, decide
approve → human gates the result

Sources

Research Methodology

Data as of: HEAD 5f127bc, development window 2026-03-06 to 2026-03-10

Feature status: v0.1.0 — working and tested for fixture + macOS backends; Windows/Linux backends are stubs

Repository: michaeljabbour/amplifier-bundle-cua (public, default branch main). Cloned fresh from GitHub into /tmp/cua-src; not present locally under /home/ramparte/dev/ANext.

Research performed:

Repo/bundle: gh repo view michaeljabbour/amplifier-bundle-cua --json name,visibility,defaultBranchRef,isPrivate ; cat bundle.md (namespace cua, v0.1.0)
Commit count: git rev-list --count HEAD (41 commits)
Timeline: git log --reverse --format="%ai %h %s" (2026-03-06 → 2026-03-10)
Contributors: git log --format="%an <%ae>" | sort | uniq -c (Michael J. Jabbour, all 41)
Merges: git log --oneline --merges (PR #1, #2, #3)
Tests: uv run pytest tests/ -q (147 passed)
Actions/models/backends: read tool.py ; read models.py ; read backends/registry.py ; grep -c 'async def' backends/macos.py

Gaps: macOS backend was not executed against a live desktop (no macOS host available); macOS claims are grounded in source (17 async methods, Quartz/screencapture/AXUIElement) and the passing mocked test suite. GitHub-side PR metadata (reviews, CI runs) was not queried; PR facts come from local merge commits.

Primary contributor: Michael J. Jabbour — sole author, all 41 commits (100%).