Features · Amplifier Bundle

The Agent Takes
the Wheel

Amplifier agents can now see your screen, understand what's on it, and operate the desktop directly — clicks, keystrokes, scrolls, all of it.

Active · amplifier-bundle-cua
May 2026 · michaeljabbour/amplifier-bundle-cua
The Problem

AI stops at
the terminal

🚫

Text-Only Agents

AI agents live in the terminal. They can write code and run commands, but they cannot see what's on screen or interact with graphical applications.

👁️

Blind to the Desktop

A browser is open, a dialog is waiting, an app needs a click — but the agent has no way to perceive any of it. The visual world is invisible.

🎯

No Grounded Action

Even if an agent could see the screen, it needs precise, bounded control — not hallucinated coordinates or unbounded click loops.

The desktop is the last frontier for AI agents. Every GUI application — from browsers to design tools to system preferences — requires eyes and hands the agent doesn't have.

The Solution

Dual-World
Perception

CUA gives agents two complementary views of the desktop — and the ability to act on both.

📸

Visual World

Raw screenshots of the current screen. The agent sees exactly what a human would see — layout, colors, overlapping windows, dialog boxes. Captured via native OS APIs.

🌳

Semantic World

The accessibility tree: a structured hierarchy of UI elements with roles, labels, values, and bounding boxes. The agent reads the logical structure beneath the pixels.

Use semantic data for identification, visual data for confirmation. When both worlds agree, confidence is high. When they disagree, re-observe.

— Dual-World Reasoning context, amplifier-bundle-cua
How It Works

Observe → Act → Verify

Every interaction follows a bounded, single-action loop — never blind, never unbounded.

1

Observe

Call cua observe to capture a full dual-world snapshot: screenshot, semantic tree, cursor position, window list, and focused application.

2

Analyze & Plan

Cross-reference visual and semantic data. Identify the target element. Decide the single next atomic action — a click, a keystroke, a scroll. Assess confidence: high, medium, low, or none.

3

Act

Execute exactly one action via the cua tool. 12 actions available: screenshot, click, double_click, type_text, key_press, scroll, move_cursor, cursor_position, screen_info, window_info, semantic_tree, and observe.

4

Verify

Re-observe to confirm the expected state change. If verified, proceed. If ambiguous, re-observe (max 2 retries). If failed, adapt or escalate. Every result returns a structured status: success, failure, blocked, or ambiguous.

Architecture

Four layers, clean separation

🧠

cua-planner

Decomposes multi-step tasks into atomic actions with budgets and recovery strategies

🤖

cua-operator

Executes the observe–act–verify loop with dual-world reasoning

🔧

cua Tool

12-action dispatcher — translates agent intent into OS calls

🖥️

Platform Backend

macOS (Quartz + AX), Windows, Linux, and Fixture for CI

Safety Rules

Anti-thrash guards, no destructive actions without approval, confidence thresholds, action budgets (default: 20 per task).

Human Gates

The bounded-task recipe requires human approval before execution begins and after results are delivered.

Recovery

Re-observe before retry. Semantic fallback, visual fallback, structured escalation. Max 2 retries per step.

Bundle Contents

A full Amplifier bundle

Agents

  • cua-operator — atomic observe/act/verify loops with dual-world reasoning
  • cua-planner — task decomposition, failure analysis, recovery planning, strategy adaptation

Recipes

  • observe-and-act — basic operator loop, observe and perform one action
  • bounded-task — multi-step execution with action budgets and approval gates
  • recovery-workflow — re-observe, diagnose, and adapt after failures

Behaviors

  • cua-core — screenshot, input, window info, semantic inspection primitives
  • cua-operator — dual-world reasoning and safety context
  • cua-recipes — repeatable CUA workflow patterns

Context Documents

  • dual-world-reasoning — when to prefer visual vs semantic data
  • action-confidence — high/medium/low/none confidence levels
  • recovery-patterns — re-observe, semantic fallback, escalation
By the Numbers

Built fast, built right

12
Desktop Actions
4
Platform Backends
41
Commits
~1:1
Test-to-Source Ratio

1,200 lines of source

Tool dispatcher, data models, Target protocol, and four backends (macOS, Windows, Linux, Fixture). macOS backend alone: 421 lines of Quartz and ApplicationServices integration.

1,273 lines of tests

12 test files: conformance, fixture, integration, macOS, models, mount, registry, smoke, stubs, target protocol, and tool tests. No mocks — real backend verification.

In Action

Natural language, full desktop control

The bounded-task recipe turns a sentence into a safe, auditable desktop workflow:

# Run a bounded desktop task with approval gates recipes execute cua:recipes/bounded-task.yaml \ --context '{ "task_description": "Open Safari and navigate to example.com", "max_actions": "10" }' # Stage 1: Planner observes screen, creates step-by-step plan # → Human approves the plan # Stage 2: Operator executes one action at a time: # observe → click Dock icon → verify # observe → click address bar → verify # observe → type_text "example.com" → verify # observe → key_press "return" → verify # → Human approves the result

Every action is bounded, every result is verified, and humans stay in the loop. The agent never thrashes — it stops when budget is exhausted and reports what happened.

The Bigger Picture

Beyond the
Terminal

CUA closes the loop between AI agents and the entire desktop — every application, every interface, every workflow that lives outside the command line.

Any Application

Browsers, design tools, email clients, system preferences, IDEs — if it runs on the desktop, the agent can see and operate it.

Cross-Platform

macOS backend ships with full Quartz + Accessibility API support. Windows and Linux backends are stubbed and ready for implementation. Fixture backend for CI.

Composable

CUA integrates with the full Amplifier ecosystem — combine with other bundles to build workflows that span code, infrastructure, and desktop applications.

The agent doesn't just write code anymore. It fills out forms, clicks through dialogs, navigates browsers, and verifies visual state. The desktop is no longer off-limits.

Sources & Methodology

How this deck was built

All data in this deck comes from the actual codebase and its git history.

Primary contributor: Michael J. Jabbour (michaeljabbour) — sole author of all 41 commits

Methodology: Repository cloned to /tmp/amplifier-bundle-cua. All statistics gathered via git log, git shortlog, wc -l, and direct source file inspection. No data was estimated or inferred.

More Amplifier Stories