Amplifier agents can now see your screen, understand what's on it, and operate the desktop directly — clicks, keystrokes, scrolls, all of it.
AI agents live in the terminal. They can write code and run commands, but they cannot see what's on screen or interact with graphical applications.
A browser is open, a dialog is waiting, an app needs a click — but the agent has no way to perceive any of it. The visual world is invisible.
Even if an agent could see the screen, it needs precise, bounded control — not hallucinated coordinates or unbounded click loops.
The desktop is the last frontier for AI agents. Every GUI application — from browsers to design tools to system preferences — requires eyes and hands the agent doesn't have.
CUA gives agents two complementary views of the desktop — and the ability to act on both.
Raw screenshots of the current screen. The agent sees exactly what a human would see — layout, colors, overlapping windows, dialog boxes. Captured via native OS APIs.
The accessibility tree: a structured hierarchy of UI elements with roles, labels, values, and bounding boxes. The agent reads the logical structure beneath the pixels.
Use semantic data for identification, visual data for confirmation. When both worlds agree, confidence is high. When they disagree, re-observe.
Every interaction follows a bounded, single-action loop — never blind, never unbounded.
Call cua observe to capture a full dual-world snapshot: screenshot, semantic tree, cursor position, window list, and focused application.
Cross-reference visual and semantic data. Identify the target element. Decide the single next atomic action — a click, a keystroke, a scroll. Assess confidence: high, medium, low, or none.
Execute exactly one action via the cua tool. 12 actions available: screenshot, click, double_click, type_text, key_press, scroll, move_cursor, cursor_position, screen_info, window_info, semantic_tree, and observe.
Re-observe to confirm the expected state change. If verified, proceed. If ambiguous, re-observe (max 2 retries). If failed, adapt or escalate. Every result returns a structured status: success, failure, blocked, or ambiguous.
Decomposes multi-step tasks into atomic actions with budgets and recovery strategies
Executes the observe–act–verify loop with dual-world reasoning
12-action dispatcher — translates agent intent into OS calls
macOS (Quartz + AX), Windows, Linux, and Fixture for CI
Anti-thrash guards, no destructive actions without approval, confidence thresholds, action budgets (default: 20 per task).
The bounded-task recipe requires human approval before execution begins and after results are delivered.
Re-observe before retry. Semantic fallback, visual fallback, structured escalation. Max 2 retries per step.
Tool dispatcher, data models, Target protocol, and four backends (macOS, Windows, Linux, Fixture). macOS backend alone: 421 lines of Quartz and ApplicationServices integration.
12 test files: conformance, fixture, integration, macOS, models, mount, registry, smoke, stubs, target protocol, and tool tests. No mocks — real backend verification.
The bounded-task recipe turns a sentence into a safe, auditable desktop workflow:
Every action is bounded, every result is verified, and humans stay in the loop. The agent never thrashes — it stops when budget is exhausted and reports what happened.
CUA closes the loop between AI agents and the entire desktop — every application, every interface, every workflow that lives outside the command line.
Browsers, design tools, email clients, system preferences, IDEs — if it runs on the desktop, the agent can see and operate it.
macOS backend ships with full Quartz + Accessibility API support. Windows and Linux backends are stubbed and ready for implementation. Fixture backend for CI.
CUA integrates with the full Amplifier ecosystem — combine with other bundles to build workflows that span code, infrastructure, and desktop applications.
The agent doesn't just write code anymore. It fills out forms, clicks through dialogs, navigates browsers, and verifies visual state. The desktop is no longer off-limits.
All data in this deck comes from the actual codebase and its git history.
michaeljabbour/amplifier-bundle-cua — cloned and inspected directlymodules/tool-cua/ (counted via wc -l)tests/ (counted via wc -l)_ACTIONS list in modules/tool-cua/amplifier_module_tool_cua/tool.pymodules/tool-cua/amplifier_module_tool_cua/backends/overview.dot in team-knowledge data and the Target protocol in target.pycontext/dual-world-reasoning.mdcontext/action-confidence.md and context/recovery-patterns.mdbundle.md (name: cua, version: 0.1.0)team_knowledge(operation="search") returned capability, behavior, agent, and recipe entriesPrimary contributor: Michael J. Jabbour (michaeljabbour) — sole author of all 41 commits
Methodology: Repository cloned to /tmp/amplifier-bundle-cua. All statistics gathered via git log, git shortlog, wc -l, and direct source file inspection. No data was estimated or inferred.