Skip to content

Computer-Use Agent (CUA)

The Computer-Use Agent (CUA) gives the model a controlled surface for driving real GUI applications: synthesized mouse and keyboard input, screenshots, and accessibility-tree reads. It runs on four platforms using native OS APIs for each, behind a policy layer that gates access per application.

Every backend operates without stealing focus. Synthesized clicks, keystrokes, and screenshots execute without raising or activating any window, without moving the user-visible cursor (where the platform allows decoupling), and without changing the foreground application. Each backend ships a focus_invariance_test to lock this contract. MouseMove on platforms where the synthesized cursor cannot be decoupled from the visible cursor returns CuaError::UnsupportedPlatform rather than break the invariant.

The op enum is locked at 11 variants (CUA_OP_LOCKED_VARIANT_COUNT = 11 in wcore-cua/src/op.rs). Ops are serialized as a tagged JSON union - { "kind": "left_click", "x": 100, "y": 200 }.

OpKind tagDescription
LeftClickleft_clickSingle click at screen coords. Optional button (default left) and mods (modifier keys).
RightClickright_clickRight click at screen coords. Optional mods.
DoubleClickdouble_clickDouble click at screen coords. Optional button.
MouseMovemouse_moveMove the synthesized cursor pointer. Returns UnsupportedPlatform on platforms where the visible cursor cannot be decoupled from the synthesized one.
ScrollscrollScroll at screen coords. dy positive scrolls down; dx positive scrolls right.
TypetypeType a string of literal text (IME path; does not hold modifier keys). Control characters except \n and \t are rejected.
KeykeyPress a key combination, e.g. "cmd+shift+t" or "ctrl+z".
ScreenshotscreenshotCapture a region. region: full (default, all displays) or { x, y, width, height }. format: png. redact: bool (default false).
AxTreeax_treeWalk the accessibility tree for the frontmost application.
WaitwaitWait duration_ms milliseconds. Tracked as an op (not a host sleep) so the cancel-token wraps it consistently.
FrontmostAppfrontmost_appReturn the frontmost-application identifier (bundle ID on macOS, window class on X11, AumId on Windows).

Drag-and-drop is deliberately omitted from the v1 surface. A drag creates a window where focus and cursor state are observable between the press and the release, which breaks the background invariant. It may be added in a future revision behind a separate capability flag.

Platform is detected at runtime. On macOS and Windows the target OS determines the backend at compile time. On Linux, Platform::current() probes WAYLAND_DISPLAY at runtime: if set, LinuxWayland; otherwise LinuxX11.

PlatformStructInput mechanismScreenshot
macOSMacOsBackendCGEvent::post(CGEventTapLocation::HID) - HID layer, no window activationCGDisplayCreateImage
Linux X11LinuxX11Backendx11rb XTest fake_input - indistinguishable from physical input, never calls set_input_focusxproto::get_image
Linux WaylandLinuxWaylandBackendwlrctl (or ydotool) subprocess via shell_command_argvgrim subprocess
WindowsWindowsBackendUI Automation + SendInputGDI BitBlt

wlrctl and grim must be on PATH for the Wayland backend. If either binary is missing, the backend returns a typed Backend error describing the missing dependency rather than a silent no-op.

At registration time the linux_wayland module runs compositor_allows_background_input(). If the probe returns false (GNOME mutter, Hyprland restricted), the adapter refuses to construct a CuaTool and emits a clear error. The probe can be overridden in tests via environment variables:

Terminal window
WCORE_CUA_TEST_WAYLAND_PERMISSIVE=1 # simulate permissive compositor
WCORE_CUA_TEST_WAYLAND_RESTRICTED=1 # simulate restricted compositor

CuaPolicy (wcore-cua/src/policy.rs) sits between the tool and every backend call. It runs before any native input is synthesized.

Three rule kinds are checked in order:

  1. Forbidden apps: hard Reject. The model cannot drive these applications at all. Intended for password managers and credential stores. Match is case-insensitive against the frontmost-app identifier.

  2. Approval-required apps: Suspend on every op. The orchestration layer emits an ApprovalRequired event and waits for the host to approve before the op proceeds.

  3. Forbidden key combos: hard Reject on both CuaOp::Key AND CuaOp::Type. The check applies to both op kinds so the model cannot bypass the gate by submitting a forbidden combo as literal text.

[cua.policy]
forbidden_apps = ["1Password", "Keychain Access"]
require_approval_for_app = ["Terminal"]
forbidden_key_combos = ["cmd+q+system", "ctrl+alt+del"]

CuaOp::Type payloads are checked for C0 control characters. Everything except \n (0x0A) and \t (0x09) is rejected outright. This blocks ANSI escape sequences (\x1b[2J), null bytes (\0), and BEL characters (\x07) that an LLM might otherwise send to manipulate a terminal.

The forbidden-combo matcher normalizes before comparison so different spellings of the same shortcut all match the same entry:

Input formNormalized
⌘Qcmd+q
Cmd+Qcmd+q
command-Qcmd+q
^Qctrl+q
⌥⇧Talt+shift+t

The matcher also checks for substring matches on token boundaries, so Type("press cmd+q to quit") still triggers when cmd+q is in the forbidden list.

By default, the first op against any application the engine has not driven before routes to Suspend for HITL approval. Once the host approves, mark_app_seen(app_id) is called and the app is recorded in <data_dir>/wayland/cua/seen-apps.json. Subsequent ops on the same app skip the prompt.

The app-id check is a composite key of <plugin_id>::<lowercased-app-id>, so two different plugins sharing the same app name carry independent approval records.

[cua.policy]
# Disable first-time-per-app approval (not recommended for production)
first_time_per_app_approval = false

When the frontmost-app identifier cannot be resolved (empty string) and any app-scoped rule is configured, the op routes to Suspend rather than Allow. This matters because on Windows and Linux Wayland the frontmost-app probe has no production implementation, and on macOS the osascript probe can fail if TCC grants are missing or the login window is active. Treating an unresolved ID as “no restrictions apply” would silently disable all app-scoped gates.

App-independent checks (forbidden key combos, control-char denylist) still run and can hard-Reject before the Suspend fires.

When redact: true is set on a Screenshot op, the engine runs a two-pass redaction pipeline before returning the bytes:

  1. Heuristic (always runs) - detects “asterisk runs” typical of password fields by scanning for uniform-foreground-glyph rows. Blurs the bounding box of each detected run.

  2. OCR-backed (platform-specific) - extracts text regions from the PNG and blurs any that match sensitive-content patterns (email addresses, SSNs, credit card numbers, API key prefixes such as sk-, ghp_, xoxb-, etc.).

OCR backends by platform:

PlatformBackendNotes
macOSVNRecognizeTextRequest (Apple Vision)Ships with the OS; no feature flag needed.
WindowsWindows.Media.OcrShips with the OS; no feature flag needed.
Linuxleptess (Tesseract bindings)Opt-in via redact-ocr Cargo feature; adds a large native-lib dependency.
OtherNoneHeuristic pass only.

Redaction is best-effort: if the OCR backend errors, the engine falls back to heuristic-only and logs a warning. A decode or encode failure in the pipeline passes the bytes through unchanged with a tracing::warn!. Redaction never blocks the tool result.

Redaction is off by default; enable it per-tool via CuaToolSpec::redact_screenshots.

Each CuaSession carries a stable session_id and an optional sub_agent name. Sub-agents get independent sessions so in-flight modifier state (a held Shift key, Caps Lock) does not bleed across concurrent agents.

When the capabilities.computer_use flag is set on a JSON-stream session, the engine emits two typed events:

  • CuaEvent - per-completed-op trail carrying op kind, screen coordinates where applicable, and a human-readable summary.
  • CuaPolicyDenied - emitted on a policy violation, carrying op kind, frontmost app identifier, and the denial reason.
VariableEffect
WCORE_CUA_TEST_WAYLAND_PERMISSIVE=1Override compositor probe to return permissive (test/CI use).
WCORE_CUA_TEST_WAYLAND_RESTRICTED=1Override compositor probe to return restricted (test/CI use).