Computer-Use Agent (CUA)
The Computer-Use Agent (CUA) gives the model a controlled surface for driving real GUI applications: synthesized mouse and keyboard input, screenshots, and accessibility-tree reads. It runs on four platforms using native OS APIs for each, behind a policy layer that gates access per application.
Background-mode invariant
Section titled “Background-mode invariant”Every backend operates without stealing focus. Synthesized clicks, keystrokes, and screenshots execute without raising or activating any window, without moving the user-visible cursor (where the platform allows decoupling), and without changing the foreground application. Each backend ships a focus_invariance_test to lock this contract. MouseMove on platforms where the synthesized cursor cannot be decoupled from the visible cursor returns CuaError::UnsupportedPlatform rather than break the invariant.
Operation surface
Section titled “Operation surface”The op enum is locked at 11 variants (CUA_OP_LOCKED_VARIANT_COUNT = 11 in wcore-cua/src/op.rs). Ops are serialized as a tagged JSON union - { "kind": "left_click", "x": 100, "y": 200 }.
| Op | Kind tag | Description |
|---|---|---|
| LeftClick | left_click | Single click at screen coords. Optional button (default left) and mods (modifier keys). |
| RightClick | right_click | Right click at screen coords. Optional mods. |
| DoubleClick | double_click | Double click at screen coords. Optional button. |
| MouseMove | mouse_move | Move the synthesized cursor pointer. Returns UnsupportedPlatform on platforms where the visible cursor cannot be decoupled from the synthesized one. |
| Scroll | scroll | Scroll at screen coords. dy positive scrolls down; dx positive scrolls right. |
| Type | type | Type a string of literal text (IME path; does not hold modifier keys). Control characters except \n and \t are rejected. |
| Key | key | Press a key combination, e.g. "cmd+shift+t" or "ctrl+z". |
| Screenshot | screenshot | Capture a region. region: full (default, all displays) or { x, y, width, height }. format: png. redact: bool (default false). |
| AxTree | ax_tree | Walk the accessibility tree for the frontmost application. |
| Wait | wait | Wait duration_ms milliseconds. Tracked as an op (not a host sleep) so the cancel-token wraps it consistently. |
| FrontmostApp | frontmost_app | Return the frontmost-application identifier (bundle ID on macOS, window class on X11, AumId on Windows). |
No drag-and-drop in v1
Section titled “No drag-and-drop in v1”Drag-and-drop is deliberately omitted from the v1 surface. A drag creates a window where focus and cursor state are observable between the press and the release, which breaks the background invariant. It may be added in a future revision behind a separate capability flag.
Platform backends
Section titled “Platform backends”Platform is detected at runtime. On macOS and Windows the target OS determines the backend at compile time. On Linux, Platform::current() probes WAYLAND_DISPLAY at runtime: if set, LinuxWayland; otherwise LinuxX11.
| Platform | Struct | Input mechanism | Screenshot |
|---|---|---|---|
| macOS | MacOsBackend | CGEvent::post(CGEventTapLocation::HID) - HID layer, no window activation | CGDisplayCreateImage |
| Linux X11 | LinuxX11Backend | x11rb XTest fake_input - indistinguishable from physical input, never calls set_input_focus | xproto::get_image |
| Linux Wayland | LinuxWaylandBackend | wlrctl (or ydotool) subprocess via shell_command_argv | grim subprocess |
| Windows | WindowsBackend | UI Automation + SendInput | GDI BitBlt |
wlrctl and grim must be on PATH for the Wayland backend. If either binary is missing, the backend returns a typed Backend error describing the missing dependency rather than a silent no-op.
Wayland compositor probe
Section titled “Wayland compositor probe”At registration time the linux_wayland module runs compositor_allows_background_input(). If the probe returns false (GNOME mutter, Hyprland restricted), the adapter refuses to construct a CuaTool and emits a clear error. The probe can be overridden in tests via environment variables:
WCORE_CUA_TEST_WAYLAND_PERMISSIVE=1 # simulate permissive compositorWCORE_CUA_TEST_WAYLAND_RESTRICTED=1 # simulate restricted compositorApplication policy (CuaPolicy)
Section titled “Application policy (CuaPolicy)”CuaPolicy (wcore-cua/src/policy.rs) sits between the tool and every backend call. It runs before any native input is synthesized.
Rule kinds
Section titled “Rule kinds”Three rule kinds are checked in order:
-
Forbidden apps: hard
Reject. The model cannot drive these applications at all. Intended for password managers and credential stores. Match is case-insensitive against the frontmost-app identifier. -
Approval-required apps:
Suspendon every op. The orchestration layer emits anApprovalRequiredevent and waits for the host to approve before the op proceeds. -
Forbidden key combos: hard
Rejecton bothCuaOp::KeyANDCuaOp::Type. The check applies to both op kinds so the model cannot bypass the gate by submitting a forbidden combo as literal text.
[cua.policy]forbidden_apps = ["1Password", "Keychain Access"]require_approval_for_app = ["Terminal"]forbidden_key_combos = ["cmd+q+system", "ctrl+alt+del"]Control-character denylist on Type
Section titled “Control-character denylist on Type”CuaOp::Type payloads are checked for C0 control characters. Everything except \n (0x0A) and \t (0x09) is rejected outright. This blocks ANSI escape sequences (\x1b[2J), null bytes (\0), and BEL characters (\x07) that an LLM might otherwise send to manipulate a terminal.
Key-combo normalization
Section titled “Key-combo normalization”The forbidden-combo matcher normalizes before comparison so different spellings of the same shortcut all match the same entry:
| Input form | Normalized |
|---|---|
⌘Q | cmd+q |
Cmd+Q | cmd+q |
command-Q | cmd+q |
^Q | ctrl+q |
⌥⇧T | alt+shift+t |
The matcher also checks for substring matches on token boundaries, so Type("press cmd+q to quit") still triggers when cmd+q is in the forbidden list.
First-time-per-app approval (default on)
Section titled “First-time-per-app approval (default on)”By default, the first op against any application the engine has not driven before routes to Suspend for HITL approval. Once the host approves, mark_app_seen(app_id) is called and the app is recorded in <data_dir>/wayland/cua/seen-apps.json. Subsequent ops on the same app skip the prompt.
The app-id check is a composite key of <plugin_id>::<lowercased-app-id>, so two different plugins sharing the same app name carry independent approval records.
[cua.policy]# Disable first-time-per-app approval (not recommended for production)first_time_per_app_approval = falseFail-closed on unknown frontmost app
Section titled “Fail-closed on unknown frontmost app”When the frontmost-app identifier cannot be resolved (empty string) and any app-scoped rule is configured, the op routes to Suspend rather than Allow. This matters because on Windows and Linux Wayland the frontmost-app probe has no production implementation, and on macOS the osascript probe can fail if TCC grants are missing or the login window is active. Treating an unresolved ID as “no restrictions apply” would silently disable all app-scoped gates.
App-independent checks (forbidden key combos, control-char denylist) still run and can hard-Reject before the Suspend fires.
Screenshot redaction
Section titled “Screenshot redaction”When redact: true is set on a Screenshot op, the engine runs a two-pass redaction pipeline before returning the bytes:
-
Heuristic (always runs) - detects “asterisk runs” typical of password fields by scanning for uniform-foreground-glyph rows. Blurs the bounding box of each detected run.
-
OCR-backed (platform-specific) - extracts text regions from the PNG and blurs any that match sensitive-content patterns (email addresses, SSNs, credit card numbers, API key prefixes such as
sk-,ghp_,xoxb-, etc.).
OCR backends by platform:
| Platform | Backend | Notes |
|---|---|---|
| macOS | VNRecognizeTextRequest (Apple Vision) | Ships with the OS; no feature flag needed. |
| Windows | Windows.Media.Ocr | Ships with the OS; no feature flag needed. |
| Linux | leptess (Tesseract bindings) | Opt-in via redact-ocr Cargo feature; adds a large native-lib dependency. |
| Other | None | Heuristic pass only. |
Redaction is best-effort: if the OCR backend errors, the engine falls back to heuristic-only and logs a warning. A decode or encode failure in the pipeline passes the bytes through unchanged with a tracing::warn!. Redaction never blocks the tool result.
Redaction is off by default; enable it per-tool via CuaToolSpec::redact_screenshots.
Per-session isolation
Section titled “Per-session isolation”Each CuaSession carries a stable session_id and an optional sub_agent name. Sub-agents get independent sessions so in-flight modifier state (a held Shift key, Caps Lock) does not bleed across concurrent agents.
Protocol events
Section titled “Protocol events”When the capabilities.computer_use flag is set on a JSON-stream session, the engine emits two typed events:
CuaEvent- per-completed-op trail carrying op kind, screen coordinates where applicable, and a human-readable summary.CuaPolicyDenied- emitted on a policy violation, carrying op kind, frontmost app identifier, and the denial reason.
Environment variables
Section titled “Environment variables”| Variable | Effect |
|---|---|
WCORE_CUA_TEST_WAYLAND_PERMISSIVE=1 | Override compositor probe to return permissive (test/CI use). |
WCORE_CUA_TEST_WAYLAND_RESTRICTED=1 | Override compositor probe to return restricted (test/CI use). |