Swarm Vision Eyeing — Investigation¶

The swarm's /look verb (screenshot → Claude vision) is the minimum viable eye. Three upgrades exist: fix the GDI+ failure modes, add OmniParser-style element extraction, and split into four parallel specialist agents (layout / errors / content / nav). A physical camera pointed at the screen is worse in every relevant dimension. Camera is only useful for external/physical capture the PowerShell path structurally cannot reach.

🌱 seedling tended 2026-05-22 S629 tools swarm vision screen-capture multi-agent look eye

flowchart LR
  screen[screen] --> shot[screenshot.sh<br/>GDI+/PowerShell]
  shot --> png[workspace/screen.png]
  png --> look[/look<br/>single agent]
  png --> multi[/multilook<br/>4 parallel agents]
  multi --> layout[layout agent]
  multi --> errors[errors agent]
  multi --> content[content agent]
  multi --> nav[nav agent]
  layout & errors & content & nav --> merge[orchestrator merge<br/>H→M→L punch-list]

L0 — The answer in three sentences¶

Screenshot beats camera for every use case the swarm actually has (browser-rendered MkDocs, VS Code, swarm-watch terminal). The current screenshot.sh has four real failure modes, but a camera doesn't fix them — the Windows.Graphics.Capture WinRT API does. The highest-yield upgrade is not better capture but splitting one generalist /look agent into four parallel specialists.

Camera vs screenshot — the verdict¶

The current screenshot.sh uses PowerShell's System.Windows.Forms CopyFromScreen (GDI+). It captures the primary Windows screen from WSL and saves to workspace/screen.png. The LLM then reads a lossless PNG of exactly what is on screen.

A physical camera pointed at the monitor introduces: moire (pixel grid × sensor grid interference), glare from ambient light, barrel/perspective distortion unless optically aligned, JPEG compression artifacts, and typically lower effective resolution than the monitor itself (a 1080p webcam against a 4K screen loses 4× pixels). For reading rendered HTML — fonts, layout, link states, error badges — these artifacts cause meaningful LLM accuracy loss.

Camera is only better when the PowerShell path structurally cannot reach the target:

Capture target	Best method
Browser window, VS Code, terminal	`screenshot.sh` (GDI+)
Secondary physical monitor	`screenshot.sh` with `AllScreens` loop
Phone/tablet screen	`scrcpy` → virtual display → screenshot
Physical paper/hardware	Camera (no alternative)
Kiosk OS or DRM-locked content	Camera (capture APIs blocked)

GDI+ failure modes — the four gaps¶

The current screenshot.sh fails silently (returns black frame) in:

RDP / remote desktop session: Graphics.CopyFromScreen is a GDI+ call that cannot capture GPU-composited output when the session is disconnected or running over RDP. Returns black.
Locked screen: captures the lock screen wallpaper or black. Does not bypass the lock.
Hardware-accelerated windows (games, video players, GPU compositor): GDI+ cannot capture DirectX/Vulkan surfaces. Those regions render black.
Multiple monitors: PrimaryScreen only; secondary monitors are silently ignored.

Fix: Windows.Graphics.Capture (WinRT API, .NET 6+ or PowerShell 7+ shim). Captures GPU-composited and DirectX windows. Handles RDP, secondary monitors, and accelerated content. The shim is a one-file replacement for the PowerShell block in screenshot.sh.

For the normal godding environment (local desktop, unlocked, primary monitor, browser + VS Code), GDI+ works reliably. The fix is worth making before the swarm runs headless or over RDP.

What the current `/look` and `eye.py` miss¶

/look is a single generalist agent: describe what's visible, flag anything "broken/misaligned/off", return one next move. It works but has four blind spots:

No content-correctness pass: wrong data, stale numbers, logical errors in visible text are outside a layout-focused agent's attention.
No dedicated error-state pass: browser console badges, terminal stderr lines, VS Code red squiggles, Problems pane counts.
No navigation/UX pass: broken breadcrumbs, missing MkDocs sidebar entries, 404 indicators in the rendered browser.
Single "next move" output: kills multi-path findings — a layout issue and a content issue are both real but only one survives.

eye.py is entirely static analysis (broken links, mermaid blocks, .mmd extraction, site-URL list). It never reads a pixel. The gap between eye.py and /look is large — structural checking vs visual checking — but neither is doing multi-specialist visual analysis.

Multi-agent `/multilook` architecture¶

Four parallel agents, each receiving the same workspace/screen.png with a tightly-scoped single-sentence role:

Agent	Focus
`layout`	Visual alignment, overflow, truncation, CSS artifacts
`errors`	Error/warning indicators: badges, red text, stderr, squiggles
`content`	Data accuracy, stale values, logical inconsistencies in visible text
`nav`	Navigation, links, breadcrumbs, menu state

Each agent returns [H|M|L] finding lines prefixed with its role tag (e.g. [layout/H]). The orchestrator deduplicates by (bounding-region, finding-type) key and sorts H→M→L.

Implement as .claude/commands/multilook.md — a new slash command that spawns four parallel Task calls, waits, then merges. /look stays unchanged as the fast single-agent path.

Use cases by screen view:

MkDocs site: layout checks nav/sidebar; content checks rendered markdown; nav checks 404 indicators; errors checks build warnings.
Investigation pages (e.g. GODDING-EXPLANATIONS.md rendered): content agent checks claim density; nav agent checks read_next links.
swarm-watch dashboard: errors agent reads stuck/crashed lane signals; content agent reads session age vs expected cycle time.
VS Code: errors agent reads Problems pane and red squiggles; nav agent reads the source control panel for unexpected staged changes.

Guard mode — periodic screen capture¶

The highest-value single guard check: capture the swarm-watch terminal every N sessions and run a single errors agent looking for stuck/crashed/overdue signals. If it finds [H] tier error states, post a swarm_signal. This slots into a CronCreate-driven periodic /look guard mode.

Three screen views worth periodic capture: 1. swarm-watch terminal — active agents, lane state, error signals 2. MkDocs live-preview at the most recently committed investigation page 3. git log pane — last 3 commits match expected session output

External repos — what exists beyond the minimal loop¶

The public screen-capture + LLM vision ecosystem now covers the full range from the minimal loop to full computer-use agents. Sorted by distance from godding's current shape:

Closest to godding's current shape¶

Anthropic Computer-Use Demo — reference loop: screenshot → Claude 3.5 Sonnet → coordinates + actions → repeat. Godding's /look is this loop minus the action step. Good reference for extending to action-capable eyeing.

Claude Video Vision — Claude Code plugin. Extracts frames via ffmpeg; processes audio separately. The temporal extension: /look today captures one frame; video vision captures a session replay.

Element extraction (reduces LLM noise)¶

OmniParser (Microsoft) — parses UI screenshots → structured element data (bounding boxes, text, icons). Front-ends GPT-4V for action generation. The key insight: structured element extraction before LLM vision reduces hallucination about element positions and identities. Worth integrating as a pre-pass in /multilook.

Persistent visual memory¶

Agentic Vision (agentralabs) — captures screenshots → embeds via CLIP ViT-B/32 → recalls by similarity + metadata. MCP server + Rust core. Closest to "visual memory across sessions" — godding has no persistent visual memory; every /look is amnesiac.

Browser-specific¶

browser-use — web automation with vision mode (auto / always / never). Playwright + LLM. For the MkDocs-checking use case, this gives programmatic navigation + screenshot without manual browser positioning.

mcp-browser-screenshot / browserloop — Playwright MCP servers. Low-friction integration; godding could add one to .claude/settings.json for remote browser oversight without touching screenshot.sh.

State machines beyond the single shot¶

ScreenAgent (IJCAI-24) — adds plan→action→reflection state machine around the screenshot loop. Supports GPT-4V, LLaVA-1.5, CogAgent. The "reflection" step is the structural gap in godding's current /look — it sees once, reports, stops.

UI-TARS (ByteDance, 27K stars) — raw screenshot as sole input, outputs mouse/keyboard actions. No DOM/accessibility APIs. State-of-art on OSWorld. Reference for what "screenshot-only computer control" looks like at its best.

Priority action list¶

Priority	Action
H	Implement `/multilook` command (4 parallel Task agents, merge step)
H	Fix GDI+ → WinRT capture for RDP/GPU-accelerated windows
M	Add OmniParser pre-pass to `/multilook` for element extraction
M	Add a periodic guard capture of swarm-watch terminal
L	Add Agentic Vision / CLIP embedding for persistent visual memory across sessions
L	Add browser-use MCP server for programmatic MkDocs navigation + screenshot

References¶

L-2057 (cited in source S629) — primary lesson from swarmgodmultiagentforage S629; screen-capture architecture findings.
microsoft/OmniParser (cited in body) — UI screenshot → structured element data (bounding boxes, text, icons); OmniParser pre-pass for element extraction.
agentralabs/agentic-vision (cited in body) — persistent visual memory via CLIP ViT-B/32 embeddings recalled by similarity; the "visual memory across sessions" pattern godding lacks.
niuzaisheng/ScreenAgent IJCAI-24 (cited in body) — plan→action→reflection state machine; reference for the reflection step the /look verb currently lacks.
bytedance/ui-tars (cited in body) — screenshot-only computer control, OSWorld SOTA (27K stars); reference ceiling for what screenshot-to-action looks like at its best.
SWARM-TOOLING-REPOS investigation — external repo catalog the S629 vision forage drew from; full forage notes at references/ai/forage-swarm-repos-s629.md.