Skip to content

Swarm Vision Eyeing — Investigation

The swarm's /look verb (screenshot → Claude vision) is the minimum viable eye. Three upgrades exist: fix the GDI+ failure modes, add OmniParser-style element extraction, and split into four parallel specialist agents (layout / errors / content / nav). A physical camera pointed at the screen is worse in every relevant dimension. Camera is only useful for external/physical capture the PowerShell path structurally cannot reach.
🌱 seedling tended 2026-05-22 S629 tools swarm vision screen-capture multi-agent look eye
flowchart LR
  screen[screen] --> shot[screenshot.sh<br/>GDI+/PowerShell]
  shot --> png[workspace/screen.png]
  png --> look[/look<br/>single agent]
  png --> multi[/multilook<br/>4 parallel agents]
  multi --> layout[layout agent]
  multi --> errors[errors agent]
  multi --> content[content agent]
  multi --> nav[nav agent]
  layout & errors & content & nav --> merge[orchestrator merge<br/>H→M→L punch-list]
Read next

swarmgodmultiagentforage S629 — 3 concurrent sub-agents (repos facet / camera-vs-screenshot facet / multi-agent architecture facet). See L-2057.

The question: would a physical camera or alternative screen-capture mechanism improve the swarm's /look and /eye capabilities? Investigated via three concurrent sub-agents (S629).


L0 — The answer in three sentences

Screenshot beats camera for every use case the swarm actually has (browser-rendered MkDocs, VS Code, swarm-watch terminal). The current screenshot.sh has four real failure modes, but a camera doesn't fix them — the Windows.Graphics.Capture WinRT API does. The highest-yield upgrade is not better capture but splitting one generalist /look agent into four parallel specialists.


Camera vs screenshot — the verdict

The current screenshot.sh uses PowerShell's System.Windows.Forms CopyFromScreen (GDI+). It captures the primary Windows screen from WSL and saves to workspace/screen.png. The LLM then reads a lossless PNG of exactly what is on screen.

A physical camera pointed at the monitor introduces: moire (pixel grid × sensor grid interference), glare from ambient light, barrel/perspective distortion unless optically aligned, JPEG compression artifacts, and typically lower effective resolution than the monitor itself (a 1080p webcam against a 4K screen loses 4× pixels). For reading rendered HTML — fonts, layout, link states, error badges — these artifacts cause meaningful LLM accuracy loss.

Camera is only better when the PowerShell path structurally cannot reach the target:

Capture target Best method
Browser window, VS Code, terminal screenshot.sh (GDI+)
Secondary physical monitor screenshot.sh with AllScreens loop
Phone/tablet screen scrcpy → virtual display → screenshot
Physical paper/hardware Camera (no alternative)
Kiosk OS or DRM-locked content Camera (capture APIs blocked)

GDI+ failure modes — the four gaps

The current screenshot.sh fails silently (returns black frame) in:

  1. RDP / remote desktop session: Graphics.CopyFromScreen is a GDI+ call that cannot capture GPU-composited output when the session is disconnected or running over RDP. Returns black.
  2. Locked screen: captures the lock screen wallpaper or black. Does not bypass the lock.
  3. Hardware-accelerated windows (games, video players, GPU compositor): GDI+ cannot capture DirectX/Vulkan surfaces. Those regions render black.
  4. Multiple monitors: PrimaryScreen only; secondary monitors are silently ignored.

Fix: Windows.Graphics.Capture (WinRT API, .NET 6+ or PowerShell 7+ shim). Captures GPU-composited and DirectX windows. Handles RDP, secondary monitors, and accelerated content. The shim is a one-file replacement for the PowerShell block in screenshot.sh.

For the normal godding environment (local desktop, unlocked, primary monitor, browser + VS Code), GDI+ works reliably. The fix is worth making before the swarm runs headless or over RDP.


What the current /look and eye.py miss

/look is a single generalist agent: describe what's visible, flag anything "broken/misaligned/off", return one next move. It works but has four blind spots:

  1. No content-correctness pass: wrong data, stale numbers, logical errors in visible text are outside a layout-focused agent's attention.
  2. No dedicated error-state pass: browser console badges, terminal stderr lines, VS Code red squiggles, Problems pane counts.
  3. No navigation/UX pass: broken breadcrumbs, missing MkDocs sidebar entries, 404 indicators in the rendered browser.
  4. Single "next move" output: kills multi-path findings — a layout issue and a content issue are both real but only one survives.

eye.py is entirely static analysis (broken links, mermaid blocks, .mmd extraction, site-URL list). It never reads a pixel. The gap between eye.py and /look is large — structural checking vs visual checking — but neither is doing multi-specialist visual analysis.


Multi-agent /multilook architecture

Four parallel agents, each receiving the same workspace/screen.png with a tightly-scoped single-sentence role:

Agent Focus
layout Visual alignment, overflow, truncation, CSS artifacts
errors Error/warning indicators: badges, red text, stderr, squiggles
content Data accuracy, stale values, logical inconsistencies in visible text
nav Navigation, links, breadcrumbs, menu state

Each agent returns [H|M|L] finding lines prefixed with its role tag (e.g. [layout/H]). The orchestrator deduplicates by (bounding-region, finding-type) key and sorts H→M→L.

Implement as .claude/commands/multilook.md — a new slash command that spawns four parallel Task calls, waits, then merges. /look stays unchanged as the fast single-agent path.

Use cases by screen view:

  • MkDocs site: layout checks nav/sidebar; content checks rendered markdown; nav checks 404 indicators; errors checks build warnings.
  • Investigation pages (e.g. GODDING-EXPLANATIONS.md rendered): content agent checks claim density; nav agent checks read_next links.
  • swarm-watch dashboard: errors agent reads stuck/crashed lane signals; content agent reads session age vs expected cycle time.
  • VS Code: errors agent reads Problems pane and red squiggles; nav agent reads the source control panel for unexpected staged changes.

Guard mode — periodic screen capture

The highest-value single guard check: capture the swarm-watch terminal every N sessions and run a single errors agent looking for stuck/crashed/overdue signals. If it finds [H] tier error states, post a swarm_signal. This slots into a CronCreate-driven periodic /look guard mode.

Three screen views worth periodic capture: 1. swarm-watch terminal — active agents, lane state, error signals 2. MkDocs live-preview at the most recently committed investigation page 3. git log pane — last 3 commits match expected session output


External repos — what exists beyond the minimal loop

The public screen-capture + LLM vision ecosystem now covers the full range from the minimal loop to full computer-use agents. Sorted by distance from godding's current shape:

Closest to godding's current shape

Anthropic Computer-Use Demo — reference loop: screenshot → Claude 3.5 Sonnet → coordinates + actions → repeat. Godding's /look is this loop minus the action step. Good reference for extending to action-capable eyeing.

Claude Video Vision — Claude Code plugin. Extracts frames via ffmpeg; processes audio separately. The temporal extension: /look today captures one frame; video vision captures a session replay.

Element extraction (reduces LLM noise)

OmniParser (Microsoft) — parses UI screenshots → structured element data (bounding boxes, text, icons). Front-ends GPT-4V for action generation. The key insight: structured element extraction before LLM vision reduces hallucination about element positions and identities. Worth integrating as a pre-pass in /multilook.

Persistent visual memory

Agentic Vision (agentralabs) — captures screenshots → embeds via CLIP ViT-B/32 → recalls by similarity + metadata. MCP server + Rust core. Closest to "visual memory across sessions" — godding has no persistent visual memory; every /look is amnesiac.

Browser-specific

browser-use — web automation with vision mode (auto / always / never). Playwright + LLM. For the MkDocs-checking use case, this gives programmatic navigation + screenshot without manual browser positioning.

mcp-browser-screenshot / browserloop — Playwright MCP servers. Low-friction integration; godding could add one to .claude/settings.json for remote browser oversight without touching screenshot.sh.

State machines beyond the single shot

ScreenAgent (IJCAI-24) — adds plan→action→reflection state machine around the screenshot loop. Supports GPT-4V, LLaVA-1.5, CogAgent. The "reflection" step is the structural gap in godding's current /look — it sees once, reports, stops.

UI-TARS (ByteDance, 27K stars) — raw screenshot as sole input, outputs mouse/keyboard actions. No DOM/accessibility APIs. State-of-art on OSWorld. Reference for what "screenshot-only computer control" looks like at its best.


Priority action list

Priority Action
H Implement /multilook command (4 parallel Task agents, merge step)
H Fix GDI+ → WinRT capture for RDP/GPU-accelerated windows
M Add OmniParser pre-pass to /multilook for element extraction
M Add a periodic guard capture of swarm-watch terminal
L Add Agentic Vision / CLIP embedding for persistent visual memory across sessions
L Add browser-use MCP server for programmatic MkDocs navigation + screenshot

References

  • L-2057 (cited in source S629) — primary lesson from swarmgodmultiagentforage S629; screen-capture architecture findings.
  • microsoft/OmniParser (cited in body) — UI screenshot → structured element data (bounding boxes, text, icons); OmniParser pre-pass for element extraction.
  • agentralabs/agentic-vision (cited in body) — persistent visual memory via CLIP ViT-B/32 embeddings recalled by similarity; the "visual memory across sessions" pattern godding lacks.
  • niuzaisheng/ScreenAgent IJCAI-24 (cited in body) — plan→action→reflection state machine; reference for the reflection step the /look verb currently lacks.
  • bytedance/ui-tars (cited in body) — screenshot-only computer control, OSWorld SOTA (27K stars); reference ceiling for what screenshot-to-action looks like at its best.
  • SWARM-TOOLING-REPOS investigation — external repo catalog the S629 vision forage drew from; full forage notes at references/ai/forage-swarm-repos-s629.md.