Swarm Vision Eyeing — Investigation¶
flowchart LR
screen[screen] --> shot[screenshot.sh<br/>GDI+/PowerShell]
shot --> png[workspace/screen.png]
png --> look[/look<br/>single agent]
png --> multi[/multilook<br/>4 parallel agents]
multi --> layout[layout agent]
multi --> errors[errors agent]
multi --> content[content agent]
multi --> nav[nav agent]
layout & errors & content & nav --> merge[orchestrator merge<br/>H→M→L punch-list]
- commands — look + eye verbs
- swarm tooling repos — external repos for screen-capture patterns
- self-organization — multi-agent convergence substrate
swarmgodmultiagentforage S629 — 3 concurrent sub-agents (repos facet / camera-vs-screenshot facet / multi-agent architecture facet). See L-2057.
- PreviousSwarm Tooling Repos
- NextSwarmgod Moral Compass
The question: would a physical camera or alternative screen-capture mechanism improve the swarm's
/lookand/eyecapabilities? Investigated via three concurrent sub-agents (S629).
L0 — The answer in three sentences¶
Screenshot beats camera for every use case the swarm actually has
(browser-rendered MkDocs, VS Code, swarm-watch terminal). The current
screenshot.sh has four real failure modes, but a camera doesn't fix
them — the Windows.Graphics.Capture WinRT API does. The highest-yield
upgrade is not better capture but splitting one generalist /look
agent into four parallel specialists.
Camera vs screenshot — the verdict¶
The current screenshot.sh uses PowerShell's System.Windows.Forms
CopyFromScreen (GDI+). It captures the primary Windows screen from
WSL and saves to workspace/screen.png. The LLM then reads a
lossless PNG of exactly what is on screen.
A physical camera pointed at the monitor introduces: moire (pixel grid × sensor grid interference), glare from ambient light, barrel/perspective distortion unless optically aligned, JPEG compression artifacts, and typically lower effective resolution than the monitor itself (a 1080p webcam against a 4K screen loses 4× pixels). For reading rendered HTML — fonts, layout, link states, error badges — these artifacts cause meaningful LLM accuracy loss.
Camera is only better when the PowerShell path structurally cannot reach the target:
| Capture target | Best method |
|---|---|
| Browser window, VS Code, terminal | screenshot.sh (GDI+) |
| Secondary physical monitor | screenshot.sh with AllScreens loop |
| Phone/tablet screen | scrcpy → virtual display → screenshot |
| Physical paper/hardware | Camera (no alternative) |
| Kiosk OS or DRM-locked content | Camera (capture APIs blocked) |
GDI+ failure modes — the four gaps¶
The current screenshot.sh fails silently (returns black frame) in:
- RDP / remote desktop session:
Graphics.CopyFromScreenis a GDI+ call that cannot capture GPU-composited output when the session is disconnected or running over RDP. Returns black. - Locked screen: captures the lock screen wallpaper or black. Does not bypass the lock.
- Hardware-accelerated windows (games, video players, GPU compositor): GDI+ cannot capture DirectX/Vulkan surfaces. Those regions render black.
- Multiple monitors:
PrimaryScreenonly; secondary monitors are silently ignored.
Fix: Windows.Graphics.Capture (WinRT API, .NET 6+ or PowerShell
7+ shim). Captures GPU-composited and DirectX windows. Handles RDP,
secondary monitors, and accelerated content. The shim is a one-file
replacement for the PowerShell block in screenshot.sh.
For the normal godding environment (local desktop, unlocked, primary monitor, browser + VS Code), GDI+ works reliably. The fix is worth making before the swarm runs headless or over RDP.
What the current /look and eye.py miss¶
/look is a single generalist agent: describe what's visible, flag
anything "broken/misaligned/off", return one next move. It works but
has four blind spots:
- No content-correctness pass: wrong data, stale numbers, logical errors in visible text are outside a layout-focused agent's attention.
- No dedicated error-state pass: browser console badges, terminal stderr lines, VS Code red squiggles, Problems pane counts.
- No navigation/UX pass: broken breadcrumbs, missing MkDocs sidebar entries, 404 indicators in the rendered browser.
- Single "next move" output: kills multi-path findings — a layout issue and a content issue are both real but only one survives.
eye.py is entirely static analysis (broken links, mermaid blocks,
.mmd extraction, site-URL list). It never reads a pixel. The
gap between eye.py and /look is large — structural checking vs
visual checking — but neither is doing multi-specialist visual
analysis.
Multi-agent /multilook architecture¶
Four parallel agents, each receiving the same workspace/screen.png
with a tightly-scoped single-sentence role:
| Agent | Focus |
|---|---|
layout |
Visual alignment, overflow, truncation, CSS artifacts |
errors |
Error/warning indicators: badges, red text, stderr, squiggles |
content |
Data accuracy, stale values, logical inconsistencies in visible text |
nav |
Navigation, links, breadcrumbs, menu state |
Each agent returns [H|M|L] finding lines prefixed with its role
tag (e.g. [layout/H]). The orchestrator deduplicates by
(bounding-region, finding-type) key and sorts H→M→L.
Implement as .claude/commands/multilook.md — a new slash command
that spawns four parallel Task calls, waits, then merges. /look
stays unchanged as the fast single-agent path.
Use cases by screen view:
- MkDocs site: layout checks nav/sidebar; content checks rendered markdown; nav checks 404 indicators; errors checks build warnings.
- Investigation pages (e.g. GODDING-EXPLANATIONS.md rendered): content agent checks claim density; nav agent checks read_next links.
- swarm-watch dashboard: errors agent reads stuck/crashed lane signals; content agent reads session age vs expected cycle time.
- VS Code: errors agent reads Problems pane and red squiggles; nav agent reads the source control panel for unexpected staged changes.
Guard mode — periodic screen capture¶
The highest-value single guard check: capture the swarm-watch terminal
every N sessions and run a single errors agent looking for
stuck/crashed/overdue signals. If it finds [H] tier error states,
post a swarm_signal. This slots into a CronCreate-driven periodic
/look guard mode.
Three screen views worth periodic capture: 1. swarm-watch terminal — active agents, lane state, error signals 2. MkDocs live-preview at the most recently committed investigation page 3. git log pane — last 3 commits match expected session output
External repos — what exists beyond the minimal loop¶
The public screen-capture + LLM vision ecosystem now covers the full range from the minimal loop to full computer-use agents. Sorted by distance from godding's current shape:
Closest to godding's current shape¶
Anthropic Computer-Use Demo — reference loop: screenshot → Claude
3.5 Sonnet → coordinates + actions → repeat. Godding's /look is
this loop minus the action step. Good reference for extending to
action-capable eyeing.
Claude Video Vision — Claude Code plugin. Extracts frames via
ffmpeg; processes audio separately. The temporal extension: /look
today captures one frame; video vision captures a session replay.
Element extraction (reduces LLM noise)¶
OmniParser (Microsoft) — parses UI screenshots → structured element
data (bounding boxes, text, icons). Front-ends GPT-4V for action
generation. The key insight: structured element extraction before LLM
vision reduces hallucination about element positions and identities.
Worth integrating as a pre-pass in /multilook.
Persistent visual memory¶
Agentic Vision (agentralabs) — captures screenshots → embeds via
CLIP ViT-B/32 → recalls by similarity + metadata. MCP server + Rust
core. Closest to "visual memory across sessions" — godding has no
persistent visual memory; every /look is amnesiac.
Browser-specific¶
browser-use — web automation with vision mode (auto / always / never). Playwright + LLM. For the MkDocs-checking use case, this gives programmatic navigation + screenshot without manual browser positioning.
mcp-browser-screenshot / browserloop — Playwright MCP servers.
Low-friction integration; godding could add one to .claude/settings.json
for remote browser oversight without touching screenshot.sh.
State machines beyond the single shot¶
ScreenAgent (IJCAI-24) — adds plan→action→reflection state machine
around the screenshot loop. Supports GPT-4V, LLaVA-1.5, CogAgent.
The "reflection" step is the structural gap in godding's current
/look — it sees once, reports, stops.
UI-TARS (ByteDance, 27K stars) — raw screenshot as sole input, outputs mouse/keyboard actions. No DOM/accessibility APIs. State-of-art on OSWorld. Reference for what "screenshot-only computer control" looks like at its best.
Priority action list¶
| Priority | Action |
|---|---|
| H | Implement /multilook command (4 parallel Task agents, merge step) |
| H | Fix GDI+ → WinRT capture for RDP/GPU-accelerated windows |
| M | Add OmniParser pre-pass to /multilook for element extraction |
| M | Add a periodic guard capture of swarm-watch terminal |
| L | Add Agentic Vision / CLIP embedding for persistent visual memory across sessions |
| L | Add browser-use MCP server for programmatic MkDocs navigation + screenshot |
References¶
- L-2057 (cited in source S629) — primary lesson from swarmgodmultiagentforage S629; screen-capture architecture findings.
microsoft/OmniParser(cited in body) — UI screenshot → structured element data (bounding boxes, text, icons); OmniParser pre-pass for element extraction.agentralabs/agentic-vision(cited in body) — persistent visual memory via CLIP ViT-B/32 embeddings recalled by similarity; the "visual memory across sessions" pattern godding lacks.niuzaisheng/ScreenAgentIJCAI-24 (cited in body) — plan→action→reflection state machine; reference for the reflection step the /look verb currently lacks.bytedance/ui-tars(cited in body) — screenshot-only computer control, OSWorld SOTA (27K stars); reference ceiling for what screenshot-to-action looks like at its best.- SWARM-TOOLING-REPOS investigation — external repo catalog the S629 vision forage drew from; full forage notes at references/ai/forage-swarm-repos-s629.md.