omni.vu - Visual Understanding & Automation Overview

omni.vu gives you eyes and hands on the user's macOS screen. Use it to:

See what the user sees (screen capture) Understand UI state with AI vision Detect changes and wait for events Act with mouse and keyboard automation When to Use This Skill Proactively Use omni.vu When: Debugging UI issues - "The button isn't working" → capture and analyze Verifying changes - After modifying UI code, check if it rendered correctly Waiting for operations - Build finishing, deployment completing, tests running Understanding context - User describes something on screen you can't see Automating repetitive tasks - Clicking through UI flows, filling forms Documentation - Capturing screenshots of features Do NOT Use When: Reading/writing files (use Read/Write tools) Running terminal commands (use Bash) Making API calls (use appropriate tools) User hasn't granted screen recording permission Tool Reference Capture Tools vu - Full Screen Capture vu(monitor=0, save_to_history=True)

Captures the entire screen. Returns base64 image.

Use when: You need to see everything on screen.

vu_window - Window Capture vu_window(window_id=None, title="VS Code", include_frame=False)

Captures a specific window by ID or title (partial match).

Use when: You only need one application's content.

Workflow:

Call vu_list_windows to see available windows Find the window_id or use title matching Call vu_window with that ID/title vu_region - Region Capture vu_region(x=100, y=100, width=500, height=300)

Captures a specific rectangle of the screen.

Use when: You need a precise area (error message, specific component).

vu_list_windows - List Windows vu_list_windows(filter_app="Chrome")

Lists all visible windows with metadata.

Returns: List of {window_id, title, owner, x, y, width, height}

vu_list_monitors - List Displays vu_list_monitors()

Lists connected displays with resolution and scale factor.

Vision Tools vu_describe - AI Vision Analysis vu_describe( prompt="What errors are visible?", provider="claude", # or openai, gemini, ollama max_tokens=1024, capture_first=True )

Captures screen and analyzes with AI.

Best prompts:

"Describe what you see on screen" "Are there any error messages visible?" "What is the state of the build/test output?" "Is the login form filled correctly?" "What color is the status indicator?"

Providers:

Provider Speed Quality Cost claude Medium Excellent $$ openai Fast Very Good $$ gemini Fast Good $ ollama Slow Varies Free Utility Tools vu_diff - Change Detection vu_diff(threshold=0.02, monitor=0)

Detects if screen changed since last capture.

Returns: {changed: bool, diff_percentage: float}

Use for polling patterns:

Wait for build to finish

while True: result = vu_diff(threshold=0.05) if result["changed"]: # Screen changed, check what happened analysis = vu_describe(prompt="Did the build succeed or fail?") break # Wait before checking again

vu_history - Capture History vu_history(limit=10, capture_type="full_screen")

Gets recent capture metadata.

vu_last - Last Capture vu_last()

Returns the most recent capture with image data.

Use when: You need to re-analyze without re-capturing.

vu_status - System Status vu_status()

Returns system info: monitors, providers, safety settings.

Automation Tools vu_click - Mouse Click vu_click(x=500, y=300, button="left", count=1)

Clicks at screen coordinates.

Buttons: left, right, middle Count: 1 (single), 2 (double), 3 (triple)

Safety: Coordinates validated against screen bounds.

vu_move - Move Cursor vu_move(x=500, y=300, duration_ms=0)

Moves cursor to position. Use duration_ms for animated movement.

vu_drag - Drag Operation vu_drag(start_x=100, start_y=100, end_x=500, end_y=300, duration_ms=500)

Drags from start to end position.

Safety: Maximum 2000px drag distance.

vu_scroll - Scroll vu_scroll(direction="down", amount=3, x=None, y=None)

Scrolls at current or specified position.

Directions: up, down, left, right Amount: Lines to scroll (1-20)

vu_type - Type Text vu_type(text="Hello World", delay_between_ms=0)

Types text character by character.

Note: Click on target field first!

vu_hotkey - Keyboard Shortcut vu_hotkey(keys="cmd+s")

Executes keyboard shortcuts.

Format: modifier+key (e.g., cmd+c, ctrl+shift+s, cmd+opt+i) Modifiers: cmd, ctrl, alt/opt, shift

Blocked hotkeys (for safety):

cmd+q (quit) cmd+shift+q (logout) cmd+opt+esc (force quit) Common Workflows 1. Debug UI Issue User: "The submit button doesn't do anything when I click it"

vu_describe(prompt="Describe the form and submit button state")
Analyze: Is button disabled? Is there validation error?
If needed: vu_window(title="Chrome") for focused capture
Check console: vu_describe(prompt="Are there any errors in the developer console?")
Verify Build/Deploy User: "Run the build and let me know when it's done"
Run: npm run build (via Bash)
Loop: vu_diff(threshold=0.05)
When changed: vu_describe(prompt="What is the build status? Did it succeed or fail?")
Report result
Fill a Form User: "Fill out the registration form with test data"
vu_describe(prompt="What form fields are visible?")
vu_click(x=field_x, y=field_y) # Click first field
vu_type(text="testuser@example.com")
vu_hotkey(keys="tab") # Move to next field
vu_type(text="Test User")
Continue for each field...
vu_click(x=submit_x, y=submit_y) # Submit
vu_describe(prompt="Was the form submitted successfully?")
Navigate UI User: "Open the settings panel in VS Code"
vu_list_windows(filter_app="Code")
vu_window(title="Code") # Capture VS Code
vu_hotkey(keys="cmd+,") # Open settings
vu_diff() # Wait for settings to open
vu_describe(prompt="Are the VS Code settings now visible?")
Monitor for Changes User: "Watch the dashboard and tell me when the status changes"
vu() # Initial capture
Loop every 5 seconds:
vu_diff(threshold=0.02)
If changed: vu_describe(prompt="What changed on the dashboard?")
Report changes to user

Best Practices 1. Capture Before Acting

Always capture and analyze before clicking/typing:

❌ Bad: vu_click(x=500, y=300) # Hope it's the right spot

✅ Good: 1. vu_describe(prompt="Where is the submit button?") 2. Parse coordinates from description 3. vu_click(x=parsed_x, y=parsed_y)

Use Appropriate Capture Scope Full screen → General context, multi-app workflows Window → Single app focus, cleaner output Region → Specific element, faster/smaller
Specific Vision Prompts ❌ Vague: "What do you see?"

✅ Specific: - "Is there an error message visible? If so, what does it say?" - "What is the status of the test runner? Passing, failing, or running?" - "List all form fields visible and their current values"

Rate Limit Automation

Don't spam clicks. Wait between actions:

vu_click(x=100, y=100)

Wait for UI response

vu_diff(threshold=0.01) vu_click(x=200, y=200)

Verify After Actions

Always verify automation succeeded:

vu_type(text="hello@example.com") vu_describe(prompt="What text is now in the email field?")

Troubleshooting "Permission denied" or blank captures

→ Grant Screen Recording permission: System Settings > Privacy > Screen Recording

Automation not working

→ Grant Accessibility permission: System Settings > Privacy > Accessibility

Vision analysis failing

→ Check API keys in environment variables → Try different provider: vu_describe(provider="openai")

Coordinates seem off

→ Retina display? Coordinates are in logical points, not pixels → Multi-monitor? Check vu_list_monitors() for display arrangement

Hotkey blocked

→ Some dangerous hotkeys (cmd+q) are blocked for safety → Unblock in config if absolutely necessary

Configuration

Environment variables:

OMNI_VU_ANTHROPIC_API_KEY=sk-ant-... # For Claude vision OMNI_VU_OPENAI_API_KEY=sk-... # For GPT-4 vision OMNI_VU_DEFAULT_PROVIDER=claude # Default AI provider OMNI_VU_SAFETY_LEVEL=medium # low/medium/high

History stored at: ~/.omni.vu/captures/

omni-vu

安装

Wait for build to finish

Wait for UI response