omni.vu - Visual Understanding & Automation Overview
omni.vu gives you eyes and hands on the user's macOS screen. Use it to:
See what the user sees (screen capture) Understand UI state with AI vision Detect changes and wait for events Act with mouse and keyboard automation When to Use This Skill Proactively Use omni.vu When: Debugging UI issues - "The button isn't working" → capture and analyze Verifying changes - After modifying UI code, check if it rendered correctly Waiting for operations - Build finishing, deployment completing, tests running Understanding context - User describes something on screen you can't see Automating repetitive tasks - Clicking through UI flows, filling forms Documentation - Capturing screenshots of features Do NOT Use When: Reading/writing files (use Read/Write tools) Running terminal commands (use Bash) Making API calls (use appropriate tools) User hasn't granted screen recording permission Tool Reference Capture Tools vu - Full Screen Capture vu(monitor=0, save_to_history=True)
Captures the entire screen. Returns base64 image.
Use when: You need to see everything on screen.
vu_window - Window Capture vu_window(window_id=None, title="VS Code", include_frame=False)
Captures a specific window by ID or title (partial match).
Use when: You only need one application's content.
Workflow:
Call vu_list_windows to see available windows Find the window_id or use title matching Call vu_window with that ID/title vu_region - Region Capture vu_region(x=100, y=100, width=500, height=300)
Captures a specific rectangle of the screen.
Use when: You need a precise area (error message, specific component).
vu_list_windows - List Windows vu_list_windows(filter_app="Chrome")
Lists all visible windows with metadata.
Returns: List of {window_id, title, owner, x, y, width, height}
vu_list_monitors - List Displays vu_list_monitors()
Lists connected displays with resolution and scale factor.
Vision Tools vu_describe - AI Vision Analysis vu_describe( prompt="What errors are visible?", provider="claude", # or openai, gemini, ollama max_tokens=1024, capture_first=True )
Captures screen and analyzes with AI.
Best prompts:
"Describe what you see on screen" "Are there any error messages visible?" "What is the state of the build/test output?" "Is the login form filled correctly?" "What color is the status indicator?"
Providers:
Provider Speed Quality Cost claude Medium Excellent $$ openai Fast Very Good $$ gemini Fast Good $ ollama Slow Varies Free Utility Tools vu_diff - Change Detection vu_diff(threshold=0.02, monitor=0)
Detects if screen changed since last capture.
Returns: {changed: bool, diff_percentage: float}
Use for polling patterns:
Wait for build to finish
while True: result = vu_diff(threshold=0.05) if result["changed"]: # Screen changed, check what happened analysis = vu_describe(prompt="Did the build succeed or fail?") break # Wait before checking again
vu_history - Capture History vu_history(limit=10, capture_type="full_screen")
Gets recent capture metadata.
vu_last - Last Capture vu_last()
Returns the most recent capture with image data.
Use when: You need to re-analyze without re-capturing.
vu_status - System Status vu_status()
Returns system info: monitors, providers, safety settings.
Automation Tools vu_click - Mouse Click vu_click(x=500, y=300, button="left", count=1)
Clicks at screen coordinates.
Buttons: left, right, middle Count: 1 (single), 2 (double), 3 (triple)
Safety: Coordinates validated against screen bounds.
vu_move - Move Cursor vu_move(x=500, y=300, duration_ms=0)
Moves cursor to position. Use duration_ms for animated movement.
vu_drag - Drag Operation vu_drag(start_x=100, start_y=100, end_x=500, end_y=300, duration_ms=500)
Drags from start to end position.
Safety: Maximum 2000px drag distance.
vu_scroll - Scroll vu_scroll(direction="down", amount=3, x=None, y=None)
Scrolls at current or specified position.
Directions: up, down, left, right Amount: Lines to scroll (1-20)
vu_type - Type Text vu_type(text="Hello World", delay_between_ms=0)
Types text character by character.
Note: Click on target field first!
vu_hotkey - Keyboard Shortcut vu_hotkey(keys="cmd+s")
Executes keyboard shortcuts.
Format: modifier+key (e.g., cmd+c, ctrl+shift+s, cmd+opt+i) Modifiers: cmd, ctrl, alt/opt, shift
Blocked hotkeys (for safety):
cmd+q (quit) cmd+shift+q (logout) cmd+opt+esc (force quit) Common Workflows 1. Debug UI Issue User: "The submit button doesn't do anything when I click it"
- vu_describe(prompt="Describe the form and submit button state")
- Analyze: Is button disabled? Is there validation error?
- If needed: vu_window(title="Chrome") for focused capture
-
Check console: vu_describe(prompt="Are there any errors in the developer console?")
-
Verify Build/Deploy User: "Run the build and let me know when it's done"
-
Run: npm run build (via Bash)
- Loop: vu_diff(threshold=0.05)
- When changed: vu_describe(prompt="What is the build status? Did it succeed or fail?")
-
Report result
-
Fill a Form User: "Fill out the registration form with test data"
-
vu_describe(prompt="What form fields are visible?")
- vu_click(x=field_x, y=field_y) # Click first field
- vu_type(text="testuser@example.com")
- vu_hotkey(keys="tab") # Move to next field
- vu_type(text="Test User")
- Continue for each field...
- vu_click(x=submit_x, y=submit_y) # Submit
-
vu_describe(prompt="Was the form submitted successfully?")
-
Navigate UI User: "Open the settings panel in VS Code"
-
vu_list_windows(filter_app="Code")
- vu_window(title="Code") # Capture VS Code
- vu_hotkey(keys="cmd+,") # Open settings
- vu_diff() # Wait for settings to open
-
vu_describe(prompt="Are the VS Code settings now visible?")
-
Monitor for Changes User: "Watch the dashboard and tell me when the status changes"
-
vu() # Initial capture
- Loop every 5 seconds:
- vu_diff(threshold=0.02)
- If changed: vu_describe(prompt="What changed on the dashboard?")
- Report changes to user
Best Practices 1. Capture Before Acting
Always capture and analyze before clicking/typing:
❌ Bad: vu_click(x=500, y=300) # Hope it's the right spot
✅ Good: 1. vu_describe(prompt="Where is the submit button?") 2. Parse coordinates from description 3. vu_click(x=parsed_x, y=parsed_y)
-
Use Appropriate Capture Scope Full screen → General context, multi-app workflows Window → Single app focus, cleaner output Region → Specific element, faster/smaller
-
Specific Vision Prompts ❌ Vague: "What do you see?"
✅ Specific: - "Is there an error message visible? If so, what does it say?" - "What is the status of the test runner? Passing, failing, or running?" - "List all form fields visible and their current values"
- Rate Limit Automation
Don't spam clicks. Wait between actions:
vu_click(x=100, y=100)
Wait for UI response
vu_diff(threshold=0.01) vu_click(x=200, y=200)
- Verify After Actions
Always verify automation succeeded:
vu_type(text="hello@example.com") vu_describe(prompt="What text is now in the email field?")
Troubleshooting "Permission denied" or blank captures
→ Grant Screen Recording permission: System Settings > Privacy > Screen Recording
Automation not working
→ Grant Accessibility permission: System Settings > Privacy > Accessibility
Vision analysis failing
→ Check API keys in environment variables → Try different provider: vu_describe(provider="openai")
Coordinates seem off
→ Retina display? Coordinates are in logical points, not pixels → Multi-monitor? Check vu_list_monitors() for display arrangement
Hotkey blocked
→ Some dangerous hotkeys (cmd+q) are blocked for safety → Unblock in config if absolutely necessary
Configuration
Environment variables:
OMNI_VU_ANTHROPIC_API_KEY=sk-ant-... # For Claude vision OMNI_VU_OPENAI_API_KEY=sk-... # For GPT-4 vision OMNI_VU_DEFAULT_PROVIDER=claude # Default AI provider OMNI_VU_SAFETY_LEVEL=medium # low/medium/high
History stored at: ~/.omni.vu/captures/