OS Use - Cross-Platform OS Automation A comprehensive cross-platform toolkit for OS automation, screenshot capture, visual recognition, mouse/keyboard control, and window management. Supports macOS 12+ and Windows 10+ . Platform Support Matrix Feature macOS Implementation Windows Implementation Screenshot pyautogui + PIL pyautogui + PIL Visual Recognition opencv-python + pyautogui opencv-python + pyautogui Mouse/Keyboard pyautogui pyautogui Window Management AppleScript (native) pywinauto / pygetwindow Application Control AppleScript / subprocess subprocess / pywinauto Browser Automation Chrome DevTools MCP Chrome DevTools MCP Capabilities 1. Screenshot Capture 📸 Universal (macOS & Windows): Full screen capture Region capture (specified coordinates) Window capture (specific application window) Clipboard screenshot access Implementation: pyautogui.screenshot() + PIL.Image 2. Visual Recognition 👁️ Universal (macOS & Windows): Image matching/locating on screen Template matching with confidence threshold Multi-scale matching (handle different resolutions) Color detection and region extraction Optional OCR: Text recognition from screenshots (requires pytesseract + Tesseract OCR engine) Implementation: opencv-python + pyautogui.locateOnScreen() 3. Mouse & Keyboard Control 🖱️⌨️ Universal (macOS & Windows): Mouse movement (absolute and relative coordinates) Mouse clicking (left, right, middle, double-click) Mouse dragging and dropping Scroll wheel operations Keyboard text input Keyboard shortcuts and hotkeys Special key combinations Implementation: pyautogui 4. Window Management 🪟 macOS Implementation: List all application windows Get window position, size, title Activate/minimize/close windows Move and resize windows Launch/quit applications Implementation: AppleScript via subprocess Windows Implementation: Same capabilities as macOS Additional: Get window handle (HWND), process information Better integration with Windows window manager Implementation: pywinauto or pygetwindow 5. Browser Automation 🌐 Universal (macOS & Windows): Webpage screenshots Element screenshots Page navigation Form filling and clicking Network monitoring Performance analysis Implementation: Chrome DevTools MCP (separate tool) 6. System Integration 🔧 Clipboard Operations: Read/write clipboard content Support images and text Implementation: pyperclip + pyautogui Technical Implementation Details Python Environment Setup
Create virtual environment
python3 -m venv ~/.nanobot/workspace/macos-automation/.venv
Activate
source ~/.nanobot/workspace/macos-automation/.venv/bin/activate
Install dependencies
pip install pyautogui opencv-python-headless numpy Pillow pyperclip
macOS specific
(AppleScript is built-in, no installation needed)
Windows specific
- pip
- install
- pywinauto pygetwindow
- Key Libraries Reference
- Library
- Version
- Purpose
- pyautogui
- 0.9.54+
- Screenshot, mouse/keyboard control
- opencv-python-headless
- 4.11.0.84+
- Image recognition, computer vision
- numpy
- 2.4.2+
- Numerical operations for OpenCV
- Pillow
- 12.1.1+
- Image processing
- pyperclip
- Latest
- Clipboard operations
- pywinauto
- Latest
- Windows window management
- pygetwindow
- Latest
- Cross-platform window control
- Platform-Specific Notes
- macOS Specifics
- Permissions Required:
- Accessibility
-
- System Settings > Privacy & Security > Accessibility
- Screen Recording
- System Settings > Privacy & Security > Screen Recording AppleScript Quirks: Some modern apps (e.g., Chrome) may have limited AppleScript support Window titles may be truncated or localized Some operations require app to be frontmost Coordinate System: Origin (0, 0) at top-left Retina displays: pyautogui automatically handles scaling Windows Specifics Administrator Privileges: Some operations (e.g., interacting with elevated windows) may require admin rights High DPI Displays: Windows scaling may affect coordinate accuracy Use pyautogui.size() to get actual screen dimensions Window Handle (HWND): Windows provides low-level window handles for precise control pywinauto provides both high-level and low-level access Error Handling Patterns import pyautogui import time
Pattern 1: Retry with backoff
def retry_with_backoff ( func , max_retries = 3 , base_delay = 1 ) : for i in range ( max_retries ) : try : return func ( ) except Exception as e : if i == max_retries - 1 : raise delay = base_delay * ( 2 ** i ) print ( f"Retry { i + 1 } / { max_retries } after { delay } s: { e } " ) time . sleep ( delay )
Pattern 2: Safe operations with fallback
def safe_screenshot ( output_path ) : try : screenshot = pyautogui . screenshot ( ) screenshot . save ( output_path ) return output_path except Exception as e : print ( f"Screenshot failed: { e } " ) return None
Pattern 3: Coordinate boundary checking
def safe_click ( x , y , max_x = None , max_y = None ) : """安全点击,确保坐标在屏幕范围内""" if max_x is None or max_y is None : max_x , max_y = pyautogui . size ( ) x = max ( 0 , min ( x , max_x - 1 ) ) y = max ( 0 , min ( y , max_y - 1 ) ) pyautogui . click ( x , y ) Usage Examples by Scenario Scenario 1: Automated Testing """ 自动化 UI 测试示例 测试一个假设的登录页面 """ import pyautogui import time def test_login_flow ( ) :
1. 截取初始状态
initial_screenshot
pyautogui . screenshot ( ) initial_screenshot . save ( "test_01_initial.png" )
2. 查找并点击登录按钮
button_location
pyautogui . locateOnScreen ( "login_button.png" , confidence = 0.9 ) if button_location : center = pyautogui . center ( button_location ) pyautogui . click ( center . x , center . y ) time . sleep ( 1 )
3. 输入用户名
pyautogui . typewrite ( "testuser@example.com" , interval = 0.01 ) pyautogui . press ( 'tab' )
4. 输入密码
pyautogui . typewrite ( "TestPassword123" , interval = 0.01 )
5. 点击提交
pyautogui . press ( 'return' ) time . sleep ( 2 )
6. 验证结果
result_screenshot
pyautogui . screenshot ( ) result_screenshot . save ( "test_02_result.png" )
检查是否出现成功提示
success_indicator
pyautogui . locateOnScreen ( "success_message.png" , confidence = 0.8 ) if success_indicator : print ( "✅ 测试通过:登录成功" ) return True else : print ( "❌ 测试失败:未找到成功提示" ) return False
运行测试
if name == "main" : test_login_flow ( ) Scenario 2: Data Entry Automation """ 数据录入自动化示例 将 Excel 数据自动填入网页表单 """ import pyautogui import pandas as pd import time def automate_data_entry ( excel_file , form_template ) : """ 从 Excel 读取数据并自动填入表单 Args: excel_file: Excel 文件路径 form_template: 表单字段与 Excel 列的映射 """
1. 读取 Excel 数据
df
pd . read_excel ( excel_file ) print ( f"读取到 { len ( df ) } 条记录" )
2. 遍历每条记录
for index , row in df . iterrows ( ) : print ( f"\n正在处理第 { index + 1 } 条记录..." )
3. 填写每个字段
for field_name , column_name in form_template . items ( ) : value = row . get ( column_name , '' )
查找表单字段(需要提前准备字段截图)
field_location
pyautogui . locateOnScreen ( f"form_field_ { field_name } .png" , confidence = 0.8 ) if field_location :
点击字段
center
pyautogui . center ( field_location ) pyautogui . click ( center . x , center . y ) time . sleep ( 0.2 )
输入值
pyautogui . hotkey ( 'ctrl' , 'a' )
全选
pyautogui . typewrite ( str ( value ) , interval = 0.01 ) time . sleep ( 0.2 ) else : print ( f" ⚠️ 未找到字段: { field_name } " )
4. 提交表单
submit_btn
pyautogui . locateOnScreen ( "submit_button.png" , confidence = 0.8 ) if submit_btn : center = pyautogui . center ( submit_btn ) pyautogui . click ( center . x , center . y ) print ( " ✅ 已提交" ) time . sleep ( 2 )
等待提交完成
else : print ( " ⚠️ 未找到提交按钮" )
5. 准备下一条记录
可能需要点击"添加新记录"或返回列表
time . sleep ( 1 ) print ( "\n🎉 所有记录处理完成!" )
使用示例
if name == "main" :
表单模板:字段名 -> Excel 列名
form_template
{ "name" : "姓名" , "email" : "邮箱" , "phone" : "电话" , "address" : "地址" } automate_data_entry ( "data.xlsx" , form_template ) Scenario 3: Screen Monitoring & Alerting """ 屏幕监控与告警示例 监控特定区域变化,发现变化时发送通知 """ import pyautogui import cv2 import numpy as np import time from datetime import datetime def monitor_screen_region ( region , template_image = None , check_interval = 5 , callback = None ) : """ 监控屏幕特定区域的变化 Args: region: (left, top, width, height) 监控区域 template_image: 要查找的模板图像路径(可选) check_interval: 检查间隔(秒) callback: 发现变化时的回调函数 Returns: 监控会话对象(可调用 stop() 停止) """ class MonitorSession : def init ( self ) : self . running = True self . baseline = None def stop ( self ) : self . running = False session = MonitorSession ( ) print ( f"🔍 开始监控区域: { region } " ) print ( f"⏱️ 检查间隔: { check_interval } 秒" ) print ( "按 Ctrl+C 停止监控\n" ) try : while session . running :
捕获当前区域
current
pyautogui . screenshot ( region = region ) current_array = np . array ( current ) if template_image :
模式1: 查找模板图像
template_location
pyautogui . locateOnScreen ( template_image , confidence = 0.8 ) if template_location : print ( f"✅ [ { datetime . now ( ) } ] 找到模板图像: { template_location } " ) if callback : callback ( 'template_found' , { 'location' : template_location , 'screenshot' : current } ) else :
模式2: 检测变化
if session . baseline is None : session . baseline = current_array print ( f"📸 [ { datetime . now ( ) } ] 已建立基准图像" ) else :
计算差异
diff
cv2 . absdiff ( session . baseline , current_array ) diff_gray = cv2 . cvtColor ( diff , cv2 . COLOR_RGB2GRAY ) diff_score = np . mean ( diff_gray ) if diff_score
10 :
阈值可调
print ( f"⚠️ [ { datetime . now ( ) } ] 检测到变化! 差异分数: { diff_score : .2f } " ) if callback : callback ( 'change_detected' , { 'diff_score' : diff_score , 'screenshot' : current , 'baseline' : session . baseline } )
更新基准
session . baseline = current_array time . sleep ( check_interval ) except KeyboardInterrupt : print ( "\n🛑 监控已停止" ) return session
使用示例
def alert_callback ( event_type , data ) : """告警回调函数示例""" if event_type == 'template_found' : print ( f"🎯 模板出现在: { data [ 'location' ] } " )
可以在这里发送通知、发送邮件、执行操作等
elif event_type == 'change_detected' : print ( f"📊 变化强度: { data [ 'diff_score' ] } " )
保存差异图像
timestamp
datetime . now ( ) . strftime ( "%Y%m%d_%H%M%S" ) data [ 'screenshot' ] . save ( f"change_ { timestamp } .png" ) if name == "main" :
示例1: 监控屏幕变化
print ( "=== 监控屏幕变化 ===" ) monitor = monitor_screen_region ( region = ( 0 , 0 , 1920 , 1080 ) ,
全屏
check_interval
5 ,
每5秒检查一次
callback
alert_callback )
10分钟后停止(实际使用可以一直运行)
time.sleep(600)
monitor.stop()
示例2: 查找特定图像
monitor = monitor_screen_region(
region=(0, 0, 1920, 1080),
template_image="target_button.png", # 要查找的图像
check_interval=2,
callback=alert_callback
)
Advanced Techniques Handling Multiple Monitors import pyautogui def get_all_screen_sizes ( ) : """获取所有显示器尺寸(仅 Windows 支持多显示器详细信息)"""
macOS 返回主屏尺寸
Windows 可以使用 pygetwindow 或 win32api 获取多显示器信息
primary
pyautogui . size ( ) print ( f"主屏幕尺寸: { primary } " )
Windows 示例(需要安装 pywin32)
try : import win32api monitors = win32api . EnumDisplayMonitors ( ) for i , monitor in enumerate ( monitors ) : print ( f"显示器 { i + 1 } : { monitor [ 2 ] } " ) except ImportError : pass return primary def screenshot_specific_monitor ( monitor_num = 0 ) : """截图指定显示器(实验性功能)"""
目前 pyautogui 主要支持主显示器
多显示器支持需要平台特定代码
pass Performance Optimization import cv2 import numpy as np import pyautogui import time from functools import lru_cache class ScreenCache : """屏幕缓存优化器""" def init ( self , cache_duration = 0.5 ) : self . cache_duration = cache_duration self . last_capture = None self . last_capture_time = 0 def get_screenshot ( self , region = None ) : """获取截图(带缓存)""" current_time = time . time ( )
检查缓存是否有效
if ( self . last_capture is not None and current_time - self . last_capture_time < self . cache_duration and region is None ) : return self . last_capture
捕获新截图
screenshot
pyautogui . screenshot ( region = region ) if region is None : self . last_capture = screenshot self . last_capture_time = current_time return screenshot def clear_cache ( self ) : """清除缓存""" self . last_capture = None self . last_capture_time = 0 class FastImageFinder : """快速图像查找器(使用多尺度金字塔)""" def init ( self , scales = [ 0.8 , 0.9 , 1.0 , 1.1 , 1.2 ] ) : self . scales = scales def find_multi_scale ( self , template_path , screenshot = None , confidence = 0.8 ) : """ 多尺度图像查找 Returns: (x, y, scale) 或 None """ if screenshot is None : screenshot = pyautogui . screenshot ( ) template = cv2 . imread ( template_path ) if template is None : return None screenshot_cv = cv2 . cvtColor ( np . array ( screenshot ) , cv2 . COLOR_RGB2BGR ) for scale in self . scales :
缩放模板
scaled_template
cv2 . resize ( template , None , fx = scale , fy = scale , interpolation = cv2 . INTER_AREA )
模板匹配
result
cv2 . matchTemplate ( screenshot_cv , scaled_template , cv2 . TM_CCOEFF_NORMED ) _ , max_val , _ , max_loc = cv2 . minMaxLoc ( result ) if max_val
= confidence : h , w = scaled_template . shape [ : 2 ] center_x = max_loc [ 0 ] + w // 2 center_y = max_loc [ 1 ] + h // 2 return ( center_x , center_y , scale ) return None
使用示例
cache
ScreenCache ( ) finder = FastImageFinder ( )
快速截图(带缓存)
screenshot
cache . get_screenshot ( )
多尺度图像查找
result
finder . find_multi_scale ( "button.png" , screenshot ) if result : x , y , scale = result print ( f"找到图像: ( { x } , { y } ), 缩放: { scale } " ) Security Considerations """ 安全最佳实践 """ import pyautogui import hashlib import time class SecureAutomation : """安全自动化包装器""" def init ( self ) : self . action_log = [ ] self . max_retries = 3 self . rate_limit_delay = 0.1
操作间隔
def log_action ( self , action , details ) : """记录操作日志""" timestamp = time . strftime ( "%Y-%m-%d %H:%M:%S" ) log_entry = { 'timestamp' : timestamp , 'action' : action , 'details' : details , 'hash' : hashlib . md5 ( f" { timestamp } { action } { details } " . encode ( ) ) . hexdigest ( ) [ : 8 ] } self . action_log . append ( log_entry ) def safe_click ( self , x , y , description = "" ) : """安全点击(带验证)""" try :
验证坐标在屏幕范围内
screen_width , screen_height = pyautogui . size ( ) if not ( 0 <= x < screen_width and 0 <= y < screen_height ) : raise ValueError ( f"坐标 ( { x } , { y } ) 超出屏幕范围" )
执行点击
pyautogui . moveTo ( x , y , duration = 0.2 ) time . sleep ( self . rate_limit_delay ) pyautogui . click ( )
记录日志
self . log_action ( 'click' , f"( { x } , { y } ) - { description } " ) return True except Exception as e : self . log_action ( 'click_failed' , f"( { x } , { y } ) - Error: { str ( e ) } " ) return False def safe_typewrite ( self , text , interval = 0.01 ) : """安全输入(敏感信息不记录)""" try : pyautogui . typewrite ( text , interval = interval ) self . log_action ( 'typewrite' , f"输入 { len ( text ) } 个字符 [内容已隐藏]" ) return True except Exception as e : self . log_action ( 'typewrite_failed' , f"Error: { str ( e ) } " ) return False def get_action_report ( self ) : """生成操作报告""" total = len ( self . action_log ) successful = sum ( 1 for log in self . action_log if 'failed' not in log [ 'action' ] ) failed = total - successful report = f""" === 自动化操作报告 === 总操作数: { total } 成功: { successful } 失败: { failed } 成功率: { ( successful / total * 100 ) : .1f } % 详细日志: """ for log in self . action_log : report += f"[ { log [ 'timestamp' ] } ] [ { log [ 'hash' ] } ] { log [ 'action' ] } : { log [ 'details' ] } \n" return report
使用示例
secure
SecureAutomation ( )
执行安全操作
secure . safe_click ( 500 , 400 , "登录按钮" ) secure . safe_typewrite ( "username@example.com" ) secure . safe_click ( 500 , 450 , "密码输入框" ) secure . safe_typewrite ( "**" ) secure . safe_click ( 500 , 500 , "提交按钮" )
生成报告
print ( secure . get_action_report ( ) ) Troubleshooting Guide Common Issues and Solutions 1. Permission Errors Symptom: pyautogui fails with permission errors or captures black screenshots. macOS Solution: Open System Settings > Privacy & Security > Accessibility Add your terminal application (e.g., Terminal.app, iTerm.app, or the Python executable) Repeat for Screen Recording permission Windows Solution: Run as Administrator if needed Check Windows Defender or antivirus isn't blocking 2. Coordinate Inaccuracy Symptom: Clicks or screenshots miss the intended target. Possible Causes: High DPI / Retina display scaling Multiple monitors with different resolutions Window decorations or taskbar affecting coordinates Solution: import pyautogui
Debug: Print screen info
print ( f"Screen size: { pyautogui . size ( ) } " ) print ( f"Mouse position: { pyautogui . position ( ) } " )
Handle high DPI (Windows)
import ctypes ctypes . windll . user32 . SetProcessDPIAware ( )
Windows only
- Image Recognition Failures Symptom: locateOnScreen returns None even when image is visible. Common Causes: Resolution mismatch (captured image at different scale) Color depth differences Transparency or alpha channel issues Confidence threshold too high Solutions: import pyautogui import cv2 import numpy as np
Solution 1: Lower confidence
location
pyautogui . locateOnScreen ( 'button.png' , confidence = 0.7 )
Default is 0.9
Solution 2: Multi-scale matching (see FastImageFinder class in Performance section)
finder
FastImageFinder ( scales = [ 0.5 , 0.75 , 1.0 , 1.25 , 1.5 ] ) result = finder . find_multi_scale ( 'button.png' )
Solution 3: Convert to grayscale for matching
screenshot
pyautogui . screenshot ( ) screenshot_cv = cv2 . cvtColor ( np . array ( screenshot ) , cv2 . COLOR_RGB2GRAY ) template = cv2 . imread ( 'button.png' , cv2 . IMREAD_GRAYSCALE ) result = cv2 . matchTemplate ( screenshot_cv , template , cv2 . TM_CCOEFF_NORMED ) min_val , max_val , min_loc , max_loc = cv2 . minMaxLoc ( result ) if max_val
= 0.8 : print ( f"找到匹配,置信度: { max_val } " ) h , w = template . shape center_x = max_loc [ 0 ] + w // 2 center_y = max_loc [ 1 ] + h // 2 pyautogui . click ( center_x , center_y ) 4. Slow Performance Symptom: Operations are slow, high CPU usage, or noticeable delays. Optimization Strategies: Reduce Screenshot Frequency Cache screenshots when possible Use region-specific captures instead of full screen Optimize Image Matching Resize large images before matching Use grayscale matching when color isn't important Set appropriate confidence levels Batch Operations Group multiple actions together Minimize unnecessary delays See the "Performance Optimization" section for detailed code examples. 5. Application-Specific Issues Browser Automation: Modern browsers may block automation Use Chrome DevTools Protocol instead of pyautogui for web Consider Playwright or Selenium for complex web automation Game/Graphics Applications: DirectX/OpenGL apps may not be capturable by standard screenshot May require specialized tools (e.g., OBS Studio's capture API) Protected Content: DRM-protected content (Netflix, etc.) cannot be screenshotted This is a system-level restriction Integration with Other Tools With ChatGPT/AI Assistants This skill is designed to work with AI assistants like nanobot. Here's how to integrate:
Example: AI assistant using this skill
def ai_assisted_automation ( user_request ) : """ AI 助手使用自动化技能 Args: user_request: 用户的自然语言请求 """
1. AI 解析用户意图
intent
parse_intent ( user_request ) if intent == 'screenshot' :
2. 执行截图
screenshot
pyautogui . screenshot ( ) timestamp = datetime . now ( ) . strftime ( "%Y%m%d_%H%M%S" ) path = f"screenshot_ { timestamp } .png" screenshot . save ( path ) return f"已截图并保存到: { path } " elif intent == 'click_button' :
2. 查找并点击按钮
button_name
extract_button_name ( user_request ) location = pyautogui . locateOnScreen ( f" { button_name } .png" ) if location : pyautogui . click ( pyautogui . center ( location ) ) return f"已点击按钮: { button_name } " else : return f"未找到按钮: { button_name } "
... 其他意图处理
With CI/CD Pipelines
Example: GitHub Actions using this skill for visual testing
name : Visual Regression Tests on : [ push , pull_request ] jobs : visual-test : runs-on : macos - latest
or windows-latest
steps : - uses : actions/checkout@v3 - name : Set up Python uses : actions/setup - python@v4 with : python-version : '3.11' - name : Install dependencies run : | pip install pyautogui opencv-python-headless numpy Pillow - name : Run visual tests run : python tests/visual_regression.py - name : Upload screenshots uses : actions/upload - artifact@v3 with : name : screenshots path : screenshots/ With Monitoring Systems
Example: Integration with Prometheus/Grafana for screen monitoring
from prometheus_client import Gauge , start_http_server import pyautogui import time
Define metrics
screen_change_gauge
Gauge ( 'screen_change_score' , 'Screen change detection score' ) template_match_gauge = Gauge ( 'template_match_confidence' , 'Template matching confidence' ) start_http_server ( 8000 ) def monitoring_loop ( ) : baseline = None while True :
Capture screen
current
pyautogui . screenshot ( ) current_array = np . array ( current ) if baseline is not None :
Calculate change
diff
cv2 . absdiff ( baseline , current_array ) diff_score = np . mean ( diff ) screen_change_gauge . set ( diff_score ) baseline = current_array
Check for template
try : location = pyautogui . locateOnScreen ( 'alert_icon.png' , confidence = 0.8 ) if location : template_match_gauge . set ( 1.0 ) else : template_match_gauge . set ( 0.0 ) except : template_match_gauge . set ( 0.0 ) time . sleep ( 5 ) monitoring_loop ( ) Future Roadmap Planned Features Linux Support X11 and Wayland compatibility xdotool and scrot integration mss for multi-monitor support AI-Powered Recognition Integration with OpenAI GPT-4V or Google Gemini for visual understanding Natural language element finding ("click the blue submit button") OCR-free text extraction using vision models Mobile Device Support Android: ADB (Android Debug Bridge) integration iOS: WebDriverAgent via Appium Screenshot and touch simulation Cloud Integration AWS Lambda support for serverless automation Azure Functions and GCP Cloud Functions compatibility Distributed screenshot processing Advanced Analytics Built-in A/B testing framework for UI changes Heatmap generation from user interactions Performance regression detection Contributing We welcome contributions! Please see the Contributing Guide for details on: Code style and formatting Testing requirements Documentation standards Pull request process License This skill is licensed under the MIT License. See LICENSE for details. Last Updated: 2026-03-06 Version: 1.0.0 Maintainer: nanobot skills team