video-understand Understand video content locally using ffmpeg for frame extraction and Whisper for transcription. Fully offline, no API keys required. Prerequisites ffmpeg + ffprobe (required): brew install ffmpeg openai-whisper (optional, for transcription): pip install openai-whisper Commands

Scene detection + transcribe (default)

python3 skills/video-understand/scripts/understand_video.py video.mp4

Keyframe extraction

python3 skills/video-understand/scripts/understand_video.py video.mp4 -m keyframe

Regular interval extraction

python3 skills/video-understand/scripts/understand_video.py video.mp4 -m interval

Limit frames extracted

python3 skills/video-understand/scripts/understand_video.py video.mp4 --max-frames 10

Use a larger Whisper model

python3 skills/video-understand/scripts/understand_video.py video.mp4 --whisper-model small

Frames only, skip transcription

python3 skills/video-understand/scripts/understand_video.py video.mp4 --no-transcribe

Quiet mode (JSON only, no progress)

python3 skills/video-understand/scripts/understand_video.py video.mp4 -q

Output to file

python3 skills/video-understand/scripts/understand_video.py video.mp4 -o result.json CLI Options Flag Description video Input video file (positional, required) -m, --mode Extraction mode: scene (default), keyframe , interval --max-frames Maximum frames to keep (default: 20) --whisper-model Whisper model size: tiny, base, small, medium, large (default: base) --no-transcribe Skip audio transcription, extract frames only -o, --output Write result JSON to file instead of stdout -q, --quiet Suppress progress messages, output only JSON Extraction Modes Mode How it works Best for scene Detects scene changes via ffmpeg select='gt(scene,0.3)' Most videos, varied content keyframe Extracts I-frames (codec keyframes) Encoded video with natural keyframe placement interval Evenly spaced frames based on duration and max-frames Fixed sampling, predictable output If scene mode detects no scene changes, it automatically falls back to interval mode. Output The script outputs JSON to stdout (or file with -o ). See references/output-format.md for the full schema. { "video" : "video.mp4" , "duration" : 18.076 , "resolution" : { "width" : 1224 , "height" : 1080 } , "mode" : "scene" , "frames" : [ { "path" : "/abs/path/frame_0001.jpg" , "timestamp" : 0.0 , "timestamp_formatted" : "00:00" } ] , "frame_count" : 12 , "transcript" : [ { "start" : 0.0 , "end" : 2.5 , "text" : "Hello and welcome..." } ] , "text" : "Full transcript..." , "note" : "Use the Read tool to view frame images for visual understanding." } Use the Read tool on frame image paths to visually inspect extracted frames. References references/output-format.md -- Full JSON output schema documentation

安装