- FFmpeg Video Analysis
- Extract frames from video files with ffmpeg. Delegate frame reading to sub-agents to preserve the main context window. Synthesise a structured timestamped summary from text-only sub-agent reports.
- Architecture: Context-Efficient Sub-Agent Pipeline
- Problem
-
- Reading dozens of images into the main conversation context consumes most of the context window, leaving little room for synthesis and follow-up.
- Solution
- A 3-phase pipeline: Main Agent Sub-Agents (disposable context) ────────── ────────────────────────────── 1. ffprobe metadata ───► 2. ffmpeg frame extraction ───► 3. Split frames into batches ──► 4. Read images (vision) Write text descriptions to batch_N_analysis.md 5. Read text files only ◄─── (context discarded) 6. Synthesise final output Images only ever exist inside sub-agent contexts. The main agent only reads lightweight text files. This cuts context usage by ~90%. 1. Prerequisites which ffmpeg && which ffprobe If either is missing, show platform-specific install instructions and STOP: macOS : brew install ffmpeg Ubuntu/Debian : sudo apt install ffmpeg Windows : choco install ffmpeg or winget install ffmpeg 2. Setup Temp Directory
macOS/Linux
TMPDIR
"/tmp/video-analysis- $( date +%s ) " mkdir -p " $TMPDIR "
Windows (PowerShell)
$TMPDIR = "$env:TEMP\video-analysis-$(Get-Date -UFormat %s)"
New-Item -ItemType Directory -Path $TMPDIR
- Extract Video Metadata ffprobe -v quiet -print_format json -show_format -show_streams "VIDEO_PATH" Extract and report: duration, resolution (width x height), fps, codec, file size, whether audio is present. If no video stream is found, report "audio-only file" and STOP. If file size > 2GB, warn the user and suggest analysing a time range with -ss START -to END .
- Extract Frames Choose strategy based on duration: Duration Strategy Command 0-60s 1 frame every 2s ffmpeg -hide_banner -y -i INPUT -vf "fps=1/2,scale='min(1280,iw)':-2" -q:v 5 DIR/frame_%04d.jpg 1-10min Scene detection (threshold 0.3) ffmpeg -hide_banner -y -i INPUT -vf "select='gt(scene,0.3)',scale='min(1280,iw)':-2" -vsync vfr -q:v 5 DIR/scene_%04d.jpg 10-30min Keyframe extraction ffmpeg -hide_banner -y -skip_frame nokey -i INPUT -vf "scale='min(1280,iw)':-2" -vsync vfr -q:v 5 DIR/key_%04d.jpg 30min+ Thumbnail filter ffmpeg -hide_banner -y -i INPUT -vf "thumbnail=SEGMENT_FRAMES,scale='min(1280,iw)':-2" -vsync vfr -q:v 5 DIR/thumb_%04d.jpg For thumbnail filter, calculate SEGMENT_FRAMES = total_frames / 60 to cap output at ~60 frames. Fallbacks: Scene detection yields 0 frames → retry with interval at 1 frame/5s More than 100 frames extracted → subsample evenly to 80 Frame extraction fails → try the next simpler strategy (scene → interval, keyframe → interval) Time range analysis: When user specifies a range, prepend -ss START -to END before -i . Higher detail mode: If requested, double the fps rate and lower scene threshold to 0.2. After extraction, list all frame files and calculate each frame's timestamp from its sequence number and the extraction rate.
- Delegate Frame Analysis to Sub-Agents This is the critical context-saving step. Do NOT read frame images in the main conversation. Instead, split frames into batches and delegate each batch to a sub-agent. 5a. Prepare Batch Manifest Split the extracted frame file list into batches of 8-10 frames each. For each batch, record: Batch number (1, 2, 3, ...) Frame file paths (absolute) Frame timestamps (calculated from sequence number) Output file path: TMPDIR/batch_N_analysis.md 5b. Spawn Sub-Agents For each batch, spawn a sub-agent with the prompt below. Launch all batches in parallel where the tool supports it — they are fully independent. Sub-Agent Prompt Template Use this prompt verbatim, substituting the placeholders: You are analysing frames extracted from a video file. VIDEO: {filename} DURATION: {duration} BATCH: {batch_number} of {total_batches} Read each frame image listed below using the Read tool (or equivalent file reading tool that supports images). For each frame, write a structured description. FRAMES:
- {absolute_path_to_frame} (timestamp: {MM:SS}) {end for} For each frame, describe:
- SCENE: What is visible (layout, UI elements, environment)
- CONTENT: Text, code, labels, menus, or dialogue visible on screen
- ACTION: What is happening or has changed since the likely previous frame
- DETAILS: Any notable specifics (error messages, URLs, file names, button states) After describing all frames, add a BATCH SUMMARY section with:
- Content type (one of: Screencast, Presentation, Tutorial, Footage, Animation)
- Key events in this batch's time range
- Any text/prompts/commands the user typed (quote exactly) Write the complete analysis to: {TMPDIR}/batch_{N}_analysis.md Format the output file as:
Batch {N} Analysis ({start_timestamp} - {end_timestamp})
Frame-by-Frame
Frame {sequence} ({timestamp})
- Scene: ...
- Content: ...
- Action: ...
- Details: ... (repeat for each frame)
Batch Summary
- Content Type: ...
- Key Events: ...
- Quoted Text/Prompts: ... How to Spawn Use whatever sub-agent, background task, or independent agent mechanism your tool provides. The requirements are simple — each sub-agent needs to: Read image files (the frame JPEGs) Write a text file (the batch analysis markdown) Launch all batches in parallel if your tool supports it — they are fully independent with no shared state. If your tool has no sub-agent mechanism , fall back to reading frames directly in the main context but limit to 20 frames maximum and warn the user about context usage. 5c. Collect Results After all sub-agents complete, read the text analysis files. These are lightweight markdown — no images enter the main context. ls TMPDIR/batch_*_analysis.md Read each batch_N_analysis.md file in order . These contain only text descriptions — the context cost is minimal compared to reading the original images.
- Synthesise Output Using only the text from the batch analysis files, perform synthesis in the main context: Merge all frame descriptions into a single chronological timeline Group frames into natural segments (same scene, slide, or screen) Detect the dominant content type across all batches Identify 3-7 key moments Extract all quoted text, prompts, or commands the user typed Write a 2-5 sentence narrative summary Format the output as:
Video Analysis: [filename]
Metadata | Property | Value | |
|
| | Duration | M:SS | | Resolution | WxH | | FPS | N | | Content Type | [detected] | | Frames Analysed | N |
Timeline
[Segment Title] (M:SS - M:SS) Description of what happens in this segment.
[Segment Title] (M:SS - M:SS) Description of what happens in this segment.
- Key Moments
- 1.
- **
- [M:SS] Title
- **
-
- Description
- 2.
- **
- [M:SS] Title
- **
-
- Description
- 3.
- **
- [M:SS] Title
- **
- Description
Summary [2-5 sentence narrative paragraph summarising the entire video] 7. Cleanup Remove the temp directory after output is complete:
macOS/Linux
rm -rf " $TMPDIR "
Windows (PowerShell)
Remove-Item -Recurse -Force $TMPDIR
- Skip cleanup if the user asks to keep frames.
- Advanced Options
- Time range
-
- "Analyse 2:00 to 5:00 of video.mp4" → use
- -ss 120 -to 300
- Higher detail
-
- "Analyse in high detail" → double frame rate, lower scene threshold to 0.2
- Focus area
-
- "Focus on the code shown" → prioritise text/code extraction in sub-agent prompts
- Sprite sheet
- For a visual overview, generate a contact sheet: ffmpeg -hide_banner -y -i INPUT -vf "select='not(mod(n,EVERY_N))',scale='min(320,iw)':-2,tile=5xROWS" -frames:v 1 DIR/sprite.jpg Error Handling ffmpeg not found → install instructions per platform, STOP No video stream → report audio-only, STOP Scene detection yields 0 frames → fallback to interval Too many frames (>100) → subsample to 80 Large files (>2GB) → warn, suggest time range Sub-agent fails or times out → read that batch's frames directly as fallback, warn about context usage Frame read failure in sub-agent → skip frame, note gap in batch analysis file