Comprehensive guide for processing multimodal inputs with Gemini 3 Pro, including image understanding, video analysis, audio processing, and PDF document extraction. This skill focuses on INPUT processing (analyzing media) - see gemini-3-image-generation for OUTPUT (generating images).
Overview
Gemini 3 Pro provides native multimodal capabilities for understanding and analyzing various media types. This skill covers all input processing operations with granular control over quality, performance, and token consumption.
Key Capabilities
-
Image Understanding: Object detection, OCR, visual Q&A, code from screenshots
-
Video Processing: Up to 1 hour of video, frame analysis, OCR
-
Audio Processing: Up to 9.5 hours of audio, speech understanding
-
PDF Documents: Native PDF support, multi-page analysis, text extraction
-
Media Resolution Control: Low/medium/high resolution for token optimization
-
Token Optimization: Granular control over processing costs
When to Use This Skill
-
Analyzing images, photos, or screenshots
-
Processing video content for insights
-
Transcribing or understanding audio/speech
-
Extracting information from PDF documents
-
Building multimodal applications
-
Optimizing media processing costs
Quick Start
Prerequisites
-
Gemini API setup (see
gemini-3-pro-apiskill) -
Media files in supported formats
Python Quick Start
import google.generativeai as genai
from pathlib import Path
genai.configure(api_key="YOUR_API_KEY")
model = genai.GenerativeModel("gemini-3-pro-preview")
# Upload and analyze image
image_file = genai.upload_file(Path("photo.jpg"))
response = model.generate_content([
"What's in this image?",
image_file
])
print(response.text)
Node.js Quick Start
import { GoogleGenerativeAI } from "@google/generative-ai";
import { GoogleAIFileManager } from "@google/generative-ai/server";
import fs from "fs";
const genAI = new GoogleGenerativeAI("YOUR_API_KEY");
const fileManager = new GoogleAIFileManager("YOUR_API_KEY");
// Upload and analyze image
const uploadResult = await fileManager.uploadFile("photo.jpg", {
mimeType: "image/jpeg"
});
const model = genAI.getGenerativeModel({ model: "gemini-3-pro-preview" });
const result = await model.generateContent([
"What's in this image?",
{ fileData: { fileUri: uploadResult.file.uri, mimeType: uploadResult.file.mimeType } }
]);
console.log(result.response.text());
Core Tasks
Task 1: Analyze Image Content
Goal: Extract information, objects, text, or insights from images.
Use Cases:
-
Object detection and recognition
-
OCR (text extraction from images)
-
Visual Q&A
-
Code generation from UI screenshots
-
Chart/diagram analysis
-
Product identification
Python Example:
import google.generativeai as genai
from pathlib import Path
genai.configure(api_key="YOUR_API_KEY")
# Configure model with high resolution for best quality
model = genai.GenerativeModel(
"gemini-3-pro-preview",
generation_config={
"thinking_level": "high",
"media_resolution": "high" # 1,120 tokens per image
}
)
# Upload image
image_path = Path("screenshot.png")
image_file = genai.upload_file(image_path)
# Analyze with specific prompt
response = model.generate_content([
"""Analyze this image and provide:
1. Main objects and their locations
2. Any visible text (OCR)
3. Overall context and purpose
4. If code/UI: describe the functionality
""",
image_file
])
print(response.text)
# Check token usage
print(f"Tokens used: {response.usage_metadata.total_token_count}")
Node.js Example:
import { GoogleGenerativeAI } from "@google/generative-ai";
import { GoogleAIFileManager } from "@google/generative-ai/server";
const genAI = new GoogleGenerativeAI(process.env.GEMINI_API_KEY!);
const fileManager = new GoogleAIFileManager(process.env.GEMINI_API_KEY!);
// Upload image
const uploadResult = await fileManager.uploadFile("screenshot.png", {
mimeType: "image/png"
});
// Configure model with high resolution
const model = genAI.getGenerativeModel({
model: "gemini-3-pro-preview",
generationConfig: {
thinking_level: "high",
media_resolution: "high" // Best quality for OCR
}
});
const result = await model.generateContent([
`Analyze this image and provide:
1. Main objects and their locations
2. Any visible text (OCR)
3. Overall context and purpose`,
{ fileData: { fileUri: uploadResult.file.uri, mimeType: uploadResult.file.mimeType } }
]);
console.log(result.response.text());
Resolution Options:
| low
| 280 tokens
| Quick analysis, low detail
| medium
| 560 tokens
| Balanced quality/cost
| high
| 1,120 tokens
| OCR, fine details, small text
Supported Formats: JPEG, PNG, WEBP, HEIC, HEIF
See: references/image-understanding.md for advanced patterns
Task 2: Process Video Content
Goal: Analyze video content, extract insights, perform frame-by-frame analysis.
Use Cases:
-
Video summarization
-
Object tracking
-
Scene detection
-
Video OCR
-
Content moderation
-
Educational video analysis
Python Example:
import google.generativeai as genai
from pathlib import Path
genai.configure(api_key="YOUR_API_KEY")
# Configure for video processing
model = genai.GenerativeModel(
"gemini-3-pro-preview",
generation_config={
"thinking_level": "high",
"media_resolution": "medium" # 70 tokens/frame (balanced)
}
)
# Upload video (up to 1 hour supported)
video_path = Path("tutorial.mp4")
video_file = genai.upload_file(video_path)
# Wait for processing
import time
while video_file.state.name == "PROCESSING":
time.sleep(5)
video_file = genai.get_file(video_file.name)
if video_file.state.name == "FAILED":
raise ValueError("Video processing failed")
# Analyze video
response = model.generate_content([
"""Analyze this video and provide:
1. Overall summary of content
2. Key scenes and timestamps
3. Main topics covered
4. Any visible text throughout the video
""",
video_file
])
print(response.text)
print(f"Tokens used: {response.usage_metadata.total_token_count}")
Node.js Example:
import { GoogleGenerativeAI } from "@google/generative-ai";
import { GoogleAIFileManager, FileState } from "@google/generative-ai/server";
const genAI = new GoogleGenerativeAI(process.env.GEMINI_API_KEY!);
const fileManager = new GoogleAIFileManager(process.env.GEMINI_API_KEY!);
// Upload video
const uploadResult = await fileManager.uploadFile("tutorial.mp4", {
mimeType: "video/mp4"
});
// Wait for processing
let file = await fileManager.getFile(uploadResult.file.name);
while (file.state === FileState.PROCESSING) {
await new Promise(resolve => setTimeout(resolve, 5000));
file = await fileManager.getFile(uploadResult.file.name);
}
if (file.state === FileState.FAILED) {
throw new Error("Video processing failed");
}
// Analyze video
const model = genAI.getGenerativeModel({
model: "gemini-3-pro-preview",
generationConfig: {
media_resolution: "medium"
}
});
const result = await model.generateContent([
`Analyze this video and provide:
1. Overall summary
2. Key scenes and timestamps
3. Main topics covered`,
{ fileData: { fileUri: file.uri, mimeType: file.mimeType } }
]);
console.log(result.response.text());
Video Specs:
-
Max Duration: 1 hour
-
Formats: MP4, MOV, AVI, etc.
-
Resolution Options: Low (70 tokens/frame), Medium (70 tokens/frame), High (280 tokens/frame)
-
OCR: Available with high resolution
See: references/video-processing.md for advanced patterns
Task 3: Process Audio/Speech
Goal: Transcribe and understand audio content, process speech.
Use Cases:
-
Audio transcription
-
Speech analysis
-
Podcast summarization
-
Meeting notes
-
Language understanding
-
Audio classification
Python Example:
import google.generativeai as genai
from pathlib import Path
genai.configure(api_key="YOUR_API_KEY")
model = genai.GenerativeModel("gemini-3-pro-preview")
# Upload audio file (up to 9.5 hours supported)
audio_path = Path("podcast.mp3")
audio_file = genai.upload_file(audio_path)
# Wait for processing
import time
while audio_file.state.name == "PROCESSING":
time.sleep(5)
audio_file = genai.get_file(audio_file.name)
# Process audio
response = model.generate_content([
"""Process this audio and provide:
1. Full transcription
2. Summary of main points
3. Key speakers (if multiple)
4. Important timestamps
5. Action items or conclusions
""",
audio_file
])
print(response.text)
print(f"Tokens used: {response.usage_metadata.total_token_count}")
Node.js Example:
import { GoogleGenerativeAI } from "@google/generative-ai";
import { GoogleAIFileManager, FileState } from "@google/generative-ai/server";
const genAI = new GoogleGenerativeAI(process.env.GEMINI_API_KEY!);
const fileManager = new GoogleAIFileManager(process.env.GEMINI_API_KEY!);
// Upload audio
const uploadResult = await fileManager.uploadFile("podcast.mp3", {
mimeType: "audio/mp3"
});
// Wait for processing
let file = await fileManager.getFile(uploadResult.file.name);
while (file.state === FileState.PROCESSING) {
await new Promise(resolve => setTimeout(resolve, 5000));
file = await fileManager.getFile(uploadResult.file.name);
}
const model = genAI.getGenerativeModel({ model: "gemini-3-pro-preview" });
const result = await model.generateContent([
`Process this audio and provide:
1. Full transcription
2. Summary of main points
3. Key timestamps`,
{ fileData: { fileUri: file.uri, mimeType: file.mimeType } }
]);
console.log(result.response.text());
Audio Specs:
-
Max Duration: 9.5 hours
-
Formats: WAV, MP3, FLAC, AAC, etc.
-
Languages: Supports multiple languages
See: references/audio-processing.md for advanced patterns
Task 4: Process PDF Documents
Goal: Extract and analyze content from PDF documents.
Use Cases:
-
Document analysis
-
Information extraction
-
Form processing
-
Research paper analysis
-
Contract review
-
Multi-page document understanding
Python Example:
import google.generativeai as genai
from pathlib import Path
genai.configure(api_key="YOUR_API_KEY")
# Configure with medium resolution (recommended for PDFs)
model = genai.GenerativeModel(
"gemini-3-pro-preview",
generation_config={
"thinking_level": "high",
"media_resolution": "medium" # 560 tokens/page (saturation point)
}
)
# Upload PDF
pdf_path = Path("research_paper.pdf")
pdf_file = genai.upload_file(pdf_path)
# Wait for processing
import time
while pdf_file.state.name == "PROCESSING":
time.sleep(5)
pdf_file = genai.get_file(pdf_file.name)
# Analyze PDF
response = model.generate_content([
"""Analyze this PDF document and provide:
1. Document type and purpose
2. Main sections and structure
3. Key findings or arguments
4. Important data or statistics
5. Conclusions or recommendations
""",
pdf_file
])
print(response.text)
print(f"Tokens used: {response.usage_metadata.total_token_count}")
Node.js Example:
import { GoogleGenerativeAI } from "@google/generative-ai";
import { GoogleAIFileManager, FileState } from "@google/generative-ai/server";
const genAI = new GoogleGenerativeAI(process.env.GEMINI_API_KEY!);
const fileManager = new GoogleAIFileManager(process.env.GEMINI_API_KEY!);
// Upload PDF
const uploadResult = await fileManager.uploadFile("research_paper.pdf", {
mimeType: "application/pdf"
});
// Wait for processing
let file = await fileManager.getFile(uploadResult.file.name);
while (file.state === FileState.PROCESSING) {
await new Promise(resolve => setTimeout(resolve, 5000));
file = await fileManager.getFile(uploadResult.file.name);
}
// Analyze with medium resolution (recommended)
const model = genAI.getGenerativeModel({
model: "gemini-3-pro-preview",
generationConfig: {
media_resolution: "medium"
}
});
const result = await model.generateContent([
`Analyze this PDF and extract:
1. Main sections
2. Key findings
3. Important data`,
{ fileData: { fileUri: file.uri, mimeType: file.mimeType } }
]);
console.log(result.response.text());
PDF Processing Tips:
-
Recommended Resolution:
medium(560 tokens/page) - saturation point for quality -
Multi-page: Automatically processes all pages
-
Native Support: No conversion to images needed
-
Text Extraction: High-quality text extraction built-in
See: references/document-processing.md for advanced patterns
Task 5: Optimize Media Processing Costs
Goal: Balance quality and token consumption based on use case.
Strategy:
| Images
| low
| 280
| Quick scan, thumbnails
| Images
| medium
| 560
| General analysis
| Images
| high
| 1,120
| OCR, fine details, code
| PDFs
| medium
| 560/page
| Recommended (saturation point)
| PDFs
| high
| 1,120/page
| Diminishing returns
| Video
| low/medium
| 70/frame
| Most use cases
| Video
| high
| 280/frame
| OCR from video
Python Optimization Example:
import google.generativeai as genai
genai.configure(api_key="YOUR_API_KEY")
# Different resolutions for different use cases
def analyze_image_optimized(image_path, need_ocr=False):
"""Analyze image with appropriate resolution"""
resolution = "high" if need_ocr else "medium"
model = genai.GenerativeModel(
"gemini-3-pro-preview",
generation_config={
"media_resolution": resolution
}
)
image_file = genai.upload_file(image_path)
response = model.generate_content([
"Describe this image" if not need_ocr else "Extract all text from this image",
image_file
])
# Log token usage for cost tracking
tokens = response.usage_metadata.total_token_count
cost = (tokens / 1_000_000) * 2.00 # Input pricing
print(f"Resolution: {resolution}, Tokens: {tokens}, Cost: ${cost:.6f}")
return response.text
# Use appropriate resolution
analyze_image_optimized("photo.jpg", need_ocr=False) # medium
analyze_image_optimized("document.png", need_ocr=True) # high
Per-Item Resolution Control:
# Set different resolutions for different media in same request
response = model.generate_content([
"Compare these images",
{"file": image1, "media_resolution": "high"}, # High detail
{"file": image2, "media_resolution": "low"}, # Low detail OK
])
Cost Monitoring:
def log_media_costs(response):
"""Log media processing costs"""
usage = response.usage_metadata
# Pricing for ≤200k context
input_cost = (usage.prompt_token_count / 1_000_000) * 2.00
output_cost = (usage.candidates_token_count / 1_000_000) * 12.00
print(f"Input tokens: {usage.prompt_token_count} (${input_cost:.6f})")
print(f"Output tokens: {usage.candidates_token_count} (${output_cost:.6f})")
print(f"Total cost: ${input_cost + output_cost:.6f}")
See: references/token-optimization.md for comprehensive strategies
Media Resolution Control
Resolution Options
| low
| 280 tokens
| 280 tokens
| 70 tokens
| Quick analysis, low detail
| medium
| 560 tokens
| 560 tokens
| 70 tokens
| Balanced quality/cost
| high
| 1,120 tokens
| 1,120 tokens
| 280 tokens
| OCR, fine text, details
Configuration
Global Setting (all media):
model = genai.GenerativeModel(
"gemini-3-pro-preview",
generation_config={
"media_resolution": "high" # Applies to all media
}
)
Per-Item Setting (mixed resolutions):
response = model.generate_content([
"Analyze these files",
{"file": high_detail_image, "media_resolution": "high"},
{"file": low_detail_image, "media_resolution": "low"}
])
Best Practices
-
Images: Use
highfor OCR/text extraction,mediumfor general analysis -
PDFs: Use
medium(saturation point - higher resolutions show diminishing returns) -
Video: Use
lowormediumunless OCR needed -
Cost Control: Start with
low, increase only if quality insufficient
See: references/media-resolution.md for detailed guide
File Management
Upload Files
import google.generativeai as genai
# Upload file
file = genai.upload_file("path/to/file.jpg")
print(f"Uploaded: {file.name}")
# Check processing status
while file.state.name == "PROCESSING":
time.sleep(5)
file = genai.get_file(file.name)
print(f"Status: {file.state.name}")
List Uploaded Files
# List all files
for file in genai.list_files():
print(f"{file.name} - {file.display_name}")
Delete Files
# Delete specific file
genai.delete_file(file.name)
# Delete all files
for file in genai.list_files():
genai.delete_file(file.name)
print(f"Deleted: {file.name}")
File Lifecycle
-
Upload: Immediate
-
Processing: Async (especially for video/audio)
-
Storage: Files persist until deleted
-
Expiration: Files may expire after period (check docs)
Multi-File Processing
Process Multiple Images
# Upload multiple images
images = [
genai.upload_file("photo1.jpg"),
genai.upload_file("photo2.jpg"),
genai.upload_file("photo3.jpg")
]
# Analyze together
response = model.generate_content([
"Compare these images and identify common elements",
*images
])
print(response.text)
Mixed Media Types
# Combine different media types
image = genai.upload_file("chart.png")
pdf = genai.upload_file("report.pdf")
response = model.generate_content([
"Does the chart match the data in the report?",
image,
pdf
])
References
Core Guides
-
Image Understanding - Complete image analysis patterns
-
Video Processing - Video analysis and frame extraction
-
Audio Processing - Audio transcription and analysis
-
Document Processing - PDF and document extraction
Optimization
-
Media Resolution - Resolution control and quality tuning
-
OCR Extraction - Text extraction best practices
-
Token Optimization - Cost control and efficiency
Scripts
-
Analyze Image Script - Production-ready image analysis
-
Process Video Script - Video processing automation
-
Process Audio Script - Audio transcription
-
Process PDF Script - PDF extraction
Official Resources
Related Skills
-
gemini-3-pro-api - Basic setup, authentication, text generation
-
gemini-3-image-generation - Image OUTPUT (generating images)
-
gemini-3-advanced - Function calling, tools, caching, batch processing
Common Use Cases
Visual Q&A Application
Combine image understanding with chat:
model = genai.GenerativeModel("gemini-3-pro-preview")
chat = model.start_chat()
# Upload image
image = genai.upload_file("product.jpg")
# Ask questions about it
response1 = chat.send_message(["What product is this?", image])
response2 = chat.send_message("What are its main features?")
response3 = chat.send_message("What's the price range for similar products?")
Document Analysis Pipeline
Process multiple PDFs and extract insights:
import google.generativeai as genai from pathlib import Path
genai.configure(api_key="YOUR_API_KEY")
model = genai.GenerativeModel( "gemini-3-pro-preview", generation_config={"media_resolution": "medium"} )
Process all PDFs in directory
pdf_dir = Path("documents/") results = {}
for pdf_path in pdf_dir.glob("*.pdf"): pdf_file = genai.upload_file(pdf_path)
# Wait for processing
while pdf_file.state.name == "PROCESSING":
time.sleep(5)
pdf_file = genai.get_file(pdf_file.name)
# Extract key information
response = model.generate_content([
"Extract: 1) Document type, 2) Key dates, 3) Important numbers, 4) Summary",
pdf_file
])
results[pdf_path.name] = response.text
# Clean up
genai.delete_file(pdf_file.name)
Save results
import json with open("analysis_results.json", "w") as f: json.dump(results<span class="token pun