gemini-3-multimodal

安装量: 39
排名: #18255

安装

npx skills add https://github.com/adaptationio/skrillz --skill gemini-3-multimodal

Comprehensive guide for processing multimodal inputs with Gemini 3 Pro, including image understanding, video analysis, audio processing, and PDF document extraction. This skill focuses on INPUT processing (analyzing media) - see gemini-3-image-generation for OUTPUT (generating images).

Overview

Gemini 3 Pro provides native multimodal capabilities for understanding and analyzing various media types. This skill covers all input processing operations with granular control over quality, performance, and token consumption.

Key Capabilities

  • Image Understanding: Object detection, OCR, visual Q&A, code from screenshots

  • Video Processing: Up to 1 hour of video, frame analysis, OCR

  • Audio Processing: Up to 9.5 hours of audio, speech understanding

  • PDF Documents: Native PDF support, multi-page analysis, text extraction

  • Media Resolution Control: Low/medium/high resolution for token optimization

  • Token Optimization: Granular control over processing costs

When to Use This Skill

  • Analyzing images, photos, or screenshots

  • Processing video content for insights

  • Transcribing or understanding audio/speech

  • Extracting information from PDF documents

  • Building multimodal applications

  • Optimizing media processing costs

Quick Start

Prerequisites

  • Gemini API setup (see gemini-3-pro-api skill)

  • Media files in supported formats

Python Quick Start

import google.generativeai as genai
from pathlib import Path

genai.configure(api_key="YOUR_API_KEY")
model = genai.GenerativeModel("gemini-3-pro-preview")

# Upload and analyze image
image_file = genai.upload_file(Path("photo.jpg"))
response = model.generate_content([
    "What's in this image?",
    image_file
])
print(response.text)

Node.js Quick Start

import { GoogleGenerativeAI } from "@google/generative-ai";
import { GoogleAIFileManager } from "@google/generative-ai/server";
import fs from "fs";

const genAI = new GoogleGenerativeAI("YOUR_API_KEY");
const fileManager = new GoogleAIFileManager("YOUR_API_KEY");

// Upload and analyze image
const uploadResult = await fileManager.uploadFile("photo.jpg", {
  mimeType: "image/jpeg"
});

const model = genAI.getGenerativeModel({ model: "gemini-3-pro-preview" });
const result = await model.generateContent([
  "What's in this image?",
  { fileData: { fileUri: uploadResult.file.uri, mimeType: uploadResult.file.mimeType } }
]);

console.log(result.response.text());

Core Tasks

Task 1: Analyze Image Content

Goal: Extract information, objects, text, or insights from images.

Use Cases:

  • Object detection and recognition

  • OCR (text extraction from images)

  • Visual Q&A

  • Code generation from UI screenshots

  • Chart/diagram analysis

  • Product identification

Python Example:

import google.generativeai as genai
from pathlib import Path

genai.configure(api_key="YOUR_API_KEY")

# Configure model with high resolution for best quality
model = genai.GenerativeModel(
    "gemini-3-pro-preview",
    generation_config={
        "thinking_level": "high",
        "media_resolution": "high"  # 1,120 tokens per image
    }
)

# Upload image
image_path = Path("screenshot.png")
image_file = genai.upload_file(image_path)

# Analyze with specific prompt
response = model.generate_content([
    """Analyze this image and provide:
    1. Main objects and their locations
    2. Any visible text (OCR)
    3. Overall context and purpose
    4. If code/UI: describe the functionality
    """,
    image_file
])

print(response.text)

# Check token usage
print(f"Tokens used: {response.usage_metadata.total_token_count}")

Node.js Example:

import { GoogleGenerativeAI } from "@google/generative-ai";
import { GoogleAIFileManager } from "@google/generative-ai/server";

const genAI = new GoogleGenerativeAI(process.env.GEMINI_API_KEY!);
const fileManager = new GoogleAIFileManager(process.env.GEMINI_API_KEY!);

// Upload image
const uploadResult = await fileManager.uploadFile("screenshot.png", {
  mimeType: "image/png"
});

// Configure model with high resolution
const model = genAI.getGenerativeModel({
  model: "gemini-3-pro-preview",
  generationConfig: {
    thinking_level: "high",
    media_resolution: "high"  // Best quality for OCR
  }
});

const result = await model.generateContent([
  `Analyze this image and provide:
  1. Main objects and their locations
  2. Any visible text (OCR)
  3. Overall context and purpose`,
  { fileData: { fileUri: uploadResult.file.uri, mimeType: uploadResult.file.mimeType } }
]);

console.log(result.response.text());

Resolution Options:

| low | 280 tokens | Quick analysis, low detail

| medium | 560 tokens | Balanced quality/cost

| high | 1,120 tokens | OCR, fine details, small text

Supported Formats: JPEG, PNG, WEBP, HEIC, HEIF

See: references/image-understanding.md for advanced patterns

Task 2: Process Video Content

Goal: Analyze video content, extract insights, perform frame-by-frame analysis.

Use Cases:

  • Video summarization

  • Object tracking

  • Scene detection

  • Video OCR

  • Content moderation

  • Educational video analysis

Python Example:

import google.generativeai as genai
from pathlib import Path

genai.configure(api_key="YOUR_API_KEY")

# Configure for video processing
model = genai.GenerativeModel(
    "gemini-3-pro-preview",
    generation_config={
        "thinking_level": "high",
        "media_resolution": "medium"  # 70 tokens/frame (balanced)
    }
)

# Upload video (up to 1 hour supported)
video_path = Path("tutorial.mp4")
video_file = genai.upload_file(video_path)

# Wait for processing
import time
while video_file.state.name == "PROCESSING":
    time.sleep(5)
    video_file = genai.get_file(video_file.name)

if video_file.state.name == "FAILED":
    raise ValueError("Video processing failed")

# Analyze video
response = model.generate_content([
    """Analyze this video and provide:
    1. Overall summary of content
    2. Key scenes and timestamps
    3. Main topics covered
    4. Any visible text throughout the video
    """,
    video_file
])

print(response.text)
print(f"Tokens used: {response.usage_metadata.total_token_count}")

Node.js Example:

import { GoogleGenerativeAI } from "@google/generative-ai";
import { GoogleAIFileManager, FileState } from "@google/generative-ai/server";

const genAI = new GoogleGenerativeAI(process.env.GEMINI_API_KEY!);
const fileManager = new GoogleAIFileManager(process.env.GEMINI_API_KEY!);

// Upload video
const uploadResult = await fileManager.uploadFile("tutorial.mp4", {
  mimeType: "video/mp4"
});

// Wait for processing
let file = await fileManager.getFile(uploadResult.file.name);
while (file.state === FileState.PROCESSING) {
  await new Promise(resolve => setTimeout(resolve, 5000));
  file = await fileManager.getFile(uploadResult.file.name);
}

if (file.state === FileState.FAILED) {
  throw new Error("Video processing failed");
}

// Analyze video
const model = genAI.getGenerativeModel({
  model: "gemini-3-pro-preview",
  generationConfig: {
    media_resolution: "medium"
  }
});

const result = await model.generateContent([
  `Analyze this video and provide:
  1. Overall summary
  2. Key scenes and timestamps
  3. Main topics covered`,
  { fileData: { fileUri: file.uri, mimeType: file.mimeType } }
]);

console.log(result.response.text());

Video Specs:

  • Max Duration: 1 hour

  • Formats: MP4, MOV, AVI, etc.

  • Resolution Options: Low (70 tokens/frame), Medium (70 tokens/frame), High (280 tokens/frame)

  • OCR: Available with high resolution

See: references/video-processing.md for advanced patterns

Task 3: Process Audio/Speech

Goal: Transcribe and understand audio content, process speech.

Use Cases:

  • Audio transcription

  • Speech analysis

  • Podcast summarization

  • Meeting notes

  • Language understanding

  • Audio classification

Python Example:

import google.generativeai as genai
from pathlib import Path

genai.configure(api_key="YOUR_API_KEY")

model = genai.GenerativeModel("gemini-3-pro-preview")

# Upload audio file (up to 9.5 hours supported)
audio_path = Path("podcast.mp3")
audio_file = genai.upload_file(audio_path)

# Wait for processing
import time
while audio_file.state.name == "PROCESSING":
    time.sleep(5)
    audio_file = genai.get_file(audio_file.name)

# Process audio
response = model.generate_content([
    """Process this audio and provide:
    1. Full transcription
    2. Summary of main points
    3. Key speakers (if multiple)
    4. Important timestamps
    5. Action items or conclusions
    """,
    audio_file
])

print(response.text)
print(f"Tokens used: {response.usage_metadata.total_token_count}")

Node.js Example:

import { GoogleGenerativeAI } from "@google/generative-ai";
import { GoogleAIFileManager, FileState } from "@google/generative-ai/server";

const genAI = new GoogleGenerativeAI(process.env.GEMINI_API_KEY!);
const fileManager = new GoogleAIFileManager(process.env.GEMINI_API_KEY!);

// Upload audio
const uploadResult = await fileManager.uploadFile("podcast.mp3", {
  mimeType: "audio/mp3"
});

// Wait for processing
let file = await fileManager.getFile(uploadResult.file.name);
while (file.state === FileState.PROCESSING) {
  await new Promise(resolve => setTimeout(resolve, 5000));
  file = await fileManager.getFile(uploadResult.file.name);
}

const model = genAI.getGenerativeModel({ model: "gemini-3-pro-preview" });

const result = await model.generateContent([
  `Process this audio and provide:
  1. Full transcription
  2. Summary of main points
  3. Key timestamps`,
  { fileData: { fileUri: file.uri, mimeType: file.mimeType } }
]);

console.log(result.response.text());

Audio Specs:

  • Max Duration: 9.5 hours

  • Formats: WAV, MP3, FLAC, AAC, etc.

  • Languages: Supports multiple languages

See: references/audio-processing.md for advanced patterns

Task 4: Process PDF Documents

Goal: Extract and analyze content from PDF documents.

Use Cases:

  • Document analysis

  • Information extraction

  • Form processing

  • Research paper analysis

  • Contract review

  • Multi-page document understanding

Python Example:

import google.generativeai as genai
from pathlib import Path

genai.configure(api_key="YOUR_API_KEY")

# Configure with medium resolution (recommended for PDFs)
model = genai.GenerativeModel(
    "gemini-3-pro-preview",
    generation_config={
        "thinking_level": "high",
        "media_resolution": "medium"  # 560 tokens/page (saturation point)
    }
)

# Upload PDF
pdf_path = Path("research_paper.pdf")
pdf_file = genai.upload_file(pdf_path)

# Wait for processing
import time
while pdf_file.state.name == "PROCESSING":
    time.sleep(5)
    pdf_file = genai.get_file(pdf_file.name)

# Analyze PDF
response = model.generate_content([
    """Analyze this PDF document and provide:
    1. Document type and purpose
    2. Main sections and structure
    3. Key findings or arguments
    4. Important data or statistics
    5. Conclusions or recommendations
    """,
    pdf_file
])

print(response.text)
print(f"Tokens used: {response.usage_metadata.total_token_count}")

Node.js Example:

import { GoogleGenerativeAI } from "@google/generative-ai";
import { GoogleAIFileManager, FileState } from "@google/generative-ai/server";

const genAI = new GoogleGenerativeAI(process.env.GEMINI_API_KEY!);
const fileManager = new GoogleAIFileManager(process.env.GEMINI_API_KEY!);

// Upload PDF
const uploadResult = await fileManager.uploadFile("research_paper.pdf", {
  mimeType: "application/pdf"
});

// Wait for processing
let file = await fileManager.getFile(uploadResult.file.name);
while (file.state === FileState.PROCESSING) {
  await new Promise(resolve => setTimeout(resolve, 5000));
  file = await fileManager.getFile(uploadResult.file.name);
}

// Analyze with medium resolution (recommended)
const model = genAI.getGenerativeModel({
  model: "gemini-3-pro-preview",
  generationConfig: {
    media_resolution: "medium"
  }
});

const result = await model.generateContent([
  `Analyze this PDF and extract:
  1. Main sections
  2. Key findings
  3. Important data`,
  { fileData: { fileUri: file.uri, mimeType: file.mimeType } }
]);

console.log(result.response.text());

PDF Processing Tips:

  • Recommended Resolution: medium (560 tokens/page) - saturation point for quality

  • Multi-page: Automatically processes all pages

  • Native Support: No conversion to images needed

  • Text Extraction: High-quality text extraction built-in

See: references/document-processing.md for advanced patterns

Task 5: Optimize Media Processing Costs

Goal: Balance quality and token consumption based on use case.

Strategy:

| Images | low | 280 | Quick scan, thumbnails

| Images | medium | 560 | General analysis

| Images | high | 1,120 | OCR, fine details, code

| PDFs | medium | 560/page | Recommended (saturation point)

| PDFs | high | 1,120/page | Diminishing returns

| Video | low/medium | 70/frame | Most use cases

| Video | high | 280/frame | OCR from video

Python Optimization Example:

import google.generativeai as genai

genai.configure(api_key="YOUR_API_KEY")

# Different resolutions for different use cases
def analyze_image_optimized(image_path, need_ocr=False):
    """Analyze image with appropriate resolution"""
    resolution = "high" if need_ocr else "medium"

    model = genai.GenerativeModel(
        "gemini-3-pro-preview",
        generation_config={
            "media_resolution": resolution
        }
    )

    image_file = genai.upload_file(image_path)
    response = model.generate_content([
        "Describe this image" if not need_ocr else "Extract all text from this image",
        image_file
    ])

    # Log token usage for cost tracking
    tokens = response.usage_metadata.total_token_count
    cost = (tokens / 1_000_000) * 2.00  # Input pricing
    print(f"Resolution: {resolution}, Tokens: {tokens}, Cost: ${cost:.6f}")

    return response.text

# Use appropriate resolution
analyze_image_optimized("photo.jpg", need_ocr=False)  # medium
analyze_image_optimized("document.png", need_ocr=True)  # high

Per-Item Resolution Control:

# Set different resolutions for different media in same request
response = model.generate_content([
    "Compare these images",
    {"file": image1, "media_resolution": "high"},  # High detail
    {"file": image2, "media_resolution": "low"},   # Low detail OK
])

Cost Monitoring:

def log_media_costs(response):
    """Log media processing costs"""
    usage = response.usage_metadata

    # Pricing for ≤200k context
    input_cost = (usage.prompt_token_count / 1_000_000) * 2.00
    output_cost = (usage.candidates_token_count / 1_000_000) * 12.00

    print(f"Input tokens: {usage.prompt_token_count} (${input_cost:.6f})")
    print(f"Output tokens: {usage.candidates_token_count} (${output_cost:.6f})")
    print(f"Total cost: ${input_cost + output_cost:.6f}")

See: references/token-optimization.md for comprehensive strategies

Media Resolution Control

Resolution Options

| low | 280 tokens | 280 tokens | 70 tokens | Quick analysis, low detail

| medium | 560 tokens | 560 tokens | 70 tokens | Balanced quality/cost

| high | 1,120 tokens | 1,120 tokens | 280 tokens | OCR, fine text, details

Configuration

Global Setting (all media):

model = genai.GenerativeModel(
    "gemini-3-pro-preview",
    generation_config={
        "media_resolution": "high"  # Applies to all media
    }
)

Per-Item Setting (mixed resolutions):

response = model.generate_content([
    "Analyze these files",
    {"file": high_detail_image, "media_resolution": "high"},
    {"file": low_detail_image, "media_resolution": "low"}
])

Best Practices

  • Images: Use high for OCR/text extraction, medium for general analysis

  • PDFs: Use medium (saturation point - higher resolutions show diminishing returns)

  • Video: Use low or medium unless OCR needed

  • Cost Control: Start with low, increase only if quality insufficient

See: references/media-resolution.md for detailed guide

File Management

Upload Files

import google.generativeai as genai

# Upload file
file = genai.upload_file("path/to/file.jpg")
print(f"Uploaded: {file.name}")

# Check processing status
while file.state.name == "PROCESSING":
    time.sleep(5)
    file = genai.get_file(file.name)

print(f"Status: {file.state.name}")

List Uploaded Files

# List all files
for file in genai.list_files():
    print(f"{file.name} - {file.display_name}")

Delete Files

# Delete specific file
genai.delete_file(file.name)

# Delete all files
for file in genai.list_files():
    genai.delete_file(file.name)
    print(f"Deleted: {file.name}")

File Lifecycle

  • Upload: Immediate

  • Processing: Async (especially for video/audio)

  • Storage: Files persist until deleted

  • Expiration: Files may expire after period (check docs)

Multi-File Processing

Process Multiple Images

# Upload multiple images
images = [
    genai.upload_file("photo1.jpg"),
    genai.upload_file("photo2.jpg"),
    genai.upload_file("photo3.jpg")
]

# Analyze together
response = model.generate_content([
    "Compare these images and identify common elements",
    *images
])

print(response.text)

Mixed Media Types

# Combine different media types
image = genai.upload_file("chart.png")
pdf = genai.upload_file("report.pdf")

response = model.generate_content([
    "Does the chart match the data in the report?",
    image,
    pdf
])

References

Core Guides

Optimization

Scripts

Official Resources

  • gemini-3-pro-api - Basic setup, authentication, text generation

  • gemini-3-image-generation - Image OUTPUT (generating images)

  • gemini-3-advanced - Function calling, tools, caching, batch processing

Common Use Cases

Visual Q&A Application

Combine image understanding with chat:

model = genai.GenerativeModel("gemini-3-pro-preview")
chat = model.start_chat()

# Upload image
image = genai.upload_file("product.jpg")

# Ask questions about it
response1 = chat.send_message(["What product is this?", image])
response2 = chat.send_message("What are its main features?")
response3 = chat.send_message("What's the price range for similar products?")

Document Analysis Pipeline

Process multiple PDFs and extract insights:

import google.generativeai as genai from pathlib import Path

genai.configure(api_key="YOUR_API_KEY")

model = genai.GenerativeModel( "gemini-3-pro-preview", generation_config={"media_resolution": "medium"} )

Process all PDFs in directory

pdf_dir = Path("documents/") results = {}

for pdf_path in pdf_dir.glob("*.pdf"): pdf_file = genai.upload_file(pdf_path)

# Wait for processing
while pdf_file.state.name == "PROCESSING":
    time.sleep(5)
    pdf_file = genai.get_file(pdf_file.name)

# Extract key information
response = model.generate_content([
    "Extract: 1) Document type, 2) Key dates, 3) Important numbers, 4) Summary",
    pdf_file
])

results[pdf_path.name] = response.text

# Clean up
genai.delete_file(pdf_file.name)

Save results

import json with open("analysis_results.json", "w") as f: json.dump(results<span class="token pun

返回排行榜