Multimodal LLM Patterns Integrate vision, audio, and video generation capabilities from leading multimodal models. Covers image analysis, document understanding, real-time voice agents, speech-to-text, text-to-speech, and AI video generation (Kling 3.0, Sora 2, Veo 3.1, Runway Gen-4.5). Quick Reference Category Rules Impact When to Use Vision: Image Analysis 1 HIGH Image captioning, VQA, multi-image comparison, object detection Vision: Document Understanding 1 HIGH OCR, chart/diagram analysis, PDF processing, table extraction Vision: Model Selection 1 MEDIUM Choosing provider, cost optimization, image size limits Audio: Speech-to-Text 1 HIGH Transcription, speaker diarization, long-form audio Audio: Text-to-Speech 1 MEDIUM Voice synthesis, expressive TTS, multi-speaker dialogue Audio: Model Selection 1 MEDIUM Real-time voice agents, provider comparison, pricing Video: Model Selection 1 HIGH Choosing video gen provider (Kling, Sora, Veo, Runway) Video: API Patterns 1 HIGH Async task polling, SDK integration, webhook callbacks Video: Multi-Shot 1 HIGH Storyboarding, character elements, scene consistency Total: 9 rules across 3 categories (Vision, Audio, Video Generation) Vision: Image Analysis Send images to multimodal LLMs for captioning, visual QA, and object detection. Always set max_tokens and resize images before encoding. Rule File Key Pattern Image Analysis rules/vision-image-analysis.md Base64 encoding, multi-image, bounding boxes Vision: Document Understanding Extract structured data from documents, charts, and PDFs using vision models. Rule File Key Pattern Document Vision rules/vision-document.md PDF page ranges, detail levels, OCR strategies Vision: Model Selection Choose the right vision provider based on accuracy, cost, and context window needs. Rule File Key Pattern Vision Models rules/vision-models.md Provider comparison, token costs, image limits Audio: Speech-to-Text Convert audio to text with speaker diarization, timestamps, and sentiment analysis. Rule File Key Pattern Speech-to-Text rules/audio-speech-to-text.md Gemini long-form, GPT-4o-Transcribe, AssemblyAI features Audio: Text-to-Speech Generate natural speech from text with voice selection and expressive cues. Rule File Key Pattern Text-to-Speech rules/audio-text-to-speech.md Gemini TTS, voice config, auditory cues Audio: Model Selection Select the right audio/voice provider for real-time, transcription, or TTS use cases. Rule File Key Pattern Audio Models rules/audio-models.md Real-time voice comparison, STT benchmarks, pricing Video: Model Selection Choose the right video generation provider based on use case, duration, and budget. Rule File Key Pattern Video Models rules/video-generation-models.md Kling vs Sora vs Veo vs Runway, pricing, capabilities Video: API Patterns Integrate video generation APIs with proper async polling, SDKs, and webhook callbacks. Rule File Key Pattern API Integration rules/video-generation-patterns.md Kling REST, fal.ai SDK, Vercel AI SDK, task polling Video: Multi-Shot Generate multi-scene videos with consistent characters using storyboarding and character elements. Rule File Key Pattern Multi-Shot rules/video-multi-shot.md Kling 3.0 character elements, 6-shot storyboards, identity binding Key Decisions Decision Recommendation High accuracy vision Claude Opus 4.6 or GPT-5 Long documents Gemini 2.5 Pro (1M context) Cost-efficient vision Gemini 2.5 Flash ($0.15/M tokens) Video analysis Gemini 2.5/3 Pro (native video) Voice assistant Grok Voice Agent (fastest, <1s) Emotional voice AI Gemini Live API Long audio transcription Gemini 2.5 Pro (9.5hr) Speaker diarization AssemblyAI or Gemini Self-hosted STT Whisper Large V3 Character-consistent video Kling 3.0 (Character Elements 3.0) Narrative video / storytelling Sora 2 (best cause-and-effect coherence) Cinematic B-roll Veo 3.1 (camera control + polished motion) Professional VFX Runway Gen-4.5 (Act-Two motion transfer) High-volume social video Kling 3.0 Standard ($0.20/video) Open-source video gen Wan 2.6 or LTX-2 Lip-sync / avatar video Kling 3.0 (native lip-sync API) Example import anthropic , base64 client = anthropic . Anthropic ( ) with open ( "image.png" , "rb" ) as f : b64 = base64 . standard_b64encode ( f . read ( ) ) . decode ( "utf-8" ) response = client . messages . create ( model = "claude-opus-4-6" , max_tokens = 1024 , messages = [ { "role" : "user" , "content" : [ { "type" : "image" , "source" : { "type" : "base64" , "media_type" : "image/png" , "data" : b64 } } , { "type" : "text" , "text" : "Describe this image" } ] } ] ) Common Mistakes Not setting max_tokens on vision requests (responses truncated) Sending oversized images without resizing (>2048px) Using high detail level for simple yes/no classification Using STT+LLM+TTS pipeline instead of native speech-to-speech Not leveraging barge-in support for natural voice conversations Using deprecated models (GPT-4V, Whisper-1) Ignoring rate limits on vision and audio endpoints Calling video generation APIs synchronously (they're async — poll or use callbacks) Generating separate clips without character elements (characters look different each time) Using Sora for high-volume social content (expensive, slow — use Kling Standard instead)
multimodal-llm
安装
npx skills add https://github.com/yonatangross/orchestkit --skill multimodal-llm