Skip to main content
OrchestKit v7.43.0 — 104 skills, 36 agents, 173 hooks · Claude Code 2.1.105+
OrchestKit
Skills

Multimodal Llm

Vision, audio, video generation, and multimodal LLM integration patterns. Use when processing images, transcribing audio, generating speech, generating AI video (Kling, Sora, Veo, Runway), or building multimodal AI pipelines.

Reference high

Auto-activated — this skill loads automatically when Claude detects matching context.

Multimodal LLM Patterns

Integrate vision, audio, and video generation capabilities from leading multimodal models. Covers image analysis, document understanding, real-time voice agents, speech-to-text, text-to-speech, and AI video generation (Kling 3.0, Sora 2, Veo 3.1, Runway Gen-4.5).

Quick Reference

CategoryRulesImpactWhen to Use
Vision: Image Analysis1HIGHImage captioning, VQA, multi-image comparison, object detection
Vision: Document Understanding1HIGHOCR, chart/diagram analysis, PDF processing, table extraction
Vision: Model Selection1MEDIUMChoosing provider, cost optimization, image size limits
Audio: Speech-to-Text1HIGHTranscription, speaker diarization, long-form audio
Audio: Text-to-Speech1MEDIUMVoice synthesis, expressive TTS, multi-speaker dialogue
Audio: Model Selection1MEDIUMReal-time voice agents, provider comparison, pricing
Video: Model Selection1HIGHChoosing video gen provider (Kling, Sora, Veo, Runway)
Video: API Patterns1HIGHAsync task polling, SDK integration, webhook callbacks
Video: Multi-Shot1HIGHStoryboarding, character elements, scene consistency

Total: 9 rules across 3 categories (Vision, Audio, Video Generation)

Vision: Image Analysis

Send images to multimodal LLMs for captioning, visual QA, and object detection. Always set max_tokens and resize images before encoding.

RuleFileKey Pattern
Image Analysisrules/vision-image-analysis.mdBase64 encoding, multi-image, bounding boxes

Vision: Document Understanding

Extract structured data from documents, charts, and PDFs using vision models.

RuleFileKey Pattern
Document Visionrules/vision-document.mdPDF page ranges, detail levels, OCR strategies

Vision: Model Selection

Choose the right vision provider based on accuracy, cost, and context window needs.

RuleFileKey Pattern
Vision Modelsrules/vision-models.mdProvider comparison, token costs, image limits

Audio: Speech-to-Text

Convert audio to text with speaker diarization, timestamps, and sentiment analysis.

RuleFileKey Pattern
Speech-to-Textrules/audio-speech-to-text.mdGemini long-form, GPT-4o-Transcribe, AssemblyAI features

Audio: Text-to-Speech

Generate natural speech from text with voice selection and expressive cues.

RuleFileKey Pattern
Text-to-Speechrules/audio-text-to-speech.mdGemini TTS, voice config, auditory cues

Audio: Model Selection

Select the right audio/voice provider for real-time, transcription, or TTS use cases.

RuleFileKey Pattern
Audio Modelsrules/audio-models.mdReal-time voice comparison, STT benchmarks, pricing

Video: Model Selection

Choose the right video generation provider based on use case, duration, and budget.

RuleFileKey Pattern
Video Modelsrules/video-generation-models.mdKling vs Sora vs Veo vs Runway, pricing, capabilities

Video: API Patterns

Integrate video generation APIs with proper async polling, SDKs, and webhook callbacks.

RuleFileKey Pattern
API Integrationrules/video-generation-patterns.mdKling REST, fal.ai SDK, Vercel AI SDK, task polling

Video: Multi-Shot

Generate multi-scene videos with consistent characters using storyboarding and character elements.

RuleFileKey Pattern
Multi-Shotrules/video-multi-shot.mdKling 3.0 character elements, 6-shot storyboards, identity binding

Key Decisions

DecisionRecommendation
High accuracy visionClaude Opus 4.6 or GPT-5
Long documentsGemini 2.5 Pro (1M context)
Cost-efficient visionGemini 2.5 Flash ($0.15/M tokens)
Video analysisGemini 2.5/3 Pro (native video)
Voice assistantGrok Voice Agent (fastest, <1s)
Emotional voice AIGemini Live API
Long audio transcriptionGemini 2.5 Pro (9.5hr)
Speaker diarizationAssemblyAI or Gemini
Self-hosted STTWhisper Large V3
Character-consistent videoKling 3.0 (Character Elements 3.0)
Narrative video / storytellingSora 2 (best cause-and-effect coherence)
Cinematic B-rollVeo 3.1 (camera control + polished motion)
Professional VFXRunway Gen-4.5 (Act-Two motion transfer)
High-volume social videoKling 3.0 Standard ($0.20/video)
Open-source video genWan 2.6 or LTX-2
Lip-sync / avatar videoKling 3.0 (native lip-sync API)

Example

import anthropic, base64

client = anthropic.Anthropic()
with open("image.png", "rb") as f:
    b64 = base64.standard_b64encode(f.read()).decode("utf-8")

response = client.messages.create(
    model="claude-opus-4-6",
    max_tokens=1024,
    messages=[{"role": "user", "content": [
        {"type": "image", "source": {"type": "base64", "media_type": "image/png", "data": b64}},
        {"type": "text", "text": "Describe this image"}
    ]}]
)

Common Mistakes

  1. Not setting max_tokens on vision requests (responses truncated)
  2. Sending oversized images without resizing (>2048px)
  3. Using high detail level for simple yes/no classification
  4. Using STT+LLM+TTS pipeline instead of native speech-to-speech
  5. Not leveraging barge-in support for natural voice conversations
  6. Using deprecated models (GPT-4V, Whisper-1)
  7. Ignoring rate limits on vision and audio endpoints
  8. Calling video generation APIs synchronously (they're async — poll or use callbacks)
  9. Generating separate clips without character elements (characters look different each time)
  10. Using Sora for high-volume social content (expensive, slow — use Kling Standard instead)
  • ork:rag-retrieval - Multimodal RAG with image + text retrieval
  • ork:llm-integration - General LLM function calling patterns
  • streaming-api-patterns - WebSocket patterns for real-time audio
  • ork:demo-producer - Terminal demo videos (VHS, asciinema) — not AI video gen

Rules (9)

Select the right audio model architecture to avoid unnecessary pipeline latency — MEDIUM

Audio Model Selection

Choose the right audio provider based on latency, features, and cost.

Incorrect — building STT+LLM+TTS pipeline for voice assistants:

# 3-step pipeline adds 2-5x latency vs native speech-to-speech
text = transcribe(audio)          # STT: ~500ms
response = llm.generate(text)     # LLM: ~1000ms
audio = text_to_speech(response)  # TTS: ~500ms
# Total: ~2000ms minimum

Correct — use native speech-to-speech for voice assistants:

# Grok Voice Agent: <1s time-to-first-audio (5x faster)
async with websockets.connect("wss://api.x.ai/v1/realtime", extra_headers=headers) as ws:
    await ws.send(json.dumps({
        "type": "session.update",
        "session": {
            "model": "grok-4-voice",
            "voice": "Aria",
            "input_audio_format": "pcm16",
            "output_audio_format": "pcm16",
            "turn_detection": {"type": "server_vad"}
        }
    }))
    # Direct audio in -> audio out, no intermediary transcription

Real-time voice provider comparison:

ModelLatencyLanguagesPriceBest For
Grok Voice Agent<1s TTFA100+$0.05/minFastest, lowest cost
Gemini Live APILow24 (30 voices)Usage-basedEmotional awareness
OpenAI Realtime~1s50+$0.10/minEcosystem integration

Gemini Live — emotional awareness and barge-in:

async with model.connect(config=config) as session:
    # Supports barge-in (user can interrupt anytime)
    # Affective dialog (understands and responds to emotions)
    # Proactive audio (responds only when relevant)
    async for response in session.receive():
        if response.data:
            yield response.data  # Audio bytes

Pricing comparison:

ProviderTypePriceNotes
Grok Voice AgentReal-time$0.05/minCheapest real-time
Gemini LiveReal-timeUsage-based30 HD voices
OpenAI RealtimeReal-time$0.10/min
Gemini 2.5 ProTranscription$1.25/M tokens9.5hr audio
GPT-4o-TranscribeTranscription$0.01/min
AssemblyAITranscription~$0.15/hrBest features
DeepgramTranscription~$0.0043/minCheapest STT

Selection guide:

ScenarioRecommendation
Voice assistant (speed)Grok Voice Agent (<1s)
Emotional AIGemini Live API
Long audio (hours)Gemini 2.5 Pro (9.5hr)
Speaker diarizationAssemblyAI or Gemini
Lowest latency STTDeepgram Nova-3 (<300ms)
Self-hosted STTWhisper Large V3
Cheapest real-timeGrok ($0.05/min)

Key rules:

  • Use native speech-to-speech for voice assistants — never chain STT+LLM+TTS
  • Grok Voice Agent is OpenAI Realtime API compatible (easy migration)
  • Gemini Live supports barge-in — essential for natural conversations
  • Always test latency with real users under realistic network conditions
  • For phone agents, <1s time-to-first-audio is the quality threshold

Configure speech-to-text with the right provider and model for accuracy and cost — HIGH

Audio Speech-to-Text

Convert audio to text with speaker labels, timestamps, and structured output.

Incorrect — using deprecated model:

# Whisper-1 is deprecated — use GPT-4o-Transcribe for better accuracy
response = client.audio.transcriptions.create(model="whisper-1", file=audio_file)

Correct — GPT-4o-Transcribe with structured output:

from openai import OpenAI
client = OpenAI()

with open(audio_path, "rb") as audio_file:
    response = client.audio.transcriptions.create(
        model="gpt-4o-transcribe",
        file=audio_file,
        language="en",  # Optional: improves accuracy for known language
        response_format="verbose_json",
        timestamp_granularities=["word", "segment"]
    )

result = {
    "text": response.text,
    "words": response.words,        # Word-level timestamps
    "segments": response.segments,  # Segment-level with speaker info
    "duration": response.duration
}

Gemini — best for long-form audio (up to 9.5 hours):

import google.generativeai as genai
model = genai.GenerativeModel("gemini-2.5-pro")

audio_file = genai.upload_file(audio_path)
response = model.generate_content([
    audio_file,
    """Transcribe with:
    1. Speaker labels (Speaker 1, Speaker 2)
    2. Timestamps: [HH:MM:SS]
    3. Punctuation and formatting"""
])

AssemblyAI — best feature set (diarization + sentiment + entities):

import assemblyai as aai
aai.settings.api_key = "YOUR_API_KEY"

config = aai.TranscriptionConfig(
    speaker_labels=True,
    sentiment_analysis=True,
    entity_detection=True,
    auto_highlights=True,
    language_detection=True
)
transcript = aai.Transcriber().transcribe(audio_url, config=config)

STT model comparison:

ModelWERLatencyBest For
Gemini 2.5 Pro~5%Medium9.5hr audio, diarization
GPT-4o-Transcribe~7%MediumAccuracy + accents
AssemblyAI Universal-28.4%200msBest features
Deepgram Nova-3~18%<300msLowest latency
Whisper Large V37.4%SlowSelf-host, 99+ langs

Key rules:

  • Use GPT-4o-Transcribe, not Whisper-1 (deprecated)
  • For audio >1 hour, prefer Gemini 2.5 Pro (handles up to 9.5 hours natively)
  • AssemblyAI provides the richest metadata (sentiment, entities, highlights)
  • Deepgram Nova-3 for lowest latency STT (<300ms)
  • Whisper Large V3 only for self-hosted / air-gapped environments
  • Always specify language when known — improves accuracy significantly

Configure text-to-speech voice selection and style prompts for natural-sounding output — MEDIUM

Audio Text-to-Speech

Generate natural speech from text with voice selection and expressive control.

Incorrect — no voice configuration:

# Default voice with no style control — sounds robotic
response = model.generate_content(contents=text)

Correct — Gemini TTS with voice and style config:

import google.generativeai as genai

def text_to_speech(text: str, voice: str = "Kore") -> bytes:
    """Gemini 2.5 Flash TTS with voice selection.

    Available voices: Puck, Charon, Kore, Fenrir, Aoede (30 total)
    """
    model = genai.GenerativeModel("gemini-2.5-flash-tts")

    response = model.generate_content(
        contents=text,
        generation_config=genai.GenerationConfig(
            response_mime_type="audio/mp3",
            speech_config=genai.SpeechConfig(
                voice_config=genai.VoiceConfig(
                    prebuilt_voice_config=genai.PrebuiltVoiceConfig(
                        voice_name=voice
                    )
                )
            )
        )
    )
    return response.audio

Expressive voice with auditory cues (Grok Voice Agent):

# Supports: [whisper], [sigh], [laugh], [pause]
await ws.send(json.dumps({
    "type": "response.create",
    "response": {
        "instructions": "[sigh] Let me think about that... [pause] Here's what I found."
    }
}))

Gemini Live — real-time TTS with emotional awareness:

config = live.LiveConnectConfig(
    response_modalities=["AUDIO"],
    speech_config=live.SpeechConfig(
        voice_config=live.VoiceConfig(
            prebuilt_voice_config=live.PrebuiltVoiceConfig(
                voice_name="Puck"  # 30 HD voices in 24 languages
            )
        )
    ),
    system_instruction="Speak warmly and naturally."
)

Key rules:

  • Always configure voice explicitly — defaults vary by provider
  • Gemini TTS supports enhanced expressivity with style prompts in system instructions
  • Grok supports inline auditory cues: [whisper], [sigh], [laugh], [pause]
  • For multi-speaker dialogue, use consistent voice assignments per character
  • Gemini offers 30 HD voices across 24 languages
  • Test with real users — perceived quality varies by use case and language

Select the right video generation model by use case, cost, and output quality — HIGH

Video Generation Model Selection

Choose the right video generation provider based on use case, quality needs, and budget.

Model comparison (February 2026):

ModelMax DurationResolutionNative AudioMulti-ShotCharacter ConsistencyBest For
Kling 3.0/O315s4K/60fpsYes (multi-speaker)Yes (6 shots)Excellent (3+ chars)Social content, IP, ads
Sora 260s1080pYesNoModerateNarrative, storytelling
Veo 3.18s4KYesNoGoodCinematic B-roll, polished explainers
Runway Gen-4.510s4KLimitedNoExcellent (Act-Two)Professional filmmaking
Wan 2.6varies1080pNoNoGoodOpen-source, brand narrative
Pika 2.55s1080pLimitedNoLowViral effects, social

Incorrect — using Sora for high-volume social content:

# Wrong: Sora is expensive and slow for volume work
# $1+ per video, 300-600s generation time
result = openai.videos.generate(
    model="sora-2",
    prompt="Product showcase with character",
    duration=10,
)
# Better: Use Kling 3.0 Standard — $0.20/video, 60-90s, character elements

Correct — matching model to use case:

# Social/ads volume → Kling 3.0 (cheap, fast, character consistent)
# Narrative film → Sora 2 (realism, cause-and-effect, 60s duration)
# Cinematic B-roll → Veo 3.1 (camera control, polished motion)
# Professional VFX → Runway Gen-4.5 (Act-Two motion transfer)
# Self-hosted/private → Wan 2.6 or LTX-2 (open-source)

Pricing comparison (5s video, standard mode):

ProviderKling 3.0Sora 2Veo 3.1Runway Gen-4.5LTX 2.0Wan 2.5
Official API~$0.20ChatGPT Plus ($20/mo)Google AI Pro ($29/mo)From $12/mo
fal.ai$0.07/s ($0.14 w/audio)Available$0.40/sAvailable$0.05/s
Third-party (PiAPI, Segmind)$0.20-$0.96

Selection guide:

ScenarioRecommendation
Character consistency across shotsKling 3.0 (Character Elements 3.0, 3+ chars)
Realistic narrative storytellingSora 2 (best cause-and-effect coherence)
Cinematic B-roll / product shotsVeo 3.1 (best camera control + motion)
Professional VFX / motion captureRunway Gen-4.5 (Act-Two motion transfer)
High-volume social contentKling 3.0 Standard (cheapest, 60-90s gen)
Open-source / self-hostedWan 2.6 or LTX-2
Lip-sync / avatarKling 3.0 (native lip-sync API)

Key rules:

  • Kling 3.0 has two variants: V3 (standard generation) and O3 (AI Director — better scene composition and physics)
  • Native audio generation doubles credit cost on Kling — disable sound: false for silent clips
  • Sora 2 is region-restricted (US/Canada primarily) — check availability before committing
  • Runway Gen-4.5 has no third-party API access — must use Runway's own API
  • Video generation is async — always implement polling or callback patterns for task status

Use async task polling and provider-specific SDKs for reliable video generation — HIGH

Video Generation API Patterns

All video generation APIs are async — submit a task, poll for completion. Never attempt synchronous generation.

Incorrect — synchronous call expecting immediate result:

# WRONG: Video generation takes 60-300s, this will timeout
response = requests.post(url, json={"prompt": "..."}, timeout=30)
video_url = response.json()["video_url"]  # Not available yet!

Correct — Kling API with task polling:

import requests
import time

API_KEY = "your-api-key"
BASE_URL = "https://api.klingai.com/v1"

# 1. Submit generation task
response = requests.post(
    f"{BASE_URL}/videos/text2video",
    headers={"Authorization": f"Bearer {API_KEY}"},
    json={
        "model": "kling-v3.0",
        "prompt": "A chef plating a dish in a modern kitchen, warm lighting",
        "duration": "5",
        "aspect_ratio": "16:9",
        "mode": "std",  # std | pro
    }
)
task_id = response.json()["data"]["task_id"]

# 2. Poll for completion (60-90s typical for Kling)
while True:
    status = requests.get(
        f"{BASE_URL}/videos/text2video/{task_id}",
        headers={"Authorization": f"Bearer {API_KEY}"},
    ).json()
    if status["data"]["status"] == "completed":
        video_url = status["data"]["video_url"]
        break
    elif status["data"]["status"] == "failed":
        raise RuntimeError(f"Generation failed: {status['data']['error']}")
    time.sleep(10)

Correct — fal.ai SDK (serverless, no GPU management):

import fal_client

# Kling V3 text-to-video via fal.ai
result = fal_client.run(
    "fal-ai/kling-video/v3/pro/text-to-video",
    arguments={
        "prompt": "A knight in weathered armor, cinematic lighting",
        "duration": "10",
        "aspect_ratio": "16:9",
    }
)
video_url = result["video"]["url"]

Correct — Vercel AI SDK (TypeScript):

import { klingai, type KlingAIVideoModelOptions } from '@ai-sdk/klingai';
import { experimental_generateVideo as generateVideo } from 'ai';

// Text-to-video
const { videos } = await generateVideo({
  model: klingai.video('kling-v3.0-t2v'),
  prompt: 'A cat playing in autumn leaves, warm afternoon light',
  aspectRatio: '16:9',
  duration: 5,
  providerOptions: {
    klingai: { mode: 'std' } satisfies KlingAIVideoModelOptions,
  },
});

// Image-to-video with start + end frame
const { videos: i2v } = await generateVideo({
  model: klingai.video('kling-v3.0-i2v'),
  prompt: { image: startFrameUrl, text: 'The subject turns and smiles' },
  duration: 5,
  providerOptions: {
    klingai: {
      mode: 'pro',        // Pro required for end-frame
      imageTail: endFrameUrl,
    } satisfies KlingAIVideoModelOptions,
  },
});

Callback pattern (preferred over polling for production):

# Register a webhook URL when creating the task
response = requests.post(
    f"{BASE_URL}/videos/text2video",
    headers={"Authorization": f"Bearer {API_KEY}"},
    json={
        "prompt": "...",
        "callback_url": "https://your-app.com/api/video-webhook",
    }
)
# Your webhook receives POST with task_id + video_url on completion

Key rules:

  • Always implement exponential backoff on polling — don't hammer the API every second
  • Set generation timeout at 10 minutes minimum — complex scenes take longer
  • Kling mode: "pro" costs ~2x more than mode: "std" but enables end-frame control and higher quality
  • fal.ai and AI SDK abstract the polling — prefer these over raw REST for new projects
  • fal MCP server (mcp.fal.ai) gives Claude Code / Cursor direct access to all fal models via run_model / submit_job tools — no SDK code needed for prototyping
  • Store task_id persistently — if your process crashes, you can resume polling
  • Validate image URLs are publicly accessible before passing to image-to-video endpoints

Use multi-shot storyboarding and character elements for consistent multi-scene video — HIGH

Multi-Shot Storyboarding & Character Consistency

Kling 3.0+ supports multi-shot generation with character elements for identity-consistent scenes. Without this, characters look different in every clip.

Incorrect — generating separate clips and hoping characters match:

# WRONG: Each generation creates a different-looking character
clip1 = generate_video("A woman walks into a café")
clip2 = generate_video("A woman orders coffee at a counter")
clip3 = generate_video("A woman sits down and opens a laptop")
# Result: 3 different women, inconsistent clothing, hair, face

Correct — Kling 3.0 multi-shot with character elements:

import { klingai, type KlingAIVideoModelOptions } from '@ai-sdk/klingai';
import { experimental_generateVideo as generateVideo } from 'ai';

const { videos } = await generateVideo({
  model: klingai.video('kling-v3.0-t2v'),
  prompt: 'A woman walks into a café, orders coffee, and sits down',
  providerOptions: {
    klingai: {
      shot_type: 'customize',
      // Multi-shot: each shot gets its own prompt and duration
      multi_prompt: [
        { prompt: '@Element1 walks into a bright café', duration: 5 },
        { prompt: '@Element1 orders coffee at the counter, smiling', duration: 5 },
        { prompt: '@Element1 sits by the window and opens a laptop', duration: 5 },
      ],
      // Character element: up to 3 reference images for identity lock
      elements: [{
        images: [
          'https://...frontal-face.jpg',   // Required: frontal view
          'https://...side-profile.jpg',    // Optional: side angle
          'https://...full-body.jpg',       // Optional: full body
        ],
        // Optional: assign a voice to this character
        voice_id: 'voice_abc123',
      }],
    } satisfies KlingAIVideoModelOptions,
  },
});

Correct — Kling REST API multi-shot (direct):

response = requests.post(
    f"{BASE_URL}/videos/text2video",
    headers={"Authorization": f"Bearer {API_KEY}"},
    json={
        "model": "kling-v3.0",
        "mode": "pro",
        "shot_type": "customize",
        "multi_prompt": [
            {"prompt": "@Element1 enters a modern office", "duration": 5},
            {"prompt": "@Element1 presents to a group, gesturing at a screen", "duration": 5},
        ],
        "elements": [{
            "images": ["https://...frontal.jpg", "https://...profile.jpg"]
        }],
        "aspect_ratio": "16:9",
    }
)

Character element best practices:

GuidelineReason
Always include a frontal face imageRequired for identity lock — model needs clear face features
Add side profile as 2nd imagePrevents hallucination when character turns
Add full body as 3rd imageLocks clothing, posture, proportions across shots
Max 3 images per elementAPI limit — choose the 3 most informative angles
Reference as @Element1, @Element2Prompt syntax for character binding in multi-shot
Max 3+ characters in one sceneKling O3 supports multi-character coreference

Scene transition controls:

  • Up to 6 shots per generation with customizable per-shot duration
  • Each shot can have independent camera movement and prompt
  • Kling handles transitions between shots automatically
  • Total duration: sum of all shot durations (max 15s total)

Key rules:

  • Multi-shot requires mode: "pro" — standard mode only supports single-shot
  • Character elements only work with Kling 3.0+ — earlier versions have no identity binding
  • Reference images must be publicly accessible HTTPS URLs (no local files)
  • Use descriptive prompts per shot — "walks left" is better than "moves" for camera planning
  • For 3+ character scenes, use Kling O3 (not V3) — O3 has better multi-character coreference
  • Voice IDs are optional — get them from the Kling create-voice endpoint first

Process documents with vision models using page ranges to avoid context overflow — HIGH

Vision Document Understanding

Extract structured data from documents, charts, and PDFs using vision models.

Incorrect — sending entire large PDF at once:

# Exceeds context window or times out on 100+ page PDFs
response = model.generate_content([genai.upload_file("large_doc.pdf"), "Summarize this"])

Correct — incremental PDF processing:

# Claude Code Read tool with page ranges (max 20 pages per request)
Read(file_path="/path/to/document.pdf", pages="1-5")    # TOC/structure scan
Read(file_path="/path/to/document.pdf", pages="45-55")   # Target section
Read(file_path="/path/to/document.pdf", pages="80-90")   # Appendix

Chart/diagram analysis — use high detail and explicit extraction prompt:

response = client.chat.completions.create(
    model="gpt-5.2",
    max_tokens=4096,
    messages=[{
        "role": "user",
        "content": [
            {"type": "text", "text": (
                "Extract all data from this chart. Return as structured JSON with:\n"
                "1. Chart type (bar, line, pie, etc.)\n"
                "2. Axis labels and units\n"
                "3. All data points with values\n"
                "4. Title and legend entries"
            )},
            {"type": "image_url", "image_url": {
                "url": f"data:{mime_type};base64,{base64_data}",
                "detail": "high"  # Required for chart OCR accuracy
            }}
        ]
    }]
)

Key rules:

  • For PDFs >10 pages, always use page ranges — never send all at once
  • Max 20 pages per Read request in Claude Code
  • Use detail: "high" for documents with small text, tables, or charts
  • Gemini 2.5 Pro handles longest documents (1M+ context)
  • Always validate extracted numbers against source when accuracy is critical
  • Claude supports up to 600 images or PDF pages per request (useful for multi-page document analysis)

PDF constraints:

ConstraintValue
Max pages per request20
Max file size20MB
Large PDF threshold>10 pages

Encode images correctly with content array structure to prevent silent vision failures — HIGH

Vision Image Analysis

Encode images correctly and structure multi-modal messages for each provider.

Incorrect — string content instead of content array:

# OpenAI — image silently ignored
response = client.chat.completions.create(
    model="gpt-5.2",
    messages=[{"role": "user", "content": f"Describe this image: {base64_data}"}]
)

Correct — structured content array with image_url:

import base64, mimetypes

def encode_image(path: str) -> tuple[str, str]:
    mime_type = mimetypes.guess_type(path)[0] or "image/png"
    with open(path, "rb") as f:
        return base64.standard_b64encode(f.read()).decode("utf-8"), mime_type

# OpenAI (GPT-5, GPT-4o)
base64_data, mime_type = encode_image(image_path)
response = client.chat.completions.create(
    model="gpt-5.2",
    max_tokens=4096,  # Required for vision — omitting truncates response
    messages=[{
        "role": "user",
        "content": [
            {"type": "text", "text": prompt},
            {"type": "image_url", "image_url": {
                "url": f"data:{mime_type};base64,{base64_data}",
                "detail": "high"  # low=65 tokens, high=129+ tokens/tile
            }}
        ]
    }]
)

Claude — different content structure (type: "image", not "image_url"):

response = client.messages.create(
    model="claude-opus-4-6",
    max_tokens=4096,
    messages=[{
        "role": "user",
        "content": [
            {"type": "image", "source": {
                "type": "base64",
                "media_type": media_type,
                "data": base64_data
            }},
            {"type": "text", "text": prompt}
        ]
    }]
)

Gemini — uses PIL Image directly:

from PIL import Image
model = genai.GenerativeModel("gemini-2.5-pro")
image = Image.open(image_path)
response = model.generate_content([prompt, image])

Multi-image comparison (Claude supports up to 100):

content = []
for img_path in images:
    b64, mt = encode_image(img_path)
    content.append({"type": "image", "source": {"type": "base64", "media_type": mt, "data": b64}})
content.append({"type": "text", "text": "Compare these images..."})
response = client.messages.create(model="claude-opus-4-6", max_tokens=8192, messages=[{"role": "user", "content": content}])

Object detection with bounding boxes (Gemini 2.5+):

response = model.generate_content([
    "Detect all objects. Return JSON: {objects: [{label, box: [x1,y1,x2,y2]}]}",
    image
])

Key rules:

  • Always set max_tokens on vision requests (responses truncated without it)
  • Resize images to max 2048px before encoding (reduces cost and latency)
  • Use detail: "low" (65 tokens) for simple classification, "high" for OCR/charts
  • Each provider has different content structure — do not mix formats
  • Claude uses media_type, OpenAI uses mime in the data URI

Select the right vision model to balance cost against task complexity requirements — MEDIUM

Vision Model Selection

Choose the right vision provider based on accuracy, cost, and context needs.

Model comparison:

ModelContextStrengthsMax Images
GPT-5.2128KBest general reasoning10/request
Claude Opus 4.61MBest coding, sustained agent tasks600/request
Gemini 2.5 Pro1M+Longest context, native video3,600/request
Gemini 3 Pro1MDeep Think, enhanced segmentation3,600/request
Grok 42MReal-time X integrationLimited

Token cost by detail level:

ProviderDetail LevelToken CostUse For
OpenAIlow65 tokensClassification (yes/no)
OpenAIhigh129+ tokens/tileOCR, charts, detailed analysis
Geminibase258 tokensScales with resolution
Claudeper-imageFixedBatch for efficiency

Incorrect — using expensive model for simple classification:

# Wastes tokens: high detail + large model for yes/no
response = client.chat.completions.create(
    model="gpt-5.2",
    messages=[{"role": "user", "content": [
        {"type": "text", "text": "Is there a person? Reply: yes/no"},
        {"type": "image_url", "image_url": {"url": img_url, "detail": "high"}}
    ]}]
)

Correct — cost-optimized for simple tasks:

response = client.chat.completions.create(
    model="gpt-5.2-mini",  # Cheaper model
    messages=[{"role": "user", "content": [
        {"type": "text", "text": "Is there a person? Reply: yes/no"},
        {"type": "image_url", "image_url": {"url": img_url, "detail": "low"}}  # 65 tokens
    ]}]
)

Image size limits:

ProviderMax SizeMax ImagesNotes
OpenAI20MB10/requestGPT-5 series
Claude8000x8000 px600/request2000px if >20 images
Gemini20MB3,600/requestBest for batch
Grok20MBLimited

Selection guide:

ScenarioRecommendation
High accuracyClaude Opus 4.6 or GPT-5
Long documentsGemini 2.5 Pro (1M context)
Cost efficiencyGemini 2.5 Flash ($0.15/M tokens)
Real-time/X dataGrok 4 with DeepSearch
Video analysisGemini 2.5/3 Pro (native)
Batch imagesClaude (600/req) or Gemini (3,600/req)

Key rules:

  • Cannot identify specific people (privacy restriction, all providers)
  • May hallucinate on low-quality or rotated images (<200px)
  • No real-time video except Gemini — use frame extraction for others
  • Validate image format before encoding (corrupt files cause silent failures)
  • Always check rate limits on vision endpoints — they are lower than text-only
Edit on GitHub

Last updated on