Vision, audio, and multimodal LLM integration patterns. Use when processing images, transcribing audio, generating speech, or building multimodal AI pipelines.

Reference high

Multimodal LLM Patterns

Integrate vision and audio capabilities from leading multimodal models. Covers image analysis, document understanding, real-time voice agents, speech-to-text, and text-to-speech.

Quick Reference

Category	Rules	Impact	When to Use
Vision: Image Analysis	1	HIGH	Image captioning, VQA, multi-image comparison, object detection
Vision: Document Understanding	1	HIGH	OCR, chart/diagram analysis, PDF processing, table extraction
Vision: Model Selection	1	MEDIUM	Choosing provider, cost optimization, image size limits
Audio: Speech-to-Text	1	HIGH	Transcription, speaker diarization, long-form audio
Audio: Text-to-Speech	1	MEDIUM	Voice synthesis, expressive TTS, multi-speaker dialogue
Audio: Model Selection	1	MEDIUM	Real-time voice agents, provider comparison, pricing

Total: 6 rules across 2 categories (Vision, Audio)

Vision: Image Analysis

Send images to multimodal LLMs for captioning, visual QA, and object detection. Always set max_tokens and resize images before encoding.

Rule	File	Key Pattern
Image Analysis	`rules/vision-image-analysis.md`	Base64 encoding, multi-image, bounding boxes

Vision: Document Understanding

Extract structured data from documents, charts, and PDFs using vision models.

Rule	File	Key Pattern
Document Vision	`rules/vision-document.md`	PDF page ranges, detail levels, OCR strategies

Vision: Model Selection

Choose the right vision provider based on accuracy, cost, and context window needs.

Rule	File	Key Pattern
Vision Models	`rules/vision-models.md`	Provider comparison, token costs, image limits

Audio: Speech-to-Text

Convert audio to text with speaker diarization, timestamps, and sentiment analysis.

Rule	File	Key Pattern
Speech-to-Text	`rules/audio-speech-to-text.md`	Gemini long-form, GPT-4o-Transcribe, AssemblyAI features

Audio: Text-to-Speech

Generate natural speech from text with voice selection and expressive cues.

Rule	File	Key Pattern
Text-to-Speech	`rules/audio-text-to-speech.md`	Gemini TTS, voice config, auditory cues

Audio: Model Selection

Select the right audio/voice provider for real-time, transcription, or TTS use cases.

Rule	File	Key Pattern
Audio Models	`rules/audio-models.md`	Real-time voice comparison, STT benchmarks, pricing

Key Decisions

Decision	Recommendation
High accuracy vision	Claude Opus 4.6 or GPT-5
Long documents	Gemini 2.5 Pro (1M context)
Cost-efficient vision	Gemini 2.5 Flash ($0.15/M tokens)
Video analysis	Gemini 2.5/3 Pro (native video)
Voice assistant	Grok Voice Agent (fastest, <1s)
Emotional voice AI	Gemini Live API
Long audio transcription	Gemini 2.5 Pro (9.5hr)
Speaker diarization	AssemblyAI or Gemini
Self-hosted STT	Whisper Large V3

Example

import anthropic, base64

client = anthropic.Anthropic()
with open("image.png", "rb") as f:
    b64 = base64.standard_b64encode(f.read()).decode("utf-8")

response = client.messages.create(
    model="claude-opus-4-6",
    max_tokens=1024,
    messages=[{"role": "user", "content": [
        {"type": "image", "source": {"type": "base64", "media_type": "image/png", "data": b64}},
        {"type": "text", "text": "Describe this image"}
    ]}]
)

Common Mistakes

Not setting max_tokens on vision requests (responses truncated)
Sending oversized images without resizing (>2048px)
Using high detail level for simple yes/no classification
Using STT+LLM+TTS pipeline instead of native speech-to-speech
Not leveraging barge-in support for natural voice conversations
Using deprecated models (GPT-4V, Whisper-1)
Ignoring rate limits on vision and audio endpoints

ork:rag-retrieval - Multimodal RAG with image + text retrieval
ork:llm-integration - General LLM function calling patterns
streaming-api-patterns - WebSocket patterns for real-time audio

Rules (6)

Select the right audio model architecture to avoid unnecessary pipeline latency — MEDIUM

Audio Model Selection

Choose the right audio provider based on latency, features, and cost.

Incorrect — building STT+LLM+TTS pipeline for voice assistants:

# 3-step pipeline adds 2-5x latency vs native speech-to-speech
text = transcribe(audio)          # STT: ~500ms
response = llm.generate(text)     # LLM: ~1000ms
audio = text_to_speech(response)  # TTS: ~500ms
# Total: ~2000ms minimum

Correct — use native speech-to-speech for voice assistants:

# Grok Voice Agent: <1s time-to-first-audio (5x faster)
async with websockets.connect("wss://api.x.ai/v1/realtime", extra_headers=headers) as ws:
    await ws.send(json.dumps({
        "type": "session.update",
        "session": {
            "model": "grok-4-voice",
            "voice": "Aria",
            "input_audio_format": "pcm16",
            "output_audio_format": "pcm16",
            "turn_detection": {"type": "server_vad"}
        }
    }))
    # Direct audio in -> audio out, no intermediary transcription

Real-time voice provider comparison:

Model	Latency	Languages	Price	Best For
Grok Voice Agent	<1s TTFA	100+	$0.05/min	Fastest, lowest cost
Gemini Live API	Low	24 (30 voices)	Usage-based	Emotional awareness
OpenAI Realtime	~1s	50+	$0.10/min	Ecosystem integration

Gemini Live — emotional awareness and barge-in:

async with model.connect(config=config) as session:
    # Supports barge-in (user can interrupt anytime)
    # Affective dialog (understands and responds to emotions)
    # Proactive audio (responds only when relevant)
    async for response in session.receive():
        if response.data:
            yield response.data  # Audio bytes

Pricing comparison:

Provider	Type	Price	Notes
Grok Voice Agent	Real-time	$0.05/min	Cheapest real-time
Gemini Live	Real-time	Usage-based	30 HD voices
OpenAI Realtime	Real-time	$0.10/min
Gemini 2.5 Pro	Transcription	$1.25/M tokens	9.5hr audio
GPT-4o-Transcribe	Transcription	$0.01/min
AssemblyAI	Transcription	~$0.15/hr	Best features
Deepgram	Transcription	~$0.0043/min	Cheapest STT

Selection guide:

Scenario	Recommendation
Voice assistant (speed)	Grok Voice Agent (<1s)
Emotional AI	Gemini Live API
Long audio (hours)	Gemini 2.5 Pro (9.5hr)
Speaker diarization	AssemblyAI or Gemini
Lowest latency STT	Deepgram Nova-3 (<300ms)
Self-hosted STT	Whisper Large V3
Cheapest real-time	Grok ($0.05/min)

Key rules:

Use native speech-to-speech for voice assistants — never chain STT+LLM+TTS
Grok Voice Agent is OpenAI Realtime API compatible (easy migration)
Gemini Live supports barge-in — essential for natural conversations
Always test latency with real users under realistic network conditions
For phone agents, <1s time-to-first-audio is the quality threshold

Configure speech-to-text with the right provider and model for accuracy and cost — HIGH

Audio Speech-to-Text

Convert audio to text with speaker labels, timestamps, and structured output.

Incorrect — using deprecated model:

# Whisper-1 is deprecated — use GPT-4o-Transcribe for better accuracy
response = client.audio.transcriptions.create(model="whisper-1", file=audio_file)

Correct — GPT-4o-Transcribe with structured output:

from openai import OpenAI
client = OpenAI()

with open(audio_path, "rb") as audio_file:
    response = client.audio.transcriptions.create(
        model="gpt-4o-transcribe",
        file=audio_file,
        language="en",  # Optional: improves accuracy for known language
        response_format="verbose_json",
        timestamp_granularities=["word", "segment"]
    )

result = {
    "text": response.text,
    "words": response.words,        # Word-level timestamps
    "segments": response.segments,  # Segment-level with speaker info
    "duration": response.duration
}

Gemini — best for long-form audio (up to 9.5 hours):

import google.generativeai as genai
model = genai.GenerativeModel("gemini-2.5-pro")

audio_file = genai.upload_file(audio_path)
response = model.generate_content([
    audio_file,
    """Transcribe with:
    1. Speaker labels (Speaker 1, Speaker 2)
    2. Timestamps: [HH:MM:SS]
    3. Punctuation and formatting"""
])

AssemblyAI — best feature set (diarization + sentiment + entities):

import assemblyai as aai
aai.settings.api_key = "YOUR_API_KEY"

config = aai.TranscriptionConfig(
    speaker_labels=True,
    sentiment_analysis=True,
    entity_detection=True,
    auto_highlights=True,
    language_detection=True
)
transcript = aai.Transcriber().transcribe(audio_url, config=config)

STT model comparison:

Model	WER	Latency	Best For
Gemini 2.5 Pro	~5%	Medium	9.5hr audio, diarization
GPT-4o-Transcribe	~7%	Medium	Accuracy + accents
AssemblyAI Universal-2	8.4%	200ms	Best features
Deepgram Nova-3	~18%	<300ms	Lowest latency
Whisper Large V3	7.4%	Slow	Self-host, 99+ langs

Key rules:

Use GPT-4o-Transcribe, not Whisper-1 (deprecated)
For audio >1 hour, prefer Gemini 2.5 Pro (handles up to 9.5 hours natively)
AssemblyAI provides the richest metadata (sentiment, entities, highlights)
Deepgram Nova-3 for lowest latency STT (<300ms)
Whisper Large V3 only for self-hosted / air-gapped environments
Always specify language when known — improves accuracy significantly

Configure text-to-speech voice selection and style prompts for natural-sounding output — MEDIUM

Audio Text-to-Speech

Generate natural speech from text with voice selection and expressive control.

Incorrect — no voice configuration:

# Default voice with no style control — sounds robotic
response = model.generate_content(contents=text)

Correct — Gemini TTS with voice and style config:

import google.generativeai as genai

def text_to_speech(text: str, voice: str = "Kore") -> bytes:
    """Gemini 2.5 Flash TTS with voice selection.

    Available voices: Puck, Charon, Kore, Fenrir, Aoede (30 total)
    """
    model = genai.GenerativeModel("gemini-2.5-flash-tts")

    response = model.generate_content(
        contents=text,
        generation_config=genai.GenerationConfig(
            response_mime_type="audio/mp3",
            speech_config=genai.SpeechConfig(
                voice_config=genai.VoiceConfig(
                    prebuilt_voice_config=genai.PrebuiltVoiceConfig(
                        voice_name=voice
                    )
                )
            )
        )
    )
    return response.audio

Expressive voice with auditory cues (Grok Voice Agent):

# Supports: [whisper], [sigh], [laugh], [pause]
await ws.send(json.dumps({
    "type": "response.create",
    "response": {
        "instructions": "[sigh] Let me think about that... [pause] Here's what I found."
    }
}))

Gemini Live — real-time TTS with emotional awareness:

config = live.LiveConnectConfig(
    response_modalities=["AUDIO"],
    speech_config=live.SpeechConfig(
        voice_config=live.VoiceConfig(
            prebuilt_voice_config=live.PrebuiltVoiceConfig(
                voice_name="Puck"  # 30 HD voices in 24 languages
            )
        )
    ),
    system_instruction="Speak warmly and naturally."
)

Key rules:

Always configure voice explicitly — defaults vary by provider
Gemini TTS supports enhanced expressivity with style prompts in system instructions
Grok supports inline auditory cues: [whisper], [sigh], [laugh], [pause]
For multi-speaker dialogue, use consistent voice assignments per character
Gemini offers 30 HD voices across 24 languages
Test with real users — perceived quality varies by use case and language

Process documents with vision models using page ranges to avoid context overflow — HIGH

Vision Document Understanding

Extract structured data from documents, charts, and PDFs using vision models.

Incorrect — sending entire large PDF at once:

# Exceeds context window or times out on 100+ page PDFs
response = model.generate_content([genai.upload_file("large_doc.pdf"), "Summarize this"])

Correct — incremental PDF processing:

# Claude Code Read tool with page ranges (max 20 pages per request)
Read(file_path="/path/to/document.pdf", pages="1-5")    # TOC/structure scan
Read(file_path="/path/to/document.pdf", pages="45-55")   # Target section
Read(file_path="/path/to/document.pdf", pages="80-90")   # Appendix

Chart/diagram analysis — use high detail and explicit extraction prompt:

response = client.chat.completions.create(
    model="gpt-5.2",
    max_tokens=4096,
    messages=[{
        "role": "user",
        "content": [
            {"type": "text", "text": (
                "Extract all data from this chart. Return as structured JSON with:\n"
                "1. Chart type (bar, line, pie, etc.)\n"
                "2. Axis labels and units\n"
                "3. All data points with values\n"
                "4. Title and legend entries"
            )},
            {"type": "image_url", "image_url": {
                "url": f"data:{mime_type};base64,{base64_data}",
                "detail": "high"  # Required for chart OCR accuracy
            }}
        ]
    }]
)

Key rules:

For PDFs >10 pages, always use page ranges — never send all at once
Max 20 pages per Read request in Claude Code
Use detail: "high" for documents with small text, tables, or charts
Gemini 2.5 Pro handles longest documents (1M+ context)
Always validate extracted numbers against source when accuracy is critical
Claude supports up to 100 images per request (useful for multi-page document analysis)

PDF constraints:

Constraint	Value
Max pages per request	20
Max file size	20MB
Large PDF threshold	>10 pages

Encode images correctly with content array structure to prevent silent vision failures — HIGH

Vision Image Analysis

Encode images correctly and structure multi-modal messages for each provider.

Incorrect — string content instead of content array:

# OpenAI — image silently ignored
response = client.chat.completions.create(
    model="gpt-5.2",
    messages=[{"role": "user", "content": f"Describe this image: {base64_data}"}]
)

Correct — structured content array with image_url:

import base64, mimetypes

def encode_image(path: str) -> tuple[str, str]:
    mime_type = mimetypes.guess_type(path)[0] or "image/png"
    with open(path, "rb") as f:
        return base64.standard_b64encode(f.read()).decode("utf-8"), mime_type

# OpenAI (GPT-5, GPT-4o)
base64_data, mime_type = encode_image(image_path)
response = client.chat.completions.create(
    model="gpt-5.2",
    max_tokens=4096,  # Required for vision — omitting truncates response
    messages=[{
        "role": "user",
        "content": [
            {"type": "text", "text": prompt},
            {"type": "image_url", "image_url": {
                "url": f"data:{mime_type};base64,{base64_data}",
                "detail": "high"  # low=65 tokens, high=129+ tokens/tile
            }}
        ]
    }]
)

Claude — different content structure (type: "image", not "image_url"):

response = client.messages.create(
    model="claude-opus-4-6",
    max_tokens=4096,
    messages=[{
        "role": "user",
        "content": [
            {"type": "image", "source": {
                "type": "base64",
                "media_type": media_type,
                "data": base64_data
            }},
            {"type": "text", "text": prompt}
        ]
    }]
)

Gemini — uses PIL Image directly:

from PIL import Image
model = genai.GenerativeModel("gemini-2.5-pro")
image = Image.open(image_path)
response = model.generate_content([prompt, image])

Multi-image comparison (Claude supports up to 100):

content = []
for img_path in images:
    b64, mt = encode_image(img_path)
    content.append({"type": "image", "source": {"type": "base64", "media_type": mt, "data": b64}})
content.append({"type": "text", "text": "Compare these images..."})
response = client.messages.create(model="claude-opus-4-6", max_tokens=8192, messages=[{"role": "user", "content": content}])

Object detection with bounding boxes (Gemini 2.5+):

response = model.generate_content([
    "Detect all objects. Return JSON: {objects: [{label, box: [x1,y1,x2,y2]}]}",
    image
])

Key rules:

Always set max_tokens on vision requests (responses truncated without it)
Resize images to max 2048px before encoding (reduces cost and latency)
Use detail: "low" (65 tokens) for simple classification, "high" for OCR/charts
Each provider has different content structure — do not mix formats
Claude uses media_type, OpenAI uses mime in the data URI

Select the right vision model to balance cost against task complexity requirements — MEDIUM

Vision Model Selection

Choose the right vision provider based on accuracy, cost, and context needs.

Model comparison:

Model	Context	Strengths	Max Images
GPT-5.2	128K	Best general reasoning	10/request
Claude Opus 4.6	1M	Best coding, sustained agent tasks	100/request
Gemini 2.5 Pro	1M+	Longest context, native video	3,600/request
Gemini 3 Pro	1M	Deep Think, enhanced segmentation	3,600/request
Grok 4	2M	Real-time X integration	Limited

Token cost by detail level:

Provider	Detail Level	Token Cost	Use For
OpenAI	`low`	65 tokens	Classification (yes/no)
OpenAI	`high`	129+ tokens/tile	OCR, charts, detailed analysis
Gemini	base	258 tokens	Scales with resolution
Claude	per-image	Fixed	Batch for efficiency

Incorrect — using expensive model for simple classification:

# Wastes tokens: high detail + large model for yes/no
response = client.chat.completions.create(
    model="gpt-5.2",
    messages=[{"role": "user", "content": [
        {"type": "text", "text": "Is there a person? Reply: yes/no"},
        {"type": "image_url", "image_url": {"url": img_url, "detail": "high"}}
    ]}]
)

Correct — cost-optimized for simple tasks:

response = client.chat.completions.create(
    model="gpt-5.2-mini",  # Cheaper model
    messages=[{"role": "user", "content": [
        {"type": "text", "text": "Is there a person? Reply: yes/no"},
        {"type": "image_url", "image_url": {"url": img_url, "detail": "low"}}  # 65 tokens
    ]}]
)

Image size limits:

Provider	Max Size	Max Images	Notes
OpenAI	20MB	10/request	GPT-5 series
Claude	8000x8000 px	100/request	2000px if >20 images
Gemini	20MB	3,600/request	Best for batch
Grok	20MB	Limited

Selection guide:

Scenario	Recommendation
High accuracy	Claude Opus 4.6 or GPT-5
Long documents	Gemini 2.5 Pro (1M context)
Cost efficiency	Gemini 2.5 Flash ($0.15/M tokens)
Real-time/X data	Grok 4 with DeepSearch
Video analysis	Gemini 2.5/3 Pro (native)
Batch images	Claude (100/req) or Gemini (3,600/req)

Key rules:

Cannot identify specific people (privacy restriction, all providers)
May hallucinate on low-quality or rotated images (<200px)
No real-time video except Gemini — use frame extraction for others
Validate image format before encoding (corrupt files cause silent failures)
Always check rate limits on vision endpoints — they are lower than text-only

Multimodal Llm

Multimodal LLM Patterns

Quick Reference

Vision: Image Analysis

Vision: Document Understanding

Vision: Model Selection

Audio: Speech-to-Text

Audio: Text-to-Speech

Audio: Model Selection

Key Decisions

Example

Common Mistakes

Rules (6)

Select the right audio model architecture to avoid unnecessary pipeline latency — MEDIUM

Audio Model Selection

Configure speech-to-text with the right provider and model for accuracy and cost — HIGH

Audio Speech-to-Text

Configure text-to-speech voice selection and style prompts for natural-sounding output — MEDIUM

Audio Text-to-Speech

Process documents with vision models using page ranges to avoid context overflow — HIGH

Vision Document Understanding

Encode images correctly with content array structure to prevent silent vision failures — HIGH

Vision Image Analysis

Select the right vision model to balance cost against task complexity requirements — MEDIUM

Vision Model Selection

On this page