Multimodal Llm
Vision, audio, and multimodal LLM integration patterns. Use when processing images, transcribing audio, generating speech, or building multimodal AI pipelines.
Multimodal LLM Patterns
Integrate vision and audio capabilities from leading multimodal models. Covers image analysis, document understanding, real-time voice agents, speech-to-text, and text-to-speech.
Quick Reference
| Category | Rules | Impact | When to Use |
|---|---|---|---|
| Vision: Image Analysis | 1 | HIGH | Image captioning, VQA, multi-image comparison, object detection |
| Vision: Document Understanding | 1 | HIGH | OCR, chart/diagram analysis, PDF processing, table extraction |
| Vision: Model Selection | 1 | MEDIUM | Choosing provider, cost optimization, image size limits |
| Audio: Speech-to-Text | 1 | HIGH | Transcription, speaker diarization, long-form audio |
| Audio: Text-to-Speech | 1 | MEDIUM | Voice synthesis, expressive TTS, multi-speaker dialogue |
| Audio: Model Selection | 1 | MEDIUM | Real-time voice agents, provider comparison, pricing |
Total: 6 rules across 2 categories (Vision, Audio)
Vision: Image Analysis
Send images to multimodal LLMs for captioning, visual QA, and object detection. Always set max_tokens and resize images before encoding.
| Rule | File | Key Pattern |
|---|---|---|
| Image Analysis | rules/vision-image-analysis.md | Base64 encoding, multi-image, bounding boxes |
Vision: Document Understanding
Extract structured data from documents, charts, and PDFs using vision models.
| Rule | File | Key Pattern |
|---|---|---|
| Document Vision | rules/vision-document.md | PDF page ranges, detail levels, OCR strategies |
Vision: Model Selection
Choose the right vision provider based on accuracy, cost, and context window needs.
| Rule | File | Key Pattern |
|---|---|---|
| Vision Models | rules/vision-models.md | Provider comparison, token costs, image limits |
Audio: Speech-to-Text
Convert audio to text with speaker diarization, timestamps, and sentiment analysis.
| Rule | File | Key Pattern |
|---|---|---|
| Speech-to-Text | rules/audio-speech-to-text.md | Gemini long-form, GPT-4o-Transcribe, AssemblyAI features |
Audio: Text-to-Speech
Generate natural speech from text with voice selection and expressive cues.
| Rule | File | Key Pattern |
|---|---|---|
| Text-to-Speech | rules/audio-text-to-speech.md | Gemini TTS, voice config, auditory cues |
Audio: Model Selection
Select the right audio/voice provider for real-time, transcription, or TTS use cases.
| Rule | File | Key Pattern |
|---|---|---|
| Audio Models | rules/audio-models.md | Real-time voice comparison, STT benchmarks, pricing |
Key Decisions
| Decision | Recommendation |
|---|---|
| High accuracy vision | Claude Opus 4.6 or GPT-5 |
| Long documents | Gemini 2.5 Pro (1M context) |
| Cost-efficient vision | Gemini 2.5 Flash ($0.15/M tokens) |
| Video analysis | Gemini 2.5/3 Pro (native video) |
| Voice assistant | Grok Voice Agent (fastest, <1s) |
| Emotional voice AI | Gemini Live API |
| Long audio transcription | Gemini 2.5 Pro (9.5hr) |
| Speaker diarization | AssemblyAI or Gemini |
| Self-hosted STT | Whisper Large V3 |
Example
import anthropic, base64
client = anthropic.Anthropic()
with open("image.png", "rb") as f:
b64 = base64.standard_b64encode(f.read()).decode("utf-8")
response = client.messages.create(
model="claude-opus-4-6",
max_tokens=1024,
messages=[{"role": "user", "content": [
{"type": "image", "source": {"type": "base64", "media_type": "image/png", "data": b64}},
{"type": "text", "text": "Describe this image"}
]}]
)Common Mistakes
- Not setting
max_tokenson vision requests (responses truncated) - Sending oversized images without resizing (>2048px)
- Using
highdetail level for simple yes/no classification - Using STT+LLM+TTS pipeline instead of native speech-to-speech
- Not leveraging barge-in support for natural voice conversations
- Using deprecated models (GPT-4V, Whisper-1)
- Ignoring rate limits on vision and audio endpoints
Related Skills
ork:rag-retrieval- Multimodal RAG with image + text retrievalork:llm-integration- General LLM function calling patternsstreaming-api-patterns- WebSocket patterns for real-time audio
Rules (6)
Select the right audio model architecture to avoid unnecessary pipeline latency — MEDIUM
Audio Model Selection
Choose the right audio provider based on latency, features, and cost.
Incorrect — building STT+LLM+TTS pipeline for voice assistants:
# 3-step pipeline adds 2-5x latency vs native speech-to-speech
text = transcribe(audio) # STT: ~500ms
response = llm.generate(text) # LLM: ~1000ms
audio = text_to_speech(response) # TTS: ~500ms
# Total: ~2000ms minimumCorrect — use native speech-to-speech for voice assistants:
# Grok Voice Agent: <1s time-to-first-audio (5x faster)
async with websockets.connect("wss://api.x.ai/v1/realtime", extra_headers=headers) as ws:
await ws.send(json.dumps({
"type": "session.update",
"session": {
"model": "grok-4-voice",
"voice": "Aria",
"input_audio_format": "pcm16",
"output_audio_format": "pcm16",
"turn_detection": {"type": "server_vad"}
}
}))
# Direct audio in -> audio out, no intermediary transcriptionReal-time voice provider comparison:
| Model | Latency | Languages | Price | Best For |
|---|---|---|---|---|
| Grok Voice Agent | <1s TTFA | 100+ | $0.05/min | Fastest, lowest cost |
| Gemini Live API | Low | 24 (30 voices) | Usage-based | Emotional awareness |
| OpenAI Realtime | ~1s | 50+ | $0.10/min | Ecosystem integration |
Gemini Live — emotional awareness and barge-in:
async with model.connect(config=config) as session:
# Supports barge-in (user can interrupt anytime)
# Affective dialog (understands and responds to emotions)
# Proactive audio (responds only when relevant)
async for response in session.receive():
if response.data:
yield response.data # Audio bytesPricing comparison:
| Provider | Type | Price | Notes |
|---|---|---|---|
| Grok Voice Agent | Real-time | $0.05/min | Cheapest real-time |
| Gemini Live | Real-time | Usage-based | 30 HD voices |
| OpenAI Realtime | Real-time | $0.10/min | |
| Gemini 2.5 Pro | Transcription | $1.25/M tokens | 9.5hr audio |
| GPT-4o-Transcribe | Transcription | $0.01/min | |
| AssemblyAI | Transcription | ~$0.15/hr | Best features |
| Deepgram | Transcription | ~$0.0043/min | Cheapest STT |
Selection guide:
| Scenario | Recommendation |
|---|---|
| Voice assistant (speed) | Grok Voice Agent (<1s) |
| Emotional AI | Gemini Live API |
| Long audio (hours) | Gemini 2.5 Pro (9.5hr) |
| Speaker diarization | AssemblyAI or Gemini |
| Lowest latency STT | Deepgram Nova-3 (<300ms) |
| Self-hosted STT | Whisper Large V3 |
| Cheapest real-time | Grok ($0.05/min) |
Key rules:
- Use native speech-to-speech for voice assistants — never chain STT+LLM+TTS
- Grok Voice Agent is OpenAI Realtime API compatible (easy migration)
- Gemini Live supports barge-in — essential for natural conversations
- Always test latency with real users under realistic network conditions
- For phone agents, <1s time-to-first-audio is the quality threshold
Configure speech-to-text with the right provider and model for accuracy and cost — HIGH
Audio Speech-to-Text
Convert audio to text with speaker labels, timestamps, and structured output.
Incorrect — using deprecated model:
# Whisper-1 is deprecated — use GPT-4o-Transcribe for better accuracy
response = client.audio.transcriptions.create(model="whisper-1", file=audio_file)Correct — GPT-4o-Transcribe with structured output:
from openai import OpenAI
client = OpenAI()
with open(audio_path, "rb") as audio_file:
response = client.audio.transcriptions.create(
model="gpt-4o-transcribe",
file=audio_file,
language="en", # Optional: improves accuracy for known language
response_format="verbose_json",
timestamp_granularities=["word", "segment"]
)
result = {
"text": response.text,
"words": response.words, # Word-level timestamps
"segments": response.segments, # Segment-level with speaker info
"duration": response.duration
}Gemini — best for long-form audio (up to 9.5 hours):
import google.generativeai as genai
model = genai.GenerativeModel("gemini-2.5-pro")
audio_file = genai.upload_file(audio_path)
response = model.generate_content([
audio_file,
"""Transcribe with:
1. Speaker labels (Speaker 1, Speaker 2)
2. Timestamps: [HH:MM:SS]
3. Punctuation and formatting"""
])AssemblyAI — best feature set (diarization + sentiment + entities):
import assemblyai as aai
aai.settings.api_key = "YOUR_API_KEY"
config = aai.TranscriptionConfig(
speaker_labels=True,
sentiment_analysis=True,
entity_detection=True,
auto_highlights=True,
language_detection=True
)
transcript = aai.Transcriber().transcribe(audio_url, config=config)STT model comparison:
| Model | WER | Latency | Best For |
|---|---|---|---|
| Gemini 2.5 Pro | ~5% | Medium | 9.5hr audio, diarization |
| GPT-4o-Transcribe | ~7% | Medium | Accuracy + accents |
| AssemblyAI Universal-2 | 8.4% | 200ms | Best features |
| Deepgram Nova-3 | ~18% | <300ms | Lowest latency |
| Whisper Large V3 | 7.4% | Slow | Self-host, 99+ langs |
Key rules:
- Use GPT-4o-Transcribe, not Whisper-1 (deprecated)
- For audio >1 hour, prefer Gemini 2.5 Pro (handles up to 9.5 hours natively)
- AssemblyAI provides the richest metadata (sentiment, entities, highlights)
- Deepgram Nova-3 for lowest latency STT (<300ms)
- Whisper Large V3 only for self-hosted / air-gapped environments
- Always specify language when known — improves accuracy significantly
Configure text-to-speech voice selection and style prompts for natural-sounding output — MEDIUM
Audio Text-to-Speech
Generate natural speech from text with voice selection and expressive control.
Incorrect — no voice configuration:
# Default voice with no style control — sounds robotic
response = model.generate_content(contents=text)Correct — Gemini TTS with voice and style config:
import google.generativeai as genai
def text_to_speech(text: str, voice: str = "Kore") -> bytes:
"""Gemini 2.5 Flash TTS with voice selection.
Available voices: Puck, Charon, Kore, Fenrir, Aoede (30 total)
"""
model = genai.GenerativeModel("gemini-2.5-flash-tts")
response = model.generate_content(
contents=text,
generation_config=genai.GenerationConfig(
response_mime_type="audio/mp3",
speech_config=genai.SpeechConfig(
voice_config=genai.VoiceConfig(
prebuilt_voice_config=genai.PrebuiltVoiceConfig(
voice_name=voice
)
)
)
)
)
return response.audioExpressive voice with auditory cues (Grok Voice Agent):
# Supports: [whisper], [sigh], [laugh], [pause]
await ws.send(json.dumps({
"type": "response.create",
"response": {
"instructions": "[sigh] Let me think about that... [pause] Here's what I found."
}
}))Gemini Live — real-time TTS with emotional awareness:
config = live.LiveConnectConfig(
response_modalities=["AUDIO"],
speech_config=live.SpeechConfig(
voice_config=live.VoiceConfig(
prebuilt_voice_config=live.PrebuiltVoiceConfig(
voice_name="Puck" # 30 HD voices in 24 languages
)
)
),
system_instruction="Speak warmly and naturally."
)Key rules:
- Always configure voice explicitly — defaults vary by provider
- Gemini TTS supports enhanced expressivity with style prompts in system instructions
- Grok supports inline auditory cues:
[whisper],[sigh],[laugh],[pause] - For multi-speaker dialogue, use consistent voice assignments per character
- Gemini offers 30 HD voices across 24 languages
- Test with real users — perceived quality varies by use case and language
Process documents with vision models using page ranges to avoid context overflow — HIGH
Vision Document Understanding
Extract structured data from documents, charts, and PDFs using vision models.
Incorrect — sending entire large PDF at once:
# Exceeds context window or times out on 100+ page PDFs
response = model.generate_content([genai.upload_file("large_doc.pdf"), "Summarize this"])Correct — incremental PDF processing:
# Claude Code Read tool with page ranges (max 20 pages per request)
Read(file_path="/path/to/document.pdf", pages="1-5") # TOC/structure scan
Read(file_path="/path/to/document.pdf", pages="45-55") # Target section
Read(file_path="/path/to/document.pdf", pages="80-90") # AppendixChart/diagram analysis — use high detail and explicit extraction prompt:
response = client.chat.completions.create(
model="gpt-5.2",
max_tokens=4096,
messages=[{
"role": "user",
"content": [
{"type": "text", "text": (
"Extract all data from this chart. Return as structured JSON with:\n"
"1. Chart type (bar, line, pie, etc.)\n"
"2. Axis labels and units\n"
"3. All data points with values\n"
"4. Title and legend entries"
)},
{"type": "image_url", "image_url": {
"url": f"data:{mime_type};base64,{base64_data}",
"detail": "high" # Required for chart OCR accuracy
}}
]
}]
)Key rules:
- For PDFs >10 pages, always use page ranges — never send all at once
- Max 20 pages per Read request in Claude Code
- Use
detail: "high"for documents with small text, tables, or charts - Gemini 2.5 Pro handles longest documents (1M+ context)
- Always validate extracted numbers against source when accuracy is critical
- Claude supports up to 100 images per request (useful for multi-page document analysis)
PDF constraints:
| Constraint | Value |
|---|---|
| Max pages per request | 20 |
| Max file size | 20MB |
| Large PDF threshold | >10 pages |
Encode images correctly with content array structure to prevent silent vision failures — HIGH
Vision Image Analysis
Encode images correctly and structure multi-modal messages for each provider.
Incorrect — string content instead of content array:
# OpenAI — image silently ignored
response = client.chat.completions.create(
model="gpt-5.2",
messages=[{"role": "user", "content": f"Describe this image: {base64_data}"}]
)Correct — structured content array with image_url:
import base64, mimetypes
def encode_image(path: str) -> tuple[str, str]:
mime_type = mimetypes.guess_type(path)[0] or "image/png"
with open(path, "rb") as f:
return base64.standard_b64encode(f.read()).decode("utf-8"), mime_type
# OpenAI (GPT-5, GPT-4o)
base64_data, mime_type = encode_image(image_path)
response = client.chat.completions.create(
model="gpt-5.2",
max_tokens=4096, # Required for vision — omitting truncates response
messages=[{
"role": "user",
"content": [
{"type": "text", "text": prompt},
{"type": "image_url", "image_url": {
"url": f"data:{mime_type};base64,{base64_data}",
"detail": "high" # low=65 tokens, high=129+ tokens/tile
}}
]
}]
)Claude — different content structure (type: "image", not "image_url"):
response = client.messages.create(
model="claude-opus-4-6",
max_tokens=4096,
messages=[{
"role": "user",
"content": [
{"type": "image", "source": {
"type": "base64",
"media_type": media_type,
"data": base64_data
}},
{"type": "text", "text": prompt}
]
}]
)Gemini — uses PIL Image directly:
from PIL import Image
model = genai.GenerativeModel("gemini-2.5-pro")
image = Image.open(image_path)
response = model.generate_content([prompt, image])Multi-image comparison (Claude supports up to 100):
content = []
for img_path in images:
b64, mt = encode_image(img_path)
content.append({"type": "image", "source": {"type": "base64", "media_type": mt, "data": b64}})
content.append({"type": "text", "text": "Compare these images..."})
response = client.messages.create(model="claude-opus-4-6", max_tokens=8192, messages=[{"role": "user", "content": content}])Object detection with bounding boxes (Gemini 2.5+):
response = model.generate_content([
"Detect all objects. Return JSON: {objects: [{label, box: [x1,y1,x2,y2]}]}",
image
])Key rules:
- Always set
max_tokenson vision requests (responses truncated without it) - Resize images to max 2048px before encoding (reduces cost and latency)
- Use
detail: "low"(65 tokens) for simple classification,"high"for OCR/charts - Each provider has different content structure — do not mix formats
- Claude uses
media_type, OpenAI usesmimein the data URI
Select the right vision model to balance cost against task complexity requirements — MEDIUM
Vision Model Selection
Choose the right vision provider based on accuracy, cost, and context needs.
Model comparison:
| Model | Context | Strengths | Max Images |
|---|---|---|---|
| GPT-5.2 | 128K | Best general reasoning | 10/request |
| Claude Opus 4.6 | 1M | Best coding, sustained agent tasks | 100/request |
| Gemini 2.5 Pro | 1M+ | Longest context, native video | 3,600/request |
| Gemini 3 Pro | 1M | Deep Think, enhanced segmentation | 3,600/request |
| Grok 4 | 2M | Real-time X integration | Limited |
Token cost by detail level:
| Provider | Detail Level | Token Cost | Use For |
|---|---|---|---|
| OpenAI | low | 65 tokens | Classification (yes/no) |
| OpenAI | high | 129+ tokens/tile | OCR, charts, detailed analysis |
| Gemini | base | 258 tokens | Scales with resolution |
| Claude | per-image | Fixed | Batch for efficiency |
Incorrect — using expensive model for simple classification:
# Wastes tokens: high detail + large model for yes/no
response = client.chat.completions.create(
model="gpt-5.2",
messages=[{"role": "user", "content": [
{"type": "text", "text": "Is there a person? Reply: yes/no"},
{"type": "image_url", "image_url": {"url": img_url, "detail": "high"}}
]}]
)Correct — cost-optimized for simple tasks:
response = client.chat.completions.create(
model="gpt-5.2-mini", # Cheaper model
messages=[{"role": "user", "content": [
{"type": "text", "text": "Is there a person? Reply: yes/no"},
{"type": "image_url", "image_url": {"url": img_url, "detail": "low"}} # 65 tokens
]}]
)Image size limits:
| Provider | Max Size | Max Images | Notes |
|---|---|---|---|
| OpenAI | 20MB | 10/request | GPT-5 series |
| Claude | 8000x8000 px | 100/request | 2000px if >20 images |
| Gemini | 20MB | 3,600/request | Best for batch |
| Grok | 20MB | Limited |
Selection guide:
| Scenario | Recommendation |
|---|---|
| High accuracy | Claude Opus 4.6 or GPT-5 |
| Long documents | Gemini 2.5 Pro (1M context) |
| Cost efficiency | Gemini 2.5 Flash ($0.15/M tokens) |
| Real-time/X data | Grok 4 with DeepSearch |
| Video analysis | Gemini 2.5/3 Pro (native) |
| Batch images | Claude (100/req) or Gemini (3,600/req) |
Key rules:
- Cannot identify specific people (privacy restriction, all providers)
- May hallucinate on low-quality or rotated images (<200px)
- No real-time video except Gemini — use frame extraction for others
- Validate image format before encoding (corrupt files cause silent failures)
- Always check rate limits on vision endpoints — they are lower than text-only
Monitoring Observability
Monitoring and observability patterns for Prometheus metrics, Grafana dashboards, Langfuse LLM tracing, and drift detection. Use when adding logging, metrics, distributed tracing, LLM cost tracking, or quality drift monitoring.
Performance
Performance optimization patterns covering Core Web Vitals, React render optimization, lazy loading, image optimization, backend profiling, and LLM inference. Use when improving page speed, debugging slow renders, optimizing bundles, reducing image payload, profiling backend, or deploying LLMs efficiently.
Last updated on