Skip to main content
OrchestKit v6.7.1 — 67 skills, 38 agents, 77 hooks with Opus 4.6 support
OrchestKit
Skills

Multimodal Llm

Vision, audio, and multimodal LLM integration patterns. Use when processing images, transcribing audio, generating speech, or building multimodal AI pipelines.

Reference high

Multimodal LLM Patterns

Integrate vision and audio capabilities from leading multimodal models. Covers image analysis, document understanding, real-time voice agents, speech-to-text, and text-to-speech.

Quick Reference

CategoryRulesImpactWhen to Use
Vision: Image Analysis1HIGHImage captioning, VQA, multi-image comparison, object detection
Vision: Document Understanding1HIGHOCR, chart/diagram analysis, PDF processing, table extraction
Vision: Model Selection1MEDIUMChoosing provider, cost optimization, image size limits
Audio: Speech-to-Text1HIGHTranscription, speaker diarization, long-form audio
Audio: Text-to-Speech1MEDIUMVoice synthesis, expressive TTS, multi-speaker dialogue
Audio: Model Selection1MEDIUMReal-time voice agents, provider comparison, pricing

Total: 6 rules across 2 categories (Vision, Audio)

Vision: Image Analysis

Send images to multimodal LLMs for captioning, visual QA, and object detection. Always set max_tokens and resize images before encoding.

RuleFileKey Pattern
Image Analysisrules/vision-image-analysis.mdBase64 encoding, multi-image, bounding boxes

Vision: Document Understanding

Extract structured data from documents, charts, and PDFs using vision models.

RuleFileKey Pattern
Document Visionrules/vision-document.mdPDF page ranges, detail levels, OCR strategies

Vision: Model Selection

Choose the right vision provider based on accuracy, cost, and context window needs.

RuleFileKey Pattern
Vision Modelsrules/vision-models.mdProvider comparison, token costs, image limits

Audio: Speech-to-Text

Convert audio to text with speaker diarization, timestamps, and sentiment analysis.

RuleFileKey Pattern
Speech-to-Textrules/audio-speech-to-text.mdGemini long-form, GPT-4o-Transcribe, AssemblyAI features

Audio: Text-to-Speech

Generate natural speech from text with voice selection and expressive cues.

RuleFileKey Pattern
Text-to-Speechrules/audio-text-to-speech.mdGemini TTS, voice config, auditory cues

Audio: Model Selection

Select the right audio/voice provider for real-time, transcription, or TTS use cases.

RuleFileKey Pattern
Audio Modelsrules/audio-models.mdReal-time voice comparison, STT benchmarks, pricing

Key Decisions

DecisionRecommendation
High accuracy visionClaude Opus 4.6 or GPT-5
Long documentsGemini 2.5 Pro (1M context)
Cost-efficient visionGemini 2.5 Flash ($0.15/M tokens)
Video analysisGemini 2.5/3 Pro (native video)
Voice assistantGrok Voice Agent (fastest, <1s)
Emotional voice AIGemini Live API
Long audio transcriptionGemini 2.5 Pro (9.5hr)
Speaker diarizationAssemblyAI or Gemini
Self-hosted STTWhisper Large V3

Example

import anthropic, base64

client = anthropic.Anthropic()
with open("image.png", "rb") as f:
    b64 = base64.standard_b64encode(f.read()).decode("utf-8")

response = client.messages.create(
    model="claude-opus-4-6",
    max_tokens=1024,
    messages=[{"role": "user", "content": [
        {"type": "image", "source": {"type": "base64", "media_type": "image/png", "data": b64}},
        {"type": "text", "text": "Describe this image"}
    ]}]
)

Common Mistakes

  1. Not setting max_tokens on vision requests (responses truncated)
  2. Sending oversized images without resizing (>2048px)
  3. Using high detail level for simple yes/no classification
  4. Using STT+LLM+TTS pipeline instead of native speech-to-speech
  5. Not leveraging barge-in support for natural voice conversations
  6. Using deprecated models (GPT-4V, Whisper-1)
  7. Ignoring rate limits on vision and audio endpoints
  • ork:rag-retrieval - Multimodal RAG with image + text retrieval
  • ork:llm-integration - General LLM function calling patterns
  • streaming-api-patterns - WebSocket patterns for real-time audio

Rules (6)

Select the right audio model architecture to avoid unnecessary pipeline latency — MEDIUM

Audio Model Selection

Choose the right audio provider based on latency, features, and cost.

Incorrect — building STT+LLM+TTS pipeline for voice assistants:

# 3-step pipeline adds 2-5x latency vs native speech-to-speech
text = transcribe(audio)          # STT: ~500ms
response = llm.generate(text)     # LLM: ~1000ms
audio = text_to_speech(response)  # TTS: ~500ms
# Total: ~2000ms minimum

Correct — use native speech-to-speech for voice assistants:

# Grok Voice Agent: <1s time-to-first-audio (5x faster)
async with websockets.connect("wss://api.x.ai/v1/realtime", extra_headers=headers) as ws:
    await ws.send(json.dumps({
        "type": "session.update",
        "session": {
            "model": "grok-4-voice",
            "voice": "Aria",
            "input_audio_format": "pcm16",
            "output_audio_format": "pcm16",
            "turn_detection": {"type": "server_vad"}
        }
    }))
    # Direct audio in -> audio out, no intermediary transcription

Real-time voice provider comparison:

ModelLatencyLanguagesPriceBest For
Grok Voice Agent<1s TTFA100+$0.05/minFastest, lowest cost
Gemini Live APILow24 (30 voices)Usage-basedEmotional awareness
OpenAI Realtime~1s50+$0.10/minEcosystem integration

Gemini Live — emotional awareness and barge-in:

async with model.connect(config=config) as session:
    # Supports barge-in (user can interrupt anytime)
    # Affective dialog (understands and responds to emotions)
    # Proactive audio (responds only when relevant)
    async for response in session.receive():
        if response.data:
            yield response.data  # Audio bytes

Pricing comparison:

ProviderTypePriceNotes
Grok Voice AgentReal-time$0.05/minCheapest real-time
Gemini LiveReal-timeUsage-based30 HD voices
OpenAI RealtimeReal-time$0.10/min
Gemini 2.5 ProTranscription$1.25/M tokens9.5hr audio
GPT-4o-TranscribeTranscription$0.01/min
AssemblyAITranscription~$0.15/hrBest features
DeepgramTranscription~$0.0043/minCheapest STT

Selection guide:

ScenarioRecommendation
Voice assistant (speed)Grok Voice Agent (<1s)
Emotional AIGemini Live API
Long audio (hours)Gemini 2.5 Pro (9.5hr)
Speaker diarizationAssemblyAI or Gemini
Lowest latency STTDeepgram Nova-3 (<300ms)
Self-hosted STTWhisper Large V3
Cheapest real-timeGrok ($0.05/min)

Key rules:

  • Use native speech-to-speech for voice assistants — never chain STT+LLM+TTS
  • Grok Voice Agent is OpenAI Realtime API compatible (easy migration)
  • Gemini Live supports barge-in — essential for natural conversations
  • Always test latency with real users under realistic network conditions
  • For phone agents, <1s time-to-first-audio is the quality threshold

Configure speech-to-text with the right provider and model for accuracy and cost — HIGH

Audio Speech-to-Text

Convert audio to text with speaker labels, timestamps, and structured output.

Incorrect — using deprecated model:

# Whisper-1 is deprecated — use GPT-4o-Transcribe for better accuracy
response = client.audio.transcriptions.create(model="whisper-1", file=audio_file)

Correct — GPT-4o-Transcribe with structured output:

from openai import OpenAI
client = OpenAI()

with open(audio_path, "rb") as audio_file:
    response = client.audio.transcriptions.create(
        model="gpt-4o-transcribe",
        file=audio_file,
        language="en",  # Optional: improves accuracy for known language
        response_format="verbose_json",
        timestamp_granularities=["word", "segment"]
    )

result = {
    "text": response.text,
    "words": response.words,        # Word-level timestamps
    "segments": response.segments,  # Segment-level with speaker info
    "duration": response.duration
}

Gemini — best for long-form audio (up to 9.5 hours):

import google.generativeai as genai
model = genai.GenerativeModel("gemini-2.5-pro")

audio_file = genai.upload_file(audio_path)
response = model.generate_content([
    audio_file,
    """Transcribe with:
    1. Speaker labels (Speaker 1, Speaker 2)
    2. Timestamps: [HH:MM:SS]
    3. Punctuation and formatting"""
])

AssemblyAI — best feature set (diarization + sentiment + entities):

import assemblyai as aai
aai.settings.api_key = "YOUR_API_KEY"

config = aai.TranscriptionConfig(
    speaker_labels=True,
    sentiment_analysis=True,
    entity_detection=True,
    auto_highlights=True,
    language_detection=True
)
transcript = aai.Transcriber().transcribe(audio_url, config=config)

STT model comparison:

ModelWERLatencyBest For
Gemini 2.5 Pro~5%Medium9.5hr audio, diarization
GPT-4o-Transcribe~7%MediumAccuracy + accents
AssemblyAI Universal-28.4%200msBest features
Deepgram Nova-3~18%<300msLowest latency
Whisper Large V37.4%SlowSelf-host, 99+ langs

Key rules:

  • Use GPT-4o-Transcribe, not Whisper-1 (deprecated)
  • For audio >1 hour, prefer Gemini 2.5 Pro (handles up to 9.5 hours natively)
  • AssemblyAI provides the richest metadata (sentiment, entities, highlights)
  • Deepgram Nova-3 for lowest latency STT (<300ms)
  • Whisper Large V3 only for self-hosted / air-gapped environments
  • Always specify language when known — improves accuracy significantly

Configure text-to-speech voice selection and style prompts for natural-sounding output — MEDIUM

Audio Text-to-Speech

Generate natural speech from text with voice selection and expressive control.

Incorrect — no voice configuration:

# Default voice with no style control — sounds robotic
response = model.generate_content(contents=text)

Correct — Gemini TTS with voice and style config:

import google.generativeai as genai

def text_to_speech(text: str, voice: str = "Kore") -> bytes:
    """Gemini 2.5 Flash TTS with voice selection.

    Available voices: Puck, Charon, Kore, Fenrir, Aoede (30 total)
    """
    model = genai.GenerativeModel("gemini-2.5-flash-tts")

    response = model.generate_content(
        contents=text,
        generation_config=genai.GenerationConfig(
            response_mime_type="audio/mp3",
            speech_config=genai.SpeechConfig(
                voice_config=genai.VoiceConfig(
                    prebuilt_voice_config=genai.PrebuiltVoiceConfig(
                        voice_name=voice
                    )
                )
            )
        )
    )
    return response.audio

Expressive voice with auditory cues (Grok Voice Agent):

# Supports: [whisper], [sigh], [laugh], [pause]
await ws.send(json.dumps({
    "type": "response.create",
    "response": {
        "instructions": "[sigh] Let me think about that... [pause] Here's what I found."
    }
}))

Gemini Live — real-time TTS with emotional awareness:

config = live.LiveConnectConfig(
    response_modalities=["AUDIO"],
    speech_config=live.SpeechConfig(
        voice_config=live.VoiceConfig(
            prebuilt_voice_config=live.PrebuiltVoiceConfig(
                voice_name="Puck"  # 30 HD voices in 24 languages
            )
        )
    ),
    system_instruction="Speak warmly and naturally."
)

Key rules:

  • Always configure voice explicitly — defaults vary by provider
  • Gemini TTS supports enhanced expressivity with style prompts in system instructions
  • Grok supports inline auditory cues: [whisper], [sigh], [laugh], [pause]
  • For multi-speaker dialogue, use consistent voice assignments per character
  • Gemini offers 30 HD voices across 24 languages
  • Test with real users — perceived quality varies by use case and language

Process documents with vision models using page ranges to avoid context overflow — HIGH

Vision Document Understanding

Extract structured data from documents, charts, and PDFs using vision models.

Incorrect — sending entire large PDF at once:

# Exceeds context window or times out on 100+ page PDFs
response = model.generate_content([genai.upload_file("large_doc.pdf"), "Summarize this"])

Correct — incremental PDF processing:

# Claude Code Read tool with page ranges (max 20 pages per request)
Read(file_path="/path/to/document.pdf", pages="1-5")    # TOC/structure scan
Read(file_path="/path/to/document.pdf", pages="45-55")   # Target section
Read(file_path="/path/to/document.pdf", pages="80-90")   # Appendix

Chart/diagram analysis — use high detail and explicit extraction prompt:

response = client.chat.completions.create(
    model="gpt-5.2",
    max_tokens=4096,
    messages=[{
        "role": "user",
        "content": [
            {"type": "text", "text": (
                "Extract all data from this chart. Return as structured JSON with:\n"
                "1. Chart type (bar, line, pie, etc.)\n"
                "2. Axis labels and units\n"
                "3. All data points with values\n"
                "4. Title and legend entries"
            )},
            {"type": "image_url", "image_url": {
                "url": f"data:{mime_type};base64,{base64_data}",
                "detail": "high"  # Required for chart OCR accuracy
            }}
        ]
    }]
)

Key rules:

  • For PDFs >10 pages, always use page ranges — never send all at once
  • Max 20 pages per Read request in Claude Code
  • Use detail: "high" for documents with small text, tables, or charts
  • Gemini 2.5 Pro handles longest documents (1M+ context)
  • Always validate extracted numbers against source when accuracy is critical
  • Claude supports up to 100 images per request (useful for multi-page document analysis)

PDF constraints:

ConstraintValue
Max pages per request20
Max file size20MB
Large PDF threshold>10 pages

Encode images correctly with content array structure to prevent silent vision failures — HIGH

Vision Image Analysis

Encode images correctly and structure multi-modal messages for each provider.

Incorrect — string content instead of content array:

# OpenAI — image silently ignored
response = client.chat.completions.create(
    model="gpt-5.2",
    messages=[{"role": "user", "content": f"Describe this image: {base64_data}"}]
)

Correct — structured content array with image_url:

import base64, mimetypes

def encode_image(path: str) -> tuple[str, str]:
    mime_type = mimetypes.guess_type(path)[0] or "image/png"
    with open(path, "rb") as f:
        return base64.standard_b64encode(f.read()).decode("utf-8"), mime_type

# OpenAI (GPT-5, GPT-4o)
base64_data, mime_type = encode_image(image_path)
response = client.chat.completions.create(
    model="gpt-5.2",
    max_tokens=4096,  # Required for vision — omitting truncates response
    messages=[{
        "role": "user",
        "content": [
            {"type": "text", "text": prompt},
            {"type": "image_url", "image_url": {
                "url": f"data:{mime_type};base64,{base64_data}",
                "detail": "high"  # low=65 tokens, high=129+ tokens/tile
            }}
        ]
    }]
)

Claude — different content structure (type: "image", not "image_url"):

response = client.messages.create(
    model="claude-opus-4-6",
    max_tokens=4096,
    messages=[{
        "role": "user",
        "content": [
            {"type": "image", "source": {
                "type": "base64",
                "media_type": media_type,
                "data": base64_data
            }},
            {"type": "text", "text": prompt}
        ]
    }]
)

Gemini — uses PIL Image directly:

from PIL import Image
model = genai.GenerativeModel("gemini-2.5-pro")
image = Image.open(image_path)
response = model.generate_content([prompt, image])

Multi-image comparison (Claude supports up to 100):

content = []
for img_path in images:
    b64, mt = encode_image(img_path)
    content.append({"type": "image", "source": {"type": "base64", "media_type": mt, "data": b64}})
content.append({"type": "text", "text": "Compare these images..."})
response = client.messages.create(model="claude-opus-4-6", max_tokens=8192, messages=[{"role": "user", "content": content}])

Object detection with bounding boxes (Gemini 2.5+):

response = model.generate_content([
    "Detect all objects. Return JSON: {objects: [{label, box: [x1,y1,x2,y2]}]}",
    image
])

Key rules:

  • Always set max_tokens on vision requests (responses truncated without it)
  • Resize images to max 2048px before encoding (reduces cost and latency)
  • Use detail: "low" (65 tokens) for simple classification, "high" for OCR/charts
  • Each provider has different content structure — do not mix formats
  • Claude uses media_type, OpenAI uses mime in the data URI

Select the right vision model to balance cost against task complexity requirements — MEDIUM

Vision Model Selection

Choose the right vision provider based on accuracy, cost, and context needs.

Model comparison:

ModelContextStrengthsMax Images
GPT-5.2128KBest general reasoning10/request
Claude Opus 4.61MBest coding, sustained agent tasks100/request
Gemini 2.5 Pro1M+Longest context, native video3,600/request
Gemini 3 Pro1MDeep Think, enhanced segmentation3,600/request
Grok 42MReal-time X integrationLimited

Token cost by detail level:

ProviderDetail LevelToken CostUse For
OpenAIlow65 tokensClassification (yes/no)
OpenAIhigh129+ tokens/tileOCR, charts, detailed analysis
Geminibase258 tokensScales with resolution
Claudeper-imageFixedBatch for efficiency

Incorrect — using expensive model for simple classification:

# Wastes tokens: high detail + large model for yes/no
response = client.chat.completions.create(
    model="gpt-5.2",
    messages=[{"role": "user", "content": [
        {"type": "text", "text": "Is there a person? Reply: yes/no"},
        {"type": "image_url", "image_url": {"url": img_url, "detail": "high"}}
    ]}]
)

Correct — cost-optimized for simple tasks:

response = client.chat.completions.create(
    model="gpt-5.2-mini",  # Cheaper model
    messages=[{"role": "user", "content": [
        {"type": "text", "text": "Is there a person? Reply: yes/no"},
        {"type": "image_url", "image_url": {"url": img_url, "detail": "low"}}  # 65 tokens
    ]}]
)

Image size limits:

ProviderMax SizeMax ImagesNotes
OpenAI20MB10/requestGPT-5 series
Claude8000x8000 px100/request2000px if >20 images
Gemini20MB3,600/requestBest for batch
Grok20MBLimited

Selection guide:

ScenarioRecommendation
High accuracyClaude Opus 4.6 or GPT-5
Long documentsGemini 2.5 Pro (1M context)
Cost efficiencyGemini 2.5 Flash ($0.15/M tokens)
Real-time/X dataGrok 4 with DeepSearch
Video analysisGemini 2.5/3 Pro (native)
Batch imagesClaude (100/req) or Gemini (3,600/req)

Key rules:

  • Cannot identify specific people (privacy restriction, all providers)
  • May hallucinate on low-quality or rotated images (<200px)
  • No real-time video except Gemini — use frame extraction for others
  • Validate image format before encoding (corrupt files cause silent failures)
  • Always check rate limits on vision endpoints — they are lower than text-only
Edit on GitHub

Last updated on