Performance
Performance optimization patterns covering Core Web Vitals, React render optimization, lazy loading, image optimization, backend profiling, and LLM inference. Use when improving page speed, debugging slow renders, optimizing bundles, reducing image payload, profiling backend, or deploying LLMs efficiently.
Primary Agent: frontend-ui-developer
Performance
Comprehensive performance optimization patterns for frontend, backend, and LLM inference.
Quick Reference
| Category | Rules | Impact | When to Use |
|---|---|---|---|
| Core Web Vitals | 3 | CRITICAL | LCP, INP, CLS optimization with 2026 thresholds |
| Render Optimization | 3 | HIGH | React Compiler, memoization, virtualization |
| Lazy Loading | 3 | HIGH | Code splitting, route splitting, preloading |
| Image Optimization | 3 | HIGH | Next.js Image, AVIF/WebP, responsive images |
| Profiling & Backend | 3 | MEDIUM | React DevTools, py-spy, bundle analysis |
| LLM Inference | 3 | MEDIUM | vLLM, quantization, speculative decoding |
| Caching | 2 | HIGH | Redis cache-aside, prompt caching, HTTP cache headers |
| Query & Data Fetching | 2 | HIGH | TanStack Query prefetching, optimistic updates, rollback |
Total: 22 rules across 8 categories
Core Web Vitals
Google's Core Web Vitals with 2026 stricter thresholds.
| Rule | File | Key Pattern |
|---|---|---|
| LCP Optimization | rules/cwv-lcp.md | Preload hero, SSR, fetchpriority="high" |
| INP Optimization | rules/cwv-inp.md | scheduler.yield, useTransition, requestIdleCallback |
| CLS Prevention | rules/cwv-cls.md | Explicit dimensions, aspect-ratio, font-display |
2026 Thresholds
| Metric | Current Good | 2026 Good |
|---|---|---|
| LCP | <= 2.5s | <= 2.0s |
| INP | <= 200ms | <= 150ms |
| CLS | <= 0.1 | <= 0.08 |
Render Optimization
React render performance patterns for React 19+.
| Rule | File | Key Pattern |
|---|---|---|
| React Compiler | rules/render-compiler.md | Auto-memoization, "Memo" badge verification |
| Manual Memoization | rules/render-memo.md | useMemo/useCallback escape hatches, state colocation |
| Virtualization | rules/render-virtual.md | TanStack Virtual for 100+ item lists |
Lazy Loading
Code splitting and lazy loading with React.lazy and Suspense.
| Rule | File | Key Pattern |
|---|---|---|
| React.lazy + Suspense | rules/loading-lazy.md | Component lazy loading, error boundaries |
| Route Splitting | rules/loading-splitting.md | React Router 7.x, Vite manual chunks |
| Preloading | rules/loading-preload.md | Prefetch on hover, modulepreload hints |
Image Optimization
Production image optimization for modern web applications.
| Rule | File | Key Pattern |
|---|---|---|
| Next.js Image | rules/images-nextjs.md | Image component, priority, blur placeholder |
| Format Selection | rules/images-formats.md | AVIF/WebP, quality 75-85, picture element |
| Responsive Images | rules/images-responsive.md | sizes prop, art direction, CDN loaders |
Profiling & Backend
Profiling tools and backend optimization patterns.
| Rule | File | Key Pattern |
|---|---|---|
| React Profiling | rules/profiling-react.md | DevTools Profiler, flamegraph, render counts |
| Backend Profiling | rules/profiling-backend.md | py-spy, cProfile, memory_profiler, flame graphs |
| Bundle Analysis | rules/profiling-bundle.md | vite-bundle-visualizer, tree shaking, performance budgets |
LLM Inference
High-performance LLM inference with vLLM, quantization, and speculative decoding.
| Rule | File | Key Pattern |
|---|---|---|
| vLLM Deployment | rules/inference-vllm.md | PagedAttention, continuous batching, tensor parallelism |
| Quantization | rules/inference-quantization.md | AWQ, GPTQ, FP8, INT8 method selection |
| Speculative Decoding | rules/inference-speculative.md | N-gram, draft model, 1.5-2.5x throughput |
Caching
Backend Redis caching and LLM prompt caching for cost savings and performance.
| Rule | File | Key Pattern |
|---|---|---|
| Redis & Backend | rules/caching-redis.md | Cache-aside, write-through, invalidation, stampede prevention |
| HTTP & Prompt | rules/caching-http.md | HTTP cache headers, LLM prompt caching, semantic caching |
Query & Data Fetching
TanStack Query v5 patterns for prefetching and optimistic updates.
| Rule | File | Key Pattern |
|---|---|---|
| Prefetching | rules/query-prefetching.md | Hover prefetch, route loaders, queryOptions, Suspense |
| Optimistic Updates | rules/query-optimistic.md | Optimistic mutations, rollback, cache invalidation |
Quick Start Example
// LCP: Priority hero image with SSR
import Image from 'next/image';
export default async function Page() {
const data = await fetchHeroData();
return (
<Image
src={data.heroImage}
alt="Hero"
priority
placeholder="blur"
sizes="100vw"
fill
/>
);
}Key Decisions
| Decision | Recommendation |
|---|---|
| Memoization | Let React Compiler handle it (2026 default) |
| Lists 100+ items | Use TanStack Virtual |
| Image format | AVIF with WebP fallback (30-50% smaller) |
| LCP content | SSR/SSG, never client-side fetch |
| Code splitting | Per-route for most apps, per-component for heavy widgets |
| Prefetch strategy | On hover for nav links, viewport for content |
| Quantization | AWQ for 4-bit, FP8 for H100/H200 |
| Bundle budget | Hard fail in CI to prevent regression |
Common Mistakes
- Client-side fetching LCP content (delays render)
- Images without explicit dimensions (causes CLS)
- Lazy loading LCP images (delays largest paint)
- Heavy computation in event handlers (blocks INP)
- Layout-shifting animations (use transform instead)
- Lazy loading tiny components < 5KB (overhead > savings)
- Missing error boundaries on lazy components
- Using GPTQ without calibration data
- Not benchmarking actual workload patterns
- Only measuring in lab environment (need RUM)
Related Skills
ork:react-server-components-framework- Server-first renderingork:vite-advanced- Build optimizationcaching- Cache strategies for responsesork:monitoring-observability- Production monitoring and alertingork:database-patterns- Query and index optimizationork:llm-integration- Local inference with Ollama
Capability Details
lcp-optimization
Keywords: LCP, largest-contentful-paint, hero, preload, priority, SSR Solves:
- Optimize hero image loading
- Server-render critical content
- Preload and prioritize LCP resources
inp-optimization
Keywords: INP, interaction, responsiveness, long-task, transition, yield Solves:
- Break up long tasks with scheduler.yield
- Defer non-urgent updates with useTransition
- Optimize event handler performance
cls-prevention
Keywords: CLS, layout-shift, dimensions, aspect-ratio, font-display Solves:
- Reserve space for dynamic content
- Prevent font flash and image pop-in
- Use transform for animations
react-compiler
Keywords: react-compiler, auto-memo, memoization, React 19 Solves:
- Enable automatic memoization
- Identify when manual memoization needed
- Verify compiler is working
virtualization
Keywords: virtual, TanStack, large-list, scroll, overscan Solves:
- Render 100+ item lists efficiently
- Dynamic height virtualization
- Window scrolling patterns
lazy-loading
Keywords: React.lazy, Suspense, code-splitting, dynamic-import Solves:
- Route-based code splitting
- Component lazy loading with error boundaries
- Prefetch on hover and viewport
image-optimization
Keywords: next/image, AVIF, WebP, responsive, blur-placeholder Solves:
- Next.js Image component patterns
- Format selection and quality settings
- Responsive sizing and CDN configuration
profiling
Keywords: profiler, flame-graph, py-spy, DevTools, bundle-analyzer Solves:
- Profile React renders and backend code
- Generate and interpret flame graphs
- Analyze and optimize bundle size
llm-inference
Keywords: vllm, quantization, speculative-decoding, inference, throughput Solves:
- Deploy LLMs with vLLM for production
- Choose quantization method for hardware
- Accelerate generation with speculative decoding
References
- RUM Setup - Real User Monitoring
- React Compiler Migration - Compiler adoption
- TanStack Virtual - Virtualization patterns
- vLLM Deployment - Production vLLM config
- Quantization Guide - Method comparison
- CDN Setup - Image CDN configuration
Rules (22)
Configure HTTP and LLM prompt caching with correct breakpoint ordering for maximum savings — HIGH
HTTP & Prompt Caching
HTTP cache headers for CDN/browser caching and LLM prompt caching for 90% token savings.
Incorrect — variable content before cached prefix:
# WRONG: Variable content before static content breaks prompt cache
messages = [
{"role": "user", "content": f"User {user_id} asks: {question}"}, # Variable first!
{"role": "system", "content": long_system_prompt}, # Static content after = never cached
]Correct — static prefix first, then variable content:
# Claude prompt caching: static content first with cache_control
response = await client.messages.create(
model="claude-sonnet-4-6",
system=[
{
"type": "text",
"text": long_system_prompt, # Static: cached across calls
"cache_control": {"type": "ephemeral"}, # 5-minute TTL
},
],
messages=[
{"role": "user", "content": user_question}, # Variable: after cache breakpoint
],
)
# Result: ~90% token savings on system prompt after first call
# OpenAI: automatic prefix caching (no markers needed)
# Just ensure static content comes first in messages array
response = await openai.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": long_system_prompt}, # Cached automatically
{"role": "user", "content": user_question},
],
)HTTP cache headers for API responses:
from fastapi import FastAPI, Response
app = FastAPI()
@app.get("/api/products/{product_id}")
async def get_product(product_id: str, response: Response):
product = await fetch_product(product_id)
# Browser caches 60s, CDN caches 1h
response.headers["Cache-Control"] = "public, max-age=60, s-maxage=3600"
response.headers["CDN-Cache-Control"] = "max-age=3600"
return product
@app.get("/api/user/profile")
async def get_profile(response: Response):
# Private: only browser cache, not CDN
response.headers["Cache-Control"] = "private, max-age=300"
return await get_current_user_profile()Key rules:
- Claude: use
cache_controlwithephemeraltype (5min default, 1h if >10 reads/hour) - OpenAI: automatic prefix caching, no markers needed — just put static content first
- HTTP:
public, max-age=60, stale-while-revalidate=300for API responses - Use
s-maxageorCDN-Cache-Controlfor different CDN vs browser TTLs - Semantic caching: start threshold at 0.92, tune based on hit rate
- Never cache error responses or authentication tokens
Implement Redis cache-aside pattern with TTL and stampede prevention for backend caching — HIGH
Redis & Backend Caching
Cache-aside, write-through, and invalidation patterns for Redis-backed backend services.
Incorrect — caching without TTL (memory leak):
# WRONG: No TTL = memory grows forever
async def get_user(user_id: str):
cached = await redis.get(f"user:{user_id}")
if cached:
return json.loads(cached)
user = await db.fetch_user(user_id)
await redis.set(f"user:{user_id}", json.dumps(user)) # No expiry!
return userCorrect — cache-aside with TTL and stampede prevention:
import redis.asyncio as redis
import json
import asyncio
class CacheAside:
def __init__(self, redis_client: redis.Redis, default_ttl: int = 3600):
self.redis = redis_client
self.ttl = default_ttl
async def get_or_set(self, key: str, fetch_fn, ttl: int | None = None):
"""Cache-aside with stampede prevention via lock."""
cached = await self.redis.get(key)
if cached:
return json.loads(cached)
# Stampede prevention: only one caller computes
lock_key = f"lock:{key}"
acquired = await self.redis.set(lock_key, "1", ex=30, nx=True)
if not acquired:
# Another process is computing, wait and retry
await asyncio.sleep(0.1)
cached = await self.redis.get(key)
if cached:
return json.loads(cached)
try:
value = await fetch_fn()
await self.redis.setex(key, ttl or self.ttl, json.dumps(value))
return value
finally:
await self.redis.delete(lock_key)
# Write-through: update cache and DB atomically
async def update_user(user_id: str, data: dict, db, cache: CacheAside):
async with db.transaction():
await db.execute("UPDATE users SET ... WHERE id = $1", user_id)
await cache.redis.setex(
f"user:{user_id}",
cache.ttl,
json.dumps(data),
)
# Event-based invalidation
async def on_user_updated(event: UserUpdatedEvent, cache: CacheAside):
await cache.redis.delete(f"user:{event.user_id}")
# Related caches too
await cache.redis.delete(f"user-profile:{event.user_id}")Key rules:
- Always set TTL (1h default, 5min for volatile data)
- Use
orjsonfor serialization performance overjson - Key naming:
\{entity\}:\{id\}or\{entity\}:\{id\}:\{field\} - Stampede prevention: use distributed locks for expensive computations
- Event-based invalidation for writes, TTL for reads
- Never use cache as primary storage (data loss risk)
Prevent Cumulative Layout Shift that causes content jumping and hurts search rankings — CRITICAL
CLS Prevention
Prevent Cumulative Layout Shift for the 2026 threshold of <= 0.08.
Reserve Space for Dynamic Content
/* Reserve space for images */
.image-container {
aspect-ratio: 16 / 9;
width: 100%;
}
/* Reserve space for ads */
.ad-slot {
min-height: 250px;
}Explicit Dimensions
// Always set width and height
<img src="/photo.jpg" width={800} height={600} alt="Photo" />
// Next.js Image handles this automatically
<Image src="/photo.jpg" width={800} height={600} alt="Photo" />
// For responsive images
<Image src="/photo.jpg" fill sizes="(max-width: 768px) 100vw, 50vw" />Avoid Layout-Shifting Fonts
/* Use font-display: optional for non-critical fonts */
@font-face {
font-family: 'CustomFont';
src: url('/fonts/custom.woff2') format('woff2');
font-display: optional;
}
/* Or use size-adjust for fallback */
@font-face {
font-family: 'Fallback';
src: local('Arial');
size-adjust: 105%;
ascent-override: 95%;
}Animations That Don't Cause Layout Shift
/* BAD: Changes layout properties */
.expanding {
height: 0;
transition: height 0.3s;
}
.expanding.open {
height: 200px; /* Causes layout shift */
}
/* GOOD: Use transform */
.expanding {
transform: scaleY(0);
transform-origin: top;
transition: transform 0.3s;
}
.expanding.open {
transform: scaleY(1);
}Incorrect — Image without dimensions causes layout shift:
<img src="/photo.jpg" alt="Photo" />Correct — Explicit dimensions reserve space:
<img src="/photo.jpg" width={800} height={600} alt="Photo" />Key Rules
- Always set width/height on images
- Use
aspect-ratiofor responsive containers - Use
font-display: optionalfor non-critical fonts - Never animate layout properties (width, height, top, left)
- Use
transformandopacityfor animations - Reserve space for ads, embeds, and dynamic content
- Target <= 0.08 for 2026 thresholds
Optimize Interaction to Next Paint to ensure responsive button clicks and interactions — CRITICAL
INP Optimization
Optimize Interaction to Next Paint for the 2026 threshold of <= 150ms.
Break Up Long Tasks
// BAD: Long synchronous task (blocks main thread)
function processLargeArray(items: Item[]) {
items.forEach(processItem); // Blocks for entire duration
}
// GOOD: Yield to main thread
async function processLargeArray(items: Item[]) {
for (const item of items) {
processItem(item);
if (performance.now() % 50 < 1) {
await scheduler.yield?.() ?? new Promise(r => setTimeout(r, 0));
}
}
}Use Transitions for Non-Urgent Updates
import { useTransition, useDeferredValue } from 'react';
function SearchResults() {
const [query, setQuery] = useState('');
const [isPending, startTransition] = useTransition();
const handleChange = (e: ChangeEvent<HTMLInputElement>) => {
// Urgent: Update input immediately
setQuery(e.target.value);
// Non-urgent: Defer expensive filter
startTransition(() => {
setFilteredResults(filterResults(e.target.value));
});
};
return (
<>
<input value={query} onChange={handleChange} />
{isPending && <Spinner />}
<ResultsList results={filteredResults} />
</>
);
}Optimize Event Handlers
// BAD: Heavy computation in click handler
<button onClick={() => {
const result = heavyComputation(); // Blocks paint
setResult(result);
}}>Calculate</button>
// GOOD: Defer heavy work
<button onClick={() => {
setLoading(true);
requestIdleCallback(() => {
const result = heavyComputation();
setResult(result);
setLoading(false);
});
}}>Calculate</button>Incorrect — Blocking click handler delays visual feedback:
<button onClick={() => {
const result = heavyComputation(); // Blocks paint
setResult(result);
}}>Calculate</button>Correct — Deferred work keeps UI responsive:
<button onClick={() => {
setLoading(true);
requestIdleCallback(() => {
const result = heavyComputation();
setResult(result);
setLoading(false);
});
}}>Calculate</button>Key Rules
- Break long tasks > 50ms with
scheduler.yield() - Use
useTransitionfor non-urgent state updates - Defer heavy computation with
requestIdleCallback - Never block the main thread in event handlers
- Use
useDeferredValuefor expensive derived values - Target <= 150ms for 2026 thresholds
Optimize Largest Contentful Paint to improve search rankings and perceived page speed — CRITICAL
LCP Optimization
Optimize Largest Contentful Paint for the 2026 threshold of <= 2.0s.
Identify LCP Element
new PerformanceObserver((entryList) => {
const entries = entryList.getEntries();
const lastEntry = entries[entries.length - 1];
console.log('LCP element:', lastEntry.element);
console.log('LCP time:', lastEntry.startTime);
}).observe({ type: 'largest-contentful-paint', buffered: true });Optimize LCP Images
// Priority loading for hero image
<img
src="/hero.webp"
alt="Hero"
fetchpriority="high"
loading="eager"
decoding="async"
/>
// Next.js Image with priority
import Image from 'next/image';
<Image
src="/hero.webp"
alt="Hero"
priority
sizes="100vw"
quality={85}
/>Preload Critical Resources
<!-- Preload LCP image -->
<link rel="preload" as="image" href="/hero.webp" fetchpriority="high" />
<!-- Preload critical font -->
<link rel="preload" as="font" href="/fonts/inter.woff2" type="font/woff2" crossorigin />
<!-- Preconnect to critical origins -->
<link rel="preconnect" href="https://api.example.com" />
<link rel="dns-prefetch" href="https://analytics.example.com" />Server-Side Rendering
// Next.js - ensure SSR for LCP content
export default async function Page() {
const data = await fetchCriticalData();
return <HeroSection data={data} />; // Rendered on server
}
// BAD: LCP content loaded client-side
const [data, setData] = useState(null);
useEffect(() => { fetchData().then(setData); }, []);Incorrect — Lazy-loading LCP image delays paint:
<img src="/hero.webp" alt="Hero" loading="lazy" />Correct — Priority loading for LCP image:
<img
src="/hero.webp"
alt="Hero"
fetchpriority="high"
loading="eager"
decoding="async"
/>Key Rules
- Never lazy-load the LCP image
- Always use
fetchpriority="high"on LCP images - Always server-render LCP content
- Preload critical resources in
<head> - Preconnect to third-party origins used for LCP
- Target <= 2.0s for 2026 thresholds
Serve AVIF and WebP formats for 30-50% smaller files than JPEG at equivalent quality — HIGH
Modern Image Formats
Choose the right image format and quality settings for optimal compression.
Format Decision Matrix
| Format | Best For | Browser Support | Quality Setting |
|---|---|---|---|
| AVIF | Photos, gradients | 93%+ (2026) | 60-75 |
| WebP | Universal fallback | 97%+ | 75-82 |
| JPEG | Legacy fallback | 100% | 80-85 |
| PNG | Transparency, icons | 100% | N/A |
| SVG | Icons, logos | 100% | N/A |
Picture Element with Fallback
<picture>
<source srcset="/photo.avif" type="image/avif" />
<source srcset="/photo.webp" type="image/webp" />
<img src="/photo.jpg" alt="Photo" width="800" height="600" loading="lazy" />
</picture>Build-Time Conversion
// vite.config.ts with vite-plugin-image-optimizer
import { imageOptimizer } from 'vite-plugin-image-optimizer';
export default defineConfig({
plugins: [
imageOptimizer({
avif: { quality: 72, effort: 4 },
webp: { quality: 78 },
jpeg: { quality: 82, progressive: true },
}),
],
});Quality Guidelines
AVIF 60-75 — Best compression, slight encoding time cost
WebP 75-82 — Good balance, fastest encoding
JPEG 80-85 — Legacy only, use progressive encoding
Rule of thumb: lower quality for large hero images (more compression gain),
higher quality for small thumbnails (already small files).Incorrect — Single JPEG format misses 30-50% compression savings:
<img src="/photo.jpg" alt="Photo" width="800" height="600" />Correct — Modern formats with fallback:
<picture>
<source srcset="/photo.avif" type="image/avif" />
<source srcset="/photo.webp" type="image/webp" />
<img src="/photo.jpg" alt="Photo" width="800" height="600" />
</picture>Key rules:
- Prefer AVIF as primary format with WebP fallback
- Use quality 72-78 for AVIF and WebP (visually lossless for most photos)
- Always include a JPEG/PNG fallback in
<picture> - Use progressive JPEG for any remaining JPEG images
- Automate format conversion in the build pipeline, not manually
Use Next.js Image component for automatic lazy loading, responsive sizing, and format negotiation — HIGH
Next.js Image Component
Use the Next.js Image component for automatic optimization, format negotiation, and responsive sizing.
Priority Hero Image
import Image from 'next/image';
export default function Hero() {
return (
<Image
src="/hero.webp"
alt="Product hero"
width={1200}
height={630}
priority // Disables lazy loading, adds preload hint
sizes="100vw"
quality={85}
/>
);
}Blur Placeholder
// Static imports generate blurDataURL automatically
import heroImg from '@/public/hero.jpg';
<Image
src={heroImg}
alt="Hero"
placeholder="blur" // Uses auto-generated blurDataURL
priority
/>
// For remote images, provide blurDataURL manually
<Image
src="https://cdn.example.com/photo.jpg"
alt="Photo"
width={800}
height={600}
placeholder="blur"
blurDataURL="data:image/jpeg;base64,/9j/4AAQ..."
/>Custom Loader for CDN
// next.config.js
module.exports = {
images: {
loader: 'custom',
loaderFile: './lib/image-loader.ts',
},
};
// lib/image-loader.ts
export default function cloudflareLoader({
src, width, quality,
}: { src: string; width: number; quality?: number }) {
const params = [`width=${width}`, `quality=${quality || 80}`, 'format=auto'];
return `https://cdn.example.com/cdn-cgi/image/${params.join(',')}/${src}`;
}Responsive Fill Layout
<div style={{ position: 'relative', width: '100%', aspectRatio: '16/9' }}>
<Image
src="/banner.jpg"
alt="Banner"
fill
sizes="(max-width: 768px) 100vw, (max-width: 1200px) 50vw, 33vw"
style={{ objectFit: 'cover' }}
/>
</div>Incorrect — Missing sizes causes incorrect srcset selection:
<Image src="/banner.jpg" alt="Banner" fill />Correct — Sizes hint ensures optimal image size:
<Image
src="/banner.jpg"
alt="Banner"
fill
sizes="(max-width: 768px) 100vw, 50vw"
/>Key rules:
- Set
priorityon the LCP image (only one per page) - Always provide
sizesfor responsive images - Use
placeholder="blur"for visible images to prevent CLS - Use a custom loader for external CDN image transformation
- Use
fillwithsizesfor responsive containers instead of fixed dimensions
Serve appropriately sized responsive images per viewport to avoid oversized mobile downloads — HIGH
Responsive Images
Serve the right image size for every viewport and device pixel ratio.
Srcset with Sizes
<img
src="/photo-800.jpg"
srcset="
/photo-400.jpg 400w,
/photo-800.jpg 800w,
/photo-1200.jpg 1200w,
/photo-1600.jpg 1600w
"
sizes="(max-width: 640px) 100vw,
(max-width: 1024px) 50vw,
33vw"
alt="Product photo"
loading="lazy"
width="800"
height="600"
/>Art Direction with Picture
<!-- Different crops for different viewports -->
<picture>
<source
media="(max-width: 640px)"
srcset="/hero-mobile.avif 640w, /hero-mobile-2x.avif 1280w"
sizes="100vw"
type="image/avif"
/>
<source
media="(min-width: 641px)"
srcset="/hero-desktop.avif 1200w, /hero-desktop-2x.avif 2400w"
sizes="66vw"
type="image/avif"
/>
<img src="/hero-desktop.jpg" alt="Hero" width="1200" height="630" />
</picture>CDN Image Transformation URLs
// Cloudflare Image Resizing
function cfImage(src: string, width: number, quality = 80) {
return `https://cdn.example.com/cdn-cgi/image/w=${width},q=${quality},f=auto/${src}`;
}
// Imgix
function imgixUrl(src: string, width: number, quality = 80) {
return `${src}?w=${width}&q=${quality}&auto=format,compress`;
}
// Usage in React
<img
src={cfImage('/photos/product.jpg', 800)}
srcset={`
${cfImage('/photos/product.jpg', 400)} 400w,
${cfImage('/photos/product.jpg', 800)} 800w,
${cfImage('/photos/product.jpg', 1200)} 1200w
`}
sizes="(max-width: 768px) 100vw, 50vw"
alt="Product"
loading="lazy"
/>Incorrect — srcset without sizes lets browser guess:
<img
srcset="/photo-400.jpg 400w, /photo-800.jpg 800w"
src="/photo-800.jpg"
alt="Photo"
/>Correct — sizes guides browser to optimal choice:
<img
srcset="/photo-400.jpg 400w, /photo-800.jpg 800w"
sizes="(max-width: 640px) 100vw, 50vw"
src="/photo-800.jpg"
alt="Photo"
width="800"
height="600"
/>Key rules:
- Always provide
sizesalongsidesrcsetfor width descriptors - Use 3-4 srcset breakpoints (400, 800, 1200, 1600) for most images
- Use
<picture>withmediafor art direction (different crops) - Delegate resizing to a CDN rather than shipping multiple static files
- Set explicit
widthandheightto prevent CLS
Quantize models to reduce size 2-4x with minimal quality loss for fewer GPUs — MEDIUM
Model Quantization
Reduce model memory footprint and increase throughput with quantization.
Method Decision Matrix
| Method | Precision | Speed | Quality | Best For |
|---|---|---|---|---|
| FP16 | 16-bit | Baseline | Best | When VRAM allows |
| FP8 | 8-bit | 1.5x | Near-FP16 | Hopper/Ada GPUs (H100, L40S) |
| AWQ | 4-bit | 1.8x | Good | Production serving, speed priority |
| GPTQ | 4-bit | 1.6x | Better | Quality-sensitive tasks |
| GGUF | 2-8 bit | Varies | Varies | CPU/hybrid inference (llama.cpp) |
vLLM with AWQ
# Serve a pre-quantized AWQ model
docker run --gpus '"device=0"' \
-p 8000:8000 \
vllm/vllm-openai:latest \
--model TheBloke/Llama-3.1-8B-Instruct-AWQ \
--quantization awq \
--max-model-len 8192vLLM with FP8 (Hopper GPUs)
# FP8 on H100 — native hardware support, no pre-quantized model needed
docker run --gpus all \
-p 8000:8000 \
vllm/vllm-openai:latest \
--model meta-llama/Llama-3.1-70B-Instruct \
--quantization fp8 \
--tensor-parallel-size 4VRAM Requirements (Approximate)
Model FP16 FP8 AWQ/GPTQ (4-bit)
7-8B 16 GB 9 GB 5 GB
13B 26 GB 14 GB 8 GB
70B 140 GB 75 GB 40 GB
Formula: VRAM ≈ params × bytes_per_param × 1.2 (KV cache overhead)Quality Validation
# Always benchmark quantized vs full precision on YOUR task
def eval_quantized(client, test_cases):
results = []
for case in test_cases:
response = client.chat.completions.create(
model="quantized-model",
messages=case["messages"],
max_tokens=case["max_tokens"],
)
results.append(score(response, case["expected"]))
return sum(results) / len(results)
# Accept quantization if quality >= 95% of FP16 baselineIncorrect — FP16 on smaller GPUs wastes VRAM:
docker run --gpus all \
vllm/vllm-openai:latest \
--model meta-llama/Llama-3.1-70B-Instruct \
--tensor-parallel-size 8
# Requires 140 GB VRAMCorrect — FP8 quantization reduces VRAM by ~45%:
docker run --gpus all \
vllm/vllm-openai:latest \
--model meta-llama/Llama-3.1-70B-Instruct \
--quantization fp8 \
--tensor-parallel-size 4
# Requires 75 GB VRAMKey rules:
- Use FP8 on Hopper/Ada GPUs (best speed/quality tradeoff)
- Use AWQ for maximum throughput on older GPUs
- Use GPTQ when quality matters more than speed
- Always validate quantized model quality on your specific task
- Pre-quantized models (e.g., TheBloke) save quantization time
Apply speculative decoding to generate draft tokens in parallel and reduce inference latency — MEDIUM
Speculative Decoding
Use speculative decoding to reduce per-token latency without sacrificing output quality.
How It Works
Traditional: token1 → token2 → token3 → token4 (4 forward passes)
Speculative: draft: token1, token2, token3 (fast, cheap)
verify: accept/reject all 3 (1 forward pass)
Result: 3 tokens in ~1.3 forward passesN-Gram Speculation (No Draft Model)
# vLLM n-gram speculation — uses prompt tokens as draft source
# Best for repetitive/structured output (JSON, code, templates)
docker run --gpus '"device=0"' \
-p 8000:8000 \
vllm/vllm-openai:latest \
--model meta-llama/Llama-3.1-8B-Instruct \
--speculative-model [ngram] \
--num-speculative-tokens 5 \
--ngram-prompt-lookup-max 4Draft Model Speculation
# Use a smaller model as the draft (must share tokenizer)
docker run --gpus '"device=0"' \
-p 8000:8000 \
vllm/vllm-openai:latest \
--model meta-llama/Llama-3.1-70B-Instruct \
--speculative-model meta-llama/Llama-3.1-8B-Instruct \
--num-speculative-tokens 5 \
--tensor-parallel-size 4Acceptance Rate Tuning
--num-speculative-tokens:
3 → Conservative, high acceptance rate (~85%)
5 → Balanced (default recommendation)
8 → Aggressive, lower acceptance rate (~60%)
Monitor via vLLM metrics:
vllm:spec_decode_acceptance_rate → target > 70%
If acceptance < 60%:
1. Reduce --num-speculative-tokens
2. Try n-gram for structured output
3. Verify draft model matches target model's styleWhen to Use Each Approach
N-gram speculation:
+ Structured output (JSON, SQL, code)
+ Repetitive patterns
+ No extra GPU memory needed
- Creative / diverse text
Draft model speculation:
+ General text generation
+ Large target models (70B+)
+ Higher acceptance rates on diverse tasks
- Requires extra GPU memory for draft modelIncorrect — No speculation means sequential token generation:
docker run --gpus '"device=0"' \
vllm/vllm-openai:latest \
--model meta-llama/Llama-3.1-8B-Instruct
# 4 tokens = 4 forward passesCorrect — N-gram speculation reduces passes by 30-60%:
docker run --gpus '"device=0"' \
vllm/vllm-openai:latest \
--model meta-llama/Llama-3.1-8B-Instruct \
--speculative-model [ngram] \
--num-speculative-tokens 5
# 4 tokens ≈ 1.3 forward passesKey rules:
- Use n-gram speculation for structured/repetitive output (free, no extra VRAM)
- Use draft model speculation for general text with large target models
- Start with
--num-speculative-tokens 5and tune based on acceptance rate - Monitor acceptance rate; reduce tokens if below 60%
- Output quality is identical to non-speculative decoding (mathematically guaranteed)
Deploy vLLM with PagedAttention and continuous batching for 2-4x higher inference throughput — MEDIUM
vLLM Deployment
Deploy LLMs with vLLM for high-throughput, low-latency inference.
Docker Deployment
# Single GPU
docker run --gpus '"device=0"' \
-v ~/.cache/huggingface:/root/.cache/huggingface \
-p 8000:8000 \
vllm/vllm-openai:latest \
--model meta-llama/Llama-3.1-8B-Instruct \
--max-model-len 8192 \
--gpu-memory-utilization 0.90
# Multi-GPU with tensor parallelism
docker run --gpus all \
-v ~/.cache/huggingface:/root/.cache/huggingface \
-p 8000:8000 \
vllm/vllm-openai:latest \
--model meta-llama/Llama-3.1-70B-Instruct \
--tensor-parallel-size 4 \
--max-model-len 8192 \
--gpu-memory-utilization 0.92Python Client
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="unused")
response = client.chat.completions.create(
model="meta-llama/Llama-3.1-8B-Instruct",
messages=[{"role": "user", "content": "Explain PagedAttention briefly."}],
max_tokens=256,
temperature=0.7,
)
print(response.choices[0].message.content)Key Architecture Concepts
PagedAttention:
- KV cache stored in non-contiguous pages (like OS virtual memory)
- Eliminates memory waste from pre-allocated contiguous blocks
- Enables 2-4x more concurrent sequences
Continuous Batching:
- New requests join running batch immediately
- No waiting for longest sequence to finish
- Throughput: 10-30 requests/second on single A100 (8B model)
Tensor Parallelism:
- Splits model across GPUs (--tensor-parallel-size N)
- Rule: N = number of GPUs, must evenly divide model layers
- Use for models > single GPU VRAMIncorrect — Default memory utilization wastes KV cache space:
docker run --gpus '"device=0"' \
vllm/vllm-openai:latest \
--model meta-llama/Llama-3.1-8B-Instruct
# Uses default 0.70 GPU memoryCorrect — Higher utilization enables more concurrent requests:
docker run --gpus '"device=0"' \
vllm/vllm-openai:latest \
--model meta-llama/Llama-3.1-8B-Instruct \
--gpu-memory-utilization 0.90 \
--max-model-len 8192
# 2-4x more concurrent requestsKey rules:
- Set
--gpu-memory-utilization 0.90(leave headroom for KV cache) - Use
--tensor-parallel-sizeequal to the number of GPUs - Use the OpenAI-compatible API for drop-in compatibility
- Monitor
vllm:num_requests_runningPrometheus metric for load - Set
--max-model-lento the actual max you need (lower = more concurrent requests)
Defer component loading with React.lazy to reduce initial bundle size and improve TTI — HIGH
Lazy Component Loading
Use React.lazy with Suspense to load components on demand and reduce initial bundle size.
Basic Pattern
import { lazy, Suspense } from 'react';
const Dashboard = lazy(() => import('./Dashboard'));
const Settings = lazy(() => import('./Settings'));
function App() {
return (
<Suspense fallback={<DashboardSkeleton />}>
<Dashboard />
</Suspense>
);
}Error Boundary for Failed Imports
import { Component, lazy, Suspense } from 'react';
class LazyErrorBoundary extends Component<
{ fallback: React.ReactNode; children: React.ReactNode },
{ hasError: boolean }
> {
state = { hasError: false };
static getDerivedStateFromError() {
return { hasError: true };
}
retry = () => this.setState({ hasError: false });
render() {
if (this.state.hasError) {
return <button onClick={this.retry}>Retry</button>;
}
return this.props.children;
}
}
// Usage
<LazyErrorBoundary fallback={<p>Failed to load</p>}>
<Suspense fallback={<Skeleton />}>
<LazyComponent />
</Suspense>
</LazyErrorBoundary>Skeleton Fallback
function DashboardSkeleton() {
return (
<div className="animate-pulse space-y-4">
<div className="h-8 bg-gray-200 rounded w-1/3" />
<div className="h-64 bg-gray-200 rounded" />
</div>
);
}Incorrect — Missing Suspense fallback causes error:
const Dashboard = lazy(() => import('./Dashboard'));
function App() {
return <Dashboard />; // Error: no Suspense boundary
}Correct — Suspense with skeleton fallback:
const Dashboard = lazy(() => import('./Dashboard'));
function App() {
return (
<Suspense fallback={<DashboardSkeleton />}>
<Dashboard />
</Suspense>
);
}Key rules:
- Wrap every
lazy()component in aSuspenseboundary - Add an error boundary around Suspense for network failures
- Use skeleton fallbacks that match the loaded component's layout
- Never lazy-load above-the-fold or LCP-critical components
- Group related lazy components under a single Suspense boundary
Prefetch resources before user needs them to make navigation feel instant — HIGH
Prefetch Strategies
Proactively load resources before the user navigates to reduce perceived latency.
Module Preload Hints
<!-- Preload critical JS modules -->
<link rel="modulepreload" href="/assets/dashboard-abc123.js" />
<link rel="modulepreload" href="/assets/shared-chunk-def456.js" />
<!-- Prefetch likely next pages (low priority) -->
<link rel="prefetch" href="/assets/settings-ghi789.js" />Prefetch on Hover
function NavLink({ to, children }: { to: string; children: React.ReactNode }) {
const prefetchRef = useRef(false);
const handlePointerEnter = () => {
if (prefetchRef.current) return;
prefetchRef.current = true;
import(`./pages/${to}.tsx`); // Triggers prefetch
};
return (
<a href={`/${to}`} onPointerEnter={handlePointerEnter}>
{children}
</a>
);
}Prefetch on Viewport Intersection
function usePrefetchOnVisible(importFn: () => Promise<unknown>) {
const ref = useRef<HTMLElement>(null);
const loaded = useRef(false);
useEffect(() => {
const el = ref.current;
if (!el) return;
const observer = new IntersectionObserver(([entry]) => {
if (entry.isIntersecting && !loaded.current) {
loaded.current = true;
importFn();
observer.disconnect();
}
}, { rootMargin: '200px' });
observer.observe(el);
return () => observer.disconnect();
}, [importFn]);
return ref;
}
// Usage
const ref = usePrefetchOnVisible(() => import('./HeavySection'));
<div ref={ref}><Suspense fallback={null}><HeavySection /></Suspense></div>Import on Interaction
// Load a heavy module only when the user clicks
async function handleExport() {
const { exportToPDF } = await import('./exportUtils');
await exportToPDF(data);
}
<button onClick={handleExport}>Export PDF</button>Incorrect — No prefetching causes delayed navigation:
<a href="/dashboard">Dashboard</a>Correct — Hover prefetch gives 200-400ms head start:
function NavLink({ to, children }) {
const prefetchRef = useRef(false);
const handlePointerEnter = () => {
if (prefetchRef.current) return;
prefetchRef.current = true;
import(`./pages/${to}.tsx`);
};
return (
<a href={`/${to}`} onPointerEnter={handlePointerEnter}>
{children}
</a>
);
}Key rules:
- Use
modulepreloadfor critical JS the current page needs - Use
prefetchfor resources the user will likely need next - Prefetch on hover for navigation links (200-400ms head start)
- Prefetch on intersection for below-the-fold heavy sections
- Import on interaction for rarely-used heavy features
Split code at route boundaries so users only download code for the visited page — HIGH
Route-Based Code Splitting
Split your bundle at route boundaries so each page loads only its own code.
React Router 7.x Lazy Routes
import { createBrowserRouter } from 'react-router';
const router = createBrowserRouter([
{
path: '/',
lazy: () => import('./pages/Home'),
},
{
path: '/dashboard',
lazy: () => import('./pages/Dashboard'),
},
{
path: '/settings',
lazy: () => import('./pages/Settings'),
},
]);Named Exports for Lazy Routes
// pages/Dashboard.tsx — export Component and loader
export async function loader() {
return fetchDashboardData();
}
export function Component() {
const data = useLoaderData();
return <DashboardView data={data} />;
}
Component.displayName = 'Dashboard';Chunk Naming
// Webpack — webpackChunkName magic comment
const Dashboard = lazy(
() => import(/* webpackChunkName: "dashboard" */ './pages/Dashboard')
);
// Vite/Rollup — use rollupOptions for manual chunks
// vite.config.ts
export default defineConfig({
build: {
rollupOptions: {
output: {
manualChunks: {
vendor: ['react', 'react-dom'],
charts: ['recharts', 'd3'],
},
},
},
},
});Incorrect — Eager imports bundle all routes together:
import Home from './pages/Home';
import Dashboard from './pages/Dashboard';
import Settings from './pages/Settings';
const router = createBrowserRouter([
{ path: '/', element: <Home /> },
{ path: '/dashboard', element: <Dashboard /> },
{ path: '/settings', element: <Settings /> },
]);Correct — Lazy routes split per-page bundles:
const router = createBrowserRouter([
{ path: '/', lazy: () => import('./pages/Home') },
{ path: '/dashboard', lazy: () => import('./pages/Dashboard') },
{ path: '/settings', lazy: () => import('./pages/Settings') },
]);Key rules:
- Split at route boundaries as the minimum splitting strategy
- Use React Router
lazyfor automatic route-level splitting - Export
Componentandloaderas named exports for lazy routes - Name chunks for readable build output and caching
- Group vendor libraries into shared chunks to avoid duplication
Profile Python backends with py-spy to find CPU hotspots and memory leaks in production — MEDIUM
Python Backend Profiling
Profile Python services to find CPU bottlenecks and memory leaks.
py-spy for Production Sampling
# Attach to running process (no restart needed)
py-spy top --pid 12345
# Generate flamegraph SVG
py-spy record -o profile.svg --pid 12345 --duration 30
# Profile a script directly
py-spy record -o profile.svg -- python manage.py runserver
# Sample at higher rate for short-lived operations
py-spy record --rate 250 -o profile.svg -- python batch_job.pycProfile for Development
import cProfile
import pstats
# Profile a function
with cProfile.Profile() as pr:
result = expensive_function()
stats = pstats.Stats(pr)
stats.sort_stats('cumulative')
stats.print_stats(20) # Top 20 functions
# One-liner from command line
# python -m cProfile -s cumulative app.pymemory_profiler for Memory Leaks
from memory_profiler import profile
@profile
def process_data():
data = load_large_dataset() # +500 MiB
filtered = filter_items(data) # +200 MiB
del data # -500 MiB
return summarize(filtered)
# Command line: python -m memory_profiler script.pyFastAPI Middleware Profiling
import time
from fastapi import Request
@app.middleware("http")
async def profile_requests(request: Request, call_next):
start = time.perf_counter()
response = await call_next(request)
duration = time.perf_counter() - start
if duration > 0.5: # Log slow requests
print(f"SLOW: {request.method} {request.url.path} took {duration:.2f}s")
response.headers["X-Response-Time"] = f"{duration:.3f}"
return responseIncorrect — cProfile in production requires code changes:
# Must instrument code manually
with cProfile.Profile() as pr:
result = expensive_function()Correct — py-spy attaches to running process with zero overhead:
# No code changes, no restart needed
py-spy record -o profile.svg --pid 12345 --duration 30Key rules:
- Use py-spy in production (zero overhead when not profiling, no code changes)
- Use cProfile in development for detailed call graphs
- Use memory_profiler to track per-line memory allocation
- Profile under realistic load, not just unit test conditions
- Focus on the top 3-5 hotspots by cumulative time
Analyze bundles to reveal bloated dependencies and missed tree-shaking that inflate load times — MEDIUM
Bundle Analysis
Analyze and optimize JavaScript bundle size with visualization tools and CI budgets.
Webpack Bundle Analyzer
// webpack.config.js
const { BundleAnalyzerPlugin } = require('webpack-bundle-analyzer');
module.exports = {
plugins: [
new BundleAnalyzerPlugin({
analyzerMode: 'static', // Generates HTML report
openAnalyzer: false,
reportFilename: 'bundle-report.html',
}),
],
};Vite / Rollup Visualizer
// vite.config.ts
import { visualizer } from 'rollup-plugin-visualizer';
export default defineConfig({
plugins: [
visualizer({
filename: 'bundle-report.html',
gzipSize: true,
brotliSize: true,
}),
],
});Performance Budgets in CI
// bundlesize.config.json
{
"files": [
{ "path": "dist/assets/index-*.js", "maxSize": "150 kB", "compression": "gzip" },
{ "path": "dist/assets/vendor-*.js", "maxSize": "80 kB", "compression": "gzip" },
{ "path": "dist/assets/*.css", "maxSize": "30 kB", "compression": "gzip" }
]
}# .github/workflows/bundle-check.yml
- name: Check bundle size
run: npx bundlesize
env:
BUNDLESIZE_GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}Import Cost Awareness
// BAD: Imports entire library (70 kB)
import _ from 'lodash';
const sorted = _.sortBy(items, 'name');
// GOOD: Import single function (4 kB)
import sortBy from 'lodash/sortBy';
const sorted = sortBy(items, 'name');
// BEST: Use native (0 kB)
const sorted = items.toSorted((a, b) => a.name.localeCompare(b.name));Incorrect — Importing entire lodash adds 70 kB:
import _ from 'lodash';
const sorted = _.sortBy(items, 'name');Correct — Import single function or use native API:
// Option 1: Import only what you need (4 kB)
import sortBy from 'lodash/sortBy';
const sorted = sortBy(items, 'name');
// Option 2: Use native API (0 kB)
const sorted = items.toSorted((a, b) => a.name.localeCompare(b.name));Key rules:
- Run bundle analysis on every release to catch regressions
- Set CI performance budgets (fail build if exceeded)
- Import only what you use from large libraries
- Check gzip/brotli sizes, not raw sizes
- Replace large dependencies with native APIs when possible
Profile React components with DevTools to identify unnecessary re-renders and their causes — MEDIUM
React DevTools Profiler
Use the React DevTools Profiler to identify and fix unnecessary re-renders.
Flamegraph Workflow
1. Open React DevTools → Profiler tab
2. Click "Record" → Interact with the UI → Click "Stop"
3. Read the flamegraph:
- Yellow/red bars = slow renders (> 16ms)
- Gray bars = did not render
- Click a bar → see "Why did this render?"
4. Focus on components that render often AND take longProgrammatic Profiler
import { Profiler } from 'react';
function onRenderCallback(
id: string,
phase: 'mount' | 'update',
actualDuration: number,
) {
if (actualDuration > 16) {
console.warn(`Slow render: ${id} (${phase}) took ${actualDuration.toFixed(1)}ms`);
}
}
<Profiler id="Dashboard" onRender={onRenderCallback}>
<Dashboard />
</Profiler>Why Did You Render Setup
// wdyr.ts — import BEFORE React in development
import React from 'react';
if (process.env.NODE_ENV === 'development') {
const { default: whyDidYouRender } = await import(
'@welldone-software/why-did-you-render'
);
whyDidYouRender(React, {
trackAllPureComponents: true,
logOnDifferentValues: true,
});
}
// Mark specific components for tracking
MyComponent.whyDidYouRender = true;Interpreting Render Reasons
"Props changed" → Check which prop, was it a new object/array?
"State changed" → Expected, verify state is colocated
"Parent rendered" → Parent re-renders, child doesn't memo
"Context changed" → Split context or use selectors
"Hooks changed" → useMemo/useCallback dependency changedIncorrect — Blind memoization without profiling:
const MemoizedComponent = memo(Component);
const memoizedValue = useMemo(() => value, []);
const callback = useCallback(() => {}, []);
// Added optimization without measurementCorrect — Profile first, then optimize actual bottlenecks:
// 1. Open React DevTools Profiler
// 2. Record interaction
// 3. Identify slow renders (yellow/red bars > 16ms)
// 4. Check "Why did this render?"
// 5. Apply targeted fix only where neededKey rules:
- Profile first before adding any memoization
- Focus on components that are both frequent AND slow (> 16ms)
- Use "Why did this render?" to find the root cause
- Use Why Did You Render in development for automatic detection
- Ignore gray (not rendered) components in the flamegraph
Apply TanStack Query optimistic updates for instant UI feedback with automatic rollback — HIGH
TanStack Query Optimistic Updates
Show immediate UI feedback before server confirmation with proper rollback on error.
Incorrect — mutation without optimistic update:
// WRONG: User waits for server roundtrip
const mutation = useMutation({
mutationFn: updateTodo,
onSuccess: () => {
queryClient.invalidateQueries({ queryKey: ['todos'] }); // Refetches after delay
},
});
// UI feels sluggish — user sees spinner for 200-500msCorrect — optimistic update with rollback:
import { useMutation, useQueryClient } from '@tanstack/react-query';
function useUpdateTodo() {
const queryClient = useQueryClient();
return useMutation({
mutationFn: updateTodo,
onMutate: async (newTodo) => {
// 1. Cancel outgoing refetches (prevent race condition)
await queryClient.cancelQueries({ queryKey: ['todos', newTodo.id] });
// 2. Snapshot previous value for rollback
const previousTodo = queryClient.getQueryData(['todos', newTodo.id]);
// 3. Optimistically update cache
queryClient.setQueryData(['todos', newTodo.id], newTodo);
// 4. Return context for rollback
return { previousTodo };
},
onError: (_err, newTodo, context) => {
// Rollback to previous value on error
queryClient.setQueryData(['todos', newTodo.id], context?.previousTodo);
},
onSettled: (_data, _error, variables) => {
// Always reconcile with server after mutation
queryClient.invalidateQueries({ queryKey: ['todos', variables.id] });
},
});
}
// Optimistic list update (add to list)
function useAddTodo() {
const queryClient = useQueryClient();
return useMutation({
mutationFn: createTodo,
onMutate: async (newTodo) => {
await queryClient.cancelQueries({ queryKey: ['todos'] });
const previousTodos = queryClient.getQueryData<Todo[]>(['todos']);
// Immutable update (NEVER mutate cache directly)
queryClient.setQueryData<Todo[]>(['todos'], (old) =>
old ? [...old, { ...newTodo, id: 'temp-id' }] : [newTodo]
);
return { previousTodos };
},
onError: (_err, _newTodo, context) => {
queryClient.setQueryData(['todos'], context?.previousTodos);
},
onSettled: () => {
queryClient.invalidateQueries({ queryKey: ['todos'] });
},
});
}Track pending mutations:
import { useMutationState } from '@tanstack/react-query';
function PendingTodos() {
const pendingMutations = useMutationState({
filters: { mutationKey: ['addTodo'], status: 'pending' },
select: (mutation) => mutation.state.variables as Todo,
});
return (
<>
{pendingMutations.map((todo) => (
<TodoItem key={todo.id} todo={todo} isPending />
))}
</>
);
}Key rules:
- Always cancel outgoing queries in
onMutateto prevent race conditions - Always return context from
onMutatefor rollback capability - Use immutable updates:
[...old, newItem]notold.push(newItem) - Always
invalidateQueriesinonSettledto reconcile with server - Use
useMutationStateto show pending items in the UI - Selective invalidation:
queryKey: ['todos', id]notqueryClient.invalidateQueries()(invalidates everything)
Prefetch TanStack queries on hover or in route loaders for instant page transitions — HIGH
TanStack Query Prefetching
Prefetch data before navigation for instant page transitions using TanStack Query v5.
Incorrect — fetching data only when component mounts:
// WRONG: User clicks link, waits for data to load
function UserProfile({ userId }: { userId: string }) {
const { data, isPending } = useQuery({
queryKey: ['user', userId],
queryFn: () => fetchUser(userId),
});
if (isPending) return <Skeleton />; // User sees skeleton every time
return <div>{data.name}</div>;
}Correct — prefetch on hover and in route loaders:
// 1. Define reusable query options (v5 pattern)
const userQueryOptions = (id: string) => queryOptions({
queryKey: ['user', id] as const,
queryFn: () => fetchUser(id),
staleTime: 5 * 60 * 1000, // Fresh for 5 min
});
// 2. Prefetch on hover
function UserLink({ userId }: { userId: string }) {
const queryClient = useQueryClient();
const prefetchUser = () => {
queryClient.prefetchQuery(userQueryOptions(userId));
};
return (
<Link
to={`/users/${userId}`}
onMouseEnter={prefetchUser}
onFocus={prefetchUser}
>
View User
</Link>
);
}
// 3. Prefetch in route loader (React Router 7.x)
export const loader = (queryClient: QueryClient) =>
async ({ params }: { params: { id: string } }) => {
await queryClient.ensureQueryData(userQueryOptions(params.id));
return null;
};
// 4. Use with Suspense for instant render
function UserProfile({ userId }: { userId: string }) {
// Data already loaded by prefetch — no loading state!
const { data } = useSuspenseQuery(userQueryOptions(userId));
return <div>{data.name}</div>;
}QueryClient configuration:
const queryClient = new QueryClient({
defaultOptions: {
queries: {
staleTime: 1000 * 60, // 1 min fresh
gcTime: 1000 * 60 * 5, // 5 min in cache (formerly cacheTime)
refetchOnWindowFocus: true, // Refetch on tab focus
retry: 3,
retryDelay: (attemptIndex) => Math.min(1000 * 2 ** attemptIndex, 30000),
},
},
});Key rules:
- Use
queryOptions()helper for reusable query definitions across prefetch/useQuery/loader - Prefetch on
onMouseEnterANDonFocusfor keyboard users - Use
ensureQueryDatain loaders (waits for data),prefetchQueryfor fire-and-forget useSuspenseQueryfor components where data is guaranteed by loadergcTime(v5) replacescacheTime(v4) — controls how long unused data stays in memoryisPending(v5) replacesisLoadingfor initial load state
Let React Compiler auto-memoize components, values, callbacks, and JSX automatically — HIGH
React Compiler
React 19's compiler is the primary approach to render optimization in 2026.
Decision Tree
Is React Compiler enabled?
├─ YES → Let compiler handle memoization automatically
│ Only use useMemo/useCallback as escape hatches
│ DevTools shows "Memo ✨" badge
│
└─ NO → Profile first, then optimize
1. React DevTools Profiler
2. Identify actual bottlenecks
3. Apply targeted optimizationsWhat the Compiler Memoizes
- Component re-renders
- Intermediate values (like useMemo)
- Callback references (like useCallback)
- JSX elements
Enabling the Compiler
// next.config.js (Next.js 16+)
const nextConfig = {
reactCompiler: true,
}
// Expo SDK 54+ enables by defaultVerification
Open React DevTools and look for the "Memo ✨" badge on components. If present, the compiler is successfully memoizing that component.
Incorrect — Manual memoization when compiler is enabled:
// next.config.js has reactCompiler: true
const value = useMemo(() => compute(data), [data]);
const callback = useCallback(() => handle(), []);
// Compiler already handles this automaticallyCorrect — Let compiler auto-memoize:
// Compiler handles memoization automatically
function Component({ data }) {
const value = compute(data); // Auto-memoized
const handle = () => {}; // Auto-memoized
return <div onClick={handle}>{value}</div>;
}
// Check DevTools for "Memo ✨" badgeKey Rules
- Enable React Compiler as the first step
- Let the compiler handle memoization automatically
- Verify with DevTools "Memo ✨" badge
- Only use manual memoization as escape hatches
- Profile before adding any manual optimization
Use manual useMemo and useCallback escape hatches when React Compiler cannot optimize — HIGH
Manual Memoization Escape Hatches
Use useMemo/useCallback as escape hatches when React Compiler is insufficient.
When Manual Memoization Is Needed
// 1. Effect dependencies that shouldn't trigger re-runs
const stableConfig = useMemo(() => ({
apiUrl: process.env.API_URL
}), [])
useEffect(() => {
initializeSDK(stableConfig)
}, [stableConfig])
// 2. Third-party libraries without compiler support
const memoizedValue = useMemo(() =>
expensiveThirdPartyComputation(data), [data])
// 3. Precise control over memoization boundaries
const handleClick = useCallback(() => {
// Critical callback that must be stable
}, [dependency])State Colocation
Move state as close to where it's used as possible:
// BAD: State too high - causes unnecessary re-renders
function App() {
const [filter, setFilter] = useState('')
return (
<Header /> {/* Re-renders on filter change! */}
<FilterInput value={filter} onChange={setFilter} />
<List filter={filter} />
)
}
// GOOD: State colocated - minimal re-renders
function App() {
return (
<Header />
<FilterableList /> {/* State inside */}
)
}Profiling Workflow
- React DevTools Profiler: Record, interact, analyze
- Identify: Components with high render counts or duration
- Verify: Is the re-render actually causing perf issues?
- Fix: Apply targeted optimization
- Measure: Confirm improvement
Incorrect — State too high causes unnecessary re-renders:
function App() {
const [filter, setFilter] = useState('');
return (
<>
<Header /> {/* Re-renders on filter change! */}
<FilterInput value={filter} onChange={setFilter} />
<List filter={filter} />
</>
);
}Correct — State colocated minimizes re-renders:
function App() {
return (
<>
<Header />
<FilterableList /> {/* State inside, Header unaffected */}
</>
);
}Key Rules
- Profile first — never optimize without measurement
- Colocate state as close to usage as possible
- Use
useMemofor effect dependencies that must be stable - Use
useCallbackfor callbacks passed to memoized children - Split context into state and dispatch providers
Virtualize long lists to render only visible items for smooth scrolling performance — HIGH
List Virtualization
Use TanStack Virtual for efficient rendering of large lists.
Virtualization Thresholds
| Item Count | Recommendation |
|---|---|
| < 100 | Regular rendering usually fine |
| 100-500 | Consider virtualization |
| 500+ | Virtualization required |
Basic Setup
import { useVirtualizer } from '@tanstack/react-virtual'
function VirtualList({ items }) {
const parentRef = useRef(null)
const virtualizer = useVirtualizer({
count: items.length,
getScrollElement: () => parentRef.current,
estimateSize: () => 50,
overscan: 5,
})
return (
<div ref={parentRef} style={{ height: '400px', overflow: 'auto' }}>
<div style={{ height: `${virtualizer.getTotalSize()}px`, position: 'relative' }}>
{virtualizer.getVirtualItems().map((virtualRow) => (
<div
key={virtualRow.key}
style={{
position: 'absolute',
top: 0,
left: 0,
width: '100%',
height: `${virtualRow.size}px`,
transform: `translateY(${virtualRow.start}px)`,
}}
>
{items[virtualRow.index].name}
</div>
))}
</div>
</div>
)
}Dynamic Height
const virtualizer = useVirtualizer({
count: items.length,
getScrollElement: () => parentRef.current,
estimateSize: () => 50,
overscan: 5,
measureElement: (element) => element.getBoundingClientRect().height,
})Incorrect — Rendering 1000 items causes scroll jank:
function List({ items }) {
return (
<div style={{ height: '400px', overflow: 'auto' }}>
{items.map(item => (
<div key={item.id}>{item.name}</div>
))}
</div>
);
}Correct — Virtualization renders only visible items:
import { useVirtualizer } from '@tanstack/react-virtual';
function VirtualList({ items }) {
const parentRef = useRef(null);
const virtualizer = useVirtualizer({
count: items.length,
getScrollElement: () => parentRef.current,
estimateSize: () => 50,
overscan: 5,
});
return (
<div ref={parentRef} style={{ height: '400px', overflow: 'auto' }}>
<div style={{ height: `${virtualizer.getTotalSize()}px`, position: 'relative' }}>
{virtualizer.getVirtualItems().map(virtualRow => (
<div
key={virtualRow.key}
style={{
position: 'absolute',
transform: `translateY(${virtualRow.start}px)`,
}}
>
{items[virtualRow.index].name}
</div>
))}
</div>
</div>
);
}Key Rules
- Virtualize lists with 100+ items
- Set
overscan: 5for smooth scrolling - Use
estimateSizeclose to actual average - Use
measureElementfor variable height items - Position items with
transform: translateY()(avoids layout recalculation)
References (17)
Caching Strategies
Caching Strategies
Multi-level caching patterns for performance optimization.
Cache Hierarchy
L1: In-Memory (LRU, memoization) - fastest, per-process
L2: Distributed (Redis/Memcached) - shared across instances
L3: CDN (edge, static assets) - global, closest to user
L4: Database (materialized views) - fallback, queryableCache-Aside Pattern (Read-Through)
Most common caching pattern:
async function getAnalysis(id: string): Promise<Analysis> {
const cacheKey = `analysis:${id}`;
// Try cache first (L2)
const cached = await redis.get(cacheKey);
if (cached) {
return JSON.parse(cached);
}
// Cache miss - fetch from database (L4)
const analysis = await db.query('SELECT * FROM analyses WHERE id = $1', [id]);
// Store in cache for future requests
await redis.setex(cacheKey, 3600, JSON.stringify(analysis)); // 1 hour TTL
return analysis;
}Write-Through Pattern
Update cache when writing to database:
async function updateAnalysis(id: string, updates: Partial<Analysis>) {
// Update database
const updated = await db.query(
'UPDATE analyses SET ... WHERE id = $1 RETURNING *',
[id]
);
// Update cache immediately
const cacheKey = `analysis:${id}`;
await redis.setex(cacheKey, 3600, JSON.stringify(updated));
return updated;
}Cache Invalidation Strategies
1. Time-Based (TTL)
// Short TTL for frequently changing data
await redis.setex('trending:articles', 300, data); // 5 min
// Long TTL for static data
await redis.setex('user:profile:123', 86400, data); // 24 hours2. Event-Based
// Invalidate when data changes
async function deleteAnalysis(id: string) {
await db.query('DELETE FROM analyses WHERE id = $1', [id]);
// Invalidate all related cache keys
await redis.del(`analysis:${id}`);
await redis.del(`analysis:${id}:chunks`);
await redis.del('analysis:list:recent'); // List cache
}3. Tag-Based
// Tag related cache entries
await redis.set('analysis:123', data);
await redis.sadd('tag:user:456', 'analysis:123');
// Invalidate all entries with tag
async function invalidateUserData(userId: string) {
const keys = await redis.smembers(`tag:user:${userId}`);
if (keys.length > 0) {
await redis.del(...keys);
await redis.del(`tag:user:${userId}`);
}
}Redis Patterns
1. String Cache (Most Common)
// Get/set
await redis.set('key', 'value');
const value = await redis.get('key');
// With TTL
await redis.setex('key', 3600, 'value');
// Atomic increment
await redis.incr('page:views:123');2. Hash Cache (Objects)
// Store object fields separately
await redis.hset('user:123', 'name', 'Alice');
await redis.hset('user:123', 'email', 'alice@example.com');
// Get specific field
const name = await redis.hget('user:123', 'name');
// Get all fields
const user = await redis.hgetall('user:123');3. List Cache (Queues, Recent Items)
// Recent analyses (FIFO)
await redis.lpush('analyses:recent', analysisId);
await redis.ltrim('analyses:recent', 0, 99); // Keep only 100 most recent
// Get recent
const recent = await redis.lrange('analyses:recent', 0, 9); // First 104. Set Cache (Unique Items, Tags)
// Track unique visitors
await redis.sadd('article:123:visitors', userId);
// Check membership
const hasVisited = await redis.sismember('article:123:visitors', userId);
// Count unique
const uniqueCount = await redis.scard('article:123:visitors');In-Memory Cache (L1)
For per-process caching:
import { LRUCache } from 'lru-cache';
const cache = new LRUCache<string, Analysis>({
max: 500, // Maximum items
ttl: 1000 * 60 * 5, // 5 minutes
updateAgeOnGet: true, // Refresh on access
});
function getAnalysis(id: string): Analysis {
// Check L1 first
if (cache.has(id)) {
return cache.get(id)!;
}
// Fetch from L2 or database
const analysis = await fetchAnalysis(id);
cache.set(id, analysis);
return analysis;
}HTTP Caching (Browser/CDN)
// Express.js example
app.get('/api/analyses/:id', async (req, res) => {
const analysis = await getAnalysis(req.params.id);
// Cache in browser and CDN for 1 hour
res.set('Cache-Control', 'public, max-age=3600');
// ETag for conditional requests
const etag = generateETag(analysis);
res.set('ETag', etag);
// Return 304 if unchanged
if (req.headers['if-none-match'] === etag) {
return res.status(304).end();
}
res.json(analysis);
});Cache Warming
Preload cache before traffic arrives:
async function warmCache() {
// Load hot data
const recentAnalyses = await db.query(
'SELECT * FROM analyses ORDER BY created_at DESC LIMIT 100'
);
// Populate cache
for (const analysis of recentAnalyses) {
await redis.setex(
`analysis:${analysis.id}`,
3600,
JSON.stringify(analysis)
);
}
console.log(`Warmed cache with ${recentAnalyses.length} analyses`);
}
// Run on server startup
await warmCache();Cache Stampede Prevention
Prevent multiple requests from hitting database simultaneously:
const locks = new Map<string, Promise<Analysis>>();
async function getAnalysis(id: string): Promise<Analysis> {
const cacheKey = `analysis:${id}`;
// Check cache
const cached = await redis.get(cacheKey);
if (cached) return JSON.parse(cached);
// Check if fetch is already in progress
if (locks.has(cacheKey)) {
return locks.get(cacheKey)!;
}
// Start fetch
const fetchPromise = (async () => {
const analysis = await db.query('SELECT * FROM analyses WHERE id = $1', [id]);
await redis.setex(cacheKey, 3600, JSON.stringify(analysis));
locks.delete(cacheKey); // Clean up
return analysis;
})();
locks.set(cacheKey, fetchPromise);
return fetchPromise;
}Best Practices
- Cache frequently accessed, slow-to-compute data
- Use appropriate TTL - shorter for dynamic data
- Monitor cache hit rate - aim for > 80%
- Handle cache failures gracefully - always fall back to database
- Invalidate proactively when data changes
- Monitor memory usage - set max memory and eviction policy
- Use compression for large cached values
References
- Redis Best Practices
- HTTP Caching
- See
scripts/caching-patterns.tsfor complete implementation
Cdn Setup
Image CDN Configuration
Complete guide to configuring image CDNs and optimization pipelines.
┌─────────────────────────────────────────────────────────────────────────────┐
│ Image Delivery Pipeline │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ Source CDN / Optimizer Browser │
│ ┌────────┐ ┌─────────────┐ ┌──────────┐ │
│ │ Origin │──────────►│ Resize │──AVIF────►│ Chrome │ │
│ │ Server │ │ Format │ │ Safari │ │
│ │ /CMS │ │ Quality │──WebP────►│ Firefox │ │
│ └────────┘ │ Cache │ │ Edge │ │
│ └─────────────┘ └──────────┘ │
│ │ │
│ ┌──────▼──────┐ │
│ │ Edge Cache │ │
│ │ (Global) │ │
│ └─────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────────────┘Next.js Remote Patterns
Basic Configuration
// next.config.js
module.exports = {
images: {
// Enable modern formats
formats: ['image/avif', 'image/webp'],
// Allowed remote sources (required for external images)
remotePatterns: [
{
protocol: 'https',
hostname: 'cdn.example.com',
pathname: '/images/**',
},
{
protocol: 'https',
hostname: '*.cloudinary.com',
},
{
protocol: 'https',
hostname: 'images.unsplash.com',
},
{
protocol: 'https',
hostname: 's3.amazonaws.com',
pathname: '/my-bucket/**',
},
],
// Responsive breakpoints
deviceSizes: [640, 750, 828, 1080, 1200, 1920, 2048, 3840],
imageSizes: [16, 32, 48, 64, 96, 128, 256, 384],
// Cache TTL (seconds) - default 60, increase for CDN
minimumCacheTTL: 60 * 60 * 24 * 30, // 30 days
// Disable optimization in development (faster builds)
unoptimized: process.env.NODE_ENV === 'development',
},
};Environment-Based Configuration
// next.config.js
const isProd = process.env.NODE_ENV === 'production';
module.exports = {
images: {
formats: ['image/avif', 'image/webp'],
// Different patterns per environment
remotePatterns: [
// Production CDN
...(isProd
? [
{
protocol: 'https',
hostname: 'cdn.example.com',
},
]
: []),
// Development/staging
...(!isProd
? [
{
protocol: 'https',
hostname: 'staging-cdn.example.com',
},
{
protocol: 'http',
hostname: 'localhost',
port: '3001',
},
]
: []),
],
},
};Cloudinary Integration
Loader Implementation
// lib/loaders/cloudinary.ts
import type { ImageLoader } from 'next/image';
const CLOUD_NAME = process.env.NEXT_PUBLIC_CLOUDINARY_CLOUD_NAME;
export const cloudinaryLoader: ImageLoader = ({ src, width, quality }) => {
// Build transformation string
const transforms = [
`w_${width}`,
`q_${quality || 'auto:good'}`,
'f_auto', // Auto format (AVIF > WebP > JPEG)
'c_limit', // Don't upscale
'dpr_auto', // Auto DPR
].join(',');
// Handle both full URLs and paths
const imagePath = src.startsWith('http')
? src.replace(/^https?:\/\/[^/]+/, '')
: src;
return `https://res.cloudinary.com/${CLOUD_NAME}/image/upload/${transforms}/${imagePath}`;
};
// Advanced loader with more options
export const cloudinaryAdvancedLoader: ImageLoader = ({ src, width, quality }) => {
const params = new URLSearchParams();
// Responsive width
params.set('w', width.toString());
// Quality (auto:good is a good default)
params.set('q', quality?.toString() || 'auto:good');
// Additional optimizations
const transforms = [
`w_${width}`,
`q_${quality || 'auto:good'}`,
'f_auto', // Best format for browser
'c_limit', // Don't upscale
'fl_progressive', // Progressive loading
'fl_immutable_cache', // Long cache
].join(',');
return `https://res.cloudinary.com/${CLOUD_NAME}/image/upload/${transforms}/${src}`;
};
export default cloudinaryLoader;Usage
import Image from 'next/image';
import { cloudinaryLoader } from '@/lib/loaders/cloudinary';
// Component usage
<Image
loader={cloudinaryLoader}
src="products/shoe-red.jpg" // Path in Cloudinary
alt="Red running shoe"
width={400}
height={400}
sizes="(max-width: 768px) 100vw, 400px"
/>
// Global loader configuration
// next.config.js
module.exports = {
images: {
loader: 'custom',
loaderFile: './lib/loaders/cloudinary.ts',
},
};Imgix Integration
// lib/loaders/imgix.ts
import type { ImageLoader } from 'next/image';
const IMGIX_DOMAIN = process.env.NEXT_PUBLIC_IMGIX_DOMAIN;
export const imgixLoader: ImageLoader = ({ src, width, quality }) => {
const url = new URL(`https://${IMGIX_DOMAIN}${src}`);
// Auto format negotiation
url.searchParams.set('auto', 'format,compress');
// Width
url.searchParams.set('w', width.toString());
// Quality
url.searchParams.set('q', (quality || 75).toString());
// Fit mode (contain, cover, fill, etc.)
url.searchParams.set('fit', 'max');
return url.toString();
};
// With advanced features
export const imgixAdvancedLoader: ImageLoader = ({ src, width, quality }) => {
const url = new URL(`https://${IMGIX_DOMAIN}${src}`);
url.searchParams.set('auto', 'format,compress');
url.searchParams.set('w', width.toString());
url.searchParams.set('q', (quality || 75).toString());
url.searchParams.set('fit', 'max');
// Face detection for portraits
// url.searchParams.set('fit', 'facearea');
// url.searchParams.set('facepad', '2');
// Blur for placeholders
// url.searchParams.set('blur', '200');
// url.searchParams.set('px', '16');
return url.toString();
};Cloudflare Images
// lib/loaders/cloudflare.ts
import type { ImageLoader } from 'next/image';
// Using Cloudflare Image Resizing
export const cloudflareResizingLoader: ImageLoader = ({ src, width, quality }) => {
// src should be the full URL of the original image
const params = [
`width=${width}`,
`quality=${quality || 85}`,
'format=auto', // Auto AVIF/WebP
'fit=scale-down', // Don't upscale
].join(',');
return `https://yourdomain.com/cdn-cgi/image/${params}/${src}`;
};
// Using Cloudflare Images (upload API)
const ACCOUNT_HASH = process.env.NEXT_PUBLIC_CLOUDFLARE_ACCOUNT_HASH;
export const cloudflareImagesLoader: ImageLoader = ({ src, width }) => {
// src is the image ID from Cloudflare
// Variants are predefined in Cloudflare dashboard
const variant = width <= 640 ? 'small' : width <= 1024 ? 'medium' : 'large';
return `https://imagedelivery.net/${ACCOUNT_HASH}/${src}/${variant}`;
};AWS S3 + CloudFront
// lib/loaders/aws.ts
import type { ImageLoader } from 'next/image';
const CLOUDFRONT_DOMAIN = process.env.NEXT_PUBLIC_CLOUDFRONT_DOMAIN;
// Basic CloudFront loader (requires Lambda@Edge for resizing)
export const cloudfrontLoader: ImageLoader = ({ src, width, quality }) => {
// Lambda@Edge parses these query params
const params = new URLSearchParams({
w: width.toString(),
q: (quality || 80).toString(),
f: 'auto',
});
return `https://${CLOUDFRONT_DOMAIN}${src}?${params}`;
};
// For static S3 images (no resizing)
export const s3Loader: ImageLoader = ({ src }) => {
return `https://${CLOUDFRONT_DOMAIN}${src}`;
};Vercel Image Optimization
// Automatically enabled on Vercel
// Configure in next.config.js
module.exports = {
images: {
// Use Vercel's built-in optimizer
loader: 'default',
// External domains need explicit allowlist
remotePatterns: [
{
protocol: 'https',
hostname: 'cdn.example.com',
},
],
// Increase cache for static images
minimumCacheTTL: 60 * 60 * 24 * 365, // 1 year
},
};
// For non-Vercel deployments, use external loader
module.exports = {
images: {
loader: 'custom',
loaderFile: './lib/loaders/cloudinary.ts',
},
};Self-Hosted with Sharp
// For self-hosted Next.js (Docker, Node.js)
// 1. Install Sharp
// npm install sharp
// 2. Configure next.config.js
module.exports = {
images: {
loader: 'default', // Uses Sharp internally
formats: ['image/avif', 'image/webp'],
minimumCacheTTL: 60 * 60 * 24 * 30, // 30 days
// Important for self-hosted
dangerouslyAllowSVG: false,
contentDispositionType: 'attachment',
},
};
// 3. Dockerfile - ensure Sharp can build
FROM node:20-alpine AS builder
RUN apk add --no-cache libc6-compat
WORKDIR /app
COPY package*.json ./
RUN npm ci
COPY . .
RUN npm run build
FROM node:20-alpine AS runner
RUN apk add --no-cache libc6-compat
WORKDIR /app
COPY --from=builder /app/.next/standalone ./
COPY --from=builder /app/.next/static ./.next/static
COPY --from=builder /app/public ./public
EXPOSE 3000
CMD ["node", "server.js"]CDN Headers & Caching
Nginx Configuration
# /etc/nginx/conf.d/images.conf
# Image caching
location ~* \.(jpg|jpeg|png|webp|avif|gif|ico|svg)$ {
# Long cache for immutable assets
expires 1y;
add_header Cache-Control "public, immutable";
# Vary by Accept header for format negotiation
add_header Vary "Accept";
# Security headers
add_header X-Content-Type-Options "nosniff";
}
# Next.js optimized images
location /_next/image {
proxy_pass http://nextjs_upstream;
proxy_cache_valid 200 365d;
# Cache key includes Accept header for format
proxy_cache_key "$scheme$request_method$host$request_uri$http_accept";
add_header X-Cache-Status $upstream_cache_status;
}Cloudflare Page Rules
{
"targets": [
{
"target": "url",
"constraint": {
"operator": "matches",
"value": "*.example.com/*.(jpg|jpeg|png|webp|avif|gif)"
}
}
],
"actions": [
{
"id": "cache_level",
"value": "cache_everything"
},
{
"id": "edge_cache_ttl",
"value": 2592000
},
{
"id": "browser_cache_ttl",
"value": 31536000
},
{
"id": "polish",
"value": "lossless"
}
]
}Blur Placeholder Generation
Build-Time with Plaiceholder
// lib/blur.ts
import { getPlaiceholder } from 'plaiceholder';
import fs from 'fs/promises';
import path from 'path';
export async function getBlurDataURL(imagePath: string): Promise<string> {
try {
const file = await fs.readFile(path.join(process.cwd(), 'public', imagePath));
const { base64 } = await getPlaiceholder(file);
return base64;
} catch {
// Return a tiny transparent placeholder on error
return 'data:image/gif;base64,R0lGODlhAQABAIAAAAAAAP///yH5BAEAAAAALAAAAAABAAEAAAIBRAA7';
}
}
// Usage in getStaticProps
export async function getStaticProps() {
const blurDataURL = await getBlurDataURL('/images/hero.jpg');
return {
props: { blurDataURL },
};
}Remote Image Blur
// lib/remote-blur.ts
import { getPlaiceholder } from 'plaiceholder';
export async function getRemoteBlurDataURL(imageUrl: string): Promise<string> {
try {
const response = await fetch(imageUrl);
const buffer = Buffer.from(await response.arrayBuffer());
const { base64 } = await getPlaiceholder(buffer);
return base64;
} catch {
return 'data:image/gif;base64,R0lGODlhAQABAIAAAAAAAP///yH5BAEAAAAALAAAAAABAAEAAAIBRAA7';
}
}
// Cache blur data URLs
const blurCache = new Map<string, string>();
export async function getCachedBlurDataURL(imageUrl: string): Promise<string> {
if (blurCache.has(imageUrl)) {
return blurCache.get(imageUrl)!;
}
const blur = await getRemoteBlurDataURL(imageUrl);
blurCache.set(imageUrl, blur);
return blur;
}Image Validation & Error Handling
// lib/image-validation.ts
export function isValidImageUrl(url: string): boolean {
try {
const parsed = new URL(url);
const allowedHosts = ['cdn.example.com', 'images.unsplash.com'];
return allowedHosts.some(
(host) => parsed.hostname === host || parsed.hostname.endsWith(`.${host}`)
);
} catch {
return false;
}
}
export function getOptimizedImageUrl(
src: string,
options: { width: number; quality?: number }
): string {
// Use your CDN loader
const { width, quality = 80 } = options;
if (src.includes('cloudinary.com')) {
return src.replace('/upload/', `/upload/w_${width},q_${quality},f_auto/`);
}
// Default: return as-is
return src;
}Core Web Vitals
Core Web Vitals Optimization
Google's Core Web Vitals are the key metrics for measuring user experience.
The Three Metrics
| Metric | Target | Measures | Impact |
|---|---|---|---|
| LCP (Largest Contentful Paint) | < 2.5s | Loading performance | First impression |
| INP (Interaction to Next Paint) | < 200ms | Responsiveness | User frustration |
| CLS (Cumulative Layout Shift) | < 0.1 | Visual stability | Accidental clicks |
LCP (Largest Contentful Paint)
What It Measures
Time until the largest visible element (hero image, heading, video) renders.
Common Causes
- Large, unoptimized images
- Slow server response time (TTFB > 600ms)
- Render-blocking JavaScript/CSS
- Client-side rendering
Fixes
1. Optimize Images
<!-- Preload LCP image -->
<link rel="preload" as="image" href="/hero.jpg" />
<!-- Use modern formats -->
<picture>
<source srcset="/hero.avif" type="image/avif" />
<source srcset="/hero.webp" type="image/webp" />
<img src="/hero.jpg" alt="Hero" width="1200" height="600" />
</picture>
<!-- Or use next/image -->
<Image src="/hero.jpg" priority quality={85} />2. Reduce Server Response Time
- Use CDN for static assets
- Enable HTTP/2 or HTTP/3
- Optimize database queries
- Implement caching (Redis, CDN)
3. Eliminate Render-Blocking Resources
<!-- Defer non-critical CSS -->
<link rel="preload" as="style" href="/styles.css" onload="this.onload=null;this.rel='stylesheet'" />
<!-- Defer JavaScript -->
<script src="/app.js" defer></script>
<!-- Inline critical CSS -->
<style>
/* Critical above-the-fold styles */
.hero { ... }
</style>4. Use Server-Side Rendering (SSR)
// Next.js SSR
export async function getServerSideProps() {
const data = await fetchData();
return { props: { data } };
}
// React Server Components
async function Page() {
const data = await fetchData(); // Runs on server
return <div>{data}</div>;
}INP (Interaction to Next Paint)
What It Measures
Time from user interaction (click, tap, key press) to visual feedback.
Common Causes
- Heavy JavaScript execution blocking main thread
- Long-running event handlers
- Expensive DOM updates
- Third-party scripts
Fixes
1. Debounce/Throttle Expensive Operations
import { debounce } from 'lodash';
// Without debounce: runs on EVERY keystroke
function handleSearch(query: string) {
const results = expensiveSearch(query); // Blocks for 100ms
setResults(results);
}
// With debounce: runs 300ms after user stops typing
const handleSearch = debounce((query: string) => {
const results = expensiveSearch(query);
setResults(results);
}, 300);2. Use Web Workers for Heavy Computation
// worker.ts
self.onmessage = (e) => {
const result = expensiveComputation(e.data);
self.postMessage(result);
};
// main.ts
const worker = new Worker('/worker.js');
worker.postMessage(data);
worker.onmessage = (e) => {
setResult(e.data);
};3. Split Long Tasks
// Before: Blocks main thread for 500ms
function processItems(items) {
items.forEach(item => {
processItem(item); // 5ms each × 100 items = 500ms
});
}
// After: Yields to browser between batches
async function processItems(items) {
for (let i = 0; i < items.length; i += 10) {
const batch = items.slice(i, i + 10);
batch.forEach(processItem);
// Yield to browser
await new Promise(resolve => setTimeout(resolve, 0));
}
}
// Or use Scheduler API (modern)
async function processItems(items) {
for (let i = 0; i < items.length; i += 10) {
const batch = items.slice(i, i + 10);
batch.forEach(processItem);
await scheduler.yield(); // Yield to higher priority tasks
}
}4. Optimize React Rendering
// Memoize expensive components
const Chart = memo(({ data }) => <ExpensiveChart data={data} />);
// Use startTransition for non-urgent updates
import { useTransition } from 'react';
function Search() {
const [query, setQuery] = useState('');
const [results, setResults] = useState([]);
const [isPending, startTransition] = useTransition();
function handleChange(e) {
setQuery(e.target.value); // Urgent: update input immediately
startTransition(() => {
// Non-urgent: can be interrupted
const filtered = filterResults(e.target.value);
setResults(filtered);
});
}
return <input value={query} onChange={handleChange} />;
}CLS (Cumulative Layout Shift)
What It Measures
Visual stability - how much elements unexpectedly shift during load.
Common Causes
- Images without dimensions
- Ads/embeds injected after layout
- Web fonts causing FOIT/FOUT
- Dynamically injected content
Fixes
1. Always Set Image Dimensions
<!-- ❌ BAD: No dimensions, causes layout shift -->
<img src="/photo.jpg" alt="Photo" />
<!-- ✅ GOOD: Reserves space -->
<img src="/photo.jpg" alt="Photo" width="800" height="600" />
<!-- Or with aspect ratio (CSS) -->
<img src="/photo.jpg" alt="Photo" style="aspect-ratio: 4/3; width: 100%;" />2. Reserve Space for Ads/Embeds
.ad-container {
min-height: 250px; /* Reserve space before ad loads */
background: #f0f0f0;
}3. Optimize Web Font Loading
/* Prevent FOIT (flash of invisible text) */
@font-face {
font-family: 'CustomFont';
src: url('/font.woff2') format('woff2');
font-display: swap; /* Show fallback immediately, swap when ready */
}<!-- Preload critical fonts -->
<link rel="preload" as="font" href="/font.woff2" type="font/woff2" crossorigin />4. Avoid Inserting Content Above Existing Content
// ❌ BAD: Inserts notification at top, shifts everything down
function addNotification(message) {
container.insertAdjacentHTML('afterbegin', `<div>${message}</div>`);
}
// ✅ GOOD: Append to bottom or use fixed positioning
function addNotification(message) {
const notification = document.createElement('div');
notification.className = 'notification-fixed'; // position: fixed
notification.textContent = message;
document.body.appendChild(notification);
}Measuring Core Web Vitals
In Development
// Use web-vitals library
import { onCLS, onINP, onLCP } from 'web-vitals';
onLCP(console.log); // Log LCP
onINP(console.log); // Log INP
onCLS(console.log); // Log CLSIn Production (RUM - Real User Monitoring)
import { onCLS, onINP, onLCP } from 'web-vitals';
function sendToAnalytics(metric) {
fetch('/api/analytics', {
method: 'POST',
body: JSON.stringify(metric),
});
}
onLCP(sendToAnalytics);
onINP(sendToAnalytics);
onCLS(sendToAnalytics);Lighthouse (Lab Testing)
# Run Lighthouse audit
lighthouse https://your-site.com --output=html
# Or use Chrome DevTools
# Open DevTools → Lighthouse tab → Generate reportTargets by Percentile
Google measures at the 75th percentile of all page loads:
| Grade | LCP | INP | CLS |
|---|---|---|---|
| Good (Green) | < 2.5s | < 200ms | < 0.1 |
| Needs Improvement (Orange) | 2.5-4s | 200-500ms | 0.1-0.25 |
| Poor (Red) | > 4s | > 500ms | > 0.25 |
Goal: 75% of page loads should be "Good" for all three metrics.
Quick Wins Checklist
- Add
widthandheightto all images - Preload LCP image
- Use
font-display: swapfor web fonts - Defer non-critical JavaScript
- Enable HTTP/2 and compression
- Use CDN for static assets
- Implement lazy loading for below-fold images
- Memoize expensive React components
- Debounce search inputs and expensive handlers
References
Database Optimization
Database Query Optimization
Strategies for optimizing database performance and eliminating slow queries.
Key Patterns
- Add Missing Indexes - Turn
Seq ScanintoIndex Scan - Fix N+1 Queries - Use JOINs or
includeinstead of loops - Cursor Pagination - Never load all records
- Connection Pooling - Manage connection lifecycle
Quick Diagnostics
-- Find slow queries (PostgreSQL)
SELECT query, calls, mean_time / 1000 as mean_seconds
FROM pg_stat_statements ORDER BY total_time DESC LIMIT 10;
-- Verify index usage
EXPLAIN ANALYZE SELECT * FROM orders WHERE user_id = 123;
-- Check for sequential scans
SELECT schemaname, tablename, seq_scan, seq_tup_read
FROM pg_stat_user_tables
WHERE seq_scan > 0
ORDER BY seq_tup_read DESC
LIMIT 10;N+1 Query Detection
Symptoms:
- One query to get parent records, then N queries for related data
- Rapid sequential database calls in logs
- Linear growth in query count with data size
Example Problem:
# ❌ BAD: N+1 query (1 + 8 queries)
analyses = await session.execute(select(Analysis).limit(8)).scalars().all()
for analysis in analyses:
# Each iteration hits DB again!
chunks = await session.execute(
select(Chunk).where(Chunk.analysis_id == analysis.id)
).scalars().all()Solution:
# ✅ GOOD: Single query with JOIN (1 query)
from sqlalchemy.orm import selectinload
analyses = await session.execute(
select(Analysis)
.options(selectinload(Analysis.chunks)) # Eager load
.limit(8)
).scalars().all()
# Now analyses[0].chunks is already loaded (no extra query)Index Selection Strategies
| Index Type | Use Case | Example |
|---|---|---|
| B-tree | Equality, range queries | WHERE created_at > '2025-01-01' |
| GIN | Full-text search, JSONB | WHERE content_tsvector @@ to_tsquery('python') |
| HNSW | Vector similarity | ORDER BY embedding <=> '[0.1, 0.2, ...]' |
| Hash | Exact equality only | WHERE id = 'abc123' (rare) |
Index Creation Examples:
-- B-tree index for range queries
CREATE INDEX idx_analyses_created_at ON analyses(created_at);
-- GIN index for full-text search
CREATE INDEX idx_chunks_tsvector ON chunks USING GIN(content_tsvector);
-- HNSW index for vector similarity
CREATE INDEX idx_chunks_embedding ON chunks
USING hnsw (embedding vector_cosine_ops);
-- Partial index for active records only
CREATE INDEX idx_active_users ON users(email)
WHERE deleted_at IS NULL;
-- Composite index for common query pattern
CREATE INDEX idx_analyses_user_status ON analyses(user_id, status);Connection Pooling
Problem: Creating new connections is expensive (50-100ms overhead)
Solution: Use connection pools
# SQLAlchemy async pool
engine = create_async_engine(
DATABASE_URL,
pool_size=20, # Base connections
max_overflow=10, # Additional if needed
pool_pre_ping=True, # Verify connections are alive
pool_recycle=3600 # Recycle after 1 hour
)Pagination: Cursor vs Offset
Offset-Based (❌ Slow for large datasets)
SELECT * FROM analyses ORDER BY created_at DESC
LIMIT 20 OFFSET 1000; -- Must scan 1020 rows!Cursor-Based (✅ Fast, scales to millions)
SELECT * FROM analyses
WHERE created_at < '2025-01-15 10:00:00' -- Last cursor
ORDER BY created_at DESC
LIMIT 20; -- Only scans 20 rowsBest Practices
- Always use EXPLAIN ANALYZE before deploying queries
- Index foreign keys used in JOINs
- Avoid SELECT * - request only needed columns
- Use prepared statements to prevent SQL injection and enable query caching
- Monitor pg_stat_statements weekly
- Set query timeouts to prevent runaway queries
References
- PostgreSQL Performance Tips
- Use The Index, Luke
- See
scripts/database-optimization.tsfor implementation patterns
Devtools Profiler Workflow
React DevTools Profiler Workflow
Finding and fixing performance bottlenecks.
Setup
- Install React DevTools browser extension
- Open DevTools (F12)
- Navigate to Profiler tab
- Ensure React is in development mode
Basic Profiling Flow
1. Start Recording
- Click the blue Record button
- Perform the slow interaction
- Click Stop
2. Analyze the Flamegraph
The flamegraph shows component render times:
[App (2ms)]
├── [Header (0.5ms)]
├── [Sidebar (15ms)] ← Slow!
│ ├── [NavItem (1ms)]
│ ├── [NavItem (1ms)]
│ └── [HeavyWidget (12ms)] ← Found it!
└── [Content (1ms)]3. Key Metrics
| Metric | Meaning |
|---|---|
| Render time | How long component took to render |
| Commit time | Time to apply changes to DOM |
| Interactions | What triggered the render |
Reading the Profiler
Color Coding
- Gray: Did not render
- Blue/Teal: Rendered (fast)
- Yellow: Rendered (medium)
- Red/Orange: Rendered (slow)
"Why did this render?"
Enable in DevTools settings:
- Click gear icon in Profiler
- Check "Record why each component rendered"
Common reasons:
- Props changed
- State changed
- Parent rendered
- Context changed
- Hooks changed
Identifying Problems
Problem 1: Component Renders Too Often
Look for components that render on every interaction:
Render 1: [List (50ms)] - items changed ✓
Render 2: [List (50ms)] - items same, parent rendered ✗
Render 3: [List (50ms)] - items same, parent rendered ✗Solution: Isolate state, use React.memo as escape hatch
Problem 2: Single Render Too Slow
Look for wide bars in the flamegraph:
[SlowComponent (200ms)]
├── [Child1 (5ms)]
├── [Child2 (190ms)] ← Find the slow child
│ └── [GrandChild (185ms)] ← Root cause
└── [Child3 (5ms)]Solution: Virtualize, lazy load, or optimize computation
Problem 3: Cascading Re-renders
Many components re-render for one change:
[Parent] → [Child1] → [GrandChild1]
→ [Child2] → [GrandChild2]
→ [Child3] → [GrandChild3]Solution: Move state down, split context
Profiler Settings
Click the gear icon for options:
- Record why each component rendered: Essential for debugging
- Hide commits below X ms: Filter noise
- Highlight updates: Visual indicator during interaction
Ranked View
Switch from Flamegraph to Ranked view:
1. HeavyWidget 12ms
2. Sidebar 3ms
3. NavItem 1ms
4. Content 1ms
5. Header 0.5msThis shows components sorted by render time.
Timeline View
Shows renders over time, useful for:
- Finding render cascades
- Identifying what triggered re-renders
- Seeing interaction-to-render timing
Console Integration
// Add profiling in code
import { Profiler } from 'react'
function onRenderCallback(
id, // Component tree id
phase, // "mount" | "update"
actualDuration,
baseDuration,
startTime,
commitTime
) {
console.log(`${id} ${phase}: ${actualDuration.toFixed(2)}ms`)
}
<Profiler id="Navigation" onRender={onRenderCallback}>
<Navigation />
</Profiler>Quick Checklist
- Record the slow interaction
- Find the slowest component (ranked view)
- Check why it rendered (DevTools setting)
- Verify if render was necessary
- Apply targeted fix
- Re-profile to confirm improvement
Common Fixes by Cause
| Why Rendered | Fix |
|---|---|
| Props changed (but same value) | Check prop references |
| Parent rendered | Isolate state, split component |
| Context changed | Split context |
| Hooks changed | Check effect dependencies |
| State changed | Verify state is necessary |
Edge Deployment
Edge Deployment
Overview
Deploy LLMs on resource-constrained devices: mobile, edge servers, embedded systems.
Key constraints:
- Limited GPU/NPU memory (4-24 GB)
- Power efficiency requirements
- Latency-sensitive applications
- Offline/disconnected operation
Model Selection for Edge
| Device | Memory | Recommended Models |
|---|---|---|
| Mobile (iOS/Android) | 4-8 GB | Llama-3.3-1B, Phi-3-mini |
| Edge Server | 16-24 GB | Llama-3.3-3B, Mistral-7B-4bit |
| Raspberry Pi 5 | 8 GB | Gemma-2B, TinyLlama |
| Jetson Orin | 32-64 GB | Llama-3.1-8B, Mixtral-8x7B-4bit |
Aggressive Quantization
For edge, prioritize memory over quality:
from gptqmodel import GPTQModel, QuantizeConfig
# 2-bit quantization for extreme memory constraints
quant_config = QuantizeConfig(
bits=2,
group_size=32,
damp_percent=0.1,
)
model = GPTQModel.load("meta-llama/Llama-3.3-1B-Instruct", quant_config)
model.quantize(calibration_data)
model.save("Llama-3.3-1B-2bit-edge")Quality vs Memory Trade-off:
| Bits | Memory (1B model) | Quality Retention |
|---|---|---|
| 4 | ~600 MB | ~95% |
| 3 | ~450 MB | ~85% |
| 2 | ~300 MB | ~70% |
llama.cpp for Edge
Optimized C++ inference for CPU/edge:
# Build llama.cpp with optimizations
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
# For Apple Silicon
make LLAMA_METAL=1
# For CUDA
make LLAMA_CUDA=1
# For Vulkan (cross-platform GPU)
make LLAMA_VULKAN=1
# Run inference
./main -m models/llama-3.3-1b-q4_k_m.gguf \
-p "Hello, how are you?" \
-n 128 \
-ngl 99 # Offload all layers to GPUGGUF quantization types:
| Type | Bits | Quality | Speed |
|---|---|---|---|
| Q8_0 | 8 | Best | Good |
| Q5_K_M | 5 | Very Good | Better |
| Q4_K_M | 4 | Good | Best |
| Q3_K_M | 3 | Acceptable | Best |
| Q2_K | 2 | Degraded | Best |
Mobile Deployment
iOS with MLX
# Convert to MLX format for Apple Silicon
import mlx.core as mx
from mlx_lm import load, generate
# Load quantized model
model, tokenizer = load("mlx-community/Llama-3.3-1B-Instruct-4bit")
# Generate on device
prompt = "Explain machine learning briefly:"
response = generate(model, tokenizer, prompt=prompt, max_tokens=100)Android with MLC-LLM
# Build for Android
mlc_llm compile meta-llama/Llama-3.3-1B-Instruct \
--quantization q4f16_1 \
--target android
# Deploy APK with bundled model
mlc_llm package \
--model-lib ./dist/llama-3.3-1b-q4f16_1-android.tar \
--apk-output ./LlamaApp.apkJetson/NVIDIA Edge
Optimized for Jetson Orin and embedded NVIDIA:
# Use TensorRT-LLM for Jetson
from tensorrt_llm import LLM, SamplingParams
llm = LLM(
model="meta-llama/Llama-3.3-3B-Instruct",
max_batch_size=4, # Limit for memory
max_input_len=2048,
max_output_len=512,
)
# Optimized for Jetson memory constraints
outputs = llm.generate(
prompts=["Hello!"],
sampling_params=SamplingParams(max_tokens=100),
)Memory Optimization Techniques
KV Cache Reduction
# Limit context length for edge
llm = LLM(
model="meta-llama/Llama-3.3-1B-Instruct",
max_model_len=1024, # Reduce from default 4096
gpu_memory_utilization=0.95, # Maximize usage
)Sliding Window Attention
# Models with built-in sliding window
# Mistral-7B: 4096 sliding window
# Reduces memory O(n^2) -> O(n*window)
llm = LLM(
model="mistralai/Mistral-7B-Instruct-v0.3",
sliding_window=4096, # Use model's native window
)Flash Attention
# Enable Flash Attention for memory efficiency
llm = LLM(
model="meta-llama/Llama-3.3-1B-Instruct",
use_flash_attention=True, # Default on supported hardware
)Power Efficiency
Dynamic Frequency Scaling
# Limit GPU frequency for power savings (Jetson)
sudo nvpmodel -m 2 # Medium power mode
sudo jetson_clocks --show
# For inference-heavy workloads
sudo nvpmodel -m 0 # Max performanceBatch Size Optimization
# Smaller batches = lower peak power
llm = LLM(
model="meta-llama/Llama-3.3-1B-Instruct",
max_num_seqs=8, # Limit concurrent requests
)
# Process requests sequentially for power
for prompt in prompts:
output = llm.generate([prompt], sampling_params)
yield outputOffline Deployment
Model Bundling
# Download and cache model for offline use
from huggingface_hub import snapshot_download
# Pre-download model
snapshot_download(
"meta-llama/Llama-3.3-1B-Instruct",
local_dir="./models/llama-3.3-1b",
local_dir_use_symlinks=False,
)
# Use local path
llm = LLM(model="./models/llama-3.3-1b")Air-gapped Environments
# Export model to portable format
python -m llama_cpp.convert \
--model meta-llama/Llama-3.3-1B-Instruct \
--output ./llama-3.3-1b.gguf \
--quantize q4_k_m
# Transfer and run on air-gapped device
./main -m ./llama-3.3-1b.gguf -p "Hello"Benchmarking Edge Performance
import time
def benchmark_edge(model_path: str, prompts: list[str]):
"""Benchmark for edge deployment."""
from vllm import LLM, SamplingParams
llm = LLM(
model=model_path,
max_model_len=1024,
gpu_memory_utilization=0.95,
)
# Warmup
llm.generate(["Warmup"], SamplingParams(max_tokens=10))
# Benchmark
times = []
for prompt in prompts:
start = time.perf_counter()
output = llm.generate([prompt], SamplingParams(max_tokens=100))
elapsed = time.perf_counter() - start
times.append(elapsed)
avg_latency = sum(times) / len(times)
p99_latency = sorted(times)[int(len(times) * 0.99)]
print(f"Avg latency: {avg_latency*1000:.1f}ms")
print(f"P99 latency: {p99_latency*1000:.1f}ms")Related Skills
ollama-local- Easy local deploymentquantization-guide- Quantization methods
Frontend Performance
Frontend Performance Optimization
Techniques for optimizing bundle size, loading speed, and rendering performance.
Bundle Optimization
1. Code Splitting
Split your bundle into smaller chunks that load on-demand:
// Route-based splitting (React 19)
import { lazy, Suspense } from 'react';
const AdminPanel = lazy(() => import('./AdminPanel'));
const Dashboard = lazy(() => import('./Dashboard'));
function App() {
return (
<Suspense fallback={<Loading />}>
<Routes>
<Route path="/admin" element={<AdminPanel />} />
<Route path="/dashboard" element={<Dashboard />} />
</Routes>
</Suspense>
);
}2. Tree Shaking
Import only what you need:
// ❌ BAD: Imports entire library
import _ from 'lodash';
_.debounce(fn, 100);
// ✅ GOOD: Import specific function
import debounce from 'lodash/debounce';
debounce(fn, 100);
// ✅ EVEN BETTER: Use native or lightweight alternative
const debounce = (fn, delay) => {
let timeout;
return (...args) => {
clearTimeout(timeout);
timeout = setTimeout(() => fn(...args), delay);
};
};3. Image Optimization
// Use next/image for automatic optimization
import Image from 'next/image';
<Image
src="/hero.jpg"
width={1200}
height={600}
alt="Hero"
loading="lazy" // Lazy load below fold
quality={85} // Balance quality/size
placeholder="blur" // Show blur while loading
/>
// Or use modern formats manually
<picture>
<source srcset="/hero.avif" type="image/avif" />
<source srcset="/hero.webp" type="image/webp" />
<img src="/hero.jpg" alt="Hero" loading="lazy" />
</picture>Rendering Optimization
1. Memoization
Prevent unnecessary re-renders:
import { memo, useMemo, useCallback } from 'react';
// Memoize expensive component
const ExpensiveChart = memo(({ data }) => {
return <Chart data={data} />;
});
// Memoize expensive computation
function AnalyticsDashboard({ analyses }) {
const stats = useMemo(() => {
return analyses.reduce((acc, a) => ({
totalCost: acc.totalCost + a.cost,
avgDuration: acc.avgDuration + a.duration
}), { totalCost: 0, avgDuration: 0 });
}, [analyses]); // Only recompute if analyses change
return <div>{stats.totalCost}</div>;
}
// Memoize callback to prevent child re-renders
function Parent() {
const [count, setCount] = useState(0);
const handleClick = useCallback(() => {
setCount(c => c + 1);
}, []); // Function identity stays same
return <Child onClick={handleClick} />;
}2. Virtualization
Render only visible items in long lists:
import { useVirtualizer } from '@tanstack/react-virtual';
function AnalysisList({ analyses }) {
const parentRef = useRef(null);
const virtualizer = useVirtualizer({
count: analyses.length,
getScrollElement: () => parentRef.current,
estimateSize: () => 100, // Estimated row height
});
return (
<div ref={parentRef} style={{ height: '600px', overflow: 'auto' }}>
<div style={{ height: `${virtualizer.getTotalSize()}px` }}>
{virtualizer.getVirtualItems().map(virtualItem => (
<div
key={virtualItem.index}
style={{
position: 'absolute',
top: 0,
left: 0,
width: '100%',
height: `${virtualItem.size}px`,
transform: `translateY(${virtualItem.start}px)`,
}}
>
<AnalysisCard analysis={analyses[virtualItem.index]} />
</div>
))}
</div>
</div>
);
}3. Batch DOM Operations
Minimize layout thrashing:
// ❌ BAD: Read-write-read-write causes layout thrashing
elements.forEach(el => {
const height = el.offsetHeight; // Read (triggers layout)
el.style.height = height + 10 + 'px'; // Write
});
// ✅ GOOD: Batch reads, then writes
const heights = elements.map(el => el.offsetHeight); // All reads
elements.forEach((el, i) => {
el.style.height = heights[i] + 10 + 'px'; // All writes
});Core Web Vitals Optimization
LCP (Largest Contentful Paint) - Target: < 2.5s
Causes:
- Large images not optimized
- Slow server response (TTFB)
- Render-blocking JS/CSS
Fixes:
- Preload LCP image:
<link rel="preload" as="image" href="/hero.jpg"> - Use CDN for assets
- Inline critical CSS
- Server-side rendering (SSR)
INP (Interaction to Next Paint) - Target: < 200ms
Causes:
- Heavy JavaScript execution
- Long-running event handlers
- Main thread blocked
Fixes:
- Debounce expensive operations
- Use Web Workers for heavy computation
- Split long tasks with
setTimeout()orscheduler.postTask()
CLS (Cumulative Layout Shift) - Target: < 0.1
Causes:
- Images without dimensions
- Ads/embeds loading late
- Web fonts causing FOIT/FOUT
Fixes:
- Always set
widthandheighton images - Reserve space for ads:
min-height: 250px - Use
font-display: swapfor web fonts - Preload fonts:
<link rel="preload" as="font">
Bundle Analysis
# Lighthouse audit
lighthouse http://localhost:3000 --output=html
# Bundle analysis (Next.js)
ANALYZE=true npm run build
# Bundle analysis (Vite)
npm run build && npx vite-bundle-visualizer
# Check bundle size
du -sh dist/Best Practices
- Lazy load below-the-fold content
- Use modern image formats (WebP, AVIF)
- Enable compression (Brotli > gzip)
- Minimize third-party scripts
- Use CDN for static assets
- Monitor Core Web Vitals in production with RUM
References
- Core Web Vitals
- React Profiler
- See
scripts/frontend-optimization.tsxfor complete examples
Memoization Escape Hatches
Memoization Escape Hatches
When to still use useMemo and useCallback with React Compiler.
Overview
React Compiler handles most memoization automatically. Use manual memoization only as escape hatches for specific cases.
Escape Hatch 1: Effect Dependencies
When a value is used as an effect dependency and you need precise control:
// Problem: Effect runs on every render
function UserDashboard({ userId }) {
const config = {
userId,
includeStats: true,
format: 'detailed',
}
useEffect(() => {
fetchData(config) // Runs every render! config is new object
}, [config])
}
// Solution: Memoize the config
function UserDashboard({ userId }) {
const config = useMemo(() => ({
userId,
includeStats: true,
format: 'detailed',
}), [userId]) // Only changes when userId changes
useEffect(() => {
fetchData(config)
}, [config])
}Escape Hatch 2: Third-Party Libraries
Libraries without React Compiler support may expect stable references:
// Some charting libraries compare references
function Chart({ data }) {
// Ensure stable reference for library
const chartOptions = useMemo(() => ({
animation: true,
responsive: true,
data: transformData(data),
}), [data])
return <ThirdPartyChart options={chartOptions} />
}Escape Hatch 3: Expensive Computations
When you know a computation is expensive and want explicit control:
function SearchResults({ items, query }) {
// Explicitly expensive - want to ensure it's memoized
const filteredItems = useMemo(() => {
console.log('Filtering...')
return items
.filter(item => matchesQuery(item, query))
.sort(complexSortFn)
.slice(0, 100)
}, [items, query])
return <List items={filteredItems} />
}Escape Hatch 4: Referential Equality for Children
When passing objects/arrays to components that use referential equality:
function Parent() {
// Child component uses Object.is() comparison
const contextValue = useMemo(() => ({
theme: 'dark',
locale: 'en',
}), [])
return (
<MyContext.Provider value={contextValue}>
<Children />
</MyContext.Provider>
)
}When NOT to Use Escape Hatches
Don't Memoize Primitives
// ❌ Unnecessary - primitives are already stable
const memoizedId = useMemo(() => props.id, [props.id])
// ✅ Just use it directly
<Child id={props.id} />Don't Memoize Simple JSX
// ❌ Unnecessary with React Compiler
const memoizedButton = useMemo(() => (
<Button onClick={handleClick}>Click</Button>
), [handleClick])
// ✅ Compiler handles this
<Button onClick={handleClick}>Click</Button>Don't Memoize Everything "Just in Case"
// ❌ Over-memoization
function Component({ user }) {
const name = useMemo(() => user.name, [user.name])
const email = useMemo(() => user.email, [user.email])
const avatar = useMemo(() => user.avatar, [user.avatar])
return <Profile name={name} email={email} avatar={avatar} />
}
// ✅ Trust the compiler
function Component({ user }) {
return <Profile name={user.name} email={user.email} avatar={user.avatar} />
}useCallback Escape Hatches
Stable Event Handlers for Effects
function DataFetcher({ onDataLoaded }) {
// Need stable reference for effect dependency
const stableCallback = useCallback(
(data) => onDataLoaded(data),
[onDataLoaded]
)
useEffect(() => {
fetchData().then(stableCallback)
}, [stableCallback])
}Refs in Callbacks
function Form() {
const inputRef = useRef<HTMLInputElement>(null)
// Callback that uses ref - may need stability
const focusInput = useCallback(() => {
inputRef.current?.focus()
}, [])
return (
<>
<input ref={inputRef} />
<Button onClick={focusInput}>Focus</Button>
</>
)
}Decision Tree
Is it an effect dependency?
├─ YES → Does the effect need to run less often?
│ └─ YES → useMemo/useCallback
└─ NO → Is it passed to a third-party library?
├─ YES → Check library docs, may need useMemo
└─ NO → Is it a known expensive computation?
├─ YES → Consider useMemo for explicit control
└─ NO → Trust React CompilerVerifying Compiler Coverage
// In development, check DevTools for Memo badge
// If component doesn't have badge, compiler may have skipped it
// You can also add console logs to verify:
const value = useMemo(() => {
console.log('Computing...') // Should only log when deps change
return expensiveComputation()
}, [deps])Profiling
Performance Profiling
Tools and techniques for identifying performance bottlenecks.
Profiling Workflow
- Measure - Establish baseline metrics
- Profile - Identify bottlenecks
- Optimize - Fix the slowest operations first
- Verify - Measure improvement
- Repeat - Iterate until targets met
Backend Profiling (Python)
1. cProfile (Built-in)
# Profile entire script
python -m cProfile -s cumulative backend/app/main.py
# Save profile for analysis
python -m cProfile -o profile.prof backend/app/main.py
# Analyze with snakeviz
pip install snakeviz
snakeviz profile.prof # Opens interactive flame graph2. py-spy (Sampling Profiler)
# Install
pip install py-spy
# Profile running process
sudo py-spy top --pid 12345
# Generate flame graph
sudo py-spy record -o profile.svg --pid 12345 --duration 60
# Profile from start
py-spy record -o profile.svg -- python app.py3. memory_profiler
# Install
pip install memory_profiler
# Decorate functions to profile
from memory_profiler import profile
@profile
def expensive_function():
data = [0] * (10 ** 6) # 1M integers
return sum(data)
# Run with profiling
python -m memory_profiler script.py4. Line Profiler
# Install
pip install line_profiler
# Add decorator
from line_profiler import profile
@profile
def slow_function():
result = 0
for i in range(1000000):
result += i
return result
# Run with kernprof
kernprof -l -v script.pyFrontend Profiling
1. Chrome DevTools Performance Tab
Steps:
- Open DevTools (F12)
- Go to Performance tab
- Click Record (Cmd+E)
- Interact with page
- Stop recording
- Analyze flame graph
What to Look For:
- Long tasks (> 50ms) - shows as red in timeline
- Layout/reflow - indicates DOM thrashing
- Scripting time - JavaScript execution
- Rendering time - paint and composite
2. React DevTools Profiler
import { Profiler } from 'react';
function onRenderCallback(
id: string,
phase: 'mount' | 'update',
actualDuration: number,
baseDuration: number,
startTime: number,
commitTime: number
) {
console.log(`${id} (${phase}) took ${actualDuration}ms`);
}
<Profiler id="AnalysisList" onRender={onRenderCallback}>
<AnalysisList analyses={analyses} />
</Profiler>In DevTools:
- Open React DevTools
- Go to Profiler tab
- Click Record
- Interact with app
- Stop and analyze
What to Look For:
- Components that render frequently but haven't changed
- Components with long render times
- Unnecessary re-renders (use
memo())
3. Lighthouse Performance Audit
# CLI
npm install -g lighthouse
lighthouse https://localhost:3000 --view
# Or use Chrome DevTools → Lighthouse tabMetrics Analyzed:
- First Contentful Paint (FCP)
- Largest Contentful Paint (LCP)
- Speed Index
- Time to Interactive (TTI)
- Total Blocking Time (TBT)
- Cumulative Layout Shift (CLS)
4. Bundle Analyzer
# Next.js
npm install @next/bundle-analyzer
ANALYZE=true npm run build
# Vite
npm install -D rollup-plugin-visualizer
npx vite-bundle-visualizer
# Webpack
npm install -D webpack-bundle-analyzer
webpack --profile --json > stats.json
webpack-bundle-analyzer stats.jsonDatabase Profiling
PostgreSQL
1. Enable Query Logging
-- Enable slow query log
ALTER SYSTEM SET log_min_duration_statement = 100; -- Log queries > 100ms
SELECT pg_reload_conf();2. pg_stat_statements
-- Enable extension
CREATE EXTENSION pg_stat_statements;
-- Find slowest queries
SELECT
query,
calls,
total_time / 1000 as total_seconds,
mean_time / 1000 as mean_seconds,
max_time / 1000 as max_seconds
FROM pg_stat_statements
WHERE query NOT LIKE '%pg_stat_statements%'
ORDER BY total_time DESC
LIMIT 10;
-- Reset stats
SELECT pg_stat_statements_reset();3. EXPLAIN ANALYZE
-- Analyze query execution
EXPLAIN ANALYZE
SELECT a.*, COUNT(c.id) as chunk_count
FROM analyses a
LEFT JOIN chunks c ON c.analysis_id = a.id
WHERE a.user_id = 'user_123'
GROUP BY a.id
ORDER BY a.created_at DESC
LIMIT 20;
-- Look for:
-- - Seq Scan (bad for large tables)
-- - High actual time
-- - High actual rows vs estimated rowsMemory Profiling
Python (memory_profiler)
from memory_profiler import profile
@profile
def load_analyses():
# Shows line-by-line memory usage
analyses = []
for i in range(10000):
analyses.append({
'id': i,
'content': 'x' * 1000, # Memory spike here!
})
return analysesChrome DevTools (Heap Snapshot)
Steps:
- Open DevTools → Memory tab
- Take Heap Snapshot
- Interact with app
- Take another snapshot
- Compare snapshots
What to Look For:
- Detached DOM nodes (memory leaks)
- Large arrays/objects
- Unreleased event listeners
Memory Leak Detection
// ❌ BAD: Memory leak (event listener never removed)
useEffect(() => {
window.addEventListener('resize', handleResize);
}, []);
// ✅ GOOD: Cleanup on unmount
useEffect(() => {
window.addEventListener('resize', handleResize);
return () => {
window.removeEventListener('resize', handleResize);
};
}, []);Flame Graphs
Visual representation of call stacks showing where time is spent.
Reading Flame Graphs:
- Width = Time spent (wider = slower)
- Height = Call stack depth
- Color = Usually just for differentiation
- Top = Leaf functions (where actual work happens)
Generate Flame Graph (Python):
# With py-spy
sudo py-spy record -o flamegraph.svg --pid 12345
# Open in browser
open flamegraph.svgLoad Testing
k6 (HTTP Load Testing)
// load-test.js
import http from 'k6/http';
import { check, sleep } from 'k6';
export let options = {
stages: [
{ duration: '2m', target: 100 }, // Ramp up to 100 users
{ duration: '5m', target: 100 }, // Stay at 100 users
{ duration: '2m', target: 0 }, // Ramp down to 0 users
],
thresholds: {
http_req_duration: ['p(95)<500'], // 95% of requests < 500ms
},
};
export default function () {
const res = http.get('http://localhost:8500/api/v1/analyses');
check(res, {
'status is 200': (r) => r.status === 200,
'response time < 500ms': (r) => r.timings.duration < 500,
});
sleep(1);
}# Run load test
k6 run load-test.jsLocust (Python Load Testing)
# locustfile.py
from locust import HttpUser, task, between
class ApiUser(HttpUser):
wait_time = between(1, 3)
@task
def get_analyses(self):
self.client.get("/api/v1/analyses")
@task(3) # 3x more frequent
def get_analysis(self):
self.client.get("/api/v1/analyses/abc123")# Run with web UI
locust -f locustfile.py --host=http://localhost:8500
# Or headless
locust -f locustfile.py --host=http://localhost:8500 --users 100 --spawn-rate 10 --run-time 5m --headlessProfiling Best Practices
- Profile in production-like environments - Dev may not show real bottlenecks
- Profile with realistic data volumes - Empty databases hide performance issues
- Focus on the slowest operations first - 80/20 rule applies
- Measure before and after - Verify optimizations actually help
- Profile regularly - Catch regressions early
- Use sampling profilers for production - Low overhead (py-spy, not cProfile)
Quick Profiling Commands
# Python CPU profiling
python -m cProfile -s cumulative script.py | head -20
# Python memory profiling
python -m memory_profiler script.py
# Node.js profiling
node --prof app.js
node --prof-process isolate-*.log > processed.txt
# PostgreSQL slow queries
psql -c "SELECT query, mean_time FROM pg_stat_statements ORDER BY mean_time DESC LIMIT 10"
# Chrome DevTools (programmatic)
node --inspect app.js
# Then open chrome://inspectReferences
- py-spy Documentation
- Chrome DevTools Performance
- React Profiler
- k6 Load Testing
- See
scripts/performance-metrics.tsfor Prometheus metrics setup
Quantization Guide
Quantization Guide
Overview
Quantization reduces model precision to decrease memory usage and increase throughput.
| Method | Bits | Calibration | Memory Savings | Throughput | Quality Loss |
|---|---|---|---|---|---|
| FP16 | 16 | None | Baseline | Baseline | None |
| FP8 | 8 | None | 50% | +30-50% | Minimal |
| INT8 | 8 | Optional | 50% | +10-20% | Minimal |
| AWQ | 4 | Required | 75% | +20-40% | Small |
| GPTQ | 4 | Required | 75% | +15-30% | Small |
AWQ (Activation-aware Weight Quantization)
Best 4-bit method for quality preservation:
# Use pre-quantized AWQ model
vllm serve TheBloke/Llama-2-70B-chat-AWQ \
--quantization awq \
--tensor-parallel-size 2from vllm import LLM
# AWQ quantized model
llm = LLM(
model="TheBloke/Llama-2-70B-chat-AWQ",
quantization="awq",
dtype="half",
tensor_parallel_size=2,
)AWQ Benefits:
- Activation-aware: Preserves important weights
- Better quality than GPTQ at same bit-width
- Faster inference on modern GPUs
GPTQ Quantization
Create your own GPTQ quantized model:
from gptqmodel import GPTQModel, QuantizeConfig
from datasets import load_dataset
# Load calibration data
calibration_data = load_dataset(
"allenai/c4",
data_files="en/c4-train.00001-of-01024.json.gz",
split="train",
).select(range(1024))["text"]
# Configure quantization
quant_config = QuantizeConfig(
bits=4, # 4-bit quantization
group_size=128, # Group size for quantization
damp_percent=0.1, # Dampening for Hessian
desc_act=True, # Activation order (better quality)
)
# Load and quantize
model = GPTQModel.load(
"meta-llama/Llama-3.2-1B-Instruct",
quant_config,
)
model.quantize(calibration_data, batch_size=4)
# Save quantized model
model.save("Llama-3.2-1B-Instruct-gptq-4bit")Using GPTQ with vLLM:
from vllm import LLM
llm = LLM(
model="TheBloke/Llama-2-70B-GPTQ",
quantization="gptq",
dtype="half",
)FP8 Quantization
Best for H100/H200 GPUs with native FP8 support:
from vllm import LLM
# FP8 on H100
llm = LLM(
model="meta-llama/Meta-Llama-3.1-70B-Instruct",
quantization="fp8", # Native FP8
kv_cache_dtype="fp8", # FP8 KV cache
)FP8 Advantages:
- Near-FP16 quality
- 50% memory reduction
- Best throughput on H100/H200
- No calibration needed
INT8 Quantization
Balanced option with minimal quality loss:
# INT8 weight quantization
llm = LLM(
model="meta-llama/Meta-Llama-3.1-70B-Instruct",
quantization="int8",
dtype="float16",
)Quantization Comparison
Memory Usage (70B Model)
| Precision | Memory (per GPU) | GPUs Needed |
|---|---|---|
| FP32 | ~280 GB | 8x A100 80GB |
| FP16 | ~140 GB | 4x A100 80GB |
| INT8/FP8 | ~70 GB | 2x A100 80GB |
| AWQ/GPTQ | ~35 GB | 1x A100 80GB |
Quality Benchmarks (MMLU)
| Model | FP16 | INT8 | AWQ-4bit | GPTQ-4bit |
|---|---|---|---|---|
| Llama-3.1-8B | 66.2% | 65.8% | 65.1% | 64.8% |
| Llama-3.1-70B | 79.3% | 79.0% | 78.2% | 77.9% |
Best Practices
Calibration Data
Use representative data for your use case:
# Domain-specific calibration
calibration_data = [
# Include examples similar to production queries
"Customer support query example...",
"Technical documentation example...",
"Code generation example...",
]
# Minimum 128 samples, recommended 512-1024
assert len(calibration_data) >= 128Group Size Selection
| Group Size | Memory | Quality | Speed |
|---|---|---|---|
| 32 | Lowest | Best | Slowest |
| 64 | Low | Very Good | Fast |
| 128 | Medium | Good | Fastest |
# Higher group size = faster but lower quality
quant_config = QuantizeConfig(
bits=4,
group_size=128, # Balance of speed and quality
)Mixed Precision
Keep critical layers at higher precision:
# Some layers benefit from higher precision
quant_config = QuantizeConfig(
bits=4,
group_size=128,
inside_layer_modules=[
# Keep attention at higher precision
"self_attn.q_proj",
"self_attn.k_proj",
"self_attn.v_proj",
],
)Troubleshooting
OOM During Quantization
# Reduce batch size
model.quantize(calibration_data, batch_size=1)
# Use gradient checkpointing
model.quantize(
calibration_data,
batch_size=2,
use_checkpoint=True,
)Quality Degradation
- Increase calibration data diversity
- Reduce group size (32 or 64)
- Try AWQ instead of GPTQ
- Enable
desc_act=Truefor GPTQ
Related Skills
ollama-local- Local inference with quantized modelsembeddings- Quantized embedding models
React Compiler Migration
React Compiler Migration Guide
Adopting React 19's automatic memoization.
What is React Compiler?
React Compiler automatically memoizes components and values, eliminating the need for manual useMemo, useCallback, and React.memo in most cases.
Prerequisites
- React 19+
- Compatible framework (Next.js 16+, Expo SDK 54+)
- Code follows Rules of React
Quick Setup
Next.js 16+
// next.config.js
const nextConfig = {
reactCompiler: true,
}
module.exports = nextConfigExpo SDK 54+
Enabled by default in new projects.
Babel (Manual)
npm install -D babel-plugin-react-compiler// babel.config.js
module.exports = {
plugins: [
['babel-plugin-react-compiler', {
// Optional: sources to compile
sources: (filename) => {
return filename.indexOf('src') !== -1
},
}],
],
}Verification
- Open React DevTools in browser
- Go to Components tab
- Look for "Memo ✨" badge next to component names
- If you see the sparkle emoji, compiler is working
What Gets Optimized
The compiler automatically memoizes:
| Before (Manual) | After (Compiler) |
|---|---|
React.memo(Component) | Component re-renders only when needed |
useMemo(() => value, [deps]) | Intermediate values cached |
useCallback(() => fn, [deps]) | Callback references stable |
| Conditional JSX | JSX elements memoized |
Rules of React (Must Follow)
For the compiler to work correctly:
1. Components Must Be Idempotent
// ✅ Same input → same output
function Profile({ user }) {
return <h1>{user.name}</h1>
}
// ❌ Non-deterministic
function Profile({ user }) {
return <h1>{user.name} at {Date.now()}</h1>
}2. Props and State Are Immutable
// ✅ Create new object
setUser({ ...user, name: 'New Name' })
// ❌ Mutate existing
user.name = 'New Name'
setUser(user)3. Side Effects Outside Render
// ✅ In useEffect
useEffect(() => {
analytics.track('view')
}, [])
// ❌ During render
function Component() {
analytics.track('view') // BAD
return <div>...</div>
}4. Hooks at Top Level
// ✅ Always at top
function Component() {
const [state, setState] = useState()
// ...
}
// ❌ Conditional hooks
function Component({ show }) {
if (show) {
const [state, setState] = useState() // BAD
}
}Migration Strategy
New Projects
Enable compiler immediately. No reason not to.
Existing Projects
- Enable compiler in config
- Run tests to catch issues
- Check DevTools for Memo badges
- Gradually remove manual memoization
// Before (manual)
const MemoizedChild = React.memo(Child)
const memoizedValue = useMemo(() => compute(data), [data])
const handleClick = useCallback(() => onClick(id), [id, onClick])
// After (compiler handles it)
// Just use Child, compute(data), and onClick directly
// Compiler determines what needs memoizationWhen Manual Memoization Still Needed
Keep useMemo/useCallback for:
// 1. Effect dependencies that shouldn't trigger re-runs
const stableConfig = useMemo(() => ({
apiUrl: process.env.API_URL,
timeout: 5000,
}), [])
useEffect(() => {
initSDK(stableConfig) // Should only run once
}, [stableConfig])
// 2. Third-party libraries without compiler support
const memoizedData = useMemo(() =>
thirdPartyLib.transform(data), [data])
// 3. Precise control over boundaries
const handleSubmit = useCallback(async () => {
// Complex async logic that must be stable
}, [criticalDep])Debugging Issues
Component Not Getting Memo Badge
- Check if file is in compiler's sources
- Look for Rules of React violations
- Check for unsupported patterns
Performance Regression
- Profile with React DevTools
- Check if compiler skipped problematic code
- Add manual memoization as escape hatch
Compatibility Notes
- Works with existing
useMemo/useCallback(won't double-memoize) - Safe to leave existing memoization during migration
- Compiler output is equivalent to manual optimization
Route Splitting
Route-Based Code Splitting
React Router 7.x Lazy Routes
import { lazy } from 'react';
import { createBrowserRouter } from 'react-router';
// Define lazy routes
const routes = [
{
path: '/',
lazy: () => import('./pages/Home'),
},
{
path: '/dashboard',
lazy: () => import('./pages/Dashboard'),
children: [
{
path: 'analytics',
lazy: () => import('./pages/Analytics'),
},
{
path: 'settings',
lazy: () => import('./pages/Settings'),
},
],
},
];
const router = createBrowserRouter(routes);Vite Manual Chunks
// vite.config.ts
export default defineConfig({
build: {
rollupOptions: {
output: {
manualChunks: {
// Vendor chunks
'react-vendor': ['react', 'react-dom', 'react-router'],
'query-vendor': ['@tanstack/react-query'],
// Feature chunks (match route structure)
'dashboard': [
'./src/pages/Dashboard',
'./src/pages/Analytics',
],
'settings': [
'./src/pages/Settings',
'./src/pages/Profile',
],
},
},
},
},
});Prefetch on Route Hover
import { useQueryClient } from '@tanstack/react-query';
import { Link, useNavigate } from 'react-router';
function NavLink({ to, children }: { to: string; children: React.ReactNode }) {
const queryClient = useQueryClient();
const prefetch = () => {
// Prefetch route data
queryClient.prefetchQuery({
queryKey: ['route', to],
queryFn: () => fetchRouteData(to),
});
};
return (
<Link
to={to}
onMouseEnter={prefetch}
onFocus={prefetch}
preload="intent"
>
{children}
</Link>
);
}Bundle Size Monitoring
# After build, check chunk sizes
npx vite build
# Output shows chunk sizes
# For detailed analysis
npx vite-bundle-visualizerRum Setup
Real User Monitoring (RUM) Setup
Complete guide to implementing Real User Monitoring for Core Web Vitals.
┌─────────────────────────────────────────────────────────────────────────┐
│ RUM Data Flow │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ Browser Server Analytics │
│ ┌────────┐ ┌────────┐ ┌────────────┐ │
│ │ User │──interaction──►│ web-vitals │ │ │ │
│ │Session │ │ library │ │ Dashboard │ │
│ └────────┘ └─────┬──────┘ │ + Alerts │ │
│ │ └─────┬──────┘ │
│ │ │ │
│ ┌────────────▼────────────┐ │ │
│ │ sendBeacon / fetch │─────────────► │
│ │ (keepalive: true) │ │ │
│ └────────────┬────────────┘ │ │
│ │ │ │
│ ┌────────────▼────────────┐ │ │
│ │ /api/vitals │─────────────► │
│ │ (batch + process) │ metrics │ │
│ └─────────────────────────┘ │ │
│ │
└─────────────────────────────────────────────────────────────────────────┘web-vitals Library Setup
Installation
npm install web-vitals
# or
pnpm add web-vitalsBasic Implementation
// lib/vitals.ts
import {
onCLS,
onINP,
onLCP,
onFCP,
onTTFB,
type Metric,
type ReportOpts,
} from 'web-vitals';
// Metric type for your analytics
export interface VitalsMetric {
name: 'CLS' | 'INP' | 'LCP' | 'FCP' | 'TTFB';
value: number;
rating: 'good' | 'needs-improvement' | 'poor';
delta: number;
id: string;
navigationType: 'navigate' | 'reload' | 'back-forward' | 'back-forward-cache' | 'prerender';
// Custom metadata
url: string;
userAgent: string;
connectionType?: string;
deviceMemory?: number;
timestamp: number;
}
// Collect device and connection info for debugging
function getDeviceInfo(): Partial<VitalsMetric> {
const nav = navigator as Navigator & {
connection?: { effectiveType?: string };
deviceMemory?: number;
};
return {
userAgent: navigator.userAgent,
connectionType: nav.connection?.effectiveType,
deviceMemory: nav.deviceMemory,
};
}
function createMetricPayload(metric: Metric): VitalsMetric {
return {
name: metric.name as VitalsMetric['name'],
value: metric.value,
rating: metric.rating,
delta: metric.delta,
id: metric.id,
navigationType: metric.navigationType,
url: window.location.href,
timestamp: Date.now(),
...getDeviceInfo(),
};
}
// Reliable transmission even during page unload
function sendToAnalytics(metric: Metric) {
const payload = createMetricPayload(metric);
const body = JSON.stringify(payload);
// sendBeacon is most reliable for unload scenarios
if (navigator.sendBeacon) {
navigator.sendBeacon('/api/vitals', body);
} else {
// Fallback with keepalive for browsers without sendBeacon
fetch('/api/vitals', {
method: 'POST',
body,
headers: { 'Content-Type': 'application/json' },
keepalive: true, // Keeps request alive even if page unloads
});
}
}
// Report all web vitals
export function reportWebVitals(opts?: ReportOpts) {
// Core Web Vitals (affect SEO)
onCLS(sendToAnalytics, opts);
onINP(sendToAnalytics, opts);
onLCP(sendToAnalytics, opts);
// Additional useful metrics
onFCP(sendToAnalytics, opts);
onTTFB(sendToAnalytics, opts);
}Next.js App Router Integration
Client Component for Vitals
// app/components/web-vitals.tsx
'use client';
import { useEffect } from 'react';
import { reportWebVitals } from '@/lib/vitals';
export function WebVitals() {
useEffect(() => {
// Report immediately (first value)
reportWebVitals({ reportAllChanges: false });
}, []);
return null;
}
// For debugging during development
export function WebVitalsDebug() {
useEffect(() => {
// Report all changes, not just final values
reportWebVitals({ reportAllChanges: true });
}, []);
return null;
}Layout Integration
// app/layout.tsx
import { WebVitals } from '@/components/web-vitals';
export default function RootLayout({
children,
}: {
children: React.ReactNode;
}) {
return (
<html lang="en">
<body>
<WebVitals />
{children}
</body>
</html>
);
}API Endpoint Implementation
Next.js Route Handler
// app/api/vitals/route.ts
import { NextRequest, NextResponse } from 'next/server';
// Thresholds from web.dev
const THRESHOLDS = {
LCP: { good: 2500, poor: 4000 },
INP: { good: 200, poor: 500 },
CLS: { good: 0.1, poor: 0.25 },
FCP: { good: 1800, poor: 3000 },
TTFB: { good: 800, poor: 1800 },
} as const;
// 2026 thresholds (plan ahead!)
const THRESHOLDS_2026 = {
LCP: { good: 2000, poor: 4000 },
INP: { good: 150, poor: 500 },
CLS: { good: 0.08, poor: 0.25 },
} as const;
interface VitalsMetric {
name: string;
value: number;
rating: string;
delta: number;
id: string;
navigationType: string;
url: string;
userAgent: string;
connectionType?: string;
deviceMemory?: number;
timestamp: number;
}
// Validate incoming metric
function isValidMetric(data: unknown): data is VitalsMetric {
if (!data || typeof data !== 'object') return false;
const metric = data as Record<string, unknown>;
return (
typeof metric.name === 'string' &&
typeof metric.value === 'number' &&
typeof metric.rating === 'string'
);
}
export async function POST(request: NextRequest) {
try {
const metric = await request.json();
if (!isValidMetric(metric)) {
return NextResponse.json(
{ error: 'Invalid metric format' },
{ status: 400 }
);
}
// Enrich with server-side data
const enrichedMetric = {
...metric,
receivedAt: new Date().toISOString(),
clientIP: request.headers.get('x-forwarded-for') ?? 'unknown',
country: request.headers.get('x-vercel-ip-country') ?? 'unknown',
};
// Log for debugging (replace with your analytics service)
console.log('[Web Vital]', JSON.stringify(enrichedMetric));
// Store in your analytics database
await storeMetric(enrichedMetric);
// Alert on poor metrics (optional)
if (metric.rating === 'poor') {
await alertOnPoorMetric(enrichedMetric);
}
return NextResponse.json({ received: true });
} catch (error) {
console.error('[Vitals API Error]', error);
return NextResponse.json(
{ error: 'Failed to process metric' },
{ status: 500 }
);
}
}
// Example: Store in PostgreSQL
async function storeMetric(metric: VitalsMetric & { receivedAt: string }) {
// Replace with your database client
// await db.insert('web_vitals').values({
// name: metric.name,
// value: metric.value,
// rating: metric.rating,
// url: metric.url,
// user_agent: metric.userAgent,
// connection_type: metric.connectionType,
// timestamp: new Date(metric.timestamp),
// received_at: new Date(metric.receivedAt),
// });
}
// Example: Alert via Slack/PagerDuty
async function alertOnPoorMetric(metric: VitalsMetric) {
const threshold = THRESHOLDS[metric.name as keyof typeof THRESHOLDS];
if (!threshold) return;
// await fetch(process.env.SLACK_WEBHOOK_URL!, {
// method: 'POST',
// body: JSON.stringify({
// text: `🚨 Poor ${metric.name}: ${metric.value}${metric.name === 'CLS' ? '' : 'ms'} on ${metric.url}`,
// }),
// });
}Batching for High-Traffic Sites
// lib/vitals-batched.ts
import { onCLS, onINP, onLCP, type Metric } from 'web-vitals';
const BATCH_SIZE = 10;
const FLUSH_INTERVAL = 5000; // 5 seconds
class MetricsBatcher {
private queue: Metric[] = [];
private flushTimer: ReturnType<typeof setTimeout> | null = null;
add(metric: Metric) {
this.queue.push(metric);
if (this.queue.length >= BATCH_SIZE) {
this.flush();
} else if (!this.flushTimer) {
this.flushTimer = setTimeout(() => this.flush(), FLUSH_INTERVAL);
}
}
private flush() {
if (this.queue.length === 0) return;
const metrics = [...this.queue];
this.queue = [];
if (this.flushTimer) {
clearTimeout(this.flushTimer);
this.flushTimer = null;
}
// Send batch
navigator.sendBeacon(
'/api/vitals/batch',
JSON.stringify({ metrics, timestamp: Date.now() })
);
}
// Flush on page unload
flushSync() {
if (this.flushTimer) {
clearTimeout(this.flushTimer);
this.flushTimer = null;
}
this.flush();
}
}
const batcher = new MetricsBatcher();
// Ensure flush on unload
if (typeof window !== 'undefined') {
window.addEventListener('visibilitychange', () => {
if (document.visibilityState === 'hidden') {
batcher.flushSync();
}
});
}
export function reportWebVitalsBatched() {
onCLS((metric) => batcher.add(metric));
onINP((metric) => batcher.add(metric));
onLCP((metric) => batcher.add(metric));
}Database Schema
PostgreSQL Schema
CREATE TABLE web_vitals (
id BIGSERIAL PRIMARY KEY,
name VARCHAR(10) NOT NULL,
value DECIMAL(10, 4) NOT NULL,
rating VARCHAR(20) NOT NULL,
delta DECIMAL(10, 4),
metric_id VARCHAR(50),
navigation_type VARCHAR(30),
url TEXT NOT NULL,
user_agent TEXT,
connection_type VARCHAR(20),
device_memory INT,
client_ip INET,
country VARCHAR(2),
timestamp TIMESTAMPTZ NOT NULL,
received_at TIMESTAMPTZ DEFAULT NOW(),
-- Indexes for common queries
INDEX idx_vitals_name_timestamp (name, timestamp DESC),
INDEX idx_vitals_url (url),
INDEX idx_vitals_rating (rating)
);
-- Partition by month for large datasets
CREATE TABLE web_vitals_2025_01 PARTITION OF web_vitals
FOR VALUES FROM ('2025-01-01') TO ('2025-02-01');Analytics Queries
-- Daily Core Web Vitals summary (p75 is Google's standard)
SELECT
DATE(timestamp) as date,
name,
PERCENTILE_CONT(0.75) WITHIN GROUP (ORDER BY value) as p75,
COUNT(CASE WHEN rating = 'good' THEN 1 END)::float / COUNT(*) * 100 as good_pct,
COUNT(*) as samples
FROM web_vitals
WHERE timestamp > NOW() - INTERVAL '30 days'
AND name IN ('LCP', 'INP', 'CLS')
GROUP BY DATE(timestamp), name
ORDER BY date DESC, name;
-- Worst performing pages by LCP
SELECT
url,
PERCENTILE_CONT(0.75) WITHIN GROUP (ORDER BY value) as p75_lcp,
COUNT(*) as samples
FROM web_vitals
WHERE name = 'LCP'
AND timestamp > NOW() - INTERVAL '7 days'
GROUP BY url
HAVING COUNT(*) > 100
ORDER BY p75_lcp DESC
LIMIT 20;
-- Performance by connection type
SELECT
connection_type,
name,
PERCENTILE_CONT(0.75) WITHIN GROUP (ORDER BY value) as p75,
COUNT(*) as samples
FROM web_vitals
WHERE timestamp > NOW() - INTERVAL '7 days'
AND connection_type IS NOT NULL
GROUP BY connection_type, name
ORDER BY connection_type, name;
-- Trend analysis: Week-over-week comparison
WITH current_week AS (
SELECT name, PERCENTILE_CONT(0.75) WITHIN GROUP (ORDER BY value) as p75
FROM web_vitals
WHERE timestamp > NOW() - INTERVAL '7 days'
GROUP BY name
),
previous_week AS (
SELECT name, PERCENTILE_CONT(0.75) WITHIN GROUP (ORDER BY value) as p75
FROM web_vitals
WHERE timestamp BETWEEN NOW() - INTERVAL '14 days' AND NOW() - INTERVAL '7 days'
GROUP BY name
)
SELECT
c.name,
c.p75 as current_p75,
p.p75 as previous_p75,
ROUND((c.p75 - p.p75) / p.p75 * 100, 2) as change_pct
FROM current_week c
JOIN previous_week p ON c.name = p.name;Grafana Dashboard
Prometheus Metrics Export
// lib/metrics-exporter.ts
import { Histogram, Counter, Registry } from 'prom-client';
const registry = new Registry();
// Histogram for percentile calculations
const webVitalsHistogram = new Histogram({
name: 'web_vitals_value',
help: 'Web Vitals metric values',
labelNames: ['name', 'rating'],
buckets: {
LCP: [1000, 1500, 2000, 2500, 3000, 4000, 5000],
INP: [50, 100, 150, 200, 300, 500, 1000],
CLS: [0.01, 0.05, 0.1, 0.15, 0.25, 0.5],
}['LCP'], // Default buckets
registers: [registry],
});
const webVitalsCounter = new Counter({
name: 'web_vitals_total',
help: 'Total count of Web Vitals reports',
labelNames: ['name', 'rating'],
registers: [registry],
});
export function recordMetric(name: string, value: number, rating: string) {
webVitalsHistogram.labels(name, rating).observe(value);
webVitalsCounter.labels(name, rating).inc();
}
export { registry };Grafana Alert Rules
# grafana-alerts.yaml
groups:
- name: core-web-vitals
interval: 5m
rules:
# LCP Alert
- alert: HighLCP
expr: histogram_quantile(0.75, sum(rate(web_vitals_value_bucket{name="LCP"}[15m])) by (le)) > 2500
for: 10m
labels:
severity: warning
annotations:
summary: "LCP p75 is {{ $value | printf \"%.0f\" }}ms (threshold: 2500ms)"
description: "Largest Contentful Paint has degraded. Check recent deployments."
# INP Alert
- alert: HighINP
expr: histogram_quantile(0.75, sum(rate(web_vitals_value_bucket{name="INP"}[15m])) by (le)) > 200
for: 10m
labels:
severity: warning
annotations:
summary: "INP p75 is {{ $value | printf \"%.0f\" }}ms (threshold: 200ms)"
description: "Interaction to Next Paint has degraded. Check for long tasks."
# CLS Alert
- alert: HighCLS
expr: histogram_quantile(0.75, sum(rate(web_vitals_value_bucket{name="CLS"}[15m])) by (le)) > 0.1
for: 10m
labels:
severity: warning
annotations:
summary: "CLS p75 is {{ $value | printf \"%.3f\" }} (threshold: 0.1)"
description: "Cumulative Layout Shift has degraded. Check for layout shifts."
# Good rate dropping
- alert: GoodRateDrop
expr: |
(sum(rate(web_vitals_total{rating="good"}[1h])) by (name) /
sum(rate(web_vitals_total[1h])) by (name)) < 0.75
for: 30m
labels:
severity: critical
annotations:
summary: "{{ $labels.name }} good rate dropped below 75%"
description: "Less than 75% of users are experiencing good {{ $labels.name }}"Sampling Strategy for High Traffic
// lib/vitals-sampled.ts
import { onCLS, onINP, onLCP, type Metric } from 'web-vitals';
interface SamplingConfig {
// Base sample rate (0-1)
baseRate: number;
// Always sample poor metrics
alwaysSamplePoor: boolean;
// Sample more on specific pages
pageMultipliers?: Record<string, number>;
}
const DEFAULT_CONFIG: SamplingConfig = {
baseRate: 0.1, // 10% baseline
alwaysSamplePoor: true,
pageMultipliers: {
'/': 1.0, // Always sample homepage
'/checkout': 1.0, // Always sample checkout
},
};
function shouldSample(metric: Metric, config: SamplingConfig): boolean {
// Always sample poor metrics for debugging
if (config.alwaysSamplePoor && metric.rating === 'poor') {
return true;
}
// Check page-specific multiplier
const path = window.location.pathname;
const multiplier = config.pageMultipliers?.[path] ?? 1;
const effectiveRate = config.baseRate * multiplier;
return Math.random() < effectiveRate;
}
export function reportWebVitalsSampled(config = DEFAULT_CONFIG) {
const report = (metric: Metric) => {
if (shouldSample(metric, config)) {
sendToAnalytics(metric);
}
};
onCLS(report);
onINP(report);
onLCP(report);
}Testing RUM in Development
// lib/vitals-dev.ts
import { onCLS, onINP, onLCP, type Metric } from 'web-vitals';
const RATING_COLORS = {
good: 'color: green',
'needs-improvement': 'color: orange',
poor: 'color: red',
} as const;
function logToConsole(metric: Metric) {
const color = RATING_COLORS[metric.rating];
const unit = metric.name === 'CLS' ? '' : 'ms';
console.log(
`%c[${metric.name}] ${metric.value.toFixed(2)}${unit} (${metric.rating})`,
color,
{
delta: metric.delta,
id: metric.id,
navigationType: metric.navigationType,
}
);
}
export function reportWebVitalsDev() {
// Report all changes for debugging
onCLS(logToConsole, { reportAllChanges: true });
onINP(logToConsole, { reportAllChanges: true });
onLCP(logToConsole, { reportAllChanges: true });
}
// Usage in development
if (process.env.NODE_ENV === 'development') {
reportWebVitalsDev();
}Integration with Analytics Providers
Google Analytics 4
// lib/vitals-ga4.ts
import { onCLS, onINP, onLCP, type Metric } from 'web-vitals';
declare global {
interface Window {
gtag?: (...args: unknown[]) => void;
}
}
function sendToGA4(metric: Metric) {
if (typeof window.gtag !== 'function') return;
window.gtag('event', metric.name, {
event_category: 'Web Vitals',
event_label: metric.id,
value: Math.round(metric.name === 'CLS' ? metric.value * 1000 : metric.value),
metric_rating: metric.rating,
non_interaction: true,
});
}
export function reportWebVitalsGA4() {
onCLS(sendToGA4);
onINP(sendToGA4);
onLCP(sendToGA4);
}Vercel Analytics
// Next.js built-in support
// next.config.js
module.exports = {
// Vercel Analytics automatically collects Web Vitals
// No additional setup needed when deployed on Vercel
};
// For self-hosted, use @vercel/analytics
import { Analytics } from '@vercel/analytics/react';
export default function RootLayout({ children }) {
return (
<html>
<body>
{children}
<Analytics />
</body>
</html>
);
}Speculative Decoding
Speculative Decoding
Overview
Speculative decoding accelerates autoregressive generation by predicting multiple tokens at once, then verifying in parallel.
How it works:
- Draft model (or n-gram) proposes N candidate tokens
- Target model verifies all N tokens in one forward pass
- Accept verified tokens, reject incorrect ones
- Repeat from first rejected position
Expected gains: 1.5-2.5x throughput for compatible workloads.
N-gram Speculation
No extra model needed - uses prompt patterns:
# vLLM CLI with n-gram speculation
vllm serve meta-llama/Meta-Llama-3.1-70B-Instruct \
--speculative-config '{
"method": "ngram",
"num_speculative_tokens": 5,
"prompt_lookup_max": 5,
"prompt_lookup_min": 2
}'from vllm import LLM, SamplingParams
llm = LLM(
model="meta-llama/Meta-Llama-3.1-70B-Instruct",
speculative_config={
"method": "ngram",
"num_speculative_tokens": 5,
"prompt_lookup_max": 5,
"prompt_lookup_min": 2,
},
)
# Works best with repetitive/structured output
outputs = llm.generate(
["Generate a JSON object with user data:"],
SamplingParams(max_tokens=500),
)Best for:
- Structured output (JSON, code)
- Repetitive patterns
- Low additional memory
Draft Model Speculation
Use a smaller model to draft tokens:
# Draft model speculation
vllm serve meta-llama/Meta-Llama-3.1-70B-Instruct \
--speculative-config '{
"method": "draft_model",
"draft_model": "meta-llama/Llama-3.2-1B-Instruct",
"num_speculative_tokens": 3
}' \
--tensor-parallel-size 4from vllm import LLM
llm = LLM(
model="meta-llama/Meta-Llama-3.1-70B-Instruct",
speculative_config={
"method": "draft_model",
"draft_model": "meta-llama/Llama-3.2-1B-Instruct",
"num_speculative_tokens": 3,
},
tensor_parallel_size=4,
)Draft model selection:
| Target Model | Recommended Draft | Size Ratio |
|---|---|---|
| 70B | 7B or 8B | ~10% |
| 70B | 1B-3B | ~2-5% |
| 8B | 1B | ~12% |
| 405B | 8B-70B | ~2-17% |
Medusa-style Speculation
Multiple prediction heads for parallel token generation:
# Medusa-style model (requires trained heads)
llm = LLM(
model="lmsys/vicuna-7b-v1.5-16k-medusa",
speculative_config={
"method": "medusa",
"num_heads": 4, # Number of speculation heads
},
)Advantages:
- No separate draft model
- Lower memory than draft model
- Works well with fine-tuned models
Performance Tuning
Optimal Token Count
# Benchmark different speculation depths
for num_tokens in [1, 3, 5, 7]:
llm = LLM(
model="meta-llama/Meta-Llama-3.1-70B-Instruct",
speculative_config={
"method": "ngram",
"num_speculative_tokens": num_tokens,
},
)
throughput = benchmark(llm)
print(f"Tokens: {num_tokens}, Throughput: {throughput:.1f} tok/s")General guidelines:
| Scenario | Recommended Tokens |
|---|---|
| Code generation | 5-7 |
| JSON output | 5-7 |
| Free-form text | 2-4 |
| Creative writing | 1-3 |
Acceptance Rate Monitoring
# vLLM logs acceptance rates
# Look for: "Speculative decoding acceptance rate: X%"
# High acceptance (>70%): Increase num_speculative_tokens
# Low acceptance (<40%): Decrease or disable speculationWhen NOT to Use
Speculative decoding may hurt performance when:
- High randomness (temperature > 1.0)
- Short outputs (overhead > benefit)
- Diverse outputs (low acceptance rate)
- Memory constrained (draft model overhead)
# Disable speculation for creative tasks
sampling_params = SamplingParams(
temperature=1.2,
top_p=0.95,
max_tokens=100, # Short output
)
# Use standard decoding insteadBenchmarking
import time
from vllm import LLM, SamplingParams
def benchmark_speculation(model_path: str, prompts: list[str]):
"""Compare with and without speculative decoding."""
# Without speculation
llm_base = LLM(model=model_path)
start = time.perf_counter()
outputs_base = llm_base.generate(prompts, SamplingParams(max_tokens=512))
time_base = time.perf_counter() - start
# With speculation
llm_spec = LLM(
model=model_path,
speculative_config={
"method": "ngram",
"num_speculative_tokens": 5,
},
)
start = time.perf_counter()
outputs_spec = llm_spec.generate(prompts, SamplingParams(max_tokens=512))
time_spec = time.perf_counter() - start
tokens_base = sum(len(o.outputs[0].token_ids) for o in outputs_base)
tokens_spec = sum(len(o.outputs[0].token_ids) for o in outputs_spec)
print(f"Baseline: {tokens_base/time_base:.1f} tok/s")
print(f"Speculative: {tokens_spec/time_spec:.1f} tok/s")
print(f"Speedup: {(time_base/time_spec):.2f}x")
# JSON/code prompts benefit most
prompts = [
"Generate a Python function that implements binary search:",
"Create a JSON schema for a user profile with validation:",
"Write a SQL query to find top 10 customers by revenue:",
]
benchmark_speculation("meta-llama/Meta-Llama-3.1-8B-Instruct", prompts)Related Skills
llm-streaming- Streaming with speculationprompt-caching- Combine with prefix caching
State Colocation
State Colocation
Keep state as close to where it's used as possible.
The Principle
State should live in the component that needs it. Only lift state when truly necessary for sibling communication.
Problem: State Too High
// ❌ State at app level causes unnecessary re-renders
function App() {
const [searchQuery, setSearchQuery] = useState('')
const [selectedId, setSelectedId] = useState(null)
return (
<div>
<Header /> {/* Re-renders on search! */}
<Sidebar /> {/* Re-renders on search! */}
<SearchInput
value={searchQuery}
onChange={setSearchQuery}
/>
<SearchResults
query={searchQuery}
selectedId={selectedId}
onSelect={setSelectedId}
/>
<Footer /> {/* Re-renders on search! */}
</div>
)
}Solution: Colocate State
// ✅ State colocated with components that use it
function App() {
return (
<div>
<Header />
<Sidebar />
<SearchSection /> {/* Contains its own state */}
<Footer />
</div>
)
}
function SearchSection() {
const [searchQuery, setSearchQuery] = useState('')
const [selectedId, setSelectedId] = useState(null)
return (
<>
<SearchInput
value={searchQuery}
onChange={setSearchQuery}
/>
<SearchResults
query={searchQuery}
selectedId={selectedId}
onSelect={setSelectedId}
/>
</>
)
}When to Lift State
Lift state ONLY when:
- Siblings need to share it
// Both components need selectedUser
function Parent() {
const [selectedUser, setSelectedUser] = useState(null)
return (
<>
<UserList onSelect={setSelectedUser} selected={selectedUser} />
<UserDetails user={selectedUser} />
</>
)
}- Parent needs to coordinate
// Parent manages form submission
function Form() {
const [values, setValues] = useState({})
const handleSubmit = () => {
api.submit(values)
}
return (
<>
<FormFields values={values} onChange={setValues} />
<SubmitButton onClick={handleSubmit} />
</>
)
}Component Splitting
Split components to isolate state:
// ❌ Before: Counter re-renders entire card
function Card() {
const [count, setCount] = useState(0)
return (
<div className="card">
<ExpensiveHeader /> {/* Re-renders on count change */}
<ExpensiveContent /> {/* Re-renders on count change */}
<button onClick={() => setCount(c => c + 1)}>
Count: {count}
</button>
</div>
)
}
// ✅ After: Counter isolated
function Card() {
return (
<div className="card">
<ExpensiveHeader /> {/* Doesn't re-render */}
<ExpensiveContent /> {/* Doesn't re-render */}
<Counter /> {/* Only this re-renders */}
</div>
)
}
function Counter() {
const [count, setCount] = useState(0)
return (
<button onClick={() => setCount(c => c + 1)}>
Count: {count}
</button>
)
}Context for Cross-Cutting Concerns
Use Context for truly global state, not local UI state:
// ✅ Good: Theme is app-wide
<ThemeContext.Provider value={theme}>
<App />
</ThemeContext.Provider>
// ✅ Good: Auth is app-wide
<AuthContext.Provider value={user}>
<App />
</AuthContext.Provider>
// ❌ Bad: Search query is local
<SearchQueryContext.Provider value={query}> {/* Don't do this */}
<Header />
<SearchResults />
</SearchQueryContext.Provider>Context Splitting
Split contexts to prevent unnecessary re-renders:
// ❌ Single context - all consumers re-render
const AppContext = createContext({ user, theme, locale })
// ✅ Split contexts - targeted re-renders
const UserContext = createContext(null)
const ThemeContext = createContext('light')
const LocaleContext = createContext('en')Signs State Should Move
Move state DOWN when:
- Only one component uses it
- Child components don't need it
- Re-renders are affecting unrelated components
Move state UP when:
- Multiple children need to read it
- Children need to update each other
- State represents shared domain concept
Quick Checklist
- Is state used by only one component? → Keep it there
- Do siblings need this state? → Lift to parent
- Is it causing unnecessary re-renders? → Consider splitting
- Is it truly global? → Use Context
- Is it URL state? → Use router params
Tanstack Virtual Patterns
TanStack Virtual Patterns
Efficient virtualization for large lists and grids.
When to Virtualize
| Item Count | Recommendation |
|---|---|
| < 50 | Not needed |
| 50-100 | Consider if items are complex |
| 100-500 | Recommended |
| 500+ | Required |
Basic List Virtualization
import { useVirtualizer } from '@tanstack/react-virtual'
function VirtualList({ items }) {
const parentRef = useRef<HTMLDivElement>(null)
const virtualizer = useVirtualizer({
count: items.length,
getScrollElement: () => parentRef.current,
estimateSize: () => 50, // Estimated row height in px
overscan: 5, // Render 5 extra items for smooth scrolling
})
return (
<div
ref={parentRef}
style={{ height: '400px', overflow: 'auto' }}
>
<div
style={{
height: `${virtualizer.getTotalSize()}px`,
width: '100%',
position: 'relative',
}}
>
{virtualizer.getVirtualItems().map((virtualItem) => (
<div
key={virtualItem.key}
style={{
position: 'absolute',
top: 0,
left: 0,
width: '100%',
height: `${virtualItem.size}px`,
transform: `translateY(${virtualItem.start}px)`,
}}
>
{items[virtualItem.index].name}
</div>
))}
</div>
</div>
)
}Variable Height Rows
For rows with different heights:
function VariableHeightList({ items }) {
const parentRef = useRef<HTMLDivElement>(null)
const virtualizer = useVirtualizer({
count: items.length,
getScrollElement: () => parentRef.current,
estimateSize: (index) => {
// Return estimated height based on content
return items[index].type === 'header' ? 80 : 50
},
overscan: 5,
})
return (
<div ref={parentRef} style={{ height: '400px', overflow: 'auto' }}>
<div
style={{
height: `${virtualizer.getTotalSize()}px`,
position: 'relative',
}}
>
{virtualizer.getVirtualItems().map((virtualItem) => (
<div
key={virtualItem.key}
data-index={virtualItem.index}
ref={virtualizer.measureElement} // Enable dynamic measurement
style={{
position: 'absolute',
top: 0,
left: 0,
width: '100%',
transform: `translateY(${virtualItem.start}px)`,
}}
>
<ItemComponent item={items[virtualItem.index]} />
</div>
))}
</div>
</div>
)
}Dynamic Measurement
When content determines height:
const virtualizer = useVirtualizer({
count: items.length,
getScrollElement: () => parentRef.current,
estimateSize: () => 50, // Initial estimate
// measureElement enables dynamic re-measurement
})
// Add ref to each item
<div
key={virtualItem.key}
data-index={virtualItem.index}
ref={virtualizer.measureElement}
>
{/* Content with unknown height */}
</div>Horizontal Virtualization
const columnVirtualizer = useVirtualizer({
horizontal: true,
count: columns.length,
getScrollElement: () => parentRef.current,
estimateSize: () => 150, // Column width
overscan: 3,
})Grid Virtualization
Combine row and column virtualizers:
function VirtualGrid({ rows, columns }) {
const parentRef = useRef<HTMLDivElement>(null)
const rowVirtualizer = useVirtualizer({
count: rows.length,
getScrollElement: () => parentRef.current,
estimateSize: () => 50,
overscan: 5,
})
const columnVirtualizer = useVirtualizer({
horizontal: true,
count: columns.length,
getScrollElement: () => parentRef.current,
estimateSize: () => 100,
overscan: 3,
})
return (
<div ref={parentRef} style={{ height: '400px', overflow: 'auto' }}>
<div
style={{
height: `${rowVirtualizer.getTotalSize()}px`,
width: `${columnVirtualizer.getTotalSize()}px`,
position: 'relative',
}}
>
{rowVirtualizer.getVirtualItems().map((virtualRow) => (
<React.Fragment key={virtualRow.key}>
{columnVirtualizer.getVirtualItems().map((virtualColumn) => (
<div
key={virtualColumn.key}
style={{
position: 'absolute',
top: 0,
left: 0,
width: `${virtualColumn.size}px`,
height: `${virtualRow.size}px`,
transform: `translateX(${virtualColumn.start}px) translateY(${virtualRow.start}px)`,
}}
>
{/* Cell content */}
</div>
))}
</React.Fragment>
))}
</div>
</div>
)
}Scroll to Index
const virtualizer = useVirtualizer({/* ... */})
// Scroll to specific item
virtualizer.scrollToIndex(50, { align: 'start' })
// Align options: 'start' | 'center' | 'end' | 'auto'Window Scroller
For document-level scrolling:
import { useWindowVirtualizer } from '@tanstack/react-virtual'
function WindowList({ items }) {
const virtualizer = useWindowVirtualizer({
count: items.length,
estimateSize: () => 50,
overscan: 5,
})
return (
<div
style={{
height: `${virtualizer.getTotalSize()}px`,
position: 'relative',
}}
>
{virtualizer.getVirtualItems().map((virtualItem) => (
<div
key={virtualItem.key}
style={{
position: 'absolute',
top: 0,
left: 0,
width: '100%',
transform: `translateY(${virtualItem.start}px)`,
}}
>
{items[virtualItem.index].name}
</div>
))}
</div>
)
}Performance Tips
- Use stable keys: Avoid array index as key
- Memoize items: If item rendering is expensive
- Adjust overscan: More overscan = smoother scroll, more DOM nodes
- Measure sparingly: Only use
measureElementwhen needed - Debounce scroll: For very heavy computations
Vllm Deployment
vLLM Deployment
PagedAttention
vLLM's PagedAttention manages KV cache memory in non-contiguous blocks, enabling:
- Efficient memory: Only allocates what's needed per request
- Dynamic batching: Handles variable sequence lengths
- Up to 24x throughput: Compared to naive implementations
from vllm import LLM, SamplingParams
# PagedAttention is enabled by default
llm = LLM(
model="meta-llama/Meta-Llama-3.1-70B-Instruct",
gpu_memory_utilization=0.9, # Use 90% GPU memory for KV cache
max_num_seqs=256, # Max concurrent sequences
max_model_len=8192, # Max context length
)Continuous Batching
Dynamic batching that doesn't wait for batch completion:
from vllm import AsyncLLMEngine, AsyncEngineArgs, SamplingParams
# Configure async engine for continuous batching
engine_args = AsyncEngineArgs(
model="meta-llama/Meta-Llama-3.1-8B-Instruct",
max_num_batched_tokens=8192, # Max tokens per batch
max_num_seqs=64, # Max concurrent sequences
enable_chunked_prefill=True, # Better latency for long prompts
)
engine = AsyncLLMEngine.from_engine_args(engine_args)
# Requests are automatically batched
async def generate(prompt: str):
sampling_params = SamplingParams(max_tokens=512)
generator = engine.generate(prompt, sampling_params, request_id="req-1")
async for output in generator:
yield output.outputs[0].textCUDA Graphs
Capture and replay CUDA operations for faster execution:
# Enable via CLI
vllm serve meta-llama/Meta-Llama-3.1-8B-Instruct \
--enforce-eager false # Enable CUDA graphs (default)
# Disable for debugging
vllm serve meta-llama/Meta-Llama-3.1-8B-Instruct \
--enforce-eager true # Disable CUDA graphs# Python API
llm = LLM(
model="meta-llama/Meta-Llama-3.1-8B-Instruct",
enforce_eager=False, # Enable CUDA graphs
)Note: CUDA graphs require fixed input shapes. vLLM handles this automatically with padding.
Tensor Parallelism
Scale across multiple GPUs:
# 4-GPU tensor parallelism
vllm serve meta-llama/Meta-Llama-3.1-70B-Instruct \
--tensor-parallel-size 4
# With pipeline parallelism (for very large models)
vllm serve meta-llama/Meta-Llama-3.3-405B-Instruct \
--tensor-parallel-size 4 \
--pipeline-parallel-size 2llm = LLM(
model="meta-llama/Meta-Llama-3.1-70B-Instruct",
tensor_parallel_size=4,
distributed_executor_backend="ray", # For multi-node
)GPU Requirements:
| Model Size | GPUs (FP16) | GPUs (INT8) | GPUs (AWQ/GPTQ) |
|---|---|---|---|
| 7B | 1 | 1 | 1 |
| 13B | 1 | 1 | 1 |
| 70B | 4 | 2 | 1-2 |
| 405B | 8+ | 4+ | 4+ |
Prefix Caching
Reuse KV cache for shared prompt prefixes:
llm = LLM(
model="meta-llama/Meta-Llama-3.1-8B-Instruct",
enable_prefix_caching=True, # Enable prefix caching
)
# Shared system prompt benefits from caching
system_prompt = "You are a helpful assistant. Be concise and accurate."
prompts = [
f"{system_prompt}\n\nUser: What is Python?",
f"{system_prompt}\n\nUser: Explain REST APIs.",
f"{system_prompt}\n\nUser: What is Docker?",
]
# First request computes system prompt KV cache
# Subsequent requests reuse cached prefix
outputs = llm.generate(prompts, SamplingParams(max_tokens=256))Benefits:
- Reduced TTFT (time to first token) for shared prefixes
- Lower GPU memory for batch requests
- Ideal for: chat systems, RAG with fixed context
Production Server Configuration
# Production vLLM server
vllm serve meta-llama/Meta-Llama-3.1-70B-Instruct \
--host 0.0.0.0 \
--port 8000 \
--tensor-parallel-size 4 \
--max-model-len 8192 \
--max-num-seqs 128 \
--gpu-memory-utilization 0.9 \
--enable-prefix-caching \
--disable-log-requests \
--api-key $VLLM_API_KEY
# With quantization
vllm serve meta-llama/Meta-Llama-3.1-70B-Instruct \
--quantization awq \
--dtype half \
--tensor-parallel-size 2 \
--gpu-memory-utilization 0.85OpenAI-compatible API:
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:8000/v1",
api_key="your-api-key",
)
response = client.chat.completions.create(
model="meta-llama/Meta-Llama-3.1-70B-Instruct",
messages=[{"role": "user", "content": "Hello!"}],
max_tokens=256,
)Monitoring and Metrics
vLLM exposes Prometheus metrics:
# Enable metrics
vllm serve ... --enable-metrics
# Metrics endpoint
curl http://localhost:8000/metricsKey metrics:
vllm:num_requests_running: Active requestsvllm:num_requests_waiting: Queued requestsvllm:gpu_cache_usage_perc: KV cache utilizationvllm:avg_prompt_throughput_toks_per_s: Input throughputvllm:avg_generation_throughput_toks_per_s: Output throughput
Related Skills
observability-monitoring- Production monitoring patternsperformance-testing- Load testing inference endpoints
Checklists (5)
Cwv Checklist
Core Web Vitals Optimization Checklist
Comprehensive checklist for achieving and maintaining good Core Web Vitals scores.
Thresholds Reference
| Metric | Good | Needs Improvement | Poor |
|---|---|---|---|
| LCP | ≤ 2.5s | ≤ 4.0s | > 4.0s |
| INP | ≤ 200ms | ≤ 500ms | > 500ms |
| CLS | ≤ 0.1 | ≤ 0.25 | > 0.25 |
2026 Stricter Thresholds (plan ahead!):
- LCP: ≤ 2.0s
- INP: ≤ 150ms
- CLS: ≤ 0.08
LCP (Largest Contentful Paint) ≤ 2.5s
Identify the LCP Element
- Run Lighthouse to identify LCP element
- Use Performance Observer to confirm in production
- LCP is typically: hero image, hero heading, or above-the-fold banner
// Debug: Find LCP element
new PerformanceObserver((list) => {
const entries = list.getEntries();
console.log('LCP element:', entries[entries.length - 1].element);
}).observe({ type: 'largest-contentful-paint', buffered: true });Server Response Time (TTFB)
- Server response time (TTFB) < 800ms
- Use edge/CDN for static content
- Enable HTTP/2 or HTTP/3
- Compress responses (gzip/brotli)
- Database queries optimized
- Caching strategy implemented (Redis, CDN cache)
Critical Resource Loading
- LCP image has
fetchpriority="high"attribute - LCP image has
loading="eager"(not lazy) - LCP image preloaded in
<head> - Critical CSS inlined or preloaded
- Font preloaded with
crossoriginattribute - Preconnect to critical third-party origins
<!-- Preload critical resources -->
<link rel="preload" as="image" href="/hero.webp" fetchpriority="high" />
<link rel="preload" as="font" href="/font.woff2" type="font/woff2" crossorigin />
<link rel="preconnect" href="https://api.example.com" />Image Optimization
- LCP image in modern format (WebP/AVIF)
- Image properly sized (not oversized)
- Responsive images with
srcset - Image CDN used (Cloudinary, imgix, Vercel)
Rendering Strategy
- LCP content rendered server-side (SSR/SSG)
- LCP content NOT loaded client-side via fetch
- No render-blocking JavaScript
- No render-blocking CSS below the fold
- Third-party scripts deferred
// ✅ GOOD: Server-rendered LCP content
export default async function Page() {
const hero = await getHeroData();
return <Hero data={hero} />;
}
// ❌ BAD: Client-loaded LCP content
function Page() {
const [hero, setHero] = useState(null);
useEffect(() => { fetchHero().then(setHero); }, []); // Delays LCP!
}INP (Interaction to Next Paint) ≤ 200ms
Identify Long Tasks
- Chrome DevTools Performance tab analyzed
- Long tasks (>50ms) identified
- Main thread blockers removed/optimized
JavaScript Optimization
- Heavy computation moved to Web Workers
- Large arrays processed in chunks with yielding
-
requestIdleCallbackused for non-critical work - Bundle size minimized (code splitting)
- Tree shaking enabled
// ✅ GOOD: Yield to main thread
async function processItems(items: Item[]) {
for (const item of items) {
processItem(item);
// Yield every 4ms to allow paint
await scheduler.yield?.() ?? new Promise(r => setTimeout(r, 0));
}
}React Optimization
-
useTransitionfor non-urgent updates -
useDeferredValuefor expensive derivations - Memoization where appropriate (
useMemo,memo) - Virtualization for long lists (
react-window,@tanstack/virtual) - Suspense boundaries for code splitting
// ✅ GOOD: Non-blocking state updates
const [isPending, startTransition] = useTransition();
function handleSearch(query: string) {
setQuery(query); // Urgent: update input
startTransition(() => {
setFilteredResults(filter(query)); // Non-urgent: defer
});
}Event Handler Optimization
- No heavy computation in event handlers
- Handlers don't cause layout thrashing
- Passive event listeners for scroll/touch
- Debounced input handlers where appropriate
// ✅ GOOD: Defer heavy work
onClick={() => {
setLoading(true);
startTransition(() => {
const result = heavyComputation();
setResult(result);
setLoading(false);
});
}}
// ❌ BAD: Blocking handler
onClick={() => {
const result = heavyComputation(); // Blocks paint!
setResult(result);
}}Animation Performance
- Animations use
transformandopacityonly - No animations on layout properties (width, height, top, left)
-
will-changeused sparingly - Animations run at 60fps (checked in DevTools)
CLS (Cumulative Layout Shift) ≤ 0.1
Image Dimensions
- ALL images have explicit
widthandheight - Responsive images use
aspect-ratiocontainer -
fillprop images have sized container - No images cause layout shift on load
// ✅ GOOD: Explicit dimensions
<img src="/photo.jpg" width={800} height={600} alt="Photo" />
// ✅ GOOD: Aspect ratio container
<div className="aspect-[16/9]">
<Image src="/photo.jpg" fill alt="Photo" />
</div>Dynamic Content
- Space reserved for dynamic content (ads, embeds)
- Skeleton loaders match final content size
- No content inserted above existing content
- Lazy-loaded content has reserved space
// ✅ GOOD: Reserved space
<div className="min-h-[250px]">
{ad ? <Ad data={ad} /> : <Skeleton height={250} />}
</div>Font Loading
-
font-display: optionalorswapused - Fallback font has
size-adjustto match - Critical font preloaded
- System font stack as fallback
/* Fallback with size adjustment */
@font-face {
font-family: 'Inter Fallback';
src: local('Arial');
size-adjust: 107%;
ascent-override: 90%;
}
body {
font-family: 'Inter', 'Inter Fallback', sans-serif;
}Animation Stability
- Animations use
transform, not layout properties - Expanding/collapsing uses
scaleY, notheight - Modals/overlays don't shift page content
- Toast notifications positioned fixed/absolute
/* ✅ GOOD: Transform-based animation */
.drawer {
transform: translateX(-100%);
transition: transform 0.3s;
}
.drawer.open {
transform: translateX(0);
}
/* ❌ BAD: Layout-shifting animation */
.drawer {
width: 0;
transition: width 0.3s;
}Iframes and Embeds
- Iframes have explicit dimensions
- Third-party embeds wrapped with sized container
- Lazy iframes have placeholder
Measurement & Monitoring
Lab Testing
- Lighthouse CI in build pipeline
- Performance budgets enforced
- Regular manual Lighthouse audits
- Testing on throttled CPU/network
Field Data (RUM)
-
web-vitalslibrary installed - Metrics sent to analytics endpoint
- p75 percentile tracked (Google's standard)
- Alerts configured for regressions
// Essential RUM setup
import { onLCP, onINP, onCLS } from 'web-vitals';
onLCP(sendToAnalytics);
onINP(sendToAnalytics);
onCLS(sendToAnalytics);Data Analysis
- Dashboard showing daily/weekly trends
- Segmentation by page, device, connection
- Comparison of lab vs field data
- Week-over-week regression detection
Alerting
- Alert when p75 exceeds threshold
- Alert when good rate drops below 75%
- Alert on significant week-over-week regression
- Escalation path defined
Build & Deploy
Performance Budgets
- Bundle size limits configured
- Build fails on budget exceeded
- Per-route budgets for large apps
// webpack.config.js
module.exports = {
performance: {
maxAssetSize: 150000, // 150KB
maxEntrypointSize: 250000, // 250KB
hints: 'error', // Fail build
},
};CI/CD Integration
- Lighthouse CI runs on PRs
- Performance regression blocks merge
- Bundle analyzer report generated
- Preview deployments for testing
CDN & Caching
- Static assets on CDN
- Immutable caching for hashed assets
- Stale-while-revalidate for HTML
- Edge caching where appropriate
Debugging Checklist
Slow LCP
- Check TTFB (server response time)
- Verify LCP element has
fetchpriority="high" - Confirm LCP content is server-rendered
- Check for render-blocking resources
- Verify image is optimized and properly sized
High INP
- Run Performance recording during interaction
- Look for long tasks in flame chart
- Check for forced synchronous layouts
- Verify heavy work is deferred
- Check for excessive re-renders
High CLS
- Run Lighthouse with "Layout Shift Regions" enabled
- Check images for missing dimensions
- Look for late-loading content
- Verify fonts have fallbacks
- Check for content inserted above viewport
Testing Protocol
Before Deployment
- Lighthouse score ≥ 90 on Performance
- All Core Web Vitals in "good" range
- No performance budget violations
- Tested on throttled 4G + slow CPU
After Deployment
- Monitor RUM for 24-48 hours
- Compare p75 to pre-deployment baseline
- Check for unexpected regressions
- Verify alerting is working
Weekly Review
- Review p75 trends
- Identify worst-performing pages
- Check for new issues in CrUX
- Plan optimizations for next sprint
Image Checklist
Image Optimization Checklist
Comprehensive checklist for production-ready image optimization.
Format Selection
Photo Content
- Use AVIF as primary format (30-50% smaller than JPEG)
- Configure WebP as fallback for older browsers
- JPEG only for browsers without AVIF/WebP support
- Configure Next.js:
formats: ['image/avif', 'image/webp']
Graphics & Icons
- SVG for logos, icons, and simple graphics
- PNG only when transparency is required
- Consider SVG sprites for icon sets (reduces requests)
- Inline small SVGs (< 1KB) to avoid network requests
Format Decision Tree
Is it a photo/complex image?
├── Yes → Use AVIF/WebP (Next.js Image handles this)
└── No → Is transparency needed?
├── Yes → PNG or SVG
└── No → Is it an icon/logo?
├── Yes → SVG (scalable, tiny file size)
└── No → AVIF/WebPDimensions & Sizing
Always Set Dimensions
- Every
<Image>haswidthandheightOR usesfill - Fill mode images have sized container (relative + dimensions)
- Dimensions match actual display size (not larger)
- No CLS from images (Layout Shift score = 0)
// ✅ GOOD: Explicit dimensions
<Image src="/photo.jpg" width={800} height={600} />
// ✅ GOOD: Fill with sized container
<div className="relative h-[400px]">
<Image src="/photo.jpg" fill />
</div>
// ❌ BAD: Missing dimensions
<Image src="/photo.jpg" />Responsive Images
-
sizesprop set for all responsive images - Sizes match actual layout breakpoints
- Don't serve images larger than needed
- Test with DevTools Network tab (check actual sizes served)
// ✅ GOOD: Accurate sizes prop
<Image
src="/photo.jpg"
fill
sizes="(max-width: 640px) 100vw, (max-width: 1024px) 50vw, 33vw"
/>
// Common sizes patterns:
// Full width hero: sizes="100vw"
// Half width on desktop: sizes="(max-width: 768px) 100vw, 50vw"
// Grid of 4: sizes="(max-width: 640px) 50vw, 25vw"Loading Strategy
LCP Images (Above the Fold)
- Hero/banner image has
priorityprop - ONLY one image per page has
priority(usually LCP element) - LCP image preloaded in
<head>if not using Next.js Image - No lazy loading on LCP images
// ✅ GOOD: Priority on LCP image
<Image src="/hero.jpg" priority fill sizes="100vw" />
// ❌ BAD: Priority on all images
{images.map(img => <Image src={img} priority />)} // Wrong!Below-the-Fold Images
- Default lazy loading (Next.js Image default)
- No
priorityprop on non-LCP images - Consider
loading="lazy"for native<img>elements - Use Intersection Observer for custom lazy loading
Preloading
- Critical hero image preloaded
- Don't preload below-fold images
- Use
fetchpriority="high"for critical images
<link rel="preload" as="image" href="/hero.webp" fetchpriority="high" />Placeholders
Blur Placeholders
- Static imports use
placeholder="blur"(automatic) - Remote images have
blurDataURLgenerated - Placeholder improves perceived performance
- Consider plaiceholder library for build-time generation
// ✅ Static import with automatic blur
import heroImage from '@/public/hero.jpg';
<Image src={heroImage} placeholder="blur" />
// ✅ Remote image with blur
<Image
src="https://cdn.example.com/photo.jpg"
placeholder="blur"
blurDataURL="data:image/jpeg;base64,..."
/>Color Placeholders
- Consider dominant color placeholder for cards
- Skeleton placeholders for loading states
- Smooth transition from placeholder to image
Quality Settings
Compression
- Quality set to 75-85 (not 100)
- Test quality visually - often 75 is indistinguishable
- Higher quality (85-90) only for hero/product images
- Lower quality (60-70) acceptable for thumbnails
// ✅ GOOD: Appropriate quality
<Image src="/hero.jpg" quality={85} /> // Important hero
<Image src="/thumbnail.jpg" quality={70} /> // Small thumbnail
// ❌ BAD: Unnecessary quality
<Image src="/photo.jpg" quality={100} /> // Huge file, no benefitAVIF-Specific
- AVIF quality can be 10-15 points lower than JPEG
- Test AVIF vs WebP on your content type
- Some images compress better with WebP
CDN & Infrastructure
Next.js Configuration
-
remotePatternsconfigured for all external domains -
deviceSizesmatches your breakpoints -
formatsincludes AVIF and WebP -
minimumCacheTTLset appropriately (30+ days for static)
// next.config.js
images: {
formats: ['image/avif', 'image/webp'],
remotePatterns: [
{ hostname: 'cdn.example.com' },
{ hostname: '*.cloudinary.com' },
],
deviceSizes: [640, 750, 828, 1080, 1200, 1920],
minimumCacheTTL: 60 * 60 * 24 * 30, // 30 days
}CDN Setup
- Images served from CDN (not origin server)
- Edge caching enabled
- Cache headers set correctly (1 year for hashed assets)
-
Vary: Acceptheader for format negotiation
Self-Hosted
- Sharp installed:
npm install sharp - Docker image includes Sharp dependencies
- Adequate disk space for image cache
- Memory limits account for Sharp processing
Accessibility
Alt Text
- ALL images have
altattribute - Meaningful alt for informative images
- Empty
alt=""for decorative images - Alt text describes content, not appearance
- No "image of" or "picture of" prefix
// ✅ GOOD: Meaningful alt
<Image src="/product.jpg" alt="Red Nike Air Max 90 running shoe, side view" />
// ✅ GOOD: Decorative image
<Image src="/decorative-pattern.svg" alt="" />
// ❌ BAD: Generic alt
<Image src="/product.jpg" alt="Image" />
// ❌ BAD: Missing alt
<Image src="/product.jpg" />Additional A11y
- No text in images (use real text)
- Sufficient color contrast for overlaid text
- Images don't convey information unavailable in text
- Decorative images marked with
role="presentation"
Performance Monitoring
Metrics to Track
- LCP (Largest Contentful Paint) < 2.5s
- CLS (Cumulative Layout Shift) = 0 for images
- Image load times in RUM data
- Total image bytes transferred
Debugging
- Check DevTools Network tab for actual sizes
- Verify format negotiation (AVIF/WebP served)
- Test on slow connections (DevTools throttling)
- Run Lighthouse for image recommendations
Error Handling
Fallbacks
- Fallback image configured for load errors
- Graceful degradation for broken images
- Error boundaries for image-heavy components
const [error, setError] = useState(false);
<Image
src={error ? '/fallback.jpg' : product.image}
onError={() => setError(true)}
/>Monitoring
- Image errors logged to monitoring service
- Alerts for high error rates
- 404s for images tracked
Build Pipeline
Optimization
- Images optimized at build time (where possible)
- Source images stored at high resolution
- Build includes image processing (Sharp, Squoosh)
- CI validates image configurations
Version Control
- Large images in Git LFS (not regular Git)
- Or: Images stored externally (CMS, CDN)
- Build pulls images from source
Security
Content Security
- Only allow trusted image domains
- SVG sanitization if user-uploaded
-
dangerouslyAllowSVG: falsein production - Rate limiting on image optimization endpoints
Privacy
- Strip EXIF metadata from user uploads
- No personally identifiable information in image URLs
- Consider image hashing for user content
Inference Optimization
Inference Optimization Checklist
Performance validation for LLM inference.
vLLM Configuration
- Tensor parallelism configured for GPU count
- Max model length set appropriately
- GPU memory utilization optimized (0.85-0.95)
- Prefix caching enabled for shared contexts
- Continuous batching active
Quantization
- Quantization method selected:
- FP16: Maximum quality, baseline
- INT8/FP8: Balance quality/efficiency
- AWQ: Best 4-bit quality
- GPTQ: Faster quantization
- Calibration data used (for GPTQ)
- Quality validated post-quantization
Speculative Decoding
- Method selected:
- N-gram: No extra model, lower overhead
- Draft model: Higher quality speculation
- Speculative tokens tuned (3-5 typical)
- Throughput improvement validated
Hardware Utilization
- GPU memory fully utilized
- Multi-GPU scaling verified
- NVLink/PCIe bandwidth sufficient
- CPU not bottlenecking
Batching Strategy
- Continuous batching enabled
- Max batch size configured
- Request prioritization (if needed)
- Queue management configured
Caching
- KV cache optimized (PagedAttention)
- Prefix caching for shared prompts
- Response caching (semantic if applicable)
- Cache invalidation strategy
Benchmarking
- Baseline latency measured
- Throughput (tokens/sec) benchmarked
- Time to first token (TTFT) measured
- Latency under load tested
- Memory usage profiled
Production Readiness
- Warmup requests sent before traffic
- Health checks configured
- Graceful shutdown handling
- Request timeout configured
- Error recovery tested
Monitoring
- Latency metrics (p50, p95, p99)
- Throughput tracking
- GPU utilization monitoring
- Memory usage tracking
- Error rate alerting
Cost Optimization
- Instance size appropriate
- Spot instances (if applicable)
- Auto-scaling configured
- Usage patterns analyzed
- Cost per request tracked
Performance Audit Checklist
Performance Audit Checklist
Comprehensive guide for identifying and fixing performance bottlenecks, based on OrchestKit's real optimization process.
Prerequisites
- Access to production metrics (Prometheus, Grafana)
- Profiling tools installed (py-spy, Chrome DevTools)
- Baseline performance metrics captured
- Test environment with production-like data
Phase 1: Establish Baselines
Backend Metrics
Capture current performance:
# Database query performance
psql -c "SELECT query, calls, mean_time, total_time
FROM pg_stat_statements
ORDER BY total_time DESC LIMIT 20;"
# API latency
curl 'http://localhost:9090/api/v1/query?query=histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))'
# Cache hit rate
curl 'http://localhost:9090/api/v1/query?query=sum(rate(cache_operations_total{result="hit"}[5m])) / sum(rate(cache_operations_total[5m]))'- Record p50/p95/p99 latency for all endpoints
- Document slow queries (>100ms)
- Measure cache hit rates
- Capture database connection pool usage
- Record LLM token usage and costs
Frontend Metrics
Run Lighthouse audit:
# Lighthouse CLI
lighthouse http://localhost:3000 \
--output json \
--output-path lighthouse-report.json
# Or use Chrome DevTools → Lighthouse tab- Record Core Web Vitals (LCP, INP, CLS, TTFB)
- Measure bundle size (JS, CSS)
- Check for render-blocking resources
- Analyze long tasks (>50ms)
- Measure First Contentful Paint (FCP)
Baseline Targets
| Metric | Good | Needs Work | Current |
|---|---|---|---|
| p95 API latency | <500ms | <1s | ___ms |
| p95 DB query | <100ms | <500ms | ___ms |
| Cache hit rate | >70% | >50% | __% |
| LCP | <2.5s | <4s | ___s |
| INP | <200ms | <500ms | ___ms |
| CLS | <0.1 | <0.25 | ___ |
| Bundle size | <300KB | <500KB | ___KB |
Phase 2: Identify Bottlenecks
Backend Profiling
1. Find Slow Endpoints
# Top 10 slowest endpoints (p95 latency)
topk(10,
histogram_quantile(0.95,
rate(http_request_duration_seconds_bucket[5m])
) by (endpoint)
)- List endpoints with p95 > 500ms
- Prioritize by traffic volume (high traffic = high impact)
- Document expected vs actual latency
2. Identify Slow Database Queries
-- Top 10 slowest queries
SELECT
LEFT(query, 80) as query_preview,
calls,
ROUND(mean_exec_time::numeric, 2) as avg_ms,
ROUND(total_exec_time::numeric, 2) as total_ms,
ROUND(100.0 * shared_blks_hit / NULLIF(shared_blks_hit + shared_blks_read, 0), 2) as cache_hit_ratio
FROM pg_stat_statements
ORDER BY total_exec_time DESC
LIMIT 10;- Run EXPLAIN ANALYZE on slow queries
- Check for sequential scans (should use indexes)
- Look for low cache hit ratios (<90%)
- Identify N+1 query patterns
3. Python Profiling with py-spy
# Profile running FastAPI server
py-spy record --pid $(pgrep -f uvicorn) \
--output profile.svg \
--duration 60
# Top functions by time
py-spy top --pid $(pgrep -f uvicorn)- Generate flame graph
- Identify hot paths (wide bars = time spent)
- Look for unexpected CPU usage
- Check for blocking I/O in async code
4. LLM Cost Analysis
-- Cost breakdown by model (Langfuse)
SELECT
model,
COUNT(*) as calls,
SUM(input_tokens) as total_input,
SUM(output_tokens) as total_output,
SUM(calculated_total_cost) as total_cost
FROM langfuse.traces
WHERE timestamp > NOW() - INTERVAL '7 days'
GROUP BY model
ORDER BY total_cost DESC;- Identify most expensive models
- Calculate cache hit rate potential
- Find repetitive queries (caching candidates)
- Measure prompt token waste
Frontend Profiling
1. Chrome DevTools Performance Tab
- Record 6s of user interaction
- Identify long tasks (yellow bars >50ms)
- Check for dropped frames (should be 60fps)
- Measure main thread blocking time
2. React DevTools Profiler
// Add Profiler to key components
import { Profiler } from 'react';
function onRenderCallback(
id, phase, actualDuration, baseDuration
) {
if (actualDuration > 16) {
console.warn(`Slow render: ${id} took ${actualDuration}ms`);
}
}
<Profiler id="AnalysisCard" onRender={onRenderCallback}>
<AnalysisCard />
</Profiler>- Find components with >16ms render time
- Identify unnecessary re-renders
- Check for missing memoization
3. Bundle Analysis
# Vite
npm run build
npx vite-bundle-visualizer
# Next.js
ANALYZE=true npm run build- Identify largest chunks
- Find duplicate dependencies
- Check for tree-shaking failures
- Measure code splitting effectiveness
Phase 3: Database Optimization
Add Missing Indexes
1. Identify Missing Indexes
-- Find sequential scans that should use indexes
SELECT
schemaname,
tablename,
seq_scan,
idx_scan,
seq_scan - idx_scan as too_much_seq
FROM pg_stat_user_tables
WHERE seq_scan - idx_scan > 0
ORDER BY too_much_seq DESC
LIMIT 10;- Run EXPLAIN ANALYZE on slow queries
- Look for "Seq Scan" in query plans
- Identify columns in WHERE/JOIN clauses
- Create indexes for high-cardinality columns
2. Create Indexes
-- B-tree for exact matches and ranges
CREATE INDEX idx_analysis_status ON analyses(status);
CREATE INDEX idx_analysis_created ON analyses(created_at DESC);
-- GIN for full-text search
CREATE INDEX idx_chunk_tsvector ON chunks USING GIN(content_tsvector);
-- HNSW for vector similarity (pgvector)
CREATE INDEX idx_chunk_embedding ON chunks
USING hnsw (embedding vector_cosine_ops)
WITH (m = 16, ef_construction = 64);
-- Composite index for common filter combinations
CREATE INDEX idx_chunk_analysis_created ON chunks(analysis_id, created_at DESC);- Create indexes for WHERE clause columns
- Use composite indexes for multi-column filters
- Add indexes for JOIN columns
- Use CONCURRENTLY for production
- Verify indexes are used (EXPLAIN ANALYZE)
Index Selection Guide:
| Query Pattern | Index Type | Example |
|---|---|---|
| Exact match | B-tree | WHERE status = 'completed' |
| Range query | B-tree | WHERE created_at > '2025-01-01' |
| Full-text search | GIN | WHERE content_tsvector @@ query |
| Vector similarity | HNSW | ORDER BY embedding <=> query_vec |
| JSONB queries | GIN | WHERE metadata @> '\{"key": "value"\}' |
Fix N+1 Queries
1. Detect N+1 Patterns
# ❌ BAD: N+1 query (1 query + N queries in loop)
analyses = await session.execute(select(Analysis).limit(10))
for analysis in analyses.scalars():
# Each iteration = 1 query!
chunks = await session.execute(
select(Chunk).where(Chunk.analysis_id == analysis.id)
)- Review logs for rapid sequential queries
- Check for queries inside loops
- Use query count logging in tests
2. Fix with Eager Loading
# ✅ GOOD: Single query with JOIN
from sqlalchemy.orm import selectinload
analyses = await session.execute(
select(Analysis)
.options(selectinload(Analysis.chunks)) # Eager load
.limit(10)
).scalars().all()
# Now analyses[0].chunks is preloaded (no extra query)- Replace lazy loading with eager loading
- Use
selectinload()for one-to-many - Use
joinedload()for one-to-one - Verify query count reduced (N+1 → 1-2 queries)
Optimize Connection Pooling
1. Check Current Pool Usage
# Connection pool saturation
db_connections_active / db_connections_max- Measure active vs max connections
- Check for pool exhaustion (ratio >0.8)
- Monitor connection wait times
2. Configure Pool
# backend/app/core/config.py
from sqlalchemy import create_engine
engine = create_engine(
database_url,
pool_size=5, # Connections to maintain
max_overflow=10, # Extra connections allowed
pool_recycle=3600, # Recycle after 1 hour
pool_pre_ping=True # Validate before checkout
)- Set pool_size based on traffic (5-20 typical)
- Allow overflow for spikes
- Enable pool_pre_ping for stale detection
- Set pool_recycle to avoid timeouts
Phase 4: Caching Strategy
Identify Caching Opportunities
1. Find Repetitive Queries
-- Most frequently called queries
SELECT
LEFT(query, 80),
calls,
ROUND(mean_exec_time::numeric, 2) as avg_ms
FROM pg_stat_statements
ORDER BY calls DESC
LIMIT 20;- Identify high-frequency queries
- Check if data changes frequently
- Calculate potential savings (calls × avg_time)
2. Find Repetitive LLM Calls
-- Similar prompts (Langfuse)
SELECT
LEFT(input::text, 100) as prompt_preview,
COUNT(*) as occurrences,
SUM(calculated_total_cost) as total_cost
FROM langfuse.generations
GROUP BY LEFT(input::text, 100)
HAVING COUNT(*) > 5
ORDER BY total_cost DESC;- Identify repetitive prompts
- Calculate cost savings potential
- Determine appropriate cache TTL
Implement Multi-Level Cache
L1: In-Memory Cache (Application)
from functools import lru_cache
@lru_cache(maxsize=128)
def get_agent_system_prompt(agent_type: str) -> str:
"""Cache agent prompts in memory."""
return load_prompt_from_file(f"prompts/{agent_type}.txt")- Cache static data (prompts, configs)
- Use LRU cache for bounded memory
- Set appropriate maxsize (128-1024)
L2: Redis Cache (Distributed)
async def get_analysis(analysis_id: str) -> Analysis:
"""Cache analysis results in Redis."""
# Try cache first
cached = await redis.get(f"analysis:{analysis_id}")
if cached:
return Analysis.parse_raw(cached)
# Cache miss - fetch from DB
analysis = await db.get_analysis(analysis_id)
# Store in cache (5 min TTL)
await redis.setex(
f"analysis:{analysis_id}",
300,
analysis.json()
)
return analysis- Cache query results
- Set appropriate TTL (seconds to hours)
- Invalidate on writes
- Track cache hit rate
L3: Semantic Cache (Vector Search)
async def get_llm_response(query: str) -> str:
"""Check semantic cache before calling LLM."""
# Generate query embedding
embedding = await embed_text(query)
# Search for similar cached queries
cached = await semantic_cache.search(embedding, threshold=0.92)
if cached:
return cached.response
# Call LLM
response = await llm.complete(query)
# Store in cache
await semantic_cache.store(embedding, response)
return response- Cache LLM responses by semantic similarity
- Set similarity threshold (0.90-0.95)
- Measure cost savings
- Monitor false positive rate
Cache Invalidation
Write-Through Pattern:
async def update_analysis(analysis: Analysis):
"""Update DB and cache atomically."""
# 1. Write to DB
await db.update(analysis)
# 2. Update cache
await redis.setex(
f"analysis:{analysis.id}",
300,
analysis.json()
)- Invalidate cache on writes
- Use TTL for time-sensitive data
- Add cache versioning for schema changes
Phase 5: Frontend Optimization
Code Splitting
1. Route-Based Splitting
// Before: All routes in one bundle
import AnalysisPage from './pages/AnalysisPage';
import DashboardPage from './pages/DashboardPage';
// After: Lazy load routes
const AnalysisPage = lazy(() => import('./pages/AnalysisPage'));
const DashboardPage = lazy(() => import('./pages/DashboardPage'));
<Suspense fallback={<Loading />}>
<Routes>
<Route path="/analysis" element={<AnalysisPage />} />
<Route path="/dashboard" element={<DashboardPage />} />
</Routes>
</Suspense>- Lazy load routes
- Add loading states
- Measure bundle size reduction
2. Component-Level Splitting
// Lazy load heavy components
const ChartComponent = lazy(() => import('./ChartComponent'));
{showChart && (
<Suspense fallback={<Skeleton />}>
<ChartComponent data={data} />
</Suspense>
)}- Split large dependencies (charts, editors)
- Use dynamic imports for modals
- Prefetch on user intent (hover, focus)
Memoization
React.memo for Components:
// Prevent re-renders when props unchanged
const AnalysisCard = memo(({ analysis }: Props) => {
return <div>{analysis.title}</div>;
});- Wrap expensive components with memo()
- Verify props don't change unnecessarily
- Use React DevTools Profiler to confirm
useMemo for Expensive Calculations:
const expensiveValue = useMemo(() => {
return processLargeDataset(data);
}, [data]); // Only recompute if data changes- Memoize expensive calculations
- Memoize filtered/sorted arrays
- Don't over-memoize (profiling first!)
useCallback for Event Handlers:
const handleClick = useCallback(() => {
doSomething(id);
}, [id]); // Only recreate if id changes
<ChildComponent onClick={handleClick} />- Wrap callbacks passed to memoized children
- Avoid inline functions in props
- Include all dependencies
Image Optimization
// Use next/image or similar for optimization
<Image
src="/photo.jpg"
alt="Description"
width={800}
height={600}
loading="lazy" // Lazy load images
placeholder="blur" // Show blur while loading
/>- Use WebP/AVIF formats
- Lazy load images below the fold
- Set explicit width/height (prevent CLS)
- Use responsive images (srcset)
Phase 6: Measure Impact
Re-Run Benchmarks
Backend:
# Query performance
psql -c "SELECT query, mean_time FROM pg_stat_statements ORDER BY total_time DESC LIMIT 10;"
# API latency
curl 'http://localhost:9090/api/v1/query?query=histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))'Frontend:
lighthouse http://localhost:3000 --output json- Compare p95 latency (before vs after)
- Verify query performance improved
- Check cache hit rates increased
- Measure Core Web Vitals improvement
Calculate Savings
Cost Savings:
# LLM cost reduction
baseline_cost = 35000 # Annual
cache_hit_rate = 0.90
savings = baseline_cost * cache_hit_rate * 0.90 # 90% discount on cache hits
final_cost = baseline_cost - savingsPerformance Gains:
# Query speedup
before_latency = 85 # ms
after_latency = 5 # ms
speedup = before_latency / after_latency # 17x- Document cost savings
- Calculate ROI (savings vs implementation time)
- Measure user experience improvement
Create Performance Budget
Set ongoing targets:
- p95 API latency < 500ms
- p95 DB query < 100ms
- Cache hit rate > 70%
- LCP < 2.5s
- Bundle size < 300KB
Monitor continuously:
- Add Lighthouse CI to pipeline
- Alert on budget violations
- Review metrics weekly
Phase 7: Ongoing Optimization
Weekly Reviews
- Review top 10 slowest endpoints
- Check for new slow queries
- Monitor cache hit rates
- Review LLM cost trends
- Check Core Web Vitals in RUM
Monthly Audits
- Run full Lighthouse audit
- Profile with py-spy/Chrome DevTools
- Review database index usage
- Check for unused dependencies
- Update performance budget
Continuous Monitoring
- Set up alerts for degradation
- Track performance in CI/CD
- Monitor real user metrics (RUM)
- A/B test optimizations
References
- Example:
../examples/orchestkit-performance-wins.md - Template:
../scripts/caching-patterns.ts - Template:
../scripts/database-optimization.ts - Chrome DevTools Performance
- Lighthouse Documentation
- PostgreSQL EXPLAIN
Render Audit
React Performance Audit Checklist
Pre-deployment performance verification.
React Compiler Check
- React Compiler enabled in build config
- Components show "Memo ✨" badge in DevTools
- Code follows Rules of React:
- Components are idempotent
- Props/state treated as immutable
- Side effects in useEffect only
- Hooks at top level
Render Performance
- No unnecessary re-renders (verified with Profiler)
- State colocated close to usage
- Context split to prevent cascading updates
- Expensive computations have escape hatch memoization
- Lists > 100 items are virtualized
Large Lists / Data
- TanStack Virtual for lists > 100 items
- Pagination or infinite scroll for API data
- Table virtualization for grids > 50 rows
- Images lazy loaded below fold
Code Splitting
- Route-based code splitting (lazy routes)
- Heavy components lazy loaded
- Dynamic imports for large libraries
- Bundle analyzer run, no unexpected large chunks
Network Performance
- API calls deduplicated (React Query, SWR)
- Data prefetched on hover/intent
- Optimistic updates for mutations
- Appropriate cache headers set
Images & Media
- Images optimized (WebP, AVIF)
- Responsive images with srcset
- Lazy loading for below-fold images
- Placeholder/skeleton during load
Third-Party Scripts
- Analytics loaded async/deferred
- Third-party widgets lazy loaded
- Font loading optimized (preload critical)
- No render-blocking resources
Profiling Verification
Before Optimization
- Record baseline interaction times
- Document slowest components
- Note current bundle size
After Optimization
- Re-profile all interactions
- Verify improvements in numbers
- Check bundle size delta
Key Metrics to Track
| Metric | Target | Current |
|---|---|---|
| LCP (Largest Contentful Paint) | < 2.5s | ___ |
| FID (First Input Delay) | < 100ms | ___ |
| CLS (Cumulative Layout Shift) | < 0.1 | ___ |
| Time to Interactive | < 3s | ___ |
| Main thread blocking | < 200ms | ___ |
Quick Profiler Commands
# React DevTools Profiler
# 1. Open DevTools → Profiler tab
# 2. Click Record
# 3. Perform interaction
# 4. Click Stop
# 5. Analyze flamegraph
# Lighthouse
npx lighthouse http://localhost:3000 --view
# Bundle Analyzer (Next.js)
ANALYZE=true npm run build
# Bundle Analyzer (Vite)
npx vite-bundle-visualizerCommon Issues Checklist
- No anonymous functions as props in hot paths
- No object/array literals as props in hot paths
- Context providers near consumers
- useEffect dependencies correct
- No state updates in render
Sign-Off
- All critical interactions < 100ms
- No visible jank during scroll
- Page load acceptable on 3G
- Bundle size within budget
- Performance regression tests in CI
Examples (3)
Cwv Examples
Core Web Vitals Examples
Real-world optimization examples for LCP, INP, and CLS.
1. LCP Optimization: E-Commerce Hero Section
Complete optimization of a hero section with product image and CTA.
Before: Slow LCP (3.5s+)
// ❌ BAD: Multiple LCP issues
function Hero() {
const [product, setProduct] = useState(null);
useEffect(() => {
// Problem 1: LCP content loaded client-side
fetch('/api/featured-product')
.then(res => res.json())
.then(setProduct);
}, []);
if (!product) return <div className="h-[600px]" />; // Problem 2: No skeleton
return (
<div className="relative">
{/* Problem 3: No priority, lazy by default */}
<img src={product.image} alt={product.name} />
<h1>{product.name}</h1>
<a href={`/product/${product.id}`}>Shop Now</a>
</div>
);
}After: Optimized LCP (1.2s)
// ✅ GOOD: Server-rendered with optimized image
import Image from 'next/image';
import { Suspense } from 'react';
// Server Component - data fetched on server
async function Hero() {
// Fetched on server, included in initial HTML
const product = await getFeaturedProduct();
return (
<section className="relative h-[600px] overflow-hidden">
{/* Priority image with explicit dimensions */}
<Image
src={product.image}
alt={product.name}
fill
priority // Preloads, eager loading
sizes="100vw"
quality={85}
placeholder="blur"
blurDataURL={product.blurPlaceholder}
style={{ objectFit: 'cover' }}
/>
{/* Content overlay */}
<div className="relative z-10 flex flex-col items-center justify-center h-full text-white">
<h1 className="text-5xl font-bold">{product.name}</h1>
<p className="mt-4 text-xl">{product.tagline}</p>
<a
href={`/product/${product.id}`}
className="mt-8 px-8 py-4 bg-white text-black rounded-lg font-semibold"
>
Shop Now
</a>
</div>
</section>
);
}
// Loading skeleton for Suspense boundary
function HeroSkeleton() {
return (
<section className="relative h-[600px] bg-gray-200 animate-pulse">
<div className="flex flex-col items-center justify-center h-full">
<div className="h-12 w-64 bg-gray-300 rounded" />
<div className="mt-4 h-6 w-48 bg-gray-300 rounded" />
<div className="mt-8 h-14 w-40 bg-gray-300 rounded-lg" />
</div>
</section>
);
}
// Usage in page
export default function HomePage() {
return (
<Suspense fallback={<HeroSkeleton />}>
<Hero />
</Suspense>
);
}
// Also add preload in head (layout.tsx or page metadata)
export const metadata = {
other: {
'link': [
{
rel: 'preload',
as: 'image',
href: '/featured-product-hero.webp',
fetchpriority: 'high',
},
],
},
};Document Head Optimizations
<!-- Add to <head> for fastest LCP -->
<head>
<!-- Preload hero image -->
<link rel="preload" as="image" href="/hero.webp" fetchpriority="high" />
<!-- Preload critical font -->
<link
rel="preload"
as="font"
href="/fonts/inter-bold.woff2"
type="font/woff2"
crossorigin
/>
<!-- Preconnect to image CDN -->
<link rel="preconnect" href="https://images.example.com" />
<!-- DNS prefetch for analytics -->
<link rel="dns-prefetch" href="https://analytics.example.com" />
</head>2. INP Optimization: Product Search Filter
Optimizing a search filter that was causing 400ms+ INP.
Before: Blocking INP (400ms+)
// ❌ BAD: Blocks main thread on every keystroke
function ProductSearch({ products }: { products: Product[] }) {
const [query, setQuery] = useState('');
const [results, setResults] = useState(products);
const handleChange = (e: ChangeEvent<HTMLInputElement>) => {
const value = e.target.value;
setQuery(value);
// Problem: Expensive filter runs synchronously
// Blocks paint until complete
const filtered = products.filter(p =>
p.name.toLowerCase().includes(value.toLowerCase()) ||
p.description.toLowerCase().includes(value.toLowerCase()) ||
p.tags.some(t => t.toLowerCase().includes(value.toLowerCase()))
);
setResults(filtered);
};
return (
<>
<input
value={query}
onChange={handleChange}
placeholder="Search products..."
/>
<ProductGrid products={results} />
</>
);
}After: Responsive INP (50ms)
// ✅ GOOD: Non-blocking with useDeferredValue
import {
useState,
useDeferredValue,
useMemo,
useTransition,
memo
} from 'react';
function ProductSearch({ products }: { products: Product[] }) {
const [query, setQuery] = useState('');
const [isPending, startTransition] = useTransition();
// Deferred value for expensive computation
const deferredQuery = useDeferredValue(query);
const isStale = query !== deferredQuery;
// Memoized filter only runs when deferredQuery changes
const results = useMemo(() => {
if (!deferredQuery) return products;
const searchLower = deferredQuery.toLowerCase();
return products.filter(p =>
p.name.toLowerCase().includes(searchLower) ||
p.description.toLowerCase().includes(searchLower) ||
p.tags.some(t => t.toLowerCase().includes(searchLower))
);
}, [products, deferredQuery]);
return (
<div>
<div className="relative">
<input
value={query}
onChange={(e) => setQuery(e.target.value)}
placeholder="Search products..."
className="w-full px-4 py-2 border rounded-lg"
/>
{/* Loading indicator during filter */}
{isPending && (
<div className="absolute right-3 top-1/2 -translate-y-1/2">
<Spinner size="sm" />
</div>
)}
</div>
{/* Fade during pending state */}
<div
className="mt-4 transition-opacity"
style={{ opacity: isStale ? 0.7 : 1 }}
>
<ProductGrid products={results} />
</div>
</div>
);
}
// Memoized grid to prevent unnecessary re-renders
const ProductGrid = memo(function ProductGrid({
products
}: {
products: Product[]
}) {
return (
<div className="grid grid-cols-4 gap-4">
{products.map(product => (
<ProductCard key={product.id} product={product} />
))}
</div>
);
});For Very Large Lists: Virtual Scrolling
// ✅ BEST: Virtualization for huge lists
import { useVirtualizer } from '@tanstack/react-virtual';
function VirtualizedProductList({ products }: { products: Product[] }) {
const parentRef = useRef<HTMLDivElement>(null);
const virtualizer = useVirtualizer({
count: products.length,
getScrollElement: () => parentRef.current,
estimateSize: () => 200, // Estimated row height
overscan: 5, // Render 5 extra items above/below
});
return (
<div
ref={parentRef}
className="h-[600px] overflow-auto"
>
<div
style={{
height: `${virtualizer.getTotalSize()}px`,
width: '100%',
position: 'relative',
}}
>
{virtualizer.getVirtualItems().map((virtualRow) => (
<div
key={virtualRow.key}
style={{
position: 'absolute',
top: 0,
left: 0,
width: '100%',
height: `${virtualRow.size}px`,
transform: `translateY(${virtualRow.start}px)`,
}}
>
<ProductCard product={products[virtualRow.index]} />
</div>
))}
</div>
</div>
);
}3. CLS Optimization: News Article Page
Fixing layout shifts from images, ads, and fonts.
Before: High CLS (0.35)
// ❌ BAD: Multiple CLS issues
function Article({ article }: { article: Article }) {
const [ad, setAd] = useState(null);
useEffect(() => {
loadAd().then(setAd);
}, []);
return (
<article>
<h1>{article.title}</h1>
{/* Problem 1: Image without dimensions */}
<img src={article.heroImage} alt="" />
{/* Problem 2: Ad appears after load, shifts content */}
{ad && <div className="ad-banner"><img src={ad.image} /></div>}
<div dangerouslySetInnerHTML={{ __html: article.content }} />
{/* Problem 3: Related articles load and shift */}
<RelatedArticles />
</article>
);
}
// Problem 4: Font causes layout shift
// CSS
/* No font-display, no fallback sizing */
@font-face {
font-family: 'CustomFont';
src: url('/font.woff2');
}After: Zero CLS (0.0)
// ✅ GOOD: All layout shifts prevented
import Image from 'next/image';
function Article({ article }: { article: Article }) {
return (
<article className="max-w-3xl mx-auto">
<h1 className="text-4xl font-bold">{article.title}</h1>
{/* Fixed dimensions prevent shift */}
<div className="relative aspect-[16/9] my-6">
<Image
src={article.heroImage}
alt={article.heroAlt}
fill
sizes="(max-width: 768px) 100vw, 768px"
priority
style={{ objectFit: 'cover' }}
/>
</div>
{/* Reserved space for ad */}
<AdSlot
slot="article-top"
className="my-6"
minHeight={250}
/>
<div
className="prose prose-lg"
dangerouslySetInnerHTML={{ __html: article.content }}
/>
{/* Reserved space for related */}
<RelatedArticles articleId={article.id} />
</article>
);
}
// Ad component with reserved space
function AdSlot({
slot,
className,
minHeight
}: {
slot: string;
className?: string;
minHeight: number;
}) {
const [ad, setAd] = useState<Ad | null>(null);
const [loaded, setLoaded] = useState(false);
useEffect(() => {
loadAd(slot).then(ad => {
setAd(ad);
setLoaded(true);
});
}, [slot]);
return (
<div
className={className}
style={{ minHeight: `${minHeight}px` }} // Reserved space
>
{loaded ? (
ad ? (
<Image
src={ad.image}
alt={ad.alt}
width={ad.width}
height={ad.height}
/>
) : null // No ad, space collapses gracefully
) : (
<Skeleton height={minHeight} /> // Placeholder during load
)}
</div>
);
}
// Related articles with skeleton
function RelatedArticles({ articleId }: { articleId: string }) {
const [articles, setArticles] = useState<Article[] | null>(null);
useEffect(() => {
fetchRelated(articleId).then(setArticles);
}, [articleId]);
return (
<section className="mt-12">
<h2 className="text-2xl font-bold mb-6">Related Articles</h2>
{/* Fixed grid prevents shift */}
<div className="grid grid-cols-3 gap-6">
{articles ? (
articles.map(article => (
<ArticleCard key={article.id} article={article} />
))
) : (
// Skeleton matches final layout exactly
<>
<ArticleCardSkeleton />
<ArticleCardSkeleton />
<ArticleCardSkeleton />
</>
)}
</div>
</section>
);
}
// Skeleton that matches card dimensions exactly
function ArticleCardSkeleton() {
return (
<div className="animate-pulse">
<div className="aspect-[16/9] bg-gray-200 rounded-lg" />
<div className="mt-3 h-5 bg-gray-200 rounded w-3/4" />
<div className="mt-2 h-4 bg-gray-200 rounded w-1/2" />
</div>
);
}Font Loading Without CLS
/* ✅ Optimized font loading */
/* Main font with swap and metrics */
@font-face {
font-family: 'Inter';
src: url('/fonts/inter-var.woff2') format('woff2');
font-display: swap;
font-weight: 100 900;
}
/* Fallback font with matched metrics */
@font-face {
font-family: 'Inter Fallback';
src: local('Arial');
size-adjust: 107.64%;
ascent-override: 90%;
descent-override: 22.43%;
line-gap-override: 0%;
}
body {
font-family: 'Inter', 'Inter Fallback', system-ui, sans-serif;
}
/* Alternative: font-display: optional for non-critical fonts */
@font-face {
font-family: 'DisplayFont';
src: url('/fonts/display.woff2') format('woff2');
font-display: optional; /* Won't cause FOUT - uses fallback if not cached */
}4. Complete RUM Implementation
Full Real User Monitoring setup with Next.js.
// lib/performance.ts
import { onCLS, onINP, onLCP, onFCP, onTTFB, type Metric } from 'web-vitals';
const ENDPOINT = '/api/vitals';
interface EnrichedMetric {
name: string;
value: number;
rating: 'good' | 'needs-improvement' | 'poor';
delta: number;
id: string;
navigationType: string;
url: string;
timestamp: number;
connection?: string;
deviceMemory?: number;
viewport: { width: number; height: number };
}
function getConnectionInfo() {
const nav = navigator as Navigator & {
connection?: { effectiveType?: string };
deviceMemory?: number;
};
return {
connection: nav.connection?.effectiveType,
deviceMemory: nav.deviceMemory,
};
}
function sendMetric(metric: Metric) {
const enriched: EnrichedMetric = {
name: metric.name,
value: metric.value,
rating: metric.rating,
delta: metric.delta,
id: metric.id,
navigationType: metric.navigationType,
url: window.location.href,
timestamp: Date.now(),
...getConnectionInfo(),
viewport: {
width: window.innerWidth,
height: window.innerHeight,
},
};
// Use sendBeacon for reliability
if (navigator.sendBeacon) {
navigator.sendBeacon(ENDPOINT, JSON.stringify(enriched));
} else {
fetch(ENDPOINT, {
method: 'POST',
body: JSON.stringify(enriched),
keepalive: true,
});
}
// Debug in development
if (process.env.NODE_ENV === 'development') {
const color = {
good: 'green',
'needs-improvement': 'orange',
poor: 'red',
}[metric.rating];
console.log(
`%c[${metric.name}] ${metric.value.toFixed(1)}${metric.name === 'CLS' ? '' : 'ms'}`,
`color: ${color}; font-weight: bold`
);
}
}
export function initWebVitals() {
onCLS(sendMetric);
onINP(sendMetric);
onLCP(sendMetric);
onFCP(sendMetric);
onTTFB(sendMetric);
}// app/components/web-vitals.tsx
'use client';
import { useEffect } from 'react';
import { initWebVitals } from '@/lib/performance';
export function WebVitals() {
useEffect(() => {
initWebVitals();
}, []);
return null;
}// app/api/vitals/route.ts
import { NextRequest, NextResponse } from 'next/server';
interface VitalMetric {
name: string;
value: number;
rating: string;
url: string;
timestamp: number;
}
export async function POST(request: NextRequest) {
const metric: VitalMetric = await request.json();
// Log for debugging
console.log('[Vital]', metric.name, metric.value, metric.rating);
// Store in database (example with Drizzle)
// await db.insert(webVitals).values({
// name: metric.name,
// value: metric.value,
// rating: metric.rating,
// url: metric.url,
// timestamp: new Date(metric.timestamp),
// });
// Alert on poor metrics
if (metric.rating === 'poor') {
// await alertService.send({
// severity: 'warning',
// message: `Poor ${metric.name}: ${metric.value} on ${metric.url}`,
// });
}
return NextResponse.json({ ok: true });
}5. Performance Budget Enforcement
CI/CD integration with Lighthouse CI.
lighthouserc.js
module.exports = {
ci: {
collect: {
url: [
'http://localhost:3000/',
'http://localhost:3000/products',
'http://localhost:3000/checkout',
],
numberOfRuns: 3,
settings: {
preset: 'desktop',
// Throttle to simulate 4G
// throttling: { ... }
},
},
assert: {
assertions: {
// Core Web Vitals
'largest-contentful-paint': ['error', { maxNumericValue: 2500 }],
'cumulative-layout-shift': ['error', { maxNumericValue: 0.1 }],
'total-blocking-time': ['error', { maxNumericValue: 200 }], // Proxy for INP
// Other performance metrics
'first-contentful-paint': ['warn', { maxNumericValue: 1800 }],
'speed-index': ['warn', { maxNumericValue: 3400 }],
// Resource budgets
'resource-summary:script:size': ['error', { maxNumericValue: 150000 }],
'resource-summary:image:size': ['error', { maxNumericValue: 300000 }],
'resource-summary:total:size': ['error', { maxNumericValue: 500000 }],
// Scores
'categories:performance': ['error', { minScore: 0.9 }],
'categories:accessibility': ['error', { minScore: 0.9 }],
},
},
upload: {
target: 'temporary-public-storage',
},
},
};GitHub Actions Workflow
# .github/workflows/lighthouse.yml
name: Lighthouse CI
on:
pull_request:
branches: [main]
jobs:
lighthouse:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Setup Node
uses: actions/setup-node@v4
with:
node-version: '20'
- name: Install dependencies
run: npm ci
- name: Build
run: npm run build
- name: Start server
run: npm start &
- name: Wait for server
run: npx wait-on http://localhost:3000
- name: Run Lighthouse CI
run: |
npm install -g @lhci/cli
lhci autorun
env:
LHCI_GITHUB_APP_TOKEN: ${{ secrets.LHCI_GITHUB_APP_TOKEN }}
- name: Upload results
uses: actions/upload-artifact@v4
with:
name: lighthouse-results
path: .lighthouseci/Quick Reference
// ✅ LCP: Server-render, preload, priority
export default async function Page() {
const data = await getData(); // Server-side
return <Image src={data.hero} priority fill />;
}
// ✅ INP: useTransition for expensive updates
const [isPending, startTransition] = useTransition();
onChange={(e) => {
setQuery(e.target.value);
startTransition(() => setResults(filter(e.target.value)));
}}
// ✅ CLS: Always set dimensions
<Image src="/photo.jpg" width={800} height={600} />
<div className="aspect-[16/9]"><Image fill /></div>
<div className="min-h-[250px]">{content}</div>
// ✅ RUM: Send metrics reliably
navigator.sendBeacon('/api/vitals', JSON.stringify(metric));
// ✅ Debug: Find LCP element
new PerformanceObserver((list) => {
console.log('LCP:', list.getEntries().at(-1)?.element);
}).observe({ type: 'largest-contentful-paint', buffered: true });Image Examples
Image Optimization Examples
Hero Image with Blur Placeholder
import Image from 'next/image';
import heroImage from '@/public/hero.jpg'; // Static import
function Hero() {
return (
<div className="relative h-[600px] w-full">
<Image
src={heroImage}
alt="Beautiful landscape"
fill
priority
placeholder="blur" // Automatic with static import
sizes="100vw"
style={{ objectFit: 'cover' }}
/>
<div className="absolute inset-0 flex items-center justify-center">
<h1 className="text-5xl font-bold text-white">Welcome</h1>
</div>
</div>
);
}Product Grid with Responsive Sizes
function ProductGrid({ products }) {
return (
<div className="grid grid-cols-2 md:grid-cols-3 lg:grid-cols-4 gap-4">
{products.map((product) => (
<div key={product.id} className="relative aspect-square">
<Image
src={product.imageUrl}
alt={product.name}
fill
sizes="(max-width: 640px) 50vw, (max-width: 1024px) 33vw, 25vw"
className="object-cover rounded-lg"
/>
</div>
))}
</div>
);
}Avatar with Fallback
function UserAvatar({ user }) {
const [error, setError] = useState(false);
if (error || !user.avatarUrl) {
return (
<div className="h-10 w-10 rounded-full bg-blue-500 flex items-center justify-center">
<span className="text-white font-medium">
{user.name.charAt(0).toUpperCase()}
</span>
</div>
);
}
return (
<Image
src={user.avatarUrl}
alt={user.name}
width={40}
height={40}
className="rounded-full"
onError={() => setError(true)}
/>
);
}Art Direction (Different Crops)
function ResponsiveBanner() {
return (
<>
{/* Mobile: Portrait crop */}
<div className="relative h-[400px] md:hidden">
<Image
src="/banner-mobile.jpg"
alt="Banner"
fill
priority
sizes="100vw"
className="object-cover"
/>
</div>
{/* Desktop: Landscape crop */}
<div className="relative hidden h-[300px] md:block">
<Image
src="/banner-desktop.jpg"
alt="Banner"
fill
priority
sizes="100vw"
className="object-cover"
/>
</div>
</>
);
}Gallery with Lightbox
function ImageGallery({ images }) {
const [selected, setSelected] = useState(null);
return (
<>
<div className="grid grid-cols-3 gap-2">
{images.map((image, i) => (
<button
key={image.id}
onClick={() => setSelected(image)}
className="relative aspect-square"
>
<Image
src={image.thumbnailUrl}
alt={image.alt}
fill
sizes="33vw"
className="object-cover"
/>
</button>
))}
</div>
{selected && (
<Dialog open onClose={() => setSelected(null)}>
<div className="relative h-[80vh] w-[90vw]">
<Image
src={selected.fullUrl}
alt={selected.alt}
fill
sizes="90vw"
quality={90}
className="object-contain"
/>
</div>
</Dialog>
)}
</>
);
}Background Image Pattern
// For true background images, use CSS
function HeroWithCSSBackground() {
return (
<div
className="h-[600px] bg-cover bg-center"
style={{ backgroundImage: 'url(/hero.webp)' }}
>
<div className="h-full flex items-center justify-center bg-black/40">
<h1 className="text-white text-5xl">Hero Title</h1>
</div>
</div>
);
}
// For Next.js optimization, use Image with fill
function HeroWithNextImage() {
return (
<div className="relative h-[600px]">
<Image
src="/hero.webp"
alt=""
fill
priority
className="object-cover -z-10"
/>
<div className="h-full flex items-center justify-center bg-black/40">
<h1 className="text-white text-5xl">Hero Title</h1>
</div>
</div>
);
}Orchestkit Performance Wins
OrchestKit Performance Wins - Real Optimization Examples
This document showcases actual performance optimizations from OrchestKit's production implementation with before/after metrics.
Overview
Key Performance Achievements:
- LLM costs: $35k/year → $2-5k/year (85-95% reduction)
- Vector search: 85ms → 5ms (17x faster)
- Retrieval accuracy: 87.2% → 91.6% (5.1% improvement)
- Quality gate pass rate: Increased from 67-77% → 85%+ (stable)
- Cache hit rate: 0% → 90% (L1) + 75% (L2)
Win 1: Multi-Level LLM Caching
Problem
Projected annual LLM costs: $35,000
- 8 agents per analysis, 1,500-1,800 tokens each
- Average 145 analyses/month
- No caching = every query hits LLM
- Claude Sonnet 4.5: $3/MTok input, $15/MTok output
Investigation
Cost breakdown by agent:
-- Langfuse query
SELECT
metadata->>'agent_type' as agent,
SUM(calculated_total_cost) as total_cost,
AVG(input_tokens) as avg_input,
AVG(output_tokens) as avg_output
FROM traces
GROUP BY agent
ORDER BY total_cost DESC;Results:
| Agent | Monthly Cost | Avg Input | Avg Output |
|---|---|---|---|
| security_auditor | $3.05 | 1,800 | 1,200 |
| implementation_planner | $2.76 | 1,600 | 1,100 |
| tech_comparator | $2.61 | 1,500 | 1,000 |
| Total (8 agents) | $18.73 | - | - |
Pain points:
- Analyzing similar content (React tutorials, FastAPI guides) repeatedly
- Security patterns (XSS, SQL injection) are common across codebases
- Implementation patterns (CRUD, auth) are highly repetitive
Solution: 3-Level Cache Hierarchy
Architecture:
Request → L1: Prompt Cache (Claude native)
↓ miss (10%)
→ L2: Semantic Cache (Redis vector search)
↓ miss (25% of L1 misses)
→ L3: LLM Call (actual cost)L1: Claude Prompt Caching (Native)
File: backend/app/shared/services/llm/anthropic_client.py
from anthropic import AsyncAnthropic
async def call_claude_with_prompt_cache(
system_prompt: str,
user_message: str,
model: str = "claude-sonnet-4-6"
) -> str:
"""Call Claude with prompt caching for system prompts."""
response = await anthropic_client.messages.create(
model=model,
max_tokens=4096,
system=[
{
"type": "text",
"text": system_prompt,
"cache_control": {"type": "ephemeral"} # Cache this!
}
],
messages=[
{"role": "user", "content": user_message}
]
)
# Log cache usage
cache_hit = response.usage.cache_read_input_tokens > 0
logger.info("claude_prompt_cache",
cache_hit=cache_hit,
cache_read_tokens=response.usage.cache_read_input_tokens,
input_tokens=response.usage.input_tokens,
output_tokens=response.usage.output_tokens
)
return response.content[0].textCost savings:
- Cache hit: 90% discount on cached tokens
- Cache duration: 5 minutes
- Effective for: Agent system prompts (1,500+ tokens each)
L2: Semantic Cache (Redis + Vector Search)
File: backend/app/shared/services/cache/semantic_cache.py
from redis import Redis
from app.shared.services.embeddings import embed_text
import numpy as np
class SemanticCache:
"""Vector similarity-based cache for LLM responses."""
def __init__(self, redis_client: Redis, threshold: float = 0.92):
self.redis = redis_client
self.threshold = threshold # Cosine similarity threshold
async def get(self, query: str) -> str | None:
"""Check if semantically similar query exists in cache."""
# Generate query embedding
query_embedding = await embed_text(query)
# Search for similar cached queries
# (Using Redis VSS or dedicated vector store)
cached_queries = await self._vector_search(query_embedding, top_k=5)
for cached_query, cached_embedding, cached_response in cached_queries:
similarity = cosine_similarity(query_embedding, cached_embedding)
if similarity >= self.threshold:
logger.info("semantic_cache_hit",
similarity=similarity,
cached_query=cached_query[:100]
)
return cached_response
return None # Cache miss
async def set(self, query: str, response: str, ttl: int = 3600):
"""Store query-response pair with embedding."""
# Generate embedding
embedding = await embed_text(query)
# Store in Redis (with vector index)
cache_key = f"semantic_cache:{hash(query)}"
await self.redis.setex(
cache_key,
ttl,
json.dumps({
"query": query,
"response": response,
"embedding": embedding.tolist(),
"timestamp": datetime.now().isoformat()
})
)Cost savings:
- 75% hit rate on L1 misses
- Near-instant responses (5-10ms vs 2000ms)
- Effective for: Similar technical queries
Implementation in agent calls:
@observe(name="agent_execution")
async def execute_agent(agent_type: str, content: str) -> Finding:
"""Execute agent with 3-level caching."""
# Build query
system_prompt = get_agent_system_prompt(agent_type) # 1,500+ tokens
user_message = f"Analyze this content:\n\n{content[:8000]}"
# L2: Check semantic cache
cache_key = f"{agent_type}:{content[:200]}" # Simple key for demo
cached_response = await semantic_cache.get(cache_key)
if cached_response:
logger.info("cache_hit", level="L2_semantic", agent=agent_type)
return parse_finding(cached_response)
# L1 + L3: Call Claude (with prompt caching)
response = await call_claude_with_prompt_cache(
system_prompt=system_prompt, # Cached by Claude
user_message=user_message
)
# Store in semantic cache
await semantic_cache.set(cache_key, response, ttl=3600)
return parse_finding(response)Results
Cost Reduction:
Baseline (no cache): $35,000/year
L1 savings (90% hit): -$28,350 (90% discount on 90% of queries)
L2 savings (75% hit): -$4,650 (85% discount on 75% of L1 misses)
Final cost: $2,000-5,000/year
Total savings: 85-95%Latency Improvement:
| Cache Level | Hit Rate | Latency | Cost Savings |
|---|---|---|---|
| L1 (Prompt) | 90% | 2000ms (same) | 90% on cached tokens |
| L2 (Semantic) | 75% (of L1 misses) | 5-10ms | 85% (full skip) |
| L3 (LLM) | 2.5% (fallback) | 2000ms | 0% (full cost) |
Implementation effort: 2 days Maintenance overhead: Low (cache TTL auto-expires stale data)
Win 2: Vector Index Optimization (HNSW vs IVFFlat)
Problem
Vector search taking 85ms, needed <10ms
- Golden dataset: 415 chunks, 1536-dim embeddings
- IVFFlat index (lists=10)
- Hybrid search (vector + BM25 RRF) bottlenecked by vector search
Investigation
Benchmark both index types:
-- IVFFlat performance
EXPLAIN ANALYZE
SELECT * FROM chunks
ORDER BY embedding <=> '[0.1, 0.2, ...]'::vector
LIMIT 10;
-- Result:
-- Planning Time: 2.1 ms
-- Execution Time: 85.3 ms-- HNSW performance
CREATE INDEX idx_chunk_embedding_hnsw ON chunks
USING hnsw (embedding vector_cosine_ops)
WITH (m = 16, ef_construction = 64);
EXPLAIN ANALYZE
SELECT * FROM chunks
ORDER BY embedding <=> '[0.1, 0.2, ...]'::vector
LIMIT 10;
-- Result:
-- Planning Time: 2.0 ms
-- Execution Time: 5.1 msTrade-offs:
| Index | Build Time | Query Time | Accuracy | Memory |
|---|---|---|---|---|
| IVFFlat (lists=10) | 2s | 85ms | 95% | Low |
| HNSW (m=16) | 8s | 5ms | 98% | Medium |
Solution: HNSW Index with Optimized Parameters
File: backend/alembic/versions/xxx_add_hnsw_index.py
def upgrade():
"""Add HNSW index for vector similarity search."""
op.execute("""
CREATE INDEX CONCURRENTLY idx_chunk_embedding_hnsw
ON chunks USING hnsw (embedding vector_cosine_ops)
WITH (m = 16, ef_construction = 64);
""")
# Drop old IVFFlat index
op.execute("DROP INDEX IF EXISTS idx_chunk_embedding_ivfflat;")Parameters chosen:
m = 16: Connections per layer (sweet spot for 1k-10k vectors)ef_construction = 64: Build-time quality (higher = better accuracy, slower build)ef_search = 64: Query-time quality (can tune per query)
Runtime tuning:
async def search_similar_chunks(
embedding: list[float],
top_k: int = 10
) -> list[Chunk]:
"""Vector similarity search with HNSW index."""
# Tune ef_search for accuracy vs speed trade-off
await session.execute(text("SET hnsw.ef_search = 64;"))
results = await session.execute(
select(Chunk)
.order_by(Chunk.embedding.cosine_distance(embedding))
.limit(top_k)
)
return results.scalars().all()Results
Performance:
- Query latency: 85ms → 5ms (17x faster)
- Accuracy: 95% → 98% (3% improvement)
- Build time: 2s → 8s (acceptable for 415 chunks)
Impact on retrieval:
- Hybrid search latency: 95ms → 15ms (p95)
- Throughput: 10.5 req/s → 66 req/s (6x improvement)
Implementation effort: 4 hours (index creation + testing)
Win 3: Hybrid Search Ranking Optimization
Problem
Retrieval pass rate: 87.2%, target: >90%
- Expected chunks ranked 6-10 instead of top-5
- RRF fusion not getting enough candidates
- No metadata boosting
Investigation
Golden dataset analysis (203 queries):
# Evaluate current ranking
results = []
for query in golden_queries:
retrieved = await hybrid_search(query.text, top_k=10)
expected_in_top_k = any(chunk.id in query.expected_chunk_ids for chunk in retrieved)
rank = next((i for i, c in enumerate(retrieved) if c.id in query.expected_chunk_ids), -1)
results.append({
"query": query.text,
"expected_rank": rank,
"found": rank != -1,
"passed": rank < 10
})
# Results:
# Pass rate: 177/203 = 87.2%
# MRR: 0.723Failure analysis:
- 26 queries failed (expected chunk not in top-10)
- Common issue: Expected chunk ranked 11-15
- Root cause: RRF fusion only fetching 2x candidates (20 for top-10)
Solution: Multi-Pronged Optimization
1. Increase RRF Fetch Multiplier
File: backend/app/core/constants.py
# Before
HYBRID_FETCH_MULTIPLIER = 2 # Fetch 20 for top-10
# After
HYBRID_FETCH_MULTIPLIER = 3 # Fetch 30 for top-10Rationale: More candidates → better RRF coverage → higher recall
2. Add Metadata Boosting
File: backend/app/shared/services/search/search_service.py
def apply_metadata_boosts(
chunks: list[Chunk],
query: str
) -> list[Chunk]:
"""Boost scores based on metadata signals."""
query_lower = query.lower()
for chunk in chunks:
# Boost if query matches section title
if chunk.section_title and any(
term in chunk.section_title.lower()
for term in query_lower.split()
):
chunk.score *= SECTION_TITLE_BOOST_FACTOR # 2.0
# Boost if query matches document path
if chunk.document_path and any(
term in chunk.document_path.lower()
for term in query_lower.split()
):
chunk.score *= DOCUMENT_PATH_BOOST_FACTOR # 1.15
# Boost code blocks for technical queries
if chunk.chunk_type == "code_block" and is_technical_query(query):
chunk.score *= TECHNICAL_KEYWORD_BOOST # 1.2
return sorted(chunks, key=lambda c: c.score, reverse=True)3. Pre-Compute tsvector for BM25
Before:
-- Compute tsvector on-the-fly (slow!)
SELECT *, ts_rank(to_tsvector('english', content), query) as rank
FROM chunks
WHERE to_tsvector('english', content) @@ query
ORDER BY rank DESC;After:
-- Use pre-computed tsvector column (fast!)
SELECT *, ts_rank(content_tsvector, query) as rank
FROM chunks
WHERE content_tsvector @@ query
ORDER BY rank DESC;Migration:
def upgrade():
"""Add pre-computed tsvector column."""
# Add column
op.add_column('chunks', sa.Column('content_tsvector', TSVECTOR))
# Populate
op.execute("""
UPDATE chunks
SET content_tsvector = to_tsvector('english', content);
""")
# Create GIN index
op.execute("""
CREATE INDEX idx_chunk_tsvector
ON chunks USING GIN(content_tsvector);
""")
# Add trigger to keep it updated
op.execute("""
CREATE TRIGGER tsvector_update BEFORE INSERT OR UPDATE
ON chunks FOR EACH ROW EXECUTE FUNCTION
tsvector_update_trigger(content_tsvector, 'pg_catalog.english', content);
""")Results
Ranking Quality:
| Metric | Before | After | Change |
|---|---|---|---|
| Pass rate | 177/203 (87.2%) | 186/203 (91.6%) | +5.1% |
| MRR (overall) | 0.723 | 0.777 | +7.4% |
| MRR (hard queries) | 0.647 | 0.686 | +6.0% |
Query Performance:
| Operation | Before | After | Change |
|---|---|---|---|
| BM25 search | 45ms | 4ms | 11x faster |
| Vector search | 5ms | 5ms | Same |
| RRF fusion | 2ms | 3ms | Slightly slower (more candidates) |
| Total | 52ms | 12ms | 4.3x faster |
Impact by boost factor:
- Section title boost: +7.4% MRR (most impactful)
- Document path boost: +2.1% MRR
- Code block boost: +1.3% MRR (for technical queries)
Implementation effort: 1 day (constants, migration, testing)
Win 4: SSE Event Buffering (Race Condition Fix)
Problem
Frontend showed 0% progress while backend was running
- Real-time progress updates missing
- EventSource connection established AFTER events published
- No event replay mechanism
Investigation
Reproduce issue:
- Start analysis via API
- Frontend subscribes to SSE
/progress/\{analysis_id\} - Backend immediately publishes "analysis_started" event
- Frontend connects 200ms later → misses early events
Root cause:
# ❌ BAD: Events lost if no subscriber yet
class EventBroadcaster:
def publish(self, channel: str, event: dict):
if channel not in self._subscribers:
return # Event lost!
for subscriber in self._subscribers[channel]:
subscriber.send(event)Solution: Event Buffering with Replay
File: backend/app/services/event_broadcaster.py
from collections import deque
from dataclasses import dataclass
from datetime import datetime
@dataclass
class BufferedEvent:
"""Event with timestamp for replay."""
data: dict
timestamp: datetime
class EventBroadcaster:
"""SSE broadcaster with event buffering."""
def __init__(self, buffer_size: int = 100):
self._subscribers: dict[str, list] = {}
self._buffers: dict[str, deque[BufferedEvent]] = {}
self._buffer_size = buffer_size
def publish(self, channel: str, event: dict):
"""Publish event and store in buffer."""
# Create buffer if needed
if channel not in self._buffers:
self._buffers[channel] = deque(maxlen=self._buffer_size)
# Add to buffer
buffered_event = BufferedEvent(
data=event,
timestamp=datetime.now()
)
self._buffers[channel].append(buffered_event)
# Send to active subscribers
for subscriber in self._subscribers.get(channel, []):
try:
subscriber.send(event)
except Exception as e:
logger.error("failed_to_send_event", error=str(e))
async def subscribe(self, channel: str):
"""Subscribe to channel and replay buffered events."""
# Replay buffered events first
for buffered_event in self._buffers.get(channel, []):
yield {
"event": "message",
"data": json.dumps(buffered_event.data)
}
# Then stream new events
queue = asyncio.Queue()
self._subscribers.setdefault(channel, []).append(queue)
try:
while True:
event = await queue.get()
yield {
"event": "message",
"data": json.dumps(event)
}
finally:
self._subscribers[channel].remove(queue)API endpoint:
@app.get("/progress/{analysis_id}")
async def stream_progress(analysis_id: str):
"""Stream analysis progress with buffered event replay."""
channel = f"analysis:{analysis_id}"
async def event_generator():
async for event in event_broadcaster.subscribe(channel):
yield f"data: {event['data']}\n\n"
return StreamingResponse(
event_generator(),
media_type="text/event-stream"
)Results
Before (with race condition):
- 0% progress shown until agent completion (30-60 seconds)
- Users confused, thought app was frozen
- Support tickets: "Analysis stuck at 0%"
After (with buffering):
- All events delivered (100% replay rate)
- Progress updates appear immediately
- Memory overhead: ~10KB per active analysis (100 events × 100 bytes)
Implementation effort: 3 hours (buffer logic + tests)
Win 5: Quality Gate Content Truncation Fix
Problem
Quality scores artificially low due to content truncation
- Depth scores: 5/10 (AWFUL) → required retries
- G-Eval only seeing truncated summaries
- 4 stages of truncation compounding
Investigation
Trace truncation points:
# Stage 1: compress_findings.py
MAX_STRING_LENGTH = 200 # ❌ Too aggressive!
# Stage 2: scorer.py
input_text = content[:2000] # ❌ Truncated again!
output_text = response[:3000]
# Stage 3: quality.py
MAX_CONTENT_LENGTH = 8000 # ❌ Insufficient!
# Stage 4: quality_gate_node.py
insights = findings[:2000] # ❌ Final truncation!Example:
- Original finding: 5,000 chars (detailed security analysis)
- After Stage 1: 200 chars ("Found 3 vulnerabilities...")
- After synthesis: 1,500 chars (includes other findings)
- After Stage 2: 1,500 chars (same)
- After G-Eval: Depth score = 5/10 (insufficient detail)
Solution: Increase All Truncation Limits
Changes:
| File | Before | After | Rationale |
|---|---|---|---|
| compress_findings.py | 200 | 500 | Allow key insights |
| scorer.py (input) | 2,000 | 8,000 | Full context for eval |
| scorer.py (output) | 3,000 | 12,000 | Detailed responses |
| quality.py | 8,000 | 15,000 | Complete synthesis |
| quality_gate_node.py | 2,000 | 8,000 | All findings visible |
Implementation:
# backend/app/shared/services/g_eval/scorer.py
MAX_INPUT_LENGTH = 8000 # Increased from 2000
MAX_OUTPUT_LENGTH = 12000 # Increased from 3000
# backend/app/evaluation/evaluators/quality.py
MAX_CONTENT_LENGTH = 15000 # Increased from 8000
# backend/app/domains/analysis/workflows/tasks/aggregation/compress_findings.py
MAX_STRING_LENGTH = 500 # Increased from 200Results
Quality Scores:
| Criterion | Before | After | Change |
|---|---|---|---|
| Completeness | 0.75 | 0.85 | +13% |
| Accuracy | 0.88 | 0.92 | +5% |
| Coherence | 0.84 | 0.88 | +5% |
| Depth | 0.58 | 0.78 | +34% |
| Overall | 0.76 | 0.86 | +13% |
Pass rate: 67-77% (variable) → 85%+ (stable)
Trade-offs:
- Token usage: +15% (from 8k → 12k avg)
- Cost impact: +$0.02 per analysis (acceptable)
- Quality improvement: Worth the extra cost
Implementation effort: 2 hours (find all truncation points + update tests)
Summary Table
| Optimization | Metric | Before | After | Improvement | Effort |
|---|---|---|---|---|---|
| Multi-level caching | Annual cost | $35k | $2-5k | 85-95% | 2 days |
| HNSW index | Query latency | 85ms | 5ms | 17x faster | 4 hours |
| Hybrid search | Pass rate | 87.2% | 91.6% | +5.1% | 1 day |
| SSE buffering | Event delivery | 60% | 100% | +67% | 3 hours |
| Content truncation | Depth score | 0.58 | 0.78 | +34% | 2 hours |
Total implementation time: 4 days Annual cost savings: $30-33k Quality improvement: 13% overall, 34% depth
References
- OrchestKit Quality Initiative
- Redis Connection Keepalive
- Hybrid Search Constants
- Template:
../scripts/caching-patterns.ts - Template:
../scripts/database-optimization.ts
Multimodal Llm
Vision, audio, and multimodal LLM integration patterns. Use when processing images, transcribing audio, generating speech, or building multimodal AI pipelines.
Plan Viz
Visualize planned changes before implementation. Use when reviewing plans, comparing before/after architecture, assessing risk, or analyzing execution order and impact.
Last updated on