Skills

Performance

Performance optimization patterns covering Core Web Vitals, React render optimization, lazy loading, image optimization, backend profiling, and LLM inference. Use when improving page speed, debugging slow renders, optimizing bundles, reducing image payload, profiling backend, or deploying LLMs efficiently.

Reference high

Primary Agent: frontend-ui-developer

Performance

Comprehensive performance optimization patterns for frontend, backend, and LLM inference.

Quick Reference

Category	Rules	Impact	When to Use
Core Web Vitals	3	CRITICAL	LCP, INP, CLS optimization with 2026 thresholds
Render Optimization	3	HIGH	React Compiler, memoization, virtualization
Lazy Loading	3	HIGH	Code splitting, route splitting, preloading
Image Optimization	3	HIGH	Next.js Image, AVIF/WebP, responsive images
Profiling & Backend	3	MEDIUM	React DevTools, py-spy, bundle analysis
LLM Inference	3	MEDIUM	vLLM, quantization, speculative decoding
Caching	2	HIGH	Redis cache-aside, prompt caching, HTTP cache headers
Query & Data Fetching	2	HIGH	TanStack Query prefetching, optimistic updates, rollback

Total: 22 rules across 8 categories

Core Web Vitals

Google's Core Web Vitals with 2026 stricter thresholds.

Rule	File	Key Pattern
LCP Optimization	`rules/cwv-lcp.md`	Preload hero, SSR, fetchpriority="high"
INP Optimization	`rules/cwv-inp.md`	scheduler.yield, useTransition, requestIdleCallback
CLS Prevention	`rules/cwv-cls.md`	Explicit dimensions, aspect-ratio, font-display

2026 Thresholds

Metric	Current Good	2026 Good
LCP	<= 2.5s	<= 2.0s
INP	<= 200ms	<= 150ms
CLS	<= 0.1	<= 0.08

Render Optimization

React render performance patterns for React 19+.

Rule	File	Key Pattern
React Compiler	`rules/render-compiler.md`	Auto-memoization, "Memo" badge verification
Manual Memoization	`rules/render-memo.md`	useMemo/useCallback escape hatches, state colocation
Virtualization	`rules/render-virtual.md`	TanStack Virtual for 100+ item lists

Lazy Loading

Code splitting and lazy loading with React.lazy and Suspense.

Rule	File	Key Pattern
React.lazy + Suspense	`rules/loading-lazy.md`	Component lazy loading, error boundaries
Route Splitting	`rules/loading-splitting.md`	React Router 7.x, Vite manual chunks
Preloading	`rules/loading-preload.md`	Prefetch on hover, modulepreload hints

Image Optimization

Production image optimization for modern web applications.

Rule	File	Key Pattern
Next.js Image	`rules/images-nextjs.md`	Image component, priority, blur placeholder
Format Selection	`rules/images-formats.md`	AVIF/WebP, quality 75-85, picture element
Responsive Images	`rules/images-responsive.md`	sizes prop, art direction, CDN loaders

Profiling & Backend

Profiling tools and backend optimization patterns.

Rule	File	Key Pattern
React Profiling	`rules/profiling-react.md`	DevTools Profiler, flamegraph, render counts
Backend Profiling	`rules/profiling-backend.md`	py-spy, cProfile, memory_profiler, flame graphs
Bundle Analysis	`rules/profiling-bundle.md`	vite-bundle-visualizer, tree shaking, performance budgets

LLM Inference

High-performance LLM inference with vLLM, quantization, and speculative decoding.

Rule	File	Key Pattern
vLLM Deployment	`rules/inference-vllm.md`	PagedAttention, continuous batching, tensor parallelism
Quantization	`rules/inference-quantization.md`	AWQ, GPTQ, FP8, INT8 method selection
Speculative Decoding	`rules/inference-speculative.md`	N-gram, draft model, 1.5-2.5x throughput

Caching

Backend Redis caching and LLM prompt caching for cost savings and performance.

Rule	File	Key Pattern
Redis & Backend	`rules/caching-redis.md`	Cache-aside, write-through, invalidation, stampede prevention
HTTP & Prompt	`rules/caching-http.md`	HTTP cache headers, LLM prompt caching, semantic caching

Query & Data Fetching

TanStack Query v5 patterns for prefetching and optimistic updates.

Rule	File	Key Pattern
Prefetching	`rules/query-prefetching.md`	Hover prefetch, route loaders, queryOptions, Suspense
Optimistic Updates	`rules/query-optimistic.md`	Optimistic mutations, rollback, cache invalidation

Quick Start Example

// LCP: Priority hero image with SSR
import Image from 'next/image';

export default async function Page() {
  const data = await fetchHeroData();
  return (
    <Image
      src={data.heroImage}
      alt="Hero"
      priority
      placeholder="blur"
      sizes="100vw"
      fill
    />
  );
}

Key Decisions

Decision	Recommendation
Memoization	Let React Compiler handle it (2026 default)
Lists 100+ items	Use TanStack Virtual
Image format	AVIF with WebP fallback (30-50% smaller)
LCP content	SSR/SSG, never client-side fetch
Code splitting	Per-route for most apps, per-component for heavy widgets
Prefetch strategy	On hover for nav links, viewport for content
Quantization	AWQ for 4-bit, FP8 for H100/H200
Bundle budget	Hard fail in CI to prevent regression

Common Mistakes

Client-side fetching LCP content (delays render)
Images without explicit dimensions (causes CLS)
Lazy loading LCP images (delays largest paint)
Heavy computation in event handlers (blocks INP)
Layout-shifting animations (use transform instead)
Lazy loading tiny components < 5KB (overhead > savings)
Missing error boundaries on lazy components
Using GPTQ without calibration data
Not benchmarking actual workload patterns
Only measuring in lab environment (need RUM)

ork:react-server-components-framework - Server-first rendering
ork:vite-advanced - Build optimization
caching - Cache strategies for responses
ork:monitoring-observability - Production monitoring and alerting
ork:database-patterns - Query and index optimization
ork:llm-integration - Local inference with Ollama

Capability Details

lcp-optimization

Keywords: LCP, largest-contentful-paint, hero, preload, priority, SSR Solves:

Optimize hero image loading
Server-render critical content
Preload and prioritize LCP resources

inp-optimization

Keywords: INP, interaction, responsiveness, long-task, transition, yield Solves:

Break up long tasks with scheduler.yield
Defer non-urgent updates with useTransition
Optimize event handler performance

cls-prevention

Keywords: CLS, layout-shift, dimensions, aspect-ratio, font-display Solves:

Reserve space for dynamic content
Prevent font flash and image pop-in
Use transform for animations

react-compiler

Keywords: react-compiler, auto-memo, memoization, React 19 Solves:

Enable automatic memoization
Identify when manual memoization needed
Verify compiler is working

virtualization

Keywords: virtual, TanStack, large-list, scroll, overscan Solves:

Render 100+ item lists efficiently
Dynamic height virtualization
Window scrolling patterns

lazy-loading

Keywords: React.lazy, Suspense, code-splitting, dynamic-import Solves:

Route-based code splitting
Component lazy loading with error boundaries
Prefetch on hover and viewport

image-optimization

Keywords: next/image, AVIF, WebP, responsive, blur-placeholder Solves:

Next.js Image component patterns
Format selection and quality settings
Responsive sizing and CDN configuration

profiling

Keywords: profiler, flame-graph, py-spy, DevTools, bundle-analyzer Solves:

Profile React renders and backend code
Generate and interpret flame graphs
Analyze and optimize bundle size

llm-inference

Keywords: vllm, quantization, speculative-decoding, inference, throughput Solves:

Deploy LLMs with vLLM for production
Choose quantization method for hardware
Accelerate generation with speculative decoding

References

RUM Setup - Real User Monitoring
React Compiler Migration - Compiler adoption
TanStack Virtual - Virtualization patterns
vLLM Deployment - Production vLLM config
Quantization Guide - Method comparison
CDN Setup - Image CDN configuration

Rules (22)

Configure HTTP and LLM prompt caching with correct breakpoint ordering for maximum savings — HIGH

HTTP & Prompt Caching

HTTP cache headers for CDN/browser caching and LLM prompt caching for 90% token savings.

Incorrect — variable content before cached prefix:

# WRONG: Variable content before static content breaks prompt cache
messages = [
    {"role": "user", "content": f"User {user_id} asks: {question}"},  # Variable first!
    {"role": "system", "content": long_system_prompt},  # Static content after = never cached
]

Correct — static prefix first, then variable content:

# Claude prompt caching: static content first with cache_control
response = await client.messages.create(
    model="claude-sonnet-4-6",
    system=[
        {
            "type": "text",
            "text": long_system_prompt,  # Static: cached across calls
            "cache_control": {"type": "ephemeral"},  # 5-minute TTL
        },
    ],
    messages=[
        {"role": "user", "content": user_question},  # Variable: after cache breakpoint
    ],
)
# Result: ~90% token savings on system prompt after first call

# OpenAI: automatic prefix caching (no markers needed)
# Just ensure static content comes first in messages array
response = await openai.chat.completions.create(
    model="gpt-4o",
    messages=[
        {"role": "system", "content": long_system_prompt},  # Cached automatically
        {"role": "user", "content": user_question},
    ],
)

HTTP cache headers for API responses:

from fastapi import FastAPI, Response

app = FastAPI()

@app.get("/api/products/{product_id}")
async def get_product(product_id: str, response: Response):
    product = await fetch_product(product_id)
    # Browser caches 60s, CDN caches 1h
    response.headers["Cache-Control"] = "public, max-age=60, s-maxage=3600"
    response.headers["CDN-Cache-Control"] = "max-age=3600"
    return product

@app.get("/api/user/profile")
async def get_profile(response: Response):
    # Private: only browser cache, not CDN
    response.headers["Cache-Control"] = "private, max-age=300"
    return await get_current_user_profile()

Key rules:

Claude: use cache_control with ephemeral type (5min default, 1h if >10 reads/hour)
OpenAI: automatic prefix caching, no markers needed — just put static content first
HTTP: public, max-age=60, stale-while-revalidate=300 for API responses
Use s-maxage or CDN-Cache-Control for different CDN vs browser TTLs
Semantic caching: start threshold at 0.92, tune based on hit rate
Never cache error responses or authentication tokens

Implement Redis cache-aside pattern with TTL and stampede prevention for backend caching — HIGH

Redis & Backend Caching

Cache-aside, write-through, and invalidation patterns for Redis-backed backend services.

Incorrect — caching without TTL (memory leak):

# WRONG: No TTL = memory grows forever
async def get_user(user_id: str):
    cached = await redis.get(f"user:{user_id}")
    if cached:
        return json.loads(cached)
    user = await db.fetch_user(user_id)
    await redis.set(f"user:{user_id}", json.dumps(user))  # No expiry!
    return user

Correct — cache-aside with TTL and stampede prevention:

import redis.asyncio as redis
import json
import asyncio

class CacheAside:
    def __init__(self, redis_client: redis.Redis, default_ttl: int = 3600):
        self.redis = redis_client
        self.ttl = default_ttl

    async def get_or_set(self, key: str, fetch_fn, ttl: int | None = None):
        """Cache-aside with stampede prevention via lock."""
        cached = await self.redis.get(key)
        if cached:
            return json.loads(cached)

        # Stampede prevention: only one caller computes
        lock_key = f"lock:{key}"
        acquired = await self.redis.set(lock_key, "1", ex=30, nx=True)
        if not acquired:
            # Another process is computing, wait and retry
            await asyncio.sleep(0.1)
            cached = await self.redis.get(key)
            if cached:
                return json.loads(cached)

        try:
            value = await fetch_fn()
            await self.redis.setex(key, ttl or self.ttl, json.dumps(value))
            return value
        finally:
            await self.redis.delete(lock_key)

# Write-through: update cache and DB atomically
async def update_user(user_id: str, data: dict, db, cache: CacheAside):
    async with db.transaction():
        await db.execute("UPDATE users SET ... WHERE id = $1", user_id)
        await cache.redis.setex(
            f"user:{user_id}",
            cache.ttl,
            json.dumps(data),
        )

# Event-based invalidation
async def on_user_updated(event: UserUpdatedEvent, cache: CacheAside):
    await cache.redis.delete(f"user:{event.user_id}")
    # Related caches too
    await cache.redis.delete(f"user-profile:{event.user_id}")

Key rules:

Always set TTL (1h default, 5min for volatile data)
Use orjson for serialization performance over json
Key naming: \{entity\}:\{id\} or \{entity\}:\{id\}:\{field\}
Stampede prevention: use distributed locks for expensive computations
Event-based invalidation for writes, TTL for reads
Never use cache as primary storage (data loss risk)

Prevent Cumulative Layout Shift that causes content jumping and hurts search rankings — CRITICAL

CLS Prevention

Prevent Cumulative Layout Shift for the 2026 threshold of <= 0.08.

Reserve Space for Dynamic Content

/* Reserve space for images */
.image-container {
  aspect-ratio: 16 / 9;
  width: 100%;
}

/* Reserve space for ads */
.ad-slot {
  min-height: 250px;
}

Explicit Dimensions

// Always set width and height
<img src="/photo.jpg" width={800} height={600} alt="Photo" />

// Next.js Image handles this automatically
<Image src="/photo.jpg" width={800} height={600} alt="Photo" />

// For responsive images
<Image src="/photo.jpg" fill sizes="(max-width: 768px) 100vw, 50vw" />

Avoid Layout-Shifting Fonts

/* Use font-display: optional for non-critical fonts */
@font-face {
  font-family: 'CustomFont';
  src: url('/fonts/custom.woff2') format('woff2');
  font-display: optional;
}

/* Or use size-adjust for fallback */
@font-face {
  font-family: 'Fallback';
  src: local('Arial');
  size-adjust: 105%;
  ascent-override: 95%;
}

Animations That Don't Cause Layout Shift

/* BAD: Changes layout properties */
.expanding {
  height: 0;
  transition: height 0.3s;
}
.expanding.open {
  height: 200px; /* Causes layout shift */
}

/* GOOD: Use transform */
.expanding {
  transform: scaleY(0);
  transform-origin: top;
  transition: transform 0.3s;
}
.expanding.open {
  transform: scaleY(1);
}

Incorrect — Image without dimensions causes layout shift:

<img src="/photo.jpg" alt="Photo" />

Correct — Explicit dimensions reserve space:

<img src="/photo.jpg" width={800} height={600} alt="Photo" />

Key Rules

Always set width/height on images
Use aspect-ratio for responsive containers
Use font-display: optional for non-critical fonts
Never animate layout properties (width, height, top, left)
Use transform and opacity for animations
Reserve space for ads, embeds, and dynamic content
Target <= 0.08 for 2026 thresholds

Optimize Interaction to Next Paint to ensure responsive button clicks and interactions — CRITICAL

INP Optimization

Optimize Interaction to Next Paint for the 2026 threshold of <= 150ms.

Break Up Long Tasks

// BAD: Long synchronous task (blocks main thread)
function processLargeArray(items: Item[]) {
  items.forEach(processItem); // Blocks for entire duration
}

// GOOD: Yield to main thread
async function processLargeArray(items: Item[]) {
  for (const item of items) {
    processItem(item);
    if (performance.now() % 50 < 1) {
      await scheduler.yield?.() ?? new Promise(r => setTimeout(r, 0));
    }
  }
}

Use Transitions for Non-Urgent Updates

import { useTransition, useDeferredValue } from 'react';

function SearchResults() {
  const [query, setQuery] = useState('');
  const [isPending, startTransition] = useTransition();

  const handleChange = (e: ChangeEvent<HTMLInputElement>) => {
    // Urgent: Update input immediately
    setQuery(e.target.value);

    // Non-urgent: Defer expensive filter
    startTransition(() => {
      setFilteredResults(filterResults(e.target.value));
    });
  };

  return (
    <>
      <input value={query} onChange={handleChange} />
      {isPending && <Spinner />}
      <ResultsList results={filteredResults} />
    </>
  );
}

Optimize Event Handlers

// BAD: Heavy computation in click handler
<button onClick={() => {
  const result = heavyComputation(); // Blocks paint
  setResult(result);
}}>Calculate</button>

// GOOD: Defer heavy work
<button onClick={() => {
  setLoading(true);
  requestIdleCallback(() => {
    const result = heavyComputation();
    setResult(result);
    setLoading(false);
  });
}}>Calculate</button>

Incorrect — Blocking click handler delays visual feedback:

<button onClick={() => {
  const result = heavyComputation(); // Blocks paint
  setResult(result);
}}>Calculate</button>

Correct — Deferred work keeps UI responsive:

<button onClick={() => {
  setLoading(true);
  requestIdleCallback(() => {
    const result = heavyComputation();
    setResult(result);
    setLoading(false);
  });
}}>Calculate</button>

Key Rules

Break long tasks > 50ms with scheduler.yield()
Use useTransition for non-urgent state updates
Defer heavy computation with requestIdleCallback
Never block the main thread in event handlers
Use useDeferredValue for expensive derived values
Target <= 150ms for 2026 thresholds

Optimize Largest Contentful Paint to improve search rankings and perceived page speed — CRITICAL

LCP Optimization

Optimize Largest Contentful Paint for the 2026 threshold of <= 2.0s.

Identify LCP Element

new PerformanceObserver((entryList) => {
  const entries = entryList.getEntries();
  const lastEntry = entries[entries.length - 1];
  console.log('LCP element:', lastEntry.element);
  console.log('LCP time:', lastEntry.startTime);
}).observe({ type: 'largest-contentful-paint', buffered: true });

Optimize LCP Images

// Priority loading for hero image
<img
  src="/hero.webp"
  alt="Hero"
  fetchpriority="high"
  loading="eager"
  decoding="async"
/>

// Next.js Image with priority
import Image from 'next/image';

<Image
  src="/hero.webp"
  alt="Hero"
  priority
  sizes="100vw"
  quality={85}
/>

Preload Critical Resources

<!-- Preload LCP image -->
<link rel="preload" as="image" href="/hero.webp" fetchpriority="high" />

<!-- Preload critical font -->
<link rel="preload" as="font" href="/fonts/inter.woff2" type="font/woff2" crossorigin />

<!-- Preconnect to critical origins -->
<link rel="preconnect" href="https://api.example.com" />
<link rel="dns-prefetch" href="https://analytics.example.com" />

Server-Side Rendering

// Next.js - ensure SSR for LCP content
export default async function Page() {
  const data = await fetchCriticalData();
  return <HeroSection data={data} />; // Rendered on server
}

// BAD: LCP content loaded client-side
const [data, setData] = useState(null);
useEffect(() => { fetchData().then(setData); }, []);

Incorrect — Lazy-loading LCP image delays paint:

<img src="/hero.webp" alt="Hero" loading="lazy" />

Correct — Priority loading for LCP image:

<img
  src="/hero.webp"
  alt="Hero"
  fetchpriority="high"
  loading="eager"
  decoding="async"
/>

Key Rules

Never lazy-load the LCP image
Always use fetchpriority="high" on LCP images
Always server-render LCP content
Preload critical resources in <head>
Preconnect to third-party origins used for LCP
Target <= 2.0s for 2026 thresholds

Serve AVIF and WebP formats for 30-50% smaller files than JPEG at equivalent quality — HIGH

Modern Image Formats

Choose the right image format and quality settings for optimal compression.

Format Decision Matrix

Format	Best For	Browser Support	Quality Setting
AVIF	Photos, gradients	93%+ (2026)	60-75
WebP	Universal fallback	97%+	75-82
JPEG	Legacy fallback	100%	80-85
PNG	Transparency, icons	100%	N/A
SVG	Icons, logos	100%	N/A

Picture Element with Fallback

<picture>
  <source srcset="/photo.avif" type="image/avif" />
  <source srcset="/photo.webp" type="image/webp" />
  <img src="/photo.jpg" alt="Photo" width="800" height="600" loading="lazy" />
</picture>

Build-Time Conversion

// vite.config.ts with vite-plugin-image-optimizer
import { imageOptimizer } from 'vite-plugin-image-optimizer';

export default defineConfig({
  plugins: [
    imageOptimizer({
      avif: { quality: 72, effort: 4 },
      webp: { quality: 78 },
      jpeg: { quality: 82, progressive: true },
    }),
  ],
});

Quality Guidelines

AVIF  60-75  — Best compression, slight encoding time cost
WebP  75-82  — Good balance, fastest encoding
JPEG  80-85  — Legacy only, use progressive encoding

Rule of thumb: lower quality for large hero images (more compression gain),
higher quality for small thumbnails (already small files).

Incorrect — Single JPEG format misses 30-50% compression savings:

<img src="/photo.jpg" alt="Photo" width="800" height="600" />

Correct — Modern formats with fallback:

<picture>
  <source srcset="/photo.avif" type="image/avif" />
  <source srcset="/photo.webp" type="image/webp" />
  <img src="/photo.jpg" alt="Photo" width="800" height="600" />
</picture>

Key rules:

Prefer AVIF as primary format with WebP fallback
Use quality 72-78 for AVIF and WebP (visually lossless for most photos)
Always include a JPEG/PNG fallback in <picture>
Use progressive JPEG for any remaining JPEG images
Automate format conversion in the build pipeline, not manually

Use Next.js Image component for automatic lazy loading, responsive sizing, and format negotiation — HIGH

Next.js Image Component

Use the Next.js Image component for automatic optimization, format negotiation, and responsive sizing.

Priority Hero Image

import Image from 'next/image';

export default function Hero() {
  return (
    <Image
      src="/hero.webp"
      alt="Product hero"
      width={1200}
      height={630}
      priority           // Disables lazy loading, adds preload hint
      sizes="100vw"
      quality={85}
    />
  );
}

Blur Placeholder

// Static imports generate blurDataURL automatically
import heroImg from '@/public/hero.jpg';

<Image
  src={heroImg}
  alt="Hero"
  placeholder="blur"      // Uses auto-generated blurDataURL
  priority
/>

// For remote images, provide blurDataURL manually
<Image
  src="https://cdn.example.com/photo.jpg"
  alt="Photo"
  width={800}
  height={600}
  placeholder="blur"
  blurDataURL="data:image/jpeg;base64,/9j/4AAQ..."
/>

Custom Loader for CDN

// next.config.js
module.exports = {
  images: {
    loader: 'custom',
    loaderFile: './lib/image-loader.ts',
  },
};

// lib/image-loader.ts
export default function cloudflareLoader({
  src, width, quality,
}: { src: string; width: number; quality?: number }) {
  const params = [`width=${width}`, `quality=${quality || 80}`, 'format=auto'];
  return `https://cdn.example.com/cdn-cgi/image/${params.join(',')}/${src}`;
}

Responsive Fill Layout

<div style={{ position: 'relative', width: '100%', aspectRatio: '16/9' }}>
  <Image
    src="/banner.jpg"
    alt="Banner"
    fill
    sizes="(max-width: 768px) 100vw, (max-width: 1200px) 50vw, 33vw"
    style={{ objectFit: 'cover' }}
  />
</div>

Incorrect — Missing sizes causes incorrect srcset selection:

<Image src="/banner.jpg" alt="Banner" fill />

Correct — Sizes hint ensures optimal image size:

<Image
  src="/banner.jpg"
  alt="Banner"
  fill
  sizes="(max-width: 768px) 100vw, 50vw"
/>

Key rules:

Set priority on the LCP image (only one per page)
Always provide sizes for responsive images
Use placeholder="blur" for visible images to prevent CLS
Use a custom loader for external CDN image transformation
Use fill with sizes for responsive containers instead of fixed dimensions

Serve appropriately sized responsive images per viewport to avoid oversized mobile downloads — HIGH

Responsive Images

Serve the right image size for every viewport and device pixel ratio.

Srcset with Sizes

<img
  src="/photo-800.jpg"
  srcset="
    /photo-400.jpg   400w,
    /photo-800.jpg   800w,
    /photo-1200.jpg 1200w,
    /photo-1600.jpg 1600w
  "
  sizes="(max-width: 640px) 100vw,
         (max-width: 1024px) 50vw,
         33vw"
  alt="Product photo"
  loading="lazy"
  width="800"
  height="600"
/>

Art Direction with Picture

<!-- Different crops for different viewports -->
<picture>
  <source
    media="(max-width: 640px)"
    srcset="/hero-mobile.avif 640w, /hero-mobile-2x.avif 1280w"
    sizes="100vw"
    type="image/avif"
  />
  <source
    media="(min-width: 641px)"
    srcset="/hero-desktop.avif 1200w, /hero-desktop-2x.avif 2400w"
    sizes="66vw"
    type="image/avif"
  />
  <img src="/hero-desktop.jpg" alt="Hero" width="1200" height="630" />
</picture>

CDN Image Transformation URLs

// Cloudflare Image Resizing
function cfImage(src: string, width: number, quality = 80) {
  return `https://cdn.example.com/cdn-cgi/image/w=${width},q=${quality},f=auto/${src}`;
}

// Imgix
function imgixUrl(src: string, width: number, quality = 80) {
  return `${src}?w=${width}&q=${quality}&auto=format,compress`;
}

// Usage in React
<img
  src={cfImage('/photos/product.jpg', 800)}
  srcset={`
    ${cfImage('/photos/product.jpg', 400)} 400w,
    ${cfImage('/photos/product.jpg', 800)} 800w,
    ${cfImage('/photos/product.jpg', 1200)} 1200w
  `}
  sizes="(max-width: 768px) 100vw, 50vw"
  alt="Product"
  loading="lazy"
/>

Incorrect — srcset without sizes lets browser guess:

<img
  srcset="/photo-400.jpg 400w, /photo-800.jpg 800w"
  src="/photo-800.jpg"
  alt="Photo"
/>

Correct — sizes guides browser to optimal choice:

<img
  srcset="/photo-400.jpg 400w, /photo-800.jpg 800w"
  sizes="(max-width: 640px) 100vw, 50vw"
  src="/photo-800.jpg"
  alt="Photo"
  width="800"
  height="600"
/>

Key rules:

Always provide sizes alongside srcset for width descriptors
Use 3-4 srcset breakpoints (400, 800, 1200, 1600) for most images
Use <picture> with media for art direction (different crops)
Delegate resizing to a CDN rather than shipping multiple static files
Set explicit width and height to prevent CLS

Quantize models to reduce size 2-4x with minimal quality loss for fewer GPUs — MEDIUM

Model Quantization

Reduce model memory footprint and increase throughput with quantization.

Method Decision Matrix

Method	Precision	Speed	Quality	Best For
FP16	16-bit	Baseline	Best	When VRAM allows
FP8	8-bit	1.5x	Near-FP16	Hopper/Ada GPUs (H100, L40S)
AWQ	4-bit	1.8x	Good	Production serving, speed priority
GPTQ	4-bit	1.6x	Better	Quality-sensitive tasks
GGUF	2-8 bit	Varies	Varies	CPU/hybrid inference (llama.cpp)

vLLM with AWQ

# Serve a pre-quantized AWQ model
docker run --gpus '"device=0"' \
  -p 8000:8000 \
  vllm/vllm-openai:latest \
  --model TheBloke/Llama-3.1-8B-Instruct-AWQ \
  --quantization awq \
  --max-model-len 8192

vLLM with FP8 (Hopper GPUs)

# FP8 on H100 — native hardware support, no pre-quantized model needed
docker run --gpus all \
  -p 8000:8000 \
  vllm/vllm-openai:latest \
  --model meta-llama/Llama-3.1-70B-Instruct \
  --quantization fp8 \
  --tensor-parallel-size 4

VRAM Requirements (Approximate)

Model       FP16    FP8     AWQ/GPTQ (4-bit)
7-8B        16 GB   9 GB    5 GB
13B         26 GB   14 GB   8 GB
70B         140 GB  75 GB   40 GB

Formula: VRAM ≈ params × bytes_per_param × 1.2 (KV cache overhead)

Quality Validation

# Always benchmark quantized vs full precision on YOUR task
def eval_quantized(client, test_cases):
    results = []
    for case in test_cases:
        response = client.chat.completions.create(
            model="quantized-model",
            messages=case["messages"],
            max_tokens=case["max_tokens"],
        )
        results.append(score(response, case["expected"]))
    return sum(results) / len(results)

# Accept quantization if quality >= 95% of FP16 baseline

Incorrect — FP16 on smaller GPUs wastes VRAM:

docker run --gpus all \
  vllm/vllm-openai:latest \
  --model meta-llama/Llama-3.1-70B-Instruct \
  --tensor-parallel-size 8
# Requires 140 GB VRAM

Correct — FP8 quantization reduces VRAM by ~45%:

docker run --gpus all \
  vllm/vllm-openai:latest \
  --model meta-llama/Llama-3.1-70B-Instruct \
  --quantization fp8 \
  --tensor-parallel-size 4
# Requires 75 GB VRAM

Key rules:

Use FP8 on Hopper/Ada GPUs (best speed/quality tradeoff)
Use AWQ for maximum throughput on older GPUs
Use GPTQ when quality matters more than speed
Always validate quantized model quality on your specific task
Pre-quantized models (e.g., TheBloke) save quantization time

Apply speculative decoding to generate draft tokens in parallel and reduce inference latency — MEDIUM

Speculative Decoding

Use speculative decoding to reduce per-token latency without sacrificing output quality.

How It Works

Traditional:     token1 → token2 → token3 → token4  (4 forward passes)

Speculative:     draft: token1, token2, token3       (fast, cheap)
                 verify: accept/reject all 3          (1 forward pass)
                 Result: 3 tokens in ~1.3 forward passes

N-Gram Speculation (No Draft Model)

# vLLM n-gram speculation — uses prompt tokens as draft source
# Best for repetitive/structured output (JSON, code, templates)
docker run --gpus '"device=0"' \
  -p 8000:8000 \
  vllm/vllm-openai:latest \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --speculative-model [ngram] \
  --num-speculative-tokens 5 \
  --ngram-prompt-lookup-max 4

Draft Model Speculation

# Use a smaller model as the draft (must share tokenizer)
docker run --gpus '"device=0"' \
  -p 8000:8000 \
  vllm/vllm-openai:latest \
  --model meta-llama/Llama-3.1-70B-Instruct \
  --speculative-model meta-llama/Llama-3.1-8B-Instruct \
  --num-speculative-tokens 5 \
  --tensor-parallel-size 4

Acceptance Rate Tuning

--num-speculative-tokens:
  3  → Conservative, high acceptance rate (~85%)
  5  → Balanced (default recommendation)
  8  → Aggressive, lower acceptance rate (~60%)

Monitor via vLLM metrics:
  vllm:spec_decode_acceptance_rate  → target > 70%

If acceptance < 60%:
  1. Reduce --num-speculative-tokens
  2. Try n-gram for structured output
  3. Verify draft model matches target model's style

When to Use Each Approach

N-gram speculation:
  + Structured output (JSON, SQL, code)
  + Repetitive patterns
  + No extra GPU memory needed
  - Creative / diverse text

Draft model speculation:
  + General text generation
  + Large target models (70B+)
  + Higher acceptance rates on diverse tasks
  - Requires extra GPU memory for draft model

Incorrect — No speculation means sequential token generation:

docker run --gpus '"device=0"' \
  vllm/vllm-openai:latest \
  --model meta-llama/Llama-3.1-8B-Instruct
# 4 tokens = 4 forward passes

Correct — N-gram speculation reduces passes by 30-60%:

docker run --gpus '"device=0"' \
  vllm/vllm-openai:latest \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --speculative-model [ngram] \
  --num-speculative-tokens 5
# 4 tokens ≈ 1.3 forward passes

Key rules:

Use n-gram speculation for structured/repetitive output (free, no extra VRAM)
Use draft model speculation for general text with large target models
Start with --num-speculative-tokens 5 and tune based on acceptance rate
Monitor acceptance rate; reduce tokens if below 60%
Output quality is identical to non-speculative decoding (mathematically guaranteed)

Deploy vLLM with PagedAttention and continuous batching for 2-4x higher inference throughput — MEDIUM

vLLM Deployment

Deploy LLMs with vLLM for high-throughput, low-latency inference.

Docker Deployment

# Single GPU
docker run --gpus '"device=0"' \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  -p 8000:8000 \
  vllm/vllm-openai:latest \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --max-model-len 8192 \
  --gpu-memory-utilization 0.90

# Multi-GPU with tensor parallelism
docker run --gpus all \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  -p 8000:8000 \
  vllm/vllm-openai:latest \
  --model meta-llama/Llama-3.1-70B-Instruct \
  --tensor-parallel-size 4 \
  --max-model-len 8192 \
  --gpu-memory-utilization 0.92

Python Client

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="unused")

response = client.chat.completions.create(
    model="meta-llama/Llama-3.1-8B-Instruct",
    messages=[{"role": "user", "content": "Explain PagedAttention briefly."}],
    max_tokens=256,
    temperature=0.7,
)
print(response.choices[0].message.content)

Key Architecture Concepts

PagedAttention:
  - KV cache stored in non-contiguous pages (like OS virtual memory)
  - Eliminates memory waste from pre-allocated contiguous blocks
  - Enables 2-4x more concurrent sequences

Continuous Batching:
  - New requests join running batch immediately
  - No waiting for longest sequence to finish
  - Throughput: 10-30 requests/second on single A100 (8B model)

Tensor Parallelism:
  - Splits model across GPUs (--tensor-parallel-size N)
  - Rule: N = number of GPUs, must evenly divide model layers
  - Use for models > single GPU VRAM

Incorrect — Default memory utilization wastes KV cache space:

docker run --gpus '"device=0"' \
  vllm/vllm-openai:latest \
  --model meta-llama/Llama-3.1-8B-Instruct
# Uses default 0.70 GPU memory

Correct — Higher utilization enables more concurrent requests:

docker run --gpus '"device=0"' \
  vllm/vllm-openai:latest \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --gpu-memory-utilization 0.90 \
  --max-model-len 8192
# 2-4x more concurrent requests

Key rules:

Set --gpu-memory-utilization 0.90 (leave headroom for KV cache)
Use --tensor-parallel-size equal to the number of GPUs
Use the OpenAI-compatible API for drop-in compatibility
Monitor vllm:num_requests_running Prometheus metric for load
Set --max-model-len to the actual max you need (lower = more concurrent requests)

Defer component loading with React.lazy to reduce initial bundle size and improve TTI — HIGH

Lazy Component Loading

Use React.lazy with Suspense to load components on demand and reduce initial bundle size.

Basic Pattern

import { lazy, Suspense } from 'react';

const Dashboard = lazy(() => import('./Dashboard'));
const Settings = lazy(() => import('./Settings'));

function App() {
  return (
    <Suspense fallback={<DashboardSkeleton />}>
      <Dashboard />
    </Suspense>
  );
}

Error Boundary for Failed Imports

import { Component, lazy, Suspense } from 'react';

class LazyErrorBoundary extends Component<
  { fallback: React.ReactNode; children: React.ReactNode },
  { hasError: boolean }
> {
  state = { hasError: false };

  static getDerivedStateFromError() {
    return { hasError: true };
  }

  retry = () => this.setState({ hasError: false });

  render() {
    if (this.state.hasError) {
      return <button onClick={this.retry}>Retry</button>;
    }
    return this.props.children;
  }
}

// Usage
<LazyErrorBoundary fallback={<p>Failed to load</p>}>
  <Suspense fallback={<Skeleton />}>
    <LazyComponent />
  </Suspense>
</LazyErrorBoundary>

Skeleton Fallback

function DashboardSkeleton() {
  return (
    <div className="animate-pulse space-y-4">
      <div className="h-8 bg-gray-200 rounded w-1/3" />
      <div className="h-64 bg-gray-200 rounded" />
    </div>
  );
}

Incorrect — Missing Suspense fallback causes error:

const Dashboard = lazy(() => import('./Dashboard'));

function App() {
  return <Dashboard />; // Error: no Suspense boundary
}

Correct — Suspense with skeleton fallback:

const Dashboard = lazy(() => import('./Dashboard'));

function App() {
  return (
    <Suspense fallback={<DashboardSkeleton />}>
      <Dashboard />
    </Suspense>
  );
}

Key rules:

Wrap every lazy() component in a Suspense boundary
Add an error boundary around Suspense for network failures
Use skeleton fallbacks that match the loaded component's layout
Never lazy-load above-the-fold or LCP-critical components
Group related lazy components under a single Suspense boundary

Prefetch Strategies

Proactively load resources before the user navigates to reduce perceived latency.

Module Preload Hints

<!-- Preload critical JS modules -->
<link rel="modulepreload" href="/assets/dashboard-abc123.js" />
<link rel="modulepreload" href="/assets/shared-chunk-def456.js" />

<!-- Prefetch likely next pages (low priority) -->
<link rel="prefetch" href="/assets/settings-ghi789.js" />

Prefetch on Hover

function NavLink({ to, children }: { to: string; children: React.ReactNode }) {
  const prefetchRef = useRef(false);

  const handlePointerEnter = () => {
    if (prefetchRef.current) return;
    prefetchRef.current = true;
    import(`./pages/${to}.tsx`); // Triggers prefetch
  };

  return (
    <a href={`/${to}`} onPointerEnter={handlePointerEnter}>
      {children}
    </a>
  );
}

Prefetch on Viewport Intersection

function usePrefetchOnVisible(importFn: () => Promise<unknown>) {
  const ref = useRef<HTMLElement>(null);
  const loaded = useRef(false);

  useEffect(() => {
    const el = ref.current;
    if (!el) return;

    const observer = new IntersectionObserver(([entry]) => {
      if (entry.isIntersecting && !loaded.current) {
        loaded.current = true;
        importFn();
        observer.disconnect();
      }
    }, { rootMargin: '200px' });

    observer.observe(el);
    return () => observer.disconnect();
  }, [importFn]);

  return ref;
}

// Usage
const ref = usePrefetchOnVisible(() => import('./HeavySection'));
<div ref={ref}><Suspense fallback={null}><HeavySection /></Suspense></div>

Import on Interaction

// Load a heavy module only when the user clicks
async function handleExport() {
  const { exportToPDF } = await import('./exportUtils');
  await exportToPDF(data);
}

<button onClick={handleExport}>Export PDF</button>

Incorrect — No prefetching causes delayed navigation:

<a href="/dashboard">Dashboard</a>

Correct — Hover prefetch gives 200-400ms head start:

function NavLink({ to, children }) {
  const prefetchRef = useRef(false);

  const handlePointerEnter = () => {
    if (prefetchRef.current) return;
    prefetchRef.current = true;
    import(`./pages/${to}.tsx`);
  };

  return (
    <a href={`/${to}`} onPointerEnter={handlePointerEnter}>
      {children}
    </a>
  );
}

Key rules:

Use modulepreload for critical JS the current page needs
Use prefetch for resources the user will likely need next
Prefetch on hover for navigation links (200-400ms head start)
Prefetch on intersection for below-the-fold heavy sections
Import on interaction for rarely-used heavy features

Split code at route boundaries so users only download code for the visited page — HIGH

Route-Based Code Splitting

Split your bundle at route boundaries so each page loads only its own code.

React Router 7.x Lazy Routes

import { createBrowserRouter } from 'react-router';

const router = createBrowserRouter([
  {
    path: '/',
    lazy: () => import('./pages/Home'),
  },
  {
    path: '/dashboard',
    lazy: () => import('./pages/Dashboard'),
  },
  {
    path: '/settings',
    lazy: () => import('./pages/Settings'),
  },
]);

Named Exports for Lazy Routes

// pages/Dashboard.tsx — export Component and loader
export async function loader() {
  return fetchDashboardData();
}

export function Component() {
  const data = useLoaderData();
  return <DashboardView data={data} />;
}

Component.displayName = 'Dashboard';

Chunk Naming

// Webpack — webpackChunkName magic comment
const Dashboard = lazy(
  () => import(/* webpackChunkName: "dashboard" */ './pages/Dashboard')
);

// Vite/Rollup — use rollupOptions for manual chunks
// vite.config.ts
export default defineConfig({
  build: {
    rollupOptions: {
      output: {
        manualChunks: {
          vendor: ['react', 'react-dom'],
          charts: ['recharts', 'd3'],
        },
      },
    },
  },
});

Incorrect — Eager imports bundle all routes together:

import Home from './pages/Home';
import Dashboard from './pages/Dashboard';
import Settings from './pages/Settings';

const router = createBrowserRouter([
  { path: '/', element: <Home /> },
  { path: '/dashboard', element: <Dashboard /> },
  { path: '/settings', element: <Settings /> },
]);

Correct — Lazy routes split per-page bundles:

const router = createBrowserRouter([
  { path: '/', lazy: () => import('./pages/Home') },
  { path: '/dashboard', lazy: () => import('./pages/Dashboard') },
  { path: '/settings', lazy: () => import('./pages/Settings') },
]);

Key rules:

Split at route boundaries as the minimum splitting strategy
Use React Router lazy for automatic route-level splitting
Export Component and loader as named exports for lazy routes
Name chunks for readable build output and caching
Group vendor libraries into shared chunks to avoid duplication

Profile Python backends with py-spy to find CPU hotspots and memory leaks in production — MEDIUM

Python Backend Profiling

Profile Python services to find CPU bottlenecks and memory leaks.

py-spy for Production Sampling

# Attach to running process (no restart needed)
py-spy top --pid 12345

# Generate flamegraph SVG
py-spy record -o profile.svg --pid 12345 --duration 30

# Profile a script directly
py-spy record -o profile.svg -- python manage.py runserver

# Sample at higher rate for short-lived operations
py-spy record --rate 250 -o profile.svg -- python batch_job.py

cProfile for Development

import cProfile
import pstats

# Profile a function
with cProfile.Profile() as pr:
    result = expensive_function()

stats = pstats.Stats(pr)
stats.sort_stats('cumulative')
stats.print_stats(20)  # Top 20 functions

# One-liner from command line
# python -m cProfile -s cumulative app.py

memory_profiler for Memory Leaks

from memory_profiler import profile

@profile
def process_data():
    data = load_large_dataset()     # +500 MiB
    filtered = filter_items(data)   # +200 MiB
    del data                        # -500 MiB
    return summarize(filtered)

# Command line: python -m memory_profiler script.py

FastAPI Middleware Profiling

import time
from fastapi import Request

@app.middleware("http")
async def profile_requests(request: Request, call_next):
    start = time.perf_counter()
    response = await call_next(request)
    duration = time.perf_counter() - start
    if duration > 0.5:  # Log slow requests
        print(f"SLOW: {request.method} {request.url.path} took {duration:.2f}s")
    response.headers["X-Response-Time"] = f"{duration:.3f}"
    return response

Incorrect — cProfile in production requires code changes:

# Must instrument code manually
with cProfile.Profile() as pr:
    result = expensive_function()

Correct — py-spy attaches to running process with zero overhead:

# No code changes, no restart needed
py-spy record -o profile.svg --pid 12345 --duration 30

Key rules:

Use py-spy in production (zero overhead when not profiling, no code changes)
Use cProfile in development for detailed call graphs
Use memory_profiler to track per-line memory allocation
Profile under realistic load, not just unit test conditions
Focus on the top 3-5 hotspots by cumulative time

Analyze bundles to reveal bloated dependencies and missed tree-shaking that inflate load times — MEDIUM

Bundle Analysis

Analyze and optimize JavaScript bundle size with visualization tools and CI budgets.

Webpack Bundle Analyzer

// webpack.config.js
const { BundleAnalyzerPlugin } = require('webpack-bundle-analyzer');

module.exports = {
  plugins: [
    new BundleAnalyzerPlugin({
      analyzerMode: 'static',      // Generates HTML report
      openAnalyzer: false,
      reportFilename: 'bundle-report.html',
    }),
  ],
};

Vite / Rollup Visualizer

// vite.config.ts
import { visualizer } from 'rollup-plugin-visualizer';

export default defineConfig({
  plugins: [
    visualizer({
      filename: 'bundle-report.html',
      gzipSize: true,
      brotliSize: true,
    }),
  ],
});

Performance Budgets in CI

// bundlesize.config.json
{
  "files": [
    { "path": "dist/assets/index-*.js", "maxSize": "150 kB", "compression": "gzip" },
    { "path": "dist/assets/vendor-*.js", "maxSize": "80 kB", "compression": "gzip" },
    { "path": "dist/assets/*.css", "maxSize": "30 kB", "compression": "gzip" }
  ]
}

# .github/workflows/bundle-check.yml
- name: Check bundle size
  run: npx bundlesize
  env:
    BUNDLESIZE_GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}

Import Cost Awareness

// BAD: Imports entire library (70 kB)
import _ from 'lodash';
const sorted = _.sortBy(items, 'name');

// GOOD: Import single function (4 kB)
import sortBy from 'lodash/sortBy';
const sorted = sortBy(items, 'name');

// BEST: Use native (0 kB)
const sorted = items.toSorted((a, b) => a.name.localeCompare(b.name));

Incorrect — Importing entire lodash adds 70 kB:

import _ from 'lodash';
const sorted = _.sortBy(items, 'name');

Correct — Import single function or use native API:

// Option 1: Import only what you need (4 kB)
import sortBy from 'lodash/sortBy';
const sorted = sortBy(items, 'name');

// Option 2: Use native API (0 kB)
const sorted = items.toSorted((a, b) => a.name.localeCompare(b.name));

Key rules:

Run bundle analysis on every release to catch regressions
Set CI performance budgets (fail build if exceeded)
Import only what you use from large libraries
Check gzip/brotli sizes, not raw sizes
Replace large dependencies with native APIs when possible

Profile React components with DevTools to identify unnecessary re-renders and their causes — MEDIUM

React DevTools Profiler

Use the React DevTools Profiler to identify and fix unnecessary re-renders.

Flamegraph Workflow

1. Open React DevTools → Profiler tab
2. Click "Record" → Interact with the UI → Click "Stop"
3. Read the flamegraph:
   - Yellow/red bars = slow renders (> 16ms)
   - Gray bars = did not render
   - Click a bar → see "Why did this render?"
4. Focus on components that render often AND take long

Programmatic Profiler

import { Profiler } from 'react';

function onRenderCallback(
  id: string,
  phase: 'mount' | 'update',
  actualDuration: number,
) {
  if (actualDuration > 16) {
    console.warn(`Slow render: ${id} (${phase}) took ${actualDuration.toFixed(1)}ms`);
  }
}

<Profiler id="Dashboard" onRender={onRenderCallback}>
  <Dashboard />
</Profiler>

Why Did You Render Setup

// wdyr.ts — import BEFORE React in development
import React from 'react';

if (process.env.NODE_ENV === 'development') {
  const { default: whyDidYouRender } = await import(
    '@welldone-software/why-did-you-render'
  );
  whyDidYouRender(React, {
    trackAllPureComponents: true,
    logOnDifferentValues: true,
  });
}

// Mark specific components for tracking
MyComponent.whyDidYouRender = true;

Interpreting Render Reasons

"Props changed"       → Check which prop, was it a new object/array?
"State changed"       → Expected, verify state is colocated
"Parent rendered"     → Parent re-renders, child doesn't memo
"Context changed"     → Split context or use selectors
"Hooks changed"       → useMemo/useCallback dependency changed

Incorrect — Blind memoization without profiling:

const MemoizedComponent = memo(Component);
const memoizedValue = useMemo(() => value, []);
const callback = useCallback(() => {}, []);
// Added optimization without measurement

Correct — Profile first, then optimize actual bottlenecks:

// 1. Open React DevTools Profiler
// 2. Record interaction
// 3. Identify slow renders (yellow/red bars > 16ms)
// 4. Check "Why did this render?"
// 5. Apply targeted fix only where needed

Key rules:

Profile first before adding any memoization
Focus on components that are both frequent AND slow (> 16ms)
Use "Why did this render?" to find the root cause
Use Why Did You Render in development for automatic detection
Ignore gray (not rendered) components in the flamegraph

Apply TanStack Query optimistic updates for instant UI feedback with automatic rollback — HIGH

TanStack Query Optimistic Updates

Show immediate UI feedback before server confirmation with proper rollback on error.

Incorrect — mutation without optimistic update:

// WRONG: User waits for server roundtrip
const mutation = useMutation({
  mutationFn: updateTodo,
  onSuccess: () => {
    queryClient.invalidateQueries({ queryKey: ['todos'] }); // Refetches after delay
  },
});
// UI feels sluggish — user sees spinner for 200-500ms

Correct — optimistic update with rollback:

import { useMutation, useQueryClient } from '@tanstack/react-query';

function useUpdateTodo() {
  const queryClient = useQueryClient();

  return useMutation({
    mutationFn: updateTodo,
    onMutate: async (newTodo) => {
      // 1. Cancel outgoing refetches (prevent race condition)
      await queryClient.cancelQueries({ queryKey: ['todos', newTodo.id] });

      // 2. Snapshot previous value for rollback
      const previousTodo = queryClient.getQueryData(['todos', newTodo.id]);

      // 3. Optimistically update cache
      queryClient.setQueryData(['todos', newTodo.id], newTodo);

      // 4. Return context for rollback
      return { previousTodo };
    },
    onError: (_err, newTodo, context) => {
      // Rollback to previous value on error
      queryClient.setQueryData(['todos', newTodo.id], context?.previousTodo);
    },
    onSettled: (_data, _error, variables) => {
      // Always reconcile with server after mutation
      queryClient.invalidateQueries({ queryKey: ['todos', variables.id] });
    },
  });
}

// Optimistic list update (add to list)
function useAddTodo() {
  const queryClient = useQueryClient();

  return useMutation({
    mutationFn: createTodo,
    onMutate: async (newTodo) => {
      await queryClient.cancelQueries({ queryKey: ['todos'] });
      const previousTodos = queryClient.getQueryData<Todo[]>(['todos']);

      // Immutable update (NEVER mutate cache directly)
      queryClient.setQueryData<Todo[]>(['todos'], (old) =>
        old ? [...old, { ...newTodo, id: 'temp-id' }] : [newTodo]
      );

      return { previousTodos };
    },
    onError: (_err, _newTodo, context) => {
      queryClient.setQueryData(['todos'], context?.previousTodos);
    },
    onSettled: () => {
      queryClient.invalidateQueries({ queryKey: ['todos'] });
    },
  });
}

Track pending mutations:

import { useMutationState } from '@tanstack/react-query';

function PendingTodos() {
  const pendingMutations = useMutationState({
    filters: { mutationKey: ['addTodo'], status: 'pending' },
    select: (mutation) => mutation.state.variables as Todo,
  });

  return (
    <>
      {pendingMutations.map((todo) => (
        <TodoItem key={todo.id} todo={todo} isPending />
      ))}
    </>
  );
}

Key rules:

Always cancel outgoing queries in onMutate to prevent race conditions
Always return context from onMutate for rollback capability
Use immutable updates: [...old, newItem] not old.push(newItem)
Always invalidateQueries in onSettled to reconcile with server
Use useMutationState to show pending items in the UI
Selective invalidation: queryKey: ['todos', id] not queryClient.invalidateQueries() (invalidates everything)

Prefetch TanStack queries on hover or in route loaders for instant page transitions — HIGH

TanStack Query Prefetching

Prefetch data before navigation for instant page transitions using TanStack Query v5.

Incorrect — fetching data only when component mounts:

// WRONG: User clicks link, waits for data to load
function UserProfile({ userId }: { userId: string }) {
  const { data, isPending } = useQuery({
    queryKey: ['user', userId],
    queryFn: () => fetchUser(userId),
  });

  if (isPending) return <Skeleton />; // User sees skeleton every time
  return <div>{data.name}</div>;
}

Correct — prefetch on hover and in route loaders:

// 1. Define reusable query options (v5 pattern)
const userQueryOptions = (id: string) => queryOptions({
  queryKey: ['user', id] as const,
  queryFn: () => fetchUser(id),
  staleTime: 5 * 60 * 1000, // Fresh for 5 min
});

// 2. Prefetch on hover
function UserLink({ userId }: { userId: string }) {
  const queryClient = useQueryClient();

  const prefetchUser = () => {
    queryClient.prefetchQuery(userQueryOptions(userId));
  };

  return (
    <Link
      to={`/users/${userId}`}
      onMouseEnter={prefetchUser}
      onFocus={prefetchUser}
    >
      View User
    </Link>
  );
}

// 3. Prefetch in route loader (React Router 7.x)
export const loader = (queryClient: QueryClient) =>
  async ({ params }: { params: { id: string } }) => {
    await queryClient.ensureQueryData(userQueryOptions(params.id));
    return null;
  };

// 4. Use with Suspense for instant render
function UserProfile({ userId }: { userId: string }) {
  // Data already loaded by prefetch — no loading state!
  const { data } = useSuspenseQuery(userQueryOptions(userId));
  return <div>{data.name}</div>;
}

QueryClient configuration:

const queryClient = new QueryClient({
  defaultOptions: {
    queries: {
      staleTime: 1000 * 60,       // 1 min fresh
      gcTime: 1000 * 60 * 5,      // 5 min in cache (formerly cacheTime)
      refetchOnWindowFocus: true,  // Refetch on tab focus
      retry: 3,
      retryDelay: (attemptIndex) => Math.min(1000 * 2 ** attemptIndex, 30000),
    },
  },
});

Key rules:

Use queryOptions() helper for reusable query definitions across prefetch/useQuery/loader
Prefetch on onMouseEnter AND onFocus for keyboard users
Use ensureQueryData in loaders (waits for data), prefetchQuery for fire-and-forget
useSuspenseQuery for components where data is guaranteed by loader
gcTime (v5) replaces cacheTime (v4) — controls how long unused data stays in memory
isPending (v5) replaces isLoading for initial load state

Let React Compiler auto-memoize components, values, callbacks, and JSX automatically — HIGH

React Compiler

React 19's compiler is the primary approach to render optimization in 2026.

Decision Tree

Is React Compiler enabled?
├─ YES → Let compiler handle memoization automatically
│        Only use useMemo/useCallback as escape hatches
│        DevTools shows "Memo ✨" badge
│
└─ NO → Profile first, then optimize
         1. React DevTools Profiler
         2. Identify actual bottlenecks
         3. Apply targeted optimizations

What the Compiler Memoizes

Component re-renders
Intermediate values (like useMemo)
Callback references (like useCallback)
JSX elements

Enabling the Compiler

// next.config.js (Next.js 16+)
const nextConfig = {
  reactCompiler: true,
}

// Expo SDK 54+ enables by default

Verification

Open React DevTools and look for the "Memo ✨" badge on components. If present, the compiler is successfully memoizing that component.

Incorrect — Manual memoization when compiler is enabled:

// next.config.js has reactCompiler: true
const value = useMemo(() => compute(data), [data]);
const callback = useCallback(() => handle(), []);
// Compiler already handles this automatically

Correct — Let compiler auto-memoize:

// Compiler handles memoization automatically
function Component({ data }) {
  const value = compute(data); // Auto-memoized
  const handle = () => {}; // Auto-memoized
  return <div onClick={handle}>{value}</div>;
}
// Check DevTools for "Memo ✨" badge

Key Rules

Enable React Compiler as the first step
Let the compiler handle memoization automatically
Verify with DevTools "Memo ✨" badge
Only use manual memoization as escape hatches
Profile before adding any manual optimization

Use manual useMemo and useCallback escape hatches when React Compiler cannot optimize — HIGH

Manual Memoization Escape Hatches

Use useMemo/useCallback as escape hatches when React Compiler is insufficient.

When Manual Memoization Is Needed

// 1. Effect dependencies that shouldn't trigger re-runs
const stableConfig = useMemo(() => ({
  apiUrl: process.env.API_URL
}), [])

useEffect(() => {
  initializeSDK(stableConfig)
}, [stableConfig])

// 2. Third-party libraries without compiler support
const memoizedValue = useMemo(() =>
  expensiveThirdPartyComputation(data), [data])

// 3. Precise control over memoization boundaries
const handleClick = useCallback(() => {
  // Critical callback that must be stable
}, [dependency])

State Colocation

Move state as close to where it's used as possible:

// BAD: State too high - causes unnecessary re-renders
function App() {
  const [filter, setFilter] = useState('')
  return (
    <Header />  {/* Re-renders on filter change! */}
    <FilterInput value={filter} onChange={setFilter} />
    <List filter={filter} />
  )
}

// GOOD: State colocated - minimal re-renders
function App() {
  return (
    <Header />
    <FilterableList />  {/* State inside */}
  )
}

Profiling Workflow

React DevTools Profiler: Record, interact, analyze
Identify: Components with high render counts or duration
Verify: Is the re-render actually causing perf issues?
Fix: Apply targeted optimization
Measure: Confirm improvement

Incorrect — State too high causes unnecessary re-renders:

function App() {
  const [filter, setFilter] = useState('');
  return (
    <>
      <Header />  {/* Re-renders on filter change! */}
      <FilterInput value={filter} onChange={setFilter} />
      <List filter={filter} />
    </>
  );
}

Correct — State colocated minimizes re-renders:

function App() {
  return (
    <>
      <Header />
      <FilterableList />  {/* State inside, Header unaffected */}
    </>
  );
}

Key Rules

Profile first — never optimize without measurement
Colocate state as close to usage as possible
Use useMemo for effect dependencies that must be stable
Use useCallback for callbacks passed to memoized children
Split context into state and dispatch providers

Virtualize long lists to render only visible items for smooth scrolling performance — HIGH

List Virtualization

Use TanStack Virtual for efficient rendering of large lists.

Virtualization Thresholds

Item Count	Recommendation
< 100	Regular rendering usually fine
100-500	Consider virtualization
500+	Virtualization required

Basic Setup

import { useVirtualizer } from '@tanstack/react-virtual'

function VirtualList({ items }) {
  const parentRef = useRef(null)

  const virtualizer = useVirtualizer({
    count: items.length,
    getScrollElement: () => parentRef.current,
    estimateSize: () => 50,
    overscan: 5,
  })

  return (
    <div ref={parentRef} style={{ height: '400px', overflow: 'auto' }}>
      <div style={{ height: `${virtualizer.getTotalSize()}px`, position: 'relative' }}>
        {virtualizer.getVirtualItems().map((virtualRow) => (
          <div
            key={virtualRow.key}
            style={{
              position: 'absolute',
              top: 0,
              left: 0,
              width: '100%',
              height: `${virtualRow.size}px`,
              transform: `translateY(${virtualRow.start}px)`,
            }}
          >
            {items[virtualRow.index].name}
          </div>
        ))}
      </div>
    </div>
  )
}

Dynamic Height

const virtualizer = useVirtualizer({
  count: items.length,
  getScrollElement: () => parentRef.current,
  estimateSize: () => 50,
  overscan: 5,
  measureElement: (element) => element.getBoundingClientRect().height,
})

Incorrect — Rendering 1000 items causes scroll jank:

function List({ items }) {
  return (
    <div style={{ height: '400px', overflow: 'auto' }}>
      {items.map(item => (
        <div key={item.id}>{item.name}</div>
      ))}
    </div>
  );
}

Correct — Virtualization renders only visible items:

import { useVirtualizer } from '@tanstack/react-virtual';

function VirtualList({ items }) {
  const parentRef = useRef(null);
  const virtualizer = useVirtualizer({
    count: items.length,
    getScrollElement: () => parentRef.current,
    estimateSize: () => 50,
    overscan: 5,
  });

  return (
    <div ref={parentRef} style={{ height: '400px', overflow: 'auto' }}>
      <div style={{ height: `${virtualizer.getTotalSize()}px`, position: 'relative' }}>
        {virtualizer.getVirtualItems().map(virtualRow => (
          <div
            key={virtualRow.key}
            style={{
              position: 'absolute',
              transform: `translateY(${virtualRow.start}px)`,
            }}
          >
            {items[virtualRow.index].name}
          </div>
        ))}
      </div>
    </div>
  );
}

Key Rules

Virtualize lists with 100+ items
Set overscan: 5 for smooth scrolling
Use estimateSize close to actual average
Use measureElement for variable height items
Position items with transform: translateY() (avoids layout recalculation)

References (17)

Caching Strategies

Multi-level caching patterns for performance optimization.

Cache Hierarchy

L1: In-Memory (LRU, memoization) - fastest, per-process
L2: Distributed (Redis/Memcached) - shared across instances
L3: CDN (edge, static assets) - global, closest to user
L4: Database (materialized views) - fallback, queryable

Cache-Aside Pattern (Read-Through)

Most common caching pattern:

async function getAnalysis(id: string): Promise<Analysis> {
  const cacheKey = `analysis:${id}`;

  // Try cache first (L2)
  const cached = await redis.get(cacheKey);
  if (cached) {
    return JSON.parse(cached);
  }

  // Cache miss - fetch from database (L4)
  const analysis = await db.query('SELECT * FROM analyses WHERE id = $1', [id]);

  // Store in cache for future requests
  await redis.setex(cacheKey, 3600, JSON.stringify(analysis));  // 1 hour TTL

  return analysis;
}

Write-Through Pattern

Update cache when writing to database:

async function updateAnalysis(id: string, updates: Partial<Analysis>) {
  // Update database
  const updated = await db.query(
    'UPDATE analyses SET ... WHERE id = $1 RETURNING *',
    [id]
  );

  // Update cache immediately
  const cacheKey = `analysis:${id}`;
  await redis.setex(cacheKey, 3600, JSON.stringify(updated));

  return updated;
}

Cache Invalidation Strategies

1. Time-Based (TTL)

// Short TTL for frequently changing data
await redis.setex('trending:articles', 300, data);  // 5 min

// Long TTL for static data
await redis.setex('user:profile:123', 86400, data);  // 24 hours

2. Event-Based

// Invalidate when data changes
async function deleteAnalysis(id: string) {
  await db.query('DELETE FROM analyses WHERE id = $1', [id]);

  // Invalidate all related cache keys
  await redis.del(`analysis:${id}`);
  await redis.del(`analysis:${id}:chunks`);
  await redis.del('analysis:list:recent');  // List cache
}

3. Tag-Based

// Tag related cache entries
await redis.set('analysis:123', data);
await redis.sadd('tag:user:456', 'analysis:123');

// Invalidate all entries with tag
async function invalidateUserData(userId: string) {
  const keys = await redis.smembers(`tag:user:${userId}`);
  if (keys.length > 0) {
    await redis.del(...keys);
    await redis.del(`tag:user:${userId}`);
  }
}

Redis Patterns

1. String Cache (Most Common)

// Get/set
await redis.set('key', 'value');
const value = await redis.get('key');

// With TTL
await redis.setex('key', 3600, 'value');

// Atomic increment
await redis.incr('page:views:123');

2. Hash Cache (Objects)

// Store object fields separately
await redis.hset('user:123', 'name', 'Alice');
await redis.hset('user:123', 'email', 'alice@example.com');

// Get specific field
const name = await redis.hget('user:123', 'name');

// Get all fields
const user = await redis.hgetall('user:123');

3. List Cache (Queues, Recent Items)

// Recent analyses (FIFO)
await redis.lpush('analyses:recent', analysisId);
await redis.ltrim('analyses:recent', 0, 99);  // Keep only 100 most recent

// Get recent
const recent = await redis.lrange('analyses:recent', 0, 9);  // First 10

4. Set Cache (Unique Items, Tags)

// Track unique visitors
await redis.sadd('article:123:visitors', userId);

// Check membership
const hasVisited = await redis.sismember('article:123:visitors', userId);

// Count unique
const uniqueCount = await redis.scard('article:123:visitors');

In-Memory Cache (L1)

For per-process caching:

import { LRUCache } from 'lru-cache';

const cache = new LRUCache<string, Analysis>({
  max: 500,  // Maximum items
  ttl: 1000 * 60 * 5,  // 5 minutes
  updateAgeOnGet: true,  // Refresh on access
});

function getAnalysis(id: string): Analysis {
  // Check L1 first
  if (cache.has(id)) {
    return cache.get(id)!;
  }

  // Fetch from L2 or database
  const analysis = await fetchAnalysis(id);
  cache.set(id, analysis);

  return analysis;
}

HTTP Caching (Browser/CDN)

// Express.js example
app.get('/api/analyses/:id', async (req, res) => {
  const analysis = await getAnalysis(req.params.id);

  // Cache in browser and CDN for 1 hour
  res.set('Cache-Control', 'public, max-age=3600');

  // ETag for conditional requests
  const etag = generateETag(analysis);
  res.set('ETag', etag);

  // Return 304 if unchanged
  if (req.headers['if-none-match'] === etag) {
    return res.status(304).end();
  }

  res.json(analysis);
});

Cache Warming

Preload cache before traffic arrives:

async function warmCache() {
  // Load hot data
  const recentAnalyses = await db.query(
    'SELECT * FROM analyses ORDER BY created_at DESC LIMIT 100'
  );

  // Populate cache
  for (const analysis of recentAnalyses) {
    await redis.setex(
      `analysis:${analysis.id}`,
      3600,
      JSON.stringify(analysis)
    );
  }

  console.log(`Warmed cache with ${recentAnalyses.length} analyses`);
}

// Run on server startup
await warmCache();

Cache Stampede Prevention

Prevent multiple requests from hitting database simultaneously:

const locks = new Map<string, Promise<Analysis>>();

async function getAnalysis(id: string): Promise<Analysis> {
  const cacheKey = `analysis:${id}`;

  // Check cache
  const cached = await redis.get(cacheKey);
  if (cached) return JSON.parse(cached);

  // Check if fetch is already in progress
  if (locks.has(cacheKey)) {
    return locks.get(cacheKey)!;
  }

  // Start fetch
  const fetchPromise = (async () => {
    const analysis = await db.query('SELECT * FROM analyses WHERE id = $1', [id]);
    await redis.setex(cacheKey, 3600, JSON.stringify(analysis));
    locks.delete(cacheKey);  // Clean up
    return analysis;
  })();

  locks.set(cacheKey, fetchPromise);
  return fetchPromise;
}

Best Practices

Cache frequently accessed, slow-to-compute data
Use appropriate TTL - shorter for dynamic data
Monitor cache hit rate - aim for > 80%
Handle cache failures gracefully - always fall back to database
Invalidate proactively when data changes
Monitor memory usage - set max memory and eviction policy
Use compression for large cached values

References

Redis Best Practices
HTTP Caching
See scripts/caching-patterns.ts for complete implementation

Cdn Setup

Image CDN Configuration

Complete guide to configuring image CDNs and optimization pipelines.

┌─────────────────────────────────────────────────────────────────────────────┐
│                         Image Delivery Pipeline                              │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│   Source              CDN / Optimizer              Browser                   │
│  ┌────────┐          ┌─────────────┐            ┌──────────┐               │
│  │ Origin │──────────►│   Resize   │──AVIF────►│  Chrome  │               │
│  │ Server │           │   Format   │            │  Safari  │               │
│  │  /CMS  │           │   Quality  │──WebP────►│  Firefox │               │
│  └────────┘           │   Cache    │            │  Edge    │               │
│                       └─────────────┘            └──────────┘               │
│                             │                                                │
│                      ┌──────▼──────┐                                        │
│                      │  Edge Cache │                                        │
│                      │  (Global)   │                                        │
│                      └─────────────┘                                        │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

Next.js Remote Patterns

Basic Configuration

// next.config.js
module.exports = {
  images: {
    // Enable modern formats
    formats: ['image/avif', 'image/webp'],

    // Allowed remote sources (required for external images)
    remotePatterns: [
      {
        protocol: 'https',
        hostname: 'cdn.example.com',
        pathname: '/images/**',
      },
      {
        protocol: 'https',
        hostname: '*.cloudinary.com',
      },
      {
        protocol: 'https',
        hostname: 'images.unsplash.com',
      },
      {
        protocol: 'https',
        hostname: 's3.amazonaws.com',
        pathname: '/my-bucket/**',
      },
    ],

    // Responsive breakpoints
    deviceSizes: [640, 750, 828, 1080, 1200, 1920, 2048, 3840],
    imageSizes: [16, 32, 48, 64, 96, 128, 256, 384],

    // Cache TTL (seconds) - default 60, increase for CDN
    minimumCacheTTL: 60 * 60 * 24 * 30, // 30 days

    // Disable optimization in development (faster builds)
    unoptimized: process.env.NODE_ENV === 'development',
  },
};

Environment-Based Configuration

// next.config.js
const isProd = process.env.NODE_ENV === 'production';

module.exports = {
  images: {
    formats: ['image/avif', 'image/webp'],

    // Different patterns per environment
    remotePatterns: [
      // Production CDN
      ...(isProd
        ? [
            {
              protocol: 'https',
              hostname: 'cdn.example.com',
            },
          ]
        : []),

      // Development/staging
      ...(!isProd
        ? [
            {
              protocol: 'https',
              hostname: 'staging-cdn.example.com',
            },
            {
              protocol: 'http',
              hostname: 'localhost',
              port: '3001',
            },
          ]
        : []),
    ],
  },
};

Cloudinary Integration

Loader Implementation

// lib/loaders/cloudinary.ts
import type { ImageLoader } from 'next/image';

const CLOUD_NAME = process.env.NEXT_PUBLIC_CLOUDINARY_CLOUD_NAME;

export const cloudinaryLoader: ImageLoader = ({ src, width, quality }) => {
  // Build transformation string
  const transforms = [
    `w_${width}`,
    `q_${quality || 'auto:good'}`,
    'f_auto', // Auto format (AVIF > WebP > JPEG)
    'c_limit', // Don't upscale
    'dpr_auto', // Auto DPR
  ].join(',');

  // Handle both full URLs and paths
  const imagePath = src.startsWith('http')
    ? src.replace(/^https?:\/\/[^/]+/, '')
    : src;

  return `https://res.cloudinary.com/${CLOUD_NAME}/image/upload/${transforms}/${imagePath}`;
};

// Advanced loader with more options
export const cloudinaryAdvancedLoader: ImageLoader = ({ src, width, quality }) => {
  const params = new URLSearchParams();

  // Responsive width
  params.set('w', width.toString());

  // Quality (auto:good is a good default)
  params.set('q', quality?.toString() || 'auto:good');

  // Additional optimizations
  const transforms = [
    `w_${width}`,
    `q_${quality || 'auto:good'}`,
    'f_auto', // Best format for browser
    'c_limit', // Don't upscale
    'fl_progressive', // Progressive loading
    'fl_immutable_cache', // Long cache
  ].join(',');

  return `https://res.cloudinary.com/${CLOUD_NAME}/image/upload/${transforms}/${src}`;
};

export default cloudinaryLoader;

Usage

import Image from 'next/image';
import { cloudinaryLoader } from '@/lib/loaders/cloudinary';

// Component usage
<Image
  loader={cloudinaryLoader}
  src="products/shoe-red.jpg" // Path in Cloudinary
  alt="Red running shoe"
  width={400}
  height={400}
  sizes="(max-width: 768px) 100vw, 400px"
/>

// Global loader configuration
// next.config.js
module.exports = {
  images: {
    loader: 'custom',
    loaderFile: './lib/loaders/cloudinary.ts',
  },
};

Imgix Integration

// lib/loaders/imgix.ts
import type { ImageLoader } from 'next/image';

const IMGIX_DOMAIN = process.env.NEXT_PUBLIC_IMGIX_DOMAIN;

export const imgixLoader: ImageLoader = ({ src, width, quality }) => {
  const url = new URL(`https://${IMGIX_DOMAIN}${src}`);

  // Auto format negotiation
  url.searchParams.set('auto', 'format,compress');

  // Width
  url.searchParams.set('w', width.toString());

  // Quality
  url.searchParams.set('q', (quality || 75).toString());

  // Fit mode (contain, cover, fill, etc.)
  url.searchParams.set('fit', 'max');

  return url.toString();
};

// With advanced features
export const imgixAdvancedLoader: ImageLoader = ({ src, width, quality }) => {
  const url = new URL(`https://${IMGIX_DOMAIN}${src}`);

  url.searchParams.set('auto', 'format,compress');
  url.searchParams.set('w', width.toString());
  url.searchParams.set('q', (quality || 75).toString());
  url.searchParams.set('fit', 'max');

  // Face detection for portraits
  // url.searchParams.set('fit', 'facearea');
  // url.searchParams.set('facepad', '2');

  // Blur for placeholders
  // url.searchParams.set('blur', '200');
  // url.searchParams.set('px', '16');

  return url.toString();
};

Cloudflare Images

// lib/loaders/cloudflare.ts
import type { ImageLoader } from 'next/image';

// Using Cloudflare Image Resizing
export const cloudflareResizingLoader: ImageLoader = ({ src, width, quality }) => {
  // src should be the full URL of the original image
  const params = [
    `width=${width}`,
    `quality=${quality || 85}`,
    'format=auto', // Auto AVIF/WebP
    'fit=scale-down', // Don't upscale
  ].join(',');

  return `https://yourdomain.com/cdn-cgi/image/${params}/${src}`;
};

// Using Cloudflare Images (upload API)
const ACCOUNT_HASH = process.env.NEXT_PUBLIC_CLOUDFLARE_ACCOUNT_HASH;

export const cloudflareImagesLoader: ImageLoader = ({ src, width }) => {
  // src is the image ID from Cloudflare
  // Variants are predefined in Cloudflare dashboard
  const variant = width <= 640 ? 'small' : width <= 1024 ? 'medium' : 'large';

  return `https://imagedelivery.net/${ACCOUNT_HASH}/${src}/${variant}`;
};

AWS S3 + CloudFront

// lib/loaders/aws.ts
import type { ImageLoader } from 'next/image';

const CLOUDFRONT_DOMAIN = process.env.NEXT_PUBLIC_CLOUDFRONT_DOMAIN;

// Basic CloudFront loader (requires Lambda@Edge for resizing)
export const cloudfrontLoader: ImageLoader = ({ src, width, quality }) => {
  // Lambda@Edge parses these query params
  const params = new URLSearchParams({
    w: width.toString(),
    q: (quality || 80).toString(),
    f: 'auto',
  });

  return `https://${CLOUDFRONT_DOMAIN}${src}?${params}`;
};

// For static S3 images (no resizing)
export const s3Loader: ImageLoader = ({ src }) => {
  return `https://${CLOUDFRONT_DOMAIN}${src}`;
};

Vercel Image Optimization

// Automatically enabled on Vercel
// Configure in next.config.js
module.exports = {
  images: {
    // Use Vercel's built-in optimizer
    loader: 'default',

    // External domains need explicit allowlist
    remotePatterns: [
      {
        protocol: 'https',
        hostname: 'cdn.example.com',
      },
    ],

    // Increase cache for static images
    minimumCacheTTL: 60 * 60 * 24 * 365, // 1 year
  },
};

// For non-Vercel deployments, use external loader
module.exports = {
  images: {
    loader: 'custom',
    loaderFile: './lib/loaders/cloudinary.ts',
  },
};

Self-Hosted with Sharp

// For self-hosted Next.js (Docker, Node.js)

// 1. Install Sharp
// npm install sharp

// 2. Configure next.config.js
module.exports = {
  images: {
    loader: 'default', // Uses Sharp internally
    formats: ['image/avif', 'image/webp'],
    minimumCacheTTL: 60 * 60 * 24 * 30, // 30 days

    // Important for self-hosted
    dangerouslyAllowSVG: false,
    contentDispositionType: 'attachment',
  },
};

// 3. Dockerfile - ensure Sharp can build
FROM node:20-alpine AS builder
RUN apk add --no-cache libc6-compat
WORKDIR /app
COPY package*.json ./
RUN npm ci
COPY . .
RUN npm run build

FROM node:20-alpine AS runner
RUN apk add --no-cache libc6-compat
WORKDIR /app
COPY --from=builder /app/.next/standalone ./
COPY --from=builder /app/.next/static ./.next/static
COPY --from=builder /app/public ./public
EXPOSE 3000
CMD ["node", "server.js"]

CDN Headers & Caching

Nginx Configuration

# /etc/nginx/conf.d/images.conf

# Image caching
location ~* \.(jpg|jpeg|png|webp|avif|gif|ico|svg)$ {
    # Long cache for immutable assets
    expires 1y;
    add_header Cache-Control "public, immutable";

    # Vary by Accept header for format negotiation
    add_header Vary "Accept";

    # Security headers
    add_header X-Content-Type-Options "nosniff";
}

# Next.js optimized images
location /_next/image {
    proxy_pass http://nextjs_upstream;
    proxy_cache_valid 200 365d;

    # Cache key includes Accept header for format
    proxy_cache_key "$scheme$request_method$host$request_uri$http_accept";

    add_header X-Cache-Status $upstream_cache_status;
}

Cloudflare Page Rules

{
  "targets": [
    {
      "target": "url",
      "constraint": {
        "operator": "matches",
        "value": "*.example.com/*.(jpg|jpeg|png|webp|avif|gif)"
      }
    }
  ],
  "actions": [
    {
      "id": "cache_level",
      "value": "cache_everything"
    },
    {
      "id": "edge_cache_ttl",
      "value": 2592000
    },
    {
      "id": "browser_cache_ttl",
      "value": 31536000
    },
    {
      "id": "polish",
      "value": "lossless"
    }
  ]
}

Blur Placeholder Generation

Build-Time with Plaiceholder

// lib/blur.ts
import { getPlaiceholder } from 'plaiceholder';
import fs from 'fs/promises';
import path from 'path';

export async function getBlurDataURL(imagePath: string): Promise<string> {
  try {
    const file = await fs.readFile(path.join(process.cwd(), 'public', imagePath));
    const { base64 } = await getPlaiceholder(file);
    return base64;
  } catch {
    // Return a tiny transparent placeholder on error
    return 'data:image/gif;base64,R0lGODlhAQABAIAAAAAAAP///yH5BAEAAAAALAAAAAABAAEAAAIBRAA7';
  }
}

// Usage in getStaticProps
export async function getStaticProps() {
  const blurDataURL = await getBlurDataURL('/images/hero.jpg');
  return {
    props: { blurDataURL },
  };
}

Remote Image Blur

// lib/remote-blur.ts
import { getPlaiceholder } from 'plaiceholder';

export async function getRemoteBlurDataURL(imageUrl: string): Promise<string> {
  try {
    const response = await fetch(imageUrl);
    const buffer = Buffer.from(await response.arrayBuffer());
    const { base64 } = await getPlaiceholder(buffer);
    return base64;
  } catch {
    return 'data:image/gif;base64,R0lGODlhAQABAIAAAAAAAP///yH5BAEAAAAALAAAAAABAAEAAAIBRAA7';
  }
}

// Cache blur data URLs
const blurCache = new Map<string, string>();

export async function getCachedBlurDataURL(imageUrl: string): Promise<string> {
  if (blurCache.has(imageUrl)) {
    return blurCache.get(imageUrl)!;
  }

  const blur = await getRemoteBlurDataURL(imageUrl);
  blurCache.set(imageUrl, blur);
  return blur;
}

Image Validation & Error Handling

// lib/image-validation.ts
export function isValidImageUrl(url: string): boolean {
  try {
    const parsed = new URL(url);
    const allowedHosts = ['cdn.example.com', 'images.unsplash.com'];
    return allowedHosts.some(
      (host) => parsed.hostname === host || parsed.hostname.endsWith(`.${host}`)
    );
  } catch {
    return false;
  }
}

export function getOptimizedImageUrl(
  src: string,
  options: { width: number; quality?: number }
): string {
  // Use your CDN loader
  const { width, quality = 80 } = options;

  if (src.includes('cloudinary.com')) {
    return src.replace('/upload/', `/upload/w_${width},q_${quality},f_auto/`);
  }

  // Default: return as-is
  return src;
}

Core Web Vitals

Core Web Vitals Optimization

Google's Core Web Vitals are the key metrics for measuring user experience.

The Three Metrics

Metric	Target	Measures	Impact
LCP (Largest Contentful Paint)	< 2.5s	Loading performance	First impression
INP (Interaction to Next Paint)	< 200ms	Responsiveness	User frustration
CLS (Cumulative Layout Shift)	< 0.1	Visual stability	Accidental clicks

LCP (Largest Contentful Paint)

What It Measures

Time until the largest visible element (hero image, heading, video) renders.

Common Causes

Large, unoptimized images
Slow server response time (TTFB > 600ms)
Render-blocking JavaScript/CSS
Client-side rendering

Fixes

1. Optimize Images

<!-- Preload LCP image -->
<link rel="preload" as="image" href="/hero.jpg" />

<!-- Use modern formats -->
<picture>
  <source srcset="/hero.avif" type="image/avif" />
  <source srcset="/hero.webp" type="image/webp" />
  <img src="/hero.jpg" alt="Hero" width="1200" height="600" />
</picture>

<!-- Or use next/image -->
<Image src="/hero.jpg" priority quality={85} />

2. Reduce Server Response Time

Use CDN for static assets
Enable HTTP/2 or HTTP/3
Optimize database queries
Implement caching (Redis, CDN)

3. Eliminate Render-Blocking Resources

<!-- Defer non-critical CSS -->
<link rel="preload" as="style" href="/styles.css" onload="this.onload=null;this.rel='stylesheet'" />

<!-- Defer JavaScript -->
<script src="/app.js" defer></script>

<!-- Inline critical CSS -->
<style>
  /* Critical above-the-fold styles */
  .hero { ... }
</style>

4. Use Server-Side Rendering (SSR)

// Next.js SSR
export async function getServerSideProps() {
  const data = await fetchData();
  return { props: { data } };
}

// React Server Components
async function Page() {
  const data = await fetchData();  // Runs on server
  return <div>{data}</div>;
}

INP (Interaction to Next Paint)

What It Measures

Time from user interaction (click, tap, key press) to visual feedback.

Common Causes

Heavy JavaScript execution blocking main thread
Long-running event handlers
Expensive DOM updates
Third-party scripts

Fixes

1. Debounce/Throttle Expensive Operations

import { debounce } from 'lodash';

// Without debounce: runs on EVERY keystroke
function handleSearch(query: string) {
  const results = expensiveSearch(query);  // Blocks for 100ms
  setResults(results);
}

// With debounce: runs 300ms after user stops typing
const handleSearch = debounce((query: string) => {
  const results = expensiveSearch(query);
  setResults(results);
}, 300);

2. Use Web Workers for Heavy Computation

// worker.ts
self.onmessage = (e) => {
  const result = expensiveComputation(e.data);
  self.postMessage(result);
};

// main.ts
const worker = new Worker('/worker.js');
worker.postMessage(data);
worker.onmessage = (e) => {
  setResult(e.data);
};

3. Split Long Tasks

// Before: Blocks main thread for 500ms
function processItems(items) {
  items.forEach(item => {
    processItem(item);  // 5ms each × 100 items = 500ms
  });
}

// After: Yields to browser between batches
async function processItems(items) {
  for (let i = 0; i < items.length; i += 10) {
    const batch = items.slice(i, i + 10);
    batch.forEach(processItem);

    // Yield to browser
    await new Promise(resolve => setTimeout(resolve, 0));
  }
}

// Or use Scheduler API (modern)
async function processItems(items) {
  for (let i = 0; i < items.length; i += 10) {
    const batch = items.slice(i, i + 10);
    batch.forEach(processItem);

    await scheduler.yield();  // Yield to higher priority tasks
  }
}

4. Optimize React Rendering

// Memoize expensive components
const Chart = memo(({ data }) => <ExpensiveChart data={data} />);

// Use startTransition for non-urgent updates
import { useTransition } from 'react';

function Search() {
  const [query, setQuery] = useState('');
  const [results, setResults] = useState([]);
  const [isPending, startTransition] = useTransition();

  function handleChange(e) {
    setQuery(e.target.value);  // Urgent: update input immediately

    startTransition(() => {
      // Non-urgent: can be interrupted
      const filtered = filterResults(e.target.value);
      setResults(filtered);
    });
  }

  return <input value={query} onChange={handleChange} />;
}

CLS (Cumulative Layout Shift)

What It Measures

Visual stability - how much elements unexpectedly shift during load.

Common Causes

Images without dimensions
Ads/embeds injected after layout
Web fonts causing FOIT/FOUT
Dynamically injected content

Fixes

1. Always Set Image Dimensions

<!-- ❌ BAD: No dimensions, causes layout shift -->
<img src="/photo.jpg" alt="Photo" />

<!-- ✅ GOOD: Reserves space -->
<img src="/photo.jpg" alt="Photo" width="800" height="600" />

<!-- Or with aspect ratio (CSS) -->
<img src="/photo.jpg" alt="Photo" style="aspect-ratio: 4/3; width: 100%;" />

2. Reserve Space for Ads/Embeds

.ad-container {
  min-height: 250px;  /* Reserve space before ad loads */
  background: #f0f0f0;
}

3. Optimize Web Font Loading

/* Prevent FOIT (flash of invisible text) */
@font-face {
  font-family: 'CustomFont';
  src: url('/font.woff2') format('woff2');
  font-display: swap;  /* Show fallback immediately, swap when ready */
}

<!-- Preload critical fonts -->
<link rel="preload" as="font" href="/font.woff2" type="font/woff2" crossorigin />

4. Avoid Inserting Content Above Existing Content

// ❌ BAD: Inserts notification at top, shifts everything down
function addNotification(message) {
  container.insertAdjacentHTML('afterbegin', `<div>${message}</div>`);
}

// ✅ GOOD: Append to bottom or use fixed positioning
function addNotification(message) {
  const notification = document.createElement('div');
  notification.className = 'notification-fixed';  // position: fixed
  notification.textContent = message;
  document.body.appendChild(notification);
}

Measuring Core Web Vitals

In Development

// Use web-vitals library
import { onCLS, onINP, onLCP } from 'web-vitals';

onLCP(console.log);  // Log LCP
onINP(console.log);  // Log INP
onCLS(console.log);  // Log CLS

In Production (RUM - Real User Monitoring)

import { onCLS, onINP, onLCP } from 'web-vitals';

function sendToAnalytics(metric) {
  fetch('/api/analytics', {
    method: 'POST',
    body: JSON.stringify(metric),
  });
}

onLCP(sendToAnalytics);
onINP(sendToAnalytics);
onCLS(sendToAnalytics);

Lighthouse (Lab Testing)

# Run Lighthouse audit
lighthouse https://your-site.com --output=html

# Or use Chrome DevTools
# Open DevTools → Lighthouse tab → Generate report

Targets by Percentile

Google measures at the 75th percentile of all page loads:

Grade	LCP	INP	CLS
Good (Green)	< 2.5s	< 200ms	< 0.1
Needs Improvement (Orange)	2.5-4s	200-500ms	0.1-0.25
Poor (Red)	> 4s	> 500ms	> 0.25

Goal: 75% of page loads should be "Good" for all three metrics.

Quick Wins Checklist

Add width and height to all images
Preload LCP image
Use font-display: swap for web fonts
Defer non-critical JavaScript
Enable HTTP/2 and compression
Use CDN for static assets
Implement lazy loading for below-fold images
Memoize expensive React components
Debounce search inputs and expensive handlers

References

Database Optimization

Database Query Optimization

Strategies for optimizing database performance and eliminating slow queries.

Key Patterns

Add Missing Indexes - Turn Seq Scan into Index Scan
Fix N+1 Queries - Use JOINs or include instead of loops
Cursor Pagination - Never load all records
Connection Pooling - Manage connection lifecycle

Quick Diagnostics

-- Find slow queries (PostgreSQL)
SELECT query, calls, mean_time / 1000 as mean_seconds
FROM pg_stat_statements ORDER BY total_time DESC LIMIT 10;

-- Verify index usage
EXPLAIN ANALYZE SELECT * FROM orders WHERE user_id = 123;

-- Check for sequential scans
SELECT schemaname, tablename, seq_scan, seq_tup_read
FROM pg_stat_user_tables
WHERE seq_scan > 0
ORDER BY seq_tup_read DESC
LIMIT 10;

N+1 Query Detection

Symptoms:

One query to get parent records, then N queries for related data
Rapid sequential database calls in logs
Linear growth in query count with data size

Example Problem:

# ❌ BAD: N+1 query (1 + 8 queries)
analyses = await session.execute(select(Analysis).limit(8)).scalars().all()
for analysis in analyses:
    # Each iteration hits DB again!
    chunks = await session.execute(
        select(Chunk).where(Chunk.analysis_id == analysis.id)
    ).scalars().all()

Solution:

# ✅ GOOD: Single query with JOIN (1 query)
from sqlalchemy.orm import selectinload

analyses = await session.execute(
    select(Analysis)
    .options(selectinload(Analysis.chunks))  # Eager load
    .limit(8)
).scalars().all()

# Now analyses[0].chunks is already loaded (no extra query)

Index Selection Strategies

Index Type	Use Case	Example
B-tree	Equality, range queries	`WHERE created_at > '2025-01-01'`
GIN	Full-text search, JSONB	`WHERE content_tsvector @@ to_tsquery('python')`
HNSW	Vector similarity	`ORDER BY embedding <=> '[0.1, 0.2, ...]'`
Hash	Exact equality only	`WHERE id = 'abc123'` (rare)

Index Creation Examples:

-- B-tree index for range queries
CREATE INDEX idx_analyses_created_at ON analyses(created_at);

-- GIN index for full-text search
CREATE INDEX idx_chunks_tsvector ON chunks USING GIN(content_tsvector);

-- HNSW index for vector similarity
CREATE INDEX idx_chunks_embedding ON chunks
USING hnsw (embedding vector_cosine_ops);

-- Partial index for active records only
CREATE INDEX idx_active_users ON users(email)
WHERE deleted_at IS NULL;

-- Composite index for common query pattern
CREATE INDEX idx_analyses_user_status ON analyses(user_id, status);

Connection Pooling

Problem: Creating new connections is expensive (50-100ms overhead)

Solution: Use connection pools

# SQLAlchemy async pool
engine = create_async_engine(
    DATABASE_URL,
    pool_size=20,  # Base connections
    max_overflow=10,  # Additional if needed
    pool_pre_ping=True,  # Verify connections are alive
    pool_recycle=3600  # Recycle after 1 hour
)

Pagination: Cursor vs Offset

Offset-Based (❌ Slow for large datasets)

SELECT * FROM analyses ORDER BY created_at DESC
LIMIT 20 OFFSET 1000;  -- Must scan 1020 rows!

Cursor-Based (✅ Fast, scales to millions)

SELECT * FROM analyses
WHERE created_at < '2025-01-15 10:00:00'  -- Last cursor
ORDER BY created_at DESC
LIMIT 20;  -- Only scans 20 rows

Best Practices

Always use EXPLAIN ANALYZE before deploying queries
Index foreign keys used in JOINs
Avoid SELECT * - request only needed columns
Use prepared statements to prevent SQL injection and enable query caching
Monitor pg_stat_statements weekly
Set query timeouts to prevent runaway queries

References

PostgreSQL Performance Tips
Use The Index, Luke
See scripts/database-optimization.ts for implementation patterns

Devtools Profiler Workflow

React DevTools Profiler Workflow

Finding and fixing performance bottlenecks.

Setup

Install React DevTools browser extension
Open DevTools (F12)
Navigate to Profiler tab
Ensure React is in development mode

Basic Profiling Flow

1. Start Recording

Click the blue Record button
Perform the slow interaction
Click Stop

2. Analyze the Flamegraph

The flamegraph shows component render times:

[App (2ms)]
├── [Header (0.5ms)]
├── [Sidebar (15ms)]  ← Slow!
│   ├── [NavItem (1ms)]
│   ├── [NavItem (1ms)]
│   └── [HeavyWidget (12ms)]  ← Found it!
└── [Content (1ms)]

3. Key Metrics

Metric	Meaning
Render time	How long component took to render
Commit time	Time to apply changes to DOM
Interactions	What triggered the render

Reading the Profiler

Color Coding

Gray: Did not render
Blue/Teal: Rendered (fast)
Yellow: Rendered (medium)
Red/Orange: Rendered (slow)

"Why did this render?"

Enable in DevTools settings:

Click gear icon in Profiler
Check "Record why each component rendered"

Common reasons:

Props changed
State changed
Parent rendered
Context changed
Hooks changed

Identifying Problems

Problem 1: Component Renders Too Often

Look for components that render on every interaction:

Render 1: [List (50ms)] - items changed ✓
Render 2: [List (50ms)] - items same, parent rendered ✗
Render 3: [List (50ms)] - items same, parent rendered ✗

Solution: Isolate state, use React.memo as escape hatch

Problem 2: Single Render Too Slow

Look for wide bars in the flamegraph:

[SlowComponent (200ms)]
├── [Child1 (5ms)]
├── [Child2 (190ms)]  ← Find the slow child
│   └── [GrandChild (185ms)]  ← Root cause
└── [Child3 (5ms)]

Solution: Virtualize, lazy load, or optimize computation

Problem 3: Cascading Re-renders

Many components re-render for one change:

[Parent] → [Child1] → [GrandChild1]
        → [Child2] → [GrandChild2]
        → [Child3] → [GrandChild3]

Solution: Move state down, split context

Profiler Settings

Click the gear icon for options:

Record why each component rendered: Essential for debugging
Hide commits below X ms: Filter noise
Highlight updates: Visual indicator during interaction

Ranked View

Switch from Flamegraph to Ranked view:

1. HeavyWidget      12ms
2. Sidebar          3ms
3. NavItem          1ms
4. Content          1ms
5. Header           0.5ms

This shows components sorted by render time.

Timeline View

Shows renders over time, useful for:

Finding render cascades
Identifying what triggered re-renders
Seeing interaction-to-render timing

Console Integration

// Add profiling in code
import { Profiler } from 'react'

function onRenderCallback(
  id,           // Component tree id
  phase,        // "mount" | "update"
  actualDuration,
  baseDuration,
  startTime,
  commitTime
) {
  console.log(`${id} ${phase}: ${actualDuration.toFixed(2)}ms`)
}

<Profiler id="Navigation" onRender={onRenderCallback}>
  <Navigation />
</Profiler>

Quick Checklist

Record the slow interaction
Find the slowest component (ranked view)
Check why it rendered (DevTools setting)
Verify if render was necessary
Apply targeted fix
Re-profile to confirm improvement

Common Fixes by Cause

Why Rendered	Fix
Props changed (but same value)	Check prop references
Parent rendered	Isolate state, split component
Context changed	Split context
Hooks changed	Check effect dependencies
State changed	Verify state is necessary

Edge Deployment

Overview

Deploy LLMs on resource-constrained devices: mobile, edge servers, embedded systems.

Key constraints:

Limited GPU/NPU memory (4-24 GB)
Power efficiency requirements
Latency-sensitive applications
Offline/disconnected operation

Model Selection for Edge

Device	Memory	Recommended Models
Mobile (iOS/Android)	4-8 GB	Llama-3.3-1B, Phi-3-mini
Edge Server	16-24 GB	Llama-3.3-3B, Mistral-7B-4bit
Raspberry Pi 5	8 GB	Gemma-2B, TinyLlama
Jetson Orin	32-64 GB	Llama-3.1-8B, Mixtral-8x7B-4bit

Aggressive Quantization

For edge, prioritize memory over quality:

from gptqmodel import GPTQModel, QuantizeConfig

# 2-bit quantization for extreme memory constraints
quant_config = QuantizeConfig(
    bits=2,
    group_size=32,
    damp_percent=0.1,
)

model = GPTQModel.load("meta-llama/Llama-3.3-1B-Instruct", quant_config)
model.quantize(calibration_data)
model.save("Llama-3.3-1B-2bit-edge")

Quality vs Memory Trade-off:

Bits	Memory (1B model)	Quality Retention
4	~600 MB	~95%
3	~450 MB	~85%
2	~300 MB	~70%

llama.cpp for Edge

Optimized C++ inference for CPU/edge:

# Build llama.cpp with optimizations
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp

# For Apple Silicon
make LLAMA_METAL=1

# For CUDA
make LLAMA_CUDA=1

# For Vulkan (cross-platform GPU)
make LLAMA_VULKAN=1

# Run inference
./main -m models/llama-3.3-1b-q4_k_m.gguf \
    -p "Hello, how are you?" \
    -n 128 \
    -ngl 99  # Offload all layers to GPU

GGUF quantization types:

Type	Bits	Quality	Speed
Q8_0	8	Best	Good
Q5_K_M	5	Very Good	Better
Q4_K_M	4	Good	Best
Q3_K_M	3	Acceptable	Best
Q2_K	2	Degraded	Best

Mobile Deployment

iOS with MLX

# Convert to MLX format for Apple Silicon
import mlx.core as mx
from mlx_lm import load, generate

# Load quantized model
model, tokenizer = load("mlx-community/Llama-3.3-1B-Instruct-4bit")

# Generate on device
prompt = "Explain machine learning briefly:"
response = generate(model, tokenizer, prompt=prompt, max_tokens=100)

Android with MLC-LLM

# Build for Android
mlc_llm compile meta-llama/Llama-3.3-1B-Instruct \
    --quantization q4f16_1 \
    --target android

# Deploy APK with bundled model
mlc_llm package \
    --model-lib ./dist/llama-3.3-1b-q4f16_1-android.tar \
    --apk-output ./LlamaApp.apk

Jetson/NVIDIA Edge

Optimized for Jetson Orin and embedded NVIDIA:

# Use TensorRT-LLM for Jetson
from tensorrt_llm import LLM, SamplingParams

llm = LLM(
    model="meta-llama/Llama-3.3-3B-Instruct",
    max_batch_size=4,  # Limit for memory
    max_input_len=2048,
    max_output_len=512,
)

# Optimized for Jetson memory constraints
outputs = llm.generate(
    prompts=["Hello!"],
    sampling_params=SamplingParams(max_tokens=100),
)

Memory Optimization Techniques

KV Cache Reduction

# Limit context length for edge
llm = LLM(
    model="meta-llama/Llama-3.3-1B-Instruct",
    max_model_len=1024,  # Reduce from default 4096
    gpu_memory_utilization=0.95,  # Maximize usage
)

Sliding Window Attention

# Models with built-in sliding window
# Mistral-7B: 4096 sliding window
# Reduces memory O(n^2) -> O(n*window)

llm = LLM(
    model="mistralai/Mistral-7B-Instruct-v0.3",
    sliding_window=4096,  # Use model's native window
)

Flash Attention

# Enable Flash Attention for memory efficiency
llm = LLM(
    model="meta-llama/Llama-3.3-1B-Instruct",
    use_flash_attention=True,  # Default on supported hardware
)

Power Efficiency

Dynamic Frequency Scaling

# Limit GPU frequency for power savings (Jetson)
sudo nvpmodel -m 2  # Medium power mode
sudo jetson_clocks --show

# For inference-heavy workloads
sudo nvpmodel -m 0  # Max performance

Batch Size Optimization

# Smaller batches = lower peak power
llm = LLM(
    model="meta-llama/Llama-3.3-1B-Instruct",
    max_num_seqs=8,  # Limit concurrent requests
)

# Process requests sequentially for power
for prompt in prompts:
    output = llm.generate([prompt], sampling_params)
    yield output

Offline Deployment

Model Bundling

# Download and cache model for offline use
from huggingface_hub import snapshot_download

# Pre-download model
snapshot_download(
    "meta-llama/Llama-3.3-1B-Instruct",
    local_dir="./models/llama-3.3-1b",
    local_dir_use_symlinks=False,
)

# Use local path
llm = LLM(model="./models/llama-3.3-1b")

Air-gapped Environments

# Export model to portable format
python -m llama_cpp.convert \
    --model meta-llama/Llama-3.3-1B-Instruct \
    --output ./llama-3.3-1b.gguf \
    --quantize q4_k_m

# Transfer and run on air-gapped device
./main -m ./llama-3.3-1b.gguf -p "Hello"

Benchmarking Edge Performance

import time

def benchmark_edge(model_path: str, prompts: list[str]):
    """Benchmark for edge deployment."""
    from vllm import LLM, SamplingParams

    llm = LLM(
        model=model_path,
        max_model_len=1024,
        gpu_memory_utilization=0.95,
    )

    # Warmup
    llm.generate(["Warmup"], SamplingParams(max_tokens=10))

    # Benchmark
    times = []
    for prompt in prompts:
        start = time.perf_counter()
        output = llm.generate([prompt], SamplingParams(max_tokens=100))
        elapsed = time.perf_counter() - start
        times.append(elapsed)

    avg_latency = sum(times) / len(times)
    p99_latency = sorted(times)[int(len(times) * 0.99)]

    print(f"Avg latency: {avg_latency*1000:.1f}ms")
    print(f"P99 latency: {p99_latency*1000:.1f}ms")

ollama-local - Easy local deployment
quantization-guide - Quantization methods

Frontend Performance

Frontend Performance Optimization

Techniques for optimizing bundle size, loading speed, and rendering performance.

Bundle Optimization

1. Code Splitting

Split your bundle into smaller chunks that load on-demand:

// Route-based splitting (React 19)
import { lazy, Suspense } from 'react';

const AdminPanel = lazy(() => import('./AdminPanel'));
const Dashboard = lazy(() => import('./Dashboard'));

function App() {
  return (
    <Suspense fallback={<Loading />}>
      <Routes>
        <Route path="/admin" element={<AdminPanel />} />
        <Route path="/dashboard" element={<Dashboard />} />
      </Routes>
    </Suspense>
  );
}

2. Tree Shaking

Import only what you need:

// ❌ BAD: Imports entire library
import _ from 'lodash';
_.debounce(fn, 100);

// ✅ GOOD: Import specific function
import debounce from 'lodash/debounce';
debounce(fn, 100);

// ✅ EVEN BETTER: Use native or lightweight alternative
const debounce = (fn, delay) => {
  let timeout;
  return (...args) => {
    clearTimeout(timeout);
    timeout = setTimeout(() => fn(...args), delay);
  };
};

3. Image Optimization

// Use next/image for automatic optimization
import Image from 'next/image';

<Image
  src="/hero.jpg"
  width={1200}
  height={600}
  alt="Hero"
  loading="lazy"  // Lazy load below fold
  quality={85}    // Balance quality/size
  placeholder="blur"  // Show blur while loading
/>

// Or use modern formats manually
<picture>
  <source srcset="/hero.avif" type="image/avif" />
  <source srcset="/hero.webp" type="image/webp" />
  <img src="/hero.jpg" alt="Hero" loading="lazy" />
</picture>

Rendering Optimization

1. Memoization

Prevent unnecessary re-renders:

import { memo, useMemo, useCallback } from 'react';

// Memoize expensive component
const ExpensiveChart = memo(({ data }) => {
  return <Chart data={data} />;
});

// Memoize expensive computation
function AnalyticsDashboard({ analyses }) {
  const stats = useMemo(() => {
    return analyses.reduce((acc, a) => ({
      totalCost: acc.totalCost + a.cost,
      avgDuration: acc.avgDuration + a.duration
    }), { totalCost: 0, avgDuration: 0 });
  }, [analyses]);  // Only recompute if analyses change

  return <div>{stats.totalCost}</div>;
}

// Memoize callback to prevent child re-renders
function Parent() {
  const [count, setCount] = useState(0);

  const handleClick = useCallback(() => {
    setCount(c => c + 1);
  }, []);  // Function identity stays same

  return <Child onClick={handleClick} />;
}

2. Virtualization

Render only visible items in long lists:

import { useVirtualizer } from '@tanstack/react-virtual';

function AnalysisList({ analyses }) {
  const parentRef = useRef(null);

  const virtualizer = useVirtualizer({
    count: analyses.length,
    getScrollElement: () => parentRef.current,
    estimateSize: () => 100,  // Estimated row height
  });

  return (
    <div ref={parentRef} style={{ height: '600px', overflow: 'auto' }}>
      <div style={{ height: `${virtualizer.getTotalSize()}px` }}>
        {virtualizer.getVirtualItems().map(virtualItem => (
          <div
            key={virtualItem.index}
            style={{
              position: 'absolute',
              top: 0,
              left: 0,
              width: '100%',
              height: `${virtualItem.size}px`,
              transform: `translateY(${virtualItem.start}px)`,
            }}
          >
            <AnalysisCard analysis={analyses[virtualItem.index]} />
          </div>
        ))}
      </div>
    </div>
  );
}

3. Batch DOM Operations

Minimize layout thrashing:

// ❌ BAD: Read-write-read-write causes layout thrashing
elements.forEach(el => {
  const height = el.offsetHeight;  // Read (triggers layout)
  el.style.height = height + 10 + 'px';  // Write
});

// ✅ GOOD: Batch reads, then writes
const heights = elements.map(el => el.offsetHeight);  // All reads
elements.forEach((el, i) => {
  el.style.height = heights[i] + 10 + 'px';  // All writes
});

Core Web Vitals Optimization

LCP (Largest Contentful Paint) - Target: < 2.5s

Causes:

Large images not optimized
Slow server response (TTFB)
Render-blocking JS/CSS

Fixes:

Preload LCP image: <link rel="preload" as="image" href="/hero.jpg">
Use CDN for assets
Inline critical CSS
Server-side rendering (SSR)

INP (Interaction to Next Paint) - Target: < 200ms

Causes:

Heavy JavaScript execution
Long-running event handlers
Main thread blocked

Fixes:

Debounce expensive operations
Use Web Workers for heavy computation
Split long tasks with setTimeout() or scheduler.postTask()

CLS (Cumulative Layout Shift) - Target: < 0.1

Causes:

Images without dimensions
Ads/embeds loading late
Web fonts causing FOIT/FOUT

Fixes:

Always set width and height on images
Reserve space for ads: min-height: 250px
Use font-display: swap for web fonts
Preload fonts: <link rel="preload" as="font">

Bundle Analysis

# Lighthouse audit
lighthouse http://localhost:3000 --output=html

# Bundle analysis (Next.js)
ANALYZE=true npm run build

# Bundle analysis (Vite)
npm run build && npx vite-bundle-visualizer

# Check bundle size
du -sh dist/

Best Practices

Lazy load below-the-fold content
Use modern image formats (WebP, AVIF)
Enable compression (Brotli > gzip)
Minimize third-party scripts
Use CDN for static assets
Monitor Core Web Vitals in production with RUM

References

Core Web Vitals
React Profiler
See scripts/frontend-optimization.tsx for complete examples

Memoization Escape Hatches

When to still use useMemo and useCallback with React Compiler.

Overview

React Compiler handles most memoization automatically. Use manual memoization only as escape hatches for specific cases.

Escape Hatch 1: Effect Dependencies

When a value is used as an effect dependency and you need precise control:

// Problem: Effect runs on every render
function UserDashboard({ userId }) {
  const config = {
    userId,
    includeStats: true,
    format: 'detailed',
  }

  useEffect(() => {
    fetchData(config) // Runs every render! config is new object
  }, [config])
}

// Solution: Memoize the config
function UserDashboard({ userId }) {
  const config = useMemo(() => ({
    userId,
    includeStats: true,
    format: 'detailed',
  }), [userId]) // Only changes when userId changes

  useEffect(() => {
    fetchData(config)
  }, [config])
}

Escape Hatch 2: Third-Party Libraries

Libraries without React Compiler support may expect stable references:

// Some charting libraries compare references
function Chart({ data }) {
  // Ensure stable reference for library
  const chartOptions = useMemo(() => ({
    animation: true,
    responsive: true,
    data: transformData(data),
  }), [data])

  return <ThirdPartyChart options={chartOptions} />
}

Escape Hatch 3: Expensive Computations

When you know a computation is expensive and want explicit control:

function SearchResults({ items, query }) {
  // Explicitly expensive - want to ensure it's memoized
  const filteredItems = useMemo(() => {
    console.log('Filtering...')
    return items
      .filter(item => matchesQuery(item, query))
      .sort(complexSortFn)
      .slice(0, 100)
  }, [items, query])

  return <List items={filteredItems} />
}

Escape Hatch 4: Referential Equality for Children

When passing objects/arrays to components that use referential equality:

function Parent() {
  // Child component uses Object.is() comparison
  const contextValue = useMemo(() => ({
    theme: 'dark',
    locale: 'en',
  }), [])

  return (
    <MyContext.Provider value={contextValue}>
      <Children />
    </MyContext.Provider>
  )
}

When NOT to Use Escape Hatches

Don't Memoize Primitives

// ❌ Unnecessary - primitives are already stable
const memoizedId = useMemo(() => props.id, [props.id])

// ✅ Just use it directly
<Child id={props.id} />

Don't Memoize Simple JSX

// ❌ Unnecessary with React Compiler
const memoizedButton = useMemo(() => (
  <Button onClick={handleClick}>Click</Button>
), [handleClick])

// ✅ Compiler handles this
<Button onClick={handleClick}>Click</Button>

Don't Memoize Everything "Just in Case"

// ❌ Over-memoization
function Component({ user }) {
  const name = useMemo(() => user.name, [user.name])
  const email = useMemo(() => user.email, [user.email])
  const avatar = useMemo(() => user.avatar, [user.avatar])

  return <Profile name={name} email={email} avatar={avatar} />
}

// ✅ Trust the compiler
function Component({ user }) {
  return <Profile name={user.name} email={user.email} avatar={user.avatar} />
}

useCallback Escape Hatches

Stable Event Handlers for Effects

function DataFetcher({ onDataLoaded }) {
  // Need stable reference for effect dependency
  const stableCallback = useCallback(
    (data) => onDataLoaded(data),
    [onDataLoaded]
  )

  useEffect(() => {
    fetchData().then(stableCallback)
  }, [stableCallback])
}

Refs in Callbacks

function Form() {
  const inputRef = useRef<HTMLInputElement>(null)

  // Callback that uses ref - may need stability
  const focusInput = useCallback(() => {
    inputRef.current?.focus()
  }, [])

  return (
    <>
      <input ref={inputRef} />
      <Button onClick={focusInput}>Focus</Button>
    </>
  )
}

Decision Tree

Is it an effect dependency?
├─ YES → Does the effect need to run less often?
│        └─ YES → useMemo/useCallback
└─ NO → Is it passed to a third-party library?
        ├─ YES → Check library docs, may need useMemo
        └─ NO → Is it a known expensive computation?
                ├─ YES → Consider useMemo for explicit control
                └─ NO → Trust React Compiler

Verifying Compiler Coverage

// In development, check DevTools for Memo badge
// If component doesn't have badge, compiler may have skipped it

// You can also add console logs to verify:
const value = useMemo(() => {
  console.log('Computing...') // Should only log when deps change
  return expensiveComputation()
}, [deps])

Profiling

Performance Profiling

Tools and techniques for identifying performance bottlenecks.

Profiling Workflow

Measure - Establish baseline metrics
Profile - Identify bottlenecks
Optimize - Fix the slowest operations first
Verify - Measure improvement
Repeat - Iterate until targets met

Backend Profiling (Python)

1. cProfile (Built-in)

# Profile entire script
python -m cProfile -s cumulative backend/app/main.py

# Save profile for analysis
python -m cProfile -o profile.prof backend/app/main.py

# Analyze with snakeviz
pip install snakeviz
snakeviz profile.prof  # Opens interactive flame graph

2. py-spy (Sampling Profiler)

# Install
pip install py-spy

# Profile running process
sudo py-spy top --pid 12345

# Generate flame graph
sudo py-spy record -o profile.svg --pid 12345 --duration 60

# Profile from start
py-spy record -o profile.svg -- python app.py

3. memory_profiler

# Install
pip install memory_profiler

# Decorate functions to profile
from memory_profiler import profile

@profile
def expensive_function():
    data = [0] * (10 ** 6)  # 1M integers
    return sum(data)

# Run with profiling
python -m memory_profiler script.py

4. Line Profiler

# Install
pip install line_profiler

# Add decorator
from line_profiler import profile

@profile
def slow_function():
    result = 0
    for i in range(1000000):
        result += i
    return result

# Run with kernprof
kernprof -l -v script.py

Frontend Profiling

1. Chrome DevTools Performance Tab

Steps:

Open DevTools (F12)
Go to Performance tab
Click Record (Cmd+E)
Interact with page
Stop recording
Analyze flame graph

What to Look For:

Long tasks (> 50ms) - shows as red in timeline
Layout/reflow - indicates DOM thrashing
Scripting time - JavaScript execution
Rendering time - paint and composite

2. React DevTools Profiler

import { Profiler } from 'react';

function onRenderCallback(
  id: string,
  phase: 'mount' | 'update',
  actualDuration: number,
  baseDuration: number,
  startTime: number,
  commitTime: number
) {
  console.log(`${id} (${phase}) took ${actualDuration}ms`);
}

<Profiler id="AnalysisList" onRender={onRenderCallback}>
  <AnalysisList analyses={analyses} />
</Profiler>

In DevTools:

Open React DevTools
Go to Profiler tab
Click Record
Interact with app
Stop and analyze

What to Look For:

Components that render frequently but haven't changed
Components with long render times
Unnecessary re-renders (use memo())

3. Lighthouse Performance Audit

# CLI
npm install -g lighthouse
lighthouse https://localhost:3000 --view

# Or use Chrome DevTools → Lighthouse tab

Metrics Analyzed:

First Contentful Paint (FCP)
Largest Contentful Paint (LCP)
Speed Index
Time to Interactive (TTI)
Total Blocking Time (TBT)
Cumulative Layout Shift (CLS)

4. Bundle Analyzer

# Next.js
npm install @next/bundle-analyzer
ANALYZE=true npm run build

# Vite
npm install -D rollup-plugin-visualizer
npx vite-bundle-visualizer

# Webpack
npm install -D webpack-bundle-analyzer
webpack --profile --json > stats.json
webpack-bundle-analyzer stats.json

Database Profiling

PostgreSQL

1. Enable Query Logging

-- Enable slow query log
ALTER SYSTEM SET log_min_duration_statement = 100;  -- Log queries > 100ms
SELECT pg_reload_conf();

2. pg_stat_statements

-- Enable extension
CREATE EXTENSION pg_stat_statements;

-- Find slowest queries
SELECT
  query,
  calls,
  total_time / 1000 as total_seconds,
  mean_time / 1000 as mean_seconds,
  max_time / 1000 as max_seconds
FROM pg_stat_statements
WHERE query NOT LIKE '%pg_stat_statements%'
ORDER BY total_time DESC
LIMIT 10;

-- Reset stats
SELECT pg_stat_statements_reset();

3. EXPLAIN ANALYZE

-- Analyze query execution
EXPLAIN ANALYZE
SELECT a.*, COUNT(c.id) as chunk_count
FROM analyses a
LEFT JOIN chunks c ON c.analysis_id = a.id
WHERE a.user_id = 'user_123'
GROUP BY a.id
ORDER BY a.created_at DESC
LIMIT 20;

-- Look for:
-- - Seq Scan (bad for large tables)
-- - High actual time
-- - High actual rows vs estimated rows

Memory Profiling

Python (memory_profiler)

from memory_profiler import profile

@profile
def load_analyses():
    # Shows line-by-line memory usage
    analyses = []
    for i in range(10000):
        analyses.append({
            'id': i,
            'content': 'x' * 1000,  # Memory spike here!
        })
    return analyses

Chrome DevTools (Heap Snapshot)

Steps:

Open DevTools → Memory tab
Take Heap Snapshot
Interact with app
Take another snapshot
Compare snapshots

What to Look For:

Detached DOM nodes (memory leaks)
Large arrays/objects
Unreleased event listeners

Memory Leak Detection

// ❌ BAD: Memory leak (event listener never removed)
useEffect(() => {
  window.addEventListener('resize', handleResize);
}, []);

// ✅ GOOD: Cleanup on unmount
useEffect(() => {
  window.addEventListener('resize', handleResize);
  return () => {
    window.removeEventListener('resize', handleResize);
  };
}, []);

Flame Graphs

Visual representation of call stacks showing where time is spent.

Reading Flame Graphs:

Width = Time spent (wider = slower)
Height = Call stack depth
Color = Usually just for differentiation
Top = Leaf functions (where actual work happens)

Generate Flame Graph (Python):

# With py-spy
sudo py-spy record -o flamegraph.svg --pid 12345

# Open in browser
open flamegraph.svg

Load Testing

k6 (HTTP Load Testing)

// load-test.js
import http from 'k6/http';
import { check, sleep } from 'k6';

export let options = {
  stages: [
    { duration: '2m', target: 100 },  // Ramp up to 100 users
    { duration: '5m', target: 100 },  // Stay at 100 users
    { duration: '2m', target: 0 },    // Ramp down to 0 users
  ],
  thresholds: {
    http_req_duration: ['p(95)<500'],  // 95% of requests < 500ms
  },
};

export default function () {
  const res = http.get('http://localhost:8500/api/v1/analyses');

  check(res, {
    'status is 200': (r) => r.status === 200,
    'response time < 500ms': (r) => r.timings.duration < 500,
  });

  sleep(1);
}

# Run load test
k6 run load-test.js

Locust (Python Load Testing)

# locustfile.py
from locust import HttpUser, task, between

class ApiUser(HttpUser):
    wait_time = between(1, 3)

    @task
    def get_analyses(self):
        self.client.get("/api/v1/analyses")

    @task(3)  # 3x more frequent
    def get_analysis(self):
        self.client.get("/api/v1/analyses/abc123")

# Run with web UI
locust -f locustfile.py --host=http://localhost:8500

# Or headless
locust -f locustfile.py --host=http://localhost:8500 --users 100 --spawn-rate 10 --run-time 5m --headless

Profiling Best Practices

Profile in production-like environments - Dev may not show real bottlenecks
Profile with realistic data volumes - Empty databases hide performance issues
Focus on the slowest operations first - 80/20 rule applies
Measure before and after - Verify optimizations actually help
Profile regularly - Catch regressions early
Use sampling profilers for production - Low overhead (py-spy, not cProfile)

Quick Profiling Commands

# Python CPU profiling
python -m cProfile -s cumulative script.py | head -20

# Python memory profiling
python -m memory_profiler script.py

# Node.js profiling
node --prof app.js
node --prof-process isolate-*.log > processed.txt

# PostgreSQL slow queries
psql -c "SELECT query, mean_time FROM pg_stat_statements ORDER BY mean_time DESC LIMIT 10"

# Chrome DevTools (programmatic)
node --inspect app.js
# Then open chrome://inspect

References

py-spy Documentation
Chrome DevTools Performance
React Profiler
k6 Load Testing
See scripts/performance-metrics.ts for Prometheus metrics setup

Quantization Guide

Overview

Quantization reduces model precision to decrease memory usage and increase throughput.

Method	Bits	Calibration	Memory Savings	Throughput	Quality Loss
FP16	16	None	Baseline	Baseline	None
FP8	8	None	50%	+30-50%	Minimal
INT8	8	Optional	50%	+10-20%	Minimal
AWQ	4	Required	75%	+20-40%	Small
GPTQ	4	Required	75%	+15-30%	Small

AWQ (Activation-aware Weight Quantization)

Best 4-bit method for quality preservation:

# Use pre-quantized AWQ model
vllm serve TheBloke/Llama-2-70B-chat-AWQ \
    --quantization awq \
    --tensor-parallel-size 2

from vllm import LLM

# AWQ quantized model
llm = LLM(
    model="TheBloke/Llama-2-70B-chat-AWQ",
    quantization="awq",
    dtype="half",
    tensor_parallel_size=2,
)

AWQ Benefits:

Activation-aware: Preserves important weights
Better quality than GPTQ at same bit-width
Faster inference on modern GPUs

GPTQ Quantization

Create your own GPTQ quantized model:

from gptqmodel import GPTQModel, QuantizeConfig
from datasets import load_dataset

# Load calibration data
calibration_data = load_dataset(
    "allenai/c4",
    data_files="en/c4-train.00001-of-01024.json.gz",
    split="train",
).select(range(1024))["text"]

# Configure quantization
quant_config = QuantizeConfig(
    bits=4,              # 4-bit quantization
    group_size=128,      # Group size for quantization
    damp_percent=0.1,    # Dampening for Hessian
    desc_act=True,       # Activation order (better quality)
)

# Load and quantize
model = GPTQModel.load(
    "meta-llama/Llama-3.2-1B-Instruct",
    quant_config,
)
model.quantize(calibration_data, batch_size=4)

# Save quantized model
model.save("Llama-3.2-1B-Instruct-gptq-4bit")

Using GPTQ with vLLM:

from vllm import LLM

llm = LLM(
    model="TheBloke/Llama-2-70B-GPTQ",
    quantization="gptq",
    dtype="half",
)

FP8 Quantization

Best for H100/H200 GPUs with native FP8 support:

from vllm import LLM

# FP8 on H100
llm = LLM(
    model="meta-llama/Meta-Llama-3.1-70B-Instruct",
    quantization="fp8",  # Native FP8
    kv_cache_dtype="fp8",  # FP8 KV cache
)

FP8 Advantages:

Near-FP16 quality
50% memory reduction
Best throughput on H100/H200
No calibration needed

INT8 Quantization

Balanced option with minimal quality loss:

# INT8 weight quantization
llm = LLM(
    model="meta-llama/Meta-Llama-3.1-70B-Instruct",
    quantization="int8",
    dtype="float16",
)

Quantization Comparison

Memory Usage (70B Model)

Precision	Memory (per GPU)	GPUs Needed
FP32	~280 GB	8x A100 80GB
FP16	~140 GB	4x A100 80GB
INT8/FP8	~70 GB	2x A100 80GB
AWQ/GPTQ	~35 GB	1x A100 80GB

Quality Benchmarks (MMLU)

Model	FP16	INT8	AWQ-4bit	GPTQ-4bit
Llama-3.1-8B	66.2%	65.8%	65.1%	64.8%
Llama-3.1-70B	79.3%	79.0%	78.2%	77.9%

Best Practices

Calibration Data

Use representative data for your use case:

# Domain-specific calibration
calibration_data = [
    # Include examples similar to production queries
    "Customer support query example...",
    "Technical documentation example...",
    "Code generation example...",
]

# Minimum 128 samples, recommended 512-1024
assert len(calibration_data) >= 128

Group Size Selection

Group Size	Memory	Quality	Speed
32	Lowest	Best	Slowest
64	Low	Very Good	Fast
128	Medium	Good	Fastest

# Higher group size = faster but lower quality
quant_config = QuantizeConfig(
    bits=4,
    group_size=128,  # Balance of speed and quality
)

Mixed Precision

Keep critical layers at higher precision:

# Some layers benefit from higher precision
quant_config = QuantizeConfig(
    bits=4,
    group_size=128,
    inside_layer_modules=[
        # Keep attention at higher precision
        "self_attn.q_proj",
        "self_attn.k_proj",
        "self_attn.v_proj",
    ],
)

Troubleshooting

OOM During Quantization

# Reduce batch size
model.quantize(calibration_data, batch_size=1)

# Use gradient checkpointing
model.quantize(
    calibration_data,
    batch_size=2,
    use_checkpoint=True,
)

Quality Degradation

Increase calibration data diversity
Reduce group size (32 or 64)
Try AWQ instead of GPTQ
Enable desc_act=True for GPTQ

ollama-local - Local inference with quantized models
embeddings - Quantized embedding models

React Compiler Migration

React Compiler Migration Guide

Adopting React 19's automatic memoization.

What is React Compiler?

React Compiler automatically memoizes components and values, eliminating the need for manual useMemo, useCallback, and React.memo in most cases.

Prerequisites

React 19+
Compatible framework (Next.js 16+, Expo SDK 54+)
Code follows Rules of React

Quick Setup

Next.js 16+

// next.config.js
const nextConfig = {
  reactCompiler: true,
}

module.exports = nextConfig

Expo SDK 54+

Enabled by default in new projects.

Babel (Manual)

npm install -D babel-plugin-react-compiler

// babel.config.js
module.exports = {
  plugins: [
    ['babel-plugin-react-compiler', {
      // Optional: sources to compile
      sources: (filename) => {
        return filename.indexOf('src') !== -1
      },
    }],
  ],
}

Verification

Open React DevTools in browser
Go to Components tab
Look for "Memo ✨" badge next to component names
If you see the sparkle emoji, compiler is working

What Gets Optimized

The compiler automatically memoizes:

Before (Manual)	After (Compiler)
`React.memo(Component)`	Component re-renders only when needed
`useMemo(() => value, [deps])`	Intermediate values cached
`useCallback(() => fn, [deps])`	Callback references stable
Conditional JSX	JSX elements memoized

Rules of React (Must Follow)

For the compiler to work correctly:

1. Components Must Be Idempotent

// ✅ Same input → same output
function Profile({ user }) {
  return <h1>{user.name}</h1>
}

// ❌ Non-deterministic
function Profile({ user }) {
  return <h1>{user.name} at {Date.now()}</h1>
}

2. Props and State Are Immutable

// ✅ Create new object
setUser({ ...user, name: 'New Name' })

// ❌ Mutate existing
user.name = 'New Name'
setUser(user)

3. Side Effects Outside Render

// ✅ In useEffect
useEffect(() => {
  analytics.track('view')
}, [])

// ❌ During render
function Component() {
  analytics.track('view') // BAD
  return <div>...</div>
}

4. Hooks at Top Level

// ✅ Always at top
function Component() {
  const [state, setState] = useState()
  // ...
}

// ❌ Conditional hooks
function Component({ show }) {
  if (show) {
    const [state, setState] = useState() // BAD
  }
}

Migration Strategy

New Projects

Enable compiler immediately. No reason not to.

Existing Projects

Enable compiler in config
Run tests to catch issues
Check DevTools for Memo badges
Gradually remove manual memoization

// Before (manual)
const MemoizedChild = React.memo(Child)
const memoizedValue = useMemo(() => compute(data), [data])
const handleClick = useCallback(() => onClick(id), [id, onClick])

// After (compiler handles it)
// Just use Child, compute(data), and onClick directly
// Compiler determines what needs memoization

When Manual Memoization Still Needed

Keep useMemo/useCallback for:

// 1. Effect dependencies that shouldn't trigger re-runs
const stableConfig = useMemo(() => ({
  apiUrl: process.env.API_URL,
  timeout: 5000,
}), [])

useEffect(() => {
  initSDK(stableConfig) // Should only run once
}, [stableConfig])

// 2. Third-party libraries without compiler support
const memoizedData = useMemo(() =>
  thirdPartyLib.transform(data), [data])

// 3. Precise control over boundaries
const handleSubmit = useCallback(async () => {
  // Complex async logic that must be stable
}, [criticalDep])

Debugging Issues

Component Not Getting Memo Badge

Check if file is in compiler's sources
Look for Rules of React violations
Check for unsupported patterns

Performance Regression

Profile with React DevTools
Check if compiler skipped problematic code
Add manual memoization as escape hatch

Compatibility Notes

Works with existing useMemo/useCallback (won't double-memoize)
Safe to leave existing memoization during migration
Compiler output is equivalent to manual optimization

Route Splitting

Route-Based Code Splitting

React Router 7.x Lazy Routes

import { lazy } from 'react';
import { createBrowserRouter } from 'react-router';

// Define lazy routes
const routes = [
  {
    path: '/',
    lazy: () => import('./pages/Home'),
  },
  {
    path: '/dashboard',
    lazy: () => import('./pages/Dashboard'),
    children: [
      {
        path: 'analytics',
        lazy: () => import('./pages/Analytics'),
      },
      {
        path: 'settings',
        lazy: () => import('./pages/Settings'),
      },
    ],
  },
];

const router = createBrowserRouter(routes);

Vite Manual Chunks

// vite.config.ts
export default defineConfig({
  build: {
    rollupOptions: {
      output: {
        manualChunks: {
          // Vendor chunks
          'react-vendor': ['react', 'react-dom', 'react-router'],
          'query-vendor': ['@tanstack/react-query'],

          // Feature chunks (match route structure)
          'dashboard': [
            './src/pages/Dashboard',
            './src/pages/Analytics',
          ],
          'settings': [
            './src/pages/Settings',
            './src/pages/Profile',
          ],
        },
      },
    },
  },
});

Prefetch on Route Hover

import { useQueryClient } from '@tanstack/react-query';
import { Link, useNavigate } from 'react-router';

function NavLink({ to, children }: { to: string; children: React.ReactNode }) {
  const queryClient = useQueryClient();

  const prefetch = () => {
    // Prefetch route data
    queryClient.prefetchQuery({
      queryKey: ['route', to],
      queryFn: () => fetchRouteData(to),
    });
  };

  return (
    <Link
      to={to}
      onMouseEnter={prefetch}
      onFocus={prefetch}
      preload="intent"
    >
      {children}
    </Link>
  );
}

Bundle Size Monitoring

# After build, check chunk sizes
npx vite build
# Output shows chunk sizes

# For detailed analysis
npx vite-bundle-visualizer

Rum Setup

Real User Monitoring (RUM) Setup

Complete guide to implementing Real User Monitoring for Core Web Vitals.

┌─────────────────────────────────────────────────────────────────────────┐
│                         RUM Data Flow                                    │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│   Browser                          Server                   Analytics    │
│  ┌────────┐                      ┌────────┐              ┌────────────┐ │
│  │  User  │──interaction──►│ web-vitals │            │            │ │
│  │Session │                │  library   │            │  Dashboard │ │
│  └────────┘                └─────┬──────┘            │   + Alerts │ │
│                                  │                    └─────┬──────┘ │
│                                  │                          │        │
│                     ┌────────────▼────────────┐             │        │
│                     │    sendBeacon / fetch   │─────────────►        │
│                     │    (keepalive: true)    │             │        │
│                     └────────────┬────────────┘             │        │
│                                  │                          │        │
│                     ┌────────────▼────────────┐             │        │
│                     │     /api/vitals         │─────────────►        │
│                     │   (batch + process)     │   metrics   │        │
│                     └─────────────────────────┘             │        │
│                                                                      │
└─────────────────────────────────────────────────────────────────────────┘

web-vitals Library Setup

Installation

npm install web-vitals
# or
pnpm add web-vitals

Basic Implementation

// lib/vitals.ts
import {
  onCLS,
  onINP,
  onLCP,
  onFCP,
  onTTFB,
  type Metric,
  type ReportOpts,
} from 'web-vitals';

// Metric type for your analytics
export interface VitalsMetric {
  name: 'CLS' | 'INP' | 'LCP' | 'FCP' | 'TTFB';
  value: number;
  rating: 'good' | 'needs-improvement' | 'poor';
  delta: number;
  id: string;
  navigationType: 'navigate' | 'reload' | 'back-forward' | 'back-forward-cache' | 'prerender';
  // Custom metadata
  url: string;
  userAgent: string;
  connectionType?: string;
  deviceMemory?: number;
  timestamp: number;
}

// Collect device and connection info for debugging
function getDeviceInfo(): Partial<VitalsMetric> {
  const nav = navigator as Navigator & {
    connection?: { effectiveType?: string };
    deviceMemory?: number;
  };

  return {
    userAgent: navigator.userAgent,
    connectionType: nav.connection?.effectiveType,
    deviceMemory: nav.deviceMemory,
  };
}

function createMetricPayload(metric: Metric): VitalsMetric {
  return {
    name: metric.name as VitalsMetric['name'],
    value: metric.value,
    rating: metric.rating,
    delta: metric.delta,
    id: metric.id,
    navigationType: metric.navigationType,
    url: window.location.href,
    timestamp: Date.now(),
    ...getDeviceInfo(),
  };
}

// Reliable transmission even during page unload
function sendToAnalytics(metric: Metric) {
  const payload = createMetricPayload(metric);
  const body = JSON.stringify(payload);

  // sendBeacon is most reliable for unload scenarios
  if (navigator.sendBeacon) {
    navigator.sendBeacon('/api/vitals', body);
  } else {
    // Fallback with keepalive for browsers without sendBeacon
    fetch('/api/vitals', {
      method: 'POST',
      body,
      headers: { 'Content-Type': 'application/json' },
      keepalive: true, // Keeps request alive even if page unloads
    });
  }
}

// Report all web vitals
export function reportWebVitals(opts?: ReportOpts) {
  // Core Web Vitals (affect SEO)
  onCLS(sendToAnalytics, opts);
  onINP(sendToAnalytics, opts);
  onLCP(sendToAnalytics, opts);

  // Additional useful metrics
  onFCP(sendToAnalytics, opts);
  onTTFB(sendToAnalytics, opts);
}

Next.js App Router Integration

Client Component for Vitals

// app/components/web-vitals.tsx
'use client';

import { useEffect } from 'react';
import { reportWebVitals } from '@/lib/vitals';

export function WebVitals() {
  useEffect(() => {
    // Report immediately (first value)
    reportWebVitals({ reportAllChanges: false });
  }, []);

  return null;
}

// For debugging during development
export function WebVitalsDebug() {
  useEffect(() => {
    // Report all changes, not just final values
    reportWebVitals({ reportAllChanges: true });
  }, []);

  return null;
}

Layout Integration

// app/layout.tsx
import { WebVitals } from '@/components/web-vitals';

export default function RootLayout({
  children,
}: {
  children: React.ReactNode;
}) {
  return (
    <html lang="en">
      <body>
        <WebVitals />
        {children}
      </body>
    </html>
  );
}

API Endpoint Implementation

Next.js Route Handler

// app/api/vitals/route.ts
import { NextRequest, NextResponse } from 'next/server';

// Thresholds from web.dev
const THRESHOLDS = {
  LCP: { good: 2500, poor: 4000 },
  INP: { good: 200, poor: 500 },
  CLS: { good: 0.1, poor: 0.25 },
  FCP: { good: 1800, poor: 3000 },
  TTFB: { good: 800, poor: 1800 },
} as const;

// 2026 thresholds (plan ahead!)
const THRESHOLDS_2026 = {
  LCP: { good: 2000, poor: 4000 },
  INP: { good: 150, poor: 500 },
  CLS: { good: 0.08, poor: 0.25 },
} as const;

interface VitalsMetric {
  name: string;
  value: number;
  rating: string;
  delta: number;
  id: string;
  navigationType: string;
  url: string;
  userAgent: string;
  connectionType?: string;
  deviceMemory?: number;
  timestamp: number;
}

// Validate incoming metric
function isValidMetric(data: unknown): data is VitalsMetric {
  if (!data || typeof data !== 'object') return false;
  const metric = data as Record<string, unknown>;
  return (
    typeof metric.name === 'string' &&
    typeof metric.value === 'number' &&
    typeof metric.rating === 'string'
  );
}

export async function POST(request: NextRequest) {
  try {
    const metric = await request.json();

    if (!isValidMetric(metric)) {
      return NextResponse.json(
        { error: 'Invalid metric format' },
        { status: 400 }
      );
    }

    // Enrich with server-side data
    const enrichedMetric = {
      ...metric,
      receivedAt: new Date().toISOString(),
      clientIP: request.headers.get('x-forwarded-for') ?? 'unknown',
      country: request.headers.get('x-vercel-ip-country') ?? 'unknown',
    };

    // Log for debugging (replace with your analytics service)
    console.log('[Web Vital]', JSON.stringify(enrichedMetric));

    // Store in your analytics database
    await storeMetric(enrichedMetric);

    // Alert on poor metrics (optional)
    if (metric.rating === 'poor') {
      await alertOnPoorMetric(enrichedMetric);
    }

    return NextResponse.json({ received: true });
  } catch (error) {
    console.error('[Vitals API Error]', error);
    return NextResponse.json(
      { error: 'Failed to process metric' },
      { status: 500 }
    );
  }
}

// Example: Store in PostgreSQL
async function storeMetric(metric: VitalsMetric & { receivedAt: string }) {
  // Replace with your database client
  // await db.insert('web_vitals').values({
  //   name: metric.name,
  //   value: metric.value,
  //   rating: metric.rating,
  //   url: metric.url,
  //   user_agent: metric.userAgent,
  //   connection_type: metric.connectionType,
  //   timestamp: new Date(metric.timestamp),
  //   received_at: new Date(metric.receivedAt),
  // });
}

// Example: Alert via Slack/PagerDuty
async function alertOnPoorMetric(metric: VitalsMetric) {
  const threshold = THRESHOLDS[metric.name as keyof typeof THRESHOLDS];
  if (!threshold) return;

  // await fetch(process.env.SLACK_WEBHOOK_URL!, {
  //   method: 'POST',
  //   body: JSON.stringify({
  //     text: `🚨 Poor ${metric.name}: ${metric.value}${metric.name === 'CLS' ? '' : 'ms'} on ${metric.url}`,
  //   }),
  // });
}

Batching for High-Traffic Sites

// lib/vitals-batched.ts
import { onCLS, onINP, onLCP, type Metric } from 'web-vitals';

const BATCH_SIZE = 10;
const FLUSH_INTERVAL = 5000; // 5 seconds

class MetricsBatcher {
  private queue: Metric[] = [];
  private flushTimer: ReturnType<typeof setTimeout> | null = null;

  add(metric: Metric) {
    this.queue.push(metric);

    if (this.queue.length >= BATCH_SIZE) {
      this.flush();
    } else if (!this.flushTimer) {
      this.flushTimer = setTimeout(() => this.flush(), FLUSH_INTERVAL);
    }
  }

  private flush() {
    if (this.queue.length === 0) return;

    const metrics = [...this.queue];
    this.queue = [];

    if (this.flushTimer) {
      clearTimeout(this.flushTimer);
      this.flushTimer = null;
    }

    // Send batch
    navigator.sendBeacon(
      '/api/vitals/batch',
      JSON.stringify({ metrics, timestamp: Date.now() })
    );
  }

  // Flush on page unload
  flushSync() {
    if (this.flushTimer) {
      clearTimeout(this.flushTimer);
      this.flushTimer = null;
    }
    this.flush();
  }
}

const batcher = new MetricsBatcher();

// Ensure flush on unload
if (typeof window !== 'undefined') {
  window.addEventListener('visibilitychange', () => {
    if (document.visibilityState === 'hidden') {
      batcher.flushSync();
    }
  });
}

export function reportWebVitalsBatched() {
  onCLS((metric) => batcher.add(metric));
  onINP((metric) => batcher.add(metric));
  onLCP((metric) => batcher.add(metric));
}

Database Schema

PostgreSQL Schema

CREATE TABLE web_vitals (
  id BIGSERIAL PRIMARY KEY,
  name VARCHAR(10) NOT NULL,
  value DECIMAL(10, 4) NOT NULL,
  rating VARCHAR(20) NOT NULL,
  delta DECIMAL(10, 4),
  metric_id VARCHAR(50),
  navigation_type VARCHAR(30),
  url TEXT NOT NULL,
  user_agent TEXT,
  connection_type VARCHAR(20),
  device_memory INT,
  client_ip INET,
  country VARCHAR(2),
  timestamp TIMESTAMPTZ NOT NULL,
  received_at TIMESTAMPTZ DEFAULT NOW(),

  -- Indexes for common queries
  INDEX idx_vitals_name_timestamp (name, timestamp DESC),
  INDEX idx_vitals_url (url),
  INDEX idx_vitals_rating (rating)
);

-- Partition by month for large datasets
CREATE TABLE web_vitals_2025_01 PARTITION OF web_vitals
  FOR VALUES FROM ('2025-01-01') TO ('2025-02-01');

Analytics Queries

-- Daily Core Web Vitals summary (p75 is Google's standard)
SELECT
  DATE(timestamp) as date,
  name,
  PERCENTILE_CONT(0.75) WITHIN GROUP (ORDER BY value) as p75,
  COUNT(CASE WHEN rating = 'good' THEN 1 END)::float / COUNT(*) * 100 as good_pct,
  COUNT(*) as samples
FROM web_vitals
WHERE timestamp > NOW() - INTERVAL '30 days'
  AND name IN ('LCP', 'INP', 'CLS')
GROUP BY DATE(timestamp), name
ORDER BY date DESC, name;

-- Worst performing pages by LCP
SELECT
  url,
  PERCENTILE_CONT(0.75) WITHIN GROUP (ORDER BY value) as p75_lcp,
  COUNT(*) as samples
FROM web_vitals
WHERE name = 'LCP'
  AND timestamp > NOW() - INTERVAL '7 days'
GROUP BY url
HAVING COUNT(*) > 100
ORDER BY p75_lcp DESC
LIMIT 20;

-- Performance by connection type
SELECT
  connection_type,
  name,
  PERCENTILE_CONT(0.75) WITHIN GROUP (ORDER BY value) as p75,
  COUNT(*) as samples
FROM web_vitals
WHERE timestamp > NOW() - INTERVAL '7 days'
  AND connection_type IS NOT NULL
GROUP BY connection_type, name
ORDER BY connection_type, name;

-- Trend analysis: Week-over-week comparison
WITH current_week AS (
  SELECT name, PERCENTILE_CONT(0.75) WITHIN GROUP (ORDER BY value) as p75
  FROM web_vitals
  WHERE timestamp > NOW() - INTERVAL '7 days'
  GROUP BY name
),
previous_week AS (
  SELECT name, PERCENTILE_CONT(0.75) WITHIN GROUP (ORDER BY value) as p75
  FROM web_vitals
  WHERE timestamp BETWEEN NOW() - INTERVAL '14 days' AND NOW() - INTERVAL '7 days'
  GROUP BY name
)
SELECT
  c.name,
  c.p75 as current_p75,
  p.p75 as previous_p75,
  ROUND((c.p75 - p.p75) / p.p75 * 100, 2) as change_pct
FROM current_week c
JOIN previous_week p ON c.name = p.name;

Grafana Dashboard

Prometheus Metrics Export

// lib/metrics-exporter.ts
import { Histogram, Counter, Registry } from 'prom-client';

const registry = new Registry();

// Histogram for percentile calculations
const webVitalsHistogram = new Histogram({
  name: 'web_vitals_value',
  help: 'Web Vitals metric values',
  labelNames: ['name', 'rating'],
  buckets: {
    LCP: [1000, 1500, 2000, 2500, 3000, 4000, 5000],
    INP: [50, 100, 150, 200, 300, 500, 1000],
    CLS: [0.01, 0.05, 0.1, 0.15, 0.25, 0.5],
  }['LCP'], // Default buckets
  registers: [registry],
});

const webVitalsCounter = new Counter({
  name: 'web_vitals_total',
  help: 'Total count of Web Vitals reports',
  labelNames: ['name', 'rating'],
  registers: [registry],
});

export function recordMetric(name: string, value: number, rating: string) {
  webVitalsHistogram.labels(name, rating).observe(value);
  webVitalsCounter.labels(name, rating).inc();
}

export { registry };

Grafana Alert Rules

# grafana-alerts.yaml
groups:
  - name: core-web-vitals
    interval: 5m
    rules:
      # LCP Alert
      - alert: HighLCP
        expr: histogram_quantile(0.75, sum(rate(web_vitals_value_bucket{name="LCP"}[15m])) by (le)) > 2500
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "LCP p75 is {{ $value | printf \"%.0f\" }}ms (threshold: 2500ms)"
          description: "Largest Contentful Paint has degraded. Check recent deployments."

      # INP Alert
      - alert: HighINP
        expr: histogram_quantile(0.75, sum(rate(web_vitals_value_bucket{name="INP"}[15m])) by (le)) > 200
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "INP p75 is {{ $value | printf \"%.0f\" }}ms (threshold: 200ms)"
          description: "Interaction to Next Paint has degraded. Check for long tasks."

      # CLS Alert
      - alert: HighCLS
        expr: histogram_quantile(0.75, sum(rate(web_vitals_value_bucket{name="CLS"}[15m])) by (le)) > 0.1
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "CLS p75 is {{ $value | printf \"%.3f\" }} (threshold: 0.1)"
          description: "Cumulative Layout Shift has degraded. Check for layout shifts."

      # Good rate dropping
      - alert: GoodRateDrop
        expr: |
          (sum(rate(web_vitals_total{rating="good"}[1h])) by (name) /
           sum(rate(web_vitals_total[1h])) by (name)) < 0.75
        for: 30m
        labels:
          severity: critical
        annotations:
          summary: "{{ $labels.name }} good rate dropped below 75%"
          description: "Less than 75% of users are experiencing good {{ $labels.name }}"

Sampling Strategy for High Traffic

// lib/vitals-sampled.ts
import { onCLS, onINP, onLCP, type Metric } from 'web-vitals';

interface SamplingConfig {
  // Base sample rate (0-1)
  baseRate: number;
  // Always sample poor metrics
  alwaysSamplePoor: boolean;
  // Sample more on specific pages
  pageMultipliers?: Record<string, number>;
}

const DEFAULT_CONFIG: SamplingConfig = {
  baseRate: 0.1, // 10% baseline
  alwaysSamplePoor: true,
  pageMultipliers: {
    '/': 1.0, // Always sample homepage
    '/checkout': 1.0, // Always sample checkout
  },
};

function shouldSample(metric: Metric, config: SamplingConfig): boolean {
  // Always sample poor metrics for debugging
  if (config.alwaysSamplePoor && metric.rating === 'poor') {
    return true;
  }

  // Check page-specific multiplier
  const path = window.location.pathname;
  const multiplier = config.pageMultipliers?.[path] ?? 1;
  const effectiveRate = config.baseRate * multiplier;

  return Math.random() < effectiveRate;
}

export function reportWebVitalsSampled(config = DEFAULT_CONFIG) {
  const report = (metric: Metric) => {
    if (shouldSample(metric, config)) {
      sendToAnalytics(metric);
    }
  };

  onCLS(report);
  onINP(report);
  onLCP(report);
}

Testing RUM in Development

// lib/vitals-dev.ts
import { onCLS, onINP, onLCP, type Metric } from 'web-vitals';

const RATING_COLORS = {
  good: 'color: green',
  'needs-improvement': 'color: orange',
  poor: 'color: red',
} as const;

function logToConsole(metric: Metric) {
  const color = RATING_COLORS[metric.rating];
  const unit = metric.name === 'CLS' ? '' : 'ms';

  console.log(
    `%c[${metric.name}] ${metric.value.toFixed(2)}${unit} (${metric.rating})`,
    color,
    {
      delta: metric.delta,
      id: metric.id,
      navigationType: metric.navigationType,
    }
  );
}

export function reportWebVitalsDev() {
  // Report all changes for debugging
  onCLS(logToConsole, { reportAllChanges: true });
  onINP(logToConsole, { reportAllChanges: true });
  onLCP(logToConsole, { reportAllChanges: true });
}

// Usage in development
if (process.env.NODE_ENV === 'development') {
  reportWebVitalsDev();
}

Integration with Analytics Providers

Google Analytics 4

// lib/vitals-ga4.ts
import { onCLS, onINP, onLCP, type Metric } from 'web-vitals';

declare global {
  interface Window {
    gtag?: (...args: unknown[]) => void;
  }
}

function sendToGA4(metric: Metric) {
  if (typeof window.gtag !== 'function') return;

  window.gtag('event', metric.name, {
    event_category: 'Web Vitals',
    event_label: metric.id,
    value: Math.round(metric.name === 'CLS' ? metric.value * 1000 : metric.value),
    metric_rating: metric.rating,
    non_interaction: true,
  });
}

export function reportWebVitalsGA4() {
  onCLS(sendToGA4);
  onINP(sendToGA4);
  onLCP(sendToGA4);
}

Vercel Analytics

// Next.js built-in support
// next.config.js
module.exports = {
  // Vercel Analytics automatically collects Web Vitals
  // No additional setup needed when deployed on Vercel
};

// For self-hosted, use @vercel/analytics
import { Analytics } from '@vercel/analytics/react';

export default function RootLayout({ children }) {
  return (
    <html>
      <body>
        {children}
        <Analytics />
      </body>
    </html>
  );
}

Speculative Decoding

Overview

Speculative decoding accelerates autoregressive generation by predicting multiple tokens at once, then verifying in parallel.

How it works:

Draft model (or n-gram) proposes N candidate tokens
Target model verifies all N tokens in one forward pass
Accept verified tokens, reject incorrect ones
Repeat from first rejected position

Expected gains: 1.5-2.5x throughput for compatible workloads.

N-gram Speculation

No extra model needed - uses prompt patterns:

# vLLM CLI with n-gram speculation
vllm serve meta-llama/Meta-Llama-3.1-70B-Instruct \
    --speculative-config '{
        "method": "ngram",
        "num_speculative_tokens": 5,
        "prompt_lookup_max": 5,
        "prompt_lookup_min": 2
    }'

from vllm import LLM, SamplingParams

llm = LLM(
    model="meta-llama/Meta-Llama-3.1-70B-Instruct",
    speculative_config={
        "method": "ngram",
        "num_speculative_tokens": 5,
        "prompt_lookup_max": 5,
        "prompt_lookup_min": 2,
    },
)

# Works best with repetitive/structured output
outputs = llm.generate(
    ["Generate a JSON object with user data:"],
    SamplingParams(max_tokens=500),
)

Best for:

Structured output (JSON, code)
Repetitive patterns
Low additional memory

Draft Model Speculation

Use a smaller model to draft tokens:

# Draft model speculation
vllm serve meta-llama/Meta-Llama-3.1-70B-Instruct \
    --speculative-config '{
        "method": "draft_model",
        "draft_model": "meta-llama/Llama-3.2-1B-Instruct",
        "num_speculative_tokens": 3
    }' \
    --tensor-parallel-size 4

from vllm import LLM

llm = LLM(
    model="meta-llama/Meta-Llama-3.1-70B-Instruct",
    speculative_config={
        "method": "draft_model",
        "draft_model": "meta-llama/Llama-3.2-1B-Instruct",
        "num_speculative_tokens": 3,
    },
    tensor_parallel_size=4,
)

Draft model selection:

Target Model	Recommended Draft	Size Ratio
70B	7B or 8B	~10%
70B	1B-3B	~2-5%
8B	1B	~12%
405B	8B-70B	~2-17%

Medusa-style Speculation

Multiple prediction heads for parallel token generation:

# Medusa-style model (requires trained heads)
llm = LLM(
    model="lmsys/vicuna-7b-v1.5-16k-medusa",
    speculative_config={
        "method": "medusa",
        "num_heads": 4,  # Number of speculation heads
    },
)

Advantages:

No separate draft model
Lower memory than draft model
Works well with fine-tuned models

Performance Tuning

Optimal Token Count

# Benchmark different speculation depths
for num_tokens in [1, 3, 5, 7]:
    llm = LLM(
        model="meta-llama/Meta-Llama-3.1-70B-Instruct",
        speculative_config={
            "method": "ngram",
            "num_speculative_tokens": num_tokens,
        },
    )
    throughput = benchmark(llm)
    print(f"Tokens: {num_tokens}, Throughput: {throughput:.1f} tok/s")

General guidelines:

Scenario	Recommended Tokens
Code generation	5-7
JSON output	5-7
Free-form text	2-4
Creative writing	1-3

Acceptance Rate Monitoring

# vLLM logs acceptance rates
# Look for: "Speculative decoding acceptance rate: X%"

# High acceptance (>70%): Increase num_speculative_tokens
# Low acceptance (<40%): Decrease or disable speculation

When NOT to Use

Speculative decoding may hurt performance when:

High randomness (temperature > 1.0)
Short outputs (overhead > benefit)
Diverse outputs (low acceptance rate)
Memory constrained (draft model overhead)

# Disable speculation for creative tasks
sampling_params = SamplingParams(
    temperature=1.2,
    top_p=0.95,
    max_tokens=100,  # Short output
)
# Use standard decoding instead

Benchmarking

import time
from vllm import LLM, SamplingParams

def benchmark_speculation(model_path: str, prompts: list[str]):
    """Compare with and without speculative decoding."""

    # Without speculation
    llm_base = LLM(model=model_path)
    start = time.perf_counter()
    outputs_base = llm_base.generate(prompts, SamplingParams(max_tokens=512))
    time_base = time.perf_counter() - start

    # With speculation
    llm_spec = LLM(
        model=model_path,
        speculative_config={
            "method": "ngram",
            "num_speculative_tokens": 5,
        },
    )
    start = time.perf_counter()
    outputs_spec = llm_spec.generate(prompts, SamplingParams(max_tokens=512))
    time_spec = time.perf_counter() - start

    tokens_base = sum(len(o.outputs[0].token_ids) for o in outputs_base)
    tokens_spec = sum(len(o.outputs[0].token_ids) for o in outputs_spec)

    print(f"Baseline: {tokens_base/time_base:.1f} tok/s")
    print(f"Speculative: {tokens_spec/time_spec:.1f} tok/s")
    print(f"Speedup: {(time_base/time_spec):.2f}x")


# JSON/code prompts benefit most
prompts = [
    "Generate a Python function that implements binary search:",
    "Create a JSON schema for a user profile with validation:",
    "Write a SQL query to find top 10 customers by revenue:",
]
benchmark_speculation("meta-llama/Meta-Llama-3.1-8B-Instruct", prompts)

llm-streaming - Streaming with speculation
prompt-caching - Combine with prefix caching

State Colocation

Keep state as close to where it's used as possible.

The Principle

State should live in the component that needs it. Only lift state when truly necessary for sibling communication.

Problem: State Too High

// ❌ State at app level causes unnecessary re-renders
function App() {
  const [searchQuery, setSearchQuery] = useState('')
  const [selectedId, setSelectedId] = useState(null)

  return (
    <div>
      <Header />                    {/* Re-renders on search! */}
      <Sidebar />                   {/* Re-renders on search! */}
      <SearchInput
        value={searchQuery}
        onChange={setSearchQuery}
      />
      <SearchResults
        query={searchQuery}
        selectedId={selectedId}
        onSelect={setSelectedId}
      />
      <Footer />                    {/* Re-renders on search! */}
    </div>
  )
}

Solution: Colocate State

// ✅ State colocated with components that use it
function App() {
  return (
    <div>
      <Header />
      <Sidebar />
      <SearchSection />  {/* Contains its own state */}
      <Footer />
    </div>
  )
}

function SearchSection() {
  const [searchQuery, setSearchQuery] = useState('')
  const [selectedId, setSelectedId] = useState(null)

  return (
    <>
      <SearchInput
        value={searchQuery}
        onChange={setSearchQuery}
      />
      <SearchResults
        query={searchQuery}
        selectedId={selectedId}
        onSelect={setSelectedId}
      />
    </>
  )
}

When to Lift State

Lift state ONLY when:

Siblings need to share it

// Both components need selectedUser
function Parent() {
  const [selectedUser, setSelectedUser] = useState(null)

  return (
    <>
      <UserList onSelect={setSelectedUser} selected={selectedUser} />
      <UserDetails user={selectedUser} />
    </>
  )
}

Parent needs to coordinate

// Parent manages form submission
function Form() {
  const [values, setValues] = useState({})

  const handleSubmit = () => {
    api.submit(values)
  }

  return (
    <>
      <FormFields values={values} onChange={setValues} />
      <SubmitButton onClick={handleSubmit} />
    </>
  )
}

Component Splitting

Split components to isolate state:

// ❌ Before: Counter re-renders entire card
function Card() {
  const [count, setCount] = useState(0)

  return (
    <div className="card">
      <ExpensiveHeader />           {/* Re-renders on count change */}
      <ExpensiveContent />          {/* Re-renders on count change */}
      <button onClick={() => setCount(c => c + 1)}>
        Count: {count}
      </button>
    </div>
  )
}

// ✅ After: Counter isolated
function Card() {
  return (
    <div className="card">
      <ExpensiveHeader />           {/* Doesn't re-render */}
      <ExpensiveContent />          {/* Doesn't re-render */}
      <Counter />                   {/* Only this re-renders */}
    </div>
  )
}

function Counter() {
  const [count, setCount] = useState(0)
  return (
    <button onClick={() => setCount(c => c + 1)}>
      Count: {count}
    </button>
  )
}

Context for Cross-Cutting Concerns

Use Context for truly global state, not local UI state:

// ✅ Good: Theme is app-wide
<ThemeContext.Provider value={theme}>
  <App />
</ThemeContext.Provider>

// ✅ Good: Auth is app-wide
<AuthContext.Provider value={user}>
  <App />
</AuthContext.Provider>

// ❌ Bad: Search query is local
<SearchQueryContext.Provider value={query}>  {/* Don't do this */}
  <Header />
  <SearchResults />
</SearchQueryContext.Provider>

Context Splitting

Split contexts to prevent unnecessary re-renders:

// ❌ Single context - all consumers re-render
const AppContext = createContext({ user, theme, locale })

// ✅ Split contexts - targeted re-renders
const UserContext = createContext(null)
const ThemeContext = createContext('light')
const LocaleContext = createContext('en')

Signs State Should Move

Move state DOWN when:

Only one component uses it
Child components don't need it
Re-renders are affecting unrelated components

Move state UP when:

Multiple children need to read it
Children need to update each other
State represents shared domain concept

Quick Checklist

Is state used by only one component? → Keep it there
Do siblings need this state? → Lift to parent
Is it causing unnecessary re-renders? → Consider splitting
Is it truly global? → Use Context
Is it URL state? → Use router params

Tanstack Virtual Patterns

TanStack Virtual Patterns

Efficient virtualization for large lists and grids.

When to Virtualize

Item Count	Recommendation
< 50	Not needed
50-100	Consider if items are complex
100-500	Recommended
500+	Required

Basic List Virtualization

import { useVirtualizer } from '@tanstack/react-virtual'

function VirtualList({ items }) {
  const parentRef = useRef<HTMLDivElement>(null)

  const virtualizer = useVirtualizer({
    count: items.length,
    getScrollElement: () => parentRef.current,
    estimateSize: () => 50, // Estimated row height in px
    overscan: 5, // Render 5 extra items for smooth scrolling
  })

  return (
    <div
      ref={parentRef}
      style={{ height: '400px', overflow: 'auto' }}
    >
      <div
        style={{
          height: `${virtualizer.getTotalSize()}px`,
          width: '100%',
          position: 'relative',
        }}
      >
        {virtualizer.getVirtualItems().map((virtualItem) => (
          <div
            key={virtualItem.key}
            style={{
              position: 'absolute',
              top: 0,
              left: 0,
              width: '100%',
              height: `${virtualItem.size}px`,
              transform: `translateY(${virtualItem.start}px)`,
            }}
          >
            {items[virtualItem.index].name}
          </div>
        ))}
      </div>
    </div>
  )
}

Variable Height Rows

For rows with different heights:

function VariableHeightList({ items }) {
  const parentRef = useRef<HTMLDivElement>(null)

  const virtualizer = useVirtualizer({
    count: items.length,
    getScrollElement: () => parentRef.current,
    estimateSize: (index) => {
      // Return estimated height based on content
      return items[index].type === 'header' ? 80 : 50
    },
    overscan: 5,
  })

  return (
    <div ref={parentRef} style={{ height: '400px', overflow: 'auto' }}>
      <div
        style={{
          height: `${virtualizer.getTotalSize()}px`,
          position: 'relative',
        }}
      >
        {virtualizer.getVirtualItems().map((virtualItem) => (
          <div
            key={virtualItem.key}
            data-index={virtualItem.index}
            ref={virtualizer.measureElement} // Enable dynamic measurement
            style={{
              position: 'absolute',
              top: 0,
              left: 0,
              width: '100%',
              transform: `translateY(${virtualItem.start}px)`,
            }}
          >
            <ItemComponent item={items[virtualItem.index]} />
          </div>
        ))}
      </div>
    </div>
  )
}

Dynamic Measurement

When content determines height:

const virtualizer = useVirtualizer({
  count: items.length,
  getScrollElement: () => parentRef.current,
  estimateSize: () => 50, // Initial estimate
  // measureElement enables dynamic re-measurement
})

// Add ref to each item
<div
  key={virtualItem.key}
  data-index={virtualItem.index}
  ref={virtualizer.measureElement}
>
  {/* Content with unknown height */}
</div>

Horizontal Virtualization

const columnVirtualizer = useVirtualizer({
  horizontal: true,
  count: columns.length,
  getScrollElement: () => parentRef.current,
  estimateSize: () => 150, // Column width
  overscan: 3,
})

Grid Virtualization

Combine row and column virtualizers:

function VirtualGrid({ rows, columns }) {
  const parentRef = useRef<HTMLDivElement>(null)

  const rowVirtualizer = useVirtualizer({
    count: rows.length,
    getScrollElement: () => parentRef.current,
    estimateSize: () => 50,
    overscan: 5,
  })

  const columnVirtualizer = useVirtualizer({
    horizontal: true,
    count: columns.length,
    getScrollElement: () => parentRef.current,
    estimateSize: () => 100,
    overscan: 3,
  })

  return (
    <div ref={parentRef} style={{ height: '400px', overflow: 'auto' }}>
      <div
        style={{
          height: `${rowVirtualizer.getTotalSize()}px`,
          width: `${columnVirtualizer.getTotalSize()}px`,
          position: 'relative',
        }}
      >
        {rowVirtualizer.getVirtualItems().map((virtualRow) => (
          <React.Fragment key={virtualRow.key}>
            {columnVirtualizer.getVirtualItems().map((virtualColumn) => (
              <div
                key={virtualColumn.key}
                style={{
                  position: 'absolute',
                  top: 0,
                  left: 0,
                  width: `${virtualColumn.size}px`,
                  height: `${virtualRow.size}px`,
                  transform: `translateX(${virtualColumn.start}px) translateY(${virtualRow.start}px)`,
                }}
              >
                {/* Cell content */}
              </div>
            ))}
          </React.Fragment>
        ))}
      </div>
    </div>
  )
}

Scroll to Index

const virtualizer = useVirtualizer({/* ... */})

// Scroll to specific item
virtualizer.scrollToIndex(50, { align: 'start' })

// Align options: 'start' | 'center' | 'end' | 'auto'

Window Scroller

For document-level scrolling:

import { useWindowVirtualizer } from '@tanstack/react-virtual'

function WindowList({ items }) {
  const virtualizer = useWindowVirtualizer({
    count: items.length,
    estimateSize: () => 50,
    overscan: 5,
  })

  return (
    <div
      style={{
        height: `${virtualizer.getTotalSize()}px`,
        position: 'relative',
      }}
    >
      {virtualizer.getVirtualItems().map((virtualItem) => (
        <div
          key={virtualItem.key}
          style={{
            position: 'absolute',
            top: 0,
            left: 0,
            width: '100%',
            transform: `translateY(${virtualItem.start}px)`,
          }}
        >
          {items[virtualItem.index].name}
        </div>
      ))}
    </div>
  )
}

Performance Tips

Use stable keys: Avoid array index as key
Memoize items: If item rendering is expensive
Adjust overscan: More overscan = smoother scroll, more DOM nodes
Measure sparingly: Only use measureElement when needed
Debounce scroll: For very heavy computations

Vllm Deployment

vLLM Deployment

PagedAttention

vLLM's PagedAttention manages KV cache memory in non-contiguous blocks, enabling:

Efficient memory: Only allocates what's needed per request
Dynamic batching: Handles variable sequence lengths
Up to 24x throughput: Compared to naive implementations

from vllm import LLM, SamplingParams

# PagedAttention is enabled by default
llm = LLM(
    model="meta-llama/Meta-Llama-3.1-70B-Instruct",
    gpu_memory_utilization=0.9,  # Use 90% GPU memory for KV cache
    max_num_seqs=256,  # Max concurrent sequences
    max_model_len=8192,  # Max context length
)

Continuous Batching

Dynamic batching that doesn't wait for batch completion:

from vllm import AsyncLLMEngine, AsyncEngineArgs, SamplingParams

# Configure async engine for continuous batching
engine_args = AsyncEngineArgs(
    model="meta-llama/Meta-Llama-3.1-8B-Instruct",
    max_num_batched_tokens=8192,  # Max tokens per batch
    max_num_seqs=64,  # Max concurrent sequences
    enable_chunked_prefill=True,  # Better latency for long prompts
)

engine = AsyncLLMEngine.from_engine_args(engine_args)

# Requests are automatically batched
async def generate(prompt: str):
    sampling_params = SamplingParams(max_tokens=512)
    generator = engine.generate(prompt, sampling_params, request_id="req-1")
    async for output in generator:
        yield output.outputs[0].text

CUDA Graphs

Capture and replay CUDA operations for faster execution:

# Enable via CLI
vllm serve meta-llama/Meta-Llama-3.1-8B-Instruct \
    --enforce-eager false  # Enable CUDA graphs (default)

# Disable for debugging
vllm serve meta-llama/Meta-Llama-3.1-8B-Instruct \
    --enforce-eager true  # Disable CUDA graphs

# Python API
llm = LLM(
    model="meta-llama/Meta-Llama-3.1-8B-Instruct",
    enforce_eager=False,  # Enable CUDA graphs
)

Note: CUDA graphs require fixed input shapes. vLLM handles this automatically with padding.

Tensor Parallelism

Scale across multiple GPUs:

# 4-GPU tensor parallelism
vllm serve meta-llama/Meta-Llama-3.1-70B-Instruct \
    --tensor-parallel-size 4

# With pipeline parallelism (for very large models)
vllm serve meta-llama/Meta-Llama-3.3-405B-Instruct \
    --tensor-parallel-size 4 \
    --pipeline-parallel-size 2

llm = LLM(
    model="meta-llama/Meta-Llama-3.1-70B-Instruct",
    tensor_parallel_size=4,
    distributed_executor_backend="ray",  # For multi-node
)

GPU Requirements:

Model Size	GPUs (FP16)	GPUs (INT8)	GPUs (AWQ/GPTQ)
7B	1	1	1
13B	1	1	1
70B	4	2	1-2
405B	8+	4+	4+

Prefix Caching

Reuse KV cache for shared prompt prefixes:

llm = LLM(
    model="meta-llama/Meta-Llama-3.1-8B-Instruct",
    enable_prefix_caching=True,  # Enable prefix caching
)

# Shared system prompt benefits from caching
system_prompt = "You are a helpful assistant. Be concise and accurate."
prompts = [
    f"{system_prompt}\n\nUser: What is Python?",
    f"{system_prompt}\n\nUser: Explain REST APIs.",
    f"{system_prompt}\n\nUser: What is Docker?",
]

# First request computes system prompt KV cache
# Subsequent requests reuse cached prefix
outputs = llm.generate(prompts, SamplingParams(max_tokens=256))

Benefits:

Reduced TTFT (time to first token) for shared prefixes
Lower GPU memory for batch requests
Ideal for: chat systems, RAG with fixed context

Production Server Configuration

# Production vLLM server
vllm serve meta-llama/Meta-Llama-3.1-70B-Instruct \
    --host 0.0.0.0 \
    --port 8000 \
    --tensor-parallel-size 4 \
    --max-model-len 8192 \
    --max-num-seqs 128 \
    --gpu-memory-utilization 0.9 \
    --enable-prefix-caching \
    --disable-log-requests \
    --api-key $VLLM_API_KEY

# With quantization
vllm serve meta-llama/Meta-Llama-3.1-70B-Instruct \
    --quantization awq \
    --dtype half \
    --tensor-parallel-size 2 \
    --gpu-memory-utilization 0.85

OpenAI-compatible API:

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="your-api-key",
)

response = client.chat.completions.create(
    model="meta-llama/Meta-Llama-3.1-70B-Instruct",
    messages=[{"role": "user", "content": "Hello!"}],
    max_tokens=256,
)

Monitoring and Metrics

vLLM exposes Prometheus metrics:

# Enable metrics
vllm serve ... --enable-metrics

# Metrics endpoint
curl http://localhost:8000/metrics

Key metrics:

vllm:num_requests_running: Active requests
vllm:num_requests_waiting: Queued requests
vllm:gpu_cache_usage_perc: KV cache utilization
vllm:avg_prompt_throughput_toks_per_s: Input throughput
vllm:avg_generation_throughput_toks_per_s: Output throughput

observability-monitoring - Production monitoring patterns
performance-testing - Load testing inference endpoints

Checklists (5)

Cwv Checklist

Core Web Vitals Optimization Checklist

Comprehensive checklist for achieving and maintaining good Core Web Vitals scores.

Thresholds Reference

Metric	Good	Needs Improvement	Poor
LCP	≤ 2.5s	≤ 4.0s	> 4.0s
INP	≤ 200ms	≤ 500ms	> 500ms
CLS	≤ 0.1	≤ 0.25	> 0.25

2026 Stricter Thresholds (plan ahead!):

LCP: ≤ 2.0s
INP: ≤ 150ms
CLS: ≤ 0.08

LCP (Largest Contentful Paint) ≤ 2.5s

Identify the LCP Element

Run Lighthouse to identify LCP element
Use Performance Observer to confirm in production
LCP is typically: hero image, hero heading, or above-the-fold banner

// Debug: Find LCP element
new PerformanceObserver((list) => {
  const entries = list.getEntries();
  console.log('LCP element:', entries[entries.length - 1].element);
}).observe({ type: 'largest-contentful-paint', buffered: true });

Server Response Time (TTFB)

Server response time (TTFB) < 800ms
Use edge/CDN for static content
Enable HTTP/2 or HTTP/3
Compress responses (gzip/brotli)
Database queries optimized
Caching strategy implemented (Redis, CDN cache)

Critical Resource Loading

LCP image has fetchpriority="high" attribute
LCP image has loading="eager" (not lazy)
LCP image preloaded in <head>
Critical CSS inlined or preloaded
Font preloaded with crossorigin attribute
Preconnect to critical third-party origins

<!-- Preload critical resources -->
<link rel="preload" as="image" href="/hero.webp" fetchpriority="high" />
<link rel="preload" as="font" href="/font.woff2" type="font/woff2" crossorigin />
<link rel="preconnect" href="https://api.example.com" />

Image Optimization

LCP image in modern format (WebP/AVIF)
Image properly sized (not oversized)
Responsive images with srcset
Image CDN used (Cloudinary, imgix, Vercel)

Rendering Strategy

LCP content rendered server-side (SSR/SSG)
LCP content NOT loaded client-side via fetch
No render-blocking JavaScript
No render-blocking CSS below the fold
Third-party scripts deferred

// ✅ GOOD: Server-rendered LCP content
export default async function Page() {
  const hero = await getHeroData();
  return <Hero data={hero} />;
}

// ❌ BAD: Client-loaded LCP content
function Page() {
  const [hero, setHero] = useState(null);
  useEffect(() => { fetchHero().then(setHero); }, []); // Delays LCP!
}

INP (Interaction to Next Paint) ≤ 200ms

Identify Long Tasks

Chrome DevTools Performance tab analyzed
Long tasks (>50ms) identified
Main thread blockers removed/optimized

JavaScript Optimization

Heavy computation moved to Web Workers
Large arrays processed in chunks with yielding
requestIdleCallback used for non-critical work
Bundle size minimized (code splitting)
Tree shaking enabled

// ✅ GOOD: Yield to main thread
async function processItems(items: Item[]) {
  for (const item of items) {
    processItem(item);
    // Yield every 4ms to allow paint
    await scheduler.yield?.() ?? new Promise(r => setTimeout(r, 0));
  }
}

React Optimization

useTransition for non-urgent updates
useDeferredValue for expensive derivations
Memoization where appropriate (useMemo, memo)
Virtualization for long lists (react-window, @tanstack/virtual)
Suspense boundaries for code splitting

// ✅ GOOD: Non-blocking state updates
const [isPending, startTransition] = useTransition();

function handleSearch(query: string) {
  setQuery(query); // Urgent: update input
  startTransition(() => {
    setFilteredResults(filter(query)); // Non-urgent: defer
  });
}

Event Handler Optimization

No heavy computation in event handlers
Handlers don't cause layout thrashing
Passive event listeners for scroll/touch
Debounced input handlers where appropriate

// ✅ GOOD: Defer heavy work
onClick={() => {
  setLoading(true);
  startTransition(() => {
    const result = heavyComputation();
    setResult(result);
    setLoading(false);
  });
}}

// ❌ BAD: Blocking handler
onClick={() => {
  const result = heavyComputation(); // Blocks paint!
  setResult(result);
}}

Animation Performance

Animations use transform and opacity only
No animations on layout properties (width, height, top, left)
will-change used sparingly
Animations run at 60fps (checked in DevTools)

CLS (Cumulative Layout Shift) ≤ 0.1

Image Dimensions

ALL images have explicit width and height
Responsive images use aspect-ratio container
fill prop images have sized container
No images cause layout shift on load

// ✅ GOOD: Explicit dimensions
<img src="/photo.jpg" width={800} height={600} alt="Photo" />

// ✅ GOOD: Aspect ratio container
<div className="aspect-[16/9]">
  <Image src="/photo.jpg" fill alt="Photo" />
</div>

Dynamic Content

Space reserved for dynamic content (ads, embeds)
Skeleton loaders match final content size
No content inserted above existing content
Lazy-loaded content has reserved space

// ✅ GOOD: Reserved space
<div className="min-h-[250px]">
  {ad ? <Ad data={ad} /> : <Skeleton height={250} />}
</div>

Font Loading

font-display: optional or swap used
Fallback font has size-adjust to match
Critical font preloaded
System font stack as fallback

/* Fallback with size adjustment */
@font-face {
  font-family: 'Inter Fallback';
  src: local('Arial');
  size-adjust: 107%;
  ascent-override: 90%;
}

body {
  font-family: 'Inter', 'Inter Fallback', sans-serif;
}

Animation Stability

Animations use transform, not layout properties
Expanding/collapsing uses scaleY, not height
Modals/overlays don't shift page content
Toast notifications positioned fixed/absolute

/* ✅ GOOD: Transform-based animation */
.drawer {
  transform: translateX(-100%);
  transition: transform 0.3s;
}
.drawer.open {
  transform: translateX(0);
}

/* ❌ BAD: Layout-shifting animation */
.drawer {
  width: 0;
  transition: width 0.3s;
}

Iframes and Embeds

Iframes have explicit dimensions
Third-party embeds wrapped with sized container
Lazy iframes have placeholder

Measurement & Monitoring

Lab Testing

Lighthouse CI in build pipeline
Performance budgets enforced
Regular manual Lighthouse audits
Testing on throttled CPU/network

Field Data (RUM)

web-vitals library installed
Metrics sent to analytics endpoint
p75 percentile tracked (Google's standard)
Alerts configured for regressions

// Essential RUM setup
import { onLCP, onINP, onCLS } from 'web-vitals';

onLCP(sendToAnalytics);
onINP(sendToAnalytics);
onCLS(sendToAnalytics);

Data Analysis

Dashboard showing daily/weekly trends
Segmentation by page, device, connection
Comparison of lab vs field data
Week-over-week regression detection

Alerting

Alert when p75 exceeds threshold
Alert when good rate drops below 75%
Alert on significant week-over-week regression
Escalation path defined

Build & Deploy

Performance Budgets

Bundle size limits configured
Build fails on budget exceeded
Per-route budgets for large apps

// webpack.config.js
module.exports = {
  performance: {
    maxAssetSize: 150000, // 150KB
    maxEntrypointSize: 250000, // 250KB
    hints: 'error', // Fail build
  },
};

CI/CD Integration

Lighthouse CI runs on PRs
Performance regression blocks merge
Bundle analyzer report generated
Preview deployments for testing

CDN & Caching

Static assets on CDN
Immutable caching for hashed assets
Stale-while-revalidate for HTML
Edge caching where appropriate

Debugging Checklist

Slow LCP

Check TTFB (server response time)
Verify LCP element has fetchpriority="high"
Confirm LCP content is server-rendered
Check for render-blocking resources
Verify image is optimized and properly sized

High INP

Run Performance recording during interaction
Look for long tasks in flame chart
Check for forced synchronous layouts
Verify heavy work is deferred
Check for excessive re-renders

High CLS

Run Lighthouse with "Layout Shift Regions" enabled
Check images for missing dimensions
Look for late-loading content
Verify fonts have fallbacks
Check for content inserted above viewport

Testing Protocol

Before Deployment

Lighthouse score ≥ 90 on Performance
All Core Web Vitals in "good" range
No performance budget violations
Tested on throttled 4G + slow CPU

After Deployment

Monitor RUM for 24-48 hours
Compare p75 to pre-deployment baseline
Check for unexpected regressions
Verify alerting is working

Weekly Review

Review p75 trends
Identify worst-performing pages
Check for new issues in CrUX
Plan optimizations for next sprint

Image Checklist

Image Optimization Checklist

Comprehensive checklist for production-ready image optimization.

Format Selection

Photo Content

Use AVIF as primary format (30-50% smaller than JPEG)
Configure WebP as fallback for older browsers
JPEG only for browsers without AVIF/WebP support
Configure Next.js: formats: ['image/avif', 'image/webp']

Graphics & Icons

SVG for logos, icons, and simple graphics
PNG only when transparency is required
Consider SVG sprites for icon sets (reduces requests)
Inline small SVGs (< 1KB) to avoid network requests

Format Decision Tree

Is it a photo/complex image?
├── Yes → Use AVIF/WebP (Next.js Image handles this)
└── No → Is transparency needed?
    ├── Yes → PNG or SVG
    └── No → Is it an icon/logo?
        ├── Yes → SVG (scalable, tiny file size)
        └── No → AVIF/WebP

Dimensions & Sizing

Always Set Dimensions

Every <Image> has width and height OR uses fill
Fill mode images have sized container (relative + dimensions)
Dimensions match actual display size (not larger)
No CLS from images (Layout Shift score = 0)

// ✅ GOOD: Explicit dimensions
<Image src="/photo.jpg" width={800} height={600} />

// ✅ GOOD: Fill with sized container
<div className="relative h-[400px]">
  <Image src="/photo.jpg" fill />
</div>

// ❌ BAD: Missing dimensions
<Image src="/photo.jpg" />

Responsive Images

sizes prop set for all responsive images
Sizes match actual layout breakpoints
Don't serve images larger than needed
Test with DevTools Network tab (check actual sizes served)

// ✅ GOOD: Accurate sizes prop
<Image
  src="/photo.jpg"
  fill
  sizes="(max-width: 640px) 100vw, (max-width: 1024px) 50vw, 33vw"
/>

// Common sizes patterns:
// Full width hero: sizes="100vw"
// Half width on desktop: sizes="(max-width: 768px) 100vw, 50vw"
// Grid of 4: sizes="(max-width: 640px) 50vw, 25vw"

Loading Strategy

LCP Images (Above the Fold)

Hero/banner image has priority prop
ONLY one image per page has priority (usually LCP element)
LCP image preloaded in <head> if not using Next.js Image
No lazy loading on LCP images

// ✅ GOOD: Priority on LCP image
<Image src="/hero.jpg" priority fill sizes="100vw" />

// ❌ BAD: Priority on all images
{images.map(img => <Image src={img} priority />)} // Wrong!

Below-the-Fold Images

Default lazy loading (Next.js Image default)
No priority prop on non-LCP images
Consider loading="lazy" for native <img> elements
Use Intersection Observer for custom lazy loading

Preloading

Critical hero image preloaded
Don't preload below-fold images
Use fetchpriority="high" for critical images

<link rel="preload" as="image" href="/hero.webp" fetchpriority="high" />

Placeholders

Blur Placeholders

Static imports use placeholder="blur" (automatic)
Remote images have blurDataURL generated
Placeholder improves perceived performance
Consider plaiceholder library for build-time generation

// ✅ Static import with automatic blur
import heroImage from '@/public/hero.jpg';
<Image src={heroImage} placeholder="blur" />

// ✅ Remote image with blur
<Image
  src="https://cdn.example.com/photo.jpg"
  placeholder="blur"
  blurDataURL="data:image/jpeg;base64,..."
/>

Color Placeholders

Consider dominant color placeholder for cards
Skeleton placeholders for loading states
Smooth transition from placeholder to image

Quality Settings

Compression

Quality set to 75-85 (not 100)
Test quality visually - often 75 is indistinguishable
Higher quality (85-90) only for hero/product images
Lower quality (60-70) acceptable for thumbnails

// ✅ GOOD: Appropriate quality
<Image src="/hero.jpg" quality={85} /> // Important hero
<Image src="/thumbnail.jpg" quality={70} /> // Small thumbnail

// ❌ BAD: Unnecessary quality
<Image src="/photo.jpg" quality={100} /> // Huge file, no benefit

AVIF-Specific

AVIF quality can be 10-15 points lower than JPEG
Test AVIF vs WebP on your content type
Some images compress better with WebP

CDN & Infrastructure

Next.js Configuration

remotePatterns configured for all external domains
deviceSizes matches your breakpoints
formats includes AVIF and WebP
minimumCacheTTL set appropriately (30+ days for static)

// next.config.js
images: {
  formats: ['image/avif', 'image/webp'],
  remotePatterns: [
    { hostname: 'cdn.example.com' },
    { hostname: '*.cloudinary.com' },
  ],
  deviceSizes: [640, 750, 828, 1080, 1200, 1920],
  minimumCacheTTL: 60 * 60 * 24 * 30, // 30 days
}

CDN Setup

Images served from CDN (not origin server)
Edge caching enabled
Cache headers set correctly (1 year for hashed assets)
Vary: Accept header for format negotiation

Self-Hosted

Sharp installed: npm install sharp
Docker image includes Sharp dependencies
Adequate disk space for image cache
Memory limits account for Sharp processing

Accessibility

Alt Text

ALL images have alt attribute
Meaningful alt for informative images
Empty alt="" for decorative images
Alt text describes content, not appearance
No "image of" or "picture of" prefix

// ✅ GOOD: Meaningful alt
<Image src="/product.jpg" alt="Red Nike Air Max 90 running shoe, side view" />

// ✅ GOOD: Decorative image
<Image src="/decorative-pattern.svg" alt="" />

// ❌ BAD: Generic alt
<Image src="/product.jpg" alt="Image" />

// ❌ BAD: Missing alt
<Image src="/product.jpg" />

No text in images (use real text)
Sufficient color contrast for overlaid text
Images don't convey information unavailable in text
Decorative images marked with role="presentation"

Performance Monitoring

Metrics to Track

LCP (Largest Contentful Paint) < 2.5s
CLS (Cumulative Layout Shift) = 0 for images
Image load times in RUM data
Total image bytes transferred

Debugging

Check DevTools Network tab for actual sizes
Verify format negotiation (AVIF/WebP served)
Test on slow connections (DevTools throttling)
Run Lighthouse for image recommendations

Error Handling

Fallbacks

Fallback image configured for load errors
Graceful degradation for broken images
Error boundaries for image-heavy components

const [error, setError] = useState(false);

<Image
  src={error ? '/fallback.jpg' : product.image}
  onError={() => setError(true)}
/>

Monitoring

Image errors logged to monitoring service
Alerts for high error rates
404s for images tracked

Build Pipeline

Optimization

Images optimized at build time (where possible)
Source images stored at high resolution
Build includes image processing (Sharp, Squoosh)
CI validates image configurations

Version Control

Large images in Git LFS (not regular Git)
Or: Images stored externally (CMS, CDN)
Build pulls images from source

Security

Content Security

Only allow trusted image domains
SVG sanitization if user-uploaded
dangerouslyAllowSVG: false in production
Rate limiting on image optimization endpoints

Privacy

Strip EXIF metadata from user uploads
No personally identifiable information in image URLs
Consider image hashing for user content

Inference Optimization

Inference Optimization Checklist

Performance validation for LLM inference.

vLLM Configuration

Tensor parallelism configured for GPU count
Max model length set appropriately
GPU memory utilization optimized (0.85-0.95)
Prefix caching enabled for shared contexts
Continuous batching active

Quantization

Quantization method selected:
- FP16: Maximum quality, baseline
- INT8/FP8: Balance quality/efficiency
- AWQ: Best 4-bit quality
- GPTQ: Faster quantization
Calibration data used (for GPTQ)
Quality validated post-quantization

Speculative Decoding

Method selected:
- N-gram: No extra model, lower overhead
- Draft model: Higher quality speculation
Speculative tokens tuned (3-5 typical)
Throughput improvement validated

Hardware Utilization

GPU memory fully utilized
Multi-GPU scaling verified
NVLink/PCIe bandwidth sufficient
CPU not bottlenecking

Batching Strategy

Continuous batching enabled
Max batch size configured
Request prioritization (if needed)
Queue management configured

Caching

KV cache optimized (PagedAttention)
Prefix caching for shared prompts
Response caching (semantic if applicable)
Cache invalidation strategy

Benchmarking

Baseline latency measured
Throughput (tokens/sec) benchmarked
Time to first token (TTFT) measured
Latency under load tested
Memory usage profiled

Production Readiness

Monitoring

Cost Optimization

Performance Audit Checklist

Comprehensive guide for identifying and fixing performance bottlenecks, based on OrchestKit's real optimization process.

Prerequisites

Access to production metrics (Prometheus, Grafana)
Profiling tools installed (py-spy, Chrome DevTools)
Baseline performance metrics captured
Test environment with production-like data

Phase 1: Establish Baselines

Backend Metrics

Capture current performance:

# Database query performance
psql -c "SELECT query, calls, mean_time, total_time
FROM pg_stat_statements
ORDER BY total_time DESC LIMIT 20;"

# API latency
curl 'http://localhost:9090/api/v1/query?query=histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))'

# Cache hit rate
curl 'http://localhost:9090/api/v1/query?query=sum(rate(cache_operations_total{result="hit"}[5m])) / sum(rate(cache_operations_total[5m]))'

Record p50/p95/p99 latency for all endpoints
Document slow queries (>100ms)
Measure cache hit rates
Capture database connection pool usage
Record LLM token usage and costs

Frontend Metrics

Run Lighthouse audit:

# Lighthouse CLI
lighthouse http://localhost:3000 \
  --output json \
  --output-path lighthouse-report.json

# Or use Chrome DevTools → Lighthouse tab

Record Core Web Vitals (LCP, INP, CLS, TTFB)
Measure bundle size (JS, CSS)
Check for render-blocking resources
Analyze long tasks (>50ms)
Measure First Contentful Paint (FCP)

Baseline Targets

Metric	Good	Needs Work	Current
p95 API latency	<500ms	<1s	___ms
p95 DB query	<100ms	<500ms	___ms
Cache hit rate	>70%	>50%	__%
LCP	<2.5s	<4s	___s
INP	<200ms	<500ms	___ms
CLS	<0.1	<0.25	___
Bundle size	<300KB	<500KB	___KB

Phase 2: Identify Bottlenecks

Backend Profiling

1. Find Slow Endpoints

# Top 10 slowest endpoints (p95 latency)
topk(10,
  histogram_quantile(0.95,
    rate(http_request_duration_seconds_bucket[5m])
  ) by (endpoint)
)

List endpoints with p95 > 500ms
Prioritize by traffic volume (high traffic = high impact)
Document expected vs actual latency

2. Identify Slow Database Queries

-- Top 10 slowest queries
SELECT
    LEFT(query, 80) as query_preview,
    calls,
    ROUND(mean_exec_time::numeric, 2) as avg_ms,
    ROUND(total_exec_time::numeric, 2) as total_ms,
    ROUND(100.0 * shared_blks_hit / NULLIF(shared_blks_hit + shared_blks_read, 0), 2) as cache_hit_ratio
FROM pg_stat_statements
ORDER BY total_exec_time DESC
LIMIT 10;

Run EXPLAIN ANALYZE on slow queries
Check for sequential scans (should use indexes)
Look for low cache hit ratios (<90%)
Identify N+1 query patterns

3. Python Profiling with py-spy

# Profile running FastAPI server
py-spy record --pid $(pgrep -f uvicorn) \
  --output profile.svg \
  --duration 60

# Top functions by time
py-spy top --pid $(pgrep -f uvicorn)

Generate flame graph
Identify hot paths (wide bars = time spent)
Look for unexpected CPU usage
Check for blocking I/O in async code

4. LLM Cost Analysis

-- Cost breakdown by model (Langfuse)
SELECT
    model,
    COUNT(*) as calls,
    SUM(input_tokens) as total_input,
    SUM(output_tokens) as total_output,
    SUM(calculated_total_cost) as total_cost
FROM langfuse.traces
WHERE timestamp > NOW() - INTERVAL '7 days'
GROUP BY model
ORDER BY total_cost DESC;

Identify most expensive models
Calculate cache hit rate potential
Find repetitive queries (caching candidates)
Measure prompt token waste

Frontend Profiling

1. Chrome DevTools Performance Tab

Record 6s of user interaction
Identify long tasks (yellow bars >50ms)
Check for dropped frames (should be 60fps)
Measure main thread blocking time

2. React DevTools Profiler

// Add Profiler to key components
import { Profiler } from 'react';

function onRenderCallback(
    id, phase, actualDuration, baseDuration
) {
    if (actualDuration > 16) {
        console.warn(`Slow render: ${id} took ${actualDuration}ms`);
    }
}

<Profiler id="AnalysisCard" onRender={onRenderCallback}>
    <AnalysisCard />
</Profiler>

Find components with >16ms render time
Identify unnecessary re-renders
Check for missing memoization

3. Bundle Analysis

# Vite
npm run build
npx vite-bundle-visualizer

# Next.js
ANALYZE=true npm run build

Identify largest chunks
Find duplicate dependencies
Check for tree-shaking failures
Measure code splitting effectiveness

Phase 3: Database Optimization

Add Missing Indexes

1. Identify Missing Indexes

-- Find sequential scans that should use indexes
SELECT
    schemaname,
    tablename,
    seq_scan,
    idx_scan,
    seq_scan - idx_scan as too_much_seq
FROM pg_stat_user_tables
WHERE seq_scan - idx_scan > 0
ORDER BY too_much_seq DESC
LIMIT 10;

Run EXPLAIN ANALYZE on slow queries
Look for "Seq Scan" in query plans
Identify columns in WHERE/JOIN clauses
Create indexes for high-cardinality columns

2. Create Indexes

-- B-tree for exact matches and ranges
CREATE INDEX idx_analysis_status ON analyses(status);
CREATE INDEX idx_analysis_created ON analyses(created_at DESC);

-- GIN for full-text search
CREATE INDEX idx_chunk_tsvector ON chunks USING GIN(content_tsvector);

-- HNSW for vector similarity (pgvector)
CREATE INDEX idx_chunk_embedding ON chunks
USING hnsw (embedding vector_cosine_ops)
WITH (m = 16, ef_construction = 64);

-- Composite index for common filter combinations
CREATE INDEX idx_chunk_analysis_created ON chunks(analysis_id, created_at DESC);

Create indexes for WHERE clause columns
Use composite indexes for multi-column filters
Add indexes for JOIN columns
Use CONCURRENTLY for production
Verify indexes are used (EXPLAIN ANALYZE)

Index Selection Guide:

Query Pattern	Index Type	Example
Exact match	B-tree	`WHERE status = 'completed'`
Range query	B-tree	`WHERE created_at > '2025-01-01'`
Full-text search	GIN	`WHERE content_tsvector @@ query`
Vector similarity	HNSW	`ORDER BY embedding <=> query_vec`
JSONB queries	GIN	`WHERE metadata @> '\{"key": "value"\}'`

Fix N+1 Queries

1. Detect N+1 Patterns

# ❌ BAD: N+1 query (1 query + N queries in loop)
analyses = await session.execute(select(Analysis).limit(10))
for analysis in analyses.scalars():
    # Each iteration = 1 query!
    chunks = await session.execute(
        select(Chunk).where(Chunk.analysis_id == analysis.id)
    )

Review logs for rapid sequential queries
Check for queries inside loops
Use query count logging in tests

2. Fix with Eager Loading

# ✅ GOOD: Single query with JOIN
from sqlalchemy.orm import selectinload

analyses = await session.execute(
    select(Analysis)
    .options(selectinload(Analysis.chunks))  # Eager load
    .limit(10)
).scalars().all()

# Now analyses[0].chunks is preloaded (no extra query)

Replace lazy loading with eager loading
Use selectinload() for one-to-many
Use joinedload() for one-to-one
Verify query count reduced (N+1 → 1-2 queries)

Optimize Connection Pooling

1. Check Current Pool Usage

# Connection pool saturation
db_connections_active / db_connections_max

Measure active vs max connections
Check for pool exhaustion (ratio >0.8)
Monitor connection wait times

2. Configure Pool

# backend/app/core/config.py
from sqlalchemy import create_engine

engine = create_engine(
    database_url,
    pool_size=5,           # Connections to maintain
    max_overflow=10,       # Extra connections allowed
    pool_recycle=3600,     # Recycle after 1 hour
    pool_pre_ping=True     # Validate before checkout
)

Set pool_size based on traffic (5-20 typical)
Allow overflow for spikes
Enable pool_pre_ping for stale detection
Set pool_recycle to avoid timeouts

Phase 4: Caching Strategy

Identify Caching Opportunities

1. Find Repetitive Queries

-- Most frequently called queries
SELECT
    LEFT(query, 80),
    calls,
    ROUND(mean_exec_time::numeric, 2) as avg_ms
FROM pg_stat_statements
ORDER BY calls DESC
LIMIT 20;

Identify high-frequency queries
Check if data changes frequently
Calculate potential savings (calls × avg_time)

2. Find Repetitive LLM Calls

-- Similar prompts (Langfuse)
SELECT
    LEFT(input::text, 100) as prompt_preview,
    COUNT(*) as occurrences,
    SUM(calculated_total_cost) as total_cost
FROM langfuse.generations
GROUP BY LEFT(input::text, 100)
HAVING COUNT(*) > 5
ORDER BY total_cost DESC;

Identify repetitive prompts
Calculate cost savings potential
Determine appropriate cache TTL

Implement Multi-Level Cache

L1: In-Memory Cache (Application)

from functools import lru_cache

@lru_cache(maxsize=128)
def get_agent_system_prompt(agent_type: str) -> str:
    """Cache agent prompts in memory."""
    return load_prompt_from_file(f"prompts/{agent_type}.txt")

Cache static data (prompts, configs)
Use LRU cache for bounded memory
Set appropriate maxsize (128-1024)

L2: Redis Cache (Distributed)

async def get_analysis(analysis_id: str) -> Analysis:
    """Cache analysis results in Redis."""

    # Try cache first
    cached = await redis.get(f"analysis:{analysis_id}")
    if cached:
        return Analysis.parse_raw(cached)

    # Cache miss - fetch from DB
    analysis = await db.get_analysis(analysis_id)

    # Store in cache (5 min TTL)
    await redis.setex(
        f"analysis:{analysis_id}",
        300,
        analysis.json()
    )

    return analysis

Cache query results
Set appropriate TTL (seconds to hours)
Invalidate on writes
Track cache hit rate

L3: Semantic Cache (Vector Search)

async def get_llm_response(query: str) -> str:
    """Check semantic cache before calling LLM."""

    # Generate query embedding
    embedding = await embed_text(query)

    # Search for similar cached queries
    cached = await semantic_cache.search(embedding, threshold=0.92)
    if cached:
        return cached.response

    # Call LLM
    response = await llm.complete(query)

    # Store in cache
    await semantic_cache.store(embedding, response)

    return response

Cache LLM responses by semantic similarity
Set similarity threshold (0.90-0.95)
Measure cost savings
Monitor false positive rate

Cache Invalidation

Write-Through Pattern:

async def update_analysis(analysis: Analysis):
    """Update DB and cache atomically."""

    # 1. Write to DB
    await db.update(analysis)

    # 2. Update cache
    await redis.setex(
        f"analysis:{analysis.id}",
        300,
        analysis.json()
    )

Invalidate cache on writes
Use TTL for time-sensitive data
Add cache versioning for schema changes

Phase 5: Frontend Optimization

Code Splitting

1. Route-Based Splitting

// Before: All routes in one bundle
import AnalysisPage from './pages/AnalysisPage';
import DashboardPage from './pages/DashboardPage';

// After: Lazy load routes
const AnalysisPage = lazy(() => import('./pages/AnalysisPage'));
const DashboardPage = lazy(() => import('./pages/DashboardPage'));

<Suspense fallback={<Loading />}>
    <Routes>
        <Route path="/analysis" element={<AnalysisPage />} />
        <Route path="/dashboard" element={<DashboardPage />} />
    </Routes>
</Suspense>

Lazy load routes
Add loading states
Measure bundle size reduction

2. Component-Level Splitting

// Lazy load heavy components
const ChartComponent = lazy(() => import('./ChartComponent'));

{showChart && (
    <Suspense fallback={<Skeleton />}>
        <ChartComponent data={data} />
    </Suspense>
)}

Split large dependencies (charts, editors)
Use dynamic imports for modals
Prefetch on user intent (hover, focus)

Memoization

React.memo for Components:

// Prevent re-renders when props unchanged
const AnalysisCard = memo(({ analysis }: Props) => {
    return <div>{analysis.title}</div>;
});

Wrap expensive components with memo()
Verify props don't change unnecessarily
Use React DevTools Profiler to confirm

useMemo for Expensive Calculations:

const expensiveValue = useMemo(() => {
    return processLargeDataset(data);
}, [data]);  // Only recompute if data changes

Memoize expensive calculations
Memoize filtered/sorted arrays
Don't over-memoize (profiling first!)

useCallback for Event Handlers:

const handleClick = useCallback(() => {
    doSomething(id);
}, [id]);  // Only recreate if id changes

<ChildComponent onClick={handleClick} />

Wrap callbacks passed to memoized children
Avoid inline functions in props
Include all dependencies

Image Optimization

// Use next/image or similar for optimization
<Image
    src="/photo.jpg"
    alt="Description"
    width={800}
    height={600}
    loading="lazy"  // Lazy load images
    placeholder="blur"  // Show blur while loading
/>

Use WebP/AVIF formats
Lazy load images below the fold
Set explicit width/height (prevent CLS)
Use responsive images (srcset)

Phase 6: Measure Impact

Re-Run Benchmarks

Backend:

# Query performance
psql -c "SELECT query, mean_time FROM pg_stat_statements ORDER BY total_time DESC LIMIT 10;"

# API latency
curl 'http://localhost:9090/api/v1/query?query=histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))'

Frontend:

lighthouse http://localhost:3000 --output json

Compare p95 latency (before vs after)
Verify query performance improved
Check cache hit rates increased
Measure Core Web Vitals improvement

Calculate Savings

Cost Savings:

# LLM cost reduction
baseline_cost = 35000  # Annual
cache_hit_rate = 0.90
savings = baseline_cost * cache_hit_rate * 0.90  # 90% discount on cache hits
final_cost = baseline_cost - savings

Performance Gains:

# Query speedup
before_latency = 85  # ms
after_latency = 5    # ms
speedup = before_latency / after_latency  # 17x

Document cost savings
Calculate ROI (savings vs implementation time)
Measure user experience improvement

Create Performance Budget

Set ongoing targets:

Monitor continuously:

Add Lighthouse CI to pipeline
Alert on budget violations
Review metrics weekly

Phase 7: Ongoing Optimization

Weekly Reviews

Monthly Audits

Run full Lighthouse audit
Profile with py-spy/Chrome DevTools
Review database index usage
Check for unused dependencies
Update performance budget

Continuous Monitoring

Set up alerts for degradation
Track performance in CI/CD
Monitor real user metrics (RUM)
A/B test optimizations

References

Example: ../examples/orchestkit-performance-wins.md
Template: ../scripts/caching-patterns.ts
Template: ../scripts/database-optimization.ts
Chrome DevTools Performance
Lighthouse Documentation
PostgreSQL EXPLAIN

Render Audit

React Performance Audit Checklist

Pre-deployment performance verification.

React Compiler Check

React Compiler enabled in build config
Components show "Memo ✨" badge in DevTools
Code follows Rules of React:
- Components are idempotent
- Props/state treated as immutable
- Side effects in useEffect only
- Hooks at top level

Render Performance

No unnecessary re-renders (verified with Profiler)
State colocated close to usage
Context split to prevent cascading updates
Expensive computations have escape hatch memoization
Lists > 100 items are virtualized

Large Lists / Data

TanStack Virtual for lists > 100 items
Pagination or infinite scroll for API data
Table virtualization for grids > 50 rows
Images lazy loaded below fold

Code Splitting

Route-based code splitting (lazy routes)
Heavy components lazy loaded
Dynamic imports for large libraries
Bundle analyzer run, no unexpected large chunks

Network Performance

API calls deduplicated (React Query, SWR)
Data prefetched on hover/intent
Optimistic updates for mutations
Appropriate cache headers set

Images & Media

Images optimized (WebP, AVIF)
Responsive images with srcset
Lazy loading for below-fold images
Placeholder/skeleton during load

Third-Party Scripts

Analytics loaded async/deferred
Third-party widgets lazy loaded
Font loading optimized (preload critical)
No render-blocking resources

Profiling Verification

Before Optimization

Record baseline interaction times
Document slowest components
Note current bundle size

After Optimization

Re-profile all interactions
Verify improvements in numbers
Check bundle size delta

Key Metrics to Track

Metric	Target	Current
LCP (Largest Contentful Paint)	< 2.5s	___
FID (First Input Delay)	< 100ms	___
CLS (Cumulative Layout Shift)	< 0.1	___
Time to Interactive	< 3s	___
Main thread blocking	< 200ms	___

Quick Profiler Commands

# React DevTools Profiler
# 1. Open DevTools → Profiler tab
# 2. Click Record
# 3. Perform interaction
# 4. Click Stop
# 5. Analyze flamegraph

# Lighthouse
npx lighthouse http://localhost:3000 --view

# Bundle Analyzer (Next.js)
ANALYZE=true npm run build

# Bundle Analyzer (Vite)
npx vite-bundle-visualizer

Common Issues Checklist

No anonymous functions as props in hot paths
No object/array literals as props in hot paths
Context providers near consumers
useEffect dependencies correct
No state updates in render

Sign-Off

All critical interactions < 100ms
No visible jank during scroll
Page load acceptable on 3G
Bundle size within budget
Performance regression tests in CI

Examples (3)

Cwv Examples

Core Web Vitals Examples

Real-world optimization examples for LCP, INP, and CLS.

1. LCP Optimization: E-Commerce Hero Section

Complete optimization of a hero section with product image and CTA.

Before: Slow LCP (3.5s+)

// ❌ BAD: Multiple LCP issues
function Hero() {
  const [product, setProduct] = useState(null);

  useEffect(() => {
    // Problem 1: LCP content loaded client-side
    fetch('/api/featured-product')
      .then(res => res.json())
      .then(setProduct);
  }, []);

  if (!product) return <div className="h-[600px]" />; // Problem 2: No skeleton

  return (
    <div className="relative">
      {/* Problem 3: No priority, lazy by default */}
      <img src={product.image} alt={product.name} />
      <h1>{product.name}</h1>
      <a href={`/product/${product.id}`}>Shop Now</a>
    </div>
  );
}

After: Optimized LCP (1.2s)

// ✅ GOOD: Server-rendered with optimized image
import Image from 'next/image';
import { Suspense } from 'react';

// Server Component - data fetched on server
async function Hero() {
  // Fetched on server, included in initial HTML
  const product = await getFeaturedProduct();

  return (
    <section className="relative h-[600px] overflow-hidden">
      {/* Priority image with explicit dimensions */}
      <Image
        src={product.image}
        alt={product.name}
        fill
        priority // Preloads, eager loading
        sizes="100vw"
        quality={85}
        placeholder="blur"
        blurDataURL={product.blurPlaceholder}
        style={{ objectFit: 'cover' }}
      />

      {/* Content overlay */}
      <div className="relative z-10 flex flex-col items-center justify-center h-full text-white">
        <h1 className="text-5xl font-bold">{product.name}</h1>
        <p className="mt-4 text-xl">{product.tagline}</p>
        <a
          href={`/product/${product.id}`}
          className="mt-8 px-8 py-4 bg-white text-black rounded-lg font-semibold"
        >
          Shop Now
        </a>
      </div>
    </section>
  );
}

// Loading skeleton for Suspense boundary
function HeroSkeleton() {
  return (
    <section className="relative h-[600px] bg-gray-200 animate-pulse">
      <div className="flex flex-col items-center justify-center h-full">
        <div className="h-12 w-64 bg-gray-300 rounded" />
        <div className="mt-4 h-6 w-48 bg-gray-300 rounded" />
        <div className="mt-8 h-14 w-40 bg-gray-300 rounded-lg" />
      </div>
    </section>
  );
}

// Usage in page
export default function HomePage() {
  return (
    <Suspense fallback={<HeroSkeleton />}>
      <Hero />
    </Suspense>
  );
}

// Also add preload in head (layout.tsx or page metadata)
export const metadata = {
  other: {
    'link': [
      {
        rel: 'preload',
        as: 'image',
        href: '/featured-product-hero.webp',
        fetchpriority: 'high',
      },
    ],
  },
};

Document Head Optimizations

<!-- Add to <head> for fastest LCP -->
<head>
  <!-- Preload hero image -->
  <link rel="preload" as="image" href="/hero.webp" fetchpriority="high" />

  <!-- Preload critical font -->
  <link
    rel="preload"
    as="font"
    href="/fonts/inter-bold.woff2"
    type="font/woff2"
    crossorigin
  />

  <!-- Preconnect to image CDN -->
  <link rel="preconnect" href="https://images.example.com" />

  <!-- DNS prefetch for analytics -->
  <link rel="dns-prefetch" href="https://analytics.example.com" />
</head>

Optimizing a search filter that was causing 400ms+ INP.

Before: Blocking INP (400ms+)

// ❌ BAD: Blocks main thread on every keystroke
function ProductSearch({ products }: { products: Product[] }) {
  const [query, setQuery] = useState('');
  const [results, setResults] = useState(products);

  const handleChange = (e: ChangeEvent<HTMLInputElement>) => {
    const value = e.target.value;
    setQuery(value);

    // Problem: Expensive filter runs synchronously
    // Blocks paint until complete
    const filtered = products.filter(p =>
      p.name.toLowerCase().includes(value.toLowerCase()) ||
      p.description.toLowerCase().includes(value.toLowerCase()) ||
      p.tags.some(t => t.toLowerCase().includes(value.toLowerCase()))
    );
    setResults(filtered);
  };

  return (
    <>
      <input
        value={query}
        onChange={handleChange}
        placeholder="Search products..."
      />
      <ProductGrid products={results} />
    </>
  );
}

After: Responsive INP (50ms)

// ✅ GOOD: Non-blocking with useDeferredValue
import {
  useState,
  useDeferredValue,
  useMemo,
  useTransition,
  memo
} from 'react';

function ProductSearch({ products }: { products: Product[] }) {
  const [query, setQuery] = useState('');
  const [isPending, startTransition] = useTransition();

  // Deferred value for expensive computation
  const deferredQuery = useDeferredValue(query);
  const isStale = query !== deferredQuery;

  // Memoized filter only runs when deferredQuery changes
  const results = useMemo(() => {
    if (!deferredQuery) return products;

    const searchLower = deferredQuery.toLowerCase();
    return products.filter(p =>
      p.name.toLowerCase().includes(searchLower) ||
      p.description.toLowerCase().includes(searchLower) ||
      p.tags.some(t => t.toLowerCase().includes(searchLower))
    );
  }, [products, deferredQuery]);

  return (
    <div>
      <div className="relative">
        <input
          value={query}
          onChange={(e) => setQuery(e.target.value)}
          placeholder="Search products..."
          className="w-full px-4 py-2 border rounded-lg"
        />
        {/* Loading indicator during filter */}
        {isPending && (
          <div className="absolute right-3 top-1/2 -translate-y-1/2">
            <Spinner size="sm" />
          </div>
        )}
      </div>

      {/* Fade during pending state */}
      <div
        className="mt-4 transition-opacity"
        style={{ opacity: isStale ? 0.7 : 1 }}
      >
        <ProductGrid products={results} />
      </div>
    </div>
  );
}

// Memoized grid to prevent unnecessary re-renders
const ProductGrid = memo(function ProductGrid({
  products
}: {
  products: Product[]
}) {
  return (
    <div className="grid grid-cols-4 gap-4">
      {products.map(product => (
        <ProductCard key={product.id} product={product} />
      ))}
    </div>
  );
});

For Very Large Lists: Virtual Scrolling

// ✅ BEST: Virtualization for huge lists
import { useVirtualizer } from '@tanstack/react-virtual';

function VirtualizedProductList({ products }: { products: Product[] }) {
  const parentRef = useRef<HTMLDivElement>(null);

  const virtualizer = useVirtualizer({
    count: products.length,
    getScrollElement: () => parentRef.current,
    estimateSize: () => 200, // Estimated row height
    overscan: 5, // Render 5 extra items above/below
  });

  return (
    <div
      ref={parentRef}
      className="h-[600px] overflow-auto"
    >
      <div
        style={{
          height: `${virtualizer.getTotalSize()}px`,
          width: '100%',
          position: 'relative',
        }}
      >
        {virtualizer.getVirtualItems().map((virtualRow) => (
          <div
            key={virtualRow.key}
            style={{
              position: 'absolute',
              top: 0,
              left: 0,
              width: '100%',
              height: `${virtualRow.size}px`,
              transform: `translateY(${virtualRow.start}px)`,
            }}
          >
            <ProductCard product={products[virtualRow.index]} />
          </div>
        ))}
      </div>
    </div>
  );
}

3. CLS Optimization: News Article Page

Fixing layout shifts from images, ads, and fonts.

Before: High CLS (0.35)

// ❌ BAD: Multiple CLS issues
function Article({ article }: { article: Article }) {
  const [ad, setAd] = useState(null);

  useEffect(() => {
    loadAd().then(setAd);
  }, []);

  return (
    <article>
      <h1>{article.title}</h1>

      {/* Problem 1: Image without dimensions */}
      <img src={article.heroImage} alt="" />

      {/* Problem 2: Ad appears after load, shifts content */}
      {ad && <div className="ad-banner"><img src={ad.image} /></div>}

      <div dangerouslySetInnerHTML={{ __html: article.content }} />

      {/* Problem 3: Related articles load and shift */}
      <RelatedArticles />
    </article>
  );
}

// Problem 4: Font causes layout shift
// CSS
/* No font-display, no fallback sizing */
@font-face {
  font-family: 'CustomFont';
  src: url('/font.woff2');
}

After: Zero CLS (0.0)

// ✅ GOOD: All layout shifts prevented
import Image from 'next/image';

function Article({ article }: { article: Article }) {
  return (
    <article className="max-w-3xl mx-auto">
      <h1 className="text-4xl font-bold">{article.title}</h1>

      {/* Fixed dimensions prevent shift */}
      <div className="relative aspect-[16/9] my-6">
        <Image
          src={article.heroImage}
          alt={article.heroAlt}
          fill
          sizes="(max-width: 768px) 100vw, 768px"
          priority
          style={{ objectFit: 'cover' }}
        />
      </div>

      {/* Reserved space for ad */}
      <AdSlot
        slot="article-top"
        className="my-6"
        minHeight={250}
      />

      <div
        className="prose prose-lg"
        dangerouslySetInnerHTML={{ __html: article.content }}
      />

      {/* Reserved space for related */}
      <RelatedArticles articleId={article.id} />
    </article>
  );
}

// Ad component with reserved space
function AdSlot({
  slot,
  className,
  minHeight
}: {
  slot: string;
  className?: string;
  minHeight: number;
}) {
  const [ad, setAd] = useState<Ad | null>(null);
  const [loaded, setLoaded] = useState(false);

  useEffect(() => {
    loadAd(slot).then(ad => {
      setAd(ad);
      setLoaded(true);
    });
  }, [slot]);

  return (
    <div
      className={className}
      style={{ minHeight: `${minHeight}px` }} // Reserved space
    >
      {loaded ? (
        ad ? (
          <Image
            src={ad.image}
            alt={ad.alt}
            width={ad.width}
            height={ad.height}
          />
        ) : null // No ad, space collapses gracefully
      ) : (
        <Skeleton height={minHeight} /> // Placeholder during load
      )}
    </div>
  );
}

// Related articles with skeleton
function RelatedArticles({ articleId }: { articleId: string }) {
  const [articles, setArticles] = useState<Article[] | null>(null);

  useEffect(() => {
    fetchRelated(articleId).then(setArticles);
  }, [articleId]);

  return (
    <section className="mt-12">
      <h2 className="text-2xl font-bold mb-6">Related Articles</h2>

      {/* Fixed grid prevents shift */}
      <div className="grid grid-cols-3 gap-6">
        {articles ? (
          articles.map(article => (
            <ArticleCard key={article.id} article={article} />
          ))
        ) : (
          // Skeleton matches final layout exactly
          <>
            <ArticleCardSkeleton />
            <ArticleCardSkeleton />
            <ArticleCardSkeleton />
          </>
        )}
      </div>
    </section>
  );
}

// Skeleton that matches card dimensions exactly
function ArticleCardSkeleton() {
  return (
    <div className="animate-pulse">
      <div className="aspect-[16/9] bg-gray-200 rounded-lg" />
      <div className="mt-3 h-5 bg-gray-200 rounded w-3/4" />
      <div className="mt-2 h-4 bg-gray-200 rounded w-1/2" />
    </div>
  );
}

Font Loading Without CLS

/* ✅ Optimized font loading */

/* Main font with swap and metrics */
@font-face {
  font-family: 'Inter';
  src: url('/fonts/inter-var.woff2') format('woff2');
  font-display: swap;
  font-weight: 100 900;
}

/* Fallback font with matched metrics */
@font-face {
  font-family: 'Inter Fallback';
  src: local('Arial');
  size-adjust: 107.64%;
  ascent-override: 90%;
  descent-override: 22.43%;
  line-gap-override: 0%;
}

body {
  font-family: 'Inter', 'Inter Fallback', system-ui, sans-serif;
}

/* Alternative: font-display: optional for non-critical fonts */
@font-face {
  font-family: 'DisplayFont';
  src: url('/fonts/display.woff2') format('woff2');
  font-display: optional; /* Won't cause FOUT - uses fallback if not cached */
}

4. Complete RUM Implementation

Full Real User Monitoring setup with Next.js.

// lib/performance.ts
import { onCLS, onINP, onLCP, onFCP, onTTFB, type Metric } from 'web-vitals';

const ENDPOINT = '/api/vitals';

interface EnrichedMetric {
  name: string;
  value: number;
  rating: 'good' | 'needs-improvement' | 'poor';
  delta: number;
  id: string;
  navigationType: string;
  url: string;
  timestamp: number;
  connection?: string;
  deviceMemory?: number;
  viewport: { width: number; height: number };
}

function getConnectionInfo() {
  const nav = navigator as Navigator & {
    connection?: { effectiveType?: string };
    deviceMemory?: number;
  };

  return {
    connection: nav.connection?.effectiveType,
    deviceMemory: nav.deviceMemory,
  };
}

function sendMetric(metric: Metric) {
  const enriched: EnrichedMetric = {
    name: metric.name,
    value: metric.value,
    rating: metric.rating,
    delta: metric.delta,
    id: metric.id,
    navigationType: metric.navigationType,
    url: window.location.href,
    timestamp: Date.now(),
    ...getConnectionInfo(),
    viewport: {
      width: window.innerWidth,
      height: window.innerHeight,
    },
  };

  // Use sendBeacon for reliability
  if (navigator.sendBeacon) {
    navigator.sendBeacon(ENDPOINT, JSON.stringify(enriched));
  } else {
    fetch(ENDPOINT, {
      method: 'POST',
      body: JSON.stringify(enriched),
      keepalive: true,
    });
  }

  // Debug in development
  if (process.env.NODE_ENV === 'development') {
    const color = {
      good: 'green',
      'needs-improvement': 'orange',
      poor: 'red',
    }[metric.rating];

    console.log(
      `%c[${metric.name}] ${metric.value.toFixed(1)}${metric.name === 'CLS' ? '' : 'ms'}`,
      `color: ${color}; font-weight: bold`
    );
  }
}

export function initWebVitals() {
  onCLS(sendMetric);
  onINP(sendMetric);
  onLCP(sendMetric);
  onFCP(sendMetric);
  onTTFB(sendMetric);
}

// app/components/web-vitals.tsx
'use client';

import { useEffect } from 'react';
import { initWebVitals } from '@/lib/performance';

export function WebVitals() {
  useEffect(() => {
    initWebVitals();
  }, []);

  return null;
}

// app/api/vitals/route.ts
import { NextRequest, NextResponse } from 'next/server';

interface VitalMetric {
  name: string;
  value: number;
  rating: string;
  url: string;
  timestamp: number;
}

export async function POST(request: NextRequest) {
  const metric: VitalMetric = await request.json();

  // Log for debugging
  console.log('[Vital]', metric.name, metric.value, metric.rating);

  // Store in database (example with Drizzle)
  // await db.insert(webVitals).values({
  //   name: metric.name,
  //   value: metric.value,
  //   rating: metric.rating,
  //   url: metric.url,
  //   timestamp: new Date(metric.timestamp),
  // });

  // Alert on poor metrics
  if (metric.rating === 'poor') {
    // await alertService.send({
    //   severity: 'warning',
    //   message: `Poor ${metric.name}: ${metric.value} on ${metric.url}`,
    // });
  }

  return NextResponse.json({ ok: true });
}

5. Performance Budget Enforcement

CI/CD integration with Lighthouse CI.

lighthouserc.js

module.exports = {
  ci: {
    collect: {
      url: [
        'http://localhost:3000/',
        'http://localhost:3000/products',
        'http://localhost:3000/checkout',
      ],
      numberOfRuns: 3,
      settings: {
        preset: 'desktop',
        // Throttle to simulate 4G
        // throttling: { ... }
      },
    },
    assert: {
      assertions: {
        // Core Web Vitals
        'largest-contentful-paint': ['error', { maxNumericValue: 2500 }],
        'cumulative-layout-shift': ['error', { maxNumericValue: 0.1 }],
        'total-blocking-time': ['error', { maxNumericValue: 200 }], // Proxy for INP

        // Other performance metrics
        'first-contentful-paint': ['warn', { maxNumericValue: 1800 }],
        'speed-index': ['warn', { maxNumericValue: 3400 }],

        // Resource budgets
        'resource-summary:script:size': ['error', { maxNumericValue: 150000 }],
        'resource-summary:image:size': ['error', { maxNumericValue: 300000 }],
        'resource-summary:total:size': ['error', { maxNumericValue: 500000 }],

        // Scores
        'categories:performance': ['error', { minScore: 0.9 }],
        'categories:accessibility': ['error', { minScore: 0.9 }],
      },
    },
    upload: {
      target: 'temporary-public-storage',
    },
  },
};

GitHub Actions Workflow

# .github/workflows/lighthouse.yml
name: Lighthouse CI

on:
  pull_request:
    branches: [main]

jobs:
  lighthouse:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Setup Node
        uses: actions/setup-node@v4
        with:
          node-version: '20'

      - name: Install dependencies
        run: npm ci

      - name: Build
        run: npm run build

      - name: Start server
        run: npm start &

      - name: Wait for server
        run: npx wait-on http://localhost:3000

      - name: Run Lighthouse CI
        run: |
          npm install -g @lhci/cli
          lhci autorun
        env:
          LHCI_GITHUB_APP_TOKEN: ${{ secrets.LHCI_GITHUB_APP_TOKEN }}

      - name: Upload results
        uses: actions/upload-artifact@v4
        with:
          name: lighthouse-results
          path: .lighthouseci/

Quick Reference

// ✅ LCP: Server-render, preload, priority
export default async function Page() {
  const data = await getData(); // Server-side
  return <Image src={data.hero} priority fill />;
}

// ✅ INP: useTransition for expensive updates
const [isPending, startTransition] = useTransition();
onChange={(e) => {
  setQuery(e.target.value);
  startTransition(() => setResults(filter(e.target.value)));
}}

// ✅ CLS: Always set dimensions
<Image src="/photo.jpg" width={800} height={600} />
<div className="aspect-[16/9]"><Image fill /></div>
<div className="min-h-[250px]">{content}</div>

// ✅ RUM: Send metrics reliably
navigator.sendBeacon('/api/vitals', JSON.stringify(metric));

// ✅ Debug: Find LCP element
new PerformanceObserver((list) => {
  console.log('LCP:', list.getEntries().at(-1)?.element);
}).observe({ type: 'largest-contentful-paint', buffered: true });

Image Examples

Image Optimization Examples

Hero Image with Blur Placeholder

import Image from 'next/image';
import heroImage from '@/public/hero.jpg'; // Static import

function Hero() {
  return (
    <div className="relative h-[600px] w-full">
      <Image
        src={heroImage}
        alt="Beautiful landscape"
        fill
        priority
        placeholder="blur" // Automatic with static import
        sizes="100vw"
        style={{ objectFit: 'cover' }}
      />
      <div className="absolute inset-0 flex items-center justify-center">
        <h1 className="text-5xl font-bold text-white">Welcome</h1>
      </div>
    </div>
  );
}

Product Grid with Responsive Sizes

function ProductGrid({ products }) {
  return (
    <div className="grid grid-cols-2 md:grid-cols-3 lg:grid-cols-4 gap-4">
      {products.map((product) => (
        <div key={product.id} className="relative aspect-square">
          <Image
            src={product.imageUrl}
            alt={product.name}
            fill
            sizes="(max-width: 640px) 50vw, (max-width: 1024px) 33vw, 25vw"
            className="object-cover rounded-lg"
          />
        </div>
      ))}
    </div>
  );
}

Avatar with Fallback

function UserAvatar({ user }) {
  const [error, setError] = useState(false);

  if (error || !user.avatarUrl) {
    return (
      <div className="h-10 w-10 rounded-full bg-blue-500 flex items-center justify-center">
        <span className="text-white font-medium">
          {user.name.charAt(0).toUpperCase()}
        </span>
      </div>
    );
  }

  return (
    <Image
      src={user.avatarUrl}
      alt={user.name}
      width={40}
      height={40}
      className="rounded-full"
      onError={() => setError(true)}
    />
  );
}

Art Direction (Different Crops)

function ResponsiveBanner() {
  return (
    <>
      {/* Mobile: Portrait crop */}
      <div className="relative h-[400px] md:hidden">
        <Image
          src="/banner-mobile.jpg"
          alt="Banner"
          fill
          priority
          sizes="100vw"
          className="object-cover"
        />
      </div>

      {/* Desktop: Landscape crop */}
      <div className="relative hidden h-[300px] md:block">
        <Image
          src="/banner-desktop.jpg"
          alt="Banner"
          fill
          priority
          sizes="100vw"
          className="object-cover"
        />
      </div>
    </>
  );
}

function ImageGallery({ images }) {
  const [selected, setSelected] = useState(null);

  return (
    <>
      <div className="grid grid-cols-3 gap-2">
        {images.map((image, i) => (
          <button
            key={image.id}
            onClick={() => setSelected(image)}
            className="relative aspect-square"
          >
            <Image
              src={image.thumbnailUrl}
              alt={image.alt}
              fill
              sizes="33vw"
              className="object-cover"
            />
          </button>
        ))}
      </div>

      {selected && (
        <Dialog open onClose={() => setSelected(null)}>
          <div className="relative h-[80vh] w-[90vw]">
            <Image
              src={selected.fullUrl}
              alt={selected.alt}
              fill
              sizes="90vw"
              quality={90}
              className="object-contain"
            />
          </div>
        </Dialog>
      )}
    </>
  );
}

Background Image Pattern

// For true background images, use CSS
function HeroWithCSSBackground() {
  return (
    <div
      className="h-[600px] bg-cover bg-center"
      style={{ backgroundImage: 'url(/hero.webp)' }}
    >
      <div className="h-full flex items-center justify-center bg-black/40">
        <h1 className="text-white text-5xl">Hero Title</h1>
      </div>
    </div>
  );
}

// For Next.js optimization, use Image with fill
function HeroWithNextImage() {
  return (
    <div className="relative h-[600px]">
      <Image
        src="/hero.webp"
        alt=""
        fill
        priority
        className="object-cover -z-10"
      />
      <div className="h-full flex items-center justify-center bg-black/40">
        <h1 className="text-white text-5xl">Hero Title</h1>
      </div>
    </div>
  );
}

Orchestkit Performance Wins

OrchestKit Performance Wins - Real Optimization Examples

This document showcases actual performance optimizations from OrchestKit's production implementation with before/after metrics.

Overview

Key Performance Achievements:

LLM costs: $35k/year → $2-5k/year (85-95% reduction)
Vector search: 85ms → 5ms (17x faster)
Retrieval accuracy: 87.2% → 91.6% (5.1% improvement)
Quality gate pass rate: Increased from 67-77% → 85%+ (stable)
Cache hit rate: 0% → 90% (L1) + 75% (L2)

Win 1: Multi-Level LLM Caching

Problem

Projected annual LLM costs: $35,000

8 agents per analysis, 1,500-1,800 tokens each
Average 145 analyses/month
No caching = every query hits LLM
Claude Sonnet 4.5: $3/MTok input, $15/MTok output

Investigation

Cost breakdown by agent:

-- Langfuse query
SELECT
    metadata->>'agent_type' as agent,
    SUM(calculated_total_cost) as total_cost,
    AVG(input_tokens) as avg_input,
    AVG(output_tokens) as avg_output
FROM traces
GROUP BY agent
ORDER BY total_cost DESC;

Results:

Agent	Monthly Cost	Avg Input	Avg Output
security_auditor	$3.05	1,800	1,200
implementation_planner	$2.76	1,600	1,100
tech_comparator	$2.61	1,500	1,000
Total (8 agents)	$18.73	-	-

Pain points:

Analyzing similar content (React tutorials, FastAPI guides) repeatedly
Security patterns (XSS, SQL injection) are common across codebases
Implementation patterns (CRUD, auth) are highly repetitive

Solution: 3-Level Cache Hierarchy

Architecture:

Request → L1: Prompt Cache (Claude native)
         ↓ miss (10%)
         → L2: Semantic Cache (Redis vector search)
         ↓ miss (25% of L1 misses)
         → L3: LLM Call (actual cost)

L1: Claude Prompt Caching (Native)

File: backend/app/shared/services/llm/anthropic_client.py

from anthropic import AsyncAnthropic

async def call_claude_with_prompt_cache(
    system_prompt: str,
    user_message: str,
    model: str = "claude-sonnet-4-6"
) -> str:
    """Call Claude with prompt caching for system prompts."""

    response = await anthropic_client.messages.create(
        model=model,
        max_tokens=4096,
        system=[
            {
                "type": "text",
                "text": system_prompt,
                "cache_control": {"type": "ephemeral"}  # Cache this!
            }
        ],
        messages=[
            {"role": "user", "content": user_message}
        ]
    )

    # Log cache usage
    cache_hit = response.usage.cache_read_input_tokens > 0
    logger.info("claude_prompt_cache",
        cache_hit=cache_hit,
        cache_read_tokens=response.usage.cache_read_input_tokens,
        input_tokens=response.usage.input_tokens,
        output_tokens=response.usage.output_tokens
    )

    return response.content[0].text

Cost savings:

Cache hit: 90% discount on cached tokens
Cache duration: 5 minutes
Effective for: Agent system prompts (1,500+ tokens each)

L2: Semantic Cache (Redis + Vector Search)

File: backend/app/shared/services/cache/semantic_cache.py

from redis import Redis
from app.shared.services.embeddings import embed_text
import numpy as np

class SemanticCache:
    """Vector similarity-based cache for LLM responses."""

    def __init__(self, redis_client: Redis, threshold: float = 0.92):
        self.redis = redis_client
        self.threshold = threshold  # Cosine similarity threshold

    async def get(self, query: str) -> str | None:
        """Check if semantically similar query exists in cache."""

        # Generate query embedding
        query_embedding = await embed_text(query)

        # Search for similar cached queries
        # (Using Redis VSS or dedicated vector store)
        cached_queries = await self._vector_search(query_embedding, top_k=5)

        for cached_query, cached_embedding, cached_response in cached_queries:
            similarity = cosine_similarity(query_embedding, cached_embedding)

            if similarity >= self.threshold:
                logger.info("semantic_cache_hit",
                    similarity=similarity,
                    cached_query=cached_query[:100]
                )
                return cached_response

        return None  # Cache miss

    async def set(self, query: str, response: str, ttl: int = 3600):
        """Store query-response pair with embedding."""

        # Generate embedding
        embedding = await embed_text(query)

        # Store in Redis (with vector index)
        cache_key = f"semantic_cache:{hash(query)}"
        await self.redis.setex(
            cache_key,
            ttl,
            json.dumps({
                "query": query,
                "response": response,
                "embedding": embedding.tolist(),
                "timestamp": datetime.now().isoformat()
            })
        )

Cost savings:

75% hit rate on L1 misses
Near-instant responses (5-10ms vs 2000ms)
Effective for: Similar technical queries

Implementation in agent calls:

@observe(name="agent_execution")
async def execute_agent(agent_type: str, content: str) -> Finding:
    """Execute agent with 3-level caching."""

    # Build query
    system_prompt = get_agent_system_prompt(agent_type)  # 1,500+ tokens
    user_message = f"Analyze this content:\n\n{content[:8000]}"

    # L2: Check semantic cache
    cache_key = f"{agent_type}:{content[:200]}"  # Simple key for demo
    cached_response = await semantic_cache.get(cache_key)

    if cached_response:
        logger.info("cache_hit", level="L2_semantic", agent=agent_type)
        return parse_finding(cached_response)

    # L1 + L3: Call Claude (with prompt caching)
    response = await call_claude_with_prompt_cache(
        system_prompt=system_prompt,  # Cached by Claude
        user_message=user_message
    )

    # Store in semantic cache
    await semantic_cache.set(cache_key, response, ttl=3600)

    return parse_finding(response)

Results

Cost Reduction:

Baseline (no cache):     $35,000/year
L1 savings (90% hit):    -$28,350  (90% discount on 90% of queries)
L2 savings (75% hit):    -$4,650   (85% discount on 75% of L1 misses)
Final cost:              $2,000-5,000/year

Total savings: 85-95%

Latency Improvement:

Cache Level	Hit Rate	Latency	Cost Savings
L1 (Prompt)	90%	2000ms (same)	90% on cached tokens
L2 (Semantic)	75% (of L1 misses)	5-10ms	85% (full skip)
L3 (LLM)	2.5% (fallback)	2000ms	0% (full cost)

Implementation effort: 2 days Maintenance overhead: Low (cache TTL auto-expires stale data)

Win 2: Vector Index Optimization (HNSW vs IVFFlat)

Problem

Vector search taking 85ms, needed <10ms

Golden dataset: 415 chunks, 1536-dim embeddings
IVFFlat index (lists=10)
Hybrid search (vector + BM25 RRF) bottlenecked by vector search

Investigation

Benchmark both index types:

-- IVFFlat performance
EXPLAIN ANALYZE
SELECT * FROM chunks
ORDER BY embedding <=> '[0.1, 0.2, ...]'::vector
LIMIT 10;

-- Result:
-- Planning Time: 2.1 ms
-- Execution Time: 85.3 ms

-- HNSW performance
CREATE INDEX idx_chunk_embedding_hnsw ON chunks
USING hnsw (embedding vector_cosine_ops)
WITH (m = 16, ef_construction = 64);

EXPLAIN ANALYZE
SELECT * FROM chunks
ORDER BY embedding <=> '[0.1, 0.2, ...]'::vector
LIMIT 10;

-- Result:
-- Planning Time: 2.0 ms
-- Execution Time: 5.1 ms

Trade-offs:

Index	Build Time	Query Time	Accuracy	Memory
IVFFlat (lists=10)	2s	85ms	95%	Low
HNSW (m=16)	8s	5ms	98%	Medium

Solution: HNSW Index with Optimized Parameters

File: backend/alembic/versions/xxx_add_hnsw_index.py

def upgrade():
    """Add HNSW index for vector similarity search."""

    op.execute("""
        CREATE INDEX CONCURRENTLY idx_chunk_embedding_hnsw
        ON chunks USING hnsw (embedding vector_cosine_ops)
        WITH (m = 16, ef_construction = 64);
    """)

    # Drop old IVFFlat index
    op.execute("DROP INDEX IF EXISTS idx_chunk_embedding_ivfflat;")

Parameters chosen:

m = 16: Connections per layer (sweet spot for 1k-10k vectors)
ef_construction = 64: Build-time quality (higher = better accuracy, slower build)
ef_search = 64: Query-time quality (can tune per query)

Runtime tuning:

async def search_similar_chunks(
    embedding: list[float],
    top_k: int = 10
) -> list[Chunk]:
    """Vector similarity search with HNSW index."""

    # Tune ef_search for accuracy vs speed trade-off
    await session.execute(text("SET hnsw.ef_search = 64;"))

    results = await session.execute(
        select(Chunk)
        .order_by(Chunk.embedding.cosine_distance(embedding))
        .limit(top_k)
    )

    return results.scalars().all()

Results

Performance:

Query latency: 85ms → 5ms (17x faster)
Accuracy: 95% → 98% (3% improvement)
Build time: 2s → 8s (acceptable for 415 chunks)

Impact on retrieval:

Hybrid search latency: 95ms → 15ms (p95)
Throughput: 10.5 req/s → 66 req/s (6x improvement)

Implementation effort: 4 hours (index creation + testing)

Win 3: Hybrid Search Ranking Optimization

Problem

Retrieval pass rate: 87.2%, target: >90%

Expected chunks ranked 6-10 instead of top-5
RRF fusion not getting enough candidates
No metadata boosting

Investigation

Golden dataset analysis (203 queries):

# Evaluate current ranking
results = []
for query in golden_queries:
    retrieved = await hybrid_search(query.text, top_k=10)
    expected_in_top_k = any(chunk.id in query.expected_chunk_ids for chunk in retrieved)
    rank = next((i for i, c in enumerate(retrieved) if c.id in query.expected_chunk_ids), -1)

    results.append({
        "query": query.text,
        "expected_rank": rank,
        "found": rank != -1,
        "passed": rank < 10
    })

# Results:
# Pass rate: 177/203 = 87.2%
# MRR: 0.723

Failure analysis:

26 queries failed (expected chunk not in top-10)
Common issue: Expected chunk ranked 11-15
Root cause: RRF fusion only fetching 2x candidates (20 for top-10)

Solution: Multi-Pronged Optimization

1. Increase RRF Fetch Multiplier

File: backend/app/core/constants.py

# Before
HYBRID_FETCH_MULTIPLIER = 2  # Fetch 20 for top-10

# After
HYBRID_FETCH_MULTIPLIER = 3  # Fetch 30 for top-10

Rationale: More candidates → better RRF coverage → higher recall

2. Add Metadata Boosting

File: backend/app/shared/services/search/search_service.py

def apply_metadata_boosts(
    chunks: list[Chunk],
    query: str
) -> list[Chunk]:
    """Boost scores based on metadata signals."""

    query_lower = query.lower()

    for chunk in chunks:
        # Boost if query matches section title
        if chunk.section_title and any(
            term in chunk.section_title.lower()
            for term in query_lower.split()
        ):
            chunk.score *= SECTION_TITLE_BOOST_FACTOR  # 2.0

        # Boost if query matches document path
        if chunk.document_path and any(
            term in chunk.document_path.lower()
            for term in query_lower.split()
        ):
            chunk.score *= DOCUMENT_PATH_BOOST_FACTOR  # 1.15

        # Boost code blocks for technical queries
        if chunk.chunk_type == "code_block" and is_technical_query(query):
            chunk.score *= TECHNICAL_KEYWORD_BOOST  # 1.2

    return sorted(chunks, key=lambda c: c.score, reverse=True)

3. Pre-Compute tsvector for BM25

Before:

-- Compute tsvector on-the-fly (slow!)
SELECT *, ts_rank(to_tsvector('english', content), query) as rank
FROM chunks
WHERE to_tsvector('english', content) @@ query
ORDER BY rank DESC;

After:

-- Use pre-computed tsvector column (fast!)
SELECT *, ts_rank(content_tsvector, query) as rank
FROM chunks
WHERE content_tsvector @@ query
ORDER BY rank DESC;

Migration:

def upgrade():
    """Add pre-computed tsvector column."""

    # Add column
    op.add_column('chunks', sa.Column('content_tsvector', TSVECTOR))

    # Populate
    op.execute("""
        UPDATE chunks
        SET content_tsvector = to_tsvector('english', content);
    """)

    # Create GIN index
    op.execute("""
        CREATE INDEX idx_chunk_tsvector
        ON chunks USING GIN(content_tsvector);
    """)

    # Add trigger to keep it updated
    op.execute("""
        CREATE TRIGGER tsvector_update BEFORE INSERT OR UPDATE
        ON chunks FOR EACH ROW EXECUTE FUNCTION
        tsvector_update_trigger(content_tsvector, 'pg_catalog.english', content);
    """)

Results

Ranking Quality:

Metric	Before	After	Change
Pass rate	177/203 (87.2%)	186/203 (91.6%)	+5.1%
MRR (overall)	0.723	0.777	+7.4%
MRR (hard queries)	0.647	0.686	+6.0%

Query Performance:

Operation	Before	After	Change
BM25 search	45ms	4ms	11x faster
Vector search	5ms	5ms	Same
RRF fusion	2ms	3ms	Slightly slower (more candidates)
Total	52ms	12ms	4.3x faster

Impact by boost factor:

Section title boost: +7.4% MRR (most impactful)
Document path boost: +2.1% MRR
Code block boost: +1.3% MRR (for technical queries)

Implementation effort: 1 day (constants, migration, testing)

Win 4: SSE Event Buffering (Race Condition Fix)

Problem

Frontend showed 0% progress while backend was running

Real-time progress updates missing
EventSource connection established AFTER events published
No event replay mechanism

Investigation

Reproduce issue:

Start analysis via API
Frontend subscribes to SSE /progress/\{analysis_id\}
Backend immediately publishes "analysis_started" event
Frontend connects 200ms later → misses early events

Root cause:

# ❌ BAD: Events lost if no subscriber yet
class EventBroadcaster:
    def publish(self, channel: str, event: dict):
        if channel not in self._subscribers:
            return  # Event lost!

        for subscriber in self._subscribers[channel]:
            subscriber.send(event)

Solution: Event Buffering with Replay

File: backend/app/services/event_broadcaster.py

from collections import deque
from dataclasses import dataclass
from datetime import datetime

@dataclass
class BufferedEvent:
    """Event with timestamp for replay."""
    data: dict
    timestamp: datetime

class EventBroadcaster:
    """SSE broadcaster with event buffering."""

    def __init__(self, buffer_size: int = 100):
        self._subscribers: dict[str, list] = {}
        self._buffers: dict[str, deque[BufferedEvent]] = {}
        self._buffer_size = buffer_size

    def publish(self, channel: str, event: dict):
        """Publish event and store in buffer."""

        # Create buffer if needed
        if channel not in self._buffers:
            self._buffers[channel] = deque(maxlen=self._buffer_size)

        # Add to buffer
        buffered_event = BufferedEvent(
            data=event,
            timestamp=datetime.now()
        )
        self._buffers[channel].append(buffered_event)

        # Send to active subscribers
        for subscriber in self._subscribers.get(channel, []):
            try:
                subscriber.send(event)
            except Exception as e:
                logger.error("failed_to_send_event", error=str(e))

    async def subscribe(self, channel: str):
        """Subscribe to channel and replay buffered events."""

        # Replay buffered events first
        for buffered_event in self._buffers.get(channel, []):
            yield {
                "event": "message",
                "data": json.dumps(buffered_event.data)
            }

        # Then stream new events
        queue = asyncio.Queue()
        self._subscribers.setdefault(channel, []).append(queue)

        try:
            while True:
                event = await queue.get()
                yield {
                    "event": "message",
                    "data": json.dumps(event)
                }
        finally:
            self._subscribers[channel].remove(queue)

API endpoint:

@app.get("/progress/{analysis_id}")
async def stream_progress(analysis_id: str):
    """Stream analysis progress with buffered event replay."""

    channel = f"analysis:{analysis_id}"

    async def event_generator():
        async for event in event_broadcaster.subscribe(channel):
            yield f"data: {event['data']}\n\n"

    return StreamingResponse(
        event_generator(),
        media_type="text/event-stream"
    )

Results

Before (with race condition):

0% progress shown until agent completion (30-60 seconds)
Users confused, thought app was frozen
Support tickets: "Analysis stuck at 0%"

After (with buffering):

All events delivered (100% replay rate)
Progress updates appear immediately
Memory overhead: ~10KB per active analysis (100 events × 100 bytes)

Implementation effort: 3 hours (buffer logic + tests)

Win 5: Quality Gate Content Truncation Fix

Problem

Quality scores artificially low due to content truncation

Depth scores: 5/10 (AWFUL) → required retries
G-Eval only seeing truncated summaries
4 stages of truncation compounding

Investigation

Trace truncation points:

# Stage 1: compress_findings.py
MAX_STRING_LENGTH = 200  # ❌ Too aggressive!

# Stage 2: scorer.py
input_text = content[:2000]  # ❌ Truncated again!
output_text = response[:3000]

# Stage 3: quality.py
MAX_CONTENT_LENGTH = 8000  # ❌ Insufficient!

# Stage 4: quality_gate_node.py
insights = findings[:2000]  # ❌ Final truncation!

Example:

Original finding: 5,000 chars (detailed security analysis)
After Stage 1: 200 chars ("Found 3 vulnerabilities...")
After synthesis: 1,500 chars (includes other findings)
After Stage 2: 1,500 chars (same)
After G-Eval: Depth score = 5/10 (insufficient detail)

Solution: Increase All Truncation Limits

Changes:

File	Before	After	Rationale
compress_findings.py	200	500	Allow key insights
scorer.py (input)	2,000	8,000	Full context for eval
scorer.py (output)	3,000	12,000	Detailed responses
quality.py	8,000	15,000	Complete synthesis
quality_gate_node.py	2,000	8,000	All findings visible

Implementation:

# backend/app/shared/services/g_eval/scorer.py
MAX_INPUT_LENGTH = 8000  # Increased from 2000
MAX_OUTPUT_LENGTH = 12000  # Increased from 3000

# backend/app/evaluation/evaluators/quality.py
MAX_CONTENT_LENGTH = 15000  # Increased from 8000

# backend/app/domains/analysis/workflows/tasks/aggregation/compress_findings.py
MAX_STRING_LENGTH = 500  # Increased from 200

Results

Quality Scores:

Criterion	Before	After	Change
Completeness	0.75	0.85	+13%
Accuracy	0.88	0.92	+5%
Coherence	0.84	0.88	+5%
Depth	0.58	0.78	+34%
Overall	0.76	0.86	+13%

Pass rate: 67-77% (variable) → 85%+ (stable)

Trade-offs:

Token usage: +15% (from 8k → 12k avg)
Cost impact: +$0.02 per analysis (acceptable)
Quality improvement: Worth the extra cost

Implementation effort: 2 hours (find all truncation points + update tests)

Summary Table

Optimization	Metric	Before	After	Improvement	Effort
Multi-level caching	Annual cost	$35k	$2-5k	85-95%	2 days
HNSW index	Query latency	85ms	5ms	17x faster	4 hours
Hybrid search	Pass rate	87.2%	91.6%	+5.1%	1 day
SSE buffering	Event delivery	60%	100%	+67%	3 hours
Content truncation	Depth score	0.58	0.78	+34%	2 hours

Total implementation time: 4 days Annual cost savings: $30-33k Quality improvement: 13% overall, 34% depth

References

OrchestKit Quality Initiative
Redis Connection Keepalive
Hybrid Search Constants
Template: ../scripts/caching-patterns.ts
Template: ../scripts/database-optimization.ts

Multimodal Llm

Vision, audio, and multimodal LLM integration patterns. Use when processing images, transcribing audio, generating speech, or building multimodal AI pipelines.

Plan Viz

Visualize planned changes before implementation. Use when reviewing plans, comparing before/after architecture, assessing risk, or analyzing execution order and impact.

Edit on GitHub

Last updated on

Performance Quick Reference Core Web Vitals 2026 Thresholds Render Optimization Lazy Loading Image Optimization Profiling & Backend LLM Inference Caching Query & Data Fetching Quick Start Example Key Decisions Common Mistakes Related Skills Capability Details lcp-optimization inp-optimization cls-prevention react-compiler virtualization lazy-loading image-optimization profiling llm-inference References Rules (22)Configure HTTP and LLM prompt caching with correct breakpoint ordering for maximum savings — HIGH HTTP & Prompt Caching Implement Redis cache-aside pattern with TTL and stampede prevention for backend caching — HIGH Redis & Backend Caching Prevent Cumulative Layout Shift that causes content jumping and hurts search rankings — CRITICAL CLS Prevention Reserve Space for Dynamic Content Explicit Dimensions Avoid Layout-Shifting Fonts Animations That Don't Cause Layout Shift Key Rules Optimize Interaction to Next Paint to ensure responsive button clicks and interactions — CRITICAL INP Optimization Break Up Long Tasks Use Transitions for Non-Urgent Updates Optimize Event Handlers Key Rules Optimize Largest Contentful Paint to improve search rankings and perceived page speed — CRITICAL LCP Optimization Identify LCP Element Optimize LCP Images Preload Critical Resources Server-Side Rendering Key Rules Serve AVIF and WebP formats for 30-50% smaller files than JPEG at equivalent quality — HIGH Modern Image Formats Format Decision Matrix Picture Element with Fallback Build-Time Conversion Quality Guidelines Use Next.js Image component for automatic lazy loading, responsive sizing, and format negotiation — HIGH Next.js Image Component Priority Hero Image Blur Placeholder Custom Loader for CDN Responsive Fill Layout Serve appropriately sized responsive images per viewport to avoid oversized mobile downloads — HIGH Responsive Images Srcset with Sizes Art Direction with Picture CDN Image Transformation URLs Quantize models to reduce size 2-4x with minimal quality loss for fewer GPUs — MEDIUM Model Quantization Method Decision Matrix vLLM with AWQ vLLM with FP8 (Hopper GPUs)VRAM Requirements (Approximate)Quality Validation Apply speculative decoding to generate draft tokens in parallel and reduce inference latency — MEDIUM Speculative Decoding How It Works N-Gram Speculation (No Draft Model)Draft Model Speculation Acceptance Rate Tuning When to Use Each Approach Deploy vLLM with PagedAttention and continuous batching for 2-4x higher inference throughput — MEDIUM vLLM Deployment Docker Deployment Python Client Key Architecture Concepts Defer component loading with React.lazy to reduce initial bundle size and improve TTI — HIGH Lazy Component Loading Basic Pattern Error Boundary for Failed Imports Skeleton Fallback Prefetch resources before user needs them to make navigation feel instant — HIGH Prefetch Strategies Module Preload Hints Prefetch on Hover Prefetch on Viewport Intersection Import on Interaction Split code at route boundaries so users only download code for the visited page — HIGH Route-Based Code Splitting React Router 7.x Lazy Routes Named Exports for Lazy Routes Chunk Naming Profile Python backends with py-spy to find CPU hotspots and memory leaks in production — MEDIUM Python Backend Profiling py-spy for Production Sampling cProfile for Development memory_profiler for Memory Leaks FastAPI Middleware Profiling Analyze bundles to reveal bloated dependencies and missed tree-shaking that inflate load times — MEDIUM Bundle Analysis Webpack Bundle Analyzer Vite / Rollup Visualizer Performance Budgets in CI Import Cost Awareness Profile React components with DevTools to identify unnecessary re-renders and their causes — MEDIUM React DevTools Profiler Flamegraph Workflow Programmatic Profiler Why Did You Render Setup Interpreting Render Reasons Apply TanStack Query optimistic updates for instant UI feedback with automatic rollback — HIGH TanStack Query Optimistic Updates Prefetch TanStack queries on hover or in route loaders for instant page transitions — HIGH TanStack Query Prefetching Let React Compiler auto-memoize components, values, callbacks, and JSX automatically — HIGH React Compiler Decision Tree What the Compiler Memoizes Enabling the Compiler Verification Key Rules Use manual useMemo and useCallback escape hatches when React Compiler cannot optimize — HIGH Manual Memoization Escape Hatches When Manual Memoization Is Needed State Colocation Profiling Workflow Key Rules Virtualize long lists to render only visible items for smooth scrolling performance — HIGH List Virtualization Virtualization Thresholds Basic Setup Dynamic Height Key Rules References (17)Caching Strategies Caching Strategies Cache Hierarchy Cache-Aside Pattern (Read-Through)Write-Through Pattern Cache Invalidation Strategies 1. Time-Based (TTL)2. Event-Based 3. Tag-Based Redis Patterns 1. String Cache (Most Common)2. Hash Cache (Objects)3. List Cache (Queues, Recent Items)4. Set Cache (Unique Items, Tags)In-Memory Cache (L1)HTTP Caching (Browser/CDN)Cache Warming Cache Stampede Prevention Best Practices References Cdn Setup Image CDN Configuration Next.js Remote Patterns Basic Configuration Environment-Based Configuration Cloudinary Integration Loader Implementation Usage Imgix Integration Cloudflare Images AWS S3 + CloudFront Vercel Image Optimization Self-Hosted with Sharp CDN Headers & Caching Nginx Configuration Cloudflare Page Rules Blur Placeholder Generation Build-Time with Plaiceholder Remote Image Blur Image Validation & Error Handling Core Web Vitals Core Web Vitals Optimization The Three Metrics LCP (Largest Contentful Paint)What It Measures Common Causes Fixes INP (Interaction to Next Paint)What It Measures Common Causes Fixes CLS (Cumulative Layout Shift)What It Measures Common Causes Fixes Measuring Core Web Vitals In Development In Production (RUM - Real User Monitoring)Lighthouse (Lab Testing)Targets by Percentile Quick Wins Checklist References Database Optimization Database Query Optimization Key Patterns Quick Diagnostics N+1 Query Detection Index Selection Strategies Connection Pooling Pagination: Cursor vs Offset Offset-Based (❌ Slow for large datasets)Cursor-Based (✅ Fast, scales to millions)Best Practices References Devtools Profiler Workflow React DevTools Profiler Workflow Setup Basic Profiling Flow 1. Start Recording 2. Analyze the Flamegraph 3. Key Metrics Reading the Profiler Color Coding "Why did this render?"Identifying Problems Problem 1: Component Renders Too Often Problem 2: Single Render Too Slow Problem 3: Cascading Re-renders Profiler Settings Ranked View Timeline View Console Integration Quick Checklist Common Fixes by Cause Edge Deployment Edge Deployment Overview Model Selection for Edge Aggressive Quantization llama.cpp for Edge Mobile Deployment iOS with MLX Android with MLC-LLM Jetson/NVIDIA Edge Memory Optimization Techniques KV Cache Reduction Sliding Window Attention Flash Attention Power Efficiency Dynamic Frequency Scaling Batch Size Optimization Offline Deployment Model Bundling Air-gapped Environments Benchmarking Edge Performance Related Skills Frontend Performance Frontend Performance Optimization Bundle Optimization 1. Code Splitting 2. Tree Shaking 3. Image Optimization Rendering Optimization 1. Memoization 2. Virtualization 3. Batch DOM Operations Core Web Vitals Optimization LCP (Largest Contentful Paint) - Target: < 2.5s INP (Interaction to Next Paint) - Target: < 200ms CLS (Cumulative Layout Shift) - Target: < 0.1 Bundle Analysis Best Practices References Memoization Escape Hatches Memoization Escape Hatches Overview Escape Hatch 1: Effect Dependencies Escape Hatch 2: Third-Party Libraries Escape Hatch 3: Expensive Computations Escape Hatch 4: Referential Equality for Children When NOT to Use Escape Hatches Don't Memoize Primitives Don't Memoize Simple JSX Don't Memoize Everything "Just in Case"useCallback Escape Hatches Stable Event Handlers for Effects Refs in Callbacks Decision Tree Verifying Compiler Coverage Profiling Performance Profiling Profiling Workflow Backend Profiling (Python)1. cProfile (Built-in)2. py-spy (Sampling Profiler)3. memory_profiler 4. Line Profiler Frontend Profiling 1. Chrome DevTools Performance Tab 2. React DevTools Profiler 3. Lighthouse Performance Audit 4. Bundle Analyzer Database Profiling PostgreSQL Memory Profiling Python (memory_profiler)Chrome DevTools (Heap Snapshot)Memory Leak Detection Flame Graphs Load Testing k6 (HTTP Load Testing)Locust (Python Load Testing)Profiling Best Practices Quick Profiling Commands References Quantization Guide Quantization Guide Overview AWQ (Activation-aware Weight Quantization)GPTQ Quantization FP8 Quantization INT8 Quantization Quantization Comparison Memory Usage (70B Model)Quality Benchmarks (MMLU)Best Practices Calibration Data Group Size Selection Mixed Precision Troubleshooting OOM During Quantization Quality Degradation Related Skills React Compiler Migration React Compiler Migration Guide What is React Compiler?Prerequisites Quick Setup Next.js 16+Expo SDK 54+Babel (Manual)Verification What Gets Optimized Rules of React (Must Follow)1. Components Must Be Idempotent 2. Props and State Are Immutable 3. Side Effects Outside Render 4. Hooks at Top Level Migration Strategy New Projects Existing Projects When Manual Memoization Still Needed Debugging Issues Component Not Getting Memo Badge Performance Regression Compatibility Notes Route Splitting Route-Based Code Splitting React Router 7.x Lazy Routes Vite Manual Chunks Prefetch on Route Hover Bundle Size Monitoring Rum Setup Real User Monitoring (RUM) Setup web-vitals Library Setup Installation Basic Implementation Next.js App Router Integration Client Component for Vitals Layout Integration API Endpoint Implementation Next.js Route Handler Batching for High-Traffic Sites Database Schema PostgreSQL Schema Analytics Queries Grafana Dashboard Prometheus Metrics Export Grafana Alert Rules Sampling Strategy for High Traffic Testing RUM in Development Integration with Analytics Providers Google Analytics 4 Vercel Analytics Speculative Decoding Speculative Decoding Overview N-gram Speculation Draft Model Speculation Medusa-style Speculation Performance Tuning Optimal Token Count Acceptance Rate Monitoring When NOT to Use Benchmarking Related Skills State Colocation State Colocation The Principle Problem: State Too High Solution: Colocate State When to Lift State Component Splitting Context for Cross-Cutting Concerns Context Splitting Signs State Should Move Quick Checklist Tanstack Virtual Patterns TanStack Virtual Patterns When to Virtualize Basic List Virtualization Variable Height Rows Dynamic Measurement Horizontal Virtualization Grid Virtualization Scroll to Index Window Scroller Performance Tips Vllm Deployment vLLM Deployment PagedAttention Continuous Batching CUDA Graphs Tensor Parallelism Prefix Caching Production Server Configuration Monitoring and Metrics Related Skills Checklists (5)Cwv Checklist Core Web Vitals Optimization Checklist Thresholds Reference LCP (Largest Contentful Paint) ≤ 2.5s Identify the LCP Element Server Response Time (TTFB)Critical Resource Loading Image Optimization Rendering Strategy INP (Interaction to Next Paint) ≤ 200ms Identify Long Tasks JavaScript Optimization React Optimization Event Handler Optimization Animation Performance CLS (Cumulative Layout Shift) ≤ 0.1 Image Dimensions Dynamic Content Font Loading Animation Stability Iframes and Embeds Measurement & Monitoring Lab Testing Field Data (RUM)Data Analysis Alerting Build & Deploy Performance Budgets CI/CD Integration CDN & Caching Debugging Checklist Slow LCP High INP High CLS Testing Protocol Before Deployment After Deployment Weekly Review Image Checklist Image Optimization Checklist Format Selection Photo Content Graphics & Icons Format Decision Tree Dimensions & Sizing Always Set Dimensions Responsive Images Loading Strategy LCP Images (Above the Fold)Below-the-Fold Images Preloading Placeholders Blur Placeholders Color Placeholders Quality Settings Compression AVIF-Specific CDN & Infrastructure Next.js Configuration CDN Setup Self-Hosted Accessibility Alt Text Additional A11y Performance Monitoring Metrics to Track Debugging Error Handling Fallbacks Monitoring Build Pipeline Optimization Version Control Security Content Security Privacy Inference Optimization Inference Optimization Checklist vLLM Configuration Quantization Speculative Decoding Hardware Utilization Batching Strategy Caching Benchmarking Production Readiness Monitoring Cost Optimization Performance Audit Checklist Performance Audit Checklist Prerequisites Phase 1: Establish Baselines Backend Metrics Frontend Metrics Baseline Targets Phase 2: Identify Bottlenecks Backend Profiling Frontend Profiling Phase 3: Database Optimization Add Missing Indexes Fix N+1 Queries Optimize Connection Pooling Phase 4: Caching Strategy Identify Caching Opportunities Implement Multi-Level Cache Cache Invalidation Phase 5: Frontend Optimization Code Splitting Memoization Image Optimization Phase 6: Measure Impact Re-Run Benchmarks Calculate Savings Create Performance Budget Phase 7: Ongoing Optimization Weekly Reviews Monthly Audits Continuous Monitoring References Render Audit React Performance Audit Checklist React Compiler Check Render Performance Large Lists / Data Code Splitting Network Performance Images & Media Third-Party Scripts Profiling Verification Before Optimization After Optimization Key Metrics to Track Quick Profiler Commands Common Issues Checklist Sign-Off Examples (3)Cwv Examples Core Web Vitals Examples 1. LCP Optimization: E-Commerce Hero Section Before: Slow LCP (3.5s+)After: Optimized LCP (1.2s)Document Head Optimizations 2. INP Optimization: Product Search Filter Before: Blocking INP (400ms+)After: Responsive INP (50ms)For Very Large Lists: Virtual Scrolling 3. CLS Optimization: News Article Page Before: High CLS (0.35)After: Zero CLS (0.0)Font Loading Without CLS 4. Complete RUM Implementation 5. Performance Budget Enforcement lighthouserc.js GitHub Actions Workflow Quick Reference Image Examples Image Optimization Examples Hero Image with Blur Placeholder Product Grid with Responsive Sizes Avatar with Fallback Art Direction (Different Crops)Gallery with Lightbox Background Image Pattern Orchestkit Performance Wins OrchestKit Performance Wins - Real Optimization Examples Overview Win 1: Multi-Level LLM Caching Problem Investigation Solution: 3-Level Cache Hierarchy Results Win 2: Vector Index Optimization (HNSW vs IVFFlat)Problem Investigation Solution: HNSW Index with Optimized Parameters Results Win 3: Hybrid Search Ranking Optimization Problem Investigation Solution: Multi-Pronged Optimization Results Win 4: SSE Event Buffering (Race Condition Fix)Problem Investigation Solution: Event Buffering with Replay Results Win 5: Quality Gate Content Truncation Fix Problem Investigation Solution: Increase All Truncation Limits Results Summary Table References