Skip to main content
OrchestKit v6.7.1 — 67 skills, 38 agents, 77 hooks with Opus 4.6 support
OrchestKit
Skills

Performance

Performance optimization patterns covering Core Web Vitals, React render optimization, lazy loading, image optimization, backend profiling, and LLM inference. Use when improving page speed, debugging slow renders, optimizing bundles, reducing image payload, profiling backend, or deploying LLMs efficiently.

Reference high

Primary Agent: frontend-ui-developer

Performance

Comprehensive performance optimization patterns for frontend, backend, and LLM inference.

Quick Reference

CategoryRulesImpactWhen to Use
Core Web Vitals3CRITICALLCP, INP, CLS optimization with 2026 thresholds
Render Optimization3HIGHReact Compiler, memoization, virtualization
Lazy Loading3HIGHCode splitting, route splitting, preloading
Image Optimization3HIGHNext.js Image, AVIF/WebP, responsive images
Profiling & Backend3MEDIUMReact DevTools, py-spy, bundle analysis
LLM Inference3MEDIUMvLLM, quantization, speculative decoding
Caching2HIGHRedis cache-aside, prompt caching, HTTP cache headers
Query & Data Fetching2HIGHTanStack Query prefetching, optimistic updates, rollback

Total: 22 rules across 8 categories

Core Web Vitals

Google's Core Web Vitals with 2026 stricter thresholds.

RuleFileKey Pattern
LCP Optimizationrules/cwv-lcp.mdPreload hero, SSR, fetchpriority="high"
INP Optimizationrules/cwv-inp.mdscheduler.yield, useTransition, requestIdleCallback
CLS Preventionrules/cwv-cls.mdExplicit dimensions, aspect-ratio, font-display

2026 Thresholds

MetricCurrent Good2026 Good
LCP<= 2.5s<= 2.0s
INP<= 200ms<= 150ms
CLS<= 0.1<= 0.08

Render Optimization

React render performance patterns for React 19+.

RuleFileKey Pattern
React Compilerrules/render-compiler.mdAuto-memoization, "Memo" badge verification
Manual Memoizationrules/render-memo.mduseMemo/useCallback escape hatches, state colocation
Virtualizationrules/render-virtual.mdTanStack Virtual for 100+ item lists

Lazy Loading

Code splitting and lazy loading with React.lazy and Suspense.

RuleFileKey Pattern
React.lazy + Suspenserules/loading-lazy.mdComponent lazy loading, error boundaries
Route Splittingrules/loading-splitting.mdReact Router 7.x, Vite manual chunks
Preloadingrules/loading-preload.mdPrefetch on hover, modulepreload hints

Image Optimization

Production image optimization for modern web applications.

RuleFileKey Pattern
Next.js Imagerules/images-nextjs.mdImage component, priority, blur placeholder
Format Selectionrules/images-formats.mdAVIF/WebP, quality 75-85, picture element
Responsive Imagesrules/images-responsive.mdsizes prop, art direction, CDN loaders

Profiling & Backend

Profiling tools and backend optimization patterns.

RuleFileKey Pattern
React Profilingrules/profiling-react.mdDevTools Profiler, flamegraph, render counts
Backend Profilingrules/profiling-backend.mdpy-spy, cProfile, memory_profiler, flame graphs
Bundle Analysisrules/profiling-bundle.mdvite-bundle-visualizer, tree shaking, performance budgets

LLM Inference

High-performance LLM inference with vLLM, quantization, and speculative decoding.

RuleFileKey Pattern
vLLM Deploymentrules/inference-vllm.mdPagedAttention, continuous batching, tensor parallelism
Quantizationrules/inference-quantization.mdAWQ, GPTQ, FP8, INT8 method selection
Speculative Decodingrules/inference-speculative.mdN-gram, draft model, 1.5-2.5x throughput

Caching

Backend Redis caching and LLM prompt caching for cost savings and performance.

RuleFileKey Pattern
Redis & Backendrules/caching-redis.mdCache-aside, write-through, invalidation, stampede prevention
HTTP & Promptrules/caching-http.mdHTTP cache headers, LLM prompt caching, semantic caching

Query & Data Fetching

TanStack Query v5 patterns for prefetching and optimistic updates.

RuleFileKey Pattern
Prefetchingrules/query-prefetching.mdHover prefetch, route loaders, queryOptions, Suspense
Optimistic Updatesrules/query-optimistic.mdOptimistic mutations, rollback, cache invalidation

Quick Start Example

// LCP: Priority hero image with SSR
import Image from 'next/image';

export default async function Page() {
  const data = await fetchHeroData();
  return (
    <Image
      src={data.heroImage}
      alt="Hero"
      priority
      placeholder="blur"
      sizes="100vw"
      fill
    />
  );
}

Key Decisions

DecisionRecommendation
MemoizationLet React Compiler handle it (2026 default)
Lists 100+ itemsUse TanStack Virtual
Image formatAVIF with WebP fallback (30-50% smaller)
LCP contentSSR/SSG, never client-side fetch
Code splittingPer-route for most apps, per-component for heavy widgets
Prefetch strategyOn hover for nav links, viewport for content
QuantizationAWQ for 4-bit, FP8 for H100/H200
Bundle budgetHard fail in CI to prevent regression

Common Mistakes

  1. Client-side fetching LCP content (delays render)
  2. Images without explicit dimensions (causes CLS)
  3. Lazy loading LCP images (delays largest paint)
  4. Heavy computation in event handlers (blocks INP)
  5. Layout-shifting animations (use transform instead)
  6. Lazy loading tiny components < 5KB (overhead > savings)
  7. Missing error boundaries on lazy components
  8. Using GPTQ without calibration data
  9. Not benchmarking actual workload patterns
  10. Only measuring in lab environment (need RUM)
  • ork:react-server-components-framework - Server-first rendering
  • ork:vite-advanced - Build optimization
  • caching - Cache strategies for responses
  • ork:monitoring-observability - Production monitoring and alerting
  • ork:database-patterns - Query and index optimization
  • ork:llm-integration - Local inference with Ollama

Capability Details

lcp-optimization

Keywords: LCP, largest-contentful-paint, hero, preload, priority, SSR Solves:

  • Optimize hero image loading
  • Server-render critical content
  • Preload and prioritize LCP resources

inp-optimization

Keywords: INP, interaction, responsiveness, long-task, transition, yield Solves:

  • Break up long tasks with scheduler.yield
  • Defer non-urgent updates with useTransition
  • Optimize event handler performance

cls-prevention

Keywords: CLS, layout-shift, dimensions, aspect-ratio, font-display Solves:

  • Reserve space for dynamic content
  • Prevent font flash and image pop-in
  • Use transform for animations

react-compiler

Keywords: react-compiler, auto-memo, memoization, React 19 Solves:

  • Enable automatic memoization
  • Identify when manual memoization needed
  • Verify compiler is working

virtualization

Keywords: virtual, TanStack, large-list, scroll, overscan Solves:

  • Render 100+ item lists efficiently
  • Dynamic height virtualization
  • Window scrolling patterns

lazy-loading

Keywords: React.lazy, Suspense, code-splitting, dynamic-import Solves:

  • Route-based code splitting
  • Component lazy loading with error boundaries
  • Prefetch on hover and viewport

image-optimization

Keywords: next/image, AVIF, WebP, responsive, blur-placeholder Solves:

  • Next.js Image component patterns
  • Format selection and quality settings
  • Responsive sizing and CDN configuration

profiling

Keywords: profiler, flame-graph, py-spy, DevTools, bundle-analyzer Solves:

  • Profile React renders and backend code
  • Generate and interpret flame graphs
  • Analyze and optimize bundle size

llm-inference

Keywords: vllm, quantization, speculative-decoding, inference, throughput Solves:

  • Deploy LLMs with vLLM for production
  • Choose quantization method for hardware
  • Accelerate generation with speculative decoding

References


Rules (22)

Configure HTTP and LLM prompt caching with correct breakpoint ordering for maximum savings — HIGH

HTTP & Prompt Caching

HTTP cache headers for CDN/browser caching and LLM prompt caching for 90% token savings.

Incorrect — variable content before cached prefix:

# WRONG: Variable content before static content breaks prompt cache
messages = [
    {"role": "user", "content": f"User {user_id} asks: {question}"},  # Variable first!
    {"role": "system", "content": long_system_prompt},  # Static content after = never cached
]

Correct — static prefix first, then variable content:

# Claude prompt caching: static content first with cache_control
response = await client.messages.create(
    model="claude-sonnet-4-6",
    system=[
        {
            "type": "text",
            "text": long_system_prompt,  # Static: cached across calls
            "cache_control": {"type": "ephemeral"},  # 5-minute TTL
        },
    ],
    messages=[
        {"role": "user", "content": user_question},  # Variable: after cache breakpoint
    ],
)
# Result: ~90% token savings on system prompt after first call

# OpenAI: automatic prefix caching (no markers needed)
# Just ensure static content comes first in messages array
response = await openai.chat.completions.create(
    model="gpt-4o",
    messages=[
        {"role": "system", "content": long_system_prompt},  # Cached automatically
        {"role": "user", "content": user_question},
    ],
)

HTTP cache headers for API responses:

from fastapi import FastAPI, Response

app = FastAPI()

@app.get("/api/products/{product_id}")
async def get_product(product_id: str, response: Response):
    product = await fetch_product(product_id)
    # Browser caches 60s, CDN caches 1h
    response.headers["Cache-Control"] = "public, max-age=60, s-maxage=3600"
    response.headers["CDN-Cache-Control"] = "max-age=3600"
    return product

@app.get("/api/user/profile")
async def get_profile(response: Response):
    # Private: only browser cache, not CDN
    response.headers["Cache-Control"] = "private, max-age=300"
    return await get_current_user_profile()

Key rules:

  • Claude: use cache_control with ephemeral type (5min default, 1h if >10 reads/hour)
  • OpenAI: automatic prefix caching, no markers needed — just put static content first
  • HTTP: public, max-age=60, stale-while-revalidate=300 for API responses
  • Use s-maxage or CDN-Cache-Control for different CDN vs browser TTLs
  • Semantic caching: start threshold at 0.92, tune based on hit rate
  • Never cache error responses or authentication tokens

Implement Redis cache-aside pattern with TTL and stampede prevention for backend caching — HIGH

Redis & Backend Caching

Cache-aside, write-through, and invalidation patterns for Redis-backed backend services.

Incorrect — caching without TTL (memory leak):

# WRONG: No TTL = memory grows forever
async def get_user(user_id: str):
    cached = await redis.get(f"user:{user_id}")
    if cached:
        return json.loads(cached)
    user = await db.fetch_user(user_id)
    await redis.set(f"user:{user_id}", json.dumps(user))  # No expiry!
    return user

Correct — cache-aside with TTL and stampede prevention:

import redis.asyncio as redis
import json
import asyncio

class CacheAside:
    def __init__(self, redis_client: redis.Redis, default_ttl: int = 3600):
        self.redis = redis_client
        self.ttl = default_ttl

    async def get_or_set(self, key: str, fetch_fn, ttl: int | None = None):
        """Cache-aside with stampede prevention via lock."""
        cached = await self.redis.get(key)
        if cached:
            return json.loads(cached)

        # Stampede prevention: only one caller computes
        lock_key = f"lock:{key}"
        acquired = await self.redis.set(lock_key, "1", ex=30, nx=True)
        if not acquired:
            # Another process is computing, wait and retry
            await asyncio.sleep(0.1)
            cached = await self.redis.get(key)
            if cached:
                return json.loads(cached)

        try:
            value = await fetch_fn()
            await self.redis.setex(key, ttl or self.ttl, json.dumps(value))
            return value
        finally:
            await self.redis.delete(lock_key)

# Write-through: update cache and DB atomically
async def update_user(user_id: str, data: dict, db, cache: CacheAside):
    async with db.transaction():
        await db.execute("UPDATE users SET ... WHERE id = $1", user_id)
        await cache.redis.setex(
            f"user:{user_id}",
            cache.ttl,
            json.dumps(data),
        )

# Event-based invalidation
async def on_user_updated(event: UserUpdatedEvent, cache: CacheAside):
    await cache.redis.delete(f"user:{event.user_id}")
    # Related caches too
    await cache.redis.delete(f"user-profile:{event.user_id}")

Key rules:

  • Always set TTL (1h default, 5min for volatile data)
  • Use orjson for serialization performance over json
  • Key naming: \{entity\}:\{id\} or \{entity\}:\{id\}:\{field\}
  • Stampede prevention: use distributed locks for expensive computations
  • Event-based invalidation for writes, TTL for reads
  • Never use cache as primary storage (data loss risk)

Prevent Cumulative Layout Shift that causes content jumping and hurts search rankings — CRITICAL

CLS Prevention

Prevent Cumulative Layout Shift for the 2026 threshold of <= 0.08.

Reserve Space for Dynamic Content

/* Reserve space for images */
.image-container {
  aspect-ratio: 16 / 9;
  width: 100%;
}

/* Reserve space for ads */
.ad-slot {
  min-height: 250px;
}

Explicit Dimensions

// Always set width and height
<img src="/photo.jpg" width={800} height={600} alt="Photo" />

// Next.js Image handles this automatically
<Image src="/photo.jpg" width={800} height={600} alt="Photo" />

// For responsive images
<Image src="/photo.jpg" fill sizes="(max-width: 768px) 100vw, 50vw" />

Avoid Layout-Shifting Fonts

/* Use font-display: optional for non-critical fonts */
@font-face {
  font-family: 'CustomFont';
  src: url('/fonts/custom.woff2') format('woff2');
  font-display: optional;
}

/* Or use size-adjust for fallback */
@font-face {
  font-family: 'Fallback';
  src: local('Arial');
  size-adjust: 105%;
  ascent-override: 95%;
}

Animations That Don't Cause Layout Shift

/* BAD: Changes layout properties */
.expanding {
  height: 0;
  transition: height 0.3s;
}
.expanding.open {
  height: 200px; /* Causes layout shift */
}

/* GOOD: Use transform */
.expanding {
  transform: scaleY(0);
  transform-origin: top;
  transition: transform 0.3s;
}
.expanding.open {
  transform: scaleY(1);
}

Incorrect — Image without dimensions causes layout shift:

<img src="/photo.jpg" alt="Photo" />

Correct — Explicit dimensions reserve space:

<img src="/photo.jpg" width={800} height={600} alt="Photo" />

Key Rules

  1. Always set width/height on images
  2. Use aspect-ratio for responsive containers
  3. Use font-display: optional for non-critical fonts
  4. Never animate layout properties (width, height, top, left)
  5. Use transform and opacity for animations
  6. Reserve space for ads, embeds, and dynamic content
  7. Target <= 0.08 for 2026 thresholds

Optimize Interaction to Next Paint to ensure responsive button clicks and interactions — CRITICAL

INP Optimization

Optimize Interaction to Next Paint for the 2026 threshold of <= 150ms.

Break Up Long Tasks

// BAD: Long synchronous task (blocks main thread)
function processLargeArray(items: Item[]) {
  items.forEach(processItem); // Blocks for entire duration
}

// GOOD: Yield to main thread
async function processLargeArray(items: Item[]) {
  for (const item of items) {
    processItem(item);
    if (performance.now() % 50 < 1) {
      await scheduler.yield?.() ?? new Promise(r => setTimeout(r, 0));
    }
  }
}

Use Transitions for Non-Urgent Updates

import { useTransition, useDeferredValue } from 'react';

function SearchResults() {
  const [query, setQuery] = useState('');
  const [isPending, startTransition] = useTransition();

  const handleChange = (e: ChangeEvent<HTMLInputElement>) => {
    // Urgent: Update input immediately
    setQuery(e.target.value);

    // Non-urgent: Defer expensive filter
    startTransition(() => {
      setFilteredResults(filterResults(e.target.value));
    });
  };

  return (
    <>
      <input value={query} onChange={handleChange} />
      {isPending && <Spinner />}
      <ResultsList results={filteredResults} />
    </>
  );
}

Optimize Event Handlers

// BAD: Heavy computation in click handler
<button onClick={() => {
  const result = heavyComputation(); // Blocks paint
  setResult(result);
}}>Calculate</button>

// GOOD: Defer heavy work
<button onClick={() => {
  setLoading(true);
  requestIdleCallback(() => {
    const result = heavyComputation();
    setResult(result);
    setLoading(false);
  });
}}>Calculate</button>

Incorrect — Blocking click handler delays visual feedback:

<button onClick={() => {
  const result = heavyComputation(); // Blocks paint
  setResult(result);
}}>Calculate</button>

Correct — Deferred work keeps UI responsive:

<button onClick={() => {
  setLoading(true);
  requestIdleCallback(() => {
    const result = heavyComputation();
    setResult(result);
    setLoading(false);
  });
}}>Calculate</button>

Key Rules

  1. Break long tasks > 50ms with scheduler.yield()
  2. Use useTransition for non-urgent state updates
  3. Defer heavy computation with requestIdleCallback
  4. Never block the main thread in event handlers
  5. Use useDeferredValue for expensive derived values
  6. Target <= 150ms for 2026 thresholds

Optimize Largest Contentful Paint to improve search rankings and perceived page speed — CRITICAL

LCP Optimization

Optimize Largest Contentful Paint for the 2026 threshold of <= 2.0s.

Identify LCP Element

new PerformanceObserver((entryList) => {
  const entries = entryList.getEntries();
  const lastEntry = entries[entries.length - 1];
  console.log('LCP element:', lastEntry.element);
  console.log('LCP time:', lastEntry.startTime);
}).observe({ type: 'largest-contentful-paint', buffered: true });

Optimize LCP Images

// Priority loading for hero image
<img
  src="/hero.webp"
  alt="Hero"
  fetchpriority="high"
  loading="eager"
  decoding="async"
/>

// Next.js Image with priority
import Image from 'next/image';

<Image
  src="/hero.webp"
  alt="Hero"
  priority
  sizes="100vw"
  quality={85}
/>

Preload Critical Resources

<!-- Preload LCP image -->
<link rel="preload" as="image" href="/hero.webp" fetchpriority="high" />

<!-- Preload critical font -->
<link rel="preload" as="font" href="/fonts/inter.woff2" type="font/woff2" crossorigin />

<!-- Preconnect to critical origins -->
<link rel="preconnect" href="https://api.example.com" />
<link rel="dns-prefetch" href="https://analytics.example.com" />

Server-Side Rendering

// Next.js - ensure SSR for LCP content
export default async function Page() {
  const data = await fetchCriticalData();
  return <HeroSection data={data} />; // Rendered on server
}

// BAD: LCP content loaded client-side
const [data, setData] = useState(null);
useEffect(() => { fetchData().then(setData); }, []);

Incorrect — Lazy-loading LCP image delays paint:

<img src="/hero.webp" alt="Hero" loading="lazy" />

Correct — Priority loading for LCP image:

<img
  src="/hero.webp"
  alt="Hero"
  fetchpriority="high"
  loading="eager"
  decoding="async"
/>

Key Rules

  1. Never lazy-load the LCP image
  2. Always use fetchpriority="high" on LCP images
  3. Always server-render LCP content
  4. Preload critical resources in &lt;head&gt;
  5. Preconnect to third-party origins used for LCP
  6. Target <= 2.0s for 2026 thresholds

Serve AVIF and WebP formats for 30-50% smaller files than JPEG at equivalent quality — HIGH

Modern Image Formats

Choose the right image format and quality settings for optimal compression.

Format Decision Matrix

FormatBest ForBrowser SupportQuality Setting
AVIFPhotos, gradients93%+ (2026)60-75
WebPUniversal fallback97%+75-82
JPEGLegacy fallback100%80-85
PNGTransparency, icons100%N/A
SVGIcons, logos100%N/A

Picture Element with Fallback

<picture>
  <source srcset="/photo.avif" type="image/avif" />
  <source srcset="/photo.webp" type="image/webp" />
  <img src="/photo.jpg" alt="Photo" width="800" height="600" loading="lazy" />
</picture>

Build-Time Conversion

// vite.config.ts with vite-plugin-image-optimizer
import { imageOptimizer } from 'vite-plugin-image-optimizer';

export default defineConfig({
  plugins: [
    imageOptimizer({
      avif: { quality: 72, effort: 4 },
      webp: { quality: 78 },
      jpeg: { quality: 82, progressive: true },
    }),
  ],
});

Quality Guidelines

AVIF  60-75  — Best compression, slight encoding time cost
WebP  75-82  — Good balance, fastest encoding
JPEG  80-85  — Legacy only, use progressive encoding

Rule of thumb: lower quality for large hero images (more compression gain),
higher quality for small thumbnails (already small files).

Incorrect — Single JPEG format misses 30-50% compression savings:

<img src="/photo.jpg" alt="Photo" width="800" height="600" />

Correct — Modern formats with fallback:

<picture>
  <source srcset="/photo.avif" type="image/avif" />
  <source srcset="/photo.webp" type="image/webp" />
  <img src="/photo.jpg" alt="Photo" width="800" height="600" />
</picture>

Key rules:

  • Prefer AVIF as primary format with WebP fallback
  • Use quality 72-78 for AVIF and WebP (visually lossless for most photos)
  • Always include a JPEG/PNG fallback in &lt;picture&gt;
  • Use progressive JPEG for any remaining JPEG images
  • Automate format conversion in the build pipeline, not manually

Use Next.js Image component for automatic lazy loading, responsive sizing, and format negotiation — HIGH

Next.js Image Component

Use the Next.js Image component for automatic optimization, format negotiation, and responsive sizing.

Priority Hero Image

import Image from 'next/image';

export default function Hero() {
  return (
    <Image
      src="/hero.webp"
      alt="Product hero"
      width={1200}
      height={630}
      priority           // Disables lazy loading, adds preload hint
      sizes="100vw"
      quality={85}
    />
  );
}

Blur Placeholder

// Static imports generate blurDataURL automatically
import heroImg from '@/public/hero.jpg';

<Image
  src={heroImg}
  alt="Hero"
  placeholder="blur"      // Uses auto-generated blurDataURL
  priority
/>

// For remote images, provide blurDataURL manually
<Image
  src="https://cdn.example.com/photo.jpg"
  alt="Photo"
  width={800}
  height={600}
  placeholder="blur"
  blurDataURL="data:image/jpeg;base64,/9j/4AAQ..."
/>

Custom Loader for CDN

// next.config.js
module.exports = {
  images: {
    loader: 'custom',
    loaderFile: './lib/image-loader.ts',
  },
};

// lib/image-loader.ts
export default function cloudflareLoader({
  src, width, quality,
}: { src: string; width: number; quality?: number }) {
  const params = [`width=${width}`, `quality=${quality || 80}`, 'format=auto'];
  return `https://cdn.example.com/cdn-cgi/image/${params.join(',')}/${src}`;
}

Responsive Fill Layout

<div style={{ position: 'relative', width: '100%', aspectRatio: '16/9' }}>
  <Image
    src="/banner.jpg"
    alt="Banner"
    fill
    sizes="(max-width: 768px) 100vw, (max-width: 1200px) 50vw, 33vw"
    style={{ objectFit: 'cover' }}
  />
</div>

Incorrect — Missing sizes causes incorrect srcset selection:

<Image src="/banner.jpg" alt="Banner" fill />

Correct — Sizes hint ensures optimal image size:

<Image
  src="/banner.jpg"
  alt="Banner"
  fill
  sizes="(max-width: 768px) 100vw, 50vw"
/>

Key rules:

  • Set priority on the LCP image (only one per page)
  • Always provide sizes for responsive images
  • Use placeholder="blur" for visible images to prevent CLS
  • Use a custom loader for external CDN image transformation
  • Use fill with sizes for responsive containers instead of fixed dimensions

Serve appropriately sized responsive images per viewport to avoid oversized mobile downloads — HIGH

Responsive Images

Serve the right image size for every viewport and device pixel ratio.

Srcset with Sizes

<img
  src="/photo-800.jpg"
  srcset="
    /photo-400.jpg   400w,
    /photo-800.jpg   800w,
    /photo-1200.jpg 1200w,
    /photo-1600.jpg 1600w
  "
  sizes="(max-width: 640px) 100vw,
         (max-width: 1024px) 50vw,
         33vw"
  alt="Product photo"
  loading="lazy"
  width="800"
  height="600"
/>

Art Direction with Picture

<!-- Different crops for different viewports -->
<picture>
  <source
    media="(max-width: 640px)"
    srcset="/hero-mobile.avif 640w, /hero-mobile-2x.avif 1280w"
    sizes="100vw"
    type="image/avif"
  />
  <source
    media="(min-width: 641px)"
    srcset="/hero-desktop.avif 1200w, /hero-desktop-2x.avif 2400w"
    sizes="66vw"
    type="image/avif"
  />
  <img src="/hero-desktop.jpg" alt="Hero" width="1200" height="630" />
</picture>

CDN Image Transformation URLs

// Cloudflare Image Resizing
function cfImage(src: string, width: number, quality = 80) {
  return `https://cdn.example.com/cdn-cgi/image/w=${width},q=${quality},f=auto/${src}`;
}

// Imgix
function imgixUrl(src: string, width: number, quality = 80) {
  return `${src}?w=${width}&q=${quality}&auto=format,compress`;
}

// Usage in React
<img
  src={cfImage('/photos/product.jpg', 800)}
  srcset={`
    ${cfImage('/photos/product.jpg', 400)} 400w,
    ${cfImage('/photos/product.jpg', 800)} 800w,
    ${cfImage('/photos/product.jpg', 1200)} 1200w
  `}
  sizes="(max-width: 768px) 100vw, 50vw"
  alt="Product"
  loading="lazy"
/>

Incorrect — srcset without sizes lets browser guess:

<img
  srcset="/photo-400.jpg 400w, /photo-800.jpg 800w"
  src="/photo-800.jpg"
  alt="Photo"
/>

Correct — sizes guides browser to optimal choice:

<img
  srcset="/photo-400.jpg 400w, /photo-800.jpg 800w"
  sizes="(max-width: 640px) 100vw, 50vw"
  src="/photo-800.jpg"
  alt="Photo"
  width="800"
  height="600"
/>

Key rules:

  • Always provide sizes alongside srcset for width descriptors
  • Use 3-4 srcset breakpoints (400, 800, 1200, 1600) for most images
  • Use &lt;picture&gt; with media for art direction (different crops)
  • Delegate resizing to a CDN rather than shipping multiple static files
  • Set explicit width and height to prevent CLS

Quantize models to reduce size 2-4x with minimal quality loss for fewer GPUs — MEDIUM

Model Quantization

Reduce model memory footprint and increase throughput with quantization.

Method Decision Matrix

MethodPrecisionSpeedQualityBest For
FP1616-bitBaselineBestWhen VRAM allows
FP88-bit1.5xNear-FP16Hopper/Ada GPUs (H100, L40S)
AWQ4-bit1.8xGoodProduction serving, speed priority
GPTQ4-bit1.6xBetterQuality-sensitive tasks
GGUF2-8 bitVariesVariesCPU/hybrid inference (llama.cpp)

vLLM with AWQ

# Serve a pre-quantized AWQ model
docker run --gpus '"device=0"' \
  -p 8000:8000 \
  vllm/vllm-openai:latest \
  --model TheBloke/Llama-3.1-8B-Instruct-AWQ \
  --quantization awq \
  --max-model-len 8192

vLLM with FP8 (Hopper GPUs)

# FP8 on H100 — native hardware support, no pre-quantized model needed
docker run --gpus all \
  -p 8000:8000 \
  vllm/vllm-openai:latest \
  --model meta-llama/Llama-3.1-70B-Instruct \
  --quantization fp8 \
  --tensor-parallel-size 4

VRAM Requirements (Approximate)

Model       FP16    FP8     AWQ/GPTQ (4-bit)
7-8B        16 GB   9 GB    5 GB
13B         26 GB   14 GB   8 GB
70B         140 GB  75 GB   40 GB

Formula: VRAM ≈ params × bytes_per_param × 1.2 (KV cache overhead)

Quality Validation

# Always benchmark quantized vs full precision on YOUR task
def eval_quantized(client, test_cases):
    results = []
    for case in test_cases:
        response = client.chat.completions.create(
            model="quantized-model",
            messages=case["messages"],
            max_tokens=case["max_tokens"],
        )
        results.append(score(response, case["expected"]))
    return sum(results) / len(results)

# Accept quantization if quality >= 95% of FP16 baseline

Incorrect — FP16 on smaller GPUs wastes VRAM:

docker run --gpus all \
  vllm/vllm-openai:latest \
  --model meta-llama/Llama-3.1-70B-Instruct \
  --tensor-parallel-size 8
# Requires 140 GB VRAM

Correct — FP8 quantization reduces VRAM by ~45%:

docker run --gpus all \
  vllm/vllm-openai:latest \
  --model meta-llama/Llama-3.1-70B-Instruct \
  --quantization fp8 \
  --tensor-parallel-size 4
# Requires 75 GB VRAM

Key rules:

  • Use FP8 on Hopper/Ada GPUs (best speed/quality tradeoff)
  • Use AWQ for maximum throughput on older GPUs
  • Use GPTQ when quality matters more than speed
  • Always validate quantized model quality on your specific task
  • Pre-quantized models (e.g., TheBloke) save quantization time

Apply speculative decoding to generate draft tokens in parallel and reduce inference latency — MEDIUM

Speculative Decoding

Use speculative decoding to reduce per-token latency without sacrificing output quality.

How It Works

Traditional:     token1 → token2 → token3 → token4  (4 forward passes)

Speculative:     draft: token1, token2, token3       (fast, cheap)
                 verify: accept/reject all 3          (1 forward pass)
                 Result: 3 tokens in ~1.3 forward passes

N-Gram Speculation (No Draft Model)

# vLLM n-gram speculation — uses prompt tokens as draft source
# Best for repetitive/structured output (JSON, code, templates)
docker run --gpus '"device=0"' \
  -p 8000:8000 \
  vllm/vllm-openai:latest \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --speculative-model [ngram] \
  --num-speculative-tokens 5 \
  --ngram-prompt-lookup-max 4

Draft Model Speculation

# Use a smaller model as the draft (must share tokenizer)
docker run --gpus '"device=0"' \
  -p 8000:8000 \
  vllm/vllm-openai:latest \
  --model meta-llama/Llama-3.1-70B-Instruct \
  --speculative-model meta-llama/Llama-3.1-8B-Instruct \
  --num-speculative-tokens 5 \
  --tensor-parallel-size 4

Acceptance Rate Tuning

--num-speculative-tokens:
  3  → Conservative, high acceptance rate (~85%)
  5  → Balanced (default recommendation)
  8  → Aggressive, lower acceptance rate (~60%)

Monitor via vLLM metrics:
  vllm:spec_decode_acceptance_rate  → target > 70%

If acceptance < 60%:
  1. Reduce --num-speculative-tokens
  2. Try n-gram for structured output
  3. Verify draft model matches target model's style

When to Use Each Approach

N-gram speculation:
  + Structured output (JSON, SQL, code)
  + Repetitive patterns
  + No extra GPU memory needed
  - Creative / diverse text

Draft model speculation:
  + General text generation
  + Large target models (70B+)
  + Higher acceptance rates on diverse tasks
  - Requires extra GPU memory for draft model

Incorrect — No speculation means sequential token generation:

docker run --gpus '"device=0"' \
  vllm/vllm-openai:latest \
  --model meta-llama/Llama-3.1-8B-Instruct
# 4 tokens = 4 forward passes

Correct — N-gram speculation reduces passes by 30-60%:

docker run --gpus '"device=0"' \
  vllm/vllm-openai:latest \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --speculative-model [ngram] \
  --num-speculative-tokens 5
# 4 tokens ≈ 1.3 forward passes

Key rules:

  • Use n-gram speculation for structured/repetitive output (free, no extra VRAM)
  • Use draft model speculation for general text with large target models
  • Start with --num-speculative-tokens 5 and tune based on acceptance rate
  • Monitor acceptance rate; reduce tokens if below 60%
  • Output quality is identical to non-speculative decoding (mathematically guaranteed)

Deploy vLLM with PagedAttention and continuous batching for 2-4x higher inference throughput — MEDIUM

vLLM Deployment

Deploy LLMs with vLLM for high-throughput, low-latency inference.

Docker Deployment

# Single GPU
docker run --gpus '"device=0"' \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  -p 8000:8000 \
  vllm/vllm-openai:latest \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --max-model-len 8192 \
  --gpu-memory-utilization 0.90

# Multi-GPU with tensor parallelism
docker run --gpus all \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  -p 8000:8000 \
  vllm/vllm-openai:latest \
  --model meta-llama/Llama-3.1-70B-Instruct \
  --tensor-parallel-size 4 \
  --max-model-len 8192 \
  --gpu-memory-utilization 0.92

Python Client

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="unused")

response = client.chat.completions.create(
    model="meta-llama/Llama-3.1-8B-Instruct",
    messages=[{"role": "user", "content": "Explain PagedAttention briefly."}],
    max_tokens=256,
    temperature=0.7,
)
print(response.choices[0].message.content)

Key Architecture Concepts

PagedAttention:
  - KV cache stored in non-contiguous pages (like OS virtual memory)
  - Eliminates memory waste from pre-allocated contiguous blocks
  - Enables 2-4x more concurrent sequences

Continuous Batching:
  - New requests join running batch immediately
  - No waiting for longest sequence to finish
  - Throughput: 10-30 requests/second on single A100 (8B model)

Tensor Parallelism:
  - Splits model across GPUs (--tensor-parallel-size N)
  - Rule: N = number of GPUs, must evenly divide model layers
  - Use for models > single GPU VRAM

Incorrect — Default memory utilization wastes KV cache space:

docker run --gpus '"device=0"' \
  vllm/vllm-openai:latest \
  --model meta-llama/Llama-3.1-8B-Instruct
# Uses default 0.70 GPU memory

Correct — Higher utilization enables more concurrent requests:

docker run --gpus '"device=0"' \
  vllm/vllm-openai:latest \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --gpu-memory-utilization 0.90 \
  --max-model-len 8192
# 2-4x more concurrent requests

Key rules:

  • Set --gpu-memory-utilization 0.90 (leave headroom for KV cache)
  • Use --tensor-parallel-size equal to the number of GPUs
  • Use the OpenAI-compatible API for drop-in compatibility
  • Monitor vllm:num_requests_running Prometheus metric for load
  • Set --max-model-len to the actual max you need (lower = more concurrent requests)

Defer component loading with React.lazy to reduce initial bundle size and improve TTI — HIGH

Lazy Component Loading

Use React.lazy with Suspense to load components on demand and reduce initial bundle size.

Basic Pattern

import { lazy, Suspense } from 'react';

const Dashboard = lazy(() => import('./Dashboard'));
const Settings = lazy(() => import('./Settings'));

function App() {
  return (
    <Suspense fallback={<DashboardSkeleton />}>
      <Dashboard />
    </Suspense>
  );
}

Error Boundary for Failed Imports

import { Component, lazy, Suspense } from 'react';

class LazyErrorBoundary extends Component<
  { fallback: React.ReactNode; children: React.ReactNode },
  { hasError: boolean }
> {
  state = { hasError: false };

  static getDerivedStateFromError() {
    return { hasError: true };
  }

  retry = () => this.setState({ hasError: false });

  render() {
    if (this.state.hasError) {
      return <button onClick={this.retry}>Retry</button>;
    }
    return this.props.children;
  }
}

// Usage
<LazyErrorBoundary fallback={<p>Failed to load</p>}>
  <Suspense fallback={<Skeleton />}>
    <LazyComponent />
  </Suspense>
</LazyErrorBoundary>

Skeleton Fallback

function DashboardSkeleton() {
  return (
    <div className="animate-pulse space-y-4">
      <div className="h-8 bg-gray-200 rounded w-1/3" />
      <div className="h-64 bg-gray-200 rounded" />
    </div>
  );
}

Incorrect — Missing Suspense fallback causes error:

const Dashboard = lazy(() => import('./Dashboard'));

function App() {
  return <Dashboard />; // Error: no Suspense boundary
}

Correct — Suspense with skeleton fallback:

const Dashboard = lazy(() => import('./Dashboard'));

function App() {
  return (
    <Suspense fallback={<DashboardSkeleton />}>
      <Dashboard />
    </Suspense>
  );
}

Key rules:

  • Wrap every lazy() component in a Suspense boundary
  • Add an error boundary around Suspense for network failures
  • Use skeleton fallbacks that match the loaded component's layout
  • Never lazy-load above-the-fold or LCP-critical components
  • Group related lazy components under a single Suspense boundary

Prefetch resources before user needs them to make navigation feel instant — HIGH

Prefetch Strategies

Proactively load resources before the user navigates to reduce perceived latency.

Module Preload Hints

<!-- Preload critical JS modules -->
<link rel="modulepreload" href="/assets/dashboard-abc123.js" />
<link rel="modulepreload" href="/assets/shared-chunk-def456.js" />

<!-- Prefetch likely next pages (low priority) -->
<link rel="prefetch" href="/assets/settings-ghi789.js" />

Prefetch on Hover

function NavLink({ to, children }: { to: string; children: React.ReactNode }) {
  const prefetchRef = useRef(false);

  const handlePointerEnter = () => {
    if (prefetchRef.current) return;
    prefetchRef.current = true;
    import(`./pages/${to}.tsx`); // Triggers prefetch
  };

  return (
    <a href={`/${to}`} onPointerEnter={handlePointerEnter}>
      {children}
    </a>
  );
}

Prefetch on Viewport Intersection

function usePrefetchOnVisible(importFn: () => Promise<unknown>) {
  const ref = useRef<HTMLElement>(null);
  const loaded = useRef(false);

  useEffect(() => {
    const el = ref.current;
    if (!el) return;

    const observer = new IntersectionObserver(([entry]) => {
      if (entry.isIntersecting && !loaded.current) {
        loaded.current = true;
        importFn();
        observer.disconnect();
      }
    }, { rootMargin: '200px' });

    observer.observe(el);
    return () => observer.disconnect();
  }, [importFn]);

  return ref;
}

// Usage
const ref = usePrefetchOnVisible(() => import('./HeavySection'));
<div ref={ref}><Suspense fallback={null}><HeavySection /></Suspense></div>

Import on Interaction

// Load a heavy module only when the user clicks
async function handleExport() {
  const { exportToPDF } = await import('./exportUtils');
  await exportToPDF(data);
}

<button onClick={handleExport}>Export PDF</button>

Incorrect — No prefetching causes delayed navigation:

<a href="/dashboard">Dashboard</a>

Correct — Hover prefetch gives 200-400ms head start:

function NavLink({ to, children }) {
  const prefetchRef = useRef(false);

  const handlePointerEnter = () => {
    if (prefetchRef.current) return;
    prefetchRef.current = true;
    import(`./pages/${to}.tsx`);
  };

  return (
    <a href={`/${to}`} onPointerEnter={handlePointerEnter}>
      {children}
    </a>
  );
}

Key rules:

  • Use modulepreload for critical JS the current page needs
  • Use prefetch for resources the user will likely need next
  • Prefetch on hover for navigation links (200-400ms head start)
  • Prefetch on intersection for below-the-fold heavy sections
  • Import on interaction for rarely-used heavy features

Split code at route boundaries so users only download code for the visited page — HIGH

Route-Based Code Splitting

Split your bundle at route boundaries so each page loads only its own code.

React Router 7.x Lazy Routes

import { createBrowserRouter } from 'react-router';

const router = createBrowserRouter([
  {
    path: '/',
    lazy: () => import('./pages/Home'),
  },
  {
    path: '/dashboard',
    lazy: () => import('./pages/Dashboard'),
  },
  {
    path: '/settings',
    lazy: () => import('./pages/Settings'),
  },
]);

Named Exports for Lazy Routes

// pages/Dashboard.tsx — export Component and loader
export async function loader() {
  return fetchDashboardData();
}

export function Component() {
  const data = useLoaderData();
  return <DashboardView data={data} />;
}

Component.displayName = 'Dashboard';

Chunk Naming

// Webpack — webpackChunkName magic comment
const Dashboard = lazy(
  () => import(/* webpackChunkName: "dashboard" */ './pages/Dashboard')
);

// Vite/Rollup — use rollupOptions for manual chunks
// vite.config.ts
export default defineConfig({
  build: {
    rollupOptions: {
      output: {
        manualChunks: {
          vendor: ['react', 'react-dom'],
          charts: ['recharts', 'd3'],
        },
      },
    },
  },
});

Incorrect — Eager imports bundle all routes together:

import Home from './pages/Home';
import Dashboard from './pages/Dashboard';
import Settings from './pages/Settings';

const router = createBrowserRouter([
  { path: '/', element: <Home /> },
  { path: '/dashboard', element: <Dashboard /> },
  { path: '/settings', element: <Settings /> },
]);

Correct — Lazy routes split per-page bundles:

const router = createBrowserRouter([
  { path: '/', lazy: () => import('./pages/Home') },
  { path: '/dashboard', lazy: () => import('./pages/Dashboard') },
  { path: '/settings', lazy: () => import('./pages/Settings') },
]);

Key rules:

  • Split at route boundaries as the minimum splitting strategy
  • Use React Router lazy for automatic route-level splitting
  • Export Component and loader as named exports for lazy routes
  • Name chunks for readable build output and caching
  • Group vendor libraries into shared chunks to avoid duplication

Profile Python backends with py-spy to find CPU hotspots and memory leaks in production — MEDIUM

Python Backend Profiling

Profile Python services to find CPU bottlenecks and memory leaks.

py-spy for Production Sampling

# Attach to running process (no restart needed)
py-spy top --pid 12345

# Generate flamegraph SVG
py-spy record -o profile.svg --pid 12345 --duration 30

# Profile a script directly
py-spy record -o profile.svg -- python manage.py runserver

# Sample at higher rate for short-lived operations
py-spy record --rate 250 -o profile.svg -- python batch_job.py

cProfile for Development

import cProfile
import pstats

# Profile a function
with cProfile.Profile() as pr:
    result = expensive_function()

stats = pstats.Stats(pr)
stats.sort_stats('cumulative')
stats.print_stats(20)  # Top 20 functions

# One-liner from command line
# python -m cProfile -s cumulative app.py

memory_profiler for Memory Leaks

from memory_profiler import profile

@profile
def process_data():
    data = load_large_dataset()     # +500 MiB
    filtered = filter_items(data)   # +200 MiB
    del data                        # -500 MiB
    return summarize(filtered)

# Command line: python -m memory_profiler script.py

FastAPI Middleware Profiling

import time
from fastapi import Request

@app.middleware("http")
async def profile_requests(request: Request, call_next):
    start = time.perf_counter()
    response = await call_next(request)
    duration = time.perf_counter() - start
    if duration > 0.5:  # Log slow requests
        print(f"SLOW: {request.method} {request.url.path} took {duration:.2f}s")
    response.headers["X-Response-Time"] = f"{duration:.3f}"
    return response

Incorrect — cProfile in production requires code changes:

# Must instrument code manually
with cProfile.Profile() as pr:
    result = expensive_function()

Correct — py-spy attaches to running process with zero overhead:

# No code changes, no restart needed
py-spy record -o profile.svg --pid 12345 --duration 30

Key rules:

  • Use py-spy in production (zero overhead when not profiling, no code changes)
  • Use cProfile in development for detailed call graphs
  • Use memory_profiler to track per-line memory allocation
  • Profile under realistic load, not just unit test conditions
  • Focus on the top 3-5 hotspots by cumulative time

Analyze bundles to reveal bloated dependencies and missed tree-shaking that inflate load times — MEDIUM

Bundle Analysis

Analyze and optimize JavaScript bundle size with visualization tools and CI budgets.

Webpack Bundle Analyzer

// webpack.config.js
const { BundleAnalyzerPlugin } = require('webpack-bundle-analyzer');

module.exports = {
  plugins: [
    new BundleAnalyzerPlugin({
      analyzerMode: 'static',      // Generates HTML report
      openAnalyzer: false,
      reportFilename: 'bundle-report.html',
    }),
  ],
};

Vite / Rollup Visualizer

// vite.config.ts
import { visualizer } from 'rollup-plugin-visualizer';

export default defineConfig({
  plugins: [
    visualizer({
      filename: 'bundle-report.html',
      gzipSize: true,
      brotliSize: true,
    }),
  ],
});

Performance Budgets in CI

// bundlesize.config.json
{
  "files": [
    { "path": "dist/assets/index-*.js", "maxSize": "150 kB", "compression": "gzip" },
    { "path": "dist/assets/vendor-*.js", "maxSize": "80 kB", "compression": "gzip" },
    { "path": "dist/assets/*.css", "maxSize": "30 kB", "compression": "gzip" }
  ]
}
# .github/workflows/bundle-check.yml
- name: Check bundle size
  run: npx bundlesize
  env:
    BUNDLESIZE_GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}

Import Cost Awareness

// BAD: Imports entire library (70 kB)
import _ from 'lodash';
const sorted = _.sortBy(items, 'name');

// GOOD: Import single function (4 kB)
import sortBy from 'lodash/sortBy';
const sorted = sortBy(items, 'name');

// BEST: Use native (0 kB)
const sorted = items.toSorted((a, b) => a.name.localeCompare(b.name));

Incorrect — Importing entire lodash adds 70 kB:

import _ from 'lodash';
const sorted = _.sortBy(items, 'name');

Correct — Import single function or use native API:

// Option 1: Import only what you need (4 kB)
import sortBy from 'lodash/sortBy';
const sorted = sortBy(items, 'name');

// Option 2: Use native API (0 kB)
const sorted = items.toSorted((a, b) => a.name.localeCompare(b.name));

Key rules:

  • Run bundle analysis on every release to catch regressions
  • Set CI performance budgets (fail build if exceeded)
  • Import only what you use from large libraries
  • Check gzip/brotli sizes, not raw sizes
  • Replace large dependencies with native APIs when possible

Profile React components with DevTools to identify unnecessary re-renders and their causes — MEDIUM

React DevTools Profiler

Use the React DevTools Profiler to identify and fix unnecessary re-renders.

Flamegraph Workflow

1. Open React DevTools → Profiler tab
2. Click "Record" → Interact with the UI → Click "Stop"
3. Read the flamegraph:
   - Yellow/red bars = slow renders (> 16ms)
   - Gray bars = did not render
   - Click a bar → see "Why did this render?"
4. Focus on components that render often AND take long

Programmatic Profiler

import { Profiler } from 'react';

function onRenderCallback(
  id: string,
  phase: 'mount' | 'update',
  actualDuration: number,
) {
  if (actualDuration > 16) {
    console.warn(`Slow render: ${id} (${phase}) took ${actualDuration.toFixed(1)}ms`);
  }
}

<Profiler id="Dashboard" onRender={onRenderCallback}>
  <Dashboard />
</Profiler>

Why Did You Render Setup

// wdyr.ts — import BEFORE React in development
import React from 'react';

if (process.env.NODE_ENV === 'development') {
  const { default: whyDidYouRender } = await import(
    '@welldone-software/why-did-you-render'
  );
  whyDidYouRender(React, {
    trackAllPureComponents: true,
    logOnDifferentValues: true,
  });
}

// Mark specific components for tracking
MyComponent.whyDidYouRender = true;

Interpreting Render Reasons

"Props changed"       → Check which prop, was it a new object/array?
"State changed"       → Expected, verify state is colocated
"Parent rendered"     → Parent re-renders, child doesn't memo
"Context changed"     → Split context or use selectors
"Hooks changed"       → useMemo/useCallback dependency changed

Incorrect — Blind memoization without profiling:

const MemoizedComponent = memo(Component);
const memoizedValue = useMemo(() => value, []);
const callback = useCallback(() => {}, []);
// Added optimization without measurement

Correct — Profile first, then optimize actual bottlenecks:

// 1. Open React DevTools Profiler
// 2. Record interaction
// 3. Identify slow renders (yellow/red bars > 16ms)
// 4. Check "Why did this render?"
// 5. Apply targeted fix only where needed

Key rules:

  • Profile first before adding any memoization
  • Focus on components that are both frequent AND slow (> 16ms)
  • Use "Why did this render?" to find the root cause
  • Use Why Did You Render in development for automatic detection
  • Ignore gray (not rendered) components in the flamegraph

Apply TanStack Query optimistic updates for instant UI feedback with automatic rollback — HIGH

TanStack Query Optimistic Updates

Show immediate UI feedback before server confirmation with proper rollback on error.

Incorrect — mutation without optimistic update:

// WRONG: User waits for server roundtrip
const mutation = useMutation({
  mutationFn: updateTodo,
  onSuccess: () => {
    queryClient.invalidateQueries({ queryKey: ['todos'] }); // Refetches after delay
  },
});
// UI feels sluggish — user sees spinner for 200-500ms

Correct — optimistic update with rollback:

import { useMutation, useQueryClient } from '@tanstack/react-query';

function useUpdateTodo() {
  const queryClient = useQueryClient();

  return useMutation({
    mutationFn: updateTodo,
    onMutate: async (newTodo) => {
      // 1. Cancel outgoing refetches (prevent race condition)
      await queryClient.cancelQueries({ queryKey: ['todos', newTodo.id] });

      // 2. Snapshot previous value for rollback
      const previousTodo = queryClient.getQueryData(['todos', newTodo.id]);

      // 3. Optimistically update cache
      queryClient.setQueryData(['todos', newTodo.id], newTodo);

      // 4. Return context for rollback
      return { previousTodo };
    },
    onError: (_err, newTodo, context) => {
      // Rollback to previous value on error
      queryClient.setQueryData(['todos', newTodo.id], context?.previousTodo);
    },
    onSettled: (_data, _error, variables) => {
      // Always reconcile with server after mutation
      queryClient.invalidateQueries({ queryKey: ['todos', variables.id] });
    },
  });
}

// Optimistic list update (add to list)
function useAddTodo() {
  const queryClient = useQueryClient();

  return useMutation({
    mutationFn: createTodo,
    onMutate: async (newTodo) => {
      await queryClient.cancelQueries({ queryKey: ['todos'] });
      const previousTodos = queryClient.getQueryData<Todo[]>(['todos']);

      // Immutable update (NEVER mutate cache directly)
      queryClient.setQueryData<Todo[]>(['todos'], (old) =>
        old ? [...old, { ...newTodo, id: 'temp-id' }] : [newTodo]
      );

      return { previousTodos };
    },
    onError: (_err, _newTodo, context) => {
      queryClient.setQueryData(['todos'], context?.previousTodos);
    },
    onSettled: () => {
      queryClient.invalidateQueries({ queryKey: ['todos'] });
    },
  });
}

Track pending mutations:

import { useMutationState } from '@tanstack/react-query';

function PendingTodos() {
  const pendingMutations = useMutationState({
    filters: { mutationKey: ['addTodo'], status: 'pending' },
    select: (mutation) => mutation.state.variables as Todo,
  });

  return (
    <>
      {pendingMutations.map((todo) => (
        <TodoItem key={todo.id} todo={todo} isPending />
      ))}
    </>
  );
}

Key rules:

  • Always cancel outgoing queries in onMutate to prevent race conditions
  • Always return context from onMutate for rollback capability
  • Use immutable updates: [...old, newItem] not old.push(newItem)
  • Always invalidateQueries in onSettled to reconcile with server
  • Use useMutationState to show pending items in the UI
  • Selective invalidation: queryKey: ['todos', id] not queryClient.invalidateQueries() (invalidates everything)

Prefetch TanStack queries on hover or in route loaders for instant page transitions — HIGH

TanStack Query Prefetching

Prefetch data before navigation for instant page transitions using TanStack Query v5.

Incorrect — fetching data only when component mounts:

// WRONG: User clicks link, waits for data to load
function UserProfile({ userId }: { userId: string }) {
  const { data, isPending } = useQuery({
    queryKey: ['user', userId],
    queryFn: () => fetchUser(userId),
  });

  if (isPending) return <Skeleton />; // User sees skeleton every time
  return <div>{data.name}</div>;
}

Correct — prefetch on hover and in route loaders:

// 1. Define reusable query options (v5 pattern)
const userQueryOptions = (id: string) => queryOptions({
  queryKey: ['user', id] as const,
  queryFn: () => fetchUser(id),
  staleTime: 5 * 60 * 1000, // Fresh for 5 min
});

// 2. Prefetch on hover
function UserLink({ userId }: { userId: string }) {
  const queryClient = useQueryClient();

  const prefetchUser = () => {
    queryClient.prefetchQuery(userQueryOptions(userId));
  };

  return (
    <Link
      to={`/users/${userId}`}
      onMouseEnter={prefetchUser}
      onFocus={prefetchUser}
    >
      View User
    </Link>
  );
}

// 3. Prefetch in route loader (React Router 7.x)
export const loader = (queryClient: QueryClient) =>
  async ({ params }: { params: { id: string } }) => {
    await queryClient.ensureQueryData(userQueryOptions(params.id));
    return null;
  };

// 4. Use with Suspense for instant render
function UserProfile({ userId }: { userId: string }) {
  // Data already loaded by prefetch — no loading state!
  const { data } = useSuspenseQuery(userQueryOptions(userId));
  return <div>{data.name}</div>;
}

QueryClient configuration:

const queryClient = new QueryClient({
  defaultOptions: {
    queries: {
      staleTime: 1000 * 60,       // 1 min fresh
      gcTime: 1000 * 60 * 5,      // 5 min in cache (formerly cacheTime)
      refetchOnWindowFocus: true,  // Refetch on tab focus
      retry: 3,
      retryDelay: (attemptIndex) => Math.min(1000 * 2 ** attemptIndex, 30000),
    },
  },
});

Key rules:

  • Use queryOptions() helper for reusable query definitions across prefetch/useQuery/loader
  • Prefetch on onMouseEnter AND onFocus for keyboard users
  • Use ensureQueryData in loaders (waits for data), prefetchQuery for fire-and-forget
  • useSuspenseQuery for components where data is guaranteed by loader
  • gcTime (v5) replaces cacheTime (v4) — controls how long unused data stays in memory
  • isPending (v5) replaces isLoading for initial load state

Let React Compiler auto-memoize components, values, callbacks, and JSX automatically — HIGH

React Compiler

React 19's compiler is the primary approach to render optimization in 2026.

Decision Tree

Is React Compiler enabled?
├─ YES → Let compiler handle memoization automatically
│        Only use useMemo/useCallback as escape hatches
│        DevTools shows "Memo ✨" badge

└─ NO → Profile first, then optimize
         1. React DevTools Profiler
         2. Identify actual bottlenecks
         3. Apply targeted optimizations

What the Compiler Memoizes

  • Component re-renders
  • Intermediate values (like useMemo)
  • Callback references (like useCallback)
  • JSX elements

Enabling the Compiler

// next.config.js (Next.js 16+)
const nextConfig = {
  reactCompiler: true,
}

// Expo SDK 54+ enables by default

Verification

Open React DevTools and look for the "Memo ✨" badge on components. If present, the compiler is successfully memoizing that component.

Incorrect — Manual memoization when compiler is enabled:

// next.config.js has reactCompiler: true
const value = useMemo(() => compute(data), [data]);
const callback = useCallback(() => handle(), []);
// Compiler already handles this automatically

Correct — Let compiler auto-memoize:

// Compiler handles memoization automatically
function Component({ data }) {
  const value = compute(data); // Auto-memoized
  const handle = () => {}; // Auto-memoized
  return <div onClick={handle}>{value}</div>;
}
// Check DevTools for "Memo ✨" badge

Key Rules

  1. Enable React Compiler as the first step
  2. Let the compiler handle memoization automatically
  3. Verify with DevTools "Memo ✨" badge
  4. Only use manual memoization as escape hatches
  5. Profile before adding any manual optimization

Use manual useMemo and useCallback escape hatches when React Compiler cannot optimize — HIGH

Manual Memoization Escape Hatches

Use useMemo/useCallback as escape hatches when React Compiler is insufficient.

When Manual Memoization Is Needed

// 1. Effect dependencies that shouldn't trigger re-runs
const stableConfig = useMemo(() => ({
  apiUrl: process.env.API_URL
}), [])

useEffect(() => {
  initializeSDK(stableConfig)
}, [stableConfig])

// 2. Third-party libraries without compiler support
const memoizedValue = useMemo(() =>
  expensiveThirdPartyComputation(data), [data])

// 3. Precise control over memoization boundaries
const handleClick = useCallback(() => {
  // Critical callback that must be stable
}, [dependency])

State Colocation

Move state as close to where it's used as possible:

// BAD: State too high - causes unnecessary re-renders
function App() {
  const [filter, setFilter] = useState('')
  return (
    <Header />  {/* Re-renders on filter change! */}
    <FilterInput value={filter} onChange={setFilter} />
    <List filter={filter} />
  )
}

// GOOD: State colocated - minimal re-renders
function App() {
  return (
    <Header />
    <FilterableList />  {/* State inside */}
  )
}

Profiling Workflow

  1. React DevTools Profiler: Record, interact, analyze
  2. Identify: Components with high render counts or duration
  3. Verify: Is the re-render actually causing perf issues?
  4. Fix: Apply targeted optimization
  5. Measure: Confirm improvement

Incorrect — State too high causes unnecessary re-renders:

function App() {
  const [filter, setFilter] = useState('');
  return (
    <>
      <Header />  {/* Re-renders on filter change! */}
      <FilterInput value={filter} onChange={setFilter} />
      <List filter={filter} />
    </>
  );
}

Correct — State colocated minimizes re-renders:

function App() {
  return (
    <>
      <Header />
      <FilterableList />  {/* State inside, Header unaffected */}
    </>
  );
}

Key Rules

  1. Profile first — never optimize without measurement
  2. Colocate state as close to usage as possible
  3. Use useMemo for effect dependencies that must be stable
  4. Use useCallback for callbacks passed to memoized children
  5. Split context into state and dispatch providers

Virtualize long lists to render only visible items for smooth scrolling performance — HIGH

List Virtualization

Use TanStack Virtual for efficient rendering of large lists.

Virtualization Thresholds

Item CountRecommendation
< 100Regular rendering usually fine
100-500Consider virtualization
500+Virtualization required

Basic Setup

import { useVirtualizer } from '@tanstack/react-virtual'

function VirtualList({ items }) {
  const parentRef = useRef(null)

  const virtualizer = useVirtualizer({
    count: items.length,
    getScrollElement: () => parentRef.current,
    estimateSize: () => 50,
    overscan: 5,
  })

  return (
    <div ref={parentRef} style={{ height: '400px', overflow: 'auto' }}>
      <div style={{ height: `${virtualizer.getTotalSize()}px`, position: 'relative' }}>
        {virtualizer.getVirtualItems().map((virtualRow) => (
          <div
            key={virtualRow.key}
            style={{
              position: 'absolute',
              top: 0,
              left: 0,
              width: '100%',
              height: `${virtualRow.size}px`,
              transform: `translateY(${virtualRow.start}px)`,
            }}
          >
            {items[virtualRow.index].name}
          </div>
        ))}
      </div>
    </div>
  )
}

Dynamic Height

const virtualizer = useVirtualizer({
  count: items.length,
  getScrollElement: () => parentRef.current,
  estimateSize: () => 50,
  overscan: 5,
  measureElement: (element) => element.getBoundingClientRect().height,
})

Incorrect — Rendering 1000 items causes scroll jank:

function List({ items }) {
  return (
    <div style={{ height: '400px', overflow: 'auto' }}>
      {items.map(item => (
        <div key={item.id}>{item.name}</div>
      ))}
    </div>
  );
}

Correct — Virtualization renders only visible items:

import { useVirtualizer } from '@tanstack/react-virtual';

function VirtualList({ items }) {
  const parentRef = useRef(null);
  const virtualizer = useVirtualizer({
    count: items.length,
    getScrollElement: () => parentRef.current,
    estimateSize: () => 50,
    overscan: 5,
  });

  return (
    <div ref={parentRef} style={{ height: '400px', overflow: 'auto' }}>
      <div style={{ height: `${virtualizer.getTotalSize()}px`, position: 'relative' }}>
        {virtualizer.getVirtualItems().map(virtualRow => (
          <div
            key={virtualRow.key}
            style={{
              position: 'absolute',
              transform: `translateY(${virtualRow.start}px)`,
            }}
          >
            {items[virtualRow.index].name}
          </div>
        ))}
      </div>
    </div>
  );
}

Key Rules

  1. Virtualize lists with 100+ items
  2. Set overscan: 5 for smooth scrolling
  3. Use estimateSize close to actual average
  4. Use measureElement for variable height items
  5. Position items with transform: translateY() (avoids layout recalculation)

References (17)

Caching Strategies

Caching Strategies

Multi-level caching patterns for performance optimization.

Cache Hierarchy

L1: In-Memory (LRU, memoization) - fastest, per-process
L2: Distributed (Redis/Memcached) - shared across instances
L3: CDN (edge, static assets) - global, closest to user
L4: Database (materialized views) - fallback, queryable

Cache-Aside Pattern (Read-Through)

Most common caching pattern:

async function getAnalysis(id: string): Promise<Analysis> {
  const cacheKey = `analysis:${id}`;

  // Try cache first (L2)
  const cached = await redis.get(cacheKey);
  if (cached) {
    return JSON.parse(cached);
  }

  // Cache miss - fetch from database (L4)
  const analysis = await db.query('SELECT * FROM analyses WHERE id = $1', [id]);

  // Store in cache for future requests
  await redis.setex(cacheKey, 3600, JSON.stringify(analysis));  // 1 hour TTL

  return analysis;
}

Write-Through Pattern

Update cache when writing to database:

async function updateAnalysis(id: string, updates: Partial<Analysis>) {
  // Update database
  const updated = await db.query(
    'UPDATE analyses SET ... WHERE id = $1 RETURNING *',
    [id]
  );

  // Update cache immediately
  const cacheKey = `analysis:${id}`;
  await redis.setex(cacheKey, 3600, JSON.stringify(updated));

  return updated;
}

Cache Invalidation Strategies

1. Time-Based (TTL)

// Short TTL for frequently changing data
await redis.setex('trending:articles', 300, data);  // 5 min

// Long TTL for static data
await redis.setex('user:profile:123', 86400, data);  // 24 hours

2. Event-Based

// Invalidate when data changes
async function deleteAnalysis(id: string) {
  await db.query('DELETE FROM analyses WHERE id = $1', [id]);

  // Invalidate all related cache keys
  await redis.del(`analysis:${id}`);
  await redis.del(`analysis:${id}:chunks`);
  await redis.del('analysis:list:recent');  // List cache
}

3. Tag-Based

// Tag related cache entries
await redis.set('analysis:123', data);
await redis.sadd('tag:user:456', 'analysis:123');

// Invalidate all entries with tag
async function invalidateUserData(userId: string) {
  const keys = await redis.smembers(`tag:user:${userId}`);
  if (keys.length > 0) {
    await redis.del(...keys);
    await redis.del(`tag:user:${userId}`);
  }
}

Redis Patterns

1. String Cache (Most Common)

// Get/set
await redis.set('key', 'value');
const value = await redis.get('key');

// With TTL
await redis.setex('key', 3600, 'value');

// Atomic increment
await redis.incr('page:views:123');

2. Hash Cache (Objects)

// Store object fields separately
await redis.hset('user:123', 'name', 'Alice');
await redis.hset('user:123', 'email', 'alice@example.com');

// Get specific field
const name = await redis.hget('user:123', 'name');

// Get all fields
const user = await redis.hgetall('user:123');

3. List Cache (Queues, Recent Items)

// Recent analyses (FIFO)
await redis.lpush('analyses:recent', analysisId);
await redis.ltrim('analyses:recent', 0, 99);  // Keep only 100 most recent

// Get recent
const recent = await redis.lrange('analyses:recent', 0, 9);  // First 10

4. Set Cache (Unique Items, Tags)

// Track unique visitors
await redis.sadd('article:123:visitors', userId);

// Check membership
const hasVisited = await redis.sismember('article:123:visitors', userId);

// Count unique
const uniqueCount = await redis.scard('article:123:visitors');

In-Memory Cache (L1)

For per-process caching:

import { LRUCache } from 'lru-cache';

const cache = new LRUCache<string, Analysis>({
  max: 500,  // Maximum items
  ttl: 1000 * 60 * 5,  // 5 minutes
  updateAgeOnGet: true,  // Refresh on access
});

function getAnalysis(id: string): Analysis {
  // Check L1 first
  if (cache.has(id)) {
    return cache.get(id)!;
  }

  // Fetch from L2 or database
  const analysis = await fetchAnalysis(id);
  cache.set(id, analysis);

  return analysis;
}

HTTP Caching (Browser/CDN)

// Express.js example
app.get('/api/analyses/:id', async (req, res) => {
  const analysis = await getAnalysis(req.params.id);

  // Cache in browser and CDN for 1 hour
  res.set('Cache-Control', 'public, max-age=3600');

  // ETag for conditional requests
  const etag = generateETag(analysis);
  res.set('ETag', etag);

  // Return 304 if unchanged
  if (req.headers['if-none-match'] === etag) {
    return res.status(304).end();
  }

  res.json(analysis);
});

Cache Warming

Preload cache before traffic arrives:

async function warmCache() {
  // Load hot data
  const recentAnalyses = await db.query(
    'SELECT * FROM analyses ORDER BY created_at DESC LIMIT 100'
  );

  // Populate cache
  for (const analysis of recentAnalyses) {
    await redis.setex(
      `analysis:${analysis.id}`,
      3600,
      JSON.stringify(analysis)
    );
  }

  console.log(`Warmed cache with ${recentAnalyses.length} analyses`);
}

// Run on server startup
await warmCache();

Cache Stampede Prevention

Prevent multiple requests from hitting database simultaneously:

const locks = new Map<string, Promise<Analysis>>();

async function getAnalysis(id: string): Promise<Analysis> {
  const cacheKey = `analysis:${id}`;

  // Check cache
  const cached = await redis.get(cacheKey);
  if (cached) return JSON.parse(cached);

  // Check if fetch is already in progress
  if (locks.has(cacheKey)) {
    return locks.get(cacheKey)!;
  }

  // Start fetch
  const fetchPromise = (async () => {
    const analysis = await db.query('SELECT * FROM analyses WHERE id = $1', [id]);
    await redis.setex(cacheKey, 3600, JSON.stringify(analysis));
    locks.delete(cacheKey);  // Clean up
    return analysis;
  })();

  locks.set(cacheKey, fetchPromise);
  return fetchPromise;
}

Best Practices

  1. Cache frequently accessed, slow-to-compute data
  2. Use appropriate TTL - shorter for dynamic data
  3. Monitor cache hit rate - aim for > 80%
  4. Handle cache failures gracefully - always fall back to database
  5. Invalidate proactively when data changes
  6. Monitor memory usage - set max memory and eviction policy
  7. Use compression for large cached values

References

Cdn Setup

Image CDN Configuration

Complete guide to configuring image CDNs and optimization pipelines.

┌─────────────────────────────────────────────────────────────────────────────┐
│                         Image Delivery Pipeline                              │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│   Source              CDN / Optimizer              Browser                   │
│  ┌────────┐          ┌─────────────┐            ┌──────────┐               │
│  │ Origin │──────────►│   Resize   │──AVIF────►│  Chrome  │               │
│  │ Server │           │   Format   │            │  Safari  │               │
│  │  /CMS  │           │   Quality  │──WebP────►│  Firefox │               │
│  └────────┘           │   Cache    │            │  Edge    │               │
│                       └─────────────┘            └──────────┘               │
│                             │                                                │
│                      ┌──────▼──────┐                                        │
│                      │  Edge Cache │                                        │
│                      │  (Global)   │                                        │
│                      └─────────────┘                                        │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

Next.js Remote Patterns

Basic Configuration

// next.config.js
module.exports = {
  images: {
    // Enable modern formats
    formats: ['image/avif', 'image/webp'],

    // Allowed remote sources (required for external images)
    remotePatterns: [
      {
        protocol: 'https',
        hostname: 'cdn.example.com',
        pathname: '/images/**',
      },
      {
        protocol: 'https',
        hostname: '*.cloudinary.com',
      },
      {
        protocol: 'https',
        hostname: 'images.unsplash.com',
      },
      {
        protocol: 'https',
        hostname: 's3.amazonaws.com',
        pathname: '/my-bucket/**',
      },
    ],

    // Responsive breakpoints
    deviceSizes: [640, 750, 828, 1080, 1200, 1920, 2048, 3840],
    imageSizes: [16, 32, 48, 64, 96, 128, 256, 384],

    // Cache TTL (seconds) - default 60, increase for CDN
    minimumCacheTTL: 60 * 60 * 24 * 30, // 30 days

    // Disable optimization in development (faster builds)
    unoptimized: process.env.NODE_ENV === 'development',
  },
};

Environment-Based Configuration

// next.config.js
const isProd = process.env.NODE_ENV === 'production';

module.exports = {
  images: {
    formats: ['image/avif', 'image/webp'],

    // Different patterns per environment
    remotePatterns: [
      // Production CDN
      ...(isProd
        ? [
            {
              protocol: 'https',
              hostname: 'cdn.example.com',
            },
          ]
        : []),

      // Development/staging
      ...(!isProd
        ? [
            {
              protocol: 'https',
              hostname: 'staging-cdn.example.com',
            },
            {
              protocol: 'http',
              hostname: 'localhost',
              port: '3001',
            },
          ]
        : []),
    ],
  },
};

Cloudinary Integration

Loader Implementation

// lib/loaders/cloudinary.ts
import type { ImageLoader } from 'next/image';

const CLOUD_NAME = process.env.NEXT_PUBLIC_CLOUDINARY_CLOUD_NAME;

export const cloudinaryLoader: ImageLoader = ({ src, width, quality }) => {
  // Build transformation string
  const transforms = [
    `w_${width}`,
    `q_${quality || 'auto:good'}`,
    'f_auto', // Auto format (AVIF > WebP > JPEG)
    'c_limit', // Don't upscale
    'dpr_auto', // Auto DPR
  ].join(',');

  // Handle both full URLs and paths
  const imagePath = src.startsWith('http')
    ? src.replace(/^https?:\/\/[^/]+/, '')
    : src;

  return `https://res.cloudinary.com/${CLOUD_NAME}/image/upload/${transforms}/${imagePath}`;
};

// Advanced loader with more options
export const cloudinaryAdvancedLoader: ImageLoader = ({ src, width, quality }) => {
  const params = new URLSearchParams();

  // Responsive width
  params.set('w', width.toString());

  // Quality (auto:good is a good default)
  params.set('q', quality?.toString() || 'auto:good');

  // Additional optimizations
  const transforms = [
    `w_${width}`,
    `q_${quality || 'auto:good'}`,
    'f_auto', // Best format for browser
    'c_limit', // Don't upscale
    'fl_progressive', // Progressive loading
    'fl_immutable_cache', // Long cache
  ].join(',');

  return `https://res.cloudinary.com/${CLOUD_NAME}/image/upload/${transforms}/${src}`;
};

export default cloudinaryLoader;

Usage

import Image from 'next/image';
import { cloudinaryLoader } from '@/lib/loaders/cloudinary';

// Component usage
<Image
  loader={cloudinaryLoader}
  src="products/shoe-red.jpg" // Path in Cloudinary
  alt="Red running shoe"
  width={400}
  height={400}
  sizes="(max-width: 768px) 100vw, 400px"
/>

// Global loader configuration
// next.config.js
module.exports = {
  images: {
    loader: 'custom',
    loaderFile: './lib/loaders/cloudinary.ts',
  },
};

Imgix Integration

// lib/loaders/imgix.ts
import type { ImageLoader } from 'next/image';

const IMGIX_DOMAIN = process.env.NEXT_PUBLIC_IMGIX_DOMAIN;

export const imgixLoader: ImageLoader = ({ src, width, quality }) => {
  const url = new URL(`https://${IMGIX_DOMAIN}${src}`);

  // Auto format negotiation
  url.searchParams.set('auto', 'format,compress');

  // Width
  url.searchParams.set('w', width.toString());

  // Quality
  url.searchParams.set('q', (quality || 75).toString());

  // Fit mode (contain, cover, fill, etc.)
  url.searchParams.set('fit', 'max');

  return url.toString();
};

// With advanced features
export const imgixAdvancedLoader: ImageLoader = ({ src, width, quality }) => {
  const url = new URL(`https://${IMGIX_DOMAIN}${src}`);

  url.searchParams.set('auto', 'format,compress');
  url.searchParams.set('w', width.toString());
  url.searchParams.set('q', (quality || 75).toString());
  url.searchParams.set('fit', 'max');

  // Face detection for portraits
  // url.searchParams.set('fit', 'facearea');
  // url.searchParams.set('facepad', '2');

  // Blur for placeholders
  // url.searchParams.set('blur', '200');
  // url.searchParams.set('px', '16');

  return url.toString();
};

Cloudflare Images

// lib/loaders/cloudflare.ts
import type { ImageLoader } from 'next/image';

// Using Cloudflare Image Resizing
export const cloudflareResizingLoader: ImageLoader = ({ src, width, quality }) => {
  // src should be the full URL of the original image
  const params = [
    `width=${width}`,
    `quality=${quality || 85}`,
    'format=auto', // Auto AVIF/WebP
    'fit=scale-down', // Don't upscale
  ].join(',');

  return `https://yourdomain.com/cdn-cgi/image/${params}/${src}`;
};

// Using Cloudflare Images (upload API)
const ACCOUNT_HASH = process.env.NEXT_PUBLIC_CLOUDFLARE_ACCOUNT_HASH;

export const cloudflareImagesLoader: ImageLoader = ({ src, width }) => {
  // src is the image ID from Cloudflare
  // Variants are predefined in Cloudflare dashboard
  const variant = width <= 640 ? 'small' : width <= 1024 ? 'medium' : 'large';

  return `https://imagedelivery.net/${ACCOUNT_HASH}/${src}/${variant}`;
};

AWS S3 + CloudFront

// lib/loaders/aws.ts
import type { ImageLoader } from 'next/image';

const CLOUDFRONT_DOMAIN = process.env.NEXT_PUBLIC_CLOUDFRONT_DOMAIN;

// Basic CloudFront loader (requires Lambda@Edge for resizing)
export const cloudfrontLoader: ImageLoader = ({ src, width, quality }) => {
  // Lambda@Edge parses these query params
  const params = new URLSearchParams({
    w: width.toString(),
    q: (quality || 80).toString(),
    f: 'auto',
  });

  return `https://${CLOUDFRONT_DOMAIN}${src}?${params}`;
};

// For static S3 images (no resizing)
export const s3Loader: ImageLoader = ({ src }) => {
  return `https://${CLOUDFRONT_DOMAIN}${src}`;
};

Vercel Image Optimization

// Automatically enabled on Vercel
// Configure in next.config.js
module.exports = {
  images: {
    // Use Vercel's built-in optimizer
    loader: 'default',

    // External domains need explicit allowlist
    remotePatterns: [
      {
        protocol: 'https',
        hostname: 'cdn.example.com',
      },
    ],

    // Increase cache for static images
    minimumCacheTTL: 60 * 60 * 24 * 365, // 1 year
  },
};

// For non-Vercel deployments, use external loader
module.exports = {
  images: {
    loader: 'custom',
    loaderFile: './lib/loaders/cloudinary.ts',
  },
};

Self-Hosted with Sharp

// For self-hosted Next.js (Docker, Node.js)

// 1. Install Sharp
// npm install sharp

// 2. Configure next.config.js
module.exports = {
  images: {
    loader: 'default', // Uses Sharp internally
    formats: ['image/avif', 'image/webp'],
    minimumCacheTTL: 60 * 60 * 24 * 30, // 30 days

    // Important for self-hosted
    dangerouslyAllowSVG: false,
    contentDispositionType: 'attachment',
  },
};

// 3. Dockerfile - ensure Sharp can build
FROM node:20-alpine AS builder
RUN apk add --no-cache libc6-compat
WORKDIR /app
COPY package*.json ./
RUN npm ci
COPY . .
RUN npm run build

FROM node:20-alpine AS runner
RUN apk add --no-cache libc6-compat
WORKDIR /app
COPY --from=builder /app/.next/standalone ./
COPY --from=builder /app/.next/static ./.next/static
COPY --from=builder /app/public ./public
EXPOSE 3000
CMD ["node", "server.js"]

CDN Headers & Caching

Nginx Configuration

# /etc/nginx/conf.d/images.conf

# Image caching
location ~* \.(jpg|jpeg|png|webp|avif|gif|ico|svg)$ {
    # Long cache for immutable assets
    expires 1y;
    add_header Cache-Control "public, immutable";

    # Vary by Accept header for format negotiation
    add_header Vary "Accept";

    # Security headers
    add_header X-Content-Type-Options "nosniff";
}

# Next.js optimized images
location /_next/image {
    proxy_pass http://nextjs_upstream;
    proxy_cache_valid 200 365d;

    # Cache key includes Accept header for format
    proxy_cache_key "$scheme$request_method$host$request_uri$http_accept";

    add_header X-Cache-Status $upstream_cache_status;
}

Cloudflare Page Rules

{
  "targets": [
    {
      "target": "url",
      "constraint": {
        "operator": "matches",
        "value": "*.example.com/*.(jpg|jpeg|png|webp|avif|gif)"
      }
    }
  ],
  "actions": [
    {
      "id": "cache_level",
      "value": "cache_everything"
    },
    {
      "id": "edge_cache_ttl",
      "value": 2592000
    },
    {
      "id": "browser_cache_ttl",
      "value": 31536000
    },
    {
      "id": "polish",
      "value": "lossless"
    }
  ]
}

Blur Placeholder Generation

Build-Time with Plaiceholder

// lib/blur.ts
import { getPlaiceholder } from 'plaiceholder';
import fs from 'fs/promises';
import path from 'path';

export async function getBlurDataURL(imagePath: string): Promise<string> {
  try {
    const file = await fs.readFile(path.join(process.cwd(), 'public', imagePath));
    const { base64 } = await getPlaiceholder(file);
    return base64;
  } catch {
    // Return a tiny transparent placeholder on error
    return 'data:image/gif;base64,R0lGODlhAQABAIAAAAAAAP///yH5BAEAAAAALAAAAAABAAEAAAIBRAA7';
  }
}

// Usage in getStaticProps
export async function getStaticProps() {
  const blurDataURL = await getBlurDataURL('/images/hero.jpg');
  return {
    props: { blurDataURL },
  };
}

Remote Image Blur

// lib/remote-blur.ts
import { getPlaiceholder } from 'plaiceholder';

export async function getRemoteBlurDataURL(imageUrl: string): Promise<string> {
  try {
    const response = await fetch(imageUrl);
    const buffer = Buffer.from(await response.arrayBuffer());
    const { base64 } = await getPlaiceholder(buffer);
    return base64;
  } catch {
    return 'data:image/gif;base64,R0lGODlhAQABAIAAAAAAAP///yH5BAEAAAAALAAAAAABAAEAAAIBRAA7';
  }
}

// Cache blur data URLs
const blurCache = new Map<string, string>();

export async function getCachedBlurDataURL(imageUrl: string): Promise<string> {
  if (blurCache.has(imageUrl)) {
    return blurCache.get(imageUrl)!;
  }

  const blur = await getRemoteBlurDataURL(imageUrl);
  blurCache.set(imageUrl, blur);
  return blur;
}

Image Validation & Error Handling

// lib/image-validation.ts
export function isValidImageUrl(url: string): boolean {
  try {
    const parsed = new URL(url);
    const allowedHosts = ['cdn.example.com', 'images.unsplash.com'];
    return allowedHosts.some(
      (host) => parsed.hostname === host || parsed.hostname.endsWith(`.${host}`)
    );
  } catch {
    return false;
  }
}

export function getOptimizedImageUrl(
  src: string,
  options: { width: number; quality?: number }
): string {
  // Use your CDN loader
  const { width, quality = 80 } = options;

  if (src.includes('cloudinary.com')) {
    return src.replace('/upload/', `/upload/w_${width},q_${quality},f_auto/`);
  }

  // Default: return as-is
  return src;
}

Core Web Vitals

Core Web Vitals Optimization

Google's Core Web Vitals are the key metrics for measuring user experience.

The Three Metrics

MetricTargetMeasuresImpact
LCP (Largest Contentful Paint)< 2.5sLoading performanceFirst impression
INP (Interaction to Next Paint)< 200msResponsivenessUser frustration
CLS (Cumulative Layout Shift)< 0.1Visual stabilityAccidental clicks

LCP (Largest Contentful Paint)

What It Measures

Time until the largest visible element (hero image, heading, video) renders.

Common Causes

  • Large, unoptimized images
  • Slow server response time (TTFB > 600ms)
  • Render-blocking JavaScript/CSS
  • Client-side rendering

Fixes

1. Optimize Images

<!-- Preload LCP image -->
<link rel="preload" as="image" href="/hero.jpg" />

<!-- Use modern formats -->
<picture>
  <source srcset="/hero.avif" type="image/avif" />
  <source srcset="/hero.webp" type="image/webp" />
  <img src="/hero.jpg" alt="Hero" width="1200" height="600" />
</picture>

<!-- Or use next/image -->
<Image src="/hero.jpg" priority quality={85} />

2. Reduce Server Response Time

  • Use CDN for static assets
  • Enable HTTP/2 or HTTP/3
  • Optimize database queries
  • Implement caching (Redis, CDN)

3. Eliminate Render-Blocking Resources

<!-- Defer non-critical CSS -->
<link rel="preload" as="style" href="/styles.css" onload="this.onload=null;this.rel='stylesheet'" />

<!-- Defer JavaScript -->
<script src="/app.js" defer></script>

<!-- Inline critical CSS -->
<style>
  /* Critical above-the-fold styles */
  .hero { ... }
</style>

4. Use Server-Side Rendering (SSR)

// Next.js SSR
export async function getServerSideProps() {
  const data = await fetchData();
  return { props: { data } };
}

// React Server Components
async function Page() {
  const data = await fetchData();  // Runs on server
  return <div>{data}</div>;
}

INP (Interaction to Next Paint)

What It Measures

Time from user interaction (click, tap, key press) to visual feedback.

Common Causes

  • Heavy JavaScript execution blocking main thread
  • Long-running event handlers
  • Expensive DOM updates
  • Third-party scripts

Fixes

1. Debounce/Throttle Expensive Operations

import { debounce } from 'lodash';

// Without debounce: runs on EVERY keystroke
function handleSearch(query: string) {
  const results = expensiveSearch(query);  // Blocks for 100ms
  setResults(results);
}

// With debounce: runs 300ms after user stops typing
const handleSearch = debounce((query: string) => {
  const results = expensiveSearch(query);
  setResults(results);
}, 300);

2. Use Web Workers for Heavy Computation

// worker.ts
self.onmessage = (e) => {
  const result = expensiveComputation(e.data);
  self.postMessage(result);
};

// main.ts
const worker = new Worker('/worker.js');
worker.postMessage(data);
worker.onmessage = (e) => {
  setResult(e.data);
};

3. Split Long Tasks

// Before: Blocks main thread for 500ms
function processItems(items) {
  items.forEach(item => {
    processItem(item);  // 5ms each × 100 items = 500ms
  });
}

// After: Yields to browser between batches
async function processItems(items) {
  for (let i = 0; i < items.length; i += 10) {
    const batch = items.slice(i, i + 10);
    batch.forEach(processItem);

    // Yield to browser
    await new Promise(resolve => setTimeout(resolve, 0));
  }
}

// Or use Scheduler API (modern)
async function processItems(items) {
  for (let i = 0; i < items.length; i += 10) {
    const batch = items.slice(i, i + 10);
    batch.forEach(processItem);

    await scheduler.yield();  // Yield to higher priority tasks
  }
}

4. Optimize React Rendering

// Memoize expensive components
const Chart = memo(({ data }) => <ExpensiveChart data={data} />);

// Use startTransition for non-urgent updates
import { useTransition } from 'react';

function Search() {
  const [query, setQuery] = useState('');
  const [results, setResults] = useState([]);
  const [isPending, startTransition] = useTransition();

  function handleChange(e) {
    setQuery(e.target.value);  // Urgent: update input immediately

    startTransition(() => {
      // Non-urgent: can be interrupted
      const filtered = filterResults(e.target.value);
      setResults(filtered);
    });
  }

  return <input value={query} onChange={handleChange} />;
}

CLS (Cumulative Layout Shift)

What It Measures

Visual stability - how much elements unexpectedly shift during load.

Common Causes

  • Images without dimensions
  • Ads/embeds injected after layout
  • Web fonts causing FOIT/FOUT
  • Dynamically injected content

Fixes

1. Always Set Image Dimensions

<!-- ❌ BAD: No dimensions, causes layout shift -->
<img src="/photo.jpg" alt="Photo" />

<!-- ✅ GOOD: Reserves space -->
<img src="/photo.jpg" alt="Photo" width="800" height="600" />

<!-- Or with aspect ratio (CSS) -->
<img src="/photo.jpg" alt="Photo" style="aspect-ratio: 4/3; width: 100%;" />

2. Reserve Space for Ads/Embeds

.ad-container {
  min-height: 250px;  /* Reserve space before ad loads */
  background: #f0f0f0;
}

3. Optimize Web Font Loading

/* Prevent FOIT (flash of invisible text) */
@font-face {
  font-family: 'CustomFont';
  src: url('/font.woff2') format('woff2');
  font-display: swap;  /* Show fallback immediately, swap when ready */
}
<!-- Preload critical fonts -->
<link rel="preload" as="font" href="/font.woff2" type="font/woff2" crossorigin />

4. Avoid Inserting Content Above Existing Content

// ❌ BAD: Inserts notification at top, shifts everything down
function addNotification(message) {
  container.insertAdjacentHTML('afterbegin', `<div>${message}</div>`);
}

// ✅ GOOD: Append to bottom or use fixed positioning
function addNotification(message) {
  const notification = document.createElement('div');
  notification.className = 'notification-fixed';  // position: fixed
  notification.textContent = message;
  document.body.appendChild(notification);
}

Measuring Core Web Vitals

In Development

// Use web-vitals library
import { onCLS, onINP, onLCP } from 'web-vitals';

onLCP(console.log);  // Log LCP
onINP(console.log);  // Log INP
onCLS(console.log);  // Log CLS

In Production (RUM - Real User Monitoring)

import { onCLS, onINP, onLCP } from 'web-vitals';

function sendToAnalytics(metric) {
  fetch('/api/analytics', {
    method: 'POST',
    body: JSON.stringify(metric),
  });
}

onLCP(sendToAnalytics);
onINP(sendToAnalytics);
onCLS(sendToAnalytics);

Lighthouse (Lab Testing)

# Run Lighthouse audit
lighthouse https://your-site.com --output=html

# Or use Chrome DevTools
# Open DevTools → Lighthouse tab → Generate report

Targets by Percentile

Google measures at the 75th percentile of all page loads:

GradeLCPINPCLS
Good (Green)< 2.5s< 200ms< 0.1
Needs Improvement (Orange)2.5-4s200-500ms0.1-0.25
Poor (Red)> 4s> 500ms> 0.25

Goal: 75% of page loads should be "Good" for all three metrics.

Quick Wins Checklist

  • Add width and height to all images
  • Preload LCP image
  • Use font-display: swap for web fonts
  • Defer non-critical JavaScript
  • Enable HTTP/2 and compression
  • Use CDN for static assets
  • Implement lazy loading for below-fold images
  • Memoize expensive React components
  • Debounce search inputs and expensive handlers

References

Database Optimization

Database Query Optimization

Strategies for optimizing database performance and eliminating slow queries.

Key Patterns

  1. Add Missing Indexes - Turn Seq Scan into Index Scan
  2. Fix N+1 Queries - Use JOINs or include instead of loops
  3. Cursor Pagination - Never load all records
  4. Connection Pooling - Manage connection lifecycle

Quick Diagnostics

-- Find slow queries (PostgreSQL)
SELECT query, calls, mean_time / 1000 as mean_seconds
FROM pg_stat_statements ORDER BY total_time DESC LIMIT 10;

-- Verify index usage
EXPLAIN ANALYZE SELECT * FROM orders WHERE user_id = 123;

-- Check for sequential scans
SELECT schemaname, tablename, seq_scan, seq_tup_read
FROM pg_stat_user_tables
WHERE seq_scan > 0
ORDER BY seq_tup_read DESC
LIMIT 10;

N+1 Query Detection

Symptoms:

  • One query to get parent records, then N queries for related data
  • Rapid sequential database calls in logs
  • Linear growth in query count with data size

Example Problem:

# ❌ BAD: N+1 query (1 + 8 queries)
analyses = await session.execute(select(Analysis).limit(8)).scalars().all()
for analysis in analyses:
    # Each iteration hits DB again!
    chunks = await session.execute(
        select(Chunk).where(Chunk.analysis_id == analysis.id)
    ).scalars().all()

Solution:

# ✅ GOOD: Single query with JOIN (1 query)
from sqlalchemy.orm import selectinload

analyses = await session.execute(
    select(Analysis)
    .options(selectinload(Analysis.chunks))  # Eager load
    .limit(8)
).scalars().all()

# Now analyses[0].chunks is already loaded (no extra query)

Index Selection Strategies

Index TypeUse CaseExample
B-treeEquality, range queriesWHERE created_at > '2025-01-01'
GINFull-text search, JSONBWHERE content_tsvector @@ to_tsquery('python')
HNSWVector similarityORDER BY embedding &lt;=&gt; '[0.1, 0.2, ...]'
HashExact equality onlyWHERE id = 'abc123' (rare)

Index Creation Examples:

-- B-tree index for range queries
CREATE INDEX idx_analyses_created_at ON analyses(created_at);

-- GIN index for full-text search
CREATE INDEX idx_chunks_tsvector ON chunks USING GIN(content_tsvector);

-- HNSW index for vector similarity
CREATE INDEX idx_chunks_embedding ON chunks
USING hnsw (embedding vector_cosine_ops);

-- Partial index for active records only
CREATE INDEX idx_active_users ON users(email)
WHERE deleted_at IS NULL;

-- Composite index for common query pattern
CREATE INDEX idx_analyses_user_status ON analyses(user_id, status);

Connection Pooling

Problem: Creating new connections is expensive (50-100ms overhead)

Solution: Use connection pools

# SQLAlchemy async pool
engine = create_async_engine(
    DATABASE_URL,
    pool_size=20,  # Base connections
    max_overflow=10,  # Additional if needed
    pool_pre_ping=True,  # Verify connections are alive
    pool_recycle=3600  # Recycle after 1 hour
)

Pagination: Cursor vs Offset

Offset-Based (❌ Slow for large datasets)

SELECT * FROM analyses ORDER BY created_at DESC
LIMIT 20 OFFSET 1000;  -- Must scan 1020 rows!

Cursor-Based (✅ Fast, scales to millions)

SELECT * FROM analyses
WHERE created_at < '2025-01-15 10:00:00'  -- Last cursor
ORDER BY created_at DESC
LIMIT 20;  -- Only scans 20 rows

Best Practices

  1. Always use EXPLAIN ANALYZE before deploying queries
  2. Index foreign keys used in JOINs
  3. Avoid SELECT * - request only needed columns
  4. Use prepared statements to prevent SQL injection and enable query caching
  5. Monitor pg_stat_statements weekly
  6. Set query timeouts to prevent runaway queries

References

Devtools Profiler Workflow

React DevTools Profiler Workflow

Finding and fixing performance bottlenecks.

Setup

  1. Install React DevTools browser extension
  2. Open DevTools (F12)
  3. Navigate to Profiler tab
  4. Ensure React is in development mode

Basic Profiling Flow

1. Start Recording

  • Click the blue Record button
  • Perform the slow interaction
  • Click Stop

2. Analyze the Flamegraph

The flamegraph shows component render times:

[App (2ms)]
├── [Header (0.5ms)]
├── [Sidebar (15ms)]  ← Slow!
│   ├── [NavItem (1ms)]
│   ├── [NavItem (1ms)]
│   └── [HeavyWidget (12ms)]  ← Found it!
└── [Content (1ms)]

3. Key Metrics

MetricMeaning
Render timeHow long component took to render
Commit timeTime to apply changes to DOM
InteractionsWhat triggered the render

Reading the Profiler

Color Coding

  • Gray: Did not render
  • Blue/Teal: Rendered (fast)
  • Yellow: Rendered (medium)
  • Red/Orange: Rendered (slow)

"Why did this render?"

Enable in DevTools settings:

  1. Click gear icon in Profiler
  2. Check "Record why each component rendered"

Common reasons:

  • Props changed
  • State changed
  • Parent rendered
  • Context changed
  • Hooks changed

Identifying Problems

Problem 1: Component Renders Too Often

Look for components that render on every interaction:

Render 1: [List (50ms)] - items changed ✓
Render 2: [List (50ms)] - items same, parent rendered ✗
Render 3: [List (50ms)] - items same, parent rendered ✗

Solution: Isolate state, use React.memo as escape hatch

Problem 2: Single Render Too Slow

Look for wide bars in the flamegraph:

[SlowComponent (200ms)]
├── [Child1 (5ms)]
├── [Child2 (190ms)]  ← Find the slow child
│   └── [GrandChild (185ms)]  ← Root cause
└── [Child3 (5ms)]

Solution: Virtualize, lazy load, or optimize computation

Problem 3: Cascading Re-renders

Many components re-render for one change:

[Parent] → [Child1] → [GrandChild1]
        → [Child2] → [GrandChild2]
        → [Child3] → [GrandChild3]

Solution: Move state down, split context

Profiler Settings

Click the gear icon for options:

  • Record why each component rendered: Essential for debugging
  • Hide commits below X ms: Filter noise
  • Highlight updates: Visual indicator during interaction

Ranked View

Switch from Flamegraph to Ranked view:

1. HeavyWidget      12ms
2. Sidebar          3ms
3. NavItem          1ms
4. Content          1ms
5. Header           0.5ms

This shows components sorted by render time.

Timeline View

Shows renders over time, useful for:

  • Finding render cascades
  • Identifying what triggered re-renders
  • Seeing interaction-to-render timing

Console Integration

// Add profiling in code
import { Profiler } from 'react'

function onRenderCallback(
  id,           // Component tree id
  phase,        // "mount" | "update"
  actualDuration,
  baseDuration,
  startTime,
  commitTime
) {
  console.log(`${id} ${phase}: ${actualDuration.toFixed(2)}ms`)
}

<Profiler id="Navigation" onRender={onRenderCallback}>
  <Navigation />
</Profiler>

Quick Checklist

  1. Record the slow interaction
  2. Find the slowest component (ranked view)
  3. Check why it rendered (DevTools setting)
  4. Verify if render was necessary
  5. Apply targeted fix
  6. Re-profile to confirm improvement

Common Fixes by Cause

Why RenderedFix
Props changed (but same value)Check prop references
Parent renderedIsolate state, split component
Context changedSplit context
Hooks changedCheck effect dependencies
State changedVerify state is necessary

Edge Deployment

Edge Deployment

Overview

Deploy LLMs on resource-constrained devices: mobile, edge servers, embedded systems.

Key constraints:

  • Limited GPU/NPU memory (4-24 GB)
  • Power efficiency requirements
  • Latency-sensitive applications
  • Offline/disconnected operation

Model Selection for Edge

DeviceMemoryRecommended Models
Mobile (iOS/Android)4-8 GBLlama-3.3-1B, Phi-3-mini
Edge Server16-24 GBLlama-3.3-3B, Mistral-7B-4bit
Raspberry Pi 58 GBGemma-2B, TinyLlama
Jetson Orin32-64 GBLlama-3.1-8B, Mixtral-8x7B-4bit

Aggressive Quantization

For edge, prioritize memory over quality:

from gptqmodel import GPTQModel, QuantizeConfig

# 2-bit quantization for extreme memory constraints
quant_config = QuantizeConfig(
    bits=2,
    group_size=32,
    damp_percent=0.1,
)

model = GPTQModel.load("meta-llama/Llama-3.3-1B-Instruct", quant_config)
model.quantize(calibration_data)
model.save("Llama-3.3-1B-2bit-edge")

Quality vs Memory Trade-off:

BitsMemory (1B model)Quality Retention
4~600 MB~95%
3~450 MB~85%
2~300 MB~70%

llama.cpp for Edge

Optimized C++ inference for CPU/edge:

# Build llama.cpp with optimizations
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp

# For Apple Silicon
make LLAMA_METAL=1

# For CUDA
make LLAMA_CUDA=1

# For Vulkan (cross-platform GPU)
make LLAMA_VULKAN=1

# Run inference
./main -m models/llama-3.3-1b-q4_k_m.gguf \
    -p "Hello, how are you?" \
    -n 128 \
    -ngl 99  # Offload all layers to GPU

GGUF quantization types:

TypeBitsQualitySpeed
Q8_08BestGood
Q5_K_M5Very GoodBetter
Q4_K_M4GoodBest
Q3_K_M3AcceptableBest
Q2_K2DegradedBest

Mobile Deployment

iOS with MLX

# Convert to MLX format for Apple Silicon
import mlx.core as mx
from mlx_lm import load, generate

# Load quantized model
model, tokenizer = load("mlx-community/Llama-3.3-1B-Instruct-4bit")

# Generate on device
prompt = "Explain machine learning briefly:"
response = generate(model, tokenizer, prompt=prompt, max_tokens=100)

Android with MLC-LLM

# Build for Android
mlc_llm compile meta-llama/Llama-3.3-1B-Instruct \
    --quantization q4f16_1 \
    --target android

# Deploy APK with bundled model
mlc_llm package \
    --model-lib ./dist/llama-3.3-1b-q4f16_1-android.tar \
    --apk-output ./LlamaApp.apk

Jetson/NVIDIA Edge

Optimized for Jetson Orin and embedded NVIDIA:

# Use TensorRT-LLM for Jetson
from tensorrt_llm import LLM, SamplingParams

llm = LLM(
    model="meta-llama/Llama-3.3-3B-Instruct",
    max_batch_size=4,  # Limit for memory
    max_input_len=2048,
    max_output_len=512,
)

# Optimized for Jetson memory constraints
outputs = llm.generate(
    prompts=["Hello!"],
    sampling_params=SamplingParams(max_tokens=100),
)

Memory Optimization Techniques

KV Cache Reduction

# Limit context length for edge
llm = LLM(
    model="meta-llama/Llama-3.3-1B-Instruct",
    max_model_len=1024,  # Reduce from default 4096
    gpu_memory_utilization=0.95,  # Maximize usage
)

Sliding Window Attention

# Models with built-in sliding window
# Mistral-7B: 4096 sliding window
# Reduces memory O(n^2) -> O(n*window)

llm = LLM(
    model="mistralai/Mistral-7B-Instruct-v0.3",
    sliding_window=4096,  # Use model's native window
)

Flash Attention

# Enable Flash Attention for memory efficiency
llm = LLM(
    model="meta-llama/Llama-3.3-1B-Instruct",
    use_flash_attention=True,  # Default on supported hardware
)

Power Efficiency

Dynamic Frequency Scaling

# Limit GPU frequency for power savings (Jetson)
sudo nvpmodel -m 2  # Medium power mode
sudo jetson_clocks --show

# For inference-heavy workloads
sudo nvpmodel -m 0  # Max performance

Batch Size Optimization

# Smaller batches = lower peak power
llm = LLM(
    model="meta-llama/Llama-3.3-1B-Instruct",
    max_num_seqs=8,  # Limit concurrent requests
)

# Process requests sequentially for power
for prompt in prompts:
    output = llm.generate([prompt], sampling_params)
    yield output

Offline Deployment

Model Bundling

# Download and cache model for offline use
from huggingface_hub import snapshot_download

# Pre-download model
snapshot_download(
    "meta-llama/Llama-3.3-1B-Instruct",
    local_dir="./models/llama-3.3-1b",
    local_dir_use_symlinks=False,
)

# Use local path
llm = LLM(model="./models/llama-3.3-1b")

Air-gapped Environments

# Export model to portable format
python -m llama_cpp.convert \
    --model meta-llama/Llama-3.3-1B-Instruct \
    --output ./llama-3.3-1b.gguf \
    --quantize q4_k_m

# Transfer and run on air-gapped device
./main -m ./llama-3.3-1b.gguf -p "Hello"

Benchmarking Edge Performance

import time

def benchmark_edge(model_path: str, prompts: list[str]):
    """Benchmark for edge deployment."""
    from vllm import LLM, SamplingParams

    llm = LLM(
        model=model_path,
        max_model_len=1024,
        gpu_memory_utilization=0.95,
    )

    # Warmup
    llm.generate(["Warmup"], SamplingParams(max_tokens=10))

    # Benchmark
    times = []
    for prompt in prompts:
        start = time.perf_counter()
        output = llm.generate([prompt], SamplingParams(max_tokens=100))
        elapsed = time.perf_counter() - start
        times.append(elapsed)

    avg_latency = sum(times) / len(times)
    p99_latency = sorted(times)[int(len(times) * 0.99)]

    print(f"Avg latency: {avg_latency*1000:.1f}ms")
    print(f"P99 latency: {p99_latency*1000:.1f}ms")

  • ollama-local - Easy local deployment
  • quantization-guide - Quantization methods

Frontend Performance

Frontend Performance Optimization

Techniques for optimizing bundle size, loading speed, and rendering performance.

Bundle Optimization

1. Code Splitting

Split your bundle into smaller chunks that load on-demand:

// Route-based splitting (React 19)
import { lazy, Suspense } from 'react';

const AdminPanel = lazy(() => import('./AdminPanel'));
const Dashboard = lazy(() => import('./Dashboard'));

function App() {
  return (
    <Suspense fallback={<Loading />}>
      <Routes>
        <Route path="/admin" element={<AdminPanel />} />
        <Route path="/dashboard" element={<Dashboard />} />
      </Routes>
    </Suspense>
  );
}

2. Tree Shaking

Import only what you need:

// ❌ BAD: Imports entire library
import _ from 'lodash';
_.debounce(fn, 100);

// ✅ GOOD: Import specific function
import debounce from 'lodash/debounce';
debounce(fn, 100);

// ✅ EVEN BETTER: Use native or lightweight alternative
const debounce = (fn, delay) => {
  let timeout;
  return (...args) => {
    clearTimeout(timeout);
    timeout = setTimeout(() => fn(...args), delay);
  };
};

3. Image Optimization

// Use next/image for automatic optimization
import Image from 'next/image';

<Image
  src="/hero.jpg"
  width={1200}
  height={600}
  alt="Hero"
  loading="lazy"  // Lazy load below fold
  quality={85}    // Balance quality/size
  placeholder="blur"  // Show blur while loading
/>

// Or use modern formats manually
<picture>
  <source srcset="/hero.avif" type="image/avif" />
  <source srcset="/hero.webp" type="image/webp" />
  <img src="/hero.jpg" alt="Hero" loading="lazy" />
</picture>

Rendering Optimization

1. Memoization

Prevent unnecessary re-renders:

import { memo, useMemo, useCallback } from 'react';

// Memoize expensive component
const ExpensiveChart = memo(({ data }) => {
  return <Chart data={data} />;
});

// Memoize expensive computation
function AnalyticsDashboard({ analyses }) {
  const stats = useMemo(() => {
    return analyses.reduce((acc, a) => ({
      totalCost: acc.totalCost + a.cost,
      avgDuration: acc.avgDuration + a.duration
    }), { totalCost: 0, avgDuration: 0 });
  }, [analyses]);  // Only recompute if analyses change

  return <div>{stats.totalCost}</div>;
}

// Memoize callback to prevent child re-renders
function Parent() {
  const [count, setCount] = useState(0);

  const handleClick = useCallback(() => {
    setCount(c => c + 1);
  }, []);  // Function identity stays same

  return <Child onClick={handleClick} />;
}

2. Virtualization

Render only visible items in long lists:

import { useVirtualizer } from '@tanstack/react-virtual';

function AnalysisList({ analyses }) {
  const parentRef = useRef(null);

  const virtualizer = useVirtualizer({
    count: analyses.length,
    getScrollElement: () => parentRef.current,
    estimateSize: () => 100,  // Estimated row height
  });

  return (
    <div ref={parentRef} style={{ height: '600px', overflow: 'auto' }}>
      <div style={{ height: `${virtualizer.getTotalSize()}px` }}>
        {virtualizer.getVirtualItems().map(virtualItem => (
          <div
            key={virtualItem.index}
            style={{
              position: 'absolute',
              top: 0,
              left: 0,
              width: '100%',
              height: `${virtualItem.size}px`,
              transform: `translateY(${virtualItem.start}px)`,
            }}
          >
            <AnalysisCard analysis={analyses[virtualItem.index]} />
          </div>
        ))}
      </div>
    </div>
  );
}

3. Batch DOM Operations

Minimize layout thrashing:

// ❌ BAD: Read-write-read-write causes layout thrashing
elements.forEach(el => {
  const height = el.offsetHeight;  // Read (triggers layout)
  el.style.height = height + 10 + 'px';  // Write
});

// ✅ GOOD: Batch reads, then writes
const heights = elements.map(el => el.offsetHeight);  // All reads
elements.forEach((el, i) => {
  el.style.height = heights[i] + 10 + 'px';  // All writes
});

Core Web Vitals Optimization

LCP (Largest Contentful Paint) - Target: < 2.5s

Causes:

  • Large images not optimized
  • Slow server response (TTFB)
  • Render-blocking JS/CSS

Fixes:

  • Preload LCP image: &lt;link rel="preload" as="image" href="/hero.jpg"&gt;
  • Use CDN for assets
  • Inline critical CSS
  • Server-side rendering (SSR)

INP (Interaction to Next Paint) - Target: < 200ms

Causes:

  • Heavy JavaScript execution
  • Long-running event handlers
  • Main thread blocked

Fixes:

  • Debounce expensive operations
  • Use Web Workers for heavy computation
  • Split long tasks with setTimeout() or scheduler.postTask()

CLS (Cumulative Layout Shift) - Target: < 0.1

Causes:

  • Images without dimensions
  • Ads/embeds loading late
  • Web fonts causing FOIT/FOUT

Fixes:

  • Always set width and height on images
  • Reserve space for ads: min-height: 250px
  • Use font-display: swap for web fonts
  • Preload fonts: &lt;link rel="preload" as="font"&gt;

Bundle Analysis

# Lighthouse audit
lighthouse http://localhost:3000 --output=html

# Bundle analysis (Next.js)
ANALYZE=true npm run build

# Bundle analysis (Vite)
npm run build && npx vite-bundle-visualizer

# Check bundle size
du -sh dist/

Best Practices

  1. Lazy load below-the-fold content
  2. Use modern image formats (WebP, AVIF)
  3. Enable compression (Brotli > gzip)
  4. Minimize third-party scripts
  5. Use CDN for static assets
  6. Monitor Core Web Vitals in production with RUM

References

Memoization Escape Hatches

Memoization Escape Hatches

When to still use useMemo and useCallback with React Compiler.

Overview

React Compiler handles most memoization automatically. Use manual memoization only as escape hatches for specific cases.

Escape Hatch 1: Effect Dependencies

When a value is used as an effect dependency and you need precise control:

// Problem: Effect runs on every render
function UserDashboard({ userId }) {
  const config = {
    userId,
    includeStats: true,
    format: 'detailed',
  }

  useEffect(() => {
    fetchData(config) // Runs every render! config is new object
  }, [config])
}

// Solution: Memoize the config
function UserDashboard({ userId }) {
  const config = useMemo(() => ({
    userId,
    includeStats: true,
    format: 'detailed',
  }), [userId]) // Only changes when userId changes

  useEffect(() => {
    fetchData(config)
  }, [config])
}

Escape Hatch 2: Third-Party Libraries

Libraries without React Compiler support may expect stable references:

// Some charting libraries compare references
function Chart({ data }) {
  // Ensure stable reference for library
  const chartOptions = useMemo(() => ({
    animation: true,
    responsive: true,
    data: transformData(data),
  }), [data])

  return <ThirdPartyChart options={chartOptions} />
}

Escape Hatch 3: Expensive Computations

When you know a computation is expensive and want explicit control:

function SearchResults({ items, query }) {
  // Explicitly expensive - want to ensure it's memoized
  const filteredItems = useMemo(() => {
    console.log('Filtering...')
    return items
      .filter(item => matchesQuery(item, query))
      .sort(complexSortFn)
      .slice(0, 100)
  }, [items, query])

  return <List items={filteredItems} />
}

Escape Hatch 4: Referential Equality for Children

When passing objects/arrays to components that use referential equality:

function Parent() {
  // Child component uses Object.is() comparison
  const contextValue = useMemo(() => ({
    theme: 'dark',
    locale: 'en',
  }), [])

  return (
    <MyContext.Provider value={contextValue}>
      <Children />
    </MyContext.Provider>
  )
}

When NOT to Use Escape Hatches

Don't Memoize Primitives

// ❌ Unnecessary - primitives are already stable
const memoizedId = useMemo(() => props.id, [props.id])

// ✅ Just use it directly
<Child id={props.id} />

Don't Memoize Simple JSX

// ❌ Unnecessary with React Compiler
const memoizedButton = useMemo(() => (
  <Button onClick={handleClick}>Click</Button>
), [handleClick])

// ✅ Compiler handles this
<Button onClick={handleClick}>Click</Button>

Don't Memoize Everything "Just in Case"

// ❌ Over-memoization
function Component({ user }) {
  const name = useMemo(() => user.name, [user.name])
  const email = useMemo(() => user.email, [user.email])
  const avatar = useMemo(() => user.avatar, [user.avatar])

  return <Profile name={name} email={email} avatar={avatar} />
}

// ✅ Trust the compiler
function Component({ user }) {
  return <Profile name={user.name} email={user.email} avatar={user.avatar} />
}

useCallback Escape Hatches

Stable Event Handlers for Effects

function DataFetcher({ onDataLoaded }) {
  // Need stable reference for effect dependency
  const stableCallback = useCallback(
    (data) => onDataLoaded(data),
    [onDataLoaded]
  )

  useEffect(() => {
    fetchData().then(stableCallback)
  }, [stableCallback])
}

Refs in Callbacks

function Form() {
  const inputRef = useRef<HTMLInputElement>(null)

  // Callback that uses ref - may need stability
  const focusInput = useCallback(() => {
    inputRef.current?.focus()
  }, [])

  return (
    <>
      <input ref={inputRef} />
      <Button onClick={focusInput}>Focus</Button>
    </>
  )
}

Decision Tree

Is it an effect dependency?
├─ YES → Does the effect need to run less often?
│        └─ YES → useMemo/useCallback
└─ NO → Is it passed to a third-party library?
        ├─ YES → Check library docs, may need useMemo
        └─ NO → Is it a known expensive computation?
                ├─ YES → Consider useMemo for explicit control
                └─ NO → Trust React Compiler

Verifying Compiler Coverage

// In development, check DevTools for Memo badge
// If component doesn't have badge, compiler may have skipped it

// You can also add console logs to verify:
const value = useMemo(() => {
  console.log('Computing...') // Should only log when deps change
  return expensiveComputation()
}, [deps])

Profiling

Performance Profiling

Tools and techniques for identifying performance bottlenecks.

Profiling Workflow

  1. Measure - Establish baseline metrics
  2. Profile - Identify bottlenecks
  3. Optimize - Fix the slowest operations first
  4. Verify - Measure improvement
  5. Repeat - Iterate until targets met

Backend Profiling (Python)

1. cProfile (Built-in)

# Profile entire script
python -m cProfile -s cumulative backend/app/main.py

# Save profile for analysis
python -m cProfile -o profile.prof backend/app/main.py

# Analyze with snakeviz
pip install snakeviz
snakeviz profile.prof  # Opens interactive flame graph

2. py-spy (Sampling Profiler)

# Install
pip install py-spy

# Profile running process
sudo py-spy top --pid 12345

# Generate flame graph
sudo py-spy record -o profile.svg --pid 12345 --duration 60

# Profile from start
py-spy record -o profile.svg -- python app.py

3. memory_profiler

# Install
pip install memory_profiler

# Decorate functions to profile
from memory_profiler import profile

@profile
def expensive_function():
    data = [0] * (10 ** 6)  # 1M integers
    return sum(data)

# Run with profiling
python -m memory_profiler script.py

4. Line Profiler

# Install
pip install line_profiler

# Add decorator
from line_profiler import profile

@profile
def slow_function():
    result = 0
    for i in range(1000000):
        result += i
    return result

# Run with kernprof
kernprof -l -v script.py

Frontend Profiling

1. Chrome DevTools Performance Tab

Steps:

  1. Open DevTools (F12)
  2. Go to Performance tab
  3. Click Record (Cmd+E)
  4. Interact with page
  5. Stop recording
  6. Analyze flame graph

What to Look For:

  • Long tasks (> 50ms) - shows as red in timeline
  • Layout/reflow - indicates DOM thrashing
  • Scripting time - JavaScript execution
  • Rendering time - paint and composite

2. React DevTools Profiler

import { Profiler } from 'react';

function onRenderCallback(
  id: string,
  phase: 'mount' | 'update',
  actualDuration: number,
  baseDuration: number,
  startTime: number,
  commitTime: number
) {
  console.log(`${id} (${phase}) took ${actualDuration}ms`);
}

<Profiler id="AnalysisList" onRender={onRenderCallback}>
  <AnalysisList analyses={analyses} />
</Profiler>

In DevTools:

  1. Open React DevTools
  2. Go to Profiler tab
  3. Click Record
  4. Interact with app
  5. Stop and analyze

What to Look For:

  • Components that render frequently but haven't changed
  • Components with long render times
  • Unnecessary re-renders (use memo())

3. Lighthouse Performance Audit

# CLI
npm install -g lighthouse
lighthouse https://localhost:3000 --view

# Or use Chrome DevTools → Lighthouse tab

Metrics Analyzed:

  • First Contentful Paint (FCP)
  • Largest Contentful Paint (LCP)
  • Speed Index
  • Time to Interactive (TTI)
  • Total Blocking Time (TBT)
  • Cumulative Layout Shift (CLS)

4. Bundle Analyzer

# Next.js
npm install @next/bundle-analyzer
ANALYZE=true npm run build

# Vite
npm install -D rollup-plugin-visualizer
npx vite-bundle-visualizer

# Webpack
npm install -D webpack-bundle-analyzer
webpack --profile --json > stats.json
webpack-bundle-analyzer stats.json

Database Profiling

PostgreSQL

1. Enable Query Logging

-- Enable slow query log
ALTER SYSTEM SET log_min_duration_statement = 100;  -- Log queries > 100ms
SELECT pg_reload_conf();

2. pg_stat_statements

-- Enable extension
CREATE EXTENSION pg_stat_statements;

-- Find slowest queries
SELECT
  query,
  calls,
  total_time / 1000 as total_seconds,
  mean_time / 1000 as mean_seconds,
  max_time / 1000 as max_seconds
FROM pg_stat_statements
WHERE query NOT LIKE '%pg_stat_statements%'
ORDER BY total_time DESC
LIMIT 10;

-- Reset stats
SELECT pg_stat_statements_reset();

3. EXPLAIN ANALYZE

-- Analyze query execution
EXPLAIN ANALYZE
SELECT a.*, COUNT(c.id) as chunk_count
FROM analyses a
LEFT JOIN chunks c ON c.analysis_id = a.id
WHERE a.user_id = 'user_123'
GROUP BY a.id
ORDER BY a.created_at DESC
LIMIT 20;

-- Look for:
-- - Seq Scan (bad for large tables)
-- - High actual time
-- - High actual rows vs estimated rows

Memory Profiling

Python (memory_profiler)

from memory_profiler import profile

@profile
def load_analyses():
    # Shows line-by-line memory usage
    analyses = []
    for i in range(10000):
        analyses.append({
            'id': i,
            'content': 'x' * 1000,  # Memory spike here!
        })
    return analyses

Chrome DevTools (Heap Snapshot)

Steps:

  1. Open DevTools → Memory tab
  2. Take Heap Snapshot
  3. Interact with app
  4. Take another snapshot
  5. Compare snapshots

What to Look For:

  • Detached DOM nodes (memory leaks)
  • Large arrays/objects
  • Unreleased event listeners

Memory Leak Detection

// ❌ BAD: Memory leak (event listener never removed)
useEffect(() => {
  window.addEventListener('resize', handleResize);
}, []);

// ✅ GOOD: Cleanup on unmount
useEffect(() => {
  window.addEventListener('resize', handleResize);
  return () => {
    window.removeEventListener('resize', handleResize);
  };
}, []);

Flame Graphs

Visual representation of call stacks showing where time is spent.

Reading Flame Graphs:

  • Width = Time spent (wider = slower)
  • Height = Call stack depth
  • Color = Usually just for differentiation
  • Top = Leaf functions (where actual work happens)

Generate Flame Graph (Python):

# With py-spy
sudo py-spy record -o flamegraph.svg --pid 12345

# Open in browser
open flamegraph.svg

Load Testing

k6 (HTTP Load Testing)

// load-test.js
import http from 'k6/http';
import { check, sleep } from 'k6';

export let options = {
  stages: [
    { duration: '2m', target: 100 },  // Ramp up to 100 users
    { duration: '5m', target: 100 },  // Stay at 100 users
    { duration: '2m', target: 0 },    // Ramp down to 0 users
  ],
  thresholds: {
    http_req_duration: ['p(95)<500'],  // 95% of requests < 500ms
  },
};

export default function () {
  const res = http.get('http://localhost:8500/api/v1/analyses');

  check(res, {
    'status is 200': (r) => r.status === 200,
    'response time < 500ms': (r) => r.timings.duration < 500,
  });

  sleep(1);
}
# Run load test
k6 run load-test.js

Locust (Python Load Testing)

# locustfile.py
from locust import HttpUser, task, between

class ApiUser(HttpUser):
    wait_time = between(1, 3)

    @task
    def get_analyses(self):
        self.client.get("/api/v1/analyses")

    @task(3)  # 3x more frequent
    def get_analysis(self):
        self.client.get("/api/v1/analyses/abc123")
# Run with web UI
locust -f locustfile.py --host=http://localhost:8500

# Or headless
locust -f locustfile.py --host=http://localhost:8500 --users 100 --spawn-rate 10 --run-time 5m --headless

Profiling Best Practices

  1. Profile in production-like environments - Dev may not show real bottlenecks
  2. Profile with realistic data volumes - Empty databases hide performance issues
  3. Focus on the slowest operations first - 80/20 rule applies
  4. Measure before and after - Verify optimizations actually help
  5. Profile regularly - Catch regressions early
  6. Use sampling profilers for production - Low overhead (py-spy, not cProfile)

Quick Profiling Commands

# Python CPU profiling
python -m cProfile -s cumulative script.py | head -20

# Python memory profiling
python -m memory_profiler script.py

# Node.js profiling
node --prof app.js
node --prof-process isolate-*.log > processed.txt

# PostgreSQL slow queries
psql -c "SELECT query, mean_time FROM pg_stat_statements ORDER BY mean_time DESC LIMIT 10"

# Chrome DevTools (programmatic)
node --inspect app.js
# Then open chrome://inspect

References

Quantization Guide

Quantization Guide

Overview

Quantization reduces model precision to decrease memory usage and increase throughput.

MethodBitsCalibrationMemory SavingsThroughputQuality Loss
FP1616NoneBaselineBaselineNone
FP88None50%+30-50%Minimal
INT88Optional50%+10-20%Minimal
AWQ4Required75%+20-40%Small
GPTQ4Required75%+15-30%Small

AWQ (Activation-aware Weight Quantization)

Best 4-bit method for quality preservation:

# Use pre-quantized AWQ model
vllm serve TheBloke/Llama-2-70B-chat-AWQ \
    --quantization awq \
    --tensor-parallel-size 2
from vllm import LLM

# AWQ quantized model
llm = LLM(
    model="TheBloke/Llama-2-70B-chat-AWQ",
    quantization="awq",
    dtype="half",
    tensor_parallel_size=2,
)

AWQ Benefits:

  • Activation-aware: Preserves important weights
  • Better quality than GPTQ at same bit-width
  • Faster inference on modern GPUs

GPTQ Quantization

Create your own GPTQ quantized model:

from gptqmodel import GPTQModel, QuantizeConfig
from datasets import load_dataset

# Load calibration data
calibration_data = load_dataset(
    "allenai/c4",
    data_files="en/c4-train.00001-of-01024.json.gz",
    split="train",
).select(range(1024))["text"]

# Configure quantization
quant_config = QuantizeConfig(
    bits=4,              # 4-bit quantization
    group_size=128,      # Group size for quantization
    damp_percent=0.1,    # Dampening for Hessian
    desc_act=True,       # Activation order (better quality)
)

# Load and quantize
model = GPTQModel.load(
    "meta-llama/Llama-3.2-1B-Instruct",
    quant_config,
)
model.quantize(calibration_data, batch_size=4)

# Save quantized model
model.save("Llama-3.2-1B-Instruct-gptq-4bit")

Using GPTQ with vLLM:

from vllm import LLM

llm = LLM(
    model="TheBloke/Llama-2-70B-GPTQ",
    quantization="gptq",
    dtype="half",
)

FP8 Quantization

Best for H100/H200 GPUs with native FP8 support:

from vllm import LLM

# FP8 on H100
llm = LLM(
    model="meta-llama/Meta-Llama-3.1-70B-Instruct",
    quantization="fp8",  # Native FP8
    kv_cache_dtype="fp8",  # FP8 KV cache
)

FP8 Advantages:

  • Near-FP16 quality
  • 50% memory reduction
  • Best throughput on H100/H200
  • No calibration needed

INT8 Quantization

Balanced option with minimal quality loss:

# INT8 weight quantization
llm = LLM(
    model="meta-llama/Meta-Llama-3.1-70B-Instruct",
    quantization="int8",
    dtype="float16",
)

Quantization Comparison

Memory Usage (70B Model)

PrecisionMemory (per GPU)GPUs Needed
FP32~280 GB8x A100 80GB
FP16~140 GB4x A100 80GB
INT8/FP8~70 GB2x A100 80GB
AWQ/GPTQ~35 GB1x A100 80GB

Quality Benchmarks (MMLU)

ModelFP16INT8AWQ-4bitGPTQ-4bit
Llama-3.1-8B66.2%65.8%65.1%64.8%
Llama-3.1-70B79.3%79.0%78.2%77.9%

Best Practices

Calibration Data

Use representative data for your use case:

# Domain-specific calibration
calibration_data = [
    # Include examples similar to production queries
    "Customer support query example...",
    "Technical documentation example...",
    "Code generation example...",
]

# Minimum 128 samples, recommended 512-1024
assert len(calibration_data) >= 128

Group Size Selection

Group SizeMemoryQualitySpeed
32LowestBestSlowest
64LowVery GoodFast
128MediumGoodFastest
# Higher group size = faster but lower quality
quant_config = QuantizeConfig(
    bits=4,
    group_size=128,  # Balance of speed and quality
)

Mixed Precision

Keep critical layers at higher precision:

# Some layers benefit from higher precision
quant_config = QuantizeConfig(
    bits=4,
    group_size=128,
    inside_layer_modules=[
        # Keep attention at higher precision
        "self_attn.q_proj",
        "self_attn.k_proj",
        "self_attn.v_proj",
    ],
)

Troubleshooting

OOM During Quantization

# Reduce batch size
model.quantize(calibration_data, batch_size=1)

# Use gradient checkpointing
model.quantize(
    calibration_data,
    batch_size=2,
    use_checkpoint=True,
)

Quality Degradation

  1. Increase calibration data diversity
  2. Reduce group size (32 or 64)
  3. Try AWQ instead of GPTQ
  4. Enable desc_act=True for GPTQ

  • ollama-local - Local inference with quantized models
  • embeddings - Quantized embedding models

React Compiler Migration

React Compiler Migration Guide

Adopting React 19's automatic memoization.

What is React Compiler?

React Compiler automatically memoizes components and values, eliminating the need for manual useMemo, useCallback, and React.memo in most cases.

Prerequisites

  • React 19+
  • Compatible framework (Next.js 16+, Expo SDK 54+)
  • Code follows Rules of React

Quick Setup

Next.js 16+

// next.config.js
const nextConfig = {
  reactCompiler: true,
}

module.exports = nextConfig

Expo SDK 54+

Enabled by default in new projects.

Babel (Manual)

npm install -D babel-plugin-react-compiler
// babel.config.js
module.exports = {
  plugins: [
    ['babel-plugin-react-compiler', {
      // Optional: sources to compile
      sources: (filename) => {
        return filename.indexOf('src') !== -1
      },
    }],
  ],
}

Verification

  1. Open React DevTools in browser
  2. Go to Components tab
  3. Look for "Memo ✨" badge next to component names
  4. If you see the sparkle emoji, compiler is working

What Gets Optimized

The compiler automatically memoizes:

Before (Manual)After (Compiler)
React.memo(Component)Component re-renders only when needed
useMemo(() => value, [deps])Intermediate values cached
useCallback(() => fn, [deps])Callback references stable
Conditional JSXJSX elements memoized

Rules of React (Must Follow)

For the compiler to work correctly:

1. Components Must Be Idempotent

// ✅ Same input → same output
function Profile({ user }) {
  return <h1>{user.name}</h1>
}

// ❌ Non-deterministic
function Profile({ user }) {
  return <h1>{user.name} at {Date.now()}</h1>
}

2. Props and State Are Immutable

// ✅ Create new object
setUser({ ...user, name: 'New Name' })

// ❌ Mutate existing
user.name = 'New Name'
setUser(user)

3. Side Effects Outside Render

// ✅ In useEffect
useEffect(() => {
  analytics.track('view')
}, [])

// ❌ During render
function Component() {
  analytics.track('view') // BAD
  return <div>...</div>
}

4. Hooks at Top Level

// ✅ Always at top
function Component() {
  const [state, setState] = useState()
  // ...
}

// ❌ Conditional hooks
function Component({ show }) {
  if (show) {
    const [state, setState] = useState() // BAD
  }
}

Migration Strategy

New Projects

Enable compiler immediately. No reason not to.

Existing Projects

  1. Enable compiler in config
  2. Run tests to catch issues
  3. Check DevTools for Memo badges
  4. Gradually remove manual memoization
// Before (manual)
const MemoizedChild = React.memo(Child)
const memoizedValue = useMemo(() => compute(data), [data])
const handleClick = useCallback(() => onClick(id), [id, onClick])

// After (compiler handles it)
// Just use Child, compute(data), and onClick directly
// Compiler determines what needs memoization

When Manual Memoization Still Needed

Keep useMemo/useCallback for:

// 1. Effect dependencies that shouldn't trigger re-runs
const stableConfig = useMemo(() => ({
  apiUrl: process.env.API_URL,
  timeout: 5000,
}), [])

useEffect(() => {
  initSDK(stableConfig) // Should only run once
}, [stableConfig])

// 2. Third-party libraries without compiler support
const memoizedData = useMemo(() =>
  thirdPartyLib.transform(data), [data])

// 3. Precise control over boundaries
const handleSubmit = useCallback(async () => {
  // Complex async logic that must be stable
}, [criticalDep])

Debugging Issues

Component Not Getting Memo Badge

  1. Check if file is in compiler's sources
  2. Look for Rules of React violations
  3. Check for unsupported patterns

Performance Regression

  1. Profile with React DevTools
  2. Check if compiler skipped problematic code
  3. Add manual memoization as escape hatch

Compatibility Notes

  • Works with existing useMemo/useCallback (won't double-memoize)
  • Safe to leave existing memoization during migration
  • Compiler output is equivalent to manual optimization

Route Splitting

Route-Based Code Splitting

React Router 7.x Lazy Routes

import { lazy } from 'react';
import { createBrowserRouter } from 'react-router';

// Define lazy routes
const routes = [
  {
    path: '/',
    lazy: () => import('./pages/Home'),
  },
  {
    path: '/dashboard',
    lazy: () => import('./pages/Dashboard'),
    children: [
      {
        path: 'analytics',
        lazy: () => import('./pages/Analytics'),
      },
      {
        path: 'settings',
        lazy: () => import('./pages/Settings'),
      },
    ],
  },
];

const router = createBrowserRouter(routes);

Vite Manual Chunks

// vite.config.ts
export default defineConfig({
  build: {
    rollupOptions: {
      output: {
        manualChunks: {
          // Vendor chunks
          'react-vendor': ['react', 'react-dom', 'react-router'],
          'query-vendor': ['@tanstack/react-query'],

          // Feature chunks (match route structure)
          'dashboard': [
            './src/pages/Dashboard',
            './src/pages/Analytics',
          ],
          'settings': [
            './src/pages/Settings',
            './src/pages/Profile',
          ],
        },
      },
    },
  },
});

Prefetch on Route Hover

import { useQueryClient } from '@tanstack/react-query';
import { Link, useNavigate } from 'react-router';

function NavLink({ to, children }: { to: string; children: React.ReactNode }) {
  const queryClient = useQueryClient();

  const prefetch = () => {
    // Prefetch route data
    queryClient.prefetchQuery({
      queryKey: ['route', to],
      queryFn: () => fetchRouteData(to),
    });
  };

  return (
    <Link
      to={to}
      onMouseEnter={prefetch}
      onFocus={prefetch}
      preload="intent"
    >
      {children}
    </Link>
  );
}

Bundle Size Monitoring

# After build, check chunk sizes
npx vite build
# Output shows chunk sizes

# For detailed analysis
npx vite-bundle-visualizer

Rum Setup

Real User Monitoring (RUM) Setup

Complete guide to implementing Real User Monitoring for Core Web Vitals.

┌─────────────────────────────────────────────────────────────────────────┐
│                         RUM Data Flow                                    │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│   Browser                          Server                   Analytics    │
│  ┌────────┐                      ┌────────┐              ┌────────────┐ │
│  │  User  │──interaction──►│ web-vitals │            │            │ │
│  │Session │                │  library   │            │  Dashboard │ │
│  └────────┘                └─────┬──────┘            │   + Alerts │ │
│                                  │                    └─────┬──────┘ │
│                                  │                          │        │
│                     ┌────────────▼────────────┐             │        │
│                     │    sendBeacon / fetch   │─────────────►        │
│                     │    (keepalive: true)    │             │        │
│                     └────────────┬────────────┘             │        │
│                                  │                          │        │
│                     ┌────────────▼────────────┐             │        │
│                     │     /api/vitals         │─────────────►        │
│                     │   (batch + process)     │   metrics   │        │
│                     └─────────────────────────┘             │        │
│                                                                      │
└─────────────────────────────────────────────────────────────────────────┘

web-vitals Library Setup

Installation

npm install web-vitals
# or
pnpm add web-vitals

Basic Implementation

// lib/vitals.ts
import {
  onCLS,
  onINP,
  onLCP,
  onFCP,
  onTTFB,
  type Metric,
  type ReportOpts,
} from 'web-vitals';

// Metric type for your analytics
export interface VitalsMetric {
  name: 'CLS' | 'INP' | 'LCP' | 'FCP' | 'TTFB';
  value: number;
  rating: 'good' | 'needs-improvement' | 'poor';
  delta: number;
  id: string;
  navigationType: 'navigate' | 'reload' | 'back-forward' | 'back-forward-cache' | 'prerender';
  // Custom metadata
  url: string;
  userAgent: string;
  connectionType?: string;
  deviceMemory?: number;
  timestamp: number;
}

// Collect device and connection info for debugging
function getDeviceInfo(): Partial<VitalsMetric> {
  const nav = navigator as Navigator & {
    connection?: { effectiveType?: string };
    deviceMemory?: number;
  };

  return {
    userAgent: navigator.userAgent,
    connectionType: nav.connection?.effectiveType,
    deviceMemory: nav.deviceMemory,
  };
}

function createMetricPayload(metric: Metric): VitalsMetric {
  return {
    name: metric.name as VitalsMetric['name'],
    value: metric.value,
    rating: metric.rating,
    delta: metric.delta,
    id: metric.id,
    navigationType: metric.navigationType,
    url: window.location.href,
    timestamp: Date.now(),
    ...getDeviceInfo(),
  };
}

// Reliable transmission even during page unload
function sendToAnalytics(metric: Metric) {
  const payload = createMetricPayload(metric);
  const body = JSON.stringify(payload);

  // sendBeacon is most reliable for unload scenarios
  if (navigator.sendBeacon) {
    navigator.sendBeacon('/api/vitals', body);
  } else {
    // Fallback with keepalive for browsers without sendBeacon
    fetch('/api/vitals', {
      method: 'POST',
      body,
      headers: { 'Content-Type': 'application/json' },
      keepalive: true, // Keeps request alive even if page unloads
    });
  }
}

// Report all web vitals
export function reportWebVitals(opts?: ReportOpts) {
  // Core Web Vitals (affect SEO)
  onCLS(sendToAnalytics, opts);
  onINP(sendToAnalytics, opts);
  onLCP(sendToAnalytics, opts);

  // Additional useful metrics
  onFCP(sendToAnalytics, opts);
  onTTFB(sendToAnalytics, opts);
}

Next.js App Router Integration

Client Component for Vitals

// app/components/web-vitals.tsx
'use client';

import { useEffect } from 'react';
import { reportWebVitals } from '@/lib/vitals';

export function WebVitals() {
  useEffect(() => {
    // Report immediately (first value)
    reportWebVitals({ reportAllChanges: false });
  }, []);

  return null;
}

// For debugging during development
export function WebVitalsDebug() {
  useEffect(() => {
    // Report all changes, not just final values
    reportWebVitals({ reportAllChanges: true });
  }, []);

  return null;
}

Layout Integration

// app/layout.tsx
import { WebVitals } from '@/components/web-vitals';

export default function RootLayout({
  children,
}: {
  children: React.ReactNode;
}) {
  return (
    <html lang="en">
      <body>
        <WebVitals />
        {children}
      </body>
    </html>
  );
}

API Endpoint Implementation

Next.js Route Handler

// app/api/vitals/route.ts
import { NextRequest, NextResponse } from 'next/server';

// Thresholds from web.dev
const THRESHOLDS = {
  LCP: { good: 2500, poor: 4000 },
  INP: { good: 200, poor: 500 },
  CLS: { good: 0.1, poor: 0.25 },
  FCP: { good: 1800, poor: 3000 },
  TTFB: { good: 800, poor: 1800 },
} as const;

// 2026 thresholds (plan ahead!)
const THRESHOLDS_2026 = {
  LCP: { good: 2000, poor: 4000 },
  INP: { good: 150, poor: 500 },
  CLS: { good: 0.08, poor: 0.25 },
} as const;

interface VitalsMetric {
  name: string;
  value: number;
  rating: string;
  delta: number;
  id: string;
  navigationType: string;
  url: string;
  userAgent: string;
  connectionType?: string;
  deviceMemory?: number;
  timestamp: number;
}

// Validate incoming metric
function isValidMetric(data: unknown): data is VitalsMetric {
  if (!data || typeof data !== 'object') return false;
  const metric = data as Record<string, unknown>;
  return (
    typeof metric.name === 'string' &&
    typeof metric.value === 'number' &&
    typeof metric.rating === 'string'
  );
}

export async function POST(request: NextRequest) {
  try {
    const metric = await request.json();

    if (!isValidMetric(metric)) {
      return NextResponse.json(
        { error: 'Invalid metric format' },
        { status: 400 }
      );
    }

    // Enrich with server-side data
    const enrichedMetric = {
      ...metric,
      receivedAt: new Date().toISOString(),
      clientIP: request.headers.get('x-forwarded-for') ?? 'unknown',
      country: request.headers.get('x-vercel-ip-country') ?? 'unknown',
    };

    // Log for debugging (replace with your analytics service)
    console.log('[Web Vital]', JSON.stringify(enrichedMetric));

    // Store in your analytics database
    await storeMetric(enrichedMetric);

    // Alert on poor metrics (optional)
    if (metric.rating === 'poor') {
      await alertOnPoorMetric(enrichedMetric);
    }

    return NextResponse.json({ received: true });
  } catch (error) {
    console.error('[Vitals API Error]', error);
    return NextResponse.json(
      { error: 'Failed to process metric' },
      { status: 500 }
    );
  }
}

// Example: Store in PostgreSQL
async function storeMetric(metric: VitalsMetric & { receivedAt: string }) {
  // Replace with your database client
  // await db.insert('web_vitals').values({
  //   name: metric.name,
  //   value: metric.value,
  //   rating: metric.rating,
  //   url: metric.url,
  //   user_agent: metric.userAgent,
  //   connection_type: metric.connectionType,
  //   timestamp: new Date(metric.timestamp),
  //   received_at: new Date(metric.receivedAt),
  // });
}

// Example: Alert via Slack/PagerDuty
async function alertOnPoorMetric(metric: VitalsMetric) {
  const threshold = THRESHOLDS[metric.name as keyof typeof THRESHOLDS];
  if (!threshold) return;

  // await fetch(process.env.SLACK_WEBHOOK_URL!, {
  //   method: 'POST',
  //   body: JSON.stringify({
  //     text: `🚨 Poor ${metric.name}: ${metric.value}${metric.name === 'CLS' ? '' : 'ms'} on ${metric.url}`,
  //   }),
  // });
}

Batching for High-Traffic Sites

// lib/vitals-batched.ts
import { onCLS, onINP, onLCP, type Metric } from 'web-vitals';

const BATCH_SIZE = 10;
const FLUSH_INTERVAL = 5000; // 5 seconds

class MetricsBatcher {
  private queue: Metric[] = [];
  private flushTimer: ReturnType<typeof setTimeout> | null = null;

  add(metric: Metric) {
    this.queue.push(metric);

    if (this.queue.length >= BATCH_SIZE) {
      this.flush();
    } else if (!this.flushTimer) {
      this.flushTimer = setTimeout(() => this.flush(), FLUSH_INTERVAL);
    }
  }

  private flush() {
    if (this.queue.length === 0) return;

    const metrics = [...this.queue];
    this.queue = [];

    if (this.flushTimer) {
      clearTimeout(this.flushTimer);
      this.flushTimer = null;
    }

    // Send batch
    navigator.sendBeacon(
      '/api/vitals/batch',
      JSON.stringify({ metrics, timestamp: Date.now() })
    );
  }

  // Flush on page unload
  flushSync() {
    if (this.flushTimer) {
      clearTimeout(this.flushTimer);
      this.flushTimer = null;
    }
    this.flush();
  }
}

const batcher = new MetricsBatcher();

// Ensure flush on unload
if (typeof window !== 'undefined') {
  window.addEventListener('visibilitychange', () => {
    if (document.visibilityState === 'hidden') {
      batcher.flushSync();
    }
  });
}

export function reportWebVitalsBatched() {
  onCLS((metric) => batcher.add(metric));
  onINP((metric) => batcher.add(metric));
  onLCP((metric) => batcher.add(metric));
}

Database Schema

PostgreSQL Schema

CREATE TABLE web_vitals (
  id BIGSERIAL PRIMARY KEY,
  name VARCHAR(10) NOT NULL,
  value DECIMAL(10, 4) NOT NULL,
  rating VARCHAR(20) NOT NULL,
  delta DECIMAL(10, 4),
  metric_id VARCHAR(50),
  navigation_type VARCHAR(30),
  url TEXT NOT NULL,
  user_agent TEXT,
  connection_type VARCHAR(20),
  device_memory INT,
  client_ip INET,
  country VARCHAR(2),
  timestamp TIMESTAMPTZ NOT NULL,
  received_at TIMESTAMPTZ DEFAULT NOW(),

  -- Indexes for common queries
  INDEX idx_vitals_name_timestamp (name, timestamp DESC),
  INDEX idx_vitals_url (url),
  INDEX idx_vitals_rating (rating)
);

-- Partition by month for large datasets
CREATE TABLE web_vitals_2025_01 PARTITION OF web_vitals
  FOR VALUES FROM ('2025-01-01') TO ('2025-02-01');

Analytics Queries

-- Daily Core Web Vitals summary (p75 is Google's standard)
SELECT
  DATE(timestamp) as date,
  name,
  PERCENTILE_CONT(0.75) WITHIN GROUP (ORDER BY value) as p75,
  COUNT(CASE WHEN rating = 'good' THEN 1 END)::float / COUNT(*) * 100 as good_pct,
  COUNT(*) as samples
FROM web_vitals
WHERE timestamp > NOW() - INTERVAL '30 days'
  AND name IN ('LCP', 'INP', 'CLS')
GROUP BY DATE(timestamp), name
ORDER BY date DESC, name;

-- Worst performing pages by LCP
SELECT
  url,
  PERCENTILE_CONT(0.75) WITHIN GROUP (ORDER BY value) as p75_lcp,
  COUNT(*) as samples
FROM web_vitals
WHERE name = 'LCP'
  AND timestamp > NOW() - INTERVAL '7 days'
GROUP BY url
HAVING COUNT(*) > 100
ORDER BY p75_lcp DESC
LIMIT 20;

-- Performance by connection type
SELECT
  connection_type,
  name,
  PERCENTILE_CONT(0.75) WITHIN GROUP (ORDER BY value) as p75,
  COUNT(*) as samples
FROM web_vitals
WHERE timestamp > NOW() - INTERVAL '7 days'
  AND connection_type IS NOT NULL
GROUP BY connection_type, name
ORDER BY connection_type, name;

-- Trend analysis: Week-over-week comparison
WITH current_week AS (
  SELECT name, PERCENTILE_CONT(0.75) WITHIN GROUP (ORDER BY value) as p75
  FROM web_vitals
  WHERE timestamp > NOW() - INTERVAL '7 days'
  GROUP BY name
),
previous_week AS (
  SELECT name, PERCENTILE_CONT(0.75) WITHIN GROUP (ORDER BY value) as p75
  FROM web_vitals
  WHERE timestamp BETWEEN NOW() - INTERVAL '14 days' AND NOW() - INTERVAL '7 days'
  GROUP BY name
)
SELECT
  c.name,
  c.p75 as current_p75,
  p.p75 as previous_p75,
  ROUND((c.p75 - p.p75) / p.p75 * 100, 2) as change_pct
FROM current_week c
JOIN previous_week p ON c.name = p.name;

Grafana Dashboard

Prometheus Metrics Export

// lib/metrics-exporter.ts
import { Histogram, Counter, Registry } from 'prom-client';

const registry = new Registry();

// Histogram for percentile calculations
const webVitalsHistogram = new Histogram({
  name: 'web_vitals_value',
  help: 'Web Vitals metric values',
  labelNames: ['name', 'rating'],
  buckets: {
    LCP: [1000, 1500, 2000, 2500, 3000, 4000, 5000],
    INP: [50, 100, 150, 200, 300, 500, 1000],
    CLS: [0.01, 0.05, 0.1, 0.15, 0.25, 0.5],
  }['LCP'], // Default buckets
  registers: [registry],
});

const webVitalsCounter = new Counter({
  name: 'web_vitals_total',
  help: 'Total count of Web Vitals reports',
  labelNames: ['name', 'rating'],
  registers: [registry],
});

export function recordMetric(name: string, value: number, rating: string) {
  webVitalsHistogram.labels(name, rating).observe(value);
  webVitalsCounter.labels(name, rating).inc();
}

export { registry };

Grafana Alert Rules

# grafana-alerts.yaml
groups:
  - name: core-web-vitals
    interval: 5m
    rules:
      # LCP Alert
      - alert: HighLCP
        expr: histogram_quantile(0.75, sum(rate(web_vitals_value_bucket{name="LCP"}[15m])) by (le)) > 2500
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "LCP p75 is {{ $value | printf \"%.0f\" }}ms (threshold: 2500ms)"
          description: "Largest Contentful Paint has degraded. Check recent deployments."

      # INP Alert
      - alert: HighINP
        expr: histogram_quantile(0.75, sum(rate(web_vitals_value_bucket{name="INP"}[15m])) by (le)) > 200
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "INP p75 is {{ $value | printf \"%.0f\" }}ms (threshold: 200ms)"
          description: "Interaction to Next Paint has degraded. Check for long tasks."

      # CLS Alert
      - alert: HighCLS
        expr: histogram_quantile(0.75, sum(rate(web_vitals_value_bucket{name="CLS"}[15m])) by (le)) > 0.1
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "CLS p75 is {{ $value | printf \"%.3f\" }} (threshold: 0.1)"
          description: "Cumulative Layout Shift has degraded. Check for layout shifts."

      # Good rate dropping
      - alert: GoodRateDrop
        expr: |
          (sum(rate(web_vitals_total{rating="good"}[1h])) by (name) /
           sum(rate(web_vitals_total[1h])) by (name)) < 0.75
        for: 30m
        labels:
          severity: critical
        annotations:
          summary: "{{ $labels.name }} good rate dropped below 75%"
          description: "Less than 75% of users are experiencing good {{ $labels.name }}"

Sampling Strategy for High Traffic

// lib/vitals-sampled.ts
import { onCLS, onINP, onLCP, type Metric } from 'web-vitals';

interface SamplingConfig {
  // Base sample rate (0-1)
  baseRate: number;
  // Always sample poor metrics
  alwaysSamplePoor: boolean;
  // Sample more on specific pages
  pageMultipliers?: Record<string, number>;
}

const DEFAULT_CONFIG: SamplingConfig = {
  baseRate: 0.1, // 10% baseline
  alwaysSamplePoor: true,
  pageMultipliers: {
    '/': 1.0, // Always sample homepage
    '/checkout': 1.0, // Always sample checkout
  },
};

function shouldSample(metric: Metric, config: SamplingConfig): boolean {
  // Always sample poor metrics for debugging
  if (config.alwaysSamplePoor && metric.rating === 'poor') {
    return true;
  }

  // Check page-specific multiplier
  const path = window.location.pathname;
  const multiplier = config.pageMultipliers?.[path] ?? 1;
  const effectiveRate = config.baseRate * multiplier;

  return Math.random() < effectiveRate;
}

export function reportWebVitalsSampled(config = DEFAULT_CONFIG) {
  const report = (metric: Metric) => {
    if (shouldSample(metric, config)) {
      sendToAnalytics(metric);
    }
  };

  onCLS(report);
  onINP(report);
  onLCP(report);
}

Testing RUM in Development

// lib/vitals-dev.ts
import { onCLS, onINP, onLCP, type Metric } from 'web-vitals';

const RATING_COLORS = {
  good: 'color: green',
  'needs-improvement': 'color: orange',
  poor: 'color: red',
} as const;

function logToConsole(metric: Metric) {
  const color = RATING_COLORS[metric.rating];
  const unit = metric.name === 'CLS' ? '' : 'ms';

  console.log(
    `%c[${metric.name}] ${metric.value.toFixed(2)}${unit} (${metric.rating})`,
    color,
    {
      delta: metric.delta,
      id: metric.id,
      navigationType: metric.navigationType,
    }
  );
}

export function reportWebVitalsDev() {
  // Report all changes for debugging
  onCLS(logToConsole, { reportAllChanges: true });
  onINP(logToConsole, { reportAllChanges: true });
  onLCP(logToConsole, { reportAllChanges: true });
}

// Usage in development
if (process.env.NODE_ENV === 'development') {
  reportWebVitalsDev();
}

Integration with Analytics Providers

Google Analytics 4

// lib/vitals-ga4.ts
import { onCLS, onINP, onLCP, type Metric } from 'web-vitals';

declare global {
  interface Window {
    gtag?: (...args: unknown[]) => void;
  }
}

function sendToGA4(metric: Metric) {
  if (typeof window.gtag !== 'function') return;

  window.gtag('event', metric.name, {
    event_category: 'Web Vitals',
    event_label: metric.id,
    value: Math.round(metric.name === 'CLS' ? metric.value * 1000 : metric.value),
    metric_rating: metric.rating,
    non_interaction: true,
  });
}

export function reportWebVitalsGA4() {
  onCLS(sendToGA4);
  onINP(sendToGA4);
  onLCP(sendToGA4);
}

Vercel Analytics

// Next.js built-in support
// next.config.js
module.exports = {
  // Vercel Analytics automatically collects Web Vitals
  // No additional setup needed when deployed on Vercel
};

// For self-hosted, use @vercel/analytics
import { Analytics } from '@vercel/analytics/react';

export default function RootLayout({ children }) {
  return (
    <html>
      <body>
        {children}
        <Analytics />
      </body>
    </html>
  );
}

Speculative Decoding

Speculative Decoding

Overview

Speculative decoding accelerates autoregressive generation by predicting multiple tokens at once, then verifying in parallel.

How it works:

  1. Draft model (or n-gram) proposes N candidate tokens
  2. Target model verifies all N tokens in one forward pass
  3. Accept verified tokens, reject incorrect ones
  4. Repeat from first rejected position

Expected gains: 1.5-2.5x throughput for compatible workloads.


N-gram Speculation

No extra model needed - uses prompt patterns:

# vLLM CLI with n-gram speculation
vllm serve meta-llama/Meta-Llama-3.1-70B-Instruct \
    --speculative-config '{
        "method": "ngram",
        "num_speculative_tokens": 5,
        "prompt_lookup_max": 5,
        "prompt_lookup_min": 2
    }'
from vllm import LLM, SamplingParams

llm = LLM(
    model="meta-llama/Meta-Llama-3.1-70B-Instruct",
    speculative_config={
        "method": "ngram",
        "num_speculative_tokens": 5,
        "prompt_lookup_max": 5,
        "prompt_lookup_min": 2,
    },
)

# Works best with repetitive/structured output
outputs = llm.generate(
    ["Generate a JSON object with user data:"],
    SamplingParams(max_tokens=500),
)

Best for:

  • Structured output (JSON, code)
  • Repetitive patterns
  • Low additional memory

Draft Model Speculation

Use a smaller model to draft tokens:

# Draft model speculation
vllm serve meta-llama/Meta-Llama-3.1-70B-Instruct \
    --speculative-config '{
        "method": "draft_model",
        "draft_model": "meta-llama/Llama-3.2-1B-Instruct",
        "num_speculative_tokens": 3
    }' \
    --tensor-parallel-size 4
from vllm import LLM

llm = LLM(
    model="meta-llama/Meta-Llama-3.1-70B-Instruct",
    speculative_config={
        "method": "draft_model",
        "draft_model": "meta-llama/Llama-3.2-1B-Instruct",
        "num_speculative_tokens": 3,
    },
    tensor_parallel_size=4,
)

Draft model selection:

Target ModelRecommended DraftSize Ratio
70B7B or 8B~10%
70B1B-3B~2-5%
8B1B~12%
405B8B-70B~2-17%

Medusa-style Speculation

Multiple prediction heads for parallel token generation:

# Medusa-style model (requires trained heads)
llm = LLM(
    model="lmsys/vicuna-7b-v1.5-16k-medusa",
    speculative_config={
        "method": "medusa",
        "num_heads": 4,  # Number of speculation heads
    },
)

Advantages:

  • No separate draft model
  • Lower memory than draft model
  • Works well with fine-tuned models

Performance Tuning

Optimal Token Count

# Benchmark different speculation depths
for num_tokens in [1, 3, 5, 7]:
    llm = LLM(
        model="meta-llama/Meta-Llama-3.1-70B-Instruct",
        speculative_config={
            "method": "ngram",
            "num_speculative_tokens": num_tokens,
        },
    )
    throughput = benchmark(llm)
    print(f"Tokens: {num_tokens}, Throughput: {throughput:.1f} tok/s")

General guidelines:

ScenarioRecommended Tokens
Code generation5-7
JSON output5-7
Free-form text2-4
Creative writing1-3

Acceptance Rate Monitoring

# vLLM logs acceptance rates
# Look for: "Speculative decoding acceptance rate: X%"

# High acceptance (>70%): Increase num_speculative_tokens
# Low acceptance (<40%): Decrease or disable speculation

When NOT to Use

Speculative decoding may hurt performance when:

  1. High randomness (temperature > 1.0)
  2. Short outputs (overhead > benefit)
  3. Diverse outputs (low acceptance rate)
  4. Memory constrained (draft model overhead)
# Disable speculation for creative tasks
sampling_params = SamplingParams(
    temperature=1.2,
    top_p=0.95,
    max_tokens=100,  # Short output
)
# Use standard decoding instead

Benchmarking

import time
from vllm import LLM, SamplingParams

def benchmark_speculation(model_path: str, prompts: list[str]):
    """Compare with and without speculative decoding."""

    # Without speculation
    llm_base = LLM(model=model_path)
    start = time.perf_counter()
    outputs_base = llm_base.generate(prompts, SamplingParams(max_tokens=512))
    time_base = time.perf_counter() - start

    # With speculation
    llm_spec = LLM(
        model=model_path,
        speculative_config={
            "method": "ngram",
            "num_speculative_tokens": 5,
        },
    )
    start = time.perf_counter()
    outputs_spec = llm_spec.generate(prompts, SamplingParams(max_tokens=512))
    time_spec = time.perf_counter() - start

    tokens_base = sum(len(o.outputs[0].token_ids) for o in outputs_base)
    tokens_spec = sum(len(o.outputs[0].token_ids) for o in outputs_spec)

    print(f"Baseline: {tokens_base/time_base:.1f} tok/s")
    print(f"Speculative: {tokens_spec/time_spec:.1f} tok/s")
    print(f"Speedup: {(time_base/time_spec):.2f}x")


# JSON/code prompts benefit most
prompts = [
    "Generate a Python function that implements binary search:",
    "Create a JSON schema for a user profile with validation:",
    "Write a SQL query to find top 10 customers by revenue:",
]
benchmark_speculation("meta-llama/Meta-Llama-3.1-8B-Instruct", prompts)

  • llm-streaming - Streaming with speculation
  • prompt-caching - Combine with prefix caching

State Colocation

State Colocation

Keep state as close to where it's used as possible.

The Principle

State should live in the component that needs it. Only lift state when truly necessary for sibling communication.

Problem: State Too High

// ❌ State at app level causes unnecessary re-renders
function App() {
  const [searchQuery, setSearchQuery] = useState('')
  const [selectedId, setSelectedId] = useState(null)

  return (
    <div>
      <Header />                    {/* Re-renders on search! */}
      <Sidebar />                   {/* Re-renders on search! */}
      <SearchInput
        value={searchQuery}
        onChange={setSearchQuery}
      />
      <SearchResults
        query={searchQuery}
        selectedId={selectedId}
        onSelect={setSelectedId}
      />
      <Footer />                    {/* Re-renders on search! */}
    </div>
  )
}

Solution: Colocate State

// ✅ State colocated with components that use it
function App() {
  return (
    <div>
      <Header />
      <Sidebar />
      <SearchSection />  {/* Contains its own state */}
      <Footer />
    </div>
  )
}

function SearchSection() {
  const [searchQuery, setSearchQuery] = useState('')
  const [selectedId, setSelectedId] = useState(null)

  return (
    <>
      <SearchInput
        value={searchQuery}
        onChange={setSearchQuery}
      />
      <SearchResults
        query={searchQuery}
        selectedId={selectedId}
        onSelect={setSelectedId}
      />
    </>
  )
}

When to Lift State

Lift state ONLY when:

  1. Siblings need to share it
// Both components need selectedUser
function Parent() {
  const [selectedUser, setSelectedUser] = useState(null)

  return (
    <>
      <UserList onSelect={setSelectedUser} selected={selectedUser} />
      <UserDetails user={selectedUser} />
    </>
  )
}
  1. Parent needs to coordinate
// Parent manages form submission
function Form() {
  const [values, setValues] = useState({})

  const handleSubmit = () => {
    api.submit(values)
  }

  return (
    <>
      <FormFields values={values} onChange={setValues} />
      <SubmitButton onClick={handleSubmit} />
    </>
  )
}

Component Splitting

Split components to isolate state:

// ❌ Before: Counter re-renders entire card
function Card() {
  const [count, setCount] = useState(0)

  return (
    <div className="card">
      <ExpensiveHeader />           {/* Re-renders on count change */}
      <ExpensiveContent />          {/* Re-renders on count change */}
      <button onClick={() => setCount(c => c + 1)}>
        Count: {count}
      </button>
    </div>
  )
}

// ✅ After: Counter isolated
function Card() {
  return (
    <div className="card">
      <ExpensiveHeader />           {/* Doesn't re-render */}
      <ExpensiveContent />          {/* Doesn't re-render */}
      <Counter />                   {/* Only this re-renders */}
    </div>
  )
}

function Counter() {
  const [count, setCount] = useState(0)
  return (
    <button onClick={() => setCount(c => c + 1)}>
      Count: {count}
    </button>
  )
}

Context for Cross-Cutting Concerns

Use Context for truly global state, not local UI state:

// ✅ Good: Theme is app-wide
<ThemeContext.Provider value={theme}>
  <App />
</ThemeContext.Provider>

// ✅ Good: Auth is app-wide
<AuthContext.Provider value={user}>
  <App />
</AuthContext.Provider>

// ❌ Bad: Search query is local
<SearchQueryContext.Provider value={query}>  {/* Don't do this */}
  <Header />
  <SearchResults />
</SearchQueryContext.Provider>

Context Splitting

Split contexts to prevent unnecessary re-renders:

// ❌ Single context - all consumers re-render
const AppContext = createContext({ user, theme, locale })

// ✅ Split contexts - targeted re-renders
const UserContext = createContext(null)
const ThemeContext = createContext('light')
const LocaleContext = createContext('en')

Signs State Should Move

Move state DOWN when:

  • Only one component uses it
  • Child components don't need it
  • Re-renders are affecting unrelated components

Move state UP when:

  • Multiple children need to read it
  • Children need to update each other
  • State represents shared domain concept

Quick Checklist

  • Is state used by only one component? → Keep it there
  • Do siblings need this state? → Lift to parent
  • Is it causing unnecessary re-renders? → Consider splitting
  • Is it truly global? → Use Context
  • Is it URL state? → Use router params

Tanstack Virtual Patterns

TanStack Virtual Patterns

Efficient virtualization for large lists and grids.

When to Virtualize

Item CountRecommendation
< 50Not needed
50-100Consider if items are complex
100-500Recommended
500+Required

Basic List Virtualization

import { useVirtualizer } from '@tanstack/react-virtual'

function VirtualList({ items }) {
  const parentRef = useRef<HTMLDivElement>(null)

  const virtualizer = useVirtualizer({
    count: items.length,
    getScrollElement: () => parentRef.current,
    estimateSize: () => 50, // Estimated row height in px
    overscan: 5, // Render 5 extra items for smooth scrolling
  })

  return (
    <div
      ref={parentRef}
      style={{ height: '400px', overflow: 'auto' }}
    >
      <div
        style={{
          height: `${virtualizer.getTotalSize()}px`,
          width: '100%',
          position: 'relative',
        }}
      >
        {virtualizer.getVirtualItems().map((virtualItem) => (
          <div
            key={virtualItem.key}
            style={{
              position: 'absolute',
              top: 0,
              left: 0,
              width: '100%',
              height: `${virtualItem.size}px`,
              transform: `translateY(${virtualItem.start}px)`,
            }}
          >
            {items[virtualItem.index].name}
          </div>
        ))}
      </div>
    </div>
  )
}

Variable Height Rows

For rows with different heights:

function VariableHeightList({ items }) {
  const parentRef = useRef<HTMLDivElement>(null)

  const virtualizer = useVirtualizer({
    count: items.length,
    getScrollElement: () => parentRef.current,
    estimateSize: (index) => {
      // Return estimated height based on content
      return items[index].type === 'header' ? 80 : 50
    },
    overscan: 5,
  })

  return (
    <div ref={parentRef} style={{ height: '400px', overflow: 'auto' }}>
      <div
        style={{
          height: `${virtualizer.getTotalSize()}px`,
          position: 'relative',
        }}
      >
        {virtualizer.getVirtualItems().map((virtualItem) => (
          <div
            key={virtualItem.key}
            data-index={virtualItem.index}
            ref={virtualizer.measureElement} // Enable dynamic measurement
            style={{
              position: 'absolute',
              top: 0,
              left: 0,
              width: '100%',
              transform: `translateY(${virtualItem.start}px)`,
            }}
          >
            <ItemComponent item={items[virtualItem.index]} />
          </div>
        ))}
      </div>
    </div>
  )
}

Dynamic Measurement

When content determines height:

const virtualizer = useVirtualizer({
  count: items.length,
  getScrollElement: () => parentRef.current,
  estimateSize: () => 50, // Initial estimate
  // measureElement enables dynamic re-measurement
})

// Add ref to each item
<div
  key={virtualItem.key}
  data-index={virtualItem.index}
  ref={virtualizer.measureElement}
>
  {/* Content with unknown height */}
</div>

Horizontal Virtualization

const columnVirtualizer = useVirtualizer({
  horizontal: true,
  count: columns.length,
  getScrollElement: () => parentRef.current,
  estimateSize: () => 150, // Column width
  overscan: 3,
})

Grid Virtualization

Combine row and column virtualizers:

function VirtualGrid({ rows, columns }) {
  const parentRef = useRef<HTMLDivElement>(null)

  const rowVirtualizer = useVirtualizer({
    count: rows.length,
    getScrollElement: () => parentRef.current,
    estimateSize: () => 50,
    overscan: 5,
  })

  const columnVirtualizer = useVirtualizer({
    horizontal: true,
    count: columns.length,
    getScrollElement: () => parentRef.current,
    estimateSize: () => 100,
    overscan: 3,
  })

  return (
    <div ref={parentRef} style={{ height: '400px', overflow: 'auto' }}>
      <div
        style={{
          height: `${rowVirtualizer.getTotalSize()}px`,
          width: `${columnVirtualizer.getTotalSize()}px`,
          position: 'relative',
        }}
      >
        {rowVirtualizer.getVirtualItems().map((virtualRow) => (
          <React.Fragment key={virtualRow.key}>
            {columnVirtualizer.getVirtualItems().map((virtualColumn) => (
              <div
                key={virtualColumn.key}
                style={{
                  position: 'absolute',
                  top: 0,
                  left: 0,
                  width: `${virtualColumn.size}px`,
                  height: `${virtualRow.size}px`,
                  transform: `translateX(${virtualColumn.start}px) translateY(${virtualRow.start}px)`,
                }}
              >
                {/* Cell content */}
              </div>
            ))}
          </React.Fragment>
        ))}
      </div>
    </div>
  )
}

Scroll to Index

const virtualizer = useVirtualizer({/* ... */})

// Scroll to specific item
virtualizer.scrollToIndex(50, { align: 'start' })

// Align options: 'start' | 'center' | 'end' | 'auto'

Window Scroller

For document-level scrolling:

import { useWindowVirtualizer } from '@tanstack/react-virtual'

function WindowList({ items }) {
  const virtualizer = useWindowVirtualizer({
    count: items.length,
    estimateSize: () => 50,
    overscan: 5,
  })

  return (
    <div
      style={{
        height: `${virtualizer.getTotalSize()}px`,
        position: 'relative',
      }}
    >
      {virtualizer.getVirtualItems().map((virtualItem) => (
        <div
          key={virtualItem.key}
          style={{
            position: 'absolute',
            top: 0,
            left: 0,
            width: '100%',
            transform: `translateY(${virtualItem.start}px)`,
          }}
        >
          {items[virtualItem.index].name}
        </div>
      ))}
    </div>
  )
}

Performance Tips

  1. Use stable keys: Avoid array index as key
  2. Memoize items: If item rendering is expensive
  3. Adjust overscan: More overscan = smoother scroll, more DOM nodes
  4. Measure sparingly: Only use measureElement when needed
  5. Debounce scroll: For very heavy computations

Vllm Deployment

vLLM Deployment

PagedAttention

vLLM's PagedAttention manages KV cache memory in non-contiguous blocks, enabling:

  • Efficient memory: Only allocates what's needed per request
  • Dynamic batching: Handles variable sequence lengths
  • Up to 24x throughput: Compared to naive implementations
from vllm import LLM, SamplingParams

# PagedAttention is enabled by default
llm = LLM(
    model="meta-llama/Meta-Llama-3.1-70B-Instruct",
    gpu_memory_utilization=0.9,  # Use 90% GPU memory for KV cache
    max_num_seqs=256,  # Max concurrent sequences
    max_model_len=8192,  # Max context length
)

Continuous Batching

Dynamic batching that doesn't wait for batch completion:

from vllm import AsyncLLMEngine, AsyncEngineArgs, SamplingParams

# Configure async engine for continuous batching
engine_args = AsyncEngineArgs(
    model="meta-llama/Meta-Llama-3.1-8B-Instruct",
    max_num_batched_tokens=8192,  # Max tokens per batch
    max_num_seqs=64,  # Max concurrent sequences
    enable_chunked_prefill=True,  # Better latency for long prompts
)

engine = AsyncLLMEngine.from_engine_args(engine_args)

# Requests are automatically batched
async def generate(prompt: str):
    sampling_params = SamplingParams(max_tokens=512)
    generator = engine.generate(prompt, sampling_params, request_id="req-1")
    async for output in generator:
        yield output.outputs[0].text

CUDA Graphs

Capture and replay CUDA operations for faster execution:

# Enable via CLI
vllm serve meta-llama/Meta-Llama-3.1-8B-Instruct \
    --enforce-eager false  # Enable CUDA graphs (default)

# Disable for debugging
vllm serve meta-llama/Meta-Llama-3.1-8B-Instruct \
    --enforce-eager true  # Disable CUDA graphs
# Python API
llm = LLM(
    model="meta-llama/Meta-Llama-3.1-8B-Instruct",
    enforce_eager=False,  # Enable CUDA graphs
)

Note: CUDA graphs require fixed input shapes. vLLM handles this automatically with padding.


Tensor Parallelism

Scale across multiple GPUs:

# 4-GPU tensor parallelism
vllm serve meta-llama/Meta-Llama-3.1-70B-Instruct \
    --tensor-parallel-size 4

# With pipeline parallelism (for very large models)
vllm serve meta-llama/Meta-Llama-3.3-405B-Instruct \
    --tensor-parallel-size 4 \
    --pipeline-parallel-size 2
llm = LLM(
    model="meta-llama/Meta-Llama-3.1-70B-Instruct",
    tensor_parallel_size=4,
    distributed_executor_backend="ray",  # For multi-node
)

GPU Requirements:

Model SizeGPUs (FP16)GPUs (INT8)GPUs (AWQ/GPTQ)
7B111
13B111
70B421-2
405B8+4+4+

Prefix Caching

Reuse KV cache for shared prompt prefixes:

llm = LLM(
    model="meta-llama/Meta-Llama-3.1-8B-Instruct",
    enable_prefix_caching=True,  # Enable prefix caching
)

# Shared system prompt benefits from caching
system_prompt = "You are a helpful assistant. Be concise and accurate."
prompts = [
    f"{system_prompt}\n\nUser: What is Python?",
    f"{system_prompt}\n\nUser: Explain REST APIs.",
    f"{system_prompt}\n\nUser: What is Docker?",
]

# First request computes system prompt KV cache
# Subsequent requests reuse cached prefix
outputs = llm.generate(prompts, SamplingParams(max_tokens=256))

Benefits:

  • Reduced TTFT (time to first token) for shared prefixes
  • Lower GPU memory for batch requests
  • Ideal for: chat systems, RAG with fixed context

Production Server Configuration

# Production vLLM server
vllm serve meta-llama/Meta-Llama-3.1-70B-Instruct \
    --host 0.0.0.0 \
    --port 8000 \
    --tensor-parallel-size 4 \
    --max-model-len 8192 \
    --max-num-seqs 128 \
    --gpu-memory-utilization 0.9 \
    --enable-prefix-caching \
    --disable-log-requests \
    --api-key $VLLM_API_KEY

# With quantization
vllm serve meta-llama/Meta-Llama-3.1-70B-Instruct \
    --quantization awq \
    --dtype half \
    --tensor-parallel-size 2 \
    --gpu-memory-utilization 0.85

OpenAI-compatible API:

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="your-api-key",
)

response = client.chat.completions.create(
    model="meta-llama/Meta-Llama-3.1-70B-Instruct",
    messages=[{"role": "user", "content": "Hello!"}],
    max_tokens=256,
)

Monitoring and Metrics

vLLM exposes Prometheus metrics:

# Enable metrics
vllm serve ... --enable-metrics

# Metrics endpoint
curl http://localhost:8000/metrics

Key metrics:

  • vllm:num_requests_running: Active requests
  • vllm:num_requests_waiting: Queued requests
  • vllm:gpu_cache_usage_perc: KV cache utilization
  • vllm:avg_prompt_throughput_toks_per_s: Input throughput
  • vllm:avg_generation_throughput_toks_per_s: Output throughput

  • observability-monitoring - Production monitoring patterns
  • performance-testing - Load testing inference endpoints

Checklists (5)

Cwv Checklist

Core Web Vitals Optimization Checklist

Comprehensive checklist for achieving and maintaining good Core Web Vitals scores.

Thresholds Reference

MetricGoodNeeds ImprovementPoor
LCP≤ 2.5s≤ 4.0s> 4.0s
INP≤ 200ms≤ 500ms> 500ms
CLS≤ 0.1≤ 0.25> 0.25

2026 Stricter Thresholds (plan ahead!):

  • LCP: ≤ 2.0s
  • INP: ≤ 150ms
  • CLS: ≤ 0.08

LCP (Largest Contentful Paint) ≤ 2.5s

Identify the LCP Element

  • Run Lighthouse to identify LCP element
  • Use Performance Observer to confirm in production
  • LCP is typically: hero image, hero heading, or above-the-fold banner
// Debug: Find LCP element
new PerformanceObserver((list) => {
  const entries = list.getEntries();
  console.log('LCP element:', entries[entries.length - 1].element);
}).observe({ type: 'largest-contentful-paint', buffered: true });

Server Response Time (TTFB)

  • Server response time (TTFB) < 800ms
  • Use edge/CDN for static content
  • Enable HTTP/2 or HTTP/3
  • Compress responses (gzip/brotli)
  • Database queries optimized
  • Caching strategy implemented (Redis, CDN cache)

Critical Resource Loading

  • LCP image has fetchpriority="high" attribute
  • LCP image has loading="eager" (not lazy)
  • LCP image preloaded in &lt;head&gt;
  • Critical CSS inlined or preloaded
  • Font preloaded with crossorigin attribute
  • Preconnect to critical third-party origins
<!-- Preload critical resources -->
<link rel="preload" as="image" href="/hero.webp" fetchpriority="high" />
<link rel="preload" as="font" href="/font.woff2" type="font/woff2" crossorigin />
<link rel="preconnect" href="https://api.example.com" />

Image Optimization

  • LCP image in modern format (WebP/AVIF)
  • Image properly sized (not oversized)
  • Responsive images with srcset
  • Image CDN used (Cloudinary, imgix, Vercel)

Rendering Strategy

  • LCP content rendered server-side (SSR/SSG)
  • LCP content NOT loaded client-side via fetch
  • No render-blocking JavaScript
  • No render-blocking CSS below the fold
  • Third-party scripts deferred
// ✅ GOOD: Server-rendered LCP content
export default async function Page() {
  const hero = await getHeroData();
  return <Hero data={hero} />;
}

// ❌ BAD: Client-loaded LCP content
function Page() {
  const [hero, setHero] = useState(null);
  useEffect(() => { fetchHero().then(setHero); }, []); // Delays LCP!
}

INP (Interaction to Next Paint) ≤ 200ms

Identify Long Tasks

  • Chrome DevTools Performance tab analyzed
  • Long tasks (>50ms) identified
  • Main thread blockers removed/optimized

JavaScript Optimization

  • Heavy computation moved to Web Workers
  • Large arrays processed in chunks with yielding
  • requestIdleCallback used for non-critical work
  • Bundle size minimized (code splitting)
  • Tree shaking enabled
// ✅ GOOD: Yield to main thread
async function processItems(items: Item[]) {
  for (const item of items) {
    processItem(item);
    // Yield every 4ms to allow paint
    await scheduler.yield?.() ?? new Promise(r => setTimeout(r, 0));
  }
}

React Optimization

  • useTransition for non-urgent updates
  • useDeferredValue for expensive derivations
  • Memoization where appropriate (useMemo, memo)
  • Virtualization for long lists (react-window, @tanstack/virtual)
  • Suspense boundaries for code splitting
// ✅ GOOD: Non-blocking state updates
const [isPending, startTransition] = useTransition();

function handleSearch(query: string) {
  setQuery(query); // Urgent: update input
  startTransition(() => {
    setFilteredResults(filter(query)); // Non-urgent: defer
  });
}

Event Handler Optimization

  • No heavy computation in event handlers
  • Handlers don't cause layout thrashing
  • Passive event listeners for scroll/touch
  • Debounced input handlers where appropriate
// ✅ GOOD: Defer heavy work
onClick={() => {
  setLoading(true);
  startTransition(() => {
    const result = heavyComputation();
    setResult(result);
    setLoading(false);
  });
}}

// ❌ BAD: Blocking handler
onClick={() => {
  const result = heavyComputation(); // Blocks paint!
  setResult(result);
}}

Animation Performance

  • Animations use transform and opacity only
  • No animations on layout properties (width, height, top, left)
  • will-change used sparingly
  • Animations run at 60fps (checked in DevTools)

CLS (Cumulative Layout Shift) ≤ 0.1

Image Dimensions

  • ALL images have explicit width and height
  • Responsive images use aspect-ratio container
  • fill prop images have sized container
  • No images cause layout shift on load
// ✅ GOOD: Explicit dimensions
<img src="/photo.jpg" width={800} height={600} alt="Photo" />

// ✅ GOOD: Aspect ratio container
<div className="aspect-[16/9]">
  <Image src="/photo.jpg" fill alt="Photo" />
</div>

Dynamic Content

  • Space reserved for dynamic content (ads, embeds)
  • Skeleton loaders match final content size
  • No content inserted above existing content
  • Lazy-loaded content has reserved space
// ✅ GOOD: Reserved space
<div className="min-h-[250px]">
  {ad ? <Ad data={ad} /> : <Skeleton height={250} />}
</div>

Font Loading

  • font-display: optional or swap used
  • Fallback font has size-adjust to match
  • Critical font preloaded
  • System font stack as fallback
/* Fallback with size adjustment */
@font-face {
  font-family: 'Inter Fallback';
  src: local('Arial');
  size-adjust: 107%;
  ascent-override: 90%;
}

body {
  font-family: 'Inter', 'Inter Fallback', sans-serif;
}

Animation Stability

  • Animations use transform, not layout properties
  • Expanding/collapsing uses scaleY, not height
  • Modals/overlays don't shift page content
  • Toast notifications positioned fixed/absolute
/* ✅ GOOD: Transform-based animation */
.drawer {
  transform: translateX(-100%);
  transition: transform 0.3s;
}
.drawer.open {
  transform: translateX(0);
}

/* ❌ BAD: Layout-shifting animation */
.drawer {
  width: 0;
  transition: width 0.3s;
}

Iframes and Embeds

  • Iframes have explicit dimensions
  • Third-party embeds wrapped with sized container
  • Lazy iframes have placeholder

Measurement & Monitoring

Lab Testing

  • Lighthouse CI in build pipeline
  • Performance budgets enforced
  • Regular manual Lighthouse audits
  • Testing on throttled CPU/network

Field Data (RUM)

  • web-vitals library installed
  • Metrics sent to analytics endpoint
  • p75 percentile tracked (Google's standard)
  • Alerts configured for regressions
// Essential RUM setup
import { onLCP, onINP, onCLS } from 'web-vitals';

onLCP(sendToAnalytics);
onINP(sendToAnalytics);
onCLS(sendToAnalytics);

Data Analysis

  • Dashboard showing daily/weekly trends
  • Segmentation by page, device, connection
  • Comparison of lab vs field data
  • Week-over-week regression detection

Alerting

  • Alert when p75 exceeds threshold
  • Alert when good rate drops below 75%
  • Alert on significant week-over-week regression
  • Escalation path defined

Build & Deploy

Performance Budgets

  • Bundle size limits configured
  • Build fails on budget exceeded
  • Per-route budgets for large apps
// webpack.config.js
module.exports = {
  performance: {
    maxAssetSize: 150000, // 150KB
    maxEntrypointSize: 250000, // 250KB
    hints: 'error', // Fail build
  },
};

CI/CD Integration

  • Lighthouse CI runs on PRs
  • Performance regression blocks merge
  • Bundle analyzer report generated
  • Preview deployments for testing

CDN & Caching

  • Static assets on CDN
  • Immutable caching for hashed assets
  • Stale-while-revalidate for HTML
  • Edge caching where appropriate

Debugging Checklist

Slow LCP

  • Check TTFB (server response time)
  • Verify LCP element has fetchpriority="high"
  • Confirm LCP content is server-rendered
  • Check for render-blocking resources
  • Verify image is optimized and properly sized

High INP

  • Run Performance recording during interaction
  • Look for long tasks in flame chart
  • Check for forced synchronous layouts
  • Verify heavy work is deferred
  • Check for excessive re-renders

High CLS

  • Run Lighthouse with "Layout Shift Regions" enabled
  • Check images for missing dimensions
  • Look for late-loading content
  • Verify fonts have fallbacks
  • Check for content inserted above viewport

Testing Protocol

Before Deployment

  • Lighthouse score ≥ 90 on Performance
  • All Core Web Vitals in "good" range
  • No performance budget violations
  • Tested on throttled 4G + slow CPU

After Deployment

  • Monitor RUM for 24-48 hours
  • Compare p75 to pre-deployment baseline
  • Check for unexpected regressions
  • Verify alerting is working

Weekly Review

  • Review p75 trends
  • Identify worst-performing pages
  • Check for new issues in CrUX
  • Plan optimizations for next sprint

Image Checklist

Image Optimization Checklist

Comprehensive checklist for production-ready image optimization.

Format Selection

Photo Content

  • Use AVIF as primary format (30-50% smaller than JPEG)
  • Configure WebP as fallback for older browsers
  • JPEG only for browsers without AVIF/WebP support
  • Configure Next.js: formats: ['image/avif', 'image/webp']

Graphics & Icons

  • SVG for logos, icons, and simple graphics
  • PNG only when transparency is required
  • Consider SVG sprites for icon sets (reduces requests)
  • Inline small SVGs (< 1KB) to avoid network requests

Format Decision Tree

Is it a photo/complex image?
├── Yes → Use AVIF/WebP (Next.js Image handles this)
└── No → Is transparency needed?
    ├── Yes → PNG or SVG
    └── No → Is it an icon/logo?
        ├── Yes → SVG (scalable, tiny file size)
        └── No → AVIF/WebP

Dimensions & Sizing

Always Set Dimensions

  • Every &lt;Image&gt; has width and height OR uses fill
  • Fill mode images have sized container (relative + dimensions)
  • Dimensions match actual display size (not larger)
  • No CLS from images (Layout Shift score = 0)
// ✅ GOOD: Explicit dimensions
<Image src="/photo.jpg" width={800} height={600} />

// ✅ GOOD: Fill with sized container
<div className="relative h-[400px]">
  <Image src="/photo.jpg" fill />
</div>

// ❌ BAD: Missing dimensions
<Image src="/photo.jpg" />

Responsive Images

  • sizes prop set for all responsive images
  • Sizes match actual layout breakpoints
  • Don't serve images larger than needed
  • Test with DevTools Network tab (check actual sizes served)
// ✅ GOOD: Accurate sizes prop
<Image
  src="/photo.jpg"
  fill
  sizes="(max-width: 640px) 100vw, (max-width: 1024px) 50vw, 33vw"
/>

// Common sizes patterns:
// Full width hero: sizes="100vw"
// Half width on desktop: sizes="(max-width: 768px) 100vw, 50vw"
// Grid of 4: sizes="(max-width: 640px) 50vw, 25vw"

Loading Strategy

LCP Images (Above the Fold)

  • Hero/banner image has priority prop
  • ONLY one image per page has priority (usually LCP element)
  • LCP image preloaded in &lt;head&gt; if not using Next.js Image
  • No lazy loading on LCP images
// ✅ GOOD: Priority on LCP image
<Image src="/hero.jpg" priority fill sizes="100vw" />

// ❌ BAD: Priority on all images
{images.map(img => <Image src={img} priority />)} // Wrong!

Below-the-Fold Images

  • Default lazy loading (Next.js Image default)
  • No priority prop on non-LCP images
  • Consider loading="lazy" for native <img> elements
  • Use Intersection Observer for custom lazy loading

Preloading

  • Critical hero image preloaded
  • Don't preload below-fold images
  • Use fetchpriority="high" for critical images
<link rel="preload" as="image" href="/hero.webp" fetchpriority="high" />

Placeholders

Blur Placeholders

  • Static imports use placeholder="blur" (automatic)
  • Remote images have blurDataURL generated
  • Placeholder improves perceived performance
  • Consider plaiceholder library for build-time generation
// ✅ Static import with automatic blur
import heroImage from '@/public/hero.jpg';
<Image src={heroImage} placeholder="blur" />

// ✅ Remote image with blur
<Image
  src="https://cdn.example.com/photo.jpg"
  placeholder="blur"
  blurDataURL="data:image/jpeg;base64,..."
/>

Color Placeholders

  • Consider dominant color placeholder for cards
  • Skeleton placeholders for loading states
  • Smooth transition from placeholder to image

Quality Settings

Compression

  • Quality set to 75-85 (not 100)
  • Test quality visually - often 75 is indistinguishable
  • Higher quality (85-90) only for hero/product images
  • Lower quality (60-70) acceptable for thumbnails
// ✅ GOOD: Appropriate quality
<Image src="/hero.jpg" quality={85} /> // Important hero
<Image src="/thumbnail.jpg" quality={70} /> // Small thumbnail

// ❌ BAD: Unnecessary quality
<Image src="/photo.jpg" quality={100} /> // Huge file, no benefit

AVIF-Specific

  • AVIF quality can be 10-15 points lower than JPEG
  • Test AVIF vs WebP on your content type
  • Some images compress better with WebP

CDN & Infrastructure

Next.js Configuration

  • remotePatterns configured for all external domains
  • deviceSizes matches your breakpoints
  • formats includes AVIF and WebP
  • minimumCacheTTL set appropriately (30+ days for static)
// next.config.js
images: {
  formats: ['image/avif', 'image/webp'],
  remotePatterns: [
    { hostname: 'cdn.example.com' },
    { hostname: '*.cloudinary.com' },
  ],
  deviceSizes: [640, 750, 828, 1080, 1200, 1920],
  minimumCacheTTL: 60 * 60 * 24 * 30, // 30 days
}

CDN Setup

  • Images served from CDN (not origin server)
  • Edge caching enabled
  • Cache headers set correctly (1 year for hashed assets)
  • Vary: Accept header for format negotiation

Self-Hosted

  • Sharp installed: npm install sharp
  • Docker image includes Sharp dependencies
  • Adequate disk space for image cache
  • Memory limits account for Sharp processing

Accessibility

Alt Text

  • ALL images have alt attribute
  • Meaningful alt for informative images
  • Empty alt="" for decorative images
  • Alt text describes content, not appearance
  • No "image of" or "picture of" prefix
// ✅ GOOD: Meaningful alt
<Image src="/product.jpg" alt="Red Nike Air Max 90 running shoe, side view" />

// ✅ GOOD: Decorative image
<Image src="/decorative-pattern.svg" alt="" />

// ❌ BAD: Generic alt
<Image src="/product.jpg" alt="Image" />

// ❌ BAD: Missing alt
<Image src="/product.jpg" />

Additional A11y

  • No text in images (use real text)
  • Sufficient color contrast for overlaid text
  • Images don't convey information unavailable in text
  • Decorative images marked with role="presentation"

Performance Monitoring

Metrics to Track

  • LCP (Largest Contentful Paint) < 2.5s
  • CLS (Cumulative Layout Shift) = 0 for images
  • Image load times in RUM data
  • Total image bytes transferred

Debugging

  • Check DevTools Network tab for actual sizes
  • Verify format negotiation (AVIF/WebP served)
  • Test on slow connections (DevTools throttling)
  • Run Lighthouse for image recommendations

Error Handling

Fallbacks

  • Fallback image configured for load errors
  • Graceful degradation for broken images
  • Error boundaries for image-heavy components
const [error, setError] = useState(false);

<Image
  src={error ? '/fallback.jpg' : product.image}
  onError={() => setError(true)}
/>

Monitoring

  • Image errors logged to monitoring service
  • Alerts for high error rates
  • 404s for images tracked

Build Pipeline

Optimization

  • Images optimized at build time (where possible)
  • Source images stored at high resolution
  • Build includes image processing (Sharp, Squoosh)
  • CI validates image configurations

Version Control

  • Large images in Git LFS (not regular Git)
  • Or: Images stored externally (CMS, CDN)
  • Build pulls images from source

Security

Content Security

  • Only allow trusted image domains
  • SVG sanitization if user-uploaded
  • dangerouslyAllowSVG: false in production
  • Rate limiting on image optimization endpoints

Privacy

  • Strip EXIF metadata from user uploads
  • No personally identifiable information in image URLs
  • Consider image hashing for user content

Inference Optimization

Inference Optimization Checklist

Performance validation for LLM inference.

vLLM Configuration

  • Tensor parallelism configured for GPU count
  • Max model length set appropriately
  • GPU memory utilization optimized (0.85-0.95)
  • Prefix caching enabled for shared contexts
  • Continuous batching active

Quantization

  • Quantization method selected:
    • FP16: Maximum quality, baseline
    • INT8/FP8: Balance quality/efficiency
    • AWQ: Best 4-bit quality
    • GPTQ: Faster quantization
  • Calibration data used (for GPTQ)
  • Quality validated post-quantization

Speculative Decoding

  • Method selected:
    • N-gram: No extra model, lower overhead
    • Draft model: Higher quality speculation
  • Speculative tokens tuned (3-5 typical)
  • Throughput improvement validated

Hardware Utilization

  • GPU memory fully utilized
  • Multi-GPU scaling verified
  • NVLink/PCIe bandwidth sufficient
  • CPU not bottlenecking

Batching Strategy

  • Continuous batching enabled
  • Max batch size configured
  • Request prioritization (if needed)
  • Queue management configured

Caching

  • KV cache optimized (PagedAttention)
  • Prefix caching for shared prompts
  • Response caching (semantic if applicable)
  • Cache invalidation strategy

Benchmarking

  • Baseline latency measured
  • Throughput (tokens/sec) benchmarked
  • Time to first token (TTFT) measured
  • Latency under load tested
  • Memory usage profiled

Production Readiness

  • Warmup requests sent before traffic
  • Health checks configured
  • Graceful shutdown handling
  • Request timeout configured
  • Error recovery tested

Monitoring

  • Latency metrics (p50, p95, p99)
  • Throughput tracking
  • GPU utilization monitoring
  • Memory usage tracking
  • Error rate alerting

Cost Optimization

  • Instance size appropriate
  • Spot instances (if applicable)
  • Auto-scaling configured
  • Usage patterns analyzed
  • Cost per request tracked

Performance Audit Checklist

Performance Audit Checklist

Comprehensive guide for identifying and fixing performance bottlenecks, based on OrchestKit's real optimization process.

Prerequisites

  • Access to production metrics (Prometheus, Grafana)
  • Profiling tools installed (py-spy, Chrome DevTools)
  • Baseline performance metrics captured
  • Test environment with production-like data

Phase 1: Establish Baselines

Backend Metrics

Capture current performance:

# Database query performance
psql -c "SELECT query, calls, mean_time, total_time
FROM pg_stat_statements
ORDER BY total_time DESC LIMIT 20;"

# API latency
curl 'http://localhost:9090/api/v1/query?query=histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))'

# Cache hit rate
curl 'http://localhost:9090/api/v1/query?query=sum(rate(cache_operations_total{result="hit"}[5m])) / sum(rate(cache_operations_total[5m]))'
  • Record p50/p95/p99 latency for all endpoints
  • Document slow queries (>100ms)
  • Measure cache hit rates
  • Capture database connection pool usage
  • Record LLM token usage and costs

Frontend Metrics

Run Lighthouse audit:

# Lighthouse CLI
lighthouse http://localhost:3000 \
  --output json \
  --output-path lighthouse-report.json

# Or use Chrome DevTools → Lighthouse tab
  • Record Core Web Vitals (LCP, INP, CLS, TTFB)
  • Measure bundle size (JS, CSS)
  • Check for render-blocking resources
  • Analyze long tasks (>50ms)
  • Measure First Contentful Paint (FCP)

Baseline Targets

MetricGoodNeeds WorkCurrent
p95 API latency<500ms<1s___ms
p95 DB query<100ms<500ms___ms
Cache hit rate>70%>50%__%
LCP<2.5s<4s___s
INP<200ms<500ms___ms
CLS<0.1<0.25___
Bundle size<300KB<500KB___KB

Phase 2: Identify Bottlenecks

Backend Profiling

1. Find Slow Endpoints

# Top 10 slowest endpoints (p95 latency)
topk(10,
  histogram_quantile(0.95,
    rate(http_request_duration_seconds_bucket[5m])
  ) by (endpoint)
)
  • List endpoints with p95 > 500ms
  • Prioritize by traffic volume (high traffic = high impact)
  • Document expected vs actual latency

2. Identify Slow Database Queries

-- Top 10 slowest queries
SELECT
    LEFT(query, 80) as query_preview,
    calls,
    ROUND(mean_exec_time::numeric, 2) as avg_ms,
    ROUND(total_exec_time::numeric, 2) as total_ms,
    ROUND(100.0 * shared_blks_hit / NULLIF(shared_blks_hit + shared_blks_read, 0), 2) as cache_hit_ratio
FROM pg_stat_statements
ORDER BY total_exec_time DESC
LIMIT 10;
  • Run EXPLAIN ANALYZE on slow queries
  • Check for sequential scans (should use indexes)
  • Look for low cache hit ratios (<90%)
  • Identify N+1 query patterns

3. Python Profiling with py-spy

# Profile running FastAPI server
py-spy record --pid $(pgrep -f uvicorn) \
  --output profile.svg \
  --duration 60

# Top functions by time
py-spy top --pid $(pgrep -f uvicorn)
  • Generate flame graph
  • Identify hot paths (wide bars = time spent)
  • Look for unexpected CPU usage
  • Check for blocking I/O in async code

4. LLM Cost Analysis

-- Cost breakdown by model (Langfuse)
SELECT
    model,
    COUNT(*) as calls,
    SUM(input_tokens) as total_input,
    SUM(output_tokens) as total_output,
    SUM(calculated_total_cost) as total_cost
FROM langfuse.traces
WHERE timestamp > NOW() - INTERVAL '7 days'
GROUP BY model
ORDER BY total_cost DESC;
  • Identify most expensive models
  • Calculate cache hit rate potential
  • Find repetitive queries (caching candidates)
  • Measure prompt token waste

Frontend Profiling

1. Chrome DevTools Performance Tab

  • Record 6s of user interaction
  • Identify long tasks (yellow bars >50ms)
  • Check for dropped frames (should be 60fps)
  • Measure main thread blocking time

2. React DevTools Profiler

// Add Profiler to key components
import { Profiler } from 'react';

function onRenderCallback(
    id, phase, actualDuration, baseDuration
) {
    if (actualDuration > 16) {
        console.warn(`Slow render: ${id} took ${actualDuration}ms`);
    }
}

<Profiler id="AnalysisCard" onRender={onRenderCallback}>
    <AnalysisCard />
</Profiler>
  • Find components with >16ms render time
  • Identify unnecessary re-renders
  • Check for missing memoization

3. Bundle Analysis

# Vite
npm run build
npx vite-bundle-visualizer

# Next.js
ANALYZE=true npm run build
  • Identify largest chunks
  • Find duplicate dependencies
  • Check for tree-shaking failures
  • Measure code splitting effectiveness

Phase 3: Database Optimization

Add Missing Indexes

1. Identify Missing Indexes

-- Find sequential scans that should use indexes
SELECT
    schemaname,
    tablename,
    seq_scan,
    idx_scan,
    seq_scan - idx_scan as too_much_seq
FROM pg_stat_user_tables
WHERE seq_scan - idx_scan > 0
ORDER BY too_much_seq DESC
LIMIT 10;
  • Run EXPLAIN ANALYZE on slow queries
  • Look for "Seq Scan" in query plans
  • Identify columns in WHERE/JOIN clauses
  • Create indexes for high-cardinality columns

2. Create Indexes

-- B-tree for exact matches and ranges
CREATE INDEX idx_analysis_status ON analyses(status);
CREATE INDEX idx_analysis_created ON analyses(created_at DESC);

-- GIN for full-text search
CREATE INDEX idx_chunk_tsvector ON chunks USING GIN(content_tsvector);

-- HNSW for vector similarity (pgvector)
CREATE INDEX idx_chunk_embedding ON chunks
USING hnsw (embedding vector_cosine_ops)
WITH (m = 16, ef_construction = 64);

-- Composite index for common filter combinations
CREATE INDEX idx_chunk_analysis_created ON chunks(analysis_id, created_at DESC);
  • Create indexes for WHERE clause columns
  • Use composite indexes for multi-column filters
  • Add indexes for JOIN columns
  • Use CONCURRENTLY for production
  • Verify indexes are used (EXPLAIN ANALYZE)

Index Selection Guide:

Query PatternIndex TypeExample
Exact matchB-treeWHERE status = 'completed'
Range queryB-treeWHERE created_at > '2025-01-01'
Full-text searchGINWHERE content_tsvector @@ query
Vector similarityHNSWORDER BY embedding &lt;=&gt; query_vec
JSONB queriesGINWHERE metadata @> '\{"key": "value"\}'

Fix N+1 Queries

1. Detect N+1 Patterns

# ❌ BAD: N+1 query (1 query + N queries in loop)
analyses = await session.execute(select(Analysis).limit(10))
for analysis in analyses.scalars():
    # Each iteration = 1 query!
    chunks = await session.execute(
        select(Chunk).where(Chunk.analysis_id == analysis.id)
    )
  • Review logs for rapid sequential queries
  • Check for queries inside loops
  • Use query count logging in tests

2. Fix with Eager Loading

# ✅ GOOD: Single query with JOIN
from sqlalchemy.orm import selectinload

analyses = await session.execute(
    select(Analysis)
    .options(selectinload(Analysis.chunks))  # Eager load
    .limit(10)
).scalars().all()

# Now analyses[0].chunks is preloaded (no extra query)
  • Replace lazy loading with eager loading
  • Use selectinload() for one-to-many
  • Use joinedload() for one-to-one
  • Verify query count reduced (N+1 → 1-2 queries)

Optimize Connection Pooling

1. Check Current Pool Usage

# Connection pool saturation
db_connections_active / db_connections_max
  • Measure active vs max connections
  • Check for pool exhaustion (ratio >0.8)
  • Monitor connection wait times

2. Configure Pool

# backend/app/core/config.py
from sqlalchemy import create_engine

engine = create_engine(
    database_url,
    pool_size=5,           # Connections to maintain
    max_overflow=10,       # Extra connections allowed
    pool_recycle=3600,     # Recycle after 1 hour
    pool_pre_ping=True     # Validate before checkout
)
  • Set pool_size based on traffic (5-20 typical)
  • Allow overflow for spikes
  • Enable pool_pre_ping for stale detection
  • Set pool_recycle to avoid timeouts

Phase 4: Caching Strategy

Identify Caching Opportunities

1. Find Repetitive Queries

-- Most frequently called queries
SELECT
    LEFT(query, 80),
    calls,
    ROUND(mean_exec_time::numeric, 2) as avg_ms
FROM pg_stat_statements
ORDER BY calls DESC
LIMIT 20;
  • Identify high-frequency queries
  • Check if data changes frequently
  • Calculate potential savings (calls × avg_time)

2. Find Repetitive LLM Calls

-- Similar prompts (Langfuse)
SELECT
    LEFT(input::text, 100) as prompt_preview,
    COUNT(*) as occurrences,
    SUM(calculated_total_cost) as total_cost
FROM langfuse.generations
GROUP BY LEFT(input::text, 100)
HAVING COUNT(*) > 5
ORDER BY total_cost DESC;
  • Identify repetitive prompts
  • Calculate cost savings potential
  • Determine appropriate cache TTL

Implement Multi-Level Cache

L1: In-Memory Cache (Application)

from functools import lru_cache

@lru_cache(maxsize=128)
def get_agent_system_prompt(agent_type: str) -> str:
    """Cache agent prompts in memory."""
    return load_prompt_from_file(f"prompts/{agent_type}.txt")
  • Cache static data (prompts, configs)
  • Use LRU cache for bounded memory
  • Set appropriate maxsize (128-1024)

L2: Redis Cache (Distributed)

async def get_analysis(analysis_id: str) -> Analysis:
    """Cache analysis results in Redis."""

    # Try cache first
    cached = await redis.get(f"analysis:{analysis_id}")
    if cached:
        return Analysis.parse_raw(cached)

    # Cache miss - fetch from DB
    analysis = await db.get_analysis(analysis_id)

    # Store in cache (5 min TTL)
    await redis.setex(
        f"analysis:{analysis_id}",
        300,
        analysis.json()
    )

    return analysis
  • Cache query results
  • Set appropriate TTL (seconds to hours)
  • Invalidate on writes
  • Track cache hit rate

L3: Semantic Cache (Vector Search)

async def get_llm_response(query: str) -> str:
    """Check semantic cache before calling LLM."""

    # Generate query embedding
    embedding = await embed_text(query)

    # Search for similar cached queries
    cached = await semantic_cache.search(embedding, threshold=0.92)
    if cached:
        return cached.response

    # Call LLM
    response = await llm.complete(query)

    # Store in cache
    await semantic_cache.store(embedding, response)

    return response
  • Cache LLM responses by semantic similarity
  • Set similarity threshold (0.90-0.95)
  • Measure cost savings
  • Monitor false positive rate

Cache Invalidation

Write-Through Pattern:

async def update_analysis(analysis: Analysis):
    """Update DB and cache atomically."""

    # 1. Write to DB
    await db.update(analysis)

    # 2. Update cache
    await redis.setex(
        f"analysis:{analysis.id}",
        300,
        analysis.json()
    )
  • Invalidate cache on writes
  • Use TTL for time-sensitive data
  • Add cache versioning for schema changes

Phase 5: Frontend Optimization

Code Splitting

1. Route-Based Splitting

// Before: All routes in one bundle
import AnalysisPage from './pages/AnalysisPage';
import DashboardPage from './pages/DashboardPage';

// After: Lazy load routes
const AnalysisPage = lazy(() => import('./pages/AnalysisPage'));
const DashboardPage = lazy(() => import('./pages/DashboardPage'));

<Suspense fallback={<Loading />}>
    <Routes>
        <Route path="/analysis" element={<AnalysisPage />} />
        <Route path="/dashboard" element={<DashboardPage />} />
    </Routes>
</Suspense>
  • Lazy load routes
  • Add loading states
  • Measure bundle size reduction

2. Component-Level Splitting

// Lazy load heavy components
const ChartComponent = lazy(() => import('./ChartComponent'));

{showChart && (
    <Suspense fallback={<Skeleton />}>
        <ChartComponent data={data} />
    </Suspense>
)}
  • Split large dependencies (charts, editors)
  • Use dynamic imports for modals
  • Prefetch on user intent (hover, focus)

Memoization

React.memo for Components:

// Prevent re-renders when props unchanged
const AnalysisCard = memo(({ analysis }: Props) => {
    return <div>{analysis.title}</div>;
});
  • Wrap expensive components with memo()
  • Verify props don't change unnecessarily
  • Use React DevTools Profiler to confirm

useMemo for Expensive Calculations:

const expensiveValue = useMemo(() => {
    return processLargeDataset(data);
}, [data]);  // Only recompute if data changes
  • Memoize expensive calculations
  • Memoize filtered/sorted arrays
  • Don't over-memoize (profiling first!)

useCallback for Event Handlers:

const handleClick = useCallback(() => {
    doSomething(id);
}, [id]);  // Only recreate if id changes

<ChildComponent onClick={handleClick} />
  • Wrap callbacks passed to memoized children
  • Avoid inline functions in props
  • Include all dependencies

Image Optimization

// Use next/image or similar for optimization
<Image
    src="/photo.jpg"
    alt="Description"
    width={800}
    height={600}
    loading="lazy"  // Lazy load images
    placeholder="blur"  // Show blur while loading
/>
  • Use WebP/AVIF formats
  • Lazy load images below the fold
  • Set explicit width/height (prevent CLS)
  • Use responsive images (srcset)

Phase 6: Measure Impact

Re-Run Benchmarks

Backend:

# Query performance
psql -c "SELECT query, mean_time FROM pg_stat_statements ORDER BY total_time DESC LIMIT 10;"

# API latency
curl 'http://localhost:9090/api/v1/query?query=histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))'

Frontend:

lighthouse http://localhost:3000 --output json
  • Compare p95 latency (before vs after)
  • Verify query performance improved
  • Check cache hit rates increased
  • Measure Core Web Vitals improvement

Calculate Savings

Cost Savings:

# LLM cost reduction
baseline_cost = 35000  # Annual
cache_hit_rate = 0.90
savings = baseline_cost * cache_hit_rate * 0.90  # 90% discount on cache hits
final_cost = baseline_cost - savings

Performance Gains:

# Query speedup
before_latency = 85  # ms
after_latency = 5    # ms
speedup = before_latency / after_latency  # 17x
  • Document cost savings
  • Calculate ROI (savings vs implementation time)
  • Measure user experience improvement

Create Performance Budget

Set ongoing targets:

  • p95 API latency < 500ms
  • p95 DB query < 100ms
  • Cache hit rate > 70%
  • LCP < 2.5s
  • Bundle size < 300KB

Monitor continuously:

  • Add Lighthouse CI to pipeline
  • Alert on budget violations
  • Review metrics weekly

Phase 7: Ongoing Optimization

Weekly Reviews

  • Review top 10 slowest endpoints
  • Check for new slow queries
  • Monitor cache hit rates
  • Review LLM cost trends
  • Check Core Web Vitals in RUM

Monthly Audits

  • Run full Lighthouse audit
  • Profile with py-spy/Chrome DevTools
  • Review database index usage
  • Check for unused dependencies
  • Update performance budget

Continuous Monitoring

  • Set up alerts for degradation
  • Track performance in CI/CD
  • Monitor real user metrics (RUM)
  • A/B test optimizations

References

Render Audit

React Performance Audit Checklist

Pre-deployment performance verification.

React Compiler Check

  • React Compiler enabled in build config
  • Components show "Memo ✨" badge in DevTools
  • Code follows Rules of React:
    • Components are idempotent
    • Props/state treated as immutable
    • Side effects in useEffect only
    • Hooks at top level

Render Performance

  • No unnecessary re-renders (verified with Profiler)
  • State colocated close to usage
  • Context split to prevent cascading updates
  • Expensive computations have escape hatch memoization
  • Lists > 100 items are virtualized

Large Lists / Data

  • TanStack Virtual for lists > 100 items
  • Pagination or infinite scroll for API data
  • Table virtualization for grids > 50 rows
  • Images lazy loaded below fold

Code Splitting

  • Route-based code splitting (lazy routes)
  • Heavy components lazy loaded
  • Dynamic imports for large libraries
  • Bundle analyzer run, no unexpected large chunks

Network Performance

  • API calls deduplicated (React Query, SWR)
  • Data prefetched on hover/intent
  • Optimistic updates for mutations
  • Appropriate cache headers set

Images & Media

  • Images optimized (WebP, AVIF)
  • Responsive images with srcset
  • Lazy loading for below-fold images
  • Placeholder/skeleton during load

Third-Party Scripts

  • Analytics loaded async/deferred
  • Third-party widgets lazy loaded
  • Font loading optimized (preload critical)
  • No render-blocking resources

Profiling Verification

Before Optimization

  1. Record baseline interaction times
  2. Document slowest components
  3. Note current bundle size

After Optimization

  1. Re-profile all interactions
  2. Verify improvements in numbers
  3. Check bundle size delta

Key Metrics to Track

MetricTargetCurrent
LCP (Largest Contentful Paint)< 2.5s___
FID (First Input Delay)< 100ms___
CLS (Cumulative Layout Shift)< 0.1___
Time to Interactive< 3s___
Main thread blocking< 200ms___

Quick Profiler Commands

# React DevTools Profiler
# 1. Open DevTools → Profiler tab
# 2. Click Record
# 3. Perform interaction
# 4. Click Stop
# 5. Analyze flamegraph

# Lighthouse
npx lighthouse http://localhost:3000 --view

# Bundle Analyzer (Next.js)
ANALYZE=true npm run build

# Bundle Analyzer (Vite)
npx vite-bundle-visualizer

Common Issues Checklist

  • No anonymous functions as props in hot paths
  • No object/array literals as props in hot paths
  • Context providers near consumers
  • useEffect dependencies correct
  • No state updates in render

Sign-Off

  • All critical interactions < 100ms
  • No visible jank during scroll
  • Page load acceptable on 3G
  • Bundle size within budget
  • Performance regression tests in CI

Examples (3)

Cwv Examples

Core Web Vitals Examples

Real-world optimization examples for LCP, INP, and CLS.


1. LCP Optimization: E-Commerce Hero Section

Complete optimization of a hero section with product image and CTA.

Before: Slow LCP (3.5s+)

// ❌ BAD: Multiple LCP issues
function Hero() {
  const [product, setProduct] = useState(null);

  useEffect(() => {
    // Problem 1: LCP content loaded client-side
    fetch('/api/featured-product')
      .then(res => res.json())
      .then(setProduct);
  }, []);

  if (!product) return <div className="h-[600px]" />; // Problem 2: No skeleton

  return (
    <div className="relative">
      {/* Problem 3: No priority, lazy by default */}
      <img src={product.image} alt={product.name} />
      <h1>{product.name}</h1>
      <a href={`/product/${product.id}`}>Shop Now</a>
    </div>
  );
}

After: Optimized LCP (1.2s)

// ✅ GOOD: Server-rendered with optimized image
import Image from 'next/image';
import { Suspense } from 'react';

// Server Component - data fetched on server
async function Hero() {
  // Fetched on server, included in initial HTML
  const product = await getFeaturedProduct();

  return (
    <section className="relative h-[600px] overflow-hidden">
      {/* Priority image with explicit dimensions */}
      <Image
        src={product.image}
        alt={product.name}
        fill
        priority // Preloads, eager loading
        sizes="100vw"
        quality={85}
        placeholder="blur"
        blurDataURL={product.blurPlaceholder}
        style={{ objectFit: 'cover' }}
      />

      {/* Content overlay */}
      <div className="relative z-10 flex flex-col items-center justify-center h-full text-white">
        <h1 className="text-5xl font-bold">{product.name}</h1>
        <p className="mt-4 text-xl">{product.tagline}</p>
        <a
          href={`/product/${product.id}`}
          className="mt-8 px-8 py-4 bg-white text-black rounded-lg font-semibold"
        >
          Shop Now
        </a>
      </div>
    </section>
  );
}

// Loading skeleton for Suspense boundary
function HeroSkeleton() {
  return (
    <section className="relative h-[600px] bg-gray-200 animate-pulse">
      <div className="flex flex-col items-center justify-center h-full">
        <div className="h-12 w-64 bg-gray-300 rounded" />
        <div className="mt-4 h-6 w-48 bg-gray-300 rounded" />
        <div className="mt-8 h-14 w-40 bg-gray-300 rounded-lg" />
      </div>
    </section>
  );
}

// Usage in page
export default function HomePage() {
  return (
    <Suspense fallback={<HeroSkeleton />}>
      <Hero />
    </Suspense>
  );
}

// Also add preload in head (layout.tsx or page metadata)
export const metadata = {
  other: {
    'link': [
      {
        rel: 'preload',
        as: 'image',
        href: '/featured-product-hero.webp',
        fetchpriority: 'high',
      },
    ],
  },
};

Document Head Optimizations

<!-- Add to <head> for fastest LCP -->
<head>
  <!-- Preload hero image -->
  <link rel="preload" as="image" href="/hero.webp" fetchpriority="high" />

  <!-- Preload critical font -->
  <link
    rel="preload"
    as="font"
    href="/fonts/inter-bold.woff2"
    type="font/woff2"
    crossorigin
  />

  <!-- Preconnect to image CDN -->
  <link rel="preconnect" href="https://images.example.com" />

  <!-- DNS prefetch for analytics -->
  <link rel="dns-prefetch" href="https://analytics.example.com" />
</head>

2. INP Optimization: Product Search Filter

Optimizing a search filter that was causing 400ms+ INP.

Before: Blocking INP (400ms+)

// ❌ BAD: Blocks main thread on every keystroke
function ProductSearch({ products }: { products: Product[] }) {
  const [query, setQuery] = useState('');
  const [results, setResults] = useState(products);

  const handleChange = (e: ChangeEvent<HTMLInputElement>) => {
    const value = e.target.value;
    setQuery(value);

    // Problem: Expensive filter runs synchronously
    // Blocks paint until complete
    const filtered = products.filter(p =>
      p.name.toLowerCase().includes(value.toLowerCase()) ||
      p.description.toLowerCase().includes(value.toLowerCase()) ||
      p.tags.some(t => t.toLowerCase().includes(value.toLowerCase()))
    );
    setResults(filtered);
  };

  return (
    <>
      <input
        value={query}
        onChange={handleChange}
        placeholder="Search products..."
      />
      <ProductGrid products={results} />
    </>
  );
}

After: Responsive INP (50ms)

// ✅ GOOD: Non-blocking with useDeferredValue
import {
  useState,
  useDeferredValue,
  useMemo,
  useTransition,
  memo
} from 'react';

function ProductSearch({ products }: { products: Product[] }) {
  const [query, setQuery] = useState('');
  const [isPending, startTransition] = useTransition();

  // Deferred value for expensive computation
  const deferredQuery = useDeferredValue(query);
  const isStale = query !== deferredQuery;

  // Memoized filter only runs when deferredQuery changes
  const results = useMemo(() => {
    if (!deferredQuery) return products;

    const searchLower = deferredQuery.toLowerCase();
    return products.filter(p =>
      p.name.toLowerCase().includes(searchLower) ||
      p.description.toLowerCase().includes(searchLower) ||
      p.tags.some(t => t.toLowerCase().includes(searchLower))
    );
  }, [products, deferredQuery]);

  return (
    <div>
      <div className="relative">
        <input
          value={query}
          onChange={(e) => setQuery(e.target.value)}
          placeholder="Search products..."
          className="w-full px-4 py-2 border rounded-lg"
        />
        {/* Loading indicator during filter */}
        {isPending && (
          <div className="absolute right-3 top-1/2 -translate-y-1/2">
            <Spinner size="sm" />
          </div>
        )}
      </div>

      {/* Fade during pending state */}
      <div
        className="mt-4 transition-opacity"
        style={{ opacity: isStale ? 0.7 : 1 }}
      >
        <ProductGrid products={results} />
      </div>
    </div>
  );
}

// Memoized grid to prevent unnecessary re-renders
const ProductGrid = memo(function ProductGrid({
  products
}: {
  products: Product[]
}) {
  return (
    <div className="grid grid-cols-4 gap-4">
      {products.map(product => (
        <ProductCard key={product.id} product={product} />
      ))}
    </div>
  );
});

For Very Large Lists: Virtual Scrolling

// ✅ BEST: Virtualization for huge lists
import { useVirtualizer } from '@tanstack/react-virtual';

function VirtualizedProductList({ products }: { products: Product[] }) {
  const parentRef = useRef<HTMLDivElement>(null);

  const virtualizer = useVirtualizer({
    count: products.length,
    getScrollElement: () => parentRef.current,
    estimateSize: () => 200, // Estimated row height
    overscan: 5, // Render 5 extra items above/below
  });

  return (
    <div
      ref={parentRef}
      className="h-[600px] overflow-auto"
    >
      <div
        style={{
          height: `${virtualizer.getTotalSize()}px`,
          width: '100%',
          position: 'relative',
        }}
      >
        {virtualizer.getVirtualItems().map((virtualRow) => (
          <div
            key={virtualRow.key}
            style={{
              position: 'absolute',
              top: 0,
              left: 0,
              width: '100%',
              height: `${virtualRow.size}px`,
              transform: `translateY(${virtualRow.start}px)`,
            }}
          >
            <ProductCard product={products[virtualRow.index]} />
          </div>
        ))}
      </div>
    </div>
  );
}

3. CLS Optimization: News Article Page

Fixing layout shifts from images, ads, and fonts.

Before: High CLS (0.35)

// ❌ BAD: Multiple CLS issues
function Article({ article }: { article: Article }) {
  const [ad, setAd] = useState(null);

  useEffect(() => {
    loadAd().then(setAd);
  }, []);

  return (
    <article>
      <h1>{article.title}</h1>

      {/* Problem 1: Image without dimensions */}
      <img src={article.heroImage} alt="" />

      {/* Problem 2: Ad appears after load, shifts content */}
      {ad && <div className="ad-banner"><img src={ad.image} /></div>}

      <div dangerouslySetInnerHTML={{ __html: article.content }} />

      {/* Problem 3: Related articles load and shift */}
      <RelatedArticles />
    </article>
  );
}

// Problem 4: Font causes layout shift
// CSS
/* No font-display, no fallback sizing */
@font-face {
  font-family: 'CustomFont';
  src: url('/font.woff2');
}

After: Zero CLS (0.0)

// ✅ GOOD: All layout shifts prevented
import Image from 'next/image';

function Article({ article }: { article: Article }) {
  return (
    <article className="max-w-3xl mx-auto">
      <h1 className="text-4xl font-bold">{article.title}</h1>

      {/* Fixed dimensions prevent shift */}
      <div className="relative aspect-[16/9] my-6">
        <Image
          src={article.heroImage}
          alt={article.heroAlt}
          fill
          sizes="(max-width: 768px) 100vw, 768px"
          priority
          style={{ objectFit: 'cover' }}
        />
      </div>

      {/* Reserved space for ad */}
      <AdSlot
        slot="article-top"
        className="my-6"
        minHeight={250}
      />

      <div
        className="prose prose-lg"
        dangerouslySetInnerHTML={{ __html: article.content }}
      />

      {/* Reserved space for related */}
      <RelatedArticles articleId={article.id} />
    </article>
  );
}

// Ad component with reserved space
function AdSlot({
  slot,
  className,
  minHeight
}: {
  slot: string;
  className?: string;
  minHeight: number;
}) {
  const [ad, setAd] = useState<Ad | null>(null);
  const [loaded, setLoaded] = useState(false);

  useEffect(() => {
    loadAd(slot).then(ad => {
      setAd(ad);
      setLoaded(true);
    });
  }, [slot]);

  return (
    <div
      className={className}
      style={{ minHeight: `${minHeight}px` }} // Reserved space
    >
      {loaded ? (
        ad ? (
          <Image
            src={ad.image}
            alt={ad.alt}
            width={ad.width}
            height={ad.height}
          />
        ) : null // No ad, space collapses gracefully
      ) : (
        <Skeleton height={minHeight} /> // Placeholder during load
      )}
    </div>
  );
}

// Related articles with skeleton
function RelatedArticles({ articleId }: { articleId: string }) {
  const [articles, setArticles] = useState<Article[] | null>(null);

  useEffect(() => {
    fetchRelated(articleId).then(setArticles);
  }, [articleId]);

  return (
    <section className="mt-12">
      <h2 className="text-2xl font-bold mb-6">Related Articles</h2>

      {/* Fixed grid prevents shift */}
      <div className="grid grid-cols-3 gap-6">
        {articles ? (
          articles.map(article => (
            <ArticleCard key={article.id} article={article} />
          ))
        ) : (
          // Skeleton matches final layout exactly
          <>
            <ArticleCardSkeleton />
            <ArticleCardSkeleton />
            <ArticleCardSkeleton />
          </>
        )}
      </div>
    </section>
  );
}

// Skeleton that matches card dimensions exactly
function ArticleCardSkeleton() {
  return (
    <div className="animate-pulse">
      <div className="aspect-[16/9] bg-gray-200 rounded-lg" />
      <div className="mt-3 h-5 bg-gray-200 rounded w-3/4" />
      <div className="mt-2 h-4 bg-gray-200 rounded w-1/2" />
    </div>
  );
}

Font Loading Without CLS

/* ✅ Optimized font loading */

/* Main font with swap and metrics */
@font-face {
  font-family: 'Inter';
  src: url('/fonts/inter-var.woff2') format('woff2');
  font-display: swap;
  font-weight: 100 900;
}

/* Fallback font with matched metrics */
@font-face {
  font-family: 'Inter Fallback';
  src: local('Arial');
  size-adjust: 107.64%;
  ascent-override: 90%;
  descent-override: 22.43%;
  line-gap-override: 0%;
}

body {
  font-family: 'Inter', 'Inter Fallback', system-ui, sans-serif;
}

/* Alternative: font-display: optional for non-critical fonts */
@font-face {
  font-family: 'DisplayFont';
  src: url('/fonts/display.woff2') format('woff2');
  font-display: optional; /* Won't cause FOUT - uses fallback if not cached */
}

4. Complete RUM Implementation

Full Real User Monitoring setup with Next.js.

// lib/performance.ts
import { onCLS, onINP, onLCP, onFCP, onTTFB, type Metric } from 'web-vitals';

const ENDPOINT = '/api/vitals';

interface EnrichedMetric {
  name: string;
  value: number;
  rating: 'good' | 'needs-improvement' | 'poor';
  delta: number;
  id: string;
  navigationType: string;
  url: string;
  timestamp: number;
  connection?: string;
  deviceMemory?: number;
  viewport: { width: number; height: number };
}

function getConnectionInfo() {
  const nav = navigator as Navigator & {
    connection?: { effectiveType?: string };
    deviceMemory?: number;
  };

  return {
    connection: nav.connection?.effectiveType,
    deviceMemory: nav.deviceMemory,
  };
}

function sendMetric(metric: Metric) {
  const enriched: EnrichedMetric = {
    name: metric.name,
    value: metric.value,
    rating: metric.rating,
    delta: metric.delta,
    id: metric.id,
    navigationType: metric.navigationType,
    url: window.location.href,
    timestamp: Date.now(),
    ...getConnectionInfo(),
    viewport: {
      width: window.innerWidth,
      height: window.innerHeight,
    },
  };

  // Use sendBeacon for reliability
  if (navigator.sendBeacon) {
    navigator.sendBeacon(ENDPOINT, JSON.stringify(enriched));
  } else {
    fetch(ENDPOINT, {
      method: 'POST',
      body: JSON.stringify(enriched),
      keepalive: true,
    });
  }

  // Debug in development
  if (process.env.NODE_ENV === 'development') {
    const color = {
      good: 'green',
      'needs-improvement': 'orange',
      poor: 'red',
    }[metric.rating];

    console.log(
      `%c[${metric.name}] ${metric.value.toFixed(1)}${metric.name === 'CLS' ? '' : 'ms'}`,
      `color: ${color}; font-weight: bold`
    );
  }
}

export function initWebVitals() {
  onCLS(sendMetric);
  onINP(sendMetric);
  onLCP(sendMetric);
  onFCP(sendMetric);
  onTTFB(sendMetric);
}
// app/components/web-vitals.tsx
'use client';

import { useEffect } from 'react';
import { initWebVitals } from '@/lib/performance';

export function WebVitals() {
  useEffect(() => {
    initWebVitals();
  }, []);

  return null;
}
// app/api/vitals/route.ts
import { NextRequest, NextResponse } from 'next/server';

interface VitalMetric {
  name: string;
  value: number;
  rating: string;
  url: string;
  timestamp: number;
}

export async function POST(request: NextRequest) {
  const metric: VitalMetric = await request.json();

  // Log for debugging
  console.log('[Vital]', metric.name, metric.value, metric.rating);

  // Store in database (example with Drizzle)
  // await db.insert(webVitals).values({
  //   name: metric.name,
  //   value: metric.value,
  //   rating: metric.rating,
  //   url: metric.url,
  //   timestamp: new Date(metric.timestamp),
  // });

  // Alert on poor metrics
  if (metric.rating === 'poor') {
    // await alertService.send({
    //   severity: 'warning',
    //   message: `Poor ${metric.name}: ${metric.value} on ${metric.url}`,
    // });
  }

  return NextResponse.json({ ok: true });
}

5. Performance Budget Enforcement

CI/CD integration with Lighthouse CI.

lighthouserc.js

module.exports = {
  ci: {
    collect: {
      url: [
        'http://localhost:3000/',
        'http://localhost:3000/products',
        'http://localhost:3000/checkout',
      ],
      numberOfRuns: 3,
      settings: {
        preset: 'desktop',
        // Throttle to simulate 4G
        // throttling: { ... }
      },
    },
    assert: {
      assertions: {
        // Core Web Vitals
        'largest-contentful-paint': ['error', { maxNumericValue: 2500 }],
        'cumulative-layout-shift': ['error', { maxNumericValue: 0.1 }],
        'total-blocking-time': ['error', { maxNumericValue: 200 }], // Proxy for INP

        // Other performance metrics
        'first-contentful-paint': ['warn', { maxNumericValue: 1800 }],
        'speed-index': ['warn', { maxNumericValue: 3400 }],

        // Resource budgets
        'resource-summary:script:size': ['error', { maxNumericValue: 150000 }],
        'resource-summary:image:size': ['error', { maxNumericValue: 300000 }],
        'resource-summary:total:size': ['error', { maxNumericValue: 500000 }],

        // Scores
        'categories:performance': ['error', { minScore: 0.9 }],
        'categories:accessibility': ['error', { minScore: 0.9 }],
      },
    },
    upload: {
      target: 'temporary-public-storage',
    },
  },
};

GitHub Actions Workflow

# .github/workflows/lighthouse.yml
name: Lighthouse CI

on:
  pull_request:
    branches: [main]

jobs:
  lighthouse:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Setup Node
        uses: actions/setup-node@v4
        with:
          node-version: '20'

      - name: Install dependencies
        run: npm ci

      - name: Build
        run: npm run build

      - name: Start server
        run: npm start &

      - name: Wait for server
        run: npx wait-on http://localhost:3000

      - name: Run Lighthouse CI
        run: |
          npm install -g @lhci/cli
          lhci autorun
        env:
          LHCI_GITHUB_APP_TOKEN: ${{ secrets.LHCI_GITHUB_APP_TOKEN }}

      - name: Upload results
        uses: actions/upload-artifact@v4
        with:
          name: lighthouse-results
          path: .lighthouseci/

Quick Reference

// ✅ LCP: Server-render, preload, priority
export default async function Page() {
  const data = await getData(); // Server-side
  return <Image src={data.hero} priority fill />;
}

// ✅ INP: useTransition for expensive updates
const [isPending, startTransition] = useTransition();
onChange={(e) => {
  setQuery(e.target.value);
  startTransition(() => setResults(filter(e.target.value)));
}}

// ✅ CLS: Always set dimensions
<Image src="/photo.jpg" width={800} height={600} />
<div className="aspect-[16/9]"><Image fill /></div>
<div className="min-h-[250px]">{content}</div>

// ✅ RUM: Send metrics reliably
navigator.sendBeacon('/api/vitals', JSON.stringify(metric));

// ✅ Debug: Find LCP element
new PerformanceObserver((list) => {
  console.log('LCP:', list.getEntries().at(-1)?.element);
}).observe({ type: 'largest-contentful-paint', buffered: true });

Image Examples

Image Optimization Examples

Hero Image with Blur Placeholder

import Image from 'next/image';
import heroImage from '@/public/hero.jpg'; // Static import

function Hero() {
  return (
    <div className="relative h-[600px] w-full">
      <Image
        src={heroImage}
        alt="Beautiful landscape"
        fill
        priority
        placeholder="blur" // Automatic with static import
        sizes="100vw"
        style={{ objectFit: 'cover' }}
      />
      <div className="absolute inset-0 flex items-center justify-center">
        <h1 className="text-5xl font-bold text-white">Welcome</h1>
      </div>
    </div>
  );
}

Product Grid with Responsive Sizes

function ProductGrid({ products }) {
  return (
    <div className="grid grid-cols-2 md:grid-cols-3 lg:grid-cols-4 gap-4">
      {products.map((product) => (
        <div key={product.id} className="relative aspect-square">
          <Image
            src={product.imageUrl}
            alt={product.name}
            fill
            sizes="(max-width: 640px) 50vw, (max-width: 1024px) 33vw, 25vw"
            className="object-cover rounded-lg"
          />
        </div>
      ))}
    </div>
  );
}

Avatar with Fallback

function UserAvatar({ user }) {
  const [error, setError] = useState(false);

  if (error || !user.avatarUrl) {
    return (
      <div className="h-10 w-10 rounded-full bg-blue-500 flex items-center justify-center">
        <span className="text-white font-medium">
          {user.name.charAt(0).toUpperCase()}
        </span>
      </div>
    );
  }

  return (
    <Image
      src={user.avatarUrl}
      alt={user.name}
      width={40}
      height={40}
      className="rounded-full"
      onError={() => setError(true)}
    />
  );
}

Art Direction (Different Crops)

function ResponsiveBanner() {
  return (
    <>
      {/* Mobile: Portrait crop */}
      <div className="relative h-[400px] md:hidden">
        <Image
          src="/banner-mobile.jpg"
          alt="Banner"
          fill
          priority
          sizes="100vw"
          className="object-cover"
        />
      </div>

      {/* Desktop: Landscape crop */}
      <div className="relative hidden h-[300px] md:block">
        <Image
          src="/banner-desktop.jpg"
          alt="Banner"
          fill
          priority
          sizes="100vw"
          className="object-cover"
        />
      </div>
    </>
  );
}
function ImageGallery({ images }) {
  const [selected, setSelected] = useState(null);

  return (
    <>
      <div className="grid grid-cols-3 gap-2">
        {images.map((image, i) => (
          <button
            key={image.id}
            onClick={() => setSelected(image)}
            className="relative aspect-square"
          >
            <Image
              src={image.thumbnailUrl}
              alt={image.alt}
              fill
              sizes="33vw"
              className="object-cover"
            />
          </button>
        ))}
      </div>

      {selected && (
        <Dialog open onClose={() => setSelected(null)}>
          <div className="relative h-[80vh] w-[90vw]">
            <Image
              src={selected.fullUrl}
              alt={selected.alt}
              fill
              sizes="90vw"
              quality={90}
              className="object-contain"
            />
          </div>
        </Dialog>
      )}
    </>
  );
}

Background Image Pattern

// For true background images, use CSS
function HeroWithCSSBackground() {
  return (
    <div
      className="h-[600px] bg-cover bg-center"
      style={{ backgroundImage: 'url(/hero.webp)' }}
    >
      <div className="h-full flex items-center justify-center bg-black/40">
        <h1 className="text-white text-5xl">Hero Title</h1>
      </div>
    </div>
  );
}

// For Next.js optimization, use Image with fill
function HeroWithNextImage() {
  return (
    <div className="relative h-[600px]">
      <Image
        src="/hero.webp"
        alt=""
        fill
        priority
        className="object-cover -z-10"
      />
      <div className="h-full flex items-center justify-center bg-black/40">
        <h1 className="text-white text-5xl">Hero Title</h1>
      </div>
    </div>
  );
}

Orchestkit Performance Wins

OrchestKit Performance Wins - Real Optimization Examples

This document showcases actual performance optimizations from OrchestKit's production implementation with before/after metrics.

Overview

Key Performance Achievements:

  • LLM costs: $35k/year → $2-5k/year (85-95% reduction)
  • Vector search: 85ms → 5ms (17x faster)
  • Retrieval accuracy: 87.2% → 91.6% (5.1% improvement)
  • Quality gate pass rate: Increased from 67-77% → 85%+ (stable)
  • Cache hit rate: 0% → 90% (L1) + 75% (L2)

Win 1: Multi-Level LLM Caching

Problem

Projected annual LLM costs: $35,000

  • 8 agents per analysis, 1,500-1,800 tokens each
  • Average 145 analyses/month
  • No caching = every query hits LLM
  • Claude Sonnet 4.5: $3/MTok input, $15/MTok output

Investigation

Cost breakdown by agent:

-- Langfuse query
SELECT
    metadata->>'agent_type' as agent,
    SUM(calculated_total_cost) as total_cost,
    AVG(input_tokens) as avg_input,
    AVG(output_tokens) as avg_output
FROM traces
GROUP BY agent
ORDER BY total_cost DESC;

Results:

AgentMonthly CostAvg InputAvg Output
security_auditor$3.051,8001,200
implementation_planner$2.761,6001,100
tech_comparator$2.611,5001,000
Total (8 agents)$18.73--

Pain points:

  • Analyzing similar content (React tutorials, FastAPI guides) repeatedly
  • Security patterns (XSS, SQL injection) are common across codebases
  • Implementation patterns (CRUD, auth) are highly repetitive

Solution: 3-Level Cache Hierarchy

Architecture:

Request → L1: Prompt Cache (Claude native)
         ↓ miss (10%)
         → L2: Semantic Cache (Redis vector search)
         ↓ miss (25% of L1 misses)
         → L3: LLM Call (actual cost)

L1: Claude Prompt Caching (Native)

File: backend/app/shared/services/llm/anthropic_client.py

from anthropic import AsyncAnthropic

async def call_claude_with_prompt_cache(
    system_prompt: str,
    user_message: str,
    model: str = "claude-sonnet-4-6"
) -> str:
    """Call Claude with prompt caching for system prompts."""

    response = await anthropic_client.messages.create(
        model=model,
        max_tokens=4096,
        system=[
            {
                "type": "text",
                "text": system_prompt,
                "cache_control": {"type": "ephemeral"}  # Cache this!
            }
        ],
        messages=[
            {"role": "user", "content": user_message}
        ]
    )

    # Log cache usage
    cache_hit = response.usage.cache_read_input_tokens > 0
    logger.info("claude_prompt_cache",
        cache_hit=cache_hit,
        cache_read_tokens=response.usage.cache_read_input_tokens,
        input_tokens=response.usage.input_tokens,
        output_tokens=response.usage.output_tokens
    )

    return response.content[0].text

Cost savings:

  • Cache hit: 90% discount on cached tokens
  • Cache duration: 5 minutes
  • Effective for: Agent system prompts (1,500+ tokens each)

L2: Semantic Cache (Redis + Vector Search)

File: backend/app/shared/services/cache/semantic_cache.py

from redis import Redis
from app.shared.services.embeddings import embed_text
import numpy as np

class SemanticCache:
    """Vector similarity-based cache for LLM responses."""

    def __init__(self, redis_client: Redis, threshold: float = 0.92):
        self.redis = redis_client
        self.threshold = threshold  # Cosine similarity threshold

    async def get(self, query: str) -> str | None:
        """Check if semantically similar query exists in cache."""

        # Generate query embedding
        query_embedding = await embed_text(query)

        # Search for similar cached queries
        # (Using Redis VSS or dedicated vector store)
        cached_queries = await self._vector_search(query_embedding, top_k=5)

        for cached_query, cached_embedding, cached_response in cached_queries:
            similarity = cosine_similarity(query_embedding, cached_embedding)

            if similarity >= self.threshold:
                logger.info("semantic_cache_hit",
                    similarity=similarity,
                    cached_query=cached_query[:100]
                )
                return cached_response

        return None  # Cache miss

    async def set(self, query: str, response: str, ttl: int = 3600):
        """Store query-response pair with embedding."""

        # Generate embedding
        embedding = await embed_text(query)

        # Store in Redis (with vector index)
        cache_key = f"semantic_cache:{hash(query)}"
        await self.redis.setex(
            cache_key,
            ttl,
            json.dumps({
                "query": query,
                "response": response,
                "embedding": embedding.tolist(),
                "timestamp": datetime.now().isoformat()
            })
        )

Cost savings:

  • 75% hit rate on L1 misses
  • Near-instant responses (5-10ms vs 2000ms)
  • Effective for: Similar technical queries

Implementation in agent calls:

@observe(name="agent_execution")
async def execute_agent(agent_type: str, content: str) -> Finding:
    """Execute agent with 3-level caching."""

    # Build query
    system_prompt = get_agent_system_prompt(agent_type)  # 1,500+ tokens
    user_message = f"Analyze this content:\n\n{content[:8000]}"

    # L2: Check semantic cache
    cache_key = f"{agent_type}:{content[:200]}"  # Simple key for demo
    cached_response = await semantic_cache.get(cache_key)

    if cached_response:
        logger.info("cache_hit", level="L2_semantic", agent=agent_type)
        return parse_finding(cached_response)

    # L1 + L3: Call Claude (with prompt caching)
    response = await call_claude_with_prompt_cache(
        system_prompt=system_prompt,  # Cached by Claude
        user_message=user_message
    )

    # Store in semantic cache
    await semantic_cache.set(cache_key, response, ttl=3600)

    return parse_finding(response)

Results

Cost Reduction:

Baseline (no cache):     $35,000/year
L1 savings (90% hit):    -$28,350  (90% discount on 90% of queries)
L2 savings (75% hit):    -$4,650   (85% discount on 75% of L1 misses)
Final cost:              $2,000-5,000/year

Total savings: 85-95%

Latency Improvement:

Cache LevelHit RateLatencyCost Savings
L1 (Prompt)90%2000ms (same)90% on cached tokens
L2 (Semantic)75% (of L1 misses)5-10ms85% (full skip)
L3 (LLM)2.5% (fallback)2000ms0% (full cost)

Implementation effort: 2 days Maintenance overhead: Low (cache TTL auto-expires stale data)

Win 2: Vector Index Optimization (HNSW vs IVFFlat)

Problem

Vector search taking 85ms, needed <10ms

  • Golden dataset: 415 chunks, 1536-dim embeddings
  • IVFFlat index (lists=10)
  • Hybrid search (vector + BM25 RRF) bottlenecked by vector search

Investigation

Benchmark both index types:

-- IVFFlat performance
EXPLAIN ANALYZE
SELECT * FROM chunks
ORDER BY embedding <=> '[0.1, 0.2, ...]'::vector
LIMIT 10;

-- Result:
-- Planning Time: 2.1 ms
-- Execution Time: 85.3 ms
-- HNSW performance
CREATE INDEX idx_chunk_embedding_hnsw ON chunks
USING hnsw (embedding vector_cosine_ops)
WITH (m = 16, ef_construction = 64);

EXPLAIN ANALYZE
SELECT * FROM chunks
ORDER BY embedding <=> '[0.1, 0.2, ...]'::vector
LIMIT 10;

-- Result:
-- Planning Time: 2.0 ms
-- Execution Time: 5.1 ms

Trade-offs:

IndexBuild TimeQuery TimeAccuracyMemory
IVFFlat (lists=10)2s85ms95%Low
HNSW (m=16)8s5ms98%Medium

Solution: HNSW Index with Optimized Parameters

File: backend/alembic/versions/xxx_add_hnsw_index.py

def upgrade():
    """Add HNSW index for vector similarity search."""

    op.execute("""
        CREATE INDEX CONCURRENTLY idx_chunk_embedding_hnsw
        ON chunks USING hnsw (embedding vector_cosine_ops)
        WITH (m = 16, ef_construction = 64);
    """)

    # Drop old IVFFlat index
    op.execute("DROP INDEX IF EXISTS idx_chunk_embedding_ivfflat;")

Parameters chosen:

  • m = 16: Connections per layer (sweet spot for 1k-10k vectors)
  • ef_construction = 64: Build-time quality (higher = better accuracy, slower build)
  • ef_search = 64: Query-time quality (can tune per query)

Runtime tuning:

async def search_similar_chunks(
    embedding: list[float],
    top_k: int = 10
) -> list[Chunk]:
    """Vector similarity search with HNSW index."""

    # Tune ef_search for accuracy vs speed trade-off
    await session.execute(text("SET hnsw.ef_search = 64;"))

    results = await session.execute(
        select(Chunk)
        .order_by(Chunk.embedding.cosine_distance(embedding))
        .limit(top_k)
    )

    return results.scalars().all()

Results

Performance:

  • Query latency: 85ms → 5ms (17x faster)
  • Accuracy: 95% → 98% (3% improvement)
  • Build time: 2s → 8s (acceptable for 415 chunks)

Impact on retrieval:

  • Hybrid search latency: 95ms → 15ms (p95)
  • Throughput: 10.5 req/s → 66 req/s (6x improvement)

Implementation effort: 4 hours (index creation + testing)

Win 3: Hybrid Search Ranking Optimization

Problem

Retrieval pass rate: 87.2%, target: >90%

  • Expected chunks ranked 6-10 instead of top-5
  • RRF fusion not getting enough candidates
  • No metadata boosting

Investigation

Golden dataset analysis (203 queries):

# Evaluate current ranking
results = []
for query in golden_queries:
    retrieved = await hybrid_search(query.text, top_k=10)
    expected_in_top_k = any(chunk.id in query.expected_chunk_ids for chunk in retrieved)
    rank = next((i for i, c in enumerate(retrieved) if c.id in query.expected_chunk_ids), -1)

    results.append({
        "query": query.text,
        "expected_rank": rank,
        "found": rank != -1,
        "passed": rank < 10
    })

# Results:
# Pass rate: 177/203 = 87.2%
# MRR: 0.723

Failure analysis:

  • 26 queries failed (expected chunk not in top-10)
  • Common issue: Expected chunk ranked 11-15
  • Root cause: RRF fusion only fetching 2x candidates (20 for top-10)

Solution: Multi-Pronged Optimization

1. Increase RRF Fetch Multiplier

File: backend/app/core/constants.py

# Before
HYBRID_FETCH_MULTIPLIER = 2  # Fetch 20 for top-10

# After
HYBRID_FETCH_MULTIPLIER = 3  # Fetch 30 for top-10

Rationale: More candidates → better RRF coverage → higher recall

2. Add Metadata Boosting

File: backend/app/shared/services/search/search_service.py

def apply_metadata_boosts(
    chunks: list[Chunk],
    query: str
) -> list[Chunk]:
    """Boost scores based on metadata signals."""

    query_lower = query.lower()

    for chunk in chunks:
        # Boost if query matches section title
        if chunk.section_title and any(
            term in chunk.section_title.lower()
            for term in query_lower.split()
        ):
            chunk.score *= SECTION_TITLE_BOOST_FACTOR  # 2.0

        # Boost if query matches document path
        if chunk.document_path and any(
            term in chunk.document_path.lower()
            for term in query_lower.split()
        ):
            chunk.score *= DOCUMENT_PATH_BOOST_FACTOR  # 1.15

        # Boost code blocks for technical queries
        if chunk.chunk_type == "code_block" and is_technical_query(query):
            chunk.score *= TECHNICAL_KEYWORD_BOOST  # 1.2

    return sorted(chunks, key=lambda c: c.score, reverse=True)

3. Pre-Compute tsvector for BM25

Before:

-- Compute tsvector on-the-fly (slow!)
SELECT *, ts_rank(to_tsvector('english', content), query) as rank
FROM chunks
WHERE to_tsvector('english', content) @@ query
ORDER BY rank DESC;

After:

-- Use pre-computed tsvector column (fast!)
SELECT *, ts_rank(content_tsvector, query) as rank
FROM chunks
WHERE content_tsvector @@ query
ORDER BY rank DESC;

Migration:

def upgrade():
    """Add pre-computed tsvector column."""

    # Add column
    op.add_column('chunks', sa.Column('content_tsvector', TSVECTOR))

    # Populate
    op.execute("""
        UPDATE chunks
        SET content_tsvector = to_tsvector('english', content);
    """)

    # Create GIN index
    op.execute("""
        CREATE INDEX idx_chunk_tsvector
        ON chunks USING GIN(content_tsvector);
    """)

    # Add trigger to keep it updated
    op.execute("""
        CREATE TRIGGER tsvector_update BEFORE INSERT OR UPDATE
        ON chunks FOR EACH ROW EXECUTE FUNCTION
        tsvector_update_trigger(content_tsvector, 'pg_catalog.english', content);
    """)

Results

Ranking Quality:

MetricBeforeAfterChange
Pass rate177/203 (87.2%)186/203 (91.6%)+5.1%
MRR (overall)0.7230.777+7.4%
MRR (hard queries)0.6470.686+6.0%

Query Performance:

OperationBeforeAfterChange
BM25 search45ms4ms11x faster
Vector search5ms5msSame
RRF fusion2ms3msSlightly slower (more candidates)
Total52ms12ms4.3x faster

Impact by boost factor:

  • Section title boost: +7.4% MRR (most impactful)
  • Document path boost: +2.1% MRR
  • Code block boost: +1.3% MRR (for technical queries)

Implementation effort: 1 day (constants, migration, testing)

Win 4: SSE Event Buffering (Race Condition Fix)

Problem

Frontend showed 0% progress while backend was running

  • Real-time progress updates missing
  • EventSource connection established AFTER events published
  • No event replay mechanism

Investigation

Reproduce issue:

  1. Start analysis via API
  2. Frontend subscribes to SSE /progress/\{analysis_id\}
  3. Backend immediately publishes "analysis_started" event
  4. Frontend connects 200ms later → misses early events

Root cause:

# ❌ BAD: Events lost if no subscriber yet
class EventBroadcaster:
    def publish(self, channel: str, event: dict):
        if channel not in self._subscribers:
            return  # Event lost!

        for subscriber in self._subscribers[channel]:
            subscriber.send(event)

Solution: Event Buffering with Replay

File: backend/app/services/event_broadcaster.py

from collections import deque
from dataclasses import dataclass
from datetime import datetime

@dataclass
class BufferedEvent:
    """Event with timestamp for replay."""
    data: dict
    timestamp: datetime

class EventBroadcaster:
    """SSE broadcaster with event buffering."""

    def __init__(self, buffer_size: int = 100):
        self._subscribers: dict[str, list] = {}
        self._buffers: dict[str, deque[BufferedEvent]] = {}
        self._buffer_size = buffer_size

    def publish(self, channel: str, event: dict):
        """Publish event and store in buffer."""

        # Create buffer if needed
        if channel not in self._buffers:
            self._buffers[channel] = deque(maxlen=self._buffer_size)

        # Add to buffer
        buffered_event = BufferedEvent(
            data=event,
            timestamp=datetime.now()
        )
        self._buffers[channel].append(buffered_event)

        # Send to active subscribers
        for subscriber in self._subscribers.get(channel, []):
            try:
                subscriber.send(event)
            except Exception as e:
                logger.error("failed_to_send_event", error=str(e))

    async def subscribe(self, channel: str):
        """Subscribe to channel and replay buffered events."""

        # Replay buffered events first
        for buffered_event in self._buffers.get(channel, []):
            yield {
                "event": "message",
                "data": json.dumps(buffered_event.data)
            }

        # Then stream new events
        queue = asyncio.Queue()
        self._subscribers.setdefault(channel, []).append(queue)

        try:
            while True:
                event = await queue.get()
                yield {
                    "event": "message",
                    "data": json.dumps(event)
                }
        finally:
            self._subscribers[channel].remove(queue)

API endpoint:

@app.get("/progress/{analysis_id}")
async def stream_progress(analysis_id: str):
    """Stream analysis progress with buffered event replay."""

    channel = f"analysis:{analysis_id}"

    async def event_generator():
        async for event in event_broadcaster.subscribe(channel):
            yield f"data: {event['data']}\n\n"

    return StreamingResponse(
        event_generator(),
        media_type="text/event-stream"
    )

Results

Before (with race condition):

  • 0% progress shown until agent completion (30-60 seconds)
  • Users confused, thought app was frozen
  • Support tickets: "Analysis stuck at 0%"

After (with buffering):

  • All events delivered (100% replay rate)
  • Progress updates appear immediately
  • Memory overhead: ~10KB per active analysis (100 events × 100 bytes)

Implementation effort: 3 hours (buffer logic + tests)

Win 5: Quality Gate Content Truncation Fix

Problem

Quality scores artificially low due to content truncation

  • Depth scores: 5/10 (AWFUL) → required retries
  • G-Eval only seeing truncated summaries
  • 4 stages of truncation compounding

Investigation

Trace truncation points:

# Stage 1: compress_findings.py
MAX_STRING_LENGTH = 200  # ❌ Too aggressive!

# Stage 2: scorer.py
input_text = content[:2000]  # ❌ Truncated again!
output_text = response[:3000]

# Stage 3: quality.py
MAX_CONTENT_LENGTH = 8000  # ❌ Insufficient!

# Stage 4: quality_gate_node.py
insights = findings[:2000]  # ❌ Final truncation!

Example:

  1. Original finding: 5,000 chars (detailed security analysis)
  2. After Stage 1: 200 chars ("Found 3 vulnerabilities...")
  3. After synthesis: 1,500 chars (includes other findings)
  4. After Stage 2: 1,500 chars (same)
  5. After G-Eval: Depth score = 5/10 (insufficient detail)

Solution: Increase All Truncation Limits

Changes:

FileBeforeAfterRationale
compress_findings.py200500Allow key insights
scorer.py (input)2,0008,000Full context for eval
scorer.py (output)3,00012,000Detailed responses
quality.py8,00015,000Complete synthesis
quality_gate_node.py2,0008,000All findings visible

Implementation:

# backend/app/shared/services/g_eval/scorer.py
MAX_INPUT_LENGTH = 8000  # Increased from 2000
MAX_OUTPUT_LENGTH = 12000  # Increased from 3000

# backend/app/evaluation/evaluators/quality.py
MAX_CONTENT_LENGTH = 15000  # Increased from 8000

# backend/app/domains/analysis/workflows/tasks/aggregation/compress_findings.py
MAX_STRING_LENGTH = 500  # Increased from 200

Results

Quality Scores:

CriterionBeforeAfterChange
Completeness0.750.85+13%
Accuracy0.880.92+5%
Coherence0.840.88+5%
Depth0.580.78+34%
Overall0.760.86+13%

Pass rate: 67-77% (variable) → 85%+ (stable)

Trade-offs:

  • Token usage: +15% (from 8k → 12k avg)
  • Cost impact: +$0.02 per analysis (acceptable)
  • Quality improvement: Worth the extra cost

Implementation effort: 2 hours (find all truncation points + update tests)

Summary Table

OptimizationMetricBeforeAfterImprovementEffort
Multi-level cachingAnnual cost$35k$2-5k85-95%2 days
HNSW indexQuery latency85ms5ms17x faster4 hours
Hybrid searchPass rate87.2%91.6%+5.1%1 day
SSE bufferingEvent delivery60%100%+67%3 hours
Content truncationDepth score0.580.78+34%2 hours

Total implementation time: 4 days Annual cost savings: $30-33k Quality improvement: 13% overall, 34% depth

References

Edit on GitHub

Last updated on

On this page

PerformanceQuick ReferenceCore Web Vitals2026 ThresholdsRender OptimizationLazy LoadingImage OptimizationProfiling & BackendLLM InferenceCachingQuery & Data FetchingQuick Start ExampleKey DecisionsCommon MistakesRelated SkillsCapability Detailslcp-optimizationinp-optimizationcls-preventionreact-compilervirtualizationlazy-loadingimage-optimizationprofilingllm-inferenceReferencesRules (22)Configure HTTP and LLM prompt caching with correct breakpoint ordering for maximum savings — HIGHHTTP & Prompt CachingImplement Redis cache-aside pattern with TTL and stampede prevention for backend caching — HIGHRedis & Backend CachingPrevent Cumulative Layout Shift that causes content jumping and hurts search rankings — CRITICALCLS PreventionReserve Space for Dynamic ContentExplicit DimensionsAvoid Layout-Shifting FontsAnimations That Don't Cause Layout ShiftKey RulesOptimize Interaction to Next Paint to ensure responsive button clicks and interactions — CRITICALINP OptimizationBreak Up Long TasksUse Transitions for Non-Urgent UpdatesOptimize Event HandlersKey RulesOptimize Largest Contentful Paint to improve search rankings and perceived page speed — CRITICALLCP OptimizationIdentify LCP ElementOptimize LCP ImagesPreload Critical ResourcesServer-Side RenderingKey RulesServe AVIF and WebP formats for 30-50% smaller files than JPEG at equivalent quality — HIGHModern Image FormatsFormat Decision MatrixPicture Element with FallbackBuild-Time ConversionQuality GuidelinesUse Next.js Image component for automatic lazy loading, responsive sizing, and format negotiation — HIGHNext.js Image ComponentPriority Hero ImageBlur PlaceholderCustom Loader for CDNResponsive Fill LayoutServe appropriately sized responsive images per viewport to avoid oversized mobile downloads — HIGHResponsive ImagesSrcset with SizesArt Direction with PictureCDN Image Transformation URLsQuantize models to reduce size 2-4x with minimal quality loss for fewer GPUs — MEDIUMModel QuantizationMethod Decision MatrixvLLM with AWQvLLM with FP8 (Hopper GPUs)VRAM Requirements (Approximate)Quality ValidationApply speculative decoding to generate draft tokens in parallel and reduce inference latency — MEDIUMSpeculative DecodingHow It WorksN-Gram Speculation (No Draft Model)Draft Model SpeculationAcceptance Rate TuningWhen to Use Each ApproachDeploy vLLM with PagedAttention and continuous batching for 2-4x higher inference throughput — MEDIUMvLLM DeploymentDocker DeploymentPython ClientKey Architecture ConceptsDefer component loading with React.lazy to reduce initial bundle size and improve TTI — HIGHLazy Component LoadingBasic PatternError Boundary for Failed ImportsSkeleton FallbackPrefetch resources before user needs them to make navigation feel instant — HIGHPrefetch StrategiesModule Preload HintsPrefetch on HoverPrefetch on Viewport IntersectionImport on InteractionSplit code at route boundaries so users only download code for the visited page — HIGHRoute-Based Code SplittingReact Router 7.x Lazy RoutesNamed Exports for Lazy RoutesChunk NamingProfile Python backends with py-spy to find CPU hotspots and memory leaks in production — MEDIUMPython Backend Profilingpy-spy for Production SamplingcProfile for Developmentmemory_profiler for Memory LeaksFastAPI Middleware ProfilingAnalyze bundles to reveal bloated dependencies and missed tree-shaking that inflate load times — MEDIUMBundle AnalysisWebpack Bundle AnalyzerVite / Rollup VisualizerPerformance Budgets in CIImport Cost AwarenessProfile React components with DevTools to identify unnecessary re-renders and their causes — MEDIUMReact DevTools ProfilerFlamegraph WorkflowProgrammatic ProfilerWhy Did You Render SetupInterpreting Render ReasonsApply TanStack Query optimistic updates for instant UI feedback with automatic rollback — HIGHTanStack Query Optimistic UpdatesPrefetch TanStack queries on hover or in route loaders for instant page transitions — HIGHTanStack Query PrefetchingLet React Compiler auto-memoize components, values, callbacks, and JSX automatically — HIGHReact CompilerDecision TreeWhat the Compiler MemoizesEnabling the CompilerVerificationKey RulesUse manual useMemo and useCallback escape hatches when React Compiler cannot optimize — HIGHManual Memoization Escape HatchesWhen Manual Memoization Is NeededState ColocationProfiling WorkflowKey RulesVirtualize long lists to render only visible items for smooth scrolling performance — HIGHList VirtualizationVirtualization ThresholdsBasic SetupDynamic HeightKey RulesReferences (17)Caching StrategiesCaching StrategiesCache HierarchyCache-Aside Pattern (Read-Through)Write-Through PatternCache Invalidation Strategies1. Time-Based (TTL)2. Event-Based3. Tag-BasedRedis Patterns1. String Cache (Most Common)2. Hash Cache (Objects)3. List Cache (Queues, Recent Items)4. Set Cache (Unique Items, Tags)In-Memory Cache (L1)HTTP Caching (Browser/CDN)Cache WarmingCache Stampede PreventionBest PracticesReferencesCdn SetupImage CDN ConfigurationNext.js Remote PatternsBasic ConfigurationEnvironment-Based ConfigurationCloudinary IntegrationLoader ImplementationUsageImgix IntegrationCloudflare ImagesAWS S3 + CloudFrontVercel Image OptimizationSelf-Hosted with SharpCDN Headers & CachingNginx ConfigurationCloudflare Page RulesBlur Placeholder GenerationBuild-Time with PlaiceholderRemote Image BlurImage Validation & Error HandlingCore Web VitalsCore Web Vitals OptimizationThe Three MetricsLCP (Largest Contentful Paint)What It MeasuresCommon CausesFixesINP (Interaction to Next Paint)What It MeasuresCommon CausesFixesCLS (Cumulative Layout Shift)What It MeasuresCommon CausesFixesMeasuring Core Web VitalsIn DevelopmentIn Production (RUM - Real User Monitoring)Lighthouse (Lab Testing)Targets by PercentileQuick Wins ChecklistReferencesDatabase OptimizationDatabase Query OptimizationKey PatternsQuick DiagnosticsN+1 Query DetectionIndex Selection StrategiesConnection PoolingPagination: Cursor vs OffsetOffset-Based (❌ Slow for large datasets)Cursor-Based (✅ Fast, scales to millions)Best PracticesReferencesDevtools Profiler WorkflowReact DevTools Profiler WorkflowSetupBasic Profiling Flow1. Start Recording2. Analyze the Flamegraph3. Key MetricsReading the ProfilerColor Coding"Why did this render?"Identifying ProblemsProblem 1: Component Renders Too OftenProblem 2: Single Render Too SlowProblem 3: Cascading Re-rendersProfiler SettingsRanked ViewTimeline ViewConsole IntegrationQuick ChecklistCommon Fixes by CauseEdge DeploymentEdge DeploymentOverviewModel Selection for EdgeAggressive Quantizationllama.cpp for EdgeMobile DeploymentiOS with MLXAndroid with MLC-LLMJetson/NVIDIA EdgeMemory Optimization TechniquesKV Cache ReductionSliding Window AttentionFlash AttentionPower EfficiencyDynamic Frequency ScalingBatch Size OptimizationOffline DeploymentModel BundlingAir-gapped EnvironmentsBenchmarking Edge PerformanceRelated SkillsFrontend PerformanceFrontend Performance OptimizationBundle Optimization1. Code Splitting2. Tree Shaking3. Image OptimizationRendering Optimization1. Memoization2. Virtualization3. Batch DOM OperationsCore Web Vitals OptimizationLCP (Largest Contentful Paint) - Target: < 2.5sINP (Interaction to Next Paint) - Target: < 200msCLS (Cumulative Layout Shift) - Target: < 0.1Bundle AnalysisBest PracticesReferencesMemoization Escape HatchesMemoization Escape HatchesOverviewEscape Hatch 1: Effect DependenciesEscape Hatch 2: Third-Party LibrariesEscape Hatch 3: Expensive ComputationsEscape Hatch 4: Referential Equality for ChildrenWhen NOT to Use Escape HatchesDon't Memoize PrimitivesDon't Memoize Simple JSXDon't Memoize Everything "Just in Case"useCallback Escape HatchesStable Event Handlers for EffectsRefs in CallbacksDecision TreeVerifying Compiler CoverageProfilingPerformance ProfilingProfiling WorkflowBackend Profiling (Python)1. cProfile (Built-in)2. py-spy (Sampling Profiler)3. memory_profiler4. Line ProfilerFrontend Profiling1. Chrome DevTools Performance Tab2. React DevTools Profiler3. Lighthouse Performance Audit4. Bundle AnalyzerDatabase ProfilingPostgreSQLMemory ProfilingPython (memory_profiler)Chrome DevTools (Heap Snapshot)Memory Leak DetectionFlame GraphsLoad Testingk6 (HTTP Load Testing)Locust (Python Load Testing)Profiling Best PracticesQuick Profiling CommandsReferencesQuantization GuideQuantization GuideOverviewAWQ (Activation-aware Weight Quantization)GPTQ QuantizationFP8 QuantizationINT8 QuantizationQuantization ComparisonMemory Usage (70B Model)Quality Benchmarks (MMLU)Best PracticesCalibration DataGroup Size SelectionMixed PrecisionTroubleshootingOOM During QuantizationQuality DegradationRelated SkillsReact Compiler MigrationReact Compiler Migration GuideWhat is React Compiler?PrerequisitesQuick SetupNext.js 16+Expo SDK 54+Babel (Manual)VerificationWhat Gets OptimizedRules of React (Must Follow)1. Components Must Be Idempotent2. Props and State Are Immutable3. Side Effects Outside Render4. Hooks at Top LevelMigration StrategyNew ProjectsExisting ProjectsWhen Manual Memoization Still NeededDebugging IssuesComponent Not Getting Memo BadgePerformance RegressionCompatibility NotesRoute SplittingRoute-Based Code SplittingReact Router 7.x Lazy RoutesVite Manual ChunksPrefetch on Route HoverBundle Size MonitoringRum SetupReal User Monitoring (RUM) Setupweb-vitals Library SetupInstallationBasic ImplementationNext.js App Router IntegrationClient Component for VitalsLayout IntegrationAPI Endpoint ImplementationNext.js Route HandlerBatching for High-Traffic SitesDatabase SchemaPostgreSQL SchemaAnalytics QueriesGrafana DashboardPrometheus Metrics ExportGrafana Alert RulesSampling Strategy for High TrafficTesting RUM in DevelopmentIntegration with Analytics ProvidersGoogle Analytics 4Vercel AnalyticsSpeculative DecodingSpeculative DecodingOverviewN-gram SpeculationDraft Model SpeculationMedusa-style SpeculationPerformance TuningOptimal Token CountAcceptance Rate MonitoringWhen NOT to UseBenchmarkingRelated SkillsState ColocationState ColocationThe PrincipleProblem: State Too HighSolution: Colocate StateWhen to Lift StateComponent SplittingContext for Cross-Cutting ConcernsContext SplittingSigns State Should MoveQuick ChecklistTanstack Virtual PatternsTanStack Virtual PatternsWhen to VirtualizeBasic List VirtualizationVariable Height RowsDynamic MeasurementHorizontal VirtualizationGrid VirtualizationScroll to IndexWindow ScrollerPerformance TipsVllm DeploymentvLLM DeploymentPagedAttentionContinuous BatchingCUDA GraphsTensor ParallelismPrefix CachingProduction Server ConfigurationMonitoring and MetricsRelated SkillsChecklists (5)Cwv ChecklistCore Web Vitals Optimization ChecklistThresholds ReferenceLCP (Largest Contentful Paint) ≤ 2.5sIdentify the LCP ElementServer Response Time (TTFB)Critical Resource LoadingImage OptimizationRendering StrategyINP (Interaction to Next Paint) ≤ 200msIdentify Long TasksJavaScript OptimizationReact OptimizationEvent Handler OptimizationAnimation PerformanceCLS (Cumulative Layout Shift) ≤ 0.1Image DimensionsDynamic ContentFont LoadingAnimation StabilityIframes and EmbedsMeasurement & MonitoringLab TestingField Data (RUM)Data AnalysisAlertingBuild & DeployPerformance BudgetsCI/CD IntegrationCDN & CachingDebugging ChecklistSlow LCPHigh INPHigh CLSTesting ProtocolBefore DeploymentAfter DeploymentWeekly ReviewImage ChecklistImage Optimization ChecklistFormat SelectionPhoto ContentGraphics & IconsFormat Decision TreeDimensions & SizingAlways Set DimensionsResponsive ImagesLoading StrategyLCP Images (Above the Fold)Below-the-Fold ImagesPreloadingPlaceholdersBlur PlaceholdersColor PlaceholdersQuality SettingsCompressionAVIF-SpecificCDN & InfrastructureNext.js ConfigurationCDN SetupSelf-HostedAccessibilityAlt TextAdditional A11yPerformance MonitoringMetrics to TrackDebuggingError HandlingFallbacksMonitoringBuild PipelineOptimizationVersion ControlSecurityContent SecurityPrivacyInference OptimizationInference Optimization ChecklistvLLM ConfigurationQuantizationSpeculative DecodingHardware UtilizationBatching StrategyCachingBenchmarkingProduction ReadinessMonitoringCost OptimizationPerformance Audit ChecklistPerformance Audit ChecklistPrerequisitesPhase 1: Establish BaselinesBackend MetricsFrontend MetricsBaseline TargetsPhase 2: Identify BottlenecksBackend ProfilingFrontend ProfilingPhase 3: Database OptimizationAdd Missing IndexesFix N+1 QueriesOptimize Connection PoolingPhase 4: Caching StrategyIdentify Caching OpportunitiesImplement Multi-Level CacheCache InvalidationPhase 5: Frontend OptimizationCode SplittingMemoizationImage OptimizationPhase 6: Measure ImpactRe-Run BenchmarksCalculate SavingsCreate Performance BudgetPhase 7: Ongoing OptimizationWeekly ReviewsMonthly AuditsContinuous MonitoringReferencesRender AuditReact Performance Audit ChecklistReact Compiler CheckRender PerformanceLarge Lists / DataCode SplittingNetwork PerformanceImages & MediaThird-Party ScriptsProfiling VerificationBefore OptimizationAfter OptimizationKey Metrics to TrackQuick Profiler CommandsCommon Issues ChecklistSign-OffExamples (3)Cwv ExamplesCore Web Vitals Examples1. LCP Optimization: E-Commerce Hero SectionBefore: Slow LCP (3.5s+)After: Optimized LCP (1.2s)Document Head Optimizations2. INP Optimization: Product Search FilterBefore: Blocking INP (400ms+)After: Responsive INP (50ms)For Very Large Lists: Virtual Scrolling3. CLS Optimization: News Article PageBefore: High CLS (0.35)After: Zero CLS (0.0)Font Loading Without CLS4. Complete RUM Implementation5. Performance Budget Enforcementlighthouserc.jsGitHub Actions WorkflowQuick ReferenceImage ExamplesImage Optimization ExamplesHero Image with Blur PlaceholderProduct Grid with Responsive SizesAvatar with FallbackArt Direction (Different Crops)Gallery with LightboxBackground Image PatternOrchestkit Performance WinsOrchestKit Performance Wins - Real Optimization ExamplesOverviewWin 1: Multi-Level LLM CachingProblemInvestigationSolution: 3-Level Cache HierarchyResultsWin 2: Vector Index Optimization (HNSW vs IVFFlat)ProblemInvestigationSolution: HNSW Index with Optimized ParametersResultsWin 3: Hybrid Search Ranking OptimizationProblemInvestigationSolution: Multi-Pronged OptimizationResultsWin 4: SSE Event Buffering (Race Condition Fix)ProblemInvestigationSolution: Event Buffering with ReplayResultsWin 5: Quality Gate Content Truncation FixProblemInvestigationSolution: Increase All Truncation LimitsResultsSummary TableReferences