Skip to main content
OrchestKit v7.43.0 — 104 skills, 36 agents, 173 hooks · Claude Code 2.1.105+
OrchestKit
Skills

Verify

Comprehensive verification with parallel test agents. Use when verifying implementations or validating changes.

Command high
Invoke
/ork:verify

Verify Feature

Comprehensive verification using parallel specialized agents with nuanced grading (0-10 scale) and improvement suggestions.

Quick Start

/ork:verify authentication flow
/ork:verify --model=opus user profile feature
/ork:verify --scope=backend database migrations

Argument Resolution

SCOPE = "$ARGUMENTS"       # Full argument string, e.g., "authentication flow"
SCOPE_TOKEN = "$ARGUMENTS[0]"  # First token for flag detection (e.g., "--scope=backend")
# $ARGUMENTS[0], $ARGUMENTS[1] etc. for indexed access (CC 2.1.59)

# Model override detection (CC 2.1.72)
MODEL_OVERRIDE = None
for token in "$ARGUMENTS".split():
    if token.startswith("--model="):
        MODEL_OVERRIDE = token.split("=", 1)[1]  # "opus", "sonnet", "haiku"
        SCOPE = SCOPE.replace(token, "").strip()

Pass MODEL_OVERRIDE to all Agent() calls via model=MODEL_OVERRIDE when set. Accepts symbolic names (opus, sonnet, haiku) or full IDs (claude-opus-4-6) per CC 2.1.74.

Opus 4.6: Agents use native adaptive thinking (no MCP sequential-thinking needed). Extended 128K output supports comprehensive verification reports.


STEP 0: Effort-Aware Verification Scaling (CC 2.1.76)

Scale verification depth based on /effort level:

Effort LevelPhases RunAgentsOutput
lowRun tests only → pass/fail0 agentsQuick check
mediumTests + code quality + security3 agentsScore + top issues
high (default)All 8 phases + visual capture6-7 agentsFull report + grades

Override: Explicit user selection (e.g., "Full verification") overrides /effort downscaling.

STEP 0a: Verify User Intent with AskUserQuestion

BEFORE creating tasks, clarify verification scope:

AskUserQuestion(
  questions=[{
    "question": "What scope for this verification?",
    "header": "Scope",
    "options": [
      {"label": "Full verification (Recommended)", "description": "All tests + security + code quality + visual + grades", "markdown": "```\nFull Verification (10 phases)\n─────────────────────────────\n  7 parallel agents:\n  ┌────────────┐ ┌────────────┐\n  │ Code       │ │ Security   │\n  │ Quality    │ │ Auditor    │\n  ├────────────┤ ├────────────┤\n  │ Test       │ │ Backend    │\n  │ Generator  │ │ Architect  │\n  ├────────────┤ ├────────────┤\n  │ Frontend   │ │ Performance│\n  │ Developer  │ │ Engineer   │\n  ├────────────┤ └────────────┘\n  │ Visual     │\n  │ Capture    │ → gallery.html\n  └────────────┘\n\n    Composite Score (0-10)\n    8 dimensions + Grade\n    + Visual Gallery\n```"},
      {"label": "Tests only", "description": "Run unit + integration + e2e tests", "markdown": "```\nTests Only\n──────────\n  npm test ──▶ Results\n  ┌─────────────────────┐\n  │ Unit tests     ✓/✗  │\n  │ Integration    ✓/✗  │\n  │ E2E            ✓/✗  │\n  │ Coverage       NN%  │\n  └─────────────────────┘\n  Skip: security, quality, UI\n  Output: Pass/fail + coverage\n```"},
      {"label": "Security audit", "description": "Focus on security vulnerabilities", "markdown": "```\nSecurity Audit\n──────────────\n  security-auditor agent:\n  ┌─────────────────────────┐\n  │ OWASP Top 10       ✓/✗ │\n  │ Dependency CVEs    ✓/✗ │\n  │ Secrets scan       ✓/✗ │\n  │ Auth flow review   ✓/✗ │\n  │ Input validation   ✓/✗ │\n  └─────────────────────────┘\n  Output: Security score 0-10\n          + vulnerability list\n```"},
      {"label": "Code quality", "description": "Lint, types, complexity analysis", "markdown": "```\nCode Quality\n────────────\n  code-quality-reviewer agent:\n  ┌─────────────────────────┐\n  │ Lint errors         N   │\n  │ Type coverage       NN% │\n  │ Cyclomatic complex  N.N │\n  │ Dead code           N   │\n  │ Pattern violations  N   │\n  └─────────────────────────┘\n  Output: Quality score 0-10\n          + refactor suggestions\n```"},
      {"label": "Quick check", "description": "Just run tests, skip detailed analysis", "markdown": "```\nQuick Check (~1 min)\n────────────────────\n  Run tests ──▶ Pass/Fail\n\n  Output:\n  ├── Test results\n  ├── Build status\n  └── Lint status\n  No agents, no grading,\n  no report generation\n```"}
    ],
    "multiSelect": true
  }]
)

Based on answer, adjust workflow:

  • Full verification: All 10 phases (8 + 2.5 + 8.5), 7 parallel agents including visual capture
  • Tests only: Skip phases 2 (security), 5 (UI/UX analysis)
  • Security audit: Focus on security-auditor agent
  • Code quality: Focus on code-quality-reviewer agent
  • Quick check: Run tests only, skip grading and suggestions

STEP 0b: Select Orchestration Mode

Load details: Read("$\{CLAUDE_SKILL_DIR\}/references/orchestration-mode.md") for env var check logic, Agent Teams vs Task Tool comparison, and mode selection rules.

Choose Agent Teams (mesh -- verifiers share findings) or Task tool (star -- all report to lead) based on the orchestration mode reference.


MCP Probe + Resume

ToolSearch(query="select:mcp__memory__search_nodes")
Write(".claude/chain/capabilities.json", { memory, timestamp })

Read(".claude/chain/state.json")  # resume if exists

Handoff File

After verification completes, write results:

Write(".claude/chain/verify-results.json", JSON.stringify({
  "phase": "verify", "skill": "verify",
  "timestamp": now(), "status": "completed",
  "outputs": {
    "tests_passed": N, "tests_failed": N,
    "coverage": "87%", "security_scan": "clean"
  }
}))

Regression Monitor (CC 2.1.71)

Optionally schedule post-verification monitoring:

# Guard: Skip cron in headless/CI (CLAUDE_CODE_DISABLE_CRON)
# if env CLAUDE_CODE_DISABLE_CRON is set, run a single check instead
CronCreate(
  schedule="0 8 * * *",
  prompt="Daily regression check: npm test.
    If 7 consecutive passes → CronDelete.
    If failures → alert with details."
)

Task Management (CC 2.1.16)

# 1. Create main verification task
TaskCreate(
  subject="Verify [feature-name] implementation",
  description="Comprehensive verification with nuanced grading",
  activeForm="Verifying [feature-name] implementation"
)

# 2. Create subtasks for 8-phase process
TaskCreate(subject="Run code quality checks", activeForm="Running quality checks")    # id=2
TaskCreate(subject="Execute security audit", activeForm="Running security audit")     # id=3
TaskCreate(subject="Verify test coverage", activeForm="Verifying test coverage")      # id=4
TaskCreate(subject="Validate API", activeForm="Validating API")                       # id=5
TaskCreate(subject="Check UI/UX", activeForm="Checking UI/UX")                       # id=6
TaskCreate(subject="Calculate grades", activeForm="Calculating grades")               # id=7
TaskCreate(subject="Generate suggestions", activeForm="Generating suggestions")       # id=8
TaskCreate(subject="Compile report", activeForm="Compiling report")                   # id=9

# 3. Set dependencies — phases 2-6 run in parallel, 7-9 are sequential
TaskUpdate(taskId="7", addBlockedBy=["2", "3", "4", "5", "6"])  # Grading needs all checks
TaskUpdate(taskId="8", addBlockedBy=["7"])  # Suggestions need grades
TaskUpdate(taskId="9", addBlockedBy=["8"])  # Report needs suggestions

# 4. Before starting each task, verify it's unblocked
task = TaskGet(taskId="2")  # Verify blockedBy is empty

# 5. Update status as you progress
TaskUpdate(taskId="2", status="in_progress")  # When starting
TaskUpdate(taskId="2", status="completed")    # When done — repeat for each subtask

8-Phase Workflow

Load details: Read("$\{CLAUDE_SKILL_DIR\}/references/verification-phases.md") for complete phase details, agent spawn definitions, Agent Teams alternative, and team teardown.

PhaseActivitiesOutput
1. Context GatheringGit diff, commit historyChanges summary
2. Parallel Agent Dispatch6 agents evaluate0-10 scores
2.5 Visual CaptureScreenshot routes, AI vision evalGallery + visual score
3. Test ExecutionBackend + frontend testsCoverage data
4. Nuanced GradingComposite score calculationGrade (A-F)
5. Improvement SuggestionsEffort vs impact analysisPrioritized list
6. Alternative ComparisonCompare approaches (optional)Recommendation
7. Metrics TrackingTrend analysisHistorical data
8. Report CompilationEvidence artifacts + gallery.htmlFinal report
8.5 Agentation LoopUser annotates, ui-feedback fixesBefore/after diffs

Phase 2 Agents (Quick Reference)

AgentFocusOutput
code-quality-reviewerLint, types, patternsQuality 0-10
security-auditorOWASP, secrets, CVEsSecurity 0-10
test-generatorCoverage, test qualityCoverage 0-10
backend-system-architectAPI design, asyncAPI 0-10
frontend-ui-developerReact 19, Zod, a11yUI 0-10
python-performance-engineerLatency, resources, scalingPerformance 0-10

Launch ALL agents in ONE message with run_in_background=True and max_turns=25.

Progressive Output (CC 2.1.76+)

Output each agent's score as soon as it completes — don't wait for all 6-7 agents.

Focus mode (CC 2.1.101): In focus mode, include the full composite score, all dimension scores, and the verdict in your final message — the user didn't see the incremental outputs.

Security:     8.2/10 — No critical vulnerabilities found
Code Quality: 7.5/10 — 3 complexity hotspots identified
[...remaining agents still running...]

This gives users real-time visibility into multi-agent verification. If any dimension scores below the security_minimum threshold (default 5.0), flag it as a blocker immediately — the user can terminate early without waiting for remaining agents.

Monitor + Partial Results (CC 2.1.98)

Use Monitor for streaming test execution output from background scripts:

# Stream test output in real-time instead of waiting for completion
Bash(command="npm test 2>&1", run_in_background=true)
Monitor(pid=test_task_id)  # Each line → notification

Partial results (CC 2.1.98): If a verification agent fails mid-analysis, synthesize partial scores rather than re-spawning:

for agent_result in verification_results:
    if "[PARTIAL RESULT]" in agent_result.output:
        # Extract whatever scores the agent produced before crashing
        partial_score = parse_score(agent_result.output)  # May be incomplete
        scores[agent_result.dimension] = {
            "score": partial_score, "partial": True,
            "note": "Agent crashed — score based on partial analysis"
        }
        # A 4-dimension score is better than no score. Do NOT re-spawn.

Phase 2.5: Visual Capture (NEW — runs in parallel with Phase 2)

Load details: Read("$\{CLAUDE_SKILL_DIR\}/references/visual-capture.md") for auto-detection, route discovery, screenshot capture, and AI vision evaluation.

Summary: Auto-detects project framework, starts dev server, discovers routes, uses agent-browser to screenshot each route, evaluates with Claude vision, generates self-contained gallery.html with base64-embedded images.

Output: verification-output/\{timestamp\}/gallery.html — open in browser to see all screenshots with AI evaluations, scores, and annotation diffs.

Graceful degradation: If no frontend detected or server won't start, skips visual capture with a warning — never blocks verification.

Phase 8.5: Agentation Visual Feedback (opt-in)

Load details: Read("$\{CLAUDE_SKILL_DIR\}/references/visual-capture.md") (Phase 8.5 section) for agentation loop workflow.

Trigger: Only when agentation MCP is configured. Offers user the choice to annotate the live UI. ui-feedback agent processes annotations, re-screenshots show before/after.


Grading & Scoring

Load Read("$\{CLAUDE_PLUGIN_ROOT\}/skills/quality-gates/references/unified-scoring-framework.md") for dimensions, weights, grade thresholds, and improvement prioritization. Load Read("$\{CLAUDE_SKILL_DIR\}/references/quality-model.md") for verify-specific extensions (Visual dimension). Load Read("$\{CLAUDE_SKILL_DIR\}/references/grading-rubric.md") for per-agent scoring criteria.


Evidence & Test Execution

Load details: Read("$\{CLAUDE_SKILL_DIR\}/rules/evidence-collection.md") for git commands, test execution patterns, metrics tracking, and post-verification feedback.


Policy-as-Code

Load details: Read("$\{CLAUDE_SKILL_DIR\}/references/policy-as-code.md") for configuration.

Define verification rules in .claude/policies/verification-policy.json:

{
  "thresholds": {
    "composite_minimum": 6.0,
    "security_minimum": 7.0,
    "coverage_minimum": 70
  },
  "blocking_rules": [
    {"dimension": "security", "below": 5.0, "action": "block"}
  ]
}

Report Format

Load details: Read("$\{CLAUDE_SKILL_DIR\}/references/report-template.md") for full format. Summary:

# Feature Verification Report

**Composite Score: [N.N]/10** (Grade: [LETTER])

## Verdict
**[READY FOR MERGE | IMPROVEMENTS RECOMMENDED | BLOCKED]**

References

Load on demand with Read("$\{CLAUDE_SKILL_DIR\}/references/<file>"):

FileContent
verification-phases.md8-phase workflow, agent spawn definitions, Agent Teams mode
visual-capture.mdPhase 2.5 + 8.5: screenshot capture, AI vision, gallery generation, agentation loop
quality-model.mdScoring dimensions and weights (8 unified)
grading-rubric.mdPer-agent scoring criteria
report-template.mdFull report format with visual evidence section
alternative-comparison.mdApproach comparison template
orchestration-mode.mdAgent Teams vs Task Tool
policy-as-code.mdVerification policy configuration
verification-checklist.mdPre-flight checklist

Rules

Load on demand with Read("$\{CLAUDE_SKILL_DIR\}/rules/<file>"):

FileContent
scoring-rubric.mdComposite scoring, grades, verdicts
evidence-collection.mdEvidence gathering and test patterns

Verification Gate (Cross-Cutting)

Load Read("$\{CLAUDE_PLUGIN_ROOT\}/skills/shared/rules/verification-gate.md") — the minimum 5-step gate that applies to ALL completion claims across all skills. This is non-negotiable: NO COMPLETION CLAIMS WITHOUT FRESH VERIFICATION EVIDENCE.

Anti-Sycophancy Protocol

Load Read("$\{CLAUDE_PLUGIN_ROOT\}/skills/shared/rules/anti-sycophancy.md") — all verification agents report findings directly without performative agreement. "Should be fine" is not evidence. "Tests pass (exit 0, 47/47)" is.

Agent Status Protocol

All verification agents MUST report using the standardized protocol: Read("$\{CLAUDE_PLUGIN_ROOT\}/agents/shared/status-protocol.md"). Never report DONE if concerns exist. Never silently produce work you're unsure about.


Agent Coordination

SendMessage (Cross-Agent Findings)

When a security agent finds a critical issue, share it with other verification agents:

SendMessage(to="test-generator", message="Security: SQL injection in user_service.py:88 — add parameterized query test")
SendMessage(to="code-quality-reviewer", message="Security finding at user_service.py:88 — flag in review")

Skill Chain

After verification, chain to commit if all gates pass:

TaskCreate(subject="Commit verified changes", activeForm="Committing", addBlockedBy=[verify_task_id])
# Then: /ork:commit
  • ork:implement - Full implementation with verification
  • ork:review-pr - PR-specific verification
  • testing-unit / testing-integration / testing-e2e - Test execution patterns
  • ork:quality-gates - Quality gate patterns
  • browser-tools - Browser automation for visual capture

Version: 4.2.0 (March 2026) — Added progressive output for incremental agent scores


Rules (2)

Evidence Collection Patterns — HIGH

Evidence Collection Patterns

Phase 1: Context Gathering

Run these commands in parallel in ONE message:

git diff main --stat
git log main..HEAD --oneline
git diff main --name-only | sort -u

Incorrect:

# Sequential — wastes time, no coverage data
cd backend && pytest tests/
cd frontend && npm test

Correct:

# Parallel with coverage — run both in ONE message
cd backend && poetry run pytest tests/ -v --cov=app --cov-report=json
cd frontend && npm run test -- --coverage

Phase 3: Parallel Test Execution

Run backend and frontend tests in parallel:

# PARALLEL - Backend and frontend
cd backend && poetry run pytest tests/ -v --cov=app --cov-report=json
cd frontend && npm run test -- --coverage

Phase 7: Metrics Tracking

Store verification metrics in memory for trend analysis:

mcp__memory__create_entities(entities=[{
  "name": "verification-{date}-{feature}",
  "entityType": "VerificationMetrics",
  "observations": [f"composite_score: {score}", ...]
}])

Query trends: mcp__memory__search_nodes(query="VerificationMetrics")

Phase 2.5: Visual Evidence Collection

Run in parallel with Phase 2 agents. Auto-detects frontend framework and captures screenshots.

Incorrect:

# Manual screenshots with no structure
open http://localhost:3000
# Take manual screenshot...

Correct:

# Automated visual capture with AI evaluation
Agent(
  subagent_type="general-purpose",
  prompt="Visual capture: detect framework, start server, screenshot routes via agent-browser, evaluate with Claude vision, generate gallery.html",
  run_in_background=True
)

Output structure:

verification-output/{timestamp}/
├── screenshots/          (PNGs per route, base64 in gallery)
├── ai-evaluations/       (JSON per screenshot with score + issues)
├── annotations/          (before/after if agentation used)
│   ├── before/
│   └── after/
└── gallery.html          (self-contained, open in browser)

Phase 8.5: Post-Verification Feedback

After report compilation, store verification scores in the memory graph for KPI baseline tracking:

Query trends: mcp__memory__search_nodes(query="VerificationScores")

Scoring Rubric — HIGH

Scoring Rubric

Composite Score

Each agent produces a 0-10 score with decimals for nuance. The composite score is a weighted sum using the weights from Quality Model.

Grade Thresholds

<!-- Canonical source: ../references/quality-model.md — keep in sync -->

GradeScore RangeVerdict
A+9.0-10.0EXCELLENT
A8.0-8.9READY FOR MERGE
B7.0-7.9READY FOR MERGE
C6.0-6.9IMPROVEMENTS RECOMMENDED
D5.0-5.9IMPROVEMENTS RECOMMENDED
F0.0-4.9BLOCKED

Key Decisions

DecisionChoiceRationale
Scoring scale0-10 with decimalsNuanced, not binary
Improvement priorityImpact / Effort ratioDo high-value first
Alternative comparisonOptional phaseOnly when multiple valid approaches
Metrics persistenceMemory MCPTrack trends over time

Incorrect:

Security: "looks fine"  → 8/10    # No evidence, subjective
Performance: "fast enough" → 7/10  # No benchmarks

Correct:

Security: "11/11 injection tests pass, 13 deny patterns, 0 CVEs" → 9/10
Performance: "p99 latency 142ms (budget: 300ms), 0 N+1 queries" → 8.5/10

Improvement Suggestions

Each suggestion includes effort (1-5) and impact (1-5) with priority = impact/effort. See Quality Model for scale definitions and quick wins formula.

Blocking Rules

Verification can be blocked by policy-as-code rules. See Policy-as-Code for configuration of composite minimums, dimension minimums, and blocking rules.


References (9)

Alternative Comparison

Alternative Comparison

Evaluate current implementation against alternative approaches.

When to Compare

  • Multiple valid architectures exist
  • User asks "is this the best way?"
  • Major patterns were chosen (ORM vs raw SQL, REST vs GraphQL)
  • Performance/scalability concerns raised

Comparison Criteria

For Each Alternative

CriterionWeightDescription
Effort30%Implementation complexity (1-5 scale)
Risk25%Technical and operational risk (1-5 scale)
Benefit45%Value delivered, performance, maintainability (1-5 scale)

Migration Cost

FactorEstimate
Code changesFiles/lines affected
Data migrationSchema changes, backfill
TestingNew test coverage needed
Rollback riskReversibility

Decision Matrix Format

ApproachEffortRiskBenefitScore
CurrentNNN(E0.3 + R0.25 + B*0.45)
Alt ANNNcalculated
Alt BNNNcalculated

Note: Higher effort and risk are bad (invert for scoring), higher benefit is good.

Recommendation Formula:

Score = (5 - Effort) * 0.3 + (5 - Risk) * 0.25 + Benefit * 0.45

Output Template

### Alternative Comparison: [Topic]

**Current Approach:** [description]
- Score: N/10
- Pros: [strengths]
- Cons: [weaknesses]

**Alternative A:** [description]
- Score: N/10
- Pros: [strengths]
- Cons: [weaknesses]
- Migration effort: [1-5]

**Recommendation:** [Keep current / Switch to Alt A]
**Justification:** [1-2 sentences]

Grading Rubric

Verification Grading Rubric

0-10 scoring criteria for each verification dimension.

Score Levels

RangeLevelDescription
0-3PoorCritical issues, blocks merge
4-6AdequateFunctional but needs improvement
7-9GoodReady for merge, minor suggestions
10ExcellentExemplary, reference quality

Dimension Rubrics

<!-- Weights from canonical source: ../references/quality-model.md — keep in sync -->

Correctness (Weight: 14%)

ScoreCriteria
10All functional requirements met, edge cases handled, zero regressions
8-9Core requirements met, most edge cases handled
6-7Core paths work, some edge cases missing
4-5Partial functionality, notable gaps
1-3Broken core paths
0Does not run

Maintainability (Weight: 14%)

ScoreCriteria
10Zero lint errors/warnings, strict types, exemplary patterns, low complexity
8-9Zero errors, < 5 warnings, minimal any, good patterns
6-71-3 errors, some warnings, acceptable patterns
4-54-10 errors, pattern issues, needs refactoring
1-3Many errors, poor patterns, high complexity
0Lint/type check fails to run

Performance (Weight: 11%)

ScoreCriteria
10p99 within budget, zero N+1, optimal caching, efficient resource usage
8-9Good latency, no N+1, reasonable caching
6-7Acceptable latency, minor inefficiencies
4-5Notable bottlenecks, missing caching
1-3Severe bottlenecks, resource leaks
0Unresponsive or crashes under load

Security (Weight: 18%)

ScoreCriteria
10No vulnerabilities, all OWASP compliant, secure by design
8-9No critical/high, all OWASP, excellent practices
6-7No critical, 1-2 high, most OWASP compliant
4-5No critical, 3-5 high, some gaps
1-31+ critical or many high vulnerabilities
0Multiple critical, secrets exposed

Scalability (Weight: 9%)

ScoreCriteria
10Horizontal scaling ready, stateless design, efficient data patterns
8-9Good scaling patterns, minor bottlenecks
6-7Scales for current needs, some concerns
4-5Will hit limits soon, needs rework
1-3Single-instance only, monolithic state
0Cannot handle production load

Testability (Weight: 12%)

ScoreCriteria
10>= 90% coverage, meaningful assertions, edge cases, no flaky tests
8-9>= 80% coverage, good assertions, critical paths
6-7>= 70% coverage (target), basic assertions
4-550-69% coverage
1-330-49% coverage
0< 30% coverage or tests fail to run

Compliance (Weight: 12%)

ScoreCriteria
10Perfect REST/UI contracts, RFC 9457 errors, full Zod, WCAG AA
8-9Good conventions, proper validation, accessibility
6-7Acceptable patterns, minor inconsistencies
4-5Several convention violations
1-3Poor API/UI design, missing validation
0Broken contracts or inaccessible

Visual (Weight: 10%)

ScoreCriteria
10Pixel-perfect layout, full a11y, complete content, responsive
8-9Good layout, minor visual issues, WCAG AA
6-7Acceptable layout, some a11y gaps
4-5Layout issues, missing content, a11y problems
1-3Broken layout, major content missing
0Page fails to render

Note: Visual weight is 0.00 for API-only projects — redistributed proportionally. See Quality Model.


Grade Interpretation

<!-- Canonical source: quality-model.md — keep in sync -->

CompositeGradeVerdict
9.0-10.0A+EXCELLENT
8.0-8.9AREADY FOR MERGE
7.0-7.9BREADY FOR MERGE
6.0-6.9CIMPROVEMENTS RECOMMENDED
5.0-5.9DIMPROVEMENTS RECOMMENDED
0.0-4.9FBLOCKED

Orchestration Mode

<!-- SHARED: keep in sync with ../../../assess/references/orchestration-mode.md -->

Orchestration Mode Selection

Shared logic for choosing between Agent Teams and Task tool orchestration in assess/verify skills.

Environment Check

# Agent Teams is GA since CC 2.1.33
import os
force_task_tool = os.environ.get("ORCHESTKIT_FORCE_TASK_TOOL") == "1"

if force_task_tool:
    mode = "task_tool"
else:
    # Teams available by default — use for full multi-dimensional work
    mode = "agent_teams" if scope == "full" else "task_tool"

Decision Rules

  1. Full assessment/verification scope --> Agent Teams mode (GA since CC 2.1.33)
  2. Quick/single-dimension scope --> Task tool mode
  3. ORCHESTKIT_FORCE_TASK_TOOL=1 --> Task tool (override)

Agent Teams vs Task Tool

AspectTask Tool (Star)Agent Teams (Mesh)
TopologyAll agents report to leadAgents communicate with each other
Finding correlationLead cross-references after completionAgents share findings in real-time
Cross-domain overlapIndependent scoringAgents alert each other about overlapping concerns
Cost~200K tokens~500K tokens
Best forFocused/single-dimension workFull multi-dimensional assessment/verification

Fallback

If Agent Teams encounters issues mid-execution, fall back to Task tool for remaining work. This is safe because both modes produce the same output format (dimensional scores 0-10).

Context Window Note

For full codebase work (>20 files), use the 1M context window to avoid agent context exhaustion. On 200K context, scope discovery should limit files to prevent overflow.

Policy As Code

Policy-as-Code

Define verification policies as machine-readable configuration.

Policy Structure

version: "1.0"
name: policy-name
description: What this policy enforces

thresholds:
  composite_minimum: 6.0
  coverage_minimum: 70

rules:
  blockers: []    # Fail verification
  warnings: []    # Note but continue
  info: []        # Informational only

Rule Definition

Blocker Rules (Must Pass)

blockers:
  - dimension: security
    condition: below
    value: 5.0
    message: "Security score below minimum"

  - check: critical_vulnerabilities
    condition: above
    value: 0
    message: "Critical vulnerabilities found"

  - check: type_errors
    condition: above
    value: 0
    message: "TypeScript errors must be zero"

Warning Rules (Should Fix)

warnings:
  - dimension: code_quality
    condition: below
    value: 7.0
    message: "Code quality could be improved"

  - check: test_coverage
    condition: below
    value: 80
    message: "Coverage below recommended 80%"

Info Rules (Awareness)

info:
  - check: todo_count
    condition: above
    value: 5
    message: "Multiple TODOs found in code"

Threshold Configuration

ThresholdTypeDescription
composite_minimumfloatOverall score minimum (0-10)
coverage_minimumintTest coverage percentage
critical_vulnerabilitiesintMax critical vulns (0)
high_vulnerabilitiesintMax high vulns
lint_errorsintMax lint errors (0)
type_errorsintMax type errors (0)

Custom Rules

custom_rules:
  - name: no_console_log
    pattern: "console\\.log"
    file_glob: "**/*.ts"
    exclude: ["**/*.test.ts"]
    severity: warning
    message: "Remove console.log from production"

Policy Location

Store at: .claude/policies/verification-policy.yaml

Multiple policies: .claude/policies/\{name\}-policy.yaml

Quality Model

Quality Model (verify)

Extends the unified scoring framework with Visual as the 8th dimension.

Canonical source: quality-gates/references/unified-scoring-framework.md Load: Read("$\{CLAUDE_PLUGIN_ROOT\}/skills/quality-gates/references/unified-scoring-framework.md")

verify-Specific Extensions

Visual Dimension (8th)

DimensionWeightWhat It Measures
Visual0.10Layout correctness, a11y, content completeness, responsiveness

When Visual is active, base dimensions scale: adjusted = base_weight * (1.0 / 1.10). When Visual is skipped (API-only), base weights stay at 1.00.

Dimensions Used (with Visual)

DimensionAdjusted Weight
Correctness0.14
Maintainability0.14
Performance0.11
Security0.18
Scalability0.09
Testability0.12
Compliance0.12
Visual0.10

See unified framework for grade thresholds, improvement prioritization, effort/impact scales, and blocking rules.

Report Template

Verification Report Template

Copy this template and fill in results from parallel agent verification.

Quick Copy Template

# Feature Verification Report

**Date**: [TODAY'S DATE]
**Branch**: [branch-name]
**Feature**: [feature description]
**Reviewer**: Claude Code with 5 parallel subagents
**Verification Duration**: [X minutes]

---

## Summary

**Status**: [READY FOR MERGE | NEEDS ATTENTION | BLOCKED]

[1-2 sentence summary of verification results]

---

## Agent Results

### 1. Code Quality (code-quality-reviewer)

| Check | Tool | Exit Code | Errors | Warnings | Status |
|-------|------|-----------|--------|----------|--------|
| Backend Lint | Ruff | 0/1 | N | N | PASS/FAIL |
| Backend Types | ty | 0/1 | N | N | PASS/FAIL |
| Frontend Lint | Biome | 0/1 | N | N | PASS/FAIL |
| Frontend Types | tsc | 0/1 | N | N | PASS/FAIL |

**Pattern Compliance:**
- [ ] No `console.log` in production code
- [ ] No `any` types in TypeScript
- [ ] Exhaustive switches with `assertNever`
- [ ] SOLID principles followed
- [ ] Cyclomatic complexity < 10

**Findings:**
- [List any pattern violations]

---

### 2. Security Audit (security-auditor)

| Check | Tool | Critical | High | Medium | Low | Status |
|-------|------|----------|------|--------|-----|--------|
| JS Dependencies | npm audit | N | N | N | N | PASS/BLOCK |
| Python Dependencies | pip-audit | N | N | N | N | PASS/BLOCK |
| Secrets Scan | grep/gitleaks | N/A | N/A | N/A | N | PASS/BLOCK |

**OWASP Top 10 Compliance:**
- [ ] A01: Broken Access Control
- [ ] A02: Cryptographic Failures
- [ ] A03: Injection
- [ ] A04: Insecure Design
- [ ] A05: Security Misconfiguration
- [ ] A06: Vulnerable Components
- [ ] A07: Auth Failures
- [ ] A08: Data Integrity Failures
- [ ] A09: Logging Failures
- [ ] A10: SSRF

**Findings:**
- [List any security issues]

---

### 3. Test Coverage (test-generator)

| Suite | Total | Passed | Failed | Skipped | Coverage | Target | Status |
|-------|-------|--------|--------|---------|----------|--------|--------|
| Backend Unit | N | N | N | N | X% | 70% | PASS/FAIL |
| Backend Integration | N | N | N | N | X% | 70% | PASS/FAIL |
| Frontend Unit | N | N | N | N | X% | 70% | PASS/FAIL |
| E2E | N | N | N | N | N/A | N/A | PASS/FAIL |

**Test Quality:**
- [ ] Meaningful assertions (not just `assert result`)
- [ ] Edge cases covered (empty, error, timeout)
- [ ] No flaky tests (no sleep, no timing deps)
- [ ] MSW used for API mocking (not jest.mock)

**Coverage Gaps:**
- [List uncovered critical paths]

---

### 4. API Compliance (backend-system-architect)

| Check | Compliant | Issues |
|-------|-----------|--------|
| REST Conventions | Yes/No | [details] |
| Pydantic v2 Validation | Yes/No | [details] |
| RFC 9457 Error Handling | Yes/No | [details] |
| Async Timeout Protection | Yes/No | [details] |
| No N+1 Queries | Yes/No | [details] |

**Findings:**
- [List any API compliance issues]

---

### 5. UI Compliance (frontend-ui-developer)

| Check | Compliant | Issues |
|-------|-----------|--------|
| React 19 APIs (useOptimistic, useFormStatus, use()) | Yes/No | [details] |
| Zod Validation on API Responses | Yes/No | [details] |
| Exhaustive Type Checking | Yes/No | [details] |
| Skeleton Loading States | Yes/No | [details] |
| Prefetching on Navigation | Yes/No | [details] |
| WCAG 2.1 AA Accessibility | Yes/No | [details] |

**Findings:**
- [List any UI compliance issues]

---

## Quality Gates Summary

| Gate | Required | Actual | Status |
|------|----------|--------|--------|
| Test Coverage | >= 70% | X% | PASS/FAIL |
| Security Critical | 0 | N | PASS/FAIL |
| Security High | <= 5 | N | PASS/FAIL |
| Type Errors | 0 | N | PASS/FAIL |
| Lint Errors | 0 | N | PASS/FAIL |

**Overall Gate Status**: [ALL PASS | SOME FAIL]

---

## Blockers (Must Fix Before Merge)

1. [Blocker description with file:line reference]
2. [Blocker description with file:line reference]

---

## Suggestions (Non-Blocking)

1. [Suggestion for improvement]
2. [Suggestion for improvement]

---

## Visual Verification

**Visual Score: [N.N]/10**

| Route | Screenshot | AI Score | Issues | Status |
|-------|-----------|----------|--------|--------|
| / | [thumbnail] | N.N/10 | N | PASS/WARN/FAIL |
| /dashboard | [thumbnail] | N.N/10 | N | PASS/WARN/FAIL |
| /settings | [thumbnail] | N.N/10 | N | PASS/WARN/FAIL |

**Gallery**: Open `verification-output/{timestamp}/gallery.html` for full screenshots with AI evaluations.

### Agentation Annotations (if applicable)

| Annotation | Route | Resolution | Before/After |
|-----------|-------|------------|--------------|
| [user comment] | /dashboard | [fix summary] | [see gallery] |

---

## Evidence Artifacts

| Artifact | Location | Generated |
|----------|----------|-----------|
| Test Results | `/tmp/test_results.log` | [timestamp] |
| Coverage Report | `/tmp/coverage.json` | [timestamp] |
| Security Scan | `/tmp/security_audit.json` | [timestamp] |
| Lint Report | `/tmp/lint_results.log` | [timestamp] |
| Visual Gallery | `verification-output/{timestamp}/gallery.html` | [timestamp] |
| Screenshots | `verification-output/{timestamp}/screenshots/` | [timestamp] |
| AI Evaluations | `verification-output/{timestamp}/ai-evaluations/` | [timestamp] |

---

## Verification Metadata

- **Agents Used**: 7 (code-quality-reviewer, security-auditor, test-generator, backend-system-architect, frontend-ui-developer, python-performance-engineer, visual-capture)
- **Parallel Execution**: Yes
- **Total Tool Calls**: ~N
- **Context Usage**: ~N tokens

Status Definitions

StatusEmojiMeaningAction Required
READY FOR MERGEGreenAll checks pass, no blockersApprove PR
NEEDS ATTENTIONYellowMinor issues foundReview suggestions, optionally fix
BLOCKEDRedCritical issues foundMust fix before merge

Severity Levels

LevelThresholdActionBlocks Merge
CriticalAnyFix immediatelyYES
High> 5Fix before mergeYES
Medium> 20Should fixNO (with justification)
Low> 50Nice to haveNO
InfoN/AInformationalNO

Agent Output JSON Schemas

code-quality-reviewer Output

{
  "linting": {"tool": "ruff|biome", "exit_code": 0, "errors": 0, "warnings": 0},
  "type_check": {"tool": "ty|tsc", "exit_code": 0, "errors": 0},
  "patterns": {"violations": [], "compliance": "PASS|FAIL"},
  "approval": {"status": "APPROVED|NEEDS_FIXES", "blockers": []}
}

security-auditor Output

{
  "scan_summary": {"files_scanned": 100, "vulnerabilities_found": 0},
  "critical": [],
  "high": [],
  "secrets_detected": [],
  "recommendations": [],
  "approval": {"status": "PASS|BLOCK", "blockers": []}
}

test-generator Output

{
  "coverage": {"current": 85, "target": 70, "passed": true},
  "test_summary": {"total": 100, "passed": 98, "failed": 2, "skipped": 0},
  "gaps": ["file:line - reason"],
  "quality_issues": [],
  "approval": {"status": "PASS|FAIL", "blockers": []}
}

backend-system-architect Output

{
  "api_compliance": {"rest_conventions": true, "issues": []},
  "validation": {"pydantic_v2": true, "issues": []},
  "error_handling": {"rfc9457": true, "issues": []},
  "async_safety": {"timeouts": true, "issues": []},
  "approval": {"status": "PASS|FAIL", "blockers": []}
}

frontend-ui-developer Output

{
  "react_19": {"apis_used": ["useOptimistic"], "missing": [], "compliant": true},
  "zod_validation": {"validated_endpoints": 10, "unvalidated": []},
  "type_safety": {"exhaustive_switches": true, "any_types": 0},
  "ux_patterns": {"skeletons": true, "prefetching": true},
  "accessibility": {"wcag_issues": []},
  "approval": {"status": "PASS|FAIL", "blockers": []}
}

Verification Checklist

Verification Checklist

Pre-flight checklist for comprehensive feature verification with parallel agents.

Pre-Verification Setup

Context Gathering

  • Run git diff main --stat to understand change scope
  • Run git log main..HEAD --oneline to see commit history
  • Identify affected domains (backend/frontend/both)
  • Check for any existing failing tests

Task Creation (CC 2.1.16)

  • Create parent verification task
  • Create subtasks for each agent domain
  • Set proper dependencies if needed

Agent Dispatch Checklist

Required Agents (Full-Stack)

AgentLaunchedCompletedStatus
code-quality-reviewer[ ][ ]Pending
security-auditor[ ][ ]Pending
test-generator[ ][ ]Pending
backend-system-architect[ ][ ]Pending
frontend-ui-developer[ ][ ]Pending

Optional Agents (Add as Needed)

ConditionAgentLaunched
AI/ML featuresllm-integrator[ ]
Performance-criticalfrontend-performance-engineer[ ]
Database changesdatabase-engineer[ ]

Quality Gate Checklist

Mandatory Gates

GateThresholdActualPass
Test Coverage>= 70%___%[ ]
Security Critical0___[ ]
Security High<= 5___[ ]
Type Errors0___[ ]
Lint Errors0___[ ]

Code Quality Gates

CheckStatus
No console.log in production[ ]
No any types[ ]
Exhaustive switches (assertNever)[ ]
Proper error handling[ ]
No hardcoded secrets[ ]

Frontend-Specific Gates (if applicable)

CheckStatus
React 19 APIs used[ ]
Zod validation on API responses[ ]
Skeleton loading states[ ]
Prefetching on links[ ]
WCAG 2.1 AA compliance[ ]

Backend-Specific Gates (if applicable)

CheckStatus
REST conventions followed[ ]
Pydantic v2 validation[ ]
RFC 9457 error handling[ ]
Async timeout protection[ ]
No N+1 queries[ ]

Evidence Collection

Required Evidence

  • Test results with exit code
  • Coverage report (JSON format)
  • Linting results
  • Type checking results
  • Security scan results

Optional Evidence

  • E2E test screenshots
  • Performance benchmarks
  • Bundle size analysis
  • Accessibility audit

Report Generation

Report Sections

  • Summary (READY/NEEDS ATTENTION/BLOCKED)
  • Agent Results (all 5 domains)
  • Quality Gates table
  • Blockers list (if any)
  • Suggestions list
  • Evidence links

Final Steps

  • Update all task statuses to completed
  • Store verification evidence in context
  • Generate final report markdown

Quick Reference: Agent Prompts

code-quality-reviewer

Focus: Lint, type check, anti-patterns, SOLID, complexity

security-auditor

Focus: Dependency audit, secrets, OWASP Top 10, rate limiting

test-generator

Focus: Coverage gaps, test quality, edge cases, flaky tests

backend-system-architect

Focus: REST, Pydantic v2, RFC 9457, async safety, N+1

frontend-ui-developer

Focus: React 19, Zod, exhaustive types, skeletons, prefetch, a11y

Troubleshooting

Agent Not Responding

  1. Check if agent was launched with run_in_background=True
  2. Verify agent name matches exactly
  3. Check for context window limits

Tests Failing

  1. Run tests locally first
  2. Check for missing dependencies
  3. Verify test database state
  4. Look for timing-dependent tests

Coverage Below Threshold

  1. Identify uncovered files
  2. Check for excluded patterns
  3. Focus on critical paths first

Verification Phases

Verification Phases — Detailed Workflow

Phase Overview

PhaseActivitiesOutput
1. Context GatheringGit diff, commit historyChanges summary
2. Parallel Agent Dispatch6 agents evaluate0-10 scores
2.5 Visual CaptureScreenshot routes, AI vision evalGallery + visual score
3. Test ExecutionBackend + frontend testsCoverage data
4. Nuanced GradingComposite score calculationGrade (A-F)
5. Improvement SuggestionsEffort vs impact analysisPrioritized list
6. Alternative ComparisonCompare approaches (optional)Recommendation
7. Metrics TrackingTrend analysisHistorical data
8. Report CompilationEvidence artifacts + gallery.htmlFinal report
8.5 Agentation LoopUser annotates, ui-feedback fixesBefore/after diffs

Phase 2: Parallel Agent Dispatch (6 Agents)

Launch ALL agents in ONE message with run_in_background=True and max_turns=25. Pass model=MODEL_OVERRIDE when user specifies --model=opus (CC 2.1.72).

AgentFocusOutput
code-quality-reviewerLint, types, patternsQuality 0-10
security-auditorOWASP, secrets, CVEsSecurity 0-10
test-generatorCoverage, test qualityCoverage 0-10
backend-system-architectAPI design, asyncAPI 0-10
frontend-ui-developerReact 19, Zod, a11yUI 0-10
python-performance-engineerLatency, resources, scalingPerformance 0-10

Use python-performance-engineer for backend-focused verification or frontend-performance-engineer for frontend-focused verification. See Quality Model for Performance (0.11) and Scalability (0.09) weights.

See Grading Rubric for detailed scoring criteria.

Task Tool Mode (Default)

# PARALLEL — All 6 in ONE message
Agent(
  subagent_type="code-quality-reviewer",
  model=MODEL_OVERRIDE,  # None inherits default; "opus" for thorough verification (CC 2.1.72)
  prompt="""# Cache-optimized: stable content first (CC 2.1.72)
  Verify code quality. Score 0-10.
  Check: lint errors, type coverage, cyclomatic complexity, DRY, SOLID.
  Budget: 15 tool calls max.
  Return: score (0-10), reasoning, evidence, 2-3 improvement suggestions.
  Feature: {feature}. Scope: ONLY review files in {scope_files}.""",
  run_in_background=True, max_turns=25
)
Agent(
  subagent_type="security-auditor",
  model=MODEL_OVERRIDE,
  prompt="""# Cache-optimized: stable content first (CC 2.1.72)
  Security verification. Score 0-10.
  Check: OWASP Top 10, secrets in code, dependency CVEs, auth patterns.
  Budget: 15 tool calls max.
  Return: score (0-10), vulnerabilities found, severity ratings.
  Feature: {feature}. Scope: ONLY review files in {scope_files}.""",
  run_in_background=True, max_turns=25
)
Agent(
  subagent_type="test-generator",
  model=MODEL_OVERRIDE,
  prompt="""# Cache-optimized: stable content first (CC 2.1.72)
  Verify test coverage. Score 0-10.
  Check: test existence, type matching, quality, edge cases, coverage %.
  Run existing tests and report results.
  Budget: 15 tool calls max.
  Return: score (0-10), coverage %, gaps identified.
  Feature: {feature}. Scope: ONLY review files in {scope_files}.""",
  run_in_background=True, max_turns=25
)
Agent(
  subagent_type="backend-system-architect",
  model=MODEL_OVERRIDE,
  prompt="""# Cache-optimized: stable content first (CC 2.1.72)
  Verify API design and backend patterns. Score 0-10.
  Check: REST conventions, async patterns, transaction boundaries, error handling.
  Budget: 15 tool calls max.
  Return: score (0-10), pattern compliance, issues found.
  Feature: {feature}. Scope: ONLY review files in {scope_files}.""",
  run_in_background=True, max_turns=25
)
Agent(
  subagent_type="frontend-ui-developer",
  model=MODEL_OVERRIDE,
  prompt="""# Cache-optimized: stable content first (CC 2.1.72)
  Verify frontend implementation. Score 0-10.
  Check: React 19 patterns, Zod schemas, accessibility (WCAG 2.1 AA), loading states.
  Budget: 15 tool calls max.
  Return: score (0-10), pattern compliance, a11y issues.
  Feature: {feature}. Scope: ONLY review files in {scope_files}.""",
  run_in_background=True, max_turns=25
)
Agent(
  subagent_type="python-performance-engineer",
  model=MODEL_OVERRIDE,
  prompt="""# Cache-optimized: stable content first (CC 2.1.72)
  Verify performance and scalability. Score 0-10.
  Check: latency hotspots, N+1 queries, resource usage, caching, scaling patterns.
  Budget: 15 tool calls max.
  Return: score (0-10), bottlenecks found, optimization suggestions.
  Feature: {feature}. Scope: ONLY review files in {scope_files}.""",
  run_in_background=True, max_turns=25
)

Agent Teams Alternative

In Agent Teams mode, form a verification team where agents share findings and coordinate scoring:

TeamCreate(team_name="verify-{feature}", description="Verify {feature}")

Agent(subagent_type="code-quality-reviewer", name="quality-verifier",
     team_name="verify-{feature}", model=MODEL_OVERRIDE,
     prompt="""# Cache-optimized: stable content first (CC 2.1.72)
     Verify code quality. Score 0-10.
     When you find patterns that affect security, message security-verifier.
     When you find untested code paths, message test-verifier.
     Share your quality score with all teammates for composite calculation.
     Feature: {feature}.""")

Agent(subagent_type="security-auditor", name="security-verifier",
     team_name="verify-{feature}", model=MODEL_OVERRIDE,
     prompt="""# Cache-optimized: stable content first (CC 2.1.72)
     Security verification. Score 0-10.
     When quality-verifier flags security-relevant patterns, investigate deeper.
     When you find vulnerabilities in API endpoints, message api-verifier.
     Share severity findings with test-verifier for test gap analysis.
     Feature: {feature}.""")

Agent(subagent_type="test-generator", name="test-verifier",
     team_name="verify-{feature}", model=MODEL_OVERRIDE,
     prompt="""# Cache-optimized: stable content first (CC 2.1.72)
     Verify test coverage. Score 0-10.
     When quality-verifier or security-verifier flag untested paths, quantify the gap.
     Run existing tests and report coverage metrics.
     Message the lead with coverage data for composite scoring.
     Feature: {feature}.""")

Agent(subagent_type="backend-system-architect", name="api-verifier",
     team_name="verify-{feature}", model=MODEL_OVERRIDE,
     prompt="""# Cache-optimized: stable content first (CC 2.1.72)
     Verify API design and backend patterns. Score 0-10.
     When security-verifier flags endpoint issues, validate and score.
     Share API compliance findings with ui-verifier for consistency check.
     Feature: {feature}.""")

Agent(subagent_type="frontend-ui-developer", name="ui-verifier",
     team_name="verify-{feature}", model=MODEL_OVERRIDE,
     prompt="""# Cache-optimized: stable content first (CC 2.1.72)
     Verify frontend implementation. Score 0-10.
     When api-verifier shares API patterns, verify frontend matches.
     Check React 19 patterns, accessibility, and loading states.
     Share findings with quality-verifier for overall assessment.
     Feature: {feature}.""")

# Conditional 6th agent — use python-performance-engineer for backend,
# frontend-performance-engineer for frontend
Agent(subagent_type="python-performance-engineer", name="perf-verifier",
     team_name="verify-{feature}", model=MODEL_OVERRIDE,
     prompt="""# Cache-optimized: stable content first (CC 2.1.72)
     Verify performance and scalability. Score 0-10.
     Assess latency, resource usage, caching, and scaling patterns.
     When security-verifier flags resource-intensive endpoints, profile them.
     Share performance findings with api-verifier and quality-verifier.
     Feature: {feature}.""")

Team teardown after report compilation:

# After composite grading and report generation
SendMessage(type="shutdown_request", recipient="quality-verifier", content="Verification complete")
SendMessage(type="shutdown_request", recipient="security-verifier", content="Verification complete")
SendMessage(type="shutdown_request", recipient="test-verifier", content="Verification complete")
SendMessage(type="shutdown_request", recipient="api-verifier", content="Verification complete")
SendMessage(type="shutdown_request", recipient="ui-verifier", content="Verification complete")
SendMessage(type="shutdown_request", recipient="perf-verifier", content="Verification complete")
TeamDelete()

# Worktree cleanup (CC 2.1.72)
ExitWorktree(action="keep")

Fallback: If team formation fails, use standard Phase 2 Task spawns above.

Manual cleanup: If TeamDelete() doesn't terminate all agents, press Ctrl+F twice to force-kill remaining background agents.


Phase 2.5: Visual Capture (Parallel with Phase 2)

Runs as a 7th parallel agent alongside the 6 verification agents. See Visual Capture for full details.

# Launch IN THE SAME MESSAGE as Phase 2 agents
Agent(
  subagent_type="general-purpose",
  description="Visual capture and AI evaluation",
  prompt="""Visual verification capture for: {feature}
  1. Detect project type from package.json
  2. Start dev server (auto-detect framework)
  3. Discover routes (framework-aware scan)
  4. Use agent-browser to screenshot each route (max 20)
  5. Read each screenshot PNG for AI vision evaluation
  6. Score layout, accessibility, content completeness (0-10 per route)
  7. Read gallery template from ${CLAUDE_SKILL_DIR}/assets/gallery-template.html
  8. Generate gallery.html with base64-embedded screenshots
  9. Write to verification-output/{timestamp}/gallery.html
  10. Kill dev server

  If no frontend detected, write skip notice and exit.
  If server fails to start, write warning and exit.
  Never block — graceful degradation only.""",
  run_in_background=True, max_turns=30
)

Output: verification-output/\{timestamp\}/ folder with screenshots, AI evaluations (JSON), and gallery.html.


Phase 8.5: Agentation Visual Feedback (Opt-In)

Trigger: Only when agentation MCP is configured in .mcp.json. Runs AFTER Phase 8 report compilation.

# Check agentation availability
ToolSearch(query="select:mcp__agentation__agentation_get_all_pending")

# If available, offer user choice
AskUserQuestion(questions=[{
  "question": "Agentation detected. Annotate the live UI before finalizing?",
  "header": "Visual Feedback Loop",
  "options": [
    {"label": "Yes", "description": "I'll mark issues, ui-feedback agent fixes them, gallery updates with before/after"},
    {"label": "Skip", "description": "Finalize with current screenshots"}
  ]
}])

# If yes: watch → acknowledge → dispatch ui-feedback → re-screenshot → update gallery
# Max 3 rounds (configurable in verification-config.yaml)

Phase 4: Nuanced Grading

See Quality Model for scoring dimensions, weights, and grade interpretation. See Grading Rubric for detailed per-agent scoring criteria.


Phase 5: Improvement Suggestions

Each suggestion includes effort (1-5) and impact (1-5) with priority = impact/effort. See Quality Model for scale definitions and quick wins formula.


Phase 6: Alternative Comparison (Optional)

See Alternative Comparison for template.

Use when:

  • Multiple valid approaches exist
  • User asked "is this the best way?"
  • Major architectural decisions made

Phase 8: Report Compilation

See Report Template for full format.

# Feature Verification Report

**Composite Score: [N.N]/10** (Grade: [LETTER])

## Top Improvement Suggestions
| # | Suggestion | Effort | Impact | Priority |
|---|------------|--------|--------|----------|
| 1 | [highest] | [N] | [N] | [N.N] |

## Verdict
**[READY FOR MERGE | IMPROVEMENTS RECOMMENDED | BLOCKED]**

Visual Capture

Visual Capture — Phase 2.5

Visual verification that produces browsable screenshot evidence with AI evaluation.

Architecture

Phase 2 agents (parallel)
         |
    Phase 2.5 (runs IN PARALLEL with Phase 2 agents)
         |
         v
┌─────────────────────────────────────────────────┐
│  1. Detect project type (package.json scan)      │
│  2. Start dev server (framework-aware)           │
│  3. Wait for server ready (poll localhost)        │
│  4. Discover routes (framework-aware)            │
│  5. agent-browser: navigate + screenshot each    │
│  6. Claude vision: evaluate each screenshot      │
│  7. Generate gallery.html (self-contained)       │
│  8. Stop dev server                              │
└─────────────────────────────────────────────────┘

Step 1: Project Type Detection

Scan codebase to determine framework and dev server command:

# PARALLEL — detect framework signals
Grep(pattern="\"next\":", glob="package.json", output_mode="content")
Grep(pattern="\"vite\":", glob="package.json", output_mode="content")
Grep(pattern="\"react-scripts\":", glob="package.json", output_mode="content")
Grep(pattern="\"vue\":", glob="package.json", output_mode="content")
Grep(pattern="\"nuxt\":", glob="package.json", output_mode="content")
Grep(pattern="\"@angular/core\":", glob="package.json", output_mode="content")
Glob(pattern="**/manage.py")
Glob(pattern="**/main.py")
Glob(pattern="**/app.py")
Glob(pattern="**/index.html")

Detection Matrix

SignalFrameworkStart CommandDefault Port
"next": in package.jsonNext.jsnpm run dev3000
"vite": in package.jsonVitenpm run dev5173
"react-scripts":CRAnpm start3000
"vue": + no viteVue CLInpm run serve8080
"nuxt":Nuxtnpm run dev3000
"@angular/core":Angularnpx ng serve4200
manage.py existsDjangopython manage.py runserver8000
main.py/app.py + FastAPIFastAPIuvicorn app:app8000
index.html onlyStaticnpx serve .3000
None of the aboveSkip visual captureN/AN/A

Override via Config

If .claude/verification-config.yaml exists with a visual section, use those settings instead of auto-detection.

Step 2: Start Dev Server

Bash(
  command=f"{start_command} &",
  description="Start dev server for visual capture",
  run_in_background=True
)

Wait for server readiness:

Bash(command=f"for i in $(seq 1 30); do curl -s http://localhost:{port} > /dev/null && exit 0; sleep 1; done; exit 1",
     description="Wait for dev server to be ready (max 30s)")

If server fails to start: Skip visual capture with a warning in the report. Do NOT block verification.

Step 3: Route Discovery

Next.js App Router

Glob(pattern="**/app/**/page.{tsx,jsx,ts,js}")
# Extract route from file path: app/dashboard/page.tsx → /dashboard

Next.js Pages Router

Glob(pattern="**/pages/**/*.{tsx,jsx,ts,js}")
# Exclude _app, _document, _error, api/
# Extract route: pages/about.tsx → /about

React Router

Grep(pattern="<Route.*path=[\"']([^\"']+)", glob="**/*.{tsx,jsx}", output_mode="content")

FastAPI / Express

Grep(pattern="@(app|router)\\.(get|post)\\([\"'](/[^\"']*)", glob="**/*.py", output_mode="content")
Grep(pattern="(app|router)\\.(get|post)\\([\"'](/[^\"']*)", glob="**/*.{ts,js}", output_mode="content")

Fallback

If no routes discovered, screenshot just the root URL: http://localhost:\{port\}/

Max Routes

Cap at 20 routes to keep gallery manageable and generation fast. Prioritize:

  1. Root /
  2. Routes matching changed files (from Phase 1 git diff)
  3. Routes with most sub-routes (likely important sections)

Step 4: Screenshot Capture

Use agent-browser to navigate and screenshot each route:

# For each route:
# 1. Navigate
agent-browser navigate http://localhost:{port}{route_path}
# 2. Wait for content
agent-browser wait-for-network-idle
# 3. Capture
agent-browser screenshot --full-page --path verification-output/{timestamp}/screenshots/{idx}-{slug}.png

Auth-Protected Routes

If verification-config.yaml specifies auth:

# Login first
agent-browser navigate http://localhost:{port}/login
agent-browser fill "#email" "test@example.com"
agent-browser fill "#password" "test123"
agent-browser click "button[type=submit]"
agent-browser wait-for-navigation
# Then screenshot protected routes

Viewport Options

Default: 1280x720. If mobile: true in config, also capture at 375x812.

Step 5: AI Vision Evaluation

For each screenshot, use Claude's vision (Read tool on PNG) with a structured evaluation prompt:

Read(file_path=f"verification-output/{timestamp}/screenshots/{filename}")

Then evaluate using this prompt template (include it in the visual capture agent's instructions):

Evaluate this screenshot of route "{route_path}" against these 6 criteria.
For EACH criterion, provide a severity (ok/warning/error) and specific observation.
Do NOT use generic "looks good" — cite what you actually see.

1. LAYOUT: Overflow, alignment, spacing, responsive grid. Check: content cut off? Overlapping elements? Scroll needed?
2. NAVIGATION: Is nav present and functional? Sidebar, breadcrumbs, TOC visible? Active state correct?
3. CONTENT: Text readable? Headings hierarchical? Data populated (not placeholder/loading)? Counts/numbers accurate?
4. ACCESSIBILITY: Contrast sufficient? Focus indicators visible? Text size adequate? Color-only information?
5. INTERACTIVITY: Buttons/links styled consistently? Hover/focus states? Forms labeled? CTAs discoverable?
6. BRANDING: Consistent with site theme? Dark/light mode correct? Typography matches design system?

Output as JSON array — exactly 6 items, one per criterion:
[{"severity": "ok|warning|error", "message": "CRITERION: specific observation with evidence"}]
Score 0-10 based on: 0 errors=9+, 1-2 warnings=7-8, errors=5-6, multiple errors=<5.

Per-route evaluation output (6+ items, never a single line):

{
  "route": "/dashboard",
  "score": 7.5,
  "evaluation": [
    {"severity": "ok", "message": "LAYOUT: Content within viewport, no horizontal overflow, grid columns align properly"},
    {"severity": "ok", "message": "NAVIGATION: Sidebar present with 8 sections, 'Dashboard' correctly highlighted as active"},
    {"severity": "warning", "message": "CONTENT: Stats show '79 skills' but should be '89 skills' — stale count detected"},
    {"severity": "ok", "message": "ACCESSIBILITY: Body text ~16px on dark bg (#e6edf3 on #0d1117), contrast ratio ~13:1, passes WCAG AAA"},
    {"severity": "warning", "message": "INTERACTIVITY: Code block copy buttons present but no visible hover state change"},
    {"severity": "ok", "message": "BRANDING: Dark theme consistent, green accent (#3fb950) used for active states, monospace for code"}
  ]
}

Cross-Route Summary

After evaluating all routes, synthesize a summary object for the gallery:

# Build summary from all per-route evaluations
summary = {
  "total_routes": len(routes),
  "avg_score": round(sum(r.score for r in routes) / len(routes), 1),
  "pass_count": len([r for r in routes if r.score >= 7]),
  "warn_count": len([r for r in routes if 5 <= r.score < 7]),
  "fail_count": len([r for r in routes if r.score < 5]),
  "common_issues": [  # Issues appearing on 2+ routes
    {"count": 3, "message": "Stale skill count (79 instead of 89) on 3/5 pages"},
    {"count": 2, "message": "Code block copy buttons lack hover state feedback"}
  ],
  "strengths": [  # Positive patterns across routes
    "Consistent dark theme and typography across all pages",
    "Sidebar navigation present and correctly highlights active page"
  ]
}

Include this summary in GALLERY_JSON alongside routes.

Read the gallery template:

Read(file_path="${CLAUDE_SKILL_DIR}/assets/gallery-template.html")

Build the GALLERY_JSON data structure:

{
  "branch": "feat/new-feature",
  "date": "2026-03-10",
  "timestamp": "2026-03-10T14:30:00Z",
  "compositeScore": 8.2,
  "visualScore": 7.8,
  "routes": [
    {
      "id": "homepage",
      "name": "Homepage",
      "path": "/",
      "screenshot": "data:image/png;base64,...",
      "score": 8.5,
      "evaluation": [
        {"severity": "ok", "message": "Layout consistent"},
        {"severity": "warning", "message": "Hero image loading slowly"}
      ],
      "annotations": [],
      "apiResponse": null
    }
  ]
}

Base64 encoding: Convert each PNG to base64 for self-contained HTML:

base64 -i screenshots/01-homepage.png

Size guard: If total HTML > 10MB, use maxDiffPixelRatio compression or reduce to top 10 routes.

Write the final gallery:

Write(file_path=f"verification-output/{timestamp}/gallery.html", content=rendered_html)

Step 7: Cleanup

# Kill dev server
Bash(command="kill $(lsof -ti :PORT) 2>/dev/null || true", description="Stop dev server")

Phase 8.5: Agentation Loop (Opt-In)

Trigger: Only when agentation MCP is configured in .mcp.json.

# Check if agentation is available
ToolSearch(query="select:mcp__agentation__agentation_get_all_pending")

If available, offer the user:

AskUserQuestion(questions=[{
  "question": "Agentation is configured. Want to annotate the UI before finalizing?",
  "header": "Visual Feedback Loop",
  "options": [
    {"label": "Yes, let me annotate", "description": "I'll mark issues on the live UI, then ui-feedback agent fixes them"},
    {"label": "Skip", "description": "Finalize gallery with current screenshots"}
  ]
}])

If yes:

# 1. Watch for annotations
mcp__agentation__agentation_get_all_pending()

# 2. For each annotation:
mcp__agentation__agentation_acknowledge(annotationId=id)

# 3. Dispatch ui-feedback agent
Agent(subagent_type="ork:ui-feedback",
  prompt="Process agentation annotation: {annotation}. Fix the issue, then resolve.",
  run_in_background=True)

# 4. After fixes, re-screenshot affected routes
# 5. Save before/after pairs
# 6. Update gallery with annotation diffs

Max Rounds

Default 3 rounds of annotate-fix-verify. Configurable in verification-config.yaml.

Graceful Degradation

FailureBehavior
No frontend detectedSkip visual capture, log info in report
Dev server won't startSkip visual capture with warning
agent-browser unavailableSkip screenshots, try curl for API-only
Screenshot fails on a routeSkip that route, continue with others
Base64 output too largeCompress or reduce route count
Agentation not configuredSkip Layer 2 entirely (no prompt)
Auth flow failsSkip protected routes, screenshot public only

Checklists (1)

Verification Checklist

Verification Checklist

Quick checklist for comprehensive feature verification.

Grading Complete

  • All 5 dimensions rated (0-10 scale)
  • Weights applied correctly (20/25/20/20/15)
  • Composite score calculated
  • Grade letter assigned (A+ to F)

Evidence Collected

  • Test results with exit codes
  • Coverage report (JSON)
  • Security scan results
  • Lint/type check output
  • Evidence files linked in report

Improvements Documented

  • Each suggestion has effort estimate (1-5)
  • Each suggestion has impact estimate (1-5)
  • Priority calculated (Impact / Effort)
  • Quick wins identified (low effort, high impact)

Alternatives Considered

  • Current approach scored
  • At least one alternative evaluated
  • Migration cost estimated
  • Recommendation documented

Policy Compliance

  • No blocking rule violations
  • Warning rules acknowledged
  • Thresholds checked (composite, security, coverage)

Report Generated

  • All sections filled
  • Verdict assigned (Ready/Recommended/Blocked)
  • Tasks updated to completed
Edit on GitHub

Last updated on

On this page

Verify FeatureQuick StartArgument ResolutionSTEP 0: Effort-Aware Verification Scaling (CC 2.1.76)STEP 0a: Verify User Intent with AskUserQuestionSTEP 0b: Select Orchestration ModeMCP Probe + ResumeHandoff FileRegression Monitor (CC 2.1.71)Task Management (CC 2.1.16)8-Phase WorkflowPhase 2 Agents (Quick Reference)Progressive Output (CC 2.1.76+)Monitor + Partial Results (CC 2.1.98)Phase 2.5: Visual Capture (NEW — runs in parallel with Phase 2)Phase 8.5: Agentation Visual Feedback (opt-in)Grading & ScoringEvidence & Test ExecutionPolicy-as-CodeReport FormatReferencesRulesVerification Gate (Cross-Cutting)Anti-Sycophancy ProtocolAgent Status ProtocolAgent CoordinationSendMessage (Cross-Agent Findings)Skill ChainRelated SkillsRules (2)Evidence Collection Patterns — HIGHEvidence Collection PatternsPhase 1: Context GatheringPhase 3: Parallel Test ExecutionPhase 7: Metrics TrackingPhase 2.5: Visual Evidence CollectionPhase 8.5: Post-Verification FeedbackScoring Rubric — HIGHScoring RubricComposite ScoreGrade ThresholdsKey DecisionsImprovement SuggestionsBlocking RulesReferences (9)Alternative ComparisonAlternative ComparisonWhen to CompareComparison CriteriaFor Each AlternativeMigration CostDecision Matrix FormatOutput TemplateGrading RubricVerification Grading RubricScore LevelsDimension RubricsCorrectness (Weight: 14%)Maintainability (Weight: 14%)Performance (Weight: 11%)Security (Weight: 18%)Scalability (Weight: 9%)Testability (Weight: 12%)Compliance (Weight: 12%)Visual (Weight: 10%)Grade InterpretationOrchestration ModeOrchestration Mode SelectionEnvironment CheckDecision RulesAgent Teams vs Task ToolFallbackContext Window NotePolicy As CodePolicy-as-CodePolicy StructureRule DefinitionBlocker Rules (Must Pass)Warning Rules (Should Fix)Info Rules (Awareness)Threshold ConfigurationCustom RulesPolicy LocationQuality ModelQuality Model (verify)verify-Specific ExtensionsVisual Dimension (8th)Dimensions Used (with Visual)Report TemplateVerification Report TemplateQuick Copy TemplateStatus DefinitionsSeverity LevelsAgent Output JSON Schemascode-quality-reviewer Outputsecurity-auditor Outputtest-generator Outputbackend-system-architect Outputfrontend-ui-developer OutputVerification ChecklistVerification ChecklistPre-Verification SetupContext GatheringTask Creation (CC 2.1.16)Agent Dispatch ChecklistRequired Agents (Full-Stack)Optional Agents (Add as Needed)Quality Gate ChecklistMandatory GatesCode Quality GatesFrontend-Specific Gates (if applicable)Backend-Specific Gates (if applicable)Evidence CollectionRequired EvidenceOptional EvidenceReport GenerationReport SectionsFinal StepsQuick Reference: Agent Promptscode-quality-reviewersecurity-auditortest-generatorbackend-system-architectfrontend-ui-developerTroubleshootingAgent Not RespondingTests FailingCoverage Below ThresholdVerification PhasesVerification Phases — Detailed WorkflowPhase OverviewPhase 2: Parallel Agent Dispatch (6 Agents)Task Tool Mode (Default)Agent Teams AlternativePhase 2.5: Visual Capture (Parallel with Phase 2)Phase 8.5: Agentation Visual Feedback (Opt-In)Phase 4: Nuanced GradingPhase 5: Improvement SuggestionsPhase 6: Alternative Comparison (Optional)Phase 8: Report CompilationVisual CaptureVisual Capture — Phase 2.5ArchitectureStep 1: Project Type DetectionDetection MatrixOverride via ConfigStep 2: Start Dev ServerStep 3: Route DiscoveryNext.js App RouterNext.js Pages RouterReact RouterFastAPI / ExpressFallbackMax RoutesStep 4: Screenshot CaptureAuth-Protected RoutesViewport OptionsStep 5: AI Vision EvaluationCross-Route SummaryStep 6: Gallery GenerationStep 7: CleanupPhase 8.5: Agentation Loop (Opt-In)Max RoundsGraceful DegradationChecklists (1)Verification ChecklistVerification ChecklistGrading CompleteEvidence CollectedImprovements DocumentedAlternatives ConsideredPolicy ComplianceReport Generated