Skip to main content
OrchestKit v6.7.1 — 67 skills, 38 agents, 77 hooks with Opus 4.6 support
OrchestKit
Skills

Assess

Assesses and rates quality 0-10 with pros/cons analysis. Use when evaluating code, designs, or approaches.

Command high

Assess

Comprehensive assessment skill for answering "is this good?" with structured evaluation, scoring, and actionable recommendations.

Quick Start

/ork:assess backend/app/services/auth.py
/ork:assess our caching strategy
/ork:assess the current database schema
/ork:assess frontend/src/components/Dashboard

STEP 0: Verify User Intent with AskUserQuestion

BEFORE creating tasks, clarify assessment dimensions:

AskUserQuestion(
  questions=[{
    "question": "What dimensions to assess?",
    "header": "Dimensions",
    "options": [
      {"label": "Full assessment (Recommended)", "description": "All dimensions: quality, maintainability, security, performance"},
      {"label": "Code quality only", "description": "Readability, complexity, best practices"},
      {"label": "Security focus", "description": "Vulnerabilities, attack surface, compliance"},
      {"label": "Quick score", "description": "Just give me a 0-10 score with brief notes"}
    ],
    "multiSelect": false
  }]
)

Based on answer, adjust workflow:

  • Full assessment: All 7 phases, parallel agents
  • Code quality only: Skip security and performance phases
  • Security focus: Prioritize security-auditor agent
  • Quick score: Single pass, brief output

STEP 0b: Select Orchestration Mode

See Orchestration Mode for env var check logic, Agent Teams vs Task Tool comparison, and mode selection rules.


Task Management (CC 2.1.16)

TaskCreate(
  subject="Assess: {target}",
  description="Comprehensive evaluation with quality scores and recommendations",
  activeForm="Assessing {target}"
)

What This Skill Answers

QuestionHow It's Answered
"Is this good?"Quality score 0-10 with reasoning
"What are the trade-offs?"Structured pros/cons list
"Should we change this?"Improvement suggestions with effort
"What are the alternatives?"Comparison with scores
"Where should we focus?"Prioritized recommendations

Workflow Overview

PhaseActivitiesOutput
1. Target UnderstandingRead code/design, identify scopeContext summary
1.5. Scope DiscoveryBuild bounded file listScoped file list
2. Quality Rating7-dimension scoring (0-10)Scores with reasoning
3. Pros/Cons AnalysisStrengths and weaknessesBalanced evaluation
4. Alternative ComparisonScore alternativesComparison matrix
5. Improvement SuggestionsActionable recommendationsPrioritized list
6. Effort EstimationTime and complexity estimatesEffort breakdown
7. Assessment ReportCompile findingsFinal report

Phase 1: Target Understanding

Identify what's being assessed and gather context:

# PARALLEL - Gather context
Read(file_path="$ARGUMENTS")  # If file path
Grep(pattern="$ARGUMENTS", output_mode="files_with_matches")
mcp__memory__search_nodes(query="$ARGUMENTS")  # Past decisions

Phase 1.5: Scope Discovery

See Scope Discovery for the full file discovery, limit application (MAX 30 files), and sampling priority logic. Always include the scoped file list in every agent prompt.


Phase 2: Quality Rating (7 Dimensions)

Rate each dimension 0-10 with weighted composite score. See Quality Model for dimensions, weights, and grade interpretation. See Scoring Rubric for per-dimension criteria.

See Agent Spawn Definitions for Task Tool mode spawn patterns and Agent Teams alternative.

Composite Score: Weighted average of all 7 dimensions (see quality-model.md).


Phases 3-7: Analysis, Comparison & Report

See Phase Templates for output templates for pros/cons, alternatives, improvements, effort, and the final report.

See also: Alternative Analysis | Improvement Prioritization


Grade Interpretation

See Quality Model for scoring dimensions, weights, and grade interpretation.


Key Decisions

DecisionChoiceRationale
7 dimensionsComprehensive coverageAll quality aspects without overwhelming
0-10 scaleIndustry standardEasy to understand and compare
Parallel assessment4 agents (7 dimensions)Fast, thorough evaluation
Effort/Impact scoring1-5 scaleSimple prioritization math

Rules Quick Reference

RuleImpactWhat It Covers
complexity-metricsHIGH7-criterion scoring (1-5), complexity levels, thresholds
complexity-breakdownHIGHTask decomposition strategies, risk assessment
  • assess-complexity - Task complexity assessment
  • ork:verify - Post-implementation verification
  • ork:code-review-playbook - Code review patterns
  • ork:quality-gates - Quality gate patterns

Version: 1.1.0 (February 2026)


Rules (2)

Decompose complex tasks into isolated subtasks to reduce failure risk and enable parallelism — HIGH

Task Decomposition Strategies

When a task scores Level 4-5 (Complex/Very Complex), decompose it into subtasks that each score Level 1-3.

Decomposition Approach

  1. Identify independent axes — separate concerns that can be worked on independently
  2. Isolate unknowns — create a spike/research task for each unknown
  3. Reduce cross-cutting scope — break into single-module changes
  4. Sequence by dependency — order subtasks so blocked items come last

Strategies by Complexity Driver

High CriterionDecomposition Strategy
Lines of Code (4-5)Split by component or layer
Files Affected (4-5)Split by directory or module
Dependencies (4-5)Isolate external integrations into adapter tasks
Unknowns (4-5)Create spike tasks to resolve unknowns first
Cross-Cutting (4-5)Split by layer (DB, API, UI) or by concern
Risk Level (4-5)Add validation/testing tasks before implementation

Codebase Analysis for Decomposition

# Gather metrics to inform breakdown
./scripts/analyze-codebase.sh "$TARGET"

# Key signals:
# - File count > 10: split by directory
# - Import count > 5: isolate dependency interfaces
# - Test coverage < 50%: add test-first subtask

Subtask Validation

Each subtask should score:

  • Average: 1.0-3.4 (Level 1-3) — manageable
  • Unknowns: <= 2 — no major research needed
  • Cross-cutting: <= 2 — limited to 2-3 modules

If any subtask still scores Level 4+, decompose it again.

Risk Assessment Integration

Risk FactorMitigation Task
No test coverageAdd regression tests first
Complex business logicWrite specification/invariant tests
External API dependencyCreate mock/adapter layer first
Database migrationTest migration on staging first
Multiple team coordinationDefine interface contracts first

Key Rules

  • Level 4-5 tasks are never directly implemented — decompose first
  • Every subtask must score Level 3 or below individually
  • Resolve unknowns before starting dependent implementation tasks
  • Use TaskCreate with addBlockedBy to enforce subtask ordering
  • Each subtask should be independently verifiable with its own tests
  • Prefer vertical slices (end-to-end for one feature) over horizontal layers

Incorrect — Starting Level 5 task without decomposition:

Task: "Migrate authentication to OAuth2"
Complexity: 4.8 (Level 5)
Action: Start implementing directly
// High failure risk, scope creep likely

Correct — Breaking into Level 1-3 subtasks:

1. Research OAuth2 providers (Level 2, 1-2h)
2. Add OAuth library dependency (Level 1, 30m)
3. Implement OAuth callback handler (Level 2, 2-4h)
4. Migrate existing sessions (Level 3, 4-8h)
5. Add regression tests (Level 2, 2h)

Score task complexity with structured frameworks to prevent scope creep and estimate drift — HIGH

Complexity Scoring Frameworks

Score task complexity across 7 criteria (1-5 each) to determine if a task should proceed or be decomposed first.

The 7 Criteria

Criterion1 (Low)3 (Medium)5 (High)
Lines of Code< 50200-5001500+
Time Estimate< 30 min2-8 hours24+ hours
Files Affected1 file4-10 files26+ files
Dependencies0 deps2-3 deps7+ deps
UnknownsNoneSeveral, researchableUnclear scope
Cross-CuttingSingle module4-5 modulesSystem-wide
Risk LevelTrivialTestable complexityMission-critical

Complexity Levels

Average ScoreLevelClassificationAction
1.0 - 1.41TrivialProceed immediately
1.5 - 2.42SimpleProceed
2.5 - 3.43ModerateProceed with caution
3.5 - 4.44ComplexBreak down first
4.5 - 5.05Very ComplexDecompose and reassess

Output Format

## Complexity Assessment: [Target]

| Criterion | Score |
|-----------|-------|
| Lines of Code | X/5 |
| Time Estimate | X/5 |
| Files Affected | X/5 |
| Dependencies | X/5 |
| Unknowns | X/5 |
| Cross-Cutting | X/5 |
| Risk Level | X/5 |
| **Total** | **XX/35** |

**Average Score:** X.X
**Complexity Level:** X ([Classification])
**Can Proceed:** Yes/No

Key Rules

  • Score all 7 criteria even for seemingly simple tasks
  • Total of 35 points maximum, divide by 7 for average
  • Level 4-5 tasks must be decomposed before starting implementation
  • Unknowns (criterion 5) is the highest variance factor — resolve unknowns first
  • Cross-cutting (criterion 6) indicates coordination overhead — account for it in estimates

Incorrect — Skipping complexity assessment:

Task: "Add real-time notifications"
Action: Start coding
// No idea of scope, likely 3-5x over estimate

Correct — Scoring all 7 criteria first:

| Criterion | Score |
|-----------|-------|
| Lines of Code | 4/5 (800+ lines) |
| Time Estimate | 4/5 (16+ hours) |
| Files Affected | 3/5 (8 files) |
| Dependencies | 4/5 (WebSockets, Redis) |
| Unknowns | 3/5 (some research needed) |
| Cross-Cutting | 4/5 (DB, API, UI, workers) |
| Risk Level | 3/5 (testable) |
| **Average** | **3.6 (Level 4 - Complex)** |

Action: Decompose into subtasks before starting

References (9)

Agent Spawn Definitions

Agent Spawn Definitions

Dimension-to-agent mapping and spawn patterns for Phase 2.

Task Tool Mode (Default)

For each dimension, spawn a background agent with scope constraints:

for dimension, agent_type in [
    ("CORRECTNESS + MAINTAINABILITY", "code-quality-reviewer"),
    ("SECURITY", "security-auditor"),
    ("PERFORMANCE + SCALABILITY", "python-performance-engineer"),  # Use python-performance-engineer for backend; frontend-performance-engineer for frontend
    ("TESTABILITY", "test-generator"),
]:
    Task(subagent_type=agent_type, run_in_background=True, max_turns=25,
         prompt=f"""Assess {dimension} (0-10) for: {target}

## Scope Constraint
ONLY read and analyze the following {len(scope_files)} files -- do NOT explore beyond this list:
{file_list}

Budget: Use at most 15 tool calls. Read files from the list above, then produce your score
with reasoning, evidence, and 2-3 specific improvement suggestions.
Do NOT use Glob or Grep to discover additional files.""")

Then collect results from all agents and proceed to Phase 3.

Agent Teams Alternative

See agent-teams-mode.md for Agent Teams assessment workflow with cross-validation and team teardown.

Context Window Note

For full codebase assessments (>20 files), use the 1M context window to avoid agent context exhaustion. On 200K context, the scope discovery in scope-discovery.md limits files to prevent overflow.

Agent Teams Mode

Agent Teams Assessment Mode

In Agent Teams mode, form an assessment team where dimension assessors cross-validate scores and discuss disagreements:

TeamCreate(team_name="assess-{target-slug}", description="Assess {target}")

# SCOPE CONSTRAINT (injected into every agent prompt):
SCOPE_INSTRUCTIONS = f"""
## Scope Constraint
ONLY read and analyze the following {len(scope_files)} files — do NOT explore beyond this list:
{file_list}

Budget: Use at most 15 tool calls. Read files from the list above, then score.
Do NOT use Glob or Grep to discover additional files.
"""

Task(subagent_type="code-quality-reviewer", name="correctness-assessor",
     team_name="assess-{target-slug}", max_turns=25,
     prompt=f"""Assess CORRECTNESS (0-10) and MAINTAINABILITY (0-10) for: {target}
     {SCOPE_INSTRUCTIONS}
     When you find issues that affect security, message security-assessor.
     When you find issues that affect performance, message perf-assessor.
     Share your scores with all teammates for calibration — if scores diverge
     significantly (>2 points), discuss the disagreement.""")

Task(subagent_type="security-auditor", name="security-assessor",
     team_name="assess-{target-slug}", max_turns=25,
     prompt=f"""Assess SECURITY (0-10) for: {target}
     {SCOPE_INSTRUCTIONS}
     When correctness-assessor flags security-relevant patterns, investigate deeper.
     When you find performance-impacting security measures, message perf-assessor.
     Share your score and flag any cross-dimension trade-offs.""")

Task(subagent_type="python-performance-engineer", name="perf-assessor",  # or frontend-performance-engineer for frontend
     team_name="assess-{target-slug}", max_turns=25,
     prompt=f"""Assess PERFORMANCE (0-10) and SCALABILITY (0-10) for: {target}
     {SCOPE_INSTRUCTIONS}
     When security-assessor flags performance trade-offs, evaluate the impact.
     When you find testability issues (hard-to-benchmark code), message test-assessor.
     Share your scores with reasoning for the composite calculation.""")

Task(subagent_type="test-generator", name="test-assessor",
     team_name="assess-{target-slug}", max_turns=25,
     prompt=f"""Assess TESTABILITY (0-10) for: {target}
     {SCOPE_INSTRUCTIONS}
     Evaluate test coverage, test quality, and ease of testing.
     When other assessors flag dimension-specific concerns, verify test coverage
     for those areas. Share your score and any coverage gaps found.""")

Team teardown after report compilation:

SendMessage(type="shutdown_request", recipient="correctness-assessor", content="Assessment complete")
SendMessage(type="shutdown_request", recipient="security-assessor", content="Assessment complete")
SendMessage(type="shutdown_request", recipient="perf-assessor", content="Assessment complete")
SendMessage(type="shutdown_request", recipient="test-assessor", content="Assessment complete")
TeamDelete()

Fallback — Team Formation Failure: If team formation fails, use standard Phase 2 Task spawns.

Fallback — Context Exhaustion: If agents hit "Context limit reached" before returning scores, collect whatever partial results were produced, then score remaining dimensions yourself using the scoped file list from Phase 1.5. Do NOT re-spawn agents — assess the remaining dimensions inline and proceed to Phase 3.

Alternative Analysis

Alternative Analysis Reference

How to identify, evaluate, and compare alternatives to the current approach.

Identifying Alternatives

  1. Direct Substitutes: Different implementations of the same solution
  2. Architectural Alternatives: Different design patterns or approaches
  3. Technology Alternatives: Different libraries, frameworks, or tools
  4. Hybrid Approaches: Combinations of multiple alternatives

Comparison Dimensions

DimensionQuestionWeight
ScoreHow does it rate on 6 dimensions?0.30
EffortHow hard to implement/migrate?0.25
RiskWhat could go wrong?0.25
BenefitWhat's the expected improvement?0.20

Migration Effort Scale

LevelDescriptionTime Estimate
1Drop-in replacement< 1 hour
2Minor refactoring1-4 hours
3Moderate changes1-2 days
4Significant rework3-5 days
5Major rewrite1+ weeks

Risk Categories

  • Technical: Will it work? Compatibility issues?
  • Team: Does team know this? Learning curve?
  • Timeline: Can we afford the migration time?
  • Dependencies: What else needs to change?

Decision Criteria

Switch if:

  • Score improvement >= 1.5 points AND effort <= 3
  • Current has critical security/correctness issues
  • Alternative has significantly lower maintenance burden

Stay if:

  • Score difference < 1.0 point
  • Migration effort >= 4 AND no critical issues
  • Team familiarity strongly favors current

Trade-off Documentation

## Alternative: [Name]

**Score Delta:** +/-[N.N] points
**Migration Effort:** [1-5]
**Risk Level:** Low/Medium/High

### Why Consider
- [Benefit 1]
- [Benefit 2]

### Why Not
- [Drawback 1]
- [Drawback 2]

### Verdict: [Adopt/Defer/Reject]

Improvement Prioritization

Improvement Prioritization Reference

Systematic approach to ranking improvements by value delivered per effort invested.

Impact/Effort Scoring

Impact Scale (1-5)

ScoreLabelEffect on Quality
5CriticalFixes blocker, +2.0+ points
4HighMajor improvement, +1.0-2.0
3MediumNotable improvement, +0.5-1.0
2LowMinor improvement, +0.2-0.5
1MinimalCosmetic, +0.1-0.2

Effort Scale (1-5)

ScoreLabelTime Required
1Trivial< 15 minutes
2Easy15-60 minutes
3Medium1-4 hours
4Hard4-8 hours
5Very Hard1+ days

Priority Formula

Priority = Impact / Effort

Higher priority = do first. At equal priority, prefer lower effort.

Improvement Categories

CategoryImpactEffortAction
Quick WinsHigh (4-5)Low (1-2)Do immediately
StrategicHigh (4-5)High (4-5)Plan carefully
Fill-insLow (1-2)Low (1-2)Do when idle
AvoidLow (1-2)High (4-5)Skip or defer

Time Estimation Guidelines

  • Add buffer: Estimate x1.5 for unknowns
  • Include testing: Add 30% for test updates
  • Account for review: Add time for PR process
  • Consider dependencies: Chain effects on other work

Sequencing Dependencies

  1. Blockers first: Changes that unblock other work
  2. Foundation changes: Structural changes before features
  3. Shared code: Common utilities before consumers
  4. Leaf nodes last: Isolated changes can wait

Quick Reference

Priority 5.0+  = Do NOW (high impact, trivial effort)
Priority 2.0+  = Do soon (good ROI)
Priority 1.0+  = Schedule it
Priority <1.0  = Backlog or skip

Orchestration Mode

<!-- SHARED: keep in sync with ../../../verify/references/orchestration-mode.md -->

Orchestration Mode Selection

Shared logic for choosing between Agent Teams and Task tool orchestration in assess/verify skills.

Environment Check

import os
teams_available = os.environ.get("CLAUDE_CODE_EXPERIMENTAL_AGENT_TEAMS") is not None
force_task_tool = os.environ.get("ORCHESTKIT_FORCE_TASK_TOOL") == "1"

if force_task_tool or not teams_available:
    mode = "task_tool"
else:
    # Teams available — use for full multi-dimensional work
    mode = "agent_teams" if scope == "full" else "task_tool"

Decision Rules

  1. CLAUDE_CODE_EXPERIMENTAL_AGENT_TEAMS set --> Agent Teams mode (for full assessment/verification)
  2. Flag not set --> Task tool mode (default)
  3. Quick/single-dimension scope --> Task tool (regardless of flag)
  4. ORCHESTKIT_FORCE_TASK_TOOL=1 --> Task tool (override)

Agent Teams vs Task Tool

AspectTask Tool (Star)Agent Teams (Mesh)
TopologyAll agents report to leadAgents communicate with each other
Finding correlationLead cross-references after completionAgents share findings in real-time
Cross-domain overlapIndependent scoringAgents alert each other about overlapping concerns
Cost~200K tokens~500K tokens
Best forFocused/single-dimension workFull multi-dimensional assessment/verification

Fallback

If Agent Teams encounters issues mid-execution, fall back to Task tool for remaining work. This is safe because both modes produce the same output format (dimensional scores 0-10).

Context Window Note

For full codebase work (>20 files), use the 1M context window to avoid agent context exhaustion. On 200K context, scope discovery should limit files to prevent overflow.

Phase Templates

Phase Output Templates

Markdown templates for assessment phases 3-7.

Phase 3: Pros/Cons Analysis

## Pros (Strengths)
| # | Strength | Impact | Evidence |
|---|----------|--------|----------|
| 1 | [strength] | High/Med/Low | [example] |

## Cons (Weaknesses)
| # | Weakness | Severity | Evidence |
|---|----------|----------|----------|
| 1 | [weakness] | High/Med/Low | [example] |

**Net Assessment:** [Strengths outweigh / Balanced / Weaknesses dominate]
**Recommended action:** [Keep as-is / Improve / Reconsider / Rewrite]

Phase 4: Alternative Comparison

See alternative-analysis.md for full comparison template.

CriteriaCurrentAlternative AAlternative B
Composite[N.N][N.N][N.N]
Migration EffortN/A[1-5][1-5]

Phase 5: Improvement Suggestions

See improvement-prioritization.md for effort/impact guidelines.

SuggestionEffort (1-5)Impact (1-5)Priority (I/E)
[action][N][N][ratio]

Quick Wins = Effort <= 2 AND Impact >= 4. Always highlight these first.

Phase 6: Effort Estimation

TimeframeTasksTotal
Quick wins (< 1hr)[list]X min
Short-term (< 1 day)[list]X hrs
Medium-term (1-3 days)[list]X days

Phase 7: Assessment Report

See scoring-rubric.md for full report template.

# Assessment Report: $ARGUMENTS

**Overall Score: [N.N]/10** (Grade: [A+/A/B/C/D/F])

**Verdict:** [EXCELLENT | GOOD | ADEQUATE | NEEDS WORK | CRITICAL]

## Answer: Is This Good?
**[YES / MOSTLY / SOMEWHAT / NO]**
[Reasoning]

Quality Model

<!-- SHARED: keep in sync with ../../../verify/references/quality-model.md -->

Quality Model

Canonical scoring reference for assess and verify skills. Defines unified dimensions, weights, grade thresholds, and improvement prioritization.

Scoring Dimensions (7 Unified)

DimensionWeightWhat It Measures
Correctness0.15Does it work correctly? Functional accuracy, edge cases handled
Maintainability0.15Easy to understand and modify? Readability, complexity, patterns
Performance0.12Efficient execution? No bottlenecks, resource usage, latency
Security0.20Follows security best practices? OWASP, secrets, CVEs, input validation
Scalability0.10Handles growth? Load patterns, data volume, horizontal scaling
Testability0.13Easy to test? Coverage, test quality, isolation, mocking
Compliance0.15Meets API and UI contracts? Conditional on scope (see below)

Total: 1.00

Compliance Dimension — Scope Rules

Compliance weight (0.15) applies differently based on project scope:

ScopeCompliance Covers
Backend-onlyAPI compliance (contracts, schema validation, versioning)
Frontend-onlyUI compliance (design system, a11y, responsive)
Full-stackAPI + UI compliance (split evenly: 0.075 each)

Composite Score

composite = sum(dimension_score * weight for each dimension)

Each dimension is scored 0-10 with decimal precision. Composite is also 0-10.

Grade Thresholds

ScoreGradeVerdictAction
9.0-10.0A+EXCELLENTShip it!
8.0-8.9AGOODReady for merge
7.0-7.9BGOODMinor improvements optional
6.0-6.9CADEQUATEConsider improvements
5.0-5.9DNEEDS WORKImprovements recommended
0.0-4.9FCRITICALDo not merge

Improvement Prioritization

Effort Scale (1-5)

PointsEffortDescription
1Trivial< 15 minutes, single file change
2Low15-60 minutes, few files
3Medium1-4 hours, moderate scope
4High4-8 hours, significant refactoring
5Major1+ days, architectural change

Impact Scale (1-5)

PointsImpactDescription
1MinimalCosmetic, no functional change
2LowMinor improvement, limited scope
3MediumNoticeable quality improvement
4HighSignificant quality or security gain
5CriticalBlocks shipping or fixes major vulnerability

Priority Formula

priority = impact / effort

Higher ratio = do first.

Quick Wins

Effort <= 2 AND Impact >= 4

Always highlight quick wins at the top of improvement suggestions. These are high-value changes that can be done fast.

Scope Discovery

Phase 1.5: Scope Discovery (CRITICAL -- prevents context exhaustion)

Before spawning any agents, build a bounded file list. Agents that receive unbounded targets will exhaust their context windows reading the entire codebase.

# 1. Discover target files
if is_file(target):
    scope_files = [target]
elif is_directory(target):
    scope_files = Glob(f"{target}/**/*.{{py,ts,tsx,js,jsx,go,rs,java}}")
else:
    # Concept/topic -- search for relevant files
    scope_files = Grep(pattern=target, output_mode="files_with_matches", head_limit=50)

# 2. Apply limits -- MAX 30 files for agent assessment
MAX_FILES = 30
if len(scope_files) > MAX_FILES:
    # Prioritize: entry points, configs, security-critical, then sample rest
    # Skip: test files (except for testability agent), generated files, vendor/
    prioritized = prioritize_files(scope_files)  # entry points first
    scope_files = prioritized[:MAX_FILES]
    # Tell user about sampling
    print(f"Target has {len(scope_files)} files. Sampling {MAX_FILES} representative files.")

# 3. Format as file list string for agent prompts
file_list = "\n".join(f"- {f}" for f in scope_files)

Sampling Priorities (when >30 files)

  1. Entry points (main, index, app, server)
  2. Config files (settings, env, config)
  3. Security-sensitive (auth, middleware, api routes)
  4. Core business logic (services, models, domain)
  5. Representative samples from remaining directories

Scoring Rubric

Scoring Rubric Reference

Detailed scoring guidelines for each quality dimension.

Correctness (Weight: 0.20)

Score 9-10: Excellent

  • All functionality works as documented
  • All edge cases handled gracefully
  • Comprehensive error handling
  • No known bugs
  • Types are accurate and complete

Score 7-8: Good

  • Core functionality works correctly
  • Most edge cases handled
  • Good error handling
  • Minor edge cases might be missing
  • Types mostly accurate

Score 5-6: Adequate

  • Main happy path works
  • Some edge cases unhandled
  • Basic error handling
  • Known minor bugs exist
  • Some type inaccuracies

Score 3-4: Poor

  • Functionality partially works
  • Many edge cases fail
  • Minimal error handling
  • Multiple bugs present
  • Significant type issues

Score 1-2: Critical

  • Core functionality broken
  • No edge case handling
  • Errors cause crashes
  • Critical bugs
  • Types unreliable

Score 0: Broken

  • Does not function at all
  • Cannot be used

Maintainability (Weight: 0.20)

Score 9-10: Excellent

  • Crystal clear code, self-documenting
  • Perfect naming conventions
  • Single responsibility everywhere
  • Cyclomatic complexity < 5
  • Any developer can understand immediately

Score 7-8: Good

  • Clear code with minor clarifications needed
  • Good naming, occasional ambiguity
  • Mostly single responsibility
  • Cyclomatic complexity < 10
  • Reasonable onboarding time

Score 5-6: Adequate

  • Understandable with effort
  • Mixed naming quality
  • Some large functions
  • Cyclomatic complexity < 15
  • Requires context to understand

Score 3-4: Poor

  • Difficult to understand
  • Poor naming choices
  • Multiple responsibilities mixed
  • Cyclomatic complexity 15-20
  • Requires original author to explain

Score 1-2: Critical

  • Incomprehensible
  • Meaningless names
  • Massive functions
  • Cyclomatic complexity > 20
  • "Here be dragons"

Score 0: Broken

  • Cannot be maintained at all

Performance (Weight: 0.15)

Score 9-10: Excellent

  • Optimal algorithm choices
  • No unnecessary operations
  • Proper caching
  • Async where beneficial
  • Measured and optimized

Score 7-8: Good

  • Good algorithm choices
  • Minor inefficiencies
  • Some caching
  • Async used appropriately
  • No major bottlenecks

Score 5-6: Adequate

  • Acceptable algorithms
  • Some unnecessary operations
  • Limited caching
  • Missing async opportunities
  • Noticeable but tolerable delays

Score 3-4: Poor

  • Suboptimal algorithms (O(n^2) in hot paths)
  • Many unnecessary operations
  • No caching strategy
  • Blocking where should be async
  • Noticeable performance issues

Score 1-2: Critical

  • Wrong algorithm choices
  • Excessive operations
  • Performance blockers
  • User-impacting delays

Score 0: Broken

  • Unusable due to performance

Security (Weight: 0.15)

Score 9-10: Excellent

  • All OWASP Top 10 addressed
  • Input validation everywhere
  • Proper authentication/authorization
  • Secrets managed correctly
  • Security reviewed

Score 7-8: Good

  • Most security concerns addressed
  • Good input validation
  • Proper auth patterns
  • No obvious vulnerabilities
  • Minor improvements possible

Score 5-6: Adequate

  • Basic security in place
  • Some validation gaps
  • Auth works but could be tighter
  • No critical vulnerabilities
  • Needs security review

Score 3-4: Poor

  • Security gaps present
  • Missing input validation
  • Auth issues
  • Potential vulnerabilities
  • Should not be in production

Score 1-2: Critical

  • Security vulnerabilities present
  • No input validation
  • Broken auth
  • Active exploit potential

Score 0: Broken

  • Actively exploitable

Scalability (Weight: 0.15)

Score 9-10: Excellent

  • Horizontally scalable
  • Stateless design
  • Proper queuing/caching
  • Handles 10x growth easily
  • Load tested

Score 7-8: Good

  • Mostly scalable
  • Minimal state
  • Some bottlenecks identified
  • Handles 5x growth
  • Scaling path clear

Score 5-6: Adequate

  • Scales with limitations
  • Some state management
  • Known bottlenecks
  • Handles 2x growth
  • Scaling requires work

Score 3-4: Poor

  • Limited scalability
  • Stateful design
  • Multiple bottlenecks
  • Near capacity
  • Scaling is a project

Score 1-2: Critical

  • Does not scale
  • Single point of failure
  • Already at capacity

Score 0: Broken

  • Cannot handle current load

Testability (Weight: 0.15)

Score 9-10: Excellent

  • 90% coverage

  • Meaningful assertions
  • Edge cases tested
  • Fast, deterministic tests
  • Easy to add new tests

Score 7-8: Good

  • 80% coverage

  • Good assertions
  • Main paths tested
  • Mostly fast tests
  • Reasonable to add tests

Score 5-6: Adequate

  • 70% coverage

  • Basic assertions
  • Happy path tested
  • Some slow tests
  • Tests can be added

Score 3-4: Poor

  • 50% coverage

  • Weak assertions
  • Coverage gaps
  • Flaky tests
  • Hard to test

Score 1-2: Critical

  • <50% coverage
  • Minimal assertions
  • Critical paths untested
  • Many flaky tests

Score 0: Broken

  • No tests or tests don't run

Checklists (1)

Assessment Checklist

Assessment Checklist

Pre-completion validation for comprehensive assessments.

Quality Rating

  • All 6 dimensions rated (Correctness, Maintainability, Performance, Security, Scalability, Testability)
  • Each dimension has specific evidence cited
  • Composite score calculated with correct weights
  • Grade assigned matches score range

Pros/Cons Analysis

  • At least 3 pros identified
  • At least 3 cons identified
  • Pros and cons are balanced (not all positive or negative)
  • Each item has impact/severity rating
  • Evidence provided for each claim

Alternative Comparison

  • At least 2 alternatives considered
  • Each alternative scored on same dimensions
  • Migration effort estimated (1-5 scale)
  • Clear recommendation with rationale
  • Trade-offs documented

Improvement Suggestions

  • Suggestions prioritized by Impact/Effort
  • Quick wins identified (high impact, low effort)
  • Effort estimates provided for each
  • Expected score improvement stated
  • Dependencies between improvements noted

Verdict

  • Clear YES/NO/MOSTLY/SOMEWHAT answer
  • Reasoning explains the verdict
  • Actionable next steps provided
  • Strongest and weakest dimensions highlighted

Report Quality

  • Executive summary is 2-3 sentences
  • All sections completed
  • No contradictions between sections
  • Evidence is specific (file:line when applicable)
Edit on GitHub

Last updated on

On this page

Related SkillsAssessQuick StartSTEP 0: Verify User Intent with AskUserQuestionSTEP 0b: Select Orchestration ModeTask Management (CC 2.1.16)What This Skill AnswersWorkflow OverviewPhase 1: Target UnderstandingPhase 1.5: Scope DiscoveryPhase 2: Quality Rating (7 Dimensions)Phases 3-7: Analysis, Comparison & ReportGrade InterpretationKey DecisionsRules Quick ReferenceRelated SkillsRules (2)Decompose complex tasks into isolated subtasks to reduce failure risk and enable parallelism — HIGHTask Decomposition StrategiesDecomposition ApproachStrategies by Complexity DriverCodebase Analysis for DecompositionSubtask ValidationRisk Assessment IntegrationKey RulesScore task complexity with structured frameworks to prevent scope creep and estimate drift — HIGHComplexity Scoring FrameworksThe 7 CriteriaComplexity LevelsOutput FormatKey RulesReferences (9)Agent Spawn DefinitionsAgent Spawn DefinitionsTask Tool Mode (Default)Agent Teams AlternativeContext Window NoteAgent Teams ModeAgent Teams Assessment ModeAlternative AnalysisAlternative Analysis ReferenceIdentifying AlternativesComparison DimensionsMigration Effort ScaleRisk CategoriesDecision CriteriaTrade-off DocumentationImprovement PrioritizationImprovement Prioritization ReferenceImpact/Effort ScoringImpact Scale (1-5)Effort Scale (1-5)Priority FormulaImprovement CategoriesTime Estimation GuidelinesSequencing DependenciesQuick ReferenceOrchestration ModeOrchestration Mode SelectionEnvironment CheckDecision RulesAgent Teams vs Task ToolFallbackContext Window NotePhase TemplatesPhase Output TemplatesPhase 3: Pros/Cons AnalysisPhase 4: Alternative ComparisonPhase 5: Improvement SuggestionsPhase 6: Effort EstimationPhase 7: Assessment ReportQuality ModelQuality ModelScoring Dimensions (7 Unified)Compliance Dimension — Scope RulesComposite ScoreGrade ThresholdsImprovement PrioritizationEffort Scale (1-5)Impact Scale (1-5)Priority FormulaQuick WinsScope DiscoveryPhase 1.5: Scope Discovery (CRITICAL -- prevents context exhaustion)Sampling Priorities (when >30 files)Scoring RubricScoring Rubric ReferenceCorrectness (Weight: 0.20)Score 9-10: ExcellentScore 7-8: GoodScore 5-6: AdequateScore 3-4: PoorScore 1-2: CriticalScore 0: BrokenMaintainability (Weight: 0.20)Score 9-10: ExcellentScore 7-8: GoodScore 5-6: AdequateScore 3-4: PoorScore 1-2: CriticalScore 0: BrokenPerformance (Weight: 0.15)Score 9-10: ExcellentScore 7-8: GoodScore 5-6: AdequateScore 3-4: PoorScore 1-2: CriticalScore 0: BrokenSecurity (Weight: 0.15)Score 9-10: ExcellentScore 7-8: GoodScore 5-6: AdequateScore 3-4: PoorScore 1-2: CriticalScore 0: BrokenScalability (Weight: 0.15)Score 9-10: ExcellentScore 7-8: GoodScore 5-6: AdequateScore 3-4: PoorScore 1-2: CriticalScore 0: BrokenTestability (Weight: 0.15)Score 9-10: ExcellentScore 7-8: GoodScore 5-6: AdequateScore 3-4: PoorScore 1-2: CriticalScore 0: BrokenChecklists (1)Assessment ChecklistAssessment ChecklistQuality RatingPros/Cons AnalysisAlternative ComparisonImprovement SuggestionsVerdictReport Quality