Assesses and rates quality 0-10 with pros/cons analysis. Use when evaluating code, designs, or approaches.

Command high

Assess

Comprehensive assessment skill for answering "is this good?" with structured evaluation, scoring, and actionable recommendations.

Quick Start

/ork:assess backend/app/services/auth.py
/ork:assess our caching strategy
/ork:assess the current database schema
/ork:assess frontend/src/components/Dashboard

STEP 0: Verify User Intent with AskUserQuestion

BEFORE creating tasks, clarify assessment dimensions:

AskUserQuestion(
  questions=[{
    "question": "What dimensions to assess?",
    "header": "Dimensions",
    "options": [
      {"label": "Full assessment (Recommended)", "description": "All dimensions: quality, maintainability, security, performance"},
      {"label": "Code quality only", "description": "Readability, complexity, best practices"},
      {"label": "Security focus", "description": "Vulnerabilities, attack surface, compliance"},
      {"label": "Quick score", "description": "Just give me a 0-10 score with brief notes"}
    ],
    "multiSelect": false
  }]
)

Based on answer, adjust workflow:

Full assessment: All 7 phases, parallel agents
Code quality only: Skip security and performance phases
Security focus: Prioritize security-auditor agent
Quick score: Single pass, brief output

STEP 0b: Select Orchestration Mode

See Orchestration Mode for env var check logic, Agent Teams vs Task Tool comparison, and mode selection rules.

Task Management (CC 2.1.16)

TaskCreate(
  subject="Assess: {target}",
  description="Comprehensive evaluation with quality scores and recommendations",
  activeForm="Assessing {target}"
)

What This Skill Answers

Question	How It's Answered
"Is this good?"	Quality score 0-10 with reasoning
"What are the trade-offs?"	Structured pros/cons list
"Should we change this?"	Improvement suggestions with effort
"What are the alternatives?"	Comparison with scores
"Where should we focus?"	Prioritized recommendations

Workflow Overview

Phase	Activities	Output
1. Target Understanding	Read code/design, identify scope	Context summary
1.5. Scope Discovery	Build bounded file list	Scoped file list
2. Quality Rating	7-dimension scoring (0-10)	Scores with reasoning
3. Pros/Cons Analysis	Strengths and weaknesses	Balanced evaluation
4. Alternative Comparison	Score alternatives	Comparison matrix
5. Improvement Suggestions	Actionable recommendations	Prioritized list
6. Effort Estimation	Time and complexity estimates	Effort breakdown
7. Assessment Report	Compile findings	Final report

Phase 1: Target Understanding

Identify what's being assessed and gather context:

# PARALLEL - Gather context
Read(file_path="$ARGUMENTS")  # If file path
Grep(pattern="$ARGUMENTS", output_mode="files_with_matches")
mcp__memory__search_nodes(query="$ARGUMENTS")  # Past decisions

Phase 1.5: Scope Discovery

See Scope Discovery for the full file discovery, limit application (MAX 30 files), and sampling priority logic. Always include the scoped file list in every agent prompt.

Phase 2: Quality Rating (7 Dimensions)

Rate each dimension 0-10 with weighted composite score. See Quality Model for dimensions, weights, and grade interpretation. See Scoring Rubric for per-dimension criteria.

See Agent Spawn Definitions for Task Tool mode spawn patterns and Agent Teams alternative.

Composite Score: Weighted average of all 7 dimensions (see quality-model.md).

Phases 3-7: Analysis, Comparison & Report

See Phase Templates for output templates for pros/cons, alternatives, improvements, effort, and the final report.

Grade Interpretation

See Quality Model for scoring dimensions, weights, and grade interpretation.

Key Decisions

Decision	Choice	Rationale
7 dimensions	Comprehensive coverage	All quality aspects without overwhelming
0-10 scale	Industry standard	Easy to understand and compare
Parallel assessment	4 agents (7 dimensions)	Fast, thorough evaluation
Effort/Impact scoring	1-5 scale	Simple prioritization math

Rules Quick Reference

Rule	Impact	What It Covers
complexity-metrics	HIGH	7-criterion scoring (1-5), complexity levels, thresholds
complexity-breakdown	HIGH	Task decomposition strategies, risk assessment

assess-complexity - Task complexity assessment
ork:verify - Post-implementation verification
ork:code-review-playbook - Code review patterns
ork:quality-gates - Quality gate patterns

Version: 1.1.0 (February 2026)

Rules (2)

Decompose complex tasks into isolated subtasks to reduce failure risk and enable parallelism — HIGH

Task Decomposition Strategies

When a task scores Level 4-5 (Complex/Very Complex), decompose it into subtasks that each score Level 1-3.

Decomposition Approach

Identify independent axes — separate concerns that can be worked on independently
Isolate unknowns — create a spike/research task for each unknown
Reduce cross-cutting scope — break into single-module changes
Sequence by dependency — order subtasks so blocked items come last

Strategies by Complexity Driver

High Criterion	Decomposition Strategy
Lines of Code (4-5)	Split by component or layer
Files Affected (4-5)	Split by directory or module
Dependencies (4-5)	Isolate external integrations into adapter tasks
Unknowns (4-5)	Create spike tasks to resolve unknowns first
Cross-Cutting (4-5)	Split by layer (DB, API, UI) or by concern
Risk Level (4-5)	Add validation/testing tasks before implementation

Codebase Analysis for Decomposition

# Gather metrics to inform breakdown
./scripts/analyze-codebase.sh "$TARGET"

# Key signals:
# - File count > 10: split by directory
# - Import count > 5: isolate dependency interfaces
# - Test coverage < 50%: add test-first subtask

Subtask Validation

Each subtask should score:

Average: 1.0-3.4 (Level 1-3) — manageable
Unknowns: <= 2 — no major research needed
Cross-cutting: <= 2 — limited to 2-3 modules

If any subtask still scores Level 4+, decompose it again.

Risk Assessment Integration

Risk Factor	Mitigation Task
No test coverage	Add regression tests first
Complex business logic	Write specification/invariant tests
External API dependency	Create mock/adapter layer first
Database migration	Test migration on staging first
Multiple team coordination	Define interface contracts first

Key Rules

Level 4-5 tasks are never directly implemented — decompose first
Every subtask must score Level 3 or below individually
Resolve unknowns before starting dependent implementation tasks
Use TaskCreate with addBlockedBy to enforce subtask ordering
Each subtask should be independently verifiable with its own tests
Prefer vertical slices (end-to-end for one feature) over horizontal layers

Incorrect — Starting Level 5 task without decomposition:

Task: "Migrate authentication to OAuth2"
Complexity: 4.8 (Level 5)
Action: Start implementing directly
// High failure risk, scope creep likely

Correct — Breaking into Level 1-3 subtasks:

1. Research OAuth2 providers (Level 2, 1-2h)
2. Add OAuth library dependency (Level 1, 30m)
3. Implement OAuth callback handler (Level 2, 2-4h)
4. Migrate existing sessions (Level 3, 4-8h)
5. Add regression tests (Level 2, 2h)

Score task complexity with structured frameworks to prevent scope creep and estimate drift — HIGH

Complexity Scoring Frameworks

Score task complexity across 7 criteria (1-5 each) to determine if a task should proceed or be decomposed first.

The 7 Criteria

Criterion	1 (Low)	3 (Medium)	5 (High)
Lines of Code	< 50	200-500	1500+
Time Estimate	< 30 min	2-8 hours	24+ hours
Files Affected	1 file	4-10 files	26+ files
Dependencies	0 deps	2-3 deps	7+ deps
Unknowns	None	Several, researchable	Unclear scope
Cross-Cutting	Single module	4-5 modules	System-wide
Risk Level	Trivial	Testable complexity	Mission-critical

Complexity Levels

Average Score	Level	Classification	Action
1.0 - 1.4	1	Trivial	Proceed immediately
1.5 - 2.4	2	Simple	Proceed
2.5 - 3.4	3	Moderate	Proceed with caution
3.5 - 4.4	4	Complex	Break down first
4.5 - 5.0	5	Very Complex	Decompose and reassess

Output Format

## Complexity Assessment: [Target]

| Criterion | Score |
|-----------|-------|
| Lines of Code | X/5 |
| Time Estimate | X/5 |
| Files Affected | X/5 |
| Dependencies | X/5 |
| Unknowns | X/5 |
| Cross-Cutting | X/5 |
| Risk Level | X/5 |
| **Total** | **XX/35** |

**Average Score:** X.X
**Complexity Level:** X ([Classification])
**Can Proceed:** Yes/No

Key Rules

Score all 7 criteria even for seemingly simple tasks
Total of 35 points maximum, divide by 7 for average
Level 4-5 tasks must be decomposed before starting implementation
Unknowns (criterion 5) is the highest variance factor — resolve unknowns first
Cross-cutting (criterion 6) indicates coordination overhead — account for it in estimates

Incorrect — Skipping complexity assessment:

Task: "Add real-time notifications"
Action: Start coding
// No idea of scope, likely 3-5x over estimate

Correct — Scoring all 7 criteria first:

| Criterion | Score |
|-----------|-------|
| Lines of Code | 4/5 (800+ lines) |
| Time Estimate | 4/5 (16+ hours) |
| Files Affected | 3/5 (8 files) |
| Dependencies | 4/5 (WebSockets, Redis) |
| Unknowns | 3/5 (some research needed) |
| Cross-Cutting | 4/5 (DB, API, UI, workers) |
| Risk Level | 3/5 (testable) |
| **Average** | **3.6 (Level 4 - Complex)** |

Action: Decompose into subtasks before starting

References (9)

Agent Spawn Definitions

Dimension-to-agent mapping and spawn patterns for Phase 2.

Task Tool Mode (Default)

For each dimension, spawn a background agent with scope constraints:

for dimension, agent_type in [
    ("CORRECTNESS + MAINTAINABILITY", "code-quality-reviewer"),
    ("SECURITY", "security-auditor"),
    ("PERFORMANCE + SCALABILITY", "python-performance-engineer"),  # Use python-performance-engineer for backend; frontend-performance-engineer for frontend
    ("TESTABILITY", "test-generator"),
]:
    Task(subagent_type=agent_type, run_in_background=True, max_turns=25,
         prompt=f"""Assess {dimension} (0-10) for: {target}

## Scope Constraint
ONLY read and analyze the following {len(scope_files)} files -- do NOT explore beyond this list:
{file_list}

Budget: Use at most 15 tool calls. Read files from the list above, then produce your score
with reasoning, evidence, and 2-3 specific improvement suggestions.
Do NOT use Glob or Grep to discover additional files.""")

Then collect results from all agents and proceed to Phase 3.

Agent Teams Alternative

See agent-teams-mode.md for Agent Teams assessment workflow with cross-validation and team teardown.

Context Window Note

For full codebase assessments (>20 files), use the 1M context window to avoid agent context exhaustion. On 200K context, the scope discovery in scope-discovery.md limits files to prevent overflow.

Agent Teams Mode

Agent Teams Assessment Mode

In Agent Teams mode, form an assessment team where dimension assessors cross-validate scores and discuss disagreements:

TeamCreate(team_name="assess-{target-slug}", description="Assess {target}")

# SCOPE CONSTRAINT (injected into every agent prompt):
SCOPE_INSTRUCTIONS = f"""
## Scope Constraint
ONLY read and analyze the following {len(scope_files)} files — do NOT explore beyond this list:
{file_list}

Budget: Use at most 15 tool calls. Read files from the list above, then score.
Do NOT use Glob or Grep to discover additional files.
"""

Task(subagent_type="code-quality-reviewer", name="correctness-assessor",
     team_name="assess-{target-slug}", max_turns=25,
     prompt=f"""Assess CORRECTNESS (0-10) and MAINTAINABILITY (0-10) for: {target}
     {SCOPE_INSTRUCTIONS}
     When you find issues that affect security, message security-assessor.
     When you find issues that affect performance, message perf-assessor.
     Share your scores with all teammates for calibration — if scores diverge
     significantly (>2 points), discuss the disagreement.""")

Task(subagent_type="security-auditor", name="security-assessor",
     team_name="assess-{target-slug}", max_turns=25,
     prompt=f"""Assess SECURITY (0-10) for: {target}
     {SCOPE_INSTRUCTIONS}
     When correctness-assessor flags security-relevant patterns, investigate deeper.
     When you find performance-impacting security measures, message perf-assessor.
     Share your score and flag any cross-dimension trade-offs.""")

Task(subagent_type="python-performance-engineer", name="perf-assessor",  # or frontend-performance-engineer for frontend
     team_name="assess-{target-slug}", max_turns=25,
     prompt=f"""Assess PERFORMANCE (0-10) and SCALABILITY (0-10) for: {target}
     {SCOPE_INSTRUCTIONS}
     When security-assessor flags performance trade-offs, evaluate the impact.
     When you find testability issues (hard-to-benchmark code), message test-assessor.
     Share your scores with reasoning for the composite calculation.""")

Task(subagent_type="test-generator", name="test-assessor",
     team_name="assess-{target-slug}", max_turns=25,
     prompt=f"""Assess TESTABILITY (0-10) for: {target}
     {SCOPE_INSTRUCTIONS}
     Evaluate test coverage, test quality, and ease of testing.
     When other assessors flag dimension-specific concerns, verify test coverage
     for those areas. Share your score and any coverage gaps found.""")

Team teardown after report compilation:

SendMessage(type="shutdown_request", recipient="correctness-assessor", content="Assessment complete")
SendMessage(type="shutdown_request", recipient="security-assessor", content="Assessment complete")
SendMessage(type="shutdown_request", recipient="perf-assessor", content="Assessment complete")
SendMessage(type="shutdown_request", recipient="test-assessor", content="Assessment complete")
TeamDelete()

Fallback — Team Formation Failure: If team formation fails, use standard Phase 2 Task spawns.

Fallback — Context Exhaustion: If agents hit "Context limit reached" before returning scores, collect whatever partial results were produced, then score remaining dimensions yourself using the scoped file list from Phase 1.5. Do NOT re-spawn agents — assess the remaining dimensions inline and proceed to Phase 3.

Alternative Analysis

Alternative Analysis Reference

How to identify, evaluate, and compare alternatives to the current approach.

Identifying Alternatives

Direct Substitutes: Different implementations of the same solution
Architectural Alternatives: Different design patterns or approaches
Technology Alternatives: Different libraries, frameworks, or tools
Hybrid Approaches: Combinations of multiple alternatives

Comparison Dimensions

Dimension	Question	Weight
Score	How does it rate on 6 dimensions?	0.30
Effort	How hard to implement/migrate?	0.25
Risk	What could go wrong?	0.25
Benefit	What's the expected improvement?	0.20

Migration Effort Scale

Level	Description	Time Estimate
1	Drop-in replacement	< 1 hour
2	Minor refactoring	1-4 hours
3	Moderate changes	1-2 days
4	Significant rework	3-5 days
5	Major rewrite	1+ weeks

Risk Categories

Technical: Will it work? Compatibility issues?
Team: Does team know this? Learning curve?
Timeline: Can we afford the migration time?
Dependencies: What else needs to change?

Decision Criteria

Switch if:

Score improvement >= 1.5 points AND effort <= 3
Current has critical security/correctness issues
Alternative has significantly lower maintenance burden

Stay if:

Score difference < 1.0 point
Migration effort >= 4 AND no critical issues
Team familiarity strongly favors current

Trade-off Documentation

## Alternative: [Name]

**Score Delta:** +/-[N.N] points
**Migration Effort:** [1-5]
**Risk Level:** Low/Medium/High

### Why Consider
- [Benefit 1]
- [Benefit 2]

### Why Not
- [Drawback 1]
- [Drawback 2]

### Verdict: [Adopt/Defer/Reject]

Improvement Prioritization

Improvement Prioritization Reference

Systematic approach to ranking improvements by value delivered per effort invested.

Impact/Effort Scoring

Impact Scale (1-5)

Score	Label	Effect on Quality
5	Critical	Fixes blocker, +2.0+ points
4	High	Major improvement, +1.0-2.0
3	Medium	Notable improvement, +0.5-1.0
2	Low	Minor improvement, +0.2-0.5
1	Minimal	Cosmetic, +0.1-0.2

Effort Scale (1-5)

Score	Label	Time Required
1	Trivial	< 15 minutes
2	Easy	15-60 minutes
3	Medium	1-4 hours
4	Hard	4-8 hours
5	Very Hard	1+ days

Priority Formula

Priority = Impact / Effort

Higher priority = do first. At equal priority, prefer lower effort.

Improvement Categories

Category	Impact	Effort	Action
Quick Wins	High (4-5)	Low (1-2)	Do immediately
Strategic	High (4-5)	High (4-5)	Plan carefully
Fill-ins	Low (1-2)	Low (1-2)	Do when idle
Avoid	Low (1-2)	High (4-5)	Skip or defer

Time Estimation Guidelines

Add buffer: Estimate x1.5 for unknowns
Include testing: Add 30% for test updates
Account for review: Add time for PR process
Consider dependencies: Chain effects on other work

Sequencing Dependencies

Blockers first: Changes that unblock other work
Foundation changes: Structural changes before features
Shared code: Common utilities before consumers
Leaf nodes last: Isolated changes can wait

Quick Reference

Priority 5.0+  = Do NOW (high impact, trivial effort)
Priority 2.0+  = Do soon (good ROI)
Priority 1.0+  = Schedule it
Priority <1.0  = Backlog or skip

Orchestration Mode

Orchestration Mode Selection

Shared logic for choosing between Agent Teams and Task tool orchestration in assess/verify skills.

Environment Check

import os
teams_available = os.environ.get("CLAUDE_CODE_EXPERIMENTAL_AGENT_TEAMS") is not None
force_task_tool = os.environ.get("ORCHESTKIT_FORCE_TASK_TOOL") == "1"

if force_task_tool or not teams_available:
    mode = "task_tool"
else:
    # Teams available — use for full multi-dimensional work
    mode = "agent_teams" if scope == "full" else "task_tool"

Decision Rules

CLAUDE_CODE_EXPERIMENTAL_AGENT_TEAMS set --> Agent Teams mode (for full assessment/verification)
Flag not set --> Task tool mode (default)
Quick/single-dimension scope --> Task tool (regardless of flag)
ORCHESTKIT_FORCE_TASK_TOOL=1 --> Task tool (override)

Agent Teams vs Task Tool

Aspect	Task Tool (Star)	Agent Teams (Mesh)
Topology	All agents report to lead	Agents communicate with each other
Finding correlation	Lead cross-references after completion	Agents share findings in real-time
Cross-domain overlap	Independent scoring	Agents alert each other about overlapping concerns
Cost	~200K tokens	~500K tokens
Best for	Focused/single-dimension work	Full multi-dimensional assessment/verification

Fallback

If Agent Teams encounters issues mid-execution, fall back to Task tool for remaining work. This is safe because both modes produce the same output format (dimensional scores 0-10).

Context Window Note

For full codebase work (>20 files), use the 1M context window to avoid agent context exhaustion. On 200K context, scope discovery should limit files to prevent overflow.

Phase Templates

Phase Output Templates

Markdown templates for assessment phases 3-7.

Phase 3: Pros/Cons Analysis

## Pros (Strengths)
| # | Strength | Impact | Evidence |
|---|----------|--------|----------|
| 1 | [strength] | High/Med/Low | [example] |

## Cons (Weaknesses)
| # | Weakness | Severity | Evidence |
|---|----------|----------|----------|
| 1 | [weakness] | High/Med/Low | [example] |

**Net Assessment:** [Strengths outweigh / Balanced / Weaknesses dominate]
**Recommended action:** [Keep as-is / Improve / Reconsider / Rewrite]

Phase 4: Alternative Comparison

See alternative-analysis.md for full comparison template.

Criteria	Current	Alternative A	Alternative B
Composite	[N.N]	[N.N]	[N.N]
Migration Effort	N/A	[1-5]	[1-5]

Phase 5: Improvement Suggestions

See improvement-prioritization.md for effort/impact guidelines.

Suggestion	Effort (1-5)	Impact (1-5)	Priority (I/E)
[action]	[N]	[N]	[ratio]

Quick Wins = Effort <= 2 AND Impact >= 4. Always highlight these first.

Phase 6: Effort Estimation

Timeframe	Tasks	Total
Quick wins (< 1hr)	[list]	X min
Short-term (< 1 day)	[list]	X hrs
Medium-term (1-3 days)	[list]	X days

Phase 7: Assessment Report

See scoring-rubric.md for full report template.

# Assessment Report: $ARGUMENTS

**Overall Score: [N.N]/10** (Grade: [A+/A/B/C/D/F])

**Verdict:** [EXCELLENT | GOOD | ADEQUATE | NEEDS WORK | CRITICAL]

## Answer: Is This Good?
**[YES / MOSTLY / SOMEWHAT / NO]**
[Reasoning]

Quality Model

Quality Model

Canonical scoring reference for assess and verify skills. Defines unified dimensions, weights, grade thresholds, and improvement prioritization.

Scoring Dimensions (7 Unified)

Dimension	Weight	What It Measures
Correctness	0.15	Does it work correctly? Functional accuracy, edge cases handled
Maintainability	0.15	Easy to understand and modify? Readability, complexity, patterns
Performance	0.12	Efficient execution? No bottlenecks, resource usage, latency
Security	0.20	Follows security best practices? OWASP, secrets, CVEs, input validation
Scalability	0.10	Handles growth? Load patterns, data volume, horizontal scaling
Testability	0.13	Easy to test? Coverage, test quality, isolation, mocking
Compliance	0.15	Meets API and UI contracts? Conditional on scope (see below)

Total: 1.00

Compliance Dimension — Scope Rules

Compliance weight (0.15) applies differently based on project scope:

Scope	Compliance Covers
Backend-only	API compliance (contracts, schema validation, versioning)
Frontend-only	UI compliance (design system, a11y, responsive)
Full-stack	API + UI compliance (split evenly: 0.075 each)

Composite Score

composite = sum(dimension_score * weight for each dimension)

Each dimension is scored 0-10 with decimal precision. Composite is also 0-10.

Grade Thresholds

Score	Grade	Verdict	Action
9.0-10.0	A+	EXCELLENT	Ship it!
8.0-8.9	A	GOOD	Ready for merge
7.0-7.9	B	GOOD	Minor improvements optional
6.0-6.9	C	ADEQUATE	Consider improvements
5.0-5.9	D	NEEDS WORK	Improvements recommended
0.0-4.9	F	CRITICAL	Do not merge

Improvement Prioritization

Effort Scale (1-5)

Points	Effort	Description
1	Trivial	< 15 minutes, single file change
2	Low	15-60 minutes, few files
3	Medium	1-4 hours, moderate scope
4	High	4-8 hours, significant refactoring
5	Major	1+ days, architectural change

Impact Scale (1-5)

Points	Impact	Description
1	Minimal	Cosmetic, no functional change
2	Low	Minor improvement, limited scope
3	Medium	Noticeable quality improvement
4	High	Significant quality or security gain
5	Critical	Blocks shipping or fixes major vulnerability

Priority Formula

priority = impact / effort

Higher ratio = do first.

Quick Wins

Effort <= 2 AND Impact >= 4

Always highlight quick wins at the top of improvement suggestions. These are high-value changes that can be done fast.

Scope Discovery

Phase 1.5: Scope Discovery (CRITICAL -- prevents context exhaustion)

Before spawning any agents, build a bounded file list. Agents that receive unbounded targets will exhaust their context windows reading the entire codebase.

# 1. Discover target files
if is_file(target):
    scope_files = [target]
elif is_directory(target):
    scope_files = Glob(f"{target}/**/*.{{py,ts,tsx,js,jsx,go,rs,java}}")
else:
    # Concept/topic -- search for relevant files
    scope_files = Grep(pattern=target, output_mode="files_with_matches", head_limit=50)

# 2. Apply limits -- MAX 30 files for agent assessment
MAX_FILES = 30
if len(scope_files) > MAX_FILES:
    # Prioritize: entry points, configs, security-critical, then sample rest
    # Skip: test files (except for testability agent), generated files, vendor/
    prioritized = prioritize_files(scope_files)  # entry points first
    scope_files = prioritized[:MAX_FILES]
    # Tell user about sampling
    print(f"Target has {len(scope_files)} files. Sampling {MAX_FILES} representative files.")

# 3. Format as file list string for agent prompts
file_list = "\n".join(f"- {f}" for f in scope_files)

Sampling Priorities (when >30 files)

Entry points (main, index, app, server)
Config files (settings, env, config)
Security-sensitive (auth, middleware, api routes)
Core business logic (services, models, domain)
Representative samples from remaining directories

Scoring Rubric

Scoring Rubric Reference

Detailed scoring guidelines for each quality dimension.

Correctness (Weight: 0.20)

Score 9-10: Excellent

All functionality works as documented
All edge cases handled gracefully
Comprehensive error handling
No known bugs
Types are accurate and complete

Score 7-8: Good

Core functionality works correctly
Most edge cases handled
Good error handling
Minor edge cases might be missing
Types mostly accurate

Score 5-6: Adequate

Main happy path works
Some edge cases unhandled
Basic error handling
Known minor bugs exist
Some type inaccuracies

Score 3-4: Poor

Functionality partially works
Many edge cases fail
Minimal error handling
Multiple bugs present
Significant type issues

Score 1-2: Critical

Core functionality broken
No edge case handling
Errors cause crashes
Critical bugs
Types unreliable

Score 0: Broken

Does not function at all
Cannot be used

Maintainability (Weight: 0.20)

Score 9-10: Excellent

Crystal clear code, self-documenting
Perfect naming conventions
Single responsibility everywhere
Cyclomatic complexity < 5
Any developer can understand immediately

Score 7-8: Good

Clear code with minor clarifications needed
Good naming, occasional ambiguity
Mostly single responsibility
Cyclomatic complexity < 10
Reasonable onboarding time

Score 5-6: Adequate

Understandable with effort
Mixed naming quality
Some large functions
Cyclomatic complexity < 15
Requires context to understand

Score 3-4: Poor

Difficult to understand
Poor naming choices
Multiple responsibilities mixed
Cyclomatic complexity 15-20
Requires original author to explain

Score 1-2: Critical

Incomprehensible
Meaningless names
Massive functions
Cyclomatic complexity > 20
"Here be dragons"

Score 0: Broken

Cannot be maintained at all

Performance (Weight: 0.15)

Score 9-10: Excellent

Optimal algorithm choices
No unnecessary operations
Proper caching
Async where beneficial
Measured and optimized

Score 7-8: Good

Good algorithm choices
Minor inefficiencies
Some caching
Async used appropriately
No major bottlenecks

Score 5-6: Adequate

Acceptable algorithms
Some unnecessary operations
Limited caching
Missing async opportunities
Noticeable but tolerable delays

Score 3-4: Poor

Suboptimal algorithms (O(n^2) in hot paths)
Many unnecessary operations
No caching strategy
Blocking where should be async
Noticeable performance issues

Score 1-2: Critical

Wrong algorithm choices
Excessive operations
Performance blockers
User-impacting delays

Score 0: Broken

Unusable due to performance

Security (Weight: 0.15)

Score 9-10: Excellent

All OWASP Top 10 addressed
Input validation everywhere
Proper authentication/authorization
Secrets managed correctly
Security reviewed

Score 7-8: Good

Most security concerns addressed
Good input validation
Proper auth patterns
No obvious vulnerabilities
Minor improvements possible

Score 5-6: Adequate

Basic security in place
Some validation gaps
Auth works but could be tighter
No critical vulnerabilities
Needs security review

Score 3-4: Poor

Security gaps present
Missing input validation
Auth issues
Potential vulnerabilities
Should not be in production

Score 1-2: Critical

Security vulnerabilities present
No input validation
Broken auth
Active exploit potential

Score 0: Broken

Actively exploitable

Scalability (Weight: 0.15)

Score 9-10: Excellent

Horizontally scalable
Stateless design
Proper queuing/caching
Handles 10x growth easily
Load tested

Score 7-8: Good

Mostly scalable
Minimal state
Some bottlenecks identified
Handles 5x growth
Scaling path clear

Score 5-6: Adequate

Scales with limitations
Some state management
Known bottlenecks
Handles 2x growth
Scaling requires work

Score 3-4: Poor

Limited scalability
Stateful design
Multiple bottlenecks
Near capacity
Scaling is a project

Score 1-2: Critical

Does not scale
Single point of failure
Already at capacity

Score 0: Broken

Cannot handle current load

Testability (Weight: 0.15)

Score 9-10: Excellent

90% coverage
Meaningful assertions
Edge cases tested
Fast, deterministic tests
Easy to add new tests

Score 7-8: Good

80% coverage
Good assertions
Main paths tested
Mostly fast tests
Reasonable to add tests

Score 5-6: Adequate

70% coverage
Basic assertions
Happy path tested
Some slow tests
Tests can be added

Score 3-4: Poor

50% coverage
Weak assertions
Coverage gaps
Flaky tests
Hard to test

Score 1-2: Critical

<50% coverage
Minimal assertions
Critical paths untested
Many flaky tests

Score 0: Broken

No tests or tests don't run

All 6 dimensions rated (Correctness, Maintainability, Performance, Security, Scalability, Testability)
Each dimension has specific evidence cited
Composite score calculated with correct weights
Grade assigned matches score range

Pros/Cons Analysis

At least 3 pros identified
At least 3 cons identified
Pros and cons are balanced (not all positive or negative)
Each item has impact/severity rating
Evidence provided for each claim

Alternative Comparison

At least 2 alternatives considered
Each alternative scored on same dimensions
Migration effort estimated (1-5 scale)
Clear recommendation with rationale
Trade-offs documented

Improvement Suggestions

Suggestions prioritized by Impact/Effort
Quick wins identified (high impact, low effort)
Effort estimates provided for each
Expected score improvement stated
Dependencies between improvements noted

Verdict

Clear YES/NO/MOSTLY/SOMEWHAT answer
Reasoning explains the verdict
Actionable next steps provided
Strongest and weakest dimensions highlighted

Report Quality

Executive summary is 2-3 sentences
All sections completed
No contradictions between sections
Evidence is specific (file:line when applicable)

Assess

On this page