Assess
Assesses and rates quality 0-10 with pros/cons analysis. Use when evaluating code, designs, or approaches.
Related Skills
Assess
Comprehensive assessment skill for answering "is this good?" with structured evaluation, scoring, and actionable recommendations.
Quick Start
/ork:assess backend/app/services/auth.py
/ork:assess our caching strategy
/ork:assess the current database schema
/ork:assess frontend/src/components/DashboardSTEP 0: Verify User Intent with AskUserQuestion
BEFORE creating tasks, clarify assessment dimensions:
AskUserQuestion(
questions=[{
"question": "What dimensions to assess?",
"header": "Dimensions",
"options": [
{"label": "Full assessment (Recommended)", "description": "All dimensions: quality, maintainability, security, performance"},
{"label": "Code quality only", "description": "Readability, complexity, best practices"},
{"label": "Security focus", "description": "Vulnerabilities, attack surface, compliance"},
{"label": "Quick score", "description": "Just give me a 0-10 score with brief notes"}
],
"multiSelect": false
}]
)Based on answer, adjust workflow:
- Full assessment: All 7 phases, parallel agents
- Code quality only: Skip security and performance phases
- Security focus: Prioritize security-auditor agent
- Quick score: Single pass, brief output
STEP 0b: Select Orchestration Mode
See Orchestration Mode for env var check logic, Agent Teams vs Task Tool comparison, and mode selection rules.
Task Management (CC 2.1.16)
TaskCreate(
subject="Assess: {target}",
description="Comprehensive evaluation with quality scores and recommendations",
activeForm="Assessing {target}"
)What This Skill Answers
| Question | How It's Answered |
|---|---|
| "Is this good?" | Quality score 0-10 with reasoning |
| "What are the trade-offs?" | Structured pros/cons list |
| "Should we change this?" | Improvement suggestions with effort |
| "What are the alternatives?" | Comparison with scores |
| "Where should we focus?" | Prioritized recommendations |
Workflow Overview
| Phase | Activities | Output |
|---|---|---|
| 1. Target Understanding | Read code/design, identify scope | Context summary |
| 1.5. Scope Discovery | Build bounded file list | Scoped file list |
| 2. Quality Rating | 7-dimension scoring (0-10) | Scores with reasoning |
| 3. Pros/Cons Analysis | Strengths and weaknesses | Balanced evaluation |
| 4. Alternative Comparison | Score alternatives | Comparison matrix |
| 5. Improvement Suggestions | Actionable recommendations | Prioritized list |
| 6. Effort Estimation | Time and complexity estimates | Effort breakdown |
| 7. Assessment Report | Compile findings | Final report |
Phase 1: Target Understanding
Identify what's being assessed and gather context:
# PARALLEL - Gather context
Read(file_path="$ARGUMENTS") # If file path
Grep(pattern="$ARGUMENTS", output_mode="files_with_matches")
mcp__memory__search_nodes(query="$ARGUMENTS") # Past decisionsPhase 1.5: Scope Discovery
See Scope Discovery for the full file discovery, limit application (MAX 30 files), and sampling priority logic. Always include the scoped file list in every agent prompt.
Phase 2: Quality Rating (7 Dimensions)
Rate each dimension 0-10 with weighted composite score. See Quality Model for dimensions, weights, and grade interpretation. See Scoring Rubric for per-dimension criteria.
See Agent Spawn Definitions for Task Tool mode spawn patterns and Agent Teams alternative.
Composite Score: Weighted average of all 7 dimensions (see quality-model.md).
Phases 3-7: Analysis, Comparison & Report
See Phase Templates for output templates for pros/cons, alternatives, improvements, effort, and the final report.
See also: Alternative Analysis | Improvement Prioritization
Grade Interpretation
See Quality Model for scoring dimensions, weights, and grade interpretation.
Key Decisions
| Decision | Choice | Rationale |
|---|---|---|
| 7 dimensions | Comprehensive coverage | All quality aspects without overwhelming |
| 0-10 scale | Industry standard | Easy to understand and compare |
| Parallel assessment | 4 agents (7 dimensions) | Fast, thorough evaluation |
| Effort/Impact scoring | 1-5 scale | Simple prioritization math |
Rules Quick Reference
| Rule | Impact | What It Covers |
|---|---|---|
| complexity-metrics | HIGH | 7-criterion scoring (1-5), complexity levels, thresholds |
| complexity-breakdown | HIGH | Task decomposition strategies, risk assessment |
Related Skills
assess-complexity- Task complexity assessmentork:verify- Post-implementation verificationork:code-review-playbook- Code review patternsork:quality-gates- Quality gate patterns
Version: 1.1.0 (February 2026)
Rules (2)
Decompose complex tasks into isolated subtasks to reduce failure risk and enable parallelism — HIGH
Task Decomposition Strategies
When a task scores Level 4-5 (Complex/Very Complex), decompose it into subtasks that each score Level 1-3.
Decomposition Approach
- Identify independent axes — separate concerns that can be worked on independently
- Isolate unknowns — create a spike/research task for each unknown
- Reduce cross-cutting scope — break into single-module changes
- Sequence by dependency — order subtasks so blocked items come last
Strategies by Complexity Driver
| High Criterion | Decomposition Strategy |
|---|---|
| Lines of Code (4-5) | Split by component or layer |
| Files Affected (4-5) | Split by directory or module |
| Dependencies (4-5) | Isolate external integrations into adapter tasks |
| Unknowns (4-5) | Create spike tasks to resolve unknowns first |
| Cross-Cutting (4-5) | Split by layer (DB, API, UI) or by concern |
| Risk Level (4-5) | Add validation/testing tasks before implementation |
Codebase Analysis for Decomposition
# Gather metrics to inform breakdown
./scripts/analyze-codebase.sh "$TARGET"
# Key signals:
# - File count > 10: split by directory
# - Import count > 5: isolate dependency interfaces
# - Test coverage < 50%: add test-first subtaskSubtask Validation
Each subtask should score:
- Average: 1.0-3.4 (Level 1-3) — manageable
- Unknowns: <= 2 — no major research needed
- Cross-cutting: <= 2 — limited to 2-3 modules
If any subtask still scores Level 4+, decompose it again.
Risk Assessment Integration
| Risk Factor | Mitigation Task |
|---|---|
| No test coverage | Add regression tests first |
| Complex business logic | Write specification/invariant tests |
| External API dependency | Create mock/adapter layer first |
| Database migration | Test migration on staging first |
| Multiple team coordination | Define interface contracts first |
Key Rules
- Level 4-5 tasks are never directly implemented — decompose first
- Every subtask must score Level 3 or below individually
- Resolve unknowns before starting dependent implementation tasks
- Use
TaskCreatewithaddBlockedByto enforce subtask ordering - Each subtask should be independently verifiable with its own tests
- Prefer vertical slices (end-to-end for one feature) over horizontal layers
Incorrect — Starting Level 5 task without decomposition:
Task: "Migrate authentication to OAuth2"
Complexity: 4.8 (Level 5)
Action: Start implementing directly
// High failure risk, scope creep likelyCorrect — Breaking into Level 1-3 subtasks:
1. Research OAuth2 providers (Level 2, 1-2h)
2. Add OAuth library dependency (Level 1, 30m)
3. Implement OAuth callback handler (Level 2, 2-4h)
4. Migrate existing sessions (Level 3, 4-8h)
5. Add regression tests (Level 2, 2h)Score task complexity with structured frameworks to prevent scope creep and estimate drift — HIGH
Complexity Scoring Frameworks
Score task complexity across 7 criteria (1-5 each) to determine if a task should proceed or be decomposed first.
The 7 Criteria
| Criterion | 1 (Low) | 3 (Medium) | 5 (High) |
|---|---|---|---|
| Lines of Code | < 50 | 200-500 | 1500+ |
| Time Estimate | < 30 min | 2-8 hours | 24+ hours |
| Files Affected | 1 file | 4-10 files | 26+ files |
| Dependencies | 0 deps | 2-3 deps | 7+ deps |
| Unknowns | None | Several, researchable | Unclear scope |
| Cross-Cutting | Single module | 4-5 modules | System-wide |
| Risk Level | Trivial | Testable complexity | Mission-critical |
Complexity Levels
| Average Score | Level | Classification | Action |
|---|---|---|---|
| 1.0 - 1.4 | 1 | Trivial | Proceed immediately |
| 1.5 - 2.4 | 2 | Simple | Proceed |
| 2.5 - 3.4 | 3 | Moderate | Proceed with caution |
| 3.5 - 4.4 | 4 | Complex | Break down first |
| 4.5 - 5.0 | 5 | Very Complex | Decompose and reassess |
Output Format
## Complexity Assessment: [Target]
| Criterion | Score |
|-----------|-------|
| Lines of Code | X/5 |
| Time Estimate | X/5 |
| Files Affected | X/5 |
| Dependencies | X/5 |
| Unknowns | X/5 |
| Cross-Cutting | X/5 |
| Risk Level | X/5 |
| **Total** | **XX/35** |
**Average Score:** X.X
**Complexity Level:** X ([Classification])
**Can Proceed:** Yes/NoKey Rules
- Score all 7 criteria even for seemingly simple tasks
- Total of 35 points maximum, divide by 7 for average
- Level 4-5 tasks must be decomposed before starting implementation
- Unknowns (criterion 5) is the highest variance factor — resolve unknowns first
- Cross-cutting (criterion 6) indicates coordination overhead — account for it in estimates
Incorrect — Skipping complexity assessment:
Task: "Add real-time notifications"
Action: Start coding
// No idea of scope, likely 3-5x over estimateCorrect — Scoring all 7 criteria first:
| Criterion | Score |
|-----------|-------|
| Lines of Code | 4/5 (800+ lines) |
| Time Estimate | 4/5 (16+ hours) |
| Files Affected | 3/5 (8 files) |
| Dependencies | 4/5 (WebSockets, Redis) |
| Unknowns | 3/5 (some research needed) |
| Cross-Cutting | 4/5 (DB, API, UI, workers) |
| Risk Level | 3/5 (testable) |
| **Average** | **3.6 (Level 4 - Complex)** |
Action: Decompose into subtasks before startingReferences (9)
Agent Spawn Definitions
Agent Spawn Definitions
Dimension-to-agent mapping and spawn patterns for Phase 2.
Task Tool Mode (Default)
For each dimension, spawn a background agent with scope constraints:
for dimension, agent_type in [
("CORRECTNESS + MAINTAINABILITY", "code-quality-reviewer"),
("SECURITY", "security-auditor"),
("PERFORMANCE + SCALABILITY", "python-performance-engineer"), # Use python-performance-engineer for backend; frontend-performance-engineer for frontend
("TESTABILITY", "test-generator"),
]:
Task(subagent_type=agent_type, run_in_background=True, max_turns=25,
prompt=f"""Assess {dimension} (0-10) for: {target}
## Scope Constraint
ONLY read and analyze the following {len(scope_files)} files -- do NOT explore beyond this list:
{file_list}
Budget: Use at most 15 tool calls. Read files from the list above, then produce your score
with reasoning, evidence, and 2-3 specific improvement suggestions.
Do NOT use Glob or Grep to discover additional files.""")Then collect results from all agents and proceed to Phase 3.
Agent Teams Alternative
See agent-teams-mode.md for Agent Teams assessment workflow with cross-validation and team teardown.
Context Window Note
For full codebase assessments (>20 files), use the 1M context window to avoid agent context exhaustion. On 200K context, the scope discovery in scope-discovery.md limits files to prevent overflow.
Agent Teams Mode
Agent Teams Assessment Mode
In Agent Teams mode, form an assessment team where dimension assessors cross-validate scores and discuss disagreements:
TeamCreate(team_name="assess-{target-slug}", description="Assess {target}")
# SCOPE CONSTRAINT (injected into every agent prompt):
SCOPE_INSTRUCTIONS = f"""
## Scope Constraint
ONLY read and analyze the following {len(scope_files)} files — do NOT explore beyond this list:
{file_list}
Budget: Use at most 15 tool calls. Read files from the list above, then score.
Do NOT use Glob or Grep to discover additional files.
"""
Task(subagent_type="code-quality-reviewer", name="correctness-assessor",
team_name="assess-{target-slug}", max_turns=25,
prompt=f"""Assess CORRECTNESS (0-10) and MAINTAINABILITY (0-10) for: {target}
{SCOPE_INSTRUCTIONS}
When you find issues that affect security, message security-assessor.
When you find issues that affect performance, message perf-assessor.
Share your scores with all teammates for calibration — if scores diverge
significantly (>2 points), discuss the disagreement.""")
Task(subagent_type="security-auditor", name="security-assessor",
team_name="assess-{target-slug}", max_turns=25,
prompt=f"""Assess SECURITY (0-10) for: {target}
{SCOPE_INSTRUCTIONS}
When correctness-assessor flags security-relevant patterns, investigate deeper.
When you find performance-impacting security measures, message perf-assessor.
Share your score and flag any cross-dimension trade-offs.""")
Task(subagent_type="python-performance-engineer", name="perf-assessor", # or frontend-performance-engineer for frontend
team_name="assess-{target-slug}", max_turns=25,
prompt=f"""Assess PERFORMANCE (0-10) and SCALABILITY (0-10) for: {target}
{SCOPE_INSTRUCTIONS}
When security-assessor flags performance trade-offs, evaluate the impact.
When you find testability issues (hard-to-benchmark code), message test-assessor.
Share your scores with reasoning for the composite calculation.""")
Task(subagent_type="test-generator", name="test-assessor",
team_name="assess-{target-slug}", max_turns=25,
prompt=f"""Assess TESTABILITY (0-10) for: {target}
{SCOPE_INSTRUCTIONS}
Evaluate test coverage, test quality, and ease of testing.
When other assessors flag dimension-specific concerns, verify test coverage
for those areas. Share your score and any coverage gaps found.""")Team teardown after report compilation:
SendMessage(type="shutdown_request", recipient="correctness-assessor", content="Assessment complete")
SendMessage(type="shutdown_request", recipient="security-assessor", content="Assessment complete")
SendMessage(type="shutdown_request", recipient="perf-assessor", content="Assessment complete")
SendMessage(type="shutdown_request", recipient="test-assessor", content="Assessment complete")
TeamDelete()Fallback — Team Formation Failure: If team formation fails, use standard Phase 2 Task spawns.
Fallback — Context Exhaustion: If agents hit "Context limit reached" before returning scores, collect whatever partial results were produced, then score remaining dimensions yourself using the scoped file list from Phase 1.5. Do NOT re-spawn agents — assess the remaining dimensions inline and proceed to Phase 3.
Alternative Analysis
Alternative Analysis Reference
How to identify, evaluate, and compare alternatives to the current approach.
Identifying Alternatives
- Direct Substitutes: Different implementations of the same solution
- Architectural Alternatives: Different design patterns or approaches
- Technology Alternatives: Different libraries, frameworks, or tools
- Hybrid Approaches: Combinations of multiple alternatives
Comparison Dimensions
| Dimension | Question | Weight |
|---|---|---|
| Score | How does it rate on 6 dimensions? | 0.30 |
| Effort | How hard to implement/migrate? | 0.25 |
| Risk | What could go wrong? | 0.25 |
| Benefit | What's the expected improvement? | 0.20 |
Migration Effort Scale
| Level | Description | Time Estimate |
|---|---|---|
| 1 | Drop-in replacement | < 1 hour |
| 2 | Minor refactoring | 1-4 hours |
| 3 | Moderate changes | 1-2 days |
| 4 | Significant rework | 3-5 days |
| 5 | Major rewrite | 1+ weeks |
Risk Categories
- Technical: Will it work? Compatibility issues?
- Team: Does team know this? Learning curve?
- Timeline: Can we afford the migration time?
- Dependencies: What else needs to change?
Decision Criteria
Switch if:
- Score improvement >= 1.5 points AND effort <= 3
- Current has critical security/correctness issues
- Alternative has significantly lower maintenance burden
Stay if:
- Score difference < 1.0 point
- Migration effort >= 4 AND no critical issues
- Team familiarity strongly favors current
Trade-off Documentation
## Alternative: [Name]
**Score Delta:** +/-[N.N] points
**Migration Effort:** [1-5]
**Risk Level:** Low/Medium/High
### Why Consider
- [Benefit 1]
- [Benefit 2]
### Why Not
- [Drawback 1]
- [Drawback 2]
### Verdict: [Adopt/Defer/Reject]Improvement Prioritization
Improvement Prioritization Reference
Systematic approach to ranking improvements by value delivered per effort invested.
Impact/Effort Scoring
Impact Scale (1-5)
| Score | Label | Effect on Quality |
|---|---|---|
| 5 | Critical | Fixes blocker, +2.0+ points |
| 4 | High | Major improvement, +1.0-2.0 |
| 3 | Medium | Notable improvement, +0.5-1.0 |
| 2 | Low | Minor improvement, +0.2-0.5 |
| 1 | Minimal | Cosmetic, +0.1-0.2 |
Effort Scale (1-5)
| Score | Label | Time Required |
|---|---|---|
| 1 | Trivial | < 15 minutes |
| 2 | Easy | 15-60 minutes |
| 3 | Medium | 1-4 hours |
| 4 | Hard | 4-8 hours |
| 5 | Very Hard | 1+ days |
Priority Formula
Priority = Impact / EffortHigher priority = do first. At equal priority, prefer lower effort.
Improvement Categories
| Category | Impact | Effort | Action |
|---|---|---|---|
| Quick Wins | High (4-5) | Low (1-2) | Do immediately |
| Strategic | High (4-5) | High (4-5) | Plan carefully |
| Fill-ins | Low (1-2) | Low (1-2) | Do when idle |
| Avoid | Low (1-2) | High (4-5) | Skip or defer |
Time Estimation Guidelines
- Add buffer: Estimate x1.5 for unknowns
- Include testing: Add 30% for test updates
- Account for review: Add time for PR process
- Consider dependencies: Chain effects on other work
Sequencing Dependencies
- Blockers first: Changes that unblock other work
- Foundation changes: Structural changes before features
- Shared code: Common utilities before consumers
- Leaf nodes last: Isolated changes can wait
Quick Reference
Priority 5.0+ = Do NOW (high impact, trivial effort)
Priority 2.0+ = Do soon (good ROI)
Priority 1.0+ = Schedule it
Priority <1.0 = Backlog or skipOrchestration Mode
<!-- SHARED: keep in sync with ../../../verify/references/orchestration-mode.md -->
Orchestration Mode Selection
Shared logic for choosing between Agent Teams and Task tool orchestration in assess/verify skills.
Environment Check
import os
teams_available = os.environ.get("CLAUDE_CODE_EXPERIMENTAL_AGENT_TEAMS") is not None
force_task_tool = os.environ.get("ORCHESTKIT_FORCE_TASK_TOOL") == "1"
if force_task_tool or not teams_available:
mode = "task_tool"
else:
# Teams available — use for full multi-dimensional work
mode = "agent_teams" if scope == "full" else "task_tool"Decision Rules
CLAUDE_CODE_EXPERIMENTAL_AGENT_TEAMSset --> Agent Teams mode (for full assessment/verification)- Flag not set --> Task tool mode (default)
- Quick/single-dimension scope --> Task tool (regardless of flag)
ORCHESTKIT_FORCE_TASK_TOOL=1--> Task tool (override)
Agent Teams vs Task Tool
| Aspect | Task Tool (Star) | Agent Teams (Mesh) |
|---|---|---|
| Topology | All agents report to lead | Agents communicate with each other |
| Finding correlation | Lead cross-references after completion | Agents share findings in real-time |
| Cross-domain overlap | Independent scoring | Agents alert each other about overlapping concerns |
| Cost | ~200K tokens | ~500K tokens |
| Best for | Focused/single-dimension work | Full multi-dimensional assessment/verification |
Fallback
If Agent Teams encounters issues mid-execution, fall back to Task tool for remaining work. This is safe because both modes produce the same output format (dimensional scores 0-10).
Context Window Note
For full codebase work (>20 files), use the 1M context window to avoid agent context exhaustion. On 200K context, scope discovery should limit files to prevent overflow.
Phase Templates
Phase Output Templates
Markdown templates for assessment phases 3-7.
Phase 3: Pros/Cons Analysis
## Pros (Strengths)
| # | Strength | Impact | Evidence |
|---|----------|--------|----------|
| 1 | [strength] | High/Med/Low | [example] |
## Cons (Weaknesses)
| # | Weakness | Severity | Evidence |
|---|----------|----------|----------|
| 1 | [weakness] | High/Med/Low | [example] |
**Net Assessment:** [Strengths outweigh / Balanced / Weaknesses dominate]
**Recommended action:** [Keep as-is / Improve / Reconsider / Rewrite]Phase 4: Alternative Comparison
See alternative-analysis.md for full comparison template.
| Criteria | Current | Alternative A | Alternative B |
|---|---|---|---|
| Composite | [N.N] | [N.N] | [N.N] |
| Migration Effort | N/A | [1-5] | [1-5] |
Phase 5: Improvement Suggestions
See improvement-prioritization.md for effort/impact guidelines.
| Suggestion | Effort (1-5) | Impact (1-5) | Priority (I/E) |
|---|---|---|---|
| [action] | [N] | [N] | [ratio] |
Quick Wins = Effort <= 2 AND Impact >= 4. Always highlight these first.
Phase 6: Effort Estimation
| Timeframe | Tasks | Total |
|---|---|---|
| Quick wins (< 1hr) | [list] | X min |
| Short-term (< 1 day) | [list] | X hrs |
| Medium-term (1-3 days) | [list] | X days |
Phase 7: Assessment Report
See scoring-rubric.md for full report template.
# Assessment Report: $ARGUMENTS
**Overall Score: [N.N]/10** (Grade: [A+/A/B/C/D/F])
**Verdict:** [EXCELLENT | GOOD | ADEQUATE | NEEDS WORK | CRITICAL]
## Answer: Is This Good?
**[YES / MOSTLY / SOMEWHAT / NO]**
[Reasoning]Quality Model
<!-- SHARED: keep in sync with ../../../verify/references/quality-model.md -->
Quality Model
Canonical scoring reference for assess and verify skills. Defines unified dimensions, weights, grade thresholds, and improvement prioritization.
Scoring Dimensions (7 Unified)
| Dimension | Weight | What It Measures |
|---|---|---|
| Correctness | 0.15 | Does it work correctly? Functional accuracy, edge cases handled |
| Maintainability | 0.15 | Easy to understand and modify? Readability, complexity, patterns |
| Performance | 0.12 | Efficient execution? No bottlenecks, resource usage, latency |
| Security | 0.20 | Follows security best practices? OWASP, secrets, CVEs, input validation |
| Scalability | 0.10 | Handles growth? Load patterns, data volume, horizontal scaling |
| Testability | 0.13 | Easy to test? Coverage, test quality, isolation, mocking |
| Compliance | 0.15 | Meets API and UI contracts? Conditional on scope (see below) |
Total: 1.00
Compliance Dimension — Scope Rules
Compliance weight (0.15) applies differently based on project scope:
| Scope | Compliance Covers |
|---|---|
| Backend-only | API compliance (contracts, schema validation, versioning) |
| Frontend-only | UI compliance (design system, a11y, responsive) |
| Full-stack | API + UI compliance (split evenly: 0.075 each) |
Composite Score
composite = sum(dimension_score * weight for each dimension)Each dimension is scored 0-10 with decimal precision. Composite is also 0-10.
Grade Thresholds
| Score | Grade | Verdict | Action |
|---|---|---|---|
| 9.0-10.0 | A+ | EXCELLENT | Ship it! |
| 8.0-8.9 | A | GOOD | Ready for merge |
| 7.0-7.9 | B | GOOD | Minor improvements optional |
| 6.0-6.9 | C | ADEQUATE | Consider improvements |
| 5.0-5.9 | D | NEEDS WORK | Improvements recommended |
| 0.0-4.9 | F | CRITICAL | Do not merge |
Improvement Prioritization
Effort Scale (1-5)
| Points | Effort | Description |
|---|---|---|
| 1 | Trivial | < 15 minutes, single file change |
| 2 | Low | 15-60 minutes, few files |
| 3 | Medium | 1-4 hours, moderate scope |
| 4 | High | 4-8 hours, significant refactoring |
| 5 | Major | 1+ days, architectural change |
Impact Scale (1-5)
| Points | Impact | Description |
|---|---|---|
| 1 | Minimal | Cosmetic, no functional change |
| 2 | Low | Minor improvement, limited scope |
| 3 | Medium | Noticeable quality improvement |
| 4 | High | Significant quality or security gain |
| 5 | Critical | Blocks shipping or fixes major vulnerability |
Priority Formula
priority = impact / effortHigher ratio = do first.
Quick Wins
Effort <= 2 AND Impact >= 4
Always highlight quick wins at the top of improvement suggestions. These are high-value changes that can be done fast.
Scope Discovery
Phase 1.5: Scope Discovery (CRITICAL -- prevents context exhaustion)
Before spawning any agents, build a bounded file list. Agents that receive unbounded targets will exhaust their context windows reading the entire codebase.
# 1. Discover target files
if is_file(target):
scope_files = [target]
elif is_directory(target):
scope_files = Glob(f"{target}/**/*.{{py,ts,tsx,js,jsx,go,rs,java}}")
else:
# Concept/topic -- search for relevant files
scope_files = Grep(pattern=target, output_mode="files_with_matches", head_limit=50)
# 2. Apply limits -- MAX 30 files for agent assessment
MAX_FILES = 30
if len(scope_files) > MAX_FILES:
# Prioritize: entry points, configs, security-critical, then sample rest
# Skip: test files (except for testability agent), generated files, vendor/
prioritized = prioritize_files(scope_files) # entry points first
scope_files = prioritized[:MAX_FILES]
# Tell user about sampling
print(f"Target has {len(scope_files)} files. Sampling {MAX_FILES} representative files.")
# 3. Format as file list string for agent prompts
file_list = "\n".join(f"- {f}" for f in scope_files)Sampling Priorities (when >30 files)
- Entry points (main, index, app, server)
- Config files (settings, env, config)
- Security-sensitive (auth, middleware, api routes)
- Core business logic (services, models, domain)
- Representative samples from remaining directories
Scoring Rubric
Scoring Rubric Reference
Detailed scoring guidelines for each quality dimension.
Correctness (Weight: 0.20)
Score 9-10: Excellent
- All functionality works as documented
- All edge cases handled gracefully
- Comprehensive error handling
- No known bugs
- Types are accurate and complete
Score 7-8: Good
- Core functionality works correctly
- Most edge cases handled
- Good error handling
- Minor edge cases might be missing
- Types mostly accurate
Score 5-6: Adequate
- Main happy path works
- Some edge cases unhandled
- Basic error handling
- Known minor bugs exist
- Some type inaccuracies
Score 3-4: Poor
- Functionality partially works
- Many edge cases fail
- Minimal error handling
- Multiple bugs present
- Significant type issues
Score 1-2: Critical
- Core functionality broken
- No edge case handling
- Errors cause crashes
- Critical bugs
- Types unreliable
Score 0: Broken
- Does not function at all
- Cannot be used
Maintainability (Weight: 0.20)
Score 9-10: Excellent
- Crystal clear code, self-documenting
- Perfect naming conventions
- Single responsibility everywhere
- Cyclomatic complexity < 5
- Any developer can understand immediately
Score 7-8: Good
- Clear code with minor clarifications needed
- Good naming, occasional ambiguity
- Mostly single responsibility
- Cyclomatic complexity < 10
- Reasonable onboarding time
Score 5-6: Adequate
- Understandable with effort
- Mixed naming quality
- Some large functions
- Cyclomatic complexity < 15
- Requires context to understand
Score 3-4: Poor
- Difficult to understand
- Poor naming choices
- Multiple responsibilities mixed
- Cyclomatic complexity 15-20
- Requires original author to explain
Score 1-2: Critical
- Incomprehensible
- Meaningless names
- Massive functions
- Cyclomatic complexity > 20
- "Here be dragons"
Score 0: Broken
- Cannot be maintained at all
Performance (Weight: 0.15)
Score 9-10: Excellent
- Optimal algorithm choices
- No unnecessary operations
- Proper caching
- Async where beneficial
- Measured and optimized
Score 7-8: Good
- Good algorithm choices
- Minor inefficiencies
- Some caching
- Async used appropriately
- No major bottlenecks
Score 5-6: Adequate
- Acceptable algorithms
- Some unnecessary operations
- Limited caching
- Missing async opportunities
- Noticeable but tolerable delays
Score 3-4: Poor
- Suboptimal algorithms (O(n^2) in hot paths)
- Many unnecessary operations
- No caching strategy
- Blocking where should be async
- Noticeable performance issues
Score 1-2: Critical
- Wrong algorithm choices
- Excessive operations
- Performance blockers
- User-impacting delays
Score 0: Broken
- Unusable due to performance
Security (Weight: 0.15)
Score 9-10: Excellent
- All OWASP Top 10 addressed
- Input validation everywhere
- Proper authentication/authorization
- Secrets managed correctly
- Security reviewed
Score 7-8: Good
- Most security concerns addressed
- Good input validation
- Proper auth patterns
- No obvious vulnerabilities
- Minor improvements possible
Score 5-6: Adequate
- Basic security in place
- Some validation gaps
- Auth works but could be tighter
- No critical vulnerabilities
- Needs security review
Score 3-4: Poor
- Security gaps present
- Missing input validation
- Auth issues
- Potential vulnerabilities
- Should not be in production
Score 1-2: Critical
- Security vulnerabilities present
- No input validation
- Broken auth
- Active exploit potential
Score 0: Broken
- Actively exploitable
Scalability (Weight: 0.15)
Score 9-10: Excellent
- Horizontally scalable
- Stateless design
- Proper queuing/caching
- Handles 10x growth easily
- Load tested
Score 7-8: Good
- Mostly scalable
- Minimal state
- Some bottlenecks identified
- Handles 5x growth
- Scaling path clear
Score 5-6: Adequate
- Scales with limitations
- Some state management
- Known bottlenecks
- Handles 2x growth
- Scaling requires work
Score 3-4: Poor
- Limited scalability
- Stateful design
- Multiple bottlenecks
- Near capacity
- Scaling is a project
Score 1-2: Critical
- Does not scale
- Single point of failure
- Already at capacity
Score 0: Broken
- Cannot handle current load
Testability (Weight: 0.15)
Score 9-10: Excellent
-
90% coverage
- Meaningful assertions
- Edge cases tested
- Fast, deterministic tests
- Easy to add new tests
Score 7-8: Good
-
80% coverage
- Good assertions
- Main paths tested
- Mostly fast tests
- Reasonable to add tests
Score 5-6: Adequate
-
70% coverage
- Basic assertions
- Happy path tested
- Some slow tests
- Tests can be added
Score 3-4: Poor
-
50% coverage
- Weak assertions
- Coverage gaps
- Flaky tests
- Hard to test
Score 1-2: Critical
- <50% coverage
- Minimal assertions
- Critical paths untested
- Many flaky tests
Score 0: Broken
- No tests or tests don't run
Checklists (1)
Assessment Checklist
Assessment Checklist
Pre-completion validation for comprehensive assessments.
Quality Rating
- All 6 dimensions rated (Correctness, Maintainability, Performance, Security, Scalability, Testability)
- Each dimension has specific evidence cited
- Composite score calculated with correct weights
- Grade assigned matches score range
Pros/Cons Analysis
- At least 3 pros identified
- At least 3 cons identified
- Pros and cons are balanced (not all positive or negative)
- Each item has impact/severity rating
- Evidence provided for each claim
Alternative Comparison
- At least 2 alternatives considered
- Each alternative scored on same dimensions
- Migration effort estimated (1-5 scale)
- Clear recommendation with rationale
- Trade-offs documented
Improvement Suggestions
- Suggestions prioritized by Impact/Effort
- Quick wins identified (high impact, low effort)
- Effort estimates provided for each
- Expected score improvement stated
- Dependencies between improvements noted
Verdict
- Clear YES/NO/MOSTLY/SOMEWHAT answer
- Reasoning explains the verdict
- Actionable next steps provided
- Strongest and weakest dimensions highlighted
Report Quality
- Executive summary is 2-3 sentences
- All sections completed
- No contradictions between sections
- Evidence is specific (file:line when applicable)
Ascii Visualizer
ASCII diagram patterns for architecture, workflows, file trees, and data visualizations. Use when creating terminal-rendered diagrams, box-drawing layouts, progress bars, swimlanes, or blast radius visualizations.
Async Jobs
Async job processing patterns for background tasks, Celery workflows, task scheduling, retry strategies, and distributed task execution. Use when implementing background job processing, task queues, or scheduled task systems.
Last updated on