Skip to main content
OrchestKit v7.85.0 — 107 skills, 37 agents, 188 hooks · Claude Code 2.1.132+
OrchestKit
Skills

Bare Eval

Run isolated eval and grading calls using CC 2.1.81 --bare mode. Constructs claude -p --bare invocations for skill evaluation, trigger testing, and LLM grading without plugin/hook interference. Use when running eval pipelines, grading skill outputs, benchmarking prompt quality, or testing trigger accuracy in isolation.

Reference medium

Auto-activated — this skill loads automatically when Claude detects matching context.

Bare Eval — Isolated Evaluation Calls

Run claude -p --bare for fast, clean eval/grading without plugin overhead.

CC 2.1.81 required. The --bare flag skips hooks, LSP, plugin sync, and skill directory walks.

When to Use

  • Grading skill outputs against assertions
  • Trigger classification (which skill matches a prompt)
  • Description optimization iterations
  • Any scripted -p call that doesn't need plugins

When NOT to Use

  • Testing skill routing (needs --plugin-dir)
  • Testing agent orchestration (needs full plugin context)
  • Interactive sessions

Prerequisites

# --bare requires ANTHROPIC_API_KEY (OAuth/keychain disabled)
export ANTHROPIC_API_KEY="sk-ant-..."

# Verify CC version
claude --version  # Must be >= 2.1.81

Quick Reference

Call TypeCommand Pattern
Gradingclaude -p "$prompt" --bare --max-turns 1 --output-format text
Triggerclaude -p "$prompt" --bare --json-schema "$schema" --output-format json
Streaming gradeclaude -p "$prompt" --bare --max-turns 1 --output-format stream-json
Optimizeecho "$prompt" | claude -p --bare --max-turns 1 --output-format text
Force-skillclaude -p "$prompt" --bare --print --append-system-prompt "$content"
@-file in promptclaude -p "grade @fixtures/case-1.md against rubric" --bare (CC 2.1.113 Remote Control autocomplete)

--output-format stream-json

Newline-delimited JSON events (one per token/tool-call) — lets a runner score partial output or abort early on a failing probe without waiting for the full response.

claude -p "$prompt" --bare --max-turns 1 --output-format stream-json \
  | while IFS= read -r line; do
      # line is a single JSON event; inspect $.type == "content_block_delta"
      jq -r 'select(.type == "content_block_delta") | .delta.text' <<< "$line"
    done

Use stream-json over json when:

  • grading long outputs and you want incremental scoring,
  • piping into another CLI step-by-step (e.g. ork:eval-runner),
  • you need per-token timing data alongside the content.

Invocation Patterns

Load detailed patterns and examples:

Read("${CLAUDE_SKILL_DIR}/references/invocation-patterns.md")

Grading Schemas

JSON schemas for structured eval output:

Read("${CLAUDE_SKILL_DIR}/references/grading-schemas.md")

Pipeline Integration

OrchestKit's eval scripts (npm run eval:skill) auto-detect bare mode:

# eval-common.sh detects ANTHROPIC_API_KEY → sets BARE_MODE=true
# Scripts add --bare to all non-plugin calls automatically

Bare calls: Trigger classification, force-skill, baseline, all grading. Never bare: run_with_skill (needs plugin context for routing tests).

CC 2.1.119: --print honors agent tools: / disallowedTools: (M122)

Before CC 2.1.119, --print mode ran with the full default tool set regardless of the agent's frontmatter tools: and disallowedTools:. Bare-eval grading was effectively ungated — graders could call any tool they wanted, even if the agent definition restricted them.

As of 2.1.119, --print enforces the agent's declared tool surface. Implications for eval design:

ConsequenceAction
Eval graders that relied on unrestricted tool access may now failAudit grader prompts for tools they actually need; whitelist explicitly via the agent's tools: frontmatter
Eval results match interactive runsReproducibility improves — grading what the model can actually do, not what it could do in an unsandboxed --print
--agent &lt;name&gt; also honors permissionMode in --printPermission-gated tools (Bash, Edit) require either permissionMode: acceptEdits or explicit allowlists in the agent definition

Migration test:

# Run an eval against an agent with a deliberately tight tools: list.
# Graders that previously called Read/Bash freely will now fail unless those
# tools are declared on the agent.
claude -p "$prompt" --bare --print --agent grader-test

If the grader fails with a "tool not permitted" error, add the required tool to the agent's tools: frontmatter and re-run.

CC 2.1.121: CLAUDE_CODE_FORK_SUBAGENT=1 for grader determinism (#1545)

Before CC 2.1.121, the env var only worked in interactive sessions. As of 2.1.121, non-interactive paths (claude -p, SDK) honor it too — each grader invocation gets a fresh forked subagent context.

The cross-eval state-leak problem this fixes:

Without forking, sequential claude -p --bare graders inherit harness state:

InheritedSymptom
memory MCP query cachegrader sees stale hit from previous run; same fixture grades differently
.claude/chain/*.json on diskgrader for "implement" thinks "explore" already ran (file is from previous test)
ToolSearch deferred-tool cachefirst grader's MCP loads bleed into next grader's tool registry
model picker prefgrader N inherits --model=opus from grader N-1

This produced ~5–10% retry rate and non-reproducible scores — the eval baseline drifted between runs, engineers chased phantom regressions.

Fix: tests/evals/scripts/lib/eval-common.sh exports CLAUDE_CODE_FORK_SUBAGENT=1, so every script that sources it (run-trigger-eval, run-quality-eval, run-agent-eval, optimize-description, etc.) gets forked graders automatically. The CI workflow .github/workflows/orchestkit-eval.yml also sets it at the workflow level. Older CC silently ignores the env var (no-op).

Determinism contract: running the same grader on the same fixture twice in a row produces the same score. Verified by tests/evals/scripts/test-grader-determinism.sh.

Performance

ScenarioWithout --bareWith --bareSavings
Single grading call~3-5s startup~0.5-1s2-4x
Trigger (per prompt)~3-5s~0.5-1s2-4x
Full eval (50 calls)~150-250s overhead~25-50s3-5x

Rules

Read("${CLAUDE_SKILL_DIR}/rules/_sections.md")

Troubleshooting

Read("${CLAUDE_SKILL_DIR}/references/troubleshooting.md")
  • eval:skill npm script — unified skill evaluation runner
  • eval:trigger — trigger accuracy testing
  • eval:quality — A/B quality comparison
  • optimize-description.sh — iterative description improvement
  • Version compatibility: doctor/references/version-compatibility.md

Rules (3)

Use --bare for grading/classification only — MEDIUM

Use --bare for Grading/Classification Only

--bare should only be used for calls that don't need plugin context.

Bare-Safe Calls

  • Assertion grading (batch or per-assertion)
  • Trigger classification (skill matching)
  • Description optimization
  • Force-skill eval (--append-system-prompt)
  • Baseline comparison (no plugin)

Never-Bare Calls

  • run_with_skill — tests plugin routing and skill loading
  • Agent eval generation — tests agent spawning via plugin
  • Any call using --plugin-dir

Incorrect

# BAD: run_with_skill with --bare defeats the purpose of testing plugin routing
build_claude_flags() {
    flags+=(--bare)  # Wrong — this function is called with include_plugin=true
    flags+=(--plugin-dir "$PLUGIN_DIR")
}

Correct

# GOOD: only add --bare when NOT using plugins
build_claude_flags() {
    local include_plugin="$1"
    if [[ "$include_plugin" == "true" ]]; then
        flags+=(--plugin-dir "$PLUGIN_DIR")
    elif [[ "$BARE_MODE" == "true" ]]; then
        flags+=(--bare)
    fi
}

Why

The eval pipeline's value comes from testing skills in their real environment. --bare is for the grading/classification overhead calls, not the core eval itself.

Never combine --bare with --plugin-dir — HIGH

Never Combine --bare with --plugin-dir

--bare explicitly skips plugin sync and skill directory walks. Adding --plugin-dir alongside it creates a silent conflict — the plugin may not load correctly.

Incorrect

# BAD: --bare + --plugin-dir is contradictory
claude -p "$prompt" --bare --plugin-dir plugins/ork --output-format json

Correct

# For plugin-routed tests: no --bare
claude -p "$prompt" --plugin-dir plugins/ork --dangerously-skip-permissions --output-format json

# For grading/classification: --bare, no --plugin-dir
claude -p "$prompt" --bare --max-turns 1 --output-format text

Why

The --bare flag was designed for scripted -p calls that don't need the full Claude Code environment. Using it with --plugin-dir defeats the purpose and may cause unpredictable behavior since hooks and skill discovery are disabled.

Always check ANTHROPIC_API_KEY before using --bare — HIGH

Always Check ANTHROPIC_API_KEY Before --bare

--bare disables OAuth and keychain auth. If ANTHROPIC_API_KEY is not set, the call will fail.

Incorrect

# BAD: no API key check before --bare
claude -p "$prompt" --bare --max-turns 1 --output-format text

Correct

# GOOD: conditional bare mode
BARE_MODE=false
if [[ -n "${ANTHROPIC_API_KEY:-}" ]]; then
    BARE_MODE=true
fi

local -a bare_flag=()
if [[ "$BARE_MODE" == "true" ]]; then bare_flag=(--bare); fi

claude -p "$prompt" "${bare_flag[@]}" --max-turns 1 --output-format text

Why

Users authenticating via claude auth login (OAuth) or macOS keychain won't have ANTHROPIC_API_KEY set. Using --bare unconditionally would break their eval pipeline. The conditional pattern degrades gracefully — eval still works, just slower.


References (3)

Grading Schemas

Grading Schemas

JSON schemas for structured eval output with --json-schema.

Batch Assertion Grading

{
  "type": "array",
  "items": {
    "type": "object",
    "properties": {
      "name": { "type": "string", "description": "Assertion name" },
      "verdict": { "enum": ["PASS", "FAIL"], "description": "Whether the output satisfies the assertion" },
      "reason": { "type": "string", "description": "One-line explanation" }
    },
    "required": ["name", "verdict", "reason"]
  }
}

Trigger Classification

{
  "type": "object",
  "properties": {
    "skill_name": { "type": "string", "description": "The skill that would be triggered" },
    "confidence": { "type": "number", "minimum": 0, "maximum": 1, "description": "Confidence score 0-1" },
    "reasoning": { "type": "string", "description": "Why this skill matches" }
  },
  "required": ["skill_name", "confidence"]
}

Quality Score

{
  "type": "object",
  "properties": {
    "score": { "type": "integer", "minimum": 0, "maximum": 10 },
    "dimensions": {
      "type": "object",
      "properties": {
        "accuracy": { "type": "integer", "minimum": 0, "maximum": 10 },
        "completeness": { "type": "integer", "minimum": 0, "maximum": 10 },
        "actionability": { "type": "integer", "minimum": 0, "maximum": 10 },
        "format": { "type": "integer", "minimum": 0, "maximum": 10 }
      }
    },
    "verdict": { "enum": ["PASS", "FAIL", "PARTIAL"] },
    "reason": { "type": "string" }
  },
  "required": ["score", "verdict"]
}

Description Quality

For optimize-description.sh iterations:

{
  "type": "object",
  "properties": {
    "improved_description": { "type": "string", "maxLength": 1024 },
    "changes_made": {
      "type": "array",
      "items": { "type": "string" }
    },
    "trigger_keywords_added": {
      "type": "array",
      "items": { "type": "string" }
    }
  },
  "required": ["improved_description"]
}

Usage

# Save schema to file
cat > /tmp/grading-schema.json << 'EOF'
{ ... schema above ... }
EOF

# Use with --json-schema
claude -p "$prompt" --bare --json-schema /tmp/grading-schema.json --output-format json

Invocation Patterns

Bare Eval Invocation Patterns

Detailed patterns for each --bare eval scenario.

1. Batch Grading (Most Common)

Grade all assertions for an output in a single call:

grading_prompt="You are an assertion grader. Grade this output against EACH assertion.
For each, return: {\"name\": \"...\", \"verdict\": \"PASS\"|\"FAIL\", \"reason\": \"...\"}
Return ONLY a valid JSON array.

ASSERTIONS:
$assertions_json

OUTPUT:
$output_text"

result=$(claude -p "$grading_prompt" --bare --max-turns 1 --output-format text)

Fallback: Per-Assertion Grading

If batch fails (malformed JSON), fall back to grading one assertion at a time.

CC 2.1.88 note: The StructuredOutput schema cache bug that caused ~50% failure rate with multiple schemas has been fixed. Schema failures are now exceptional. If you see repeated failures, investigate the prompt/schema rather than assuming cache corruption.

grading_prompt="Grade this output against the assertion.
Output ONLY 'PASS' or 'FAIL' followed by a one-line reason.

ASSERTION: $assertion_check

OUTPUT:
$output_text"

result=$(claude -p "$grading_prompt" --bare --max-turns 1 --output-format text)

2. Trigger Classification

Test which skill a prompt would match:

classification_prompt="Which skill would be triggered by this user prompt: \"$prompt\"

Available skills:
$SKILLS_CATALOG"

claude -p "$classification_prompt" \
    --bare \
    --system-prompt "$CLASSIFIER_SYSTEM_PROMPT" \
    --output-format json \
    --json-schema "$TRIGGER_SCHEMA" \
    --max-turns 2

Repetition for Confidence

Run multiple reps and check consistency:

for ((i=1; i<=reps; i++)); do
    result=$(claude -p "$prompt" --bare --json-schema "$schema" --output-format json)
    # Parse and aggregate
done

3. Description Optimization

Iteratively improve a skill description for better trigger accuracy:

improve_prompt="Current description: $current_desc

These prompts SHOULD trigger but DIDN'T:
$failures

Rules:
- Under 200 words
- Include WHAT and WHEN
- Third person
- Specific trigger keywords

Output ONLY the improved description."

new_desc=$(echo "$improve_prompt" | claude -p --bare --max-turns 1 --output-format text)

4. Force-Skill Eval (Isolated Quality)

Test skill content quality without plugin routing — inject SKILL.md body directly:

# Strip YAML frontmatter (macOS-compatible)
skill_content=$(awk 'BEGIN{skip=0} /^---$/{skip++; next} skip>=2{print}' "$SKILL_PATH")

claude -p "$prompt" \
    --bare \
    --print \
    --no-session-persistence \
    --max-budget-usd 0.50 \
    --append-system-prompt "$skill_content"

Key: --print forces text-only output (no tool calls). Combined with --bare, this is the fastest eval mode — pure text generation with skill context.

5. Baseline Comparison

Run without any skill context for A/B comparison:

claude -p "$prompt" \
    --bare \
    --dangerously-skip-permissions \
    --max-turns 3 \
    --output-format json \
    --json-schema "$QUALITY_SCHEMA" \
    --no-session-persistence \
    --max-budget-usd 0.50

CC 2.1.88: The --json-schema flag is now safe to use with multiple schemas in a single session. The StructuredOutput schema cache bug (~50% failure rate) has been fixed. You can safely chain --json-schema calls across batch grading, trigger classification, quality scoring, and baseline comparison.

Flag Compatibility Matrix

FlagCompatible with --bare?Notes
--max-turnsYesLimits turn count
--output-formatYesjson, text, stream-json
--json-schemaYesStructured output
--system-promptYesOverride system prompt
--append-system-promptYesAppend to system prompt
--printYesText-only, no tools
--no-session-persistenceYesNo session file
--max-budget-usdYesCost cap
--modelYesModel override
--plugin-dirNoContradicts --bare
--mcp-configCheckMCP may need network
--dangerously-skip-permissionsYesUseful for CI
--settingsYesCustom settings file

Timeout Wrapper

Use a portable timeout for CI environments:

# eval-common.sh provides run_with_timeout()
run_with_timeout 120 claude -p "$prompt" --bare --max-turns 1 --output-format text

Troubleshooting

Bare Eval Troubleshooting

Common Errors

ErrorCauseFix
--bare requires ANTHROPIC_API_KEYOAuth/keychain not supported in bare modeSet ANTHROPIC_API_KEY env var
Unknown flag --bareCC version < 2.1.81Upgrade: npm i -g @anthropic-ai/claude-code@latest
Empty responseModel returned empty or budget exceededIncrease --max-budget-usd or check prompt
Malformed JSON from batch graderModel wrapped JSON in markdown fencesStrip fences: sed 's/^```json//;s/^```//;s/```$//'
BARE_MODE=false despite key setANTHROPIC_API_KEY empty stringEnsure key has a value: echo $ANTHROPIC_API_KEY
Timeout in CIModel taking too longUse run_with_timeout with adequate seconds

Debugging

Check if bare mode is active

# In eval scripts:
echo "BARE_MODE=$BARE_MODE"

# Manual test:
ANTHROPIC_API_KEY="$ANTHROPIC_API_KEY" claude -p "Hello" --bare --max-turns 1 --output-format text

Verbose output

# Capture stderr for diagnostics
claude -p "$prompt" --bare --max-turns 1 --output-format text > output.txt 2> stderr.txt
cat stderr.txt

Fallback when --bare unavailable

If CC < 2.1.81, the eval scripts still work — they just don't add --bare:

# eval-common.sh:
BARE_MODE=false  # No ANTHROPIC_API_KEY → no bare mode
# Scripts run normally, just slower (full plugin/hook overhead)

Performance Profiling

Compare eval times with and without bare:

# With bare
time ANTHROPIC_API_KEY="$KEY" claude -p "test" --bare --max-turns 1 --output-format text > /dev/null

# Without bare (full stack)
time claude -p "test" --max-turns 1 --output-format text > /dev/null

Security Notes

  • --bare disables auto-memory — no session learnings stored
  • --bare requires explicit API key — no credential leakage via keychain
  • Eval scripts already run outside Claude Code sessions (unset CLAUDECODE)
  • SKILL_NAME_RE regex prevents path traversal in eval inputs (SEC-001)
Edit on GitHub

Last updated on