vt-c-skill-eval¶
Run eval test cases against a skill. Supports structural assertions, LLM-as-judge grading, and with/without benchmarking.
Plugin: core-standards
Category: Other
Command: /vt-c-skill-eval
Skill Eval Runner¶
Run standardized test cases against toolkit skills to measure output quality, trigger accuracy, and uplift over baseline.
Invocation¶
/vt-c-skill-eval vt-c-spec-from-requirements # eval one skill
/vt-c-skill-eval --all # eval all skills with test suites
Execution¶
Step 1: Resolve Test Cases¶
- If argument is a skill name (e.g.,
vt-c-spec-from-requirements): - Glob
TOOLKIT_ROOT/tests/evals/{skill-name}/*.yaml -
If no test cases found: display error and stop:
-
If argument is
--all: - Glob
TOOLKIT_ROOT/tests/evals/*/(exclude_graders/) - For each directory, glob
*.yamlfiles -
Display: "Found {N} test cases across {M} skills"
-
Parse each YAML test case file. Validate required fields:
skill(string)query(string)expected(object with at least one assertion)- If validation fails: warn and skip the file
Step 2: Load Target Skill Content¶
For each unique skill referenced in test cases:
- Find the skill's SKILL.md:
- Search
plugins/*/skills/*/SKILL.mdwhere the frontmattername:matches the skill name -
If not found: warn "Skill {name} not found in toolkit" and skip its test cases
-
Read the full SKILL.md content — this will be injected into the sub-agent prompt for "WITH skill" runs.
Step 3: Execute Test Cases¶
For each test case:
3a. WITH skill run¶
Dispatch a context: fork sub-agent with model: haiku:
- Prompt: {skill SKILL.md content}\n\n---\n\nUser request:\n{test case query}
- Collect the output text
3b. Structural assertions¶
Check the output against expected:
activates: Iftrue, check that the output is non-trivial (> 100 chars and not a generic "I can't help" response). Iffalse, check the opposite.output_contains: For each entry, check if it appears as a substring in the output. Try as literal match first, then as regex.output_not_contains: For each entry, check it does NOT appear in the output.
Record: structural_pass: true/false, structural_details: [list of check results]
3c. LLM-as-judge grading (if grading_prompt specified)¶
Dispatch the quality-grader agent (tests/evals/_graders/quality-grader.md) with:
- The output from step 3a
- The grading_prompt from the test case
Parse the returned JSON for score and reasoning.
Record: judge_score, judge_reasoning, judge_pass: score >= min_grade
3d. Benchmark run (if benchmark: true)¶
Dispatch another context: fork sub-agent with model: haiku:
- Prompt: {test case query} (NO skill content — baseline)
- Run the same structural assertions
- If grading_prompt: run LLM-as-judge on baseline output too
Record: baseline_structural_pass, baseline_judge_score
3e. Determine pass/fail¶
A test case passes if ALL of:
- All structural assertions pass (or none specified)
- LLM judge score >= min_grade (or no grading_prompt specified)
Step 4: Aggregate and Report¶
Group results by skill. For each skill:
- Count: total, passed, failed
- Calculate pass rate
- If any benchmark runs: calculate uplift delta (with-skill vs without-skill)
- List failed test cases with reasons
Display the report (see Report Format below).
Write machine-readable summary to TOOLKIT_ROOT/tests/evals/.last-run.json:
{
"timestamp": "ISO-8601",
"skills": {
"vt-c-spec-from-requirements": {
"total": 5, "passed": 4, "failed": 1,
"pass_rate": 0.8,
"benchmark": {"with_skill": 0.8, "without_skill": 0.4, "uplift": 0.4},
"failures": [{"case": "edge-case-minimal", "reason": "output_contains: 'User Stories' not found"}]
}
},
"overall": {"total": 15, "passed": 13, "failed": 2, "pass_rate": 0.87}
}
Exit description: - If all passed: "All {N} test cases passed." - If failures: "FAILURES: {N} of {M} test cases failed. See report above."
Report Format¶
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Skill Eval Report — {skill_name}
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Test Cases: {total}
Passed: {pass_count} ({pass_pct}%)
Failed: {fail_count}
Failed Cases:
{case_name}: {reason}
LLM Judge Scores:
Average: {avg_score}
Min: {min_score} ({worst_case_name})
Benchmark (with/without skill):
Pass rate: {with_pct}% vs {without_pct}% (Δ {delta}%)
Avg score: {with_score} vs {without_score} (Δ {score_delta})
Exit code: {0 if all passed, 1 if failures}
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
When running --all, display one report block per skill, followed by an overall summary:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Overall Eval Summary
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Skills evaluated: {M}
Total test cases: {N}
Overall pass rate: {pct}%
Per-skill results:
vt-c-spec-from-requirements: 5/5 (100%)
vt-c-4-review: 4/5 (80%)
vt-c-2-plan: 5/5 (100%)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━