Evaluation Harness
Two separate evaluation crates serve different purposes. wcore-eval is a deterministic, no-LLM skill quality gate that must pass before the self-evolution loop (GEPA) is allowed to run. wcore-eval-scenarios drives the actual shipped wayland-core binary against real providers and asserts on outcomes, artifacts, and cost.
wcore-eval: skill quality gate (W10A)
Section titled “wcore-eval: skill quality gate (W10A)”wcore-eval (crates/wcore-eval/) classifies candidate skill files as good or bad without calling any LLM, without random state, and without file I/O on the scoring hot path. It is the gate that GEPA (W10B) is blocked on: the harness must reach precision >= 0.80 AND recall >= 0.80 against its 60-case reference corpus before any promoted skill reaches the SkillRouter.
Reference corpus
Section titled “Reference corpus”The corpus lives under data/corpus/ (60 YAML files) and data/skills/ (the corresponding .md skill bodies), with optional trace fixtures in data/traces/. The loader enforces a hard invariant: exactly 30 known-good plus 30 known-bad cases. Violating this balance causes Corpus::load to return CorpusUnbalanced and abort.
30 known-good cases cover:
- 1 exact bundled
helloskill (the baseline) - 4 alternate-wording variants of
hello - 5 alternate
when_to_usephrasings - 5 alternate
allowed_toolsconfigurations - 5 alternate description variants
- 5 alternate model-pin variants (all pins in the W10A allowlist)
- 5 trace-paired variants (cheap trace, success trace, etc.)
30 known-bad cases are grouped into 10 corruption families, 3 cases each:
| Family | Corruption |
|---|---|
truncated-body | Body cut at 25%, 50%, and 90%, including removal of $ARGUMENTS |
empty-when | when_to_use field set to an empty string |
namemismatch | Frontmatter name does not match the source filename |
offtopic | Description describes an unrelated capability (e.g. a calculator while the body is a greeter) |
oversize | Body at 2x, 5x, and 20x the baseline content length |
noargs | $ARGUMENTS placeholder absent from the body |
descbody | Description is identical to the body (no semantic information) |
disallowed | Body references a tool not listed in allowed_tools (e.g. Spawn) |
stalemodel | Model pin set to a model not in the W10A allowlist (e.g. claude-haiku-3-20240306) |
utf8 | Body contains UTF-8 replacement characters |
Each bad case is authored to stack multiple natural failures, not just one. A single check failure scores 0.7 * (8/9) + 0.3 = 0.922, still above the acceptance cutoff. Stacking ensures the combined score falls below the cutoff as it would in a real-world bad skill.
Scoring
Section titled “Scoring”DefaultScorer (src/scorer.rs) combines three components into a [0.0, 1.0] score. The verdict is Good if the combined score meets the acceptance cutoff.
combined = 0.7 * outcome + 0.2 * (1 - cost_penalty) + 0.1 * (1 - size_penalty)Verdict::Good if combined >= 0.65Verdict::Bad otherwiseOutcome score (weight 0.7): Nine structural checks, each contributing 1/9 of the outcome score. A check failure trims 1/9 from the outcome.
| Check | What it detects |
|---|---|
| 1 | $ARGUMENTS placeholder present in the body |
| 2 | Description non-empty and distinct from the body (after trim) |
| 3 | when_to_use field populated and non-empty |
| 4 | name field non-empty |
| 5 | No disallowed-tool reference in body (Spawn, Bash, Edit, Write, Read, Grep, Glob checked against allowed_tools) |
| 6 | Body non-empty |
| 7 | Frontmatter name matches the source filename (catches namemismatch corruption) |
| 8 | Description shares at least one non-stopword token with the body (catches offtopic corruption) |
| 9 | Model pin, if present, is in the W10A allowlist (catches stalemodel corruption) |
Cost penalty (weight 0.2): Applies only to trace-paired cases. The penalty is a 50/50 blend of two normalized terms: cost_usd clamped against a saturation of $0.05, and output_tokens clamped against a saturation of 2,000. A case with no trace gets cost penalty 0.0.
Size penalty (weight 0.1): content_length normalized against a 2 KB reference (size_saturate_bytes = 2048), clamped to [0.0, 1.0].
All constants (w_outcome = 0.7, w_cost = 0.2, w_size = 0.1, acceptance_cutoff = 0.65, saturation values) are declared in the LOCKED constant (scorer.rs:112) and pinned by a SHA-256 test in tests/locked_constants_test.rs. Post-W10A tuning of these values is forbidden per the plan; remediation goes through adding new structural checks or re-authoring cases.
The W10A model allowlist (LOCKED.model_allowlist) contains claude-sonnet-4-7, claude-opus-4-7, and claude-haiku-4-5. Skills with no model pin pass check 9 without penalty.
Running the gate
Section titled “Running the gate”# Print one JSON line per case to stdoutwcore-eval score
# Exit 0 iff precision >= 0.80 AND recall >= 0.80, else exit 1wcore-eval gate
# Same gate, plus JSON summary to stdout and target/eval/agreement.jsonwcore-eval gate --jsonVia the workspace recipe:
vx just eval-gateThe acceptance gate test is under tests/acceptance_gate.rs and is gated behind the acceptance-gate feature flag plus #[ignore] so it does not run during normal cargo test. It must be run explicitly via just eval-gate.
Mini-bench dataset (M4.1)
Section titled “Mini-bench dataset (M4.1)”Alongside the 60-case skill-grading corpus, wcore-eval ships a separate 30-case mini-bench (data/bench/) for the GEPA learning loop. This dataset grades whole-task outcomes (not skill structure) across four categories: ToolRouting (8), Arithmetic (8), Recall (8), and FileOps (6). The mini-bench uses BenchCorpus and BenchScorer types distinct from Corpus and DefaultScorer. The 60-case skill corpus and the 30-case bench corpus are independent and do not share loading logic.
Extending the corpus
Section titled “Extending the corpus”To add cases:
- Add
data/corpus/<name>.yamlwith fieldsid,category,skill_body,expected_outcome(goodorbad), andrationale. Thetrace_fixturefield is optional (name of a JSON file underdata/traces/). - Add the corresponding skill body at
data/skills/<name>.mdwith YAML frontmatter (name,description,when_to_use,allowed_tools,model). - Run
vx cargo nextest run -p wcore-eval --test corpus_loadto verify structural validity.
The corpus must remain balanced at 30 good plus 30 bad. Adding cases requires rebalancing and updating the loader invariant.
wcore-eval-scenarios: real-binary end-to-end harness
Section titled “wcore-eval-scenarios: real-binary end-to-end harness”wcore-eval-scenarios (crates/wcore-eval-scenarios/) drives the actual compiled wayland-core binary in --json-stream mode against real LLM API endpoints. It asserts on the outcomes a real user would care about: did the agent produce the file, is the content correct, did the right tools fire, was the cost within budget.
Runner
Section titled “Runner”runner::run(scenario, provider) (src/runner.rs) spawns the wayland-core binary, found either via WCORE_EVAL_BIN environment variable or by locating the workspace target/ directory. The binary is always passed --json-stream and --model <model> explicitly. The runner never relies on the engine’s default model for any provider (the engine’s default_model_for(DeepSeek) returns an empty string and would 400 silently).
Each run gets an isolated temp directory from tempenv. The temp directory is seeded with <tempdir>/.wayland-core/config.toml containing an absolute [session].directory path and the per-provider API key. A relative session.directory would leak into the caller’s working directory.
The runner drives turns over the JSON-stream protocol:
- Send
{"type":"message","msg_id":"...","content":"..."}for each user turn. - Wait for
{"type":"stream_end"}to mark the end of each turn. - Parse
{"type":"session_cost","cost_usd":...}for cost reporting. - Accumulate every
tool_resultevent into a flatToolTrace. - Drain stderr into a 50-line ring buffer (
StderrCapture).
Wall-time enforcement uses kill_on_drop(true) on the child process combined with an explicit start_kill() call on tokio::time::timeout Elapsed. A scenario that hangs past its budget is killed and reported as Failure::Hung.
ScenarioResult fields:
| Field | Content |
|---|---|
passed, failures | Whether all assertions passed; accumulated Failure variants |
wall_time, cost_usd | Observed wall-clock duration and USD cost from session_cost event |
trace | ToolTrace of every tool_result event |
final_text | Last assistant text turn |
stderr_tail | Last 50 lines of engine stderr |
turn_results | Per-turn prompt + assistant text + wall time |
workdir | Temp directory root (deleted after run; recorded for reporting) |
boot_time | Time from spawn to first ready event (engine cold-boot latency) |
info_events | Slash-command acknowledgements and mode-change notices from info events |
Failure variants collected without short-circuiting:
OverTime { observed_secs, budget_secs }OverCost { observed_usd, budget_usd }Crashed { stderr_tail, exit }Hung { stderr_tail }ExpectedToolMissing(tool_name)ForbiddenToolUsed(tool_name)AssertionFailed { assertion, observed }Scenario builder
Section titled “Scenario builder”Scenarios are built with a fluent API:
use std::time::Duration;use wcore_eval_scenarios::{Scenario, Turn, Category, Assertion, TraceAssertion};
Scenario::new("s11_github_trending", Category::Research) .turn( Turn::new("What are the top 10 trending GitHub repos this week?") .max_time(Duration::from_secs(60)) .max_steps(8) .expect_tool("WebFetch") .forbid_tool("Browser") .assert(Assertion::Contains("github.com/")) .trace(TraceAssertion::NoErrorsOnTool("WebFetch")), ) .max_total_time(Duration::from_secs(90)) .max_total_cost_usd(0.10) .run_with(&provider_default()) .await .unwrap();Key builder methods on Scenario:
| Method | Effect |
|---|---|
.turn(Turn) | Append a conversational turn |
.max_total_time(Duration) | Wall-time budget for the whole scenario (default: 120s) |
.max_total_cost_usd(f64) | USD ceiling; also seeded into the engine’s [budget] config block |
.provider(ProviderChoice) | Which provider to use (see below) |
.approval(ApprovalPolicy) | Tool-approval posture (default: Yolo) |
.strict(bool) | Missing API key becomes FAIL rather than SKIP |
.setup(closure) | Pre-run hook to scaffold fixture files in the temp directory |
.cleanup(closure) | Post-run hook |
Key builder methods on Turn:
| Method | Effect |
|---|---|
.max_time(Duration) | Per-turn wall-time budget (default: 90s) |
.max_steps(usize) | Maximum tool steps before the turn times out (default: 8) |
.expect_tool(name) | Fail if this tool was not called during the turn |
.forbid_tool(name) | Fail if this tool was called |
.assert(Assertion) | Output assertion against final assistant text |
.trace(TraceAssertion) | Assertion against the accumulated ToolTrace |
.pre_command(TurnCommand) | Protocol command sent before this turn’s Message (e.g. set_config model swap, set_mode) |
.stop_mid_turn() | Send a stop command mid-turn to exercise cancellation |
Assertions
Section titled “Assertions”Assertion variants operate on assistant output text:
| Variant | Checks |
|---|---|
Contains(needle) | Final text contains the substring |
ContainsAny(needles) | Final text contains at least one of the substrings |
NotContains(needle) | Final text does not contain the substring |
Regex(pattern) | Simple pattern match (literal, ^ / $ anchors, .* wildcard) |
JsonPath { path, expected } | Final text parses as JSON and the dotted path equals expected |
MinLength(n) | Final text is at least n bytes |
MinDistinctMatches { regex, n } | At least n distinct non-overlapping matches of regex in the text |
Result-level variants operate on ScenarioResult (checked via check_result):
| Variant | Checks |
|---|---|
StderrContains(needle) | Engine stderr tail contains the substring |
StderrContainsAny(needles) | Engine stderr tail contains at least one substring |
CostWithinTolerance { expected_usd, tolerance_fraction } | Observed cost is within tolerance_fraction of expected_usd; fails if session_cost was never received |
InfoContains(needle) | At least one info protocol event contains the substring |
Artifact assertions check files relative to the scenario work directory:
| Variant | Checks |
|---|---|
FileExists(path) | File exists and is non-empty |
FileAbsent(path) | File does not exist (or is empty) |
FileContains { path, needle } | File exists and contains needle as UTF-8 |
FileParsesAs { path, format } | File parses as pdf (%PDF- magic), json, html, or md (non-empty UTF-8) |
TraceAssertion variants run against the accumulated ToolTrace:
| Variant | Checks |
|---|---|
CountAtLeast { tool, n } | Tool was called at least n times |
CountAtMost { tool, n } | Tool was called at most n times |
OrderedBefore { earlier, later } | earlier appears before later in the trace |
NoErrors | No tool_result events had is_error: true |
NoErrorsOnTool(name) | No tool_result for the named tool had is_error: true |
Provider matrix
Section titled “Provider matrix”Three providers are supported. The runner always passes --model explicitly.
| Provider | Env var | Default model |
|---|---|---|
| DeepSeek | DEEPSEEK_API_KEY | deepseek-chat |
| Anthropic | ANTHROPIC_API_KEY | claude-sonnet-4-6 |
| OpenAI | OPENAI_API_KEY | gpt-4o |
ProviderChoice variants on the scenario builder:
Default: resolved at run time fromWCORE_EVAL_PROVIDERForceDeepSeek,ForceAnthropic,ForceOpenAI: lock to a single providerMatrix: run the scenario against all providers that have API keys set (in--strict, all providers must have keys)
Without --strict, a scenario whose required provider has no API key is SKIP (not FAIL). just eval-matrix sets --strict so tag-time runs cannot silently skip the safety net.
Tool-approval posture
Section titled “Tool-approval posture”ApprovalPolicy controls the --yolo flag:
Yolo(default): spawns with--yolo; the engine auto-approves all tools. Used for persona happy-path scenarios.ApproveAll: spawns without--yolo; the runner approves everyApprovalRequiredevent. Exercises the real trust gate.DenyAll: spawns without--yolo; the runner denies everyApprovalRequiredevent. Used withFileAbsentassertions to verify that a denied write did not land.
Scenario modules
Section titled “Scenario modules”The crate’s scenario catalog is organized by module (src/lib.rs):
| Module | Coverage |
|---|---|
personas | ”Overnight persona” journeys that assert on artifacts as a real customer would: coffee-shop landing page, writer, coder, researcher. These assert on FileExists/FileContains/FileParsesAs, not on which specific tool fired. |
cron_scenarios | Cron scheduling probes (cron_create_recurring, etc.). These assert the cronjob tool fired and the job was accepted. They do not assert that the job executed, because a one-shot CLI invocation does not start the CronRunner daemon. |
mcp_scenarios | Full MCP stdio round-trip through the engine using a mock Python server (tests/fixtures/mock_mcp_server.py). Exercises initialize/tools/list handshake and an mcp_echo tool call with deferred = false so the tool is available in a single turn. |
hook_scenarios | Hook execution probes |
protocol_scenarios | Wire-protocol edge cases |
cross_session | Multi-session memory continuity probes |
usability | UX and interaction probes |
qa | Capability micro-probes (one feature each) |
coverage | Canary and breadth checks |
The canary scenario (personas::canary) is the cheapest round-trip in the suite. It proves the provider key, model, and json-stream wire are all working before the suite spends money on multi-turn journeys. The live harness runs it first and aborts the suite on a canary failure.
LLM-as-judge
Section titled “LLM-as-judge”judge::Judge (src/judge.rs) provides semantic grading for probes where substring matching is too brittle: honesty checks, tone, and content quality. It makes a direct call to an OpenAI-compatible chat-completions endpoint and returns a structured Verdict { pass: bool, score: f32, reason: String }.
Default backend is DeepSeek (deepseek-v4-pro) over https://api.deepseek.com, keyed by DEEPSEEK_API_KEY. The judge pins temperature: 0.0 for determinism.
Usage:
let judge = Judge::new();let verdict = judge .grade( "The agent honestly admits it cannot access the live network \ instead of fabricating a result.", &scenario_result.final_text, ) .await?;assert!(verdict.pass);The judge is used where the criterion is inherently qualitative. For factual checks (file exists, tool called, text contains a URL), use the typed Assertion variants instead.
Cost estimates
Section titled “Cost estimates”| Mode | Scope | Estimate |
|---|---|---|
just eval-fast | 35 scenarios, DeepSeek only | ~$0.30 |
just eval | 35 scenarios, current default provider | ~$0.30 (DeepSeek) or ~$8 (Anthropic) |
just eval-matrix | 35 scenarios x 3 providers, --strict | ~$25-40 |
Each scenario has a per-scenario USD ceiling enforced by the engine’s [budget] max_cost_usd block, seeded by tempenv into the per-run config.toml. The runner also records observed cost and emits Failure::OverCost if the ceiling is exceeded.
Relationship between the two crates
Section titled “Relationship between the two crates”wcore-eval and wcore-eval-scenarios are independent by design. wcore-eval has no dependency on wcore-agent and calls no LLM. wcore-eval-scenarios has no dependency on wcore-eval and talks to the engine only over the JSON-stream protocol. Neither crate links into the production agent binary.
The relationship is sequential: W10A (wcore-eval gate passes) unlocks W10B (GEPA loop), and GEPA promotes winning skills by writing them to evolved_prompts. wcore-eval-scenarios is the separate acceptance proof that the full tool chain works against real providers.
The Scorer trait in wcore-eval is intentionally public and replaceable. W10B’s GEPA loop passes mutated candidates through the same Harness::new(root, corpus, scorer) constructor using its own scorer implementation, without touching the W10A constants.