Evaluation Harness

Two separate evaluation crates serve different purposes. wcore-eval is a deterministic, no-LLM skill quality gate that must pass before the self-evolution loop (GEPA) is allowed to run. wcore-eval-scenarios drives the actual shipped wayland-core binary against real providers and asserts on outcomes, artifacts, and cost.

`wcore-eval`: skill quality gate (W10A)

wcore-eval (crates/wcore-eval/) classifies candidate skill files as good or bad without calling any LLM, without random state, and without file I/O on the scoring hot path. It is the gate that GEPA (W10B) is blocked on: the harness must reach precision >= 0.80 AND recall >= 0.80 against its 60-case reference corpus before any promoted skill reaches the SkillRouter.

Reference corpus

The corpus lives under data/corpus/ (60 YAML files) and data/skills/ (the corresponding .md skill bodies), with optional trace fixtures in data/traces/. The loader enforces a hard invariant: exactly 30 known-good plus 30 known-bad cases. Violating this balance causes Corpus::load to return CorpusUnbalanced and abort.

30 known-good cases cover:

1 exact bundled hello skill (the baseline)
4 alternate-wording variants of hello
5 alternate when_to_use phrasings
5 alternate allowed_tools configurations
5 alternate description variants
5 alternate model-pin variants (all pins in the W10A allowlist)
5 trace-paired variants (cheap trace, success trace, etc.)

30 known-bad cases are grouped into 10 corruption families, 3 cases each:

Family	Corruption
`truncated-body`	Body cut at 25%, 50%, and 90%, including removal of `$ARGUMENTS`
`empty-when`	`when_to_use` field set to an empty string
`namemismatch`	Frontmatter `name` does not match the source filename
`offtopic`	Description describes an unrelated capability (e.g. a calculator while the body is a greeter)
`oversize`	Body at 2x, 5x, and 20x the baseline content length
`noargs`	`$ARGUMENTS` placeholder absent from the body
`descbody`	Description is identical to the body (no semantic information)
`disallowed`	Body references a tool not listed in `allowed_tools` (e.g. `Spawn`)
`stalemodel`	Model pin set to a model not in the W10A allowlist (e.g. `claude-haiku-3-20240306`)
`utf8`	Body contains UTF-8 replacement characters

Each bad case is authored to stack multiple natural failures, not just one. A single check failure scores 0.7 * (8/9) + 0.3 = 0.922, still above the acceptance cutoff. Stacking ensures the combined score falls below the cutoff as it would in a real-world bad skill.

Scoring

DefaultScorer (src/scorer.rs) combines three components into a [0.0, 1.0] score. The verdict is Good if the combined score meets the acceptance cutoff.

combined = 0.7 * outcome + 0.2 * (1 - cost_penalty) + 0.1 * (1 - size_penalty)
Verdict::Good  if combined >= 0.65
Verdict::Bad   otherwise

Outcome score (weight 0.7): Nine structural checks, each contributing 1/9 of the outcome score. A check failure trims 1/9 from the outcome.

Check	What it detects
1	`$ARGUMENTS` placeholder present in the body
2	Description non-empty and distinct from the body (after trim)
3	`when_to_use` field populated and non-empty
4	`name` field non-empty
5	No disallowed-tool reference in body (`Spawn`, `Bash`, `Edit`, `Write`, `Read`, `Grep`, `Glob` checked against `allowed_tools`)
6	Body non-empty
7	Frontmatter `name` matches the source filename (catches `namemismatch` corruption)
8	Description shares at least one non-stopword token with the body (catches `offtopic` corruption)
9	Model pin, if present, is in the W10A allowlist (catches `stalemodel` corruption)

Cost penalty (weight 0.2): Applies only to trace-paired cases. The penalty is a 50/50 blend of two normalized terms: cost_usd clamped against a saturation of $0.05, and output_tokens clamped against a saturation of 2,000. A case with no trace gets cost penalty 0.0.

Size penalty (weight 0.1): content_length normalized against a 2 KB reference (size_saturate_bytes = 2048), clamped to [0.0, 1.0].

All constants (w_outcome = 0.7, w_cost = 0.2, w_size = 0.1, acceptance_cutoff = 0.65, saturation values) are declared in the LOCKED constant (scorer.rs:112) and pinned by a SHA-256 test in tests/locked_constants_test.rs. Post-W10A tuning of these values is forbidden per the plan; remediation goes through adding new structural checks or re-authoring cases.

The W10A model allowlist (LOCKED.model_allowlist) contains claude-sonnet-4-7, claude-opus-4-7, and claude-haiku-4-5. Skills with no model pin pass check 9 without penalty.

Running the gate

# Print one JSON line per case to stdout
wcore-eval score

# Exit 0 iff precision >= 0.80 AND recall >= 0.80, else exit 1
wcore-eval gate

# Same gate, plus JSON summary to stdout and target/eval/agreement.json
wcore-eval gate --json

Via the workspace recipe:

vx just eval-gate

The acceptance gate test is under tests/acceptance_gate.rs and is gated behind the acceptance-gate feature flag plus #[ignore] so it does not run during normal cargo test. It must be run explicitly via just eval-gate.

Mini-bench dataset (M4.1)

Alongside the 60-case skill-grading corpus, wcore-eval ships a separate 30-case mini-bench (data/bench/) for the GEPA learning loop. This dataset grades whole-task outcomes (not skill structure) across four categories: ToolRouting (8), Arithmetic (8), Recall (8), and FileOps (6). The mini-bench uses BenchCorpus and BenchScorer types distinct from Corpus and DefaultScorer. The 60-case skill corpus and the 30-case bench corpus are independent and do not share loading logic.

Extending the corpus

To add cases:

Add data/corpus/<name>.yaml with fields id, category, skill_body, expected_outcome (good or bad), and rationale. The trace_fixture field is optional (name of a JSON file under data/traces/).
Add the corresponding skill body at data/skills/<name>.md with YAML frontmatter (name, description, when_to_use, allowed_tools, model).
Run vx cargo nextest run -p wcore-eval --test corpus_load to verify structural validity.

The corpus must remain balanced at 30 good plus 30 bad. Adding cases requires rebalancing and updating the loader invariant.

`wcore-eval-scenarios`: real-binary end-to-end harness

wcore-eval-scenarios (crates/wcore-eval-scenarios/) drives the actual compiled wayland-core binary in --json-stream mode against real LLM API endpoints. It asserts on the outcomes a real user would care about: did the agent produce the file, is the content correct, did the right tools fire, was the cost within budget.

Runner

runner::run(scenario, provider) (src/runner.rs) spawns the wayland-core binary, found either via WCORE_EVAL_BIN environment variable or by locating the workspace target/ directory. The binary is always passed --json-stream and --model <model> explicitly. The runner never relies on the engine’s default model for any provider (the engine’s default_model_for(DeepSeek) returns an empty string and would 400 silently).

Each run gets an isolated temp directory from tempenv. The temp directory is seeded with <tempdir>/.wayland-core/config.toml containing an absolute [session].directory path and the per-provider API key. A relative session.directory would leak into the caller’s working directory.

The runner drives turns over the JSON-stream protocol:

Send {"type":"message","msg_id":"...","content":"..."} for each user turn.
Wait for {"type":"stream_end"} to mark the end of each turn.
Parse {"type":"session_cost","cost_usd":...} for cost reporting.
Accumulate every tool_result event into a flat ToolTrace.
Drain stderr into a 50-line ring buffer (StderrCapture).

Wall-time enforcement uses kill_on_drop(true) on the child process combined with an explicit start_kill() call on tokio::time::timeout Elapsed. A scenario that hangs past its budget is killed and reported as Failure::Hung.

ScenarioResult fields:

Field	Content
`passed`, `failures`	Whether all assertions passed; accumulated `Failure` variants
`wall_time`, `cost_usd`	Observed wall-clock duration and USD cost from `session_cost` event
`trace`	`ToolTrace` of every `tool_result` event
`final_text`	Last assistant text turn
`stderr_tail`	Last 50 lines of engine stderr
`turn_results`	Per-turn prompt + assistant text + wall time
`workdir`	Temp directory root (deleted after run; recorded for reporting)
`boot_time`	Time from spawn to first `ready` event (engine cold-boot latency)
`info_events`	Slash-command acknowledgements and mode-change notices from `info` events

Failure variants collected without short-circuiting:

OverTime { observed_secs, budget_secs }
OverCost { observed_usd, budget_usd }
Crashed  { stderr_tail, exit }
Hung     { stderr_tail }
ExpectedToolMissing(tool_name)
ForbiddenToolUsed(tool_name)
AssertionFailed { assertion, observed }

Scenario builder

Scenarios are built with a fluent API:

use std::time::Duration;
use wcore_eval_scenarios::{Scenario, Turn, Category, Assertion, TraceAssertion};

Scenario::new("s11_github_trending", Category::Research)
    .turn(
        Turn::new("What are the top 10 trending GitHub repos this week?")
            .max_time(Duration::from_secs(60))
            .max_steps(8)
            .expect_tool("WebFetch")
            .forbid_tool("Browser")
            .assert(Assertion::Contains("github.com/"))
            .trace(TraceAssertion::NoErrorsOnTool("WebFetch")),
    )
    .max_total_time(Duration::from_secs(90))
    .max_total_cost_usd(0.10)
    .run_with(&provider_default())
    .await
    .unwrap();

Key builder methods on Scenario:

Method	Effect
`.turn(Turn)`	Append a conversational turn
`.max_total_time(Duration)`	Wall-time budget for the whole scenario (default: 120s)
`.max_total_cost_usd(f64)`	USD ceiling; also seeded into the engine’s `[budget]` config block
`.provider(ProviderChoice)`	Which provider to use (see below)
`.approval(ApprovalPolicy)`	Tool-approval posture (default: `Yolo`)
`.strict(bool)`	Missing API key becomes `FAIL` rather than `SKIP`
`.setup(closure)`	Pre-run hook to scaffold fixture files in the temp directory
`.cleanup(closure)`	Post-run hook

Key builder methods on Turn:

Method	Effect
`.max_time(Duration)`	Per-turn wall-time budget (default: 90s)
`.max_steps(usize)`	Maximum tool steps before the turn times out (default: 8)
`.expect_tool(name)`	Fail if this tool was not called during the turn
`.forbid_tool(name)`	Fail if this tool was called
`.assert(Assertion)`	Output assertion against final assistant text
`.trace(TraceAssertion)`	Assertion against the accumulated `ToolTrace`
`.pre_command(TurnCommand)`	Protocol command sent before this turn’s `Message` (e.g. `set_config` model swap, `set_mode`)
`.stop_mid_turn()`	Send a `stop` command mid-turn to exercise cancellation

Assertions

Assertion variants operate on assistant output text:

Variant	Checks
`Contains(needle)`	Final text contains the substring
`ContainsAny(needles)`	Final text contains at least one of the substrings
`NotContains(needle)`	Final text does not contain the substring
`Regex(pattern)`	Simple pattern match (literal, `^` / `$` anchors, `.*` wildcard)
`JsonPath { path, expected }`	Final text parses as JSON and the dotted path equals `expected`
`MinLength(n)`	Final text is at least `n` bytes
`MinDistinctMatches { regex, n }`	At least `n` distinct non-overlapping matches of `regex` in the text

Result-level variants operate on ScenarioResult (checked via check_result):

Variant	Checks
`StderrContains(needle)`	Engine stderr tail contains the substring
`StderrContainsAny(needles)`	Engine stderr tail contains at least one substring
`CostWithinTolerance { expected_usd, tolerance_fraction }`	Observed cost is within `tolerance_fraction` of `expected_usd`; fails if `session_cost` was never received
`InfoContains(needle)`	At least one `info` protocol event contains the substring

Artifact assertions check files relative to the scenario work directory:

Variant	Checks
`FileExists(path)`	File exists and is non-empty
`FileAbsent(path)`	File does not exist (or is empty)
`FileContains { path, needle }`	File exists and contains `needle` as UTF-8
`FileParsesAs { path, format }`	File parses as `pdf` (`%PDF-` magic), `json`, `html`, or `md` (non-empty UTF-8)

TraceAssertion variants run against the accumulated ToolTrace:

Variant	Checks
`CountAtLeast { tool, n }`	Tool was called at least `n` times
`CountAtMost { tool, n }`	Tool was called at most `n` times
`OrderedBefore { earlier, later }`	`earlier` appears before `later` in the trace
`NoErrors`	No `tool_result` events had `is_error: true`
`NoErrorsOnTool(name)`	No `tool_result` for the named tool had `is_error: true`

Provider matrix

Three providers are supported. The runner always passes --model explicitly.

Provider	Env var	Default model
DeepSeek	`DEEPSEEK_API_KEY`	`deepseek-chat`
Anthropic	`ANTHROPIC_API_KEY`	`claude-sonnet-4-6`
OpenAI	`OPENAI_API_KEY`	`gpt-4o`

ProviderChoice variants on the scenario builder:

Default: resolved at run time from WCORE_EVAL_PROVIDER
ForceDeepSeek, ForceAnthropic, ForceOpenAI: lock to a single provider
Matrix: run the scenario against all providers that have API keys set (in --strict, all providers must have keys)

Without --strict, a scenario whose required provider has no API key is SKIP (not FAIL). just eval-matrix sets --strict so tag-time runs cannot silently skip the safety net.

Tool-approval posture

ApprovalPolicy controls the --yolo flag:

Yolo (default): spawns with --yolo; the engine auto-approves all tools. Used for persona happy-path scenarios.
ApproveAll: spawns without --yolo; the runner approves every ApprovalRequired event. Exercises the real trust gate.
DenyAll: spawns without --yolo; the runner denies every ApprovalRequired event. Used with FileAbsent assertions to verify that a denied write did not land.

Scenario modules

The crate’s scenario catalog is organized by module (src/lib.rs):

Module	Coverage
`personas`	”Overnight persona” journeys that assert on artifacts as a real customer would: coffee-shop landing page, writer, coder, researcher. These assert on `FileExists`/`FileContains`/`FileParsesAs`, not on which specific tool fired.
`cron_scenarios`	Cron scheduling probes (`cron_create_recurring`, etc.). These assert the `cronjob` tool fired and the job was accepted. They do not assert that the job executed, because a one-shot CLI invocation does not start the `CronRunner` daemon.
`mcp_scenarios`	Full MCP stdio round-trip through the engine using a mock Python server (`tests/fixtures/mock_mcp_server.py`). Exercises `initialize`/`tools/list` handshake and an `mcp_echo` tool call with `deferred = false` so the tool is available in a single turn.
`hook_scenarios`	Hook execution probes
`protocol_scenarios`	Wire-protocol edge cases
`cross_session`	Multi-session memory continuity probes
`usability`	UX and interaction probes
`qa`	Capability micro-probes (one feature each)
`coverage`	Canary and breadth checks

The canary scenario (personas::canary) is the cheapest round-trip in the suite. It proves the provider key, model, and json-stream wire are all working before the suite spends money on multi-turn journeys. The live harness runs it first and aborts the suite on a canary failure.

LLM-as-judge

judge::Judge (src/judge.rs) provides semantic grading for probes where substring matching is too brittle: honesty checks, tone, and content quality. It makes a direct call to an OpenAI-compatible chat-completions endpoint and returns a structured Verdict { pass: bool, score: f32, reason: String }.

Default backend is DeepSeek (deepseek-v4-pro) over https://api.deepseek.com, keyed by DEEPSEEK_API_KEY. The judge pins temperature: 0.0 for determinism.

Usage:

let judge = Judge::new();
let verdict = judge
    .grade(
        "The agent honestly admits it cannot access the live network \
         instead of fabricating a result.",
        &scenario_result.final_text,
    )
    .await?;
assert!(verdict.pass);

The judge is used where the criterion is inherently qualitative. For factual checks (file exists, tool called, text contains a URL), use the typed Assertion variants instead.

Cost estimates

Mode	Scope	Estimate
`just eval-fast`	35 scenarios, DeepSeek only	~$0.30
`just eval`	35 scenarios, current default provider	~$0.30 (DeepSeek) or ~$8 (Anthropic)
`just eval-matrix`	35 scenarios x 3 providers, `--strict`	~$25-40

Each scenario has a per-scenario USD ceiling enforced by the engine’s [budget] max_cost_usd block, seeded by tempenv into the per-run config.toml. The runner also records observed cost and emits Failure::OverCost if the ceiling is exceeded.

Relationship between the two crates

wcore-eval and wcore-eval-scenarios are independent by design. wcore-eval has no dependency on wcore-agent and calls no LLM. wcore-eval-scenarios has no dependency on wcore-eval and talks to the engine only over the JSON-stream protocol. Neither crate links into the production agent binary.

The relationship is sequential: W10A (wcore-eval gate passes) unlocks W10B (GEPA loop), and GEPA promotes winning skills by writing them to evolved_prompts. wcore-eval-scenarios is the separate acceptance proof that the full tool chain works against real providers.

The Scorer trait in wcore-eval is intentionally public and replaceable. W10B’s GEPA loop passes mutated candidates through the same Harness::new(root, corpus, scorer) constructor using its own scorer implementation, without touching the W10A constants.