Skip to content

Evaluation Harness

Two separate evaluation crates serve different purposes. wcore-eval is a deterministic, no-LLM skill quality gate that must pass before the self-evolution loop (GEPA) is allowed to run. wcore-eval-scenarios drives the actual shipped wayland-core binary against real providers and asserts on outcomes, artifacts, and cost.

wcore-eval (crates/wcore-eval/) classifies candidate skill files as good or bad without calling any LLM, without random state, and without file I/O on the scoring hot path. It is the gate that GEPA (W10B) is blocked on: the harness must reach precision >= 0.80 AND recall >= 0.80 against its 60-case reference corpus before any promoted skill reaches the SkillRouter.

The corpus lives under data/corpus/ (60 YAML files) and data/skills/ (the corresponding .md skill bodies), with optional trace fixtures in data/traces/. The loader enforces a hard invariant: exactly 30 known-good plus 30 known-bad cases. Violating this balance causes Corpus::load to return CorpusUnbalanced and abort.

30 known-good cases cover:

  • 1 exact bundled hello skill (the baseline)
  • 4 alternate-wording variants of hello
  • 5 alternate when_to_use phrasings
  • 5 alternate allowed_tools configurations
  • 5 alternate description variants
  • 5 alternate model-pin variants (all pins in the W10A allowlist)
  • 5 trace-paired variants (cheap trace, success trace, etc.)

30 known-bad cases are grouped into 10 corruption families, 3 cases each:

FamilyCorruption
truncated-bodyBody cut at 25%, 50%, and 90%, including removal of $ARGUMENTS
empty-whenwhen_to_use field set to an empty string
namemismatchFrontmatter name does not match the source filename
offtopicDescription describes an unrelated capability (e.g. a calculator while the body is a greeter)
oversizeBody at 2x, 5x, and 20x the baseline content length
noargs$ARGUMENTS placeholder absent from the body
descbodyDescription is identical to the body (no semantic information)
disallowedBody references a tool not listed in allowed_tools (e.g. Spawn)
stalemodelModel pin set to a model not in the W10A allowlist (e.g. claude-haiku-3-20240306)
utf8Body contains UTF-8 replacement characters

Each bad case is authored to stack multiple natural failures, not just one. A single check failure scores 0.7 * (8/9) + 0.3 = 0.922, still above the acceptance cutoff. Stacking ensures the combined score falls below the cutoff as it would in a real-world bad skill.

DefaultScorer (src/scorer.rs) combines three components into a [0.0, 1.0] score. The verdict is Good if the combined score meets the acceptance cutoff.

combined = 0.7 * outcome + 0.2 * (1 - cost_penalty) + 0.1 * (1 - size_penalty)
Verdict::Good if combined >= 0.65
Verdict::Bad otherwise

Outcome score (weight 0.7): Nine structural checks, each contributing 1/9 of the outcome score. A check failure trims 1/9 from the outcome.

CheckWhat it detects
1$ARGUMENTS placeholder present in the body
2Description non-empty and distinct from the body (after trim)
3when_to_use field populated and non-empty
4name field non-empty
5No disallowed-tool reference in body (Spawn, Bash, Edit, Write, Read, Grep, Glob checked against allowed_tools)
6Body non-empty
7Frontmatter name matches the source filename (catches namemismatch corruption)
8Description shares at least one non-stopword token with the body (catches offtopic corruption)
9Model pin, if present, is in the W10A allowlist (catches stalemodel corruption)

Cost penalty (weight 0.2): Applies only to trace-paired cases. The penalty is a 50/50 blend of two normalized terms: cost_usd clamped against a saturation of $0.05, and output_tokens clamped against a saturation of 2,000. A case with no trace gets cost penalty 0.0.

Size penalty (weight 0.1): content_length normalized against a 2 KB reference (size_saturate_bytes = 2048), clamped to [0.0, 1.0].

All constants (w_outcome = 0.7, w_cost = 0.2, w_size = 0.1, acceptance_cutoff = 0.65, saturation values) are declared in the LOCKED constant (scorer.rs:112) and pinned by a SHA-256 test in tests/locked_constants_test.rs. Post-W10A tuning of these values is forbidden per the plan; remediation goes through adding new structural checks or re-authoring cases.

The W10A model allowlist (LOCKED.model_allowlist) contains claude-sonnet-4-7, claude-opus-4-7, and claude-haiku-4-5. Skills with no model pin pass check 9 without penalty.

Terminal window
# Print one JSON line per case to stdout
wcore-eval score
# Exit 0 iff precision >= 0.80 AND recall >= 0.80, else exit 1
wcore-eval gate
# Same gate, plus JSON summary to stdout and target/eval/agreement.json
wcore-eval gate --json

Via the workspace recipe:

Terminal window
vx just eval-gate

The acceptance gate test is under tests/acceptance_gate.rs and is gated behind the acceptance-gate feature flag plus #[ignore] so it does not run during normal cargo test. It must be run explicitly via just eval-gate.

Alongside the 60-case skill-grading corpus, wcore-eval ships a separate 30-case mini-bench (data/bench/) for the GEPA learning loop. This dataset grades whole-task outcomes (not skill structure) across four categories: ToolRouting (8), Arithmetic (8), Recall (8), and FileOps (6). The mini-bench uses BenchCorpus and BenchScorer types distinct from Corpus and DefaultScorer. The 60-case skill corpus and the 30-case bench corpus are independent and do not share loading logic.

To add cases:

  1. Add data/corpus/<name>.yaml with fields id, category, skill_body, expected_outcome (good or bad), and rationale. The trace_fixture field is optional (name of a JSON file under data/traces/).
  2. Add the corresponding skill body at data/skills/<name>.md with YAML frontmatter (name, description, when_to_use, allowed_tools, model).
  3. Run vx cargo nextest run -p wcore-eval --test corpus_load to verify structural validity.

The corpus must remain balanced at 30 good plus 30 bad. Adding cases requires rebalancing and updating the loader invariant.


wcore-eval-scenarios: real-binary end-to-end harness

Section titled “wcore-eval-scenarios: real-binary end-to-end harness”

wcore-eval-scenarios (crates/wcore-eval-scenarios/) drives the actual compiled wayland-core binary in --json-stream mode against real LLM API endpoints. It asserts on the outcomes a real user would care about: did the agent produce the file, is the content correct, did the right tools fire, was the cost within budget.

runner::run(scenario, provider) (src/runner.rs) spawns the wayland-core binary, found either via WCORE_EVAL_BIN environment variable or by locating the workspace target/ directory. The binary is always passed --json-stream and --model <model> explicitly. The runner never relies on the engine’s default model for any provider (the engine’s default_model_for(DeepSeek) returns an empty string and would 400 silently).

Each run gets an isolated temp directory from tempenv. The temp directory is seeded with <tempdir>/.wayland-core/config.toml containing an absolute [session].directory path and the per-provider API key. A relative session.directory would leak into the caller’s working directory.

The runner drives turns over the JSON-stream protocol:

  • Send {"type":"message","msg_id":"...","content":"..."} for each user turn.
  • Wait for {"type":"stream_end"} to mark the end of each turn.
  • Parse {"type":"session_cost","cost_usd":...} for cost reporting.
  • Accumulate every tool_result event into a flat ToolTrace.
  • Drain stderr into a 50-line ring buffer (StderrCapture).

Wall-time enforcement uses kill_on_drop(true) on the child process combined with an explicit start_kill() call on tokio::time::timeout Elapsed. A scenario that hangs past its budget is killed and reported as Failure::Hung.

ScenarioResult fields:

FieldContent
passed, failuresWhether all assertions passed; accumulated Failure variants
wall_time, cost_usdObserved wall-clock duration and USD cost from session_cost event
traceToolTrace of every tool_result event
final_textLast assistant text turn
stderr_tailLast 50 lines of engine stderr
turn_resultsPer-turn prompt + assistant text + wall time
workdirTemp directory root (deleted after run; recorded for reporting)
boot_timeTime from spawn to first ready event (engine cold-boot latency)
info_eventsSlash-command acknowledgements and mode-change notices from info events

Failure variants collected without short-circuiting:

OverTime { observed_secs, budget_secs }
OverCost { observed_usd, budget_usd }
Crashed { stderr_tail, exit }
Hung { stderr_tail }
ExpectedToolMissing(tool_name)
ForbiddenToolUsed(tool_name)
AssertionFailed { assertion, observed }

Scenarios are built with a fluent API:

use std::time::Duration;
use wcore_eval_scenarios::{Scenario, Turn, Category, Assertion, TraceAssertion};
Scenario::new("s11_github_trending", Category::Research)
.turn(
Turn::new("What are the top 10 trending GitHub repos this week?")
.max_time(Duration::from_secs(60))
.max_steps(8)
.expect_tool("WebFetch")
.forbid_tool("Browser")
.assert(Assertion::Contains("github.com/"))
.trace(TraceAssertion::NoErrorsOnTool("WebFetch")),
)
.max_total_time(Duration::from_secs(90))
.max_total_cost_usd(0.10)
.run_with(&provider_default())
.await
.unwrap();

Key builder methods on Scenario:

MethodEffect
.turn(Turn)Append a conversational turn
.max_total_time(Duration)Wall-time budget for the whole scenario (default: 120s)
.max_total_cost_usd(f64)USD ceiling; also seeded into the engine’s [budget] config block
.provider(ProviderChoice)Which provider to use (see below)
.approval(ApprovalPolicy)Tool-approval posture (default: Yolo)
.strict(bool)Missing API key becomes FAIL rather than SKIP
.setup(closure)Pre-run hook to scaffold fixture files in the temp directory
.cleanup(closure)Post-run hook

Key builder methods on Turn:

MethodEffect
.max_time(Duration)Per-turn wall-time budget (default: 90s)
.max_steps(usize)Maximum tool steps before the turn times out (default: 8)
.expect_tool(name)Fail if this tool was not called during the turn
.forbid_tool(name)Fail if this tool was called
.assert(Assertion)Output assertion against final assistant text
.trace(TraceAssertion)Assertion against the accumulated ToolTrace
.pre_command(TurnCommand)Protocol command sent before this turn’s Message (e.g. set_config model swap, set_mode)
.stop_mid_turn()Send a stop command mid-turn to exercise cancellation

Assertion variants operate on assistant output text:

VariantChecks
Contains(needle)Final text contains the substring
ContainsAny(needles)Final text contains at least one of the substrings
NotContains(needle)Final text does not contain the substring
Regex(pattern)Simple pattern match (literal, ^ / $ anchors, .* wildcard)
JsonPath { path, expected }Final text parses as JSON and the dotted path equals expected
MinLength(n)Final text is at least n bytes
MinDistinctMatches { regex, n }At least n distinct non-overlapping matches of regex in the text

Result-level variants operate on ScenarioResult (checked via check_result):

VariantChecks
StderrContains(needle)Engine stderr tail contains the substring
StderrContainsAny(needles)Engine stderr tail contains at least one substring
CostWithinTolerance { expected_usd, tolerance_fraction }Observed cost is within tolerance_fraction of expected_usd; fails if session_cost was never received
InfoContains(needle)At least one info protocol event contains the substring

Artifact assertions check files relative to the scenario work directory:

VariantChecks
FileExists(path)File exists and is non-empty
FileAbsent(path)File does not exist (or is empty)
FileContains { path, needle }File exists and contains needle as UTF-8
FileParsesAs { path, format }File parses as pdf (%PDF- magic), json, html, or md (non-empty UTF-8)

TraceAssertion variants run against the accumulated ToolTrace:

VariantChecks
CountAtLeast { tool, n }Tool was called at least n times
CountAtMost { tool, n }Tool was called at most n times
OrderedBefore { earlier, later }earlier appears before later in the trace
NoErrorsNo tool_result events had is_error: true
NoErrorsOnTool(name)No tool_result for the named tool had is_error: true

Three providers are supported. The runner always passes --model explicitly.

ProviderEnv varDefault model
DeepSeekDEEPSEEK_API_KEYdeepseek-chat
AnthropicANTHROPIC_API_KEYclaude-sonnet-4-6
OpenAIOPENAI_API_KEYgpt-4o

ProviderChoice variants on the scenario builder:

  • Default: resolved at run time from WCORE_EVAL_PROVIDER
  • ForceDeepSeek, ForceAnthropic, ForceOpenAI: lock to a single provider
  • Matrix: run the scenario against all providers that have API keys set (in --strict, all providers must have keys)

Without --strict, a scenario whose required provider has no API key is SKIP (not FAIL). just eval-matrix sets --strict so tag-time runs cannot silently skip the safety net.

ApprovalPolicy controls the --yolo flag:

  • Yolo (default): spawns with --yolo; the engine auto-approves all tools. Used for persona happy-path scenarios.
  • ApproveAll: spawns without --yolo; the runner approves every ApprovalRequired event. Exercises the real trust gate.
  • DenyAll: spawns without --yolo; the runner denies every ApprovalRequired event. Used with FileAbsent assertions to verify that a denied write did not land.

The crate’s scenario catalog is organized by module (src/lib.rs):

ModuleCoverage
personas”Overnight persona” journeys that assert on artifacts as a real customer would: coffee-shop landing page, writer, coder, researcher. These assert on FileExists/FileContains/FileParsesAs, not on which specific tool fired.
cron_scenariosCron scheduling probes (cron_create_recurring, etc.). These assert the cronjob tool fired and the job was accepted. They do not assert that the job executed, because a one-shot CLI invocation does not start the CronRunner daemon.
mcp_scenariosFull MCP stdio round-trip through the engine using a mock Python server (tests/fixtures/mock_mcp_server.py). Exercises initialize/tools/list handshake and an mcp_echo tool call with deferred = false so the tool is available in a single turn.
hook_scenariosHook execution probes
protocol_scenariosWire-protocol edge cases
cross_sessionMulti-session memory continuity probes
usabilityUX and interaction probes
qaCapability micro-probes (one feature each)
coverageCanary and breadth checks

The canary scenario (personas::canary) is the cheapest round-trip in the suite. It proves the provider key, model, and json-stream wire are all working before the suite spends money on multi-turn journeys. The live harness runs it first and aborts the suite on a canary failure.

judge::Judge (src/judge.rs) provides semantic grading for probes where substring matching is too brittle: honesty checks, tone, and content quality. It makes a direct call to an OpenAI-compatible chat-completions endpoint and returns a structured Verdict { pass: bool, score: f32, reason: String }.

Default backend is DeepSeek (deepseek-v4-pro) over https://api.deepseek.com, keyed by DEEPSEEK_API_KEY. The judge pins temperature: 0.0 for determinism.

Usage:

let judge = Judge::new();
let verdict = judge
.grade(
"The agent honestly admits it cannot access the live network \
instead of fabricating a result.",
&scenario_result.final_text,
)
.await?;
assert!(verdict.pass);

The judge is used where the criterion is inherently qualitative. For factual checks (file exists, tool called, text contains a URL), use the typed Assertion variants instead.

ModeScopeEstimate
just eval-fast35 scenarios, DeepSeek only~$0.30
just eval35 scenarios, current default provider~$0.30 (DeepSeek) or ~$8 (Anthropic)
just eval-matrix35 scenarios x 3 providers, --strict~$25-40

Each scenario has a per-scenario USD ceiling enforced by the engine’s [budget] max_cost_usd block, seeded by tempenv into the per-run config.toml. The runner also records observed cost and emits Failure::OverCost if the ceiling is exceeded.


wcore-eval and wcore-eval-scenarios are independent by design. wcore-eval has no dependency on wcore-agent and calls no LLM. wcore-eval-scenarios has no dependency on wcore-eval and talks to the engine only over the JSON-stream protocol. Neither crate links into the production agent binary.

The relationship is sequential: W10A (wcore-eval gate passes) unlocks W10B (GEPA loop), and GEPA promotes winning skills by writing them to evolved_prompts. wcore-eval-scenarios is the separate acceptance proof that the full tool chain works against real providers.

The Scorer trait in wcore-eval is intentionally public and replaceable. W10B’s GEPA loop passes mutated candidates through the same Harness::new(root, corpus, scorer) constructor using its own scorer implementation, without touching the W10A constants.