Failover and Resilience

Wayland Core layers four distinct mechanisms between your request and a provider failure. They compose: per-request HTTP retry fires first, key rotation runs alongside it, ProviderChain falls through to the next provider on retryable errors, and ResilientProvider wraps each slot with a circuit breaker.

Retry

wcore-providers/src/retry.rs implements with_retry(max_retries, f).

3 total attempts: 1 initial call plus 2 retries (DEFAULT_MAX_RETRIES = 2).
Exponential backoff: 250 ms, 1 s, 4 s (4x multiplier per step, capped at 4 s).
429 override: if the response carries a Retry-After header (RFC 9110 delta-seconds or HTTP-date) or a retry_after_ms / retry_after field in the JSON body, that value is used instead of the backoff schedule. The actual sleep is capped at 60 seconds; a hint above that threshold causes fail-fast so the resilience layer above can pick a fallback.
Retryable errors: RateLimited, Connection, transient HTTP 5xx, 408, 429 (ProviderError::is_retryable()).
Terminal errors (returned immediately, no retry): 4xx other than 408/429, Parse, PromptTooLong, non-transient Http.

The Retry-After resolution precedence is: HTTP header, then retry_after_ms / retry_after at the top level of the JSON error body, then parameters.retry_after_ms, then body.retry_after_ms, then a default of 5 s.

URL-borne API keys are stripped from all error messages before they are stored or logged (H-2 fix, without_url() on every reqwest error).

Key rotation pool

wcore-providers/src/key_rotation.rs implements KeyPool.

Holds N API keys per provider (empty and duplicate keys are filtered at construction).
Sticky “last good”: returns the most recently successful key first.
Round-robin rotation when the last-good key is absent or cooling.
Per-key cooldown of 60 seconds by default (KeyPool::with_cooldown accepts a custom Duration). A failed key is excluded from rotation until the cooldown expires.
Thread safety: KeyPool takes &mut self on all methods. Callers sharing a pool across tasks must wrap it in Arc<Mutex<KeyPool>>.

Methods: next_key() -> Option<&str>, mark_success(key), mark_failure(key).

ProviderChain

wcore-providers/src/chain.rs implements ProviderChain, a stateless sequential fallback list.

On a retryable error from the active provider, the chain moves to the next slot. On a terminal error, the error is returned immediately without trying further slots.

Retryable at the chain level (fall through to the next provider):

Connection (timeout, DNS, TLS)
Http with is_timeout() or is_connect()
RateLimited
Api { status >= 500 } (all 5xx)

Terminal at the chain level (propagated immediately):

Api { status 4xx } except 429 (which is RateLimited)
Parse
PromptTooLong
Egress::Denied (an egress-gate block will apply to every provider)

On full exhaustion, the error message includes the attempt count: "all 2 provider(s) in chain failed: ...".

The chain logs the routing_hint from LlmRequest in a tracing span when present, so dispatch decisions are visible in traces without affecting fallback order.

ProviderChain has no circuit breaker. Wrap individual slots in ResilientProvider if you want per-slot circuit-breaking.

Circuit breaker (ResilientProvider)

wcore-providers/src/resilient.rs implements ResilientProvider, which wraps any LlmProvider.

Three states:

State	Description
`Closed`	Normal operation. Failures count toward the threshold.
`Open`	Primary is skipped. Requests route directly to the fallback list.
`HalfOpen`	One probe is allowed after the recovery timeout. Success restores `Closed`; failure returns to `Open`.

Configuration (in ~/.wayland-core/config.toml or .wayland-core.toml):

[provider_chain]
enabled = true
failure_threshold = 3      # failures within window before Open
recovery_timeout_secs = 30 # how long to stay Open before a HalfOpen probe

Default values: failure_threshold = 3, recovery_timeout_secs = 30.

Semantic errors (ContextOverflow, Format, ModelNotFound) do not count toward the threshold. A bad input that always fails is not a provider health signal; counting it would open the circuit on a wedged request, not a wedged provider. Only transient and permanent provider-side failures (Connection, 5xx, RateLimited, 401/403) trip the breaker.

When no fallback providers are configured, a retryable primary error is surfaced as-is rather than replaced by the generic “all providers in chain failed” message.

The circuit state is reported as a ProviderCircuitEvent on the wcore-protocol event stream.

Failover classifier

wcore-providers/src/classify.rs implements classify_failover(err, http_status, body_text, sdk_code) -> FailoverReason.

Eleven reasons are recognized. Classification uses three signal tiers in precedence order (first match wins):

Tier 1: HTTP status code

Status	Reason
401	`Auth`
403	`AuthPermanent`
404	`ModelNotFound`
408	`Timeout`
413	`ContextOverflow`
429	`RateLimit`
402	`Billing`
503, 529	`Overloaded`
500, 502, 504 (other 5xx)	`Timeout`
400	`Format`

Tier 2: body text patterns (case-insensitive substring match)

Patterns include "session expired", "insufficient_quota", "billing", "invalid api key", "rate limit", "overloaded", "context length exceeded", and others.

Tier 3: SDK and vendor error codes

OpenAI: insufficient_quota, context_length_exceeded
Anthropic: overloaded_error, authentication_error
AWS Bedrock: ThrottlingException, ModelNotReadyException
Google Vertex: RESOURCE_EXHAUSTED, DEADLINE_EXCEEDED, UNAVAILABLE
OS errno: ETIMEDOUT, ECONNRESET

If no tier matches, the result is Unknown.

Cooldown state machine

wcore-providers/src/cooldown.rs implements CooldownTracker per provider.

Each FailoverReason maps to one of three classes:

Class	Reasons	Behavior
`Transient`	`RateLimit`, `Overloaded`, `Timeout`, `Auth`, `Unknown`	Short cooldown, 5 s base, exponential backoff (2^failure_count multiplier), capped at 5 minutes. Transitions `Cooling -> HalfOpen` on expiry.
`Permanent`	`AuthPermanent`, `Billing`, `SessionExpired`	Long cooldown, 15 minutes flat, no HalfOpen probe. Requires operator intervention.
`Semantic`	`ContextOverflow`, `Format`, `ModelNotFound`	No cooldown. The caller must change inputs; retrying the same provider is pointless.

State transitions:

Ready
  → (failure)  → Cooling { until, reason }
  → (expiry)   → HalfOpen { reason }
  → (success during probe) → Ready
  → (failure during probe) → Cooling (failure_count incremented)

is_available() returns true when the tracker is in Ready or HalfOpen state.

Putting it together

A single request flows through the stack in this order:

KeyPool.next_key() selects the current key for the provider.
with_retry attempts the call up to 3 times on transient HTTP errors, honoring Retry-After on 429.
On retryable failure, ResilientProvider counts the failure against the circuit breaker threshold. If the circuit opens, the primary is skipped.
ProviderChain moves to the next slot when the current slot returns a retryable error.
classify_failover classifies each failure reason so CooldownTracker applies the right duration.

All circuit state changes are emitted as ProviderCircuitEvent events on the JSON-stream protocol.