Failover and Resilience
Wayland Core layers four distinct mechanisms between your request and a provider failure. They compose: per-request HTTP retry fires first, key rotation runs alongside it, ProviderChain falls through to the next provider on retryable errors, and ResilientProvider wraps each slot with a circuit breaker.
wcore-providers/src/retry.rs implements with_retry(max_retries, f).
- 3 total attempts: 1 initial call plus 2 retries (
DEFAULT_MAX_RETRIES = 2). - Exponential backoff: 250 ms, 1 s, 4 s (4x multiplier per step, capped at 4 s).
- 429 override: if the response carries a
Retry-Afterheader (RFC 9110 delta-seconds or HTTP-date) or aretry_after_ms/retry_afterfield in the JSON body, that value is used instead of the backoff schedule. The actual sleep is capped at 60 seconds; a hint above that threshold causes fail-fast so the resilience layer above can pick a fallback. - Retryable errors:
RateLimited,Connection, transient HTTP 5xx, 408, 429 (ProviderError::is_retryable()). - Terminal errors (returned immediately, no retry): 4xx other than 408/429,
Parse,PromptTooLong, non-transientHttp.
The Retry-After resolution precedence is: HTTP header, then retry_after_ms / retry_after at the top level of the JSON error body, then parameters.retry_after_ms, then body.retry_after_ms, then a default of 5 s.
URL-borne API keys are stripped from all error messages before they are stored or logged (H-2 fix, without_url() on every reqwest error).
Key rotation pool
Section titled “Key rotation pool”wcore-providers/src/key_rotation.rs implements KeyPool.
- Holds N API keys per provider (empty and duplicate keys are filtered at construction).
- Sticky “last good”: returns the most recently successful key first.
- Round-robin rotation when the last-good key is absent or cooling.
- Per-key cooldown of 60 seconds by default (
KeyPool::with_cooldownaccepts a customDuration). A failed key is excluded from rotation until the cooldown expires. - Thread safety:
KeyPooltakes&mut selfon all methods. Callers sharing a pool across tasks must wrap it inArc<Mutex<KeyPool>>.
Methods: next_key() -> Option<&str>, mark_success(key), mark_failure(key).
ProviderChain
Section titled “ProviderChain”wcore-providers/src/chain.rs implements ProviderChain, a stateless sequential fallback list.
On a retryable error from the active provider, the chain moves to the next slot. On a terminal error, the error is returned immediately without trying further slots.
Retryable at the chain level (fall through to the next provider):
Connection(timeout, DNS, TLS)Httpwithis_timeout()oris_connect()RateLimitedApi { status >= 500 }(all 5xx)
Terminal at the chain level (propagated immediately):
Api { status 4xx }except 429 (which isRateLimited)ParsePromptTooLongEgress::Denied(an egress-gate block will apply to every provider)
On full exhaustion, the error message includes the attempt count: "all 2 provider(s) in chain failed: ...".
The chain logs the routing_hint from LlmRequest in a tracing span when present, so dispatch decisions are visible in traces without affecting fallback order.
ProviderChain has no circuit breaker. Wrap individual slots in ResilientProvider if you want per-slot circuit-breaking.
Circuit breaker (ResilientProvider)
Section titled “Circuit breaker (ResilientProvider)”wcore-providers/src/resilient.rs implements ResilientProvider, which wraps any LlmProvider.
Three states:
| State | Description |
|---|---|
Closed | Normal operation. Failures count toward the threshold. |
Open | Primary is skipped. Requests route directly to the fallback list. |
HalfOpen | One probe is allowed after the recovery timeout. Success restores Closed; failure returns to Open. |
Configuration (in ~/.wayland-core/config.toml or .wayland-core.toml):
[provider_chain]enabled = truefailure_threshold = 3 # failures within window before Openrecovery_timeout_secs = 30 # how long to stay Open before a HalfOpen probeDefault values: failure_threshold = 3, recovery_timeout_secs = 30.
Semantic errors (ContextOverflow, Format, ModelNotFound) do not count toward the threshold. A bad input that always fails is not a provider health signal; counting it would open the circuit on a wedged request, not a wedged provider. Only transient and permanent provider-side failures (Connection, 5xx, RateLimited, 401/403) trip the breaker.
When no fallback providers are configured, a retryable primary error is surfaced as-is rather than replaced by the generic “all providers in chain failed” message.
The circuit state is reported as a ProviderCircuitEvent on the wcore-protocol event stream.
Failover classifier
Section titled “Failover classifier”wcore-providers/src/classify.rs implements classify_failover(err, http_status, body_text, sdk_code) -> FailoverReason.
Eleven reasons are recognized. Classification uses three signal tiers in precedence order (first match wins):
Tier 1: HTTP status code
| Status | Reason |
|---|---|
| 401 | Auth |
| 403 | AuthPermanent |
| 404 | ModelNotFound |
| 408 | Timeout |
| 413 | ContextOverflow |
| 429 | RateLimit |
| 402 | Billing |
| 503, 529 | Overloaded |
| 500, 502, 504 (other 5xx) | Timeout |
| 400 | Format |
Tier 2: body text patterns (case-insensitive substring match)
Patterns include "session expired", "insufficient_quota", "billing", "invalid api key", "rate limit", "overloaded", "context length exceeded", and others.
Tier 3: SDK and vendor error codes
- OpenAI:
insufficient_quota,context_length_exceeded - Anthropic:
overloaded_error,authentication_error - AWS Bedrock:
ThrottlingException,ModelNotReadyException - Google Vertex:
RESOURCE_EXHAUSTED,DEADLINE_EXCEEDED,UNAVAILABLE - OS errno:
ETIMEDOUT,ECONNRESET
If no tier matches, the result is Unknown.
Cooldown state machine
Section titled “Cooldown state machine”wcore-providers/src/cooldown.rs implements CooldownTracker per provider.
Each FailoverReason maps to one of three classes:
| Class | Reasons | Behavior |
|---|---|---|
Transient | RateLimit, Overloaded, Timeout, Auth, Unknown | Short cooldown, 5 s base, exponential backoff (2^failure_count multiplier), capped at 5 minutes. Transitions Cooling -> HalfOpen on expiry. |
Permanent | AuthPermanent, Billing, SessionExpired | Long cooldown, 15 minutes flat, no HalfOpen probe. Requires operator intervention. |
Semantic | ContextOverflow, Format, ModelNotFound | No cooldown. The caller must change inputs; retrying the same provider is pointless. |
State transitions:
Ready → (failure) → Cooling { until, reason } → (expiry) → HalfOpen { reason } → (success during probe) → Ready → (failure during probe) → Cooling (failure_count incremented)is_available() returns true when the tracker is in Ready or HalfOpen state.
Putting it together
Section titled “Putting it together”A single request flows through the stack in this order:
KeyPool.next_key()selects the current key for the provider.with_retryattempts the call up to 3 times on transient HTTP errors, honoringRetry-Afteron 429.- On retryable failure,
ResilientProvidercounts the failure against the circuit breaker threshold. If the circuit opens, the primary is skipped. ProviderChainmoves to the next slot when the current slot returns a retryable error.classify_failoverclassifies each failure reason soCooldownTrackerapplies the right duration.
All circuit state changes are emitted as ProviderCircuitEvent events on the JSON-stream protocol.