Skip to content

Failover and Resilience

Wayland Core layers four distinct mechanisms between your request and a provider failure. They compose: per-request HTTP retry fires first, key rotation runs alongside it, ProviderChain falls through to the next provider on retryable errors, and ResilientProvider wraps each slot with a circuit breaker.

wcore-providers/src/retry.rs implements with_retry(max_retries, f).

  • 3 total attempts: 1 initial call plus 2 retries (DEFAULT_MAX_RETRIES = 2).
  • Exponential backoff: 250 ms, 1 s, 4 s (4x multiplier per step, capped at 4 s).
  • 429 override: if the response carries a Retry-After header (RFC 9110 delta-seconds or HTTP-date) or a retry_after_ms / retry_after field in the JSON body, that value is used instead of the backoff schedule. The actual sleep is capped at 60 seconds; a hint above that threshold causes fail-fast so the resilience layer above can pick a fallback.
  • Retryable errors: RateLimited, Connection, transient HTTP 5xx, 408, 429 (ProviderError::is_retryable()).
  • Terminal errors (returned immediately, no retry): 4xx other than 408/429, Parse, PromptTooLong, non-transient Http.

The Retry-After resolution precedence is: HTTP header, then retry_after_ms / retry_after at the top level of the JSON error body, then parameters.retry_after_ms, then body.retry_after_ms, then a default of 5 s.

URL-borne API keys are stripped from all error messages before they are stored or logged (H-2 fix, without_url() on every reqwest error).

wcore-providers/src/key_rotation.rs implements KeyPool.

  • Holds N API keys per provider (empty and duplicate keys are filtered at construction).
  • Sticky “last good”: returns the most recently successful key first.
  • Round-robin rotation when the last-good key is absent or cooling.
  • Per-key cooldown of 60 seconds by default (KeyPool::with_cooldown accepts a custom Duration). A failed key is excluded from rotation until the cooldown expires.
  • Thread safety: KeyPool takes &mut self on all methods. Callers sharing a pool across tasks must wrap it in Arc<Mutex<KeyPool>>.

Methods: next_key() -> Option<&str>, mark_success(key), mark_failure(key).

wcore-providers/src/chain.rs implements ProviderChain, a stateless sequential fallback list.

On a retryable error from the active provider, the chain moves to the next slot. On a terminal error, the error is returned immediately without trying further slots.

Retryable at the chain level (fall through to the next provider):

  • Connection (timeout, DNS, TLS)
  • Http with is_timeout() or is_connect()
  • RateLimited
  • Api { status >= 500 } (all 5xx)

Terminal at the chain level (propagated immediately):

  • Api { status 4xx } except 429 (which is RateLimited)
  • Parse
  • PromptTooLong
  • Egress::Denied (an egress-gate block will apply to every provider)

On full exhaustion, the error message includes the attempt count: "all 2 provider(s) in chain failed: ...".

The chain logs the routing_hint from LlmRequest in a tracing span when present, so dispatch decisions are visible in traces without affecting fallback order.

ProviderChain has no circuit breaker. Wrap individual slots in ResilientProvider if you want per-slot circuit-breaking.

wcore-providers/src/resilient.rs implements ResilientProvider, which wraps any LlmProvider.

Three states:

StateDescription
ClosedNormal operation. Failures count toward the threshold.
OpenPrimary is skipped. Requests route directly to the fallback list.
HalfOpenOne probe is allowed after the recovery timeout. Success restores Closed; failure returns to Open.

Configuration (in ~/.wayland-core/config.toml or .wayland-core.toml):

[provider_chain]
enabled = true
failure_threshold = 3 # failures within window before Open
recovery_timeout_secs = 30 # how long to stay Open before a HalfOpen probe

Default values: failure_threshold = 3, recovery_timeout_secs = 30.

Semantic errors (ContextOverflow, Format, ModelNotFound) do not count toward the threshold. A bad input that always fails is not a provider health signal; counting it would open the circuit on a wedged request, not a wedged provider. Only transient and permanent provider-side failures (Connection, 5xx, RateLimited, 401/403) trip the breaker.

When no fallback providers are configured, a retryable primary error is surfaced as-is rather than replaced by the generic “all providers in chain failed” message.

The circuit state is reported as a ProviderCircuitEvent on the wcore-protocol event stream.

wcore-providers/src/classify.rs implements classify_failover(err, http_status, body_text, sdk_code) -> FailoverReason.

Eleven reasons are recognized. Classification uses three signal tiers in precedence order (first match wins):

Tier 1: HTTP status code

StatusReason
401Auth
403AuthPermanent
404ModelNotFound
408Timeout
413ContextOverflow
429RateLimit
402Billing
503, 529Overloaded
500, 502, 504 (other 5xx)Timeout
400Format

Tier 2: body text patterns (case-insensitive substring match)

Patterns include "session expired", "insufficient_quota", "billing", "invalid api key", "rate limit", "overloaded", "context length exceeded", and others.

Tier 3: SDK and vendor error codes

  • OpenAI: insufficient_quota, context_length_exceeded
  • Anthropic: overloaded_error, authentication_error
  • AWS Bedrock: ThrottlingException, ModelNotReadyException
  • Google Vertex: RESOURCE_EXHAUSTED, DEADLINE_EXCEEDED, UNAVAILABLE
  • OS errno: ETIMEDOUT, ECONNRESET

If no tier matches, the result is Unknown.

wcore-providers/src/cooldown.rs implements CooldownTracker per provider.

Each FailoverReason maps to one of three classes:

ClassReasonsBehavior
TransientRateLimit, Overloaded, Timeout, Auth, UnknownShort cooldown, 5 s base, exponential backoff (2^failure_count multiplier), capped at 5 minutes. Transitions Cooling -> HalfOpen on expiry.
PermanentAuthPermanent, Billing, SessionExpiredLong cooldown, 15 minutes flat, no HalfOpen probe. Requires operator intervention.
SemanticContextOverflow, Format, ModelNotFoundNo cooldown. The caller must change inputs; retrying the same provider is pointless.

State transitions:

Ready
→ (failure) → Cooling { until, reason }
→ (expiry) → HalfOpen { reason }
→ (success during probe) → Ready
→ (failure during probe) → Cooling (failure_count incremented)

is_available() returns true when the tracker is in Ready or HalfOpen state.

A single request flows through the stack in this order:

  1. KeyPool.next_key() selects the current key for the provider.
  2. with_retry attempts the call up to 3 times on transient HTTP errors, honoring Retry-After on 429.
  3. On retryable failure, ResilientProvider counts the failure against the circuit breaker threshold. If the circuit opens, the primary is skipped.
  4. ProviderChain moves to the next slot when the current slot returns a retryable error.
  5. classify_failover classifies each failure reason so CooldownTracker applies the right duration.

All circuit state changes are emitted as ProviderCircuitEvent events on the JSON-stream protocol.