Saw brief tail instability around Replit Database during the upstream turbulence on Thu, 30 Oct 2025, roughly 18:50–21:10 CET (Berlin). p95/p99 grew, short 5xx bursts, and queue length amplified via retries. Looks self-exciting under provider error spikes.
Minimal rollback guardrail I’ve used elsewhere, two steps:
1. jittered exponential backoff, tighter upstream timeouts on 5xx/429, and lower max in-flight per tenant;
2. rollback trigger when a short moving window breaches on p95 or error-rate, to contain queue expansion and keep autoscaler stable under intermittent upstream failure.
If helpful, I can share a tiny diff and a quick benchmark chart over email.
Reach me at [email protected].
@cbrewster @Zabot