Most circuit breakers model recovery as a simple state transition. Something fails, the breaker opens, time passes, the breaker “half-opens,” a single request succeeds, and the system is declared healthy again.
That framing treats recovery as a moment. A single check. A yes-or-no answer to the question: “Is it working?”
In real-world distributed systems, recovery is not a moment. It is the gradual return of confidence under increasing load. The difference matters because systems that recover too optimistically tend to fail again in ways that are much harder to debug.
What “Half-Open” Is Supposed to Mean
The “Half-Open” state is usually described as a probe. The idea is that traffic is cautiously reintroduced so the system can observe whether a dependency is healthy again.
In practice, many implementations don’t actually probe. They just wait.
- The Idle Wait: Sometimes no traffic flows at all. The timer expires, the breaker closes, and full traffic resumes without a single request having been observed.
- The Lucky Strike: A single successful request is treated as definitive proof that the dependency is ready for 100% load.
Both cases confuse the passage of time with evidence of health. Time passing does not make a system healthier; it only creates an opportunity to observe health.
Recovery Is Not Binary
Dependencies do not move directly from “broken” to “fixed.” They move through intermediate states where they might respond correctly under light load but collapse the moment pressure increases.
A binary recovery model collapses that reality into a single decision point. This simplification is attractive because it’s easy to code, but it pushes complexity downstream. When recovery is treated as a switch instead of a process, failures reappear as:
- Sudden regressions.
- Oscillating “flapping” states.
- Cascading retries that are more destabilizing than the original outage.
A Different Question
Instead of asking, “Is this dependency healthy again?” a more useful question is:
How much traffic do we trust this dependency with right now?
That question admits a range of answers. It allows recovery to be gradual. It forces the system to acknowledge uncertainty instead of pretending it doesn’t exist.
In this framing, recovery becomes a controlled increase in exposure. Time determines how much trust is possible, but evidence determines how much trust is actually granted.
The First Success Is Not Full Trust
One of the most common mistakes in recovery logic is treating the first successful request as decisive.
A single success only proves the dependency can respond. It doesn’t prove it can sustain load, or that the recovery is durable rather than accidental. For that reason, the first success should unlock only a tiny sliver of traffic—enough to continue observing behavior, but not enough to cause damage if the system is still fragile.
Even if you’ve been waiting an hour, the first success should still be capped. Time defines the ceiling of possible confidence; it should not accelerate the process by itself.
Failure During Recovery Must Be Decisive
Some systems attempt to be “forgiving” during recovery. If 90% of requests succeed but 10% fail, they reduce the rate slightly and keep trying.
That approach is a recipe for oscillation. Half-open is not a statistical smoothing phase; it is a probationary one.
Any failure during recovery means confidence was misplaced. The correct response is not to negotiate or compromise—it is to reset and protect the system immediately. That strictness prevents repeated partial failures from turning into prolonged instability.
Why This Is a Product Decision
Recovery behavior shapes the user experience just as much as failure behavior does.
- Aggressive Recovery: Users see repeated regressions and “glitchy” performance as the system repeatedly slams into a wall.
- Conservative Recovery: Users experience unnecessary downtime even after the underlying issue is resolved.
Neither outcome is accidental. Choosing how confidence is rebuilt, how much evidence is required, and how quickly exposure increases is a fundamental decision about risk and trust.
That makes recovery a product decision.
Confidence Over Clocks
The simplest recovery strategy is driven by a clock: Wait X seconds and reopen. A safer strategy is driven by confidence: Increase exposure only when success justifies it, and never faster than time allows.
When recovery works well, it is boring. Traffic returns smoothly. There are no sudden reopenings, no lucky probes, and no mysterious regressions. That boredom is not a default state—it is earned through better design.