Load Balancer Failover Time Calculator
Model load balancer failover timing from health probes, thresholds, reroute delay, DNS TTL, drain windows, RTO status, and scenarios.| Component | Window | Controlled by | Operational note | Copy |
|---|---|---|---|---|
| {{ row.component }} | {{ row.window }} | {{ row.control }} | {{ row.note }} |
| Setting | Current value | Preset basis | Tuning cue | Copy |
|---|---|---|---|---|
| {{ row.setting }} | {{ row.current }} | {{ row.preset }} | {{ row.cue }} |
| Scenario | Health check profile | New traffic window | User-visible window | Tradeoff | Copy |
|---|---|---|---|---|---|
| {{ row.scenario }} | {{ row.profile }} | {{ row.newTraffic }} | {{ row.userVisible }} | {{ row.tradeoff }} |
Introduction:
Failover time is the delay between a backend becoming unusable and users reliably reaching a healthy destination. In a load-balanced service, that delay is rarely a single timer. It is usually the sum of health-check detection, control-plane rerouting, client connection behavior, and sometimes DNS cache exposure.
Health checks are the starting signal. A probe interval controls how often a backend is tested, a timeout defines how long a probe can wait, and an unhealthy threshold says how many consecutive failures are needed before traffic should move away. Short intervals detect failures faster, but they also make false positives more likely when an application has brief pauses, network jitter, slow startup, or maintenance windows.
Backend failover and DNS failover should be kept separate in planning. Backend failover moves new traffic inside the load balancer or target group. DNS failover changes the answers returned to resolvers, so users may keep using an old endpoint until cached records expire. A low time to live can reduce that exposure, but it cannot force every resolver or client to re-query at the same instant.
| Term | Practical meaning |
|---|---|
| Detection window | How long probes need to classify the backend as unhealthy. |
| Reroute delay | Time for the service to update routing after the unhealthy decision. |
| TTL exposure | Potential time for clients or resolvers to keep a cached DNS answer. |
| Connection drain | Grace time for existing connections, which may outlive the new-traffic cutover. |
A recovery time objective, or RTO, is a promise about how quickly service should be restored from the user's point of view. Failover math helps test whether probe settings and DNS choices are aligned with that promise. It does not prove application correctness, data consistency, capacity in the standby path, or whether the fallback can handle the full load.
How to Use This Tool:
- Pick the provider preset that is closest to the system being modeled, or start from custom values when your production settings are known.
- Select the failover path: backend-only, DNS/GSLB endpoint, or backend plus DNS exposure.
- Enter the health-check interval, probe timeout, unhealthy threshold, healthy threshold, reroute delay, and connection-drain time.
- For DNS-based paths, add the recursive DNS TTL and authoritative propagation time that users may experience after the backend decision.
- Set the target RTO so the result can show whether the modeled user-visible window is inside, near, or beyond the objective.
- Use the timing assumption, successful probe response, and jitter buffer to model worst-case, average, or first-failed-probe timing.
- Compare the scenario matrix and sensitivity charts before changing production thresholds, because faster probes can increase false-positive risk.
Interpreting Results:
The new-backend traffic window is the estimated time until new requests can be steered away from the failed backend. The user-visible failover window can be longer when DNS TTL, authoritative propagation, or other endpoint-level behavior is part of the path.
| Result | How to read it |
|---|---|
| Detection | Probe timeout and interval time needed to cross the unhealthy threshold. |
| New traffic | Detection plus reroute delay, before old connections are fully drained. |
| User-visible failover | New traffic plus DNS exposure when the selected path includes DNS. |
| RTO status | A comparison between the modeled user-visible window and the target objective. |
| Recovery duration | Estimated time for an unhealthy backend to become healthy again after successful probes. |
Technical Details:
Probe-based failover is governed by consecutive results. An unhealthy threshold of 3 does not mean three seconds; it means three failed probe attempts, each separated by the configured interval and each subject to timeout behavior. The first failed probe can occur immediately after failure or nearly one full interval later, which is why worst-case and average timing differ.
Connection draining and DNS exposure affect different audiences. New traffic can move to healthy backends while existing connections continue until the drain window ends. DNS TTL applies to clients and resolvers that already cached an endpoint answer before failover took effect.
Formula Core:
The simplified timing model adds probe detection, control-plane delay, and optional DNS exposure.
| Symbol | Meaning |
|---|---|
| a | Alignment wait: zero, half interval, or one interval depending on timing assumption. |
| u | Unhealthy threshold count. |
| t | Probe timeout. |
| i | Probe interval. |
| j | Jitter or automation buffer. |
| c, p, ttl | Reroute delay, authoritative propagation, and recursive DNS TTL. |
With a 30 second interval, 5 second timeout, threshold of 2, worst-case alignment of 30 seconds, and 10 second reroute delay, detection is 45 seconds and new traffic starts around 55 seconds. If DNS TTL is 60 seconds and authoritative propagation is 15 seconds, the modeled user-visible window becomes about 130 seconds.
Accuracy Notes:
The estimate models timing from configuration values. Real incidents can be affected by application-level health semantics, regional control-plane delays, resolver behavior, client retry policy, sticky sessions, TLS reuse, standby capacity, and whether all backends fail at once. Test failover in a controlled environment before relying on a modeled RTO.
Worked Example:
An application load balancer checks a backend every 10 seconds, waits 5 seconds for a response, and marks the backend unhealthy after 2 failed checks. Under the worst-case timing assumption, detection is 5 + 10 + 10 seconds, or 25 seconds. Add a 5 second reroute delay and new traffic can move at about 30 seconds.
If the public endpoint also depends on a DNS answer cached for 120 seconds, user-visible recovery can be much longer than backend failover. That is why DNS paths should be tested with resolver behavior, not just target-group health.
FAQ:
Why can DNS failover be slower than backend failover?
Backend failover changes where the load balancer sends new traffic. DNS failover also waits for cached DNS answers to age out, so users may continue using an old endpoint until their resolver re-queries.
Does connection drain delay new traffic?
Usually no. Drain time protects existing connections. New traffic can be routed elsewhere while old connections finish or time out.
Why not set the unhealthy threshold to 1?
A threshold of 1 detects faster, but a single slow probe, transient packet loss, or brief application pause can remove a healthy backend from service.
What should I check if the RTO status is over target?
Compare the detection window, reroute delay, and DNS TTL separately. The largest contributor is usually the first setting to test and tune.