Load Balancer Failover Time Calculator
Estimate load balancer failover time from probe intervals, thresholds, reroute delay, DNS TTL, drain windows, and RTO scenario pressure.| Component | Window | Controlled by | Operational note | Copy |
|---|---|---|---|---|
| {{ row.component }} | {{ row.window }} | {{ row.control }} | {{ row.note }} |
| Setting | Current value | Preset basis | Tuning cue | Copy |
|---|---|---|---|---|
| {{ row.setting }} | {{ row.current }} | {{ row.preset }} | {{ row.cue }} |
| Scenario | Health check profile | New traffic window | User-visible window | Tradeoff | Copy |
|---|---|---|---|---|---|
| {{ row.scenario }} | {{ row.profile }} | {{ row.newTraffic }} | {{ row.userVisible }} | {{ row.tradeoff }} |
Introduction:
Failover time is the user-facing delay between a failed serving path and a healthy destination taking over. In a load-balanced system, that delay is not just one health-check timer. It usually includes probe detection, routing updates, client connection behavior, and, when the public endpoint changes, DNS cache exposure.
Health checks decide when a backend should stop receiving new traffic. The interval controls how often the backend is tested, the timeout controls how long a failed probe can wait, and the unhealthy threshold controls how many consecutive failures are required before the backend is treated as down. Faster checks reduce the detection window, but aggressive thresholds can remove a healthy instance during a short pause, a slow startup, packet loss, or a noisy deploy.
Backend failover and DNS failover belong in separate parts of the plan. Backend failover moves new traffic inside the same client endpoint, such as a listener, target group, backend pool, or service mesh route. DNS failover changes the address or name that resolvers return, so users can keep reaching an old endpoint until cached A, AAAA, or CNAME answers expire. A low time to live reduces that exposure, but it does not force every resolver, browser, or application cache to re-query at the same instant.
Connection draining adds another boundary. It can protect existing sessions while new traffic moves away, but it does not necessarily represent outage duration for new requests. A service may be healthy for new users and still have old connections attached to a backend that is being removed. That distinction matters during blue-green deployments, regional failover, maintenance drains, and incident reviews.
| Term | Practical meaning |
|---|---|
| Detection window | How long probes need to classify the backend as unhealthy. |
| Reroute delay | Time for the service to update routing after the unhealthy decision. |
| TTL exposure | Potential time for clients or resolvers to keep a cached DNS answer. |
| Connection drain | Grace time for existing connections, which may outlive the new-traffic cutover. |
A recovery time objective, or RTO, is a promise about how quickly service should recover from the user's point of view. Failover math can show whether probe settings, routing delays, and DNS choices are aligned with that promise. It cannot prove application correctness, database consistency, standby capacity, session behavior, or whether the health endpoint is checking the part of the service that actually matters.
How to Use This Tool:
Model the path that users actually experience, then compare the result with the service objective rather than only with the first failed health check.
- Pick Provider preset closest to the production platform, or start from custom values when listener, target group, backend pool, or traffic-manager settings are known.
- Select Failover path as backend-only, DNS/GSLB endpoint, or backend plus DNS exposure.
Use backend-only when the load balancer keeps the same client endpoint; include DNS when A, AAAA, or CNAME answers can keep users on an old endpoint until cache expiry.
- Enter Health check interval, Probe timeout, Unhealthy threshold, Healthy threshold, Reroute update delay, and Existing connection drain. Keep timeout no longer than the interval unless the platform documents overlapping probes clearly.
- Add Recursive DNS TTL and Authoritative propagation when the selected path includes DNS or GSLB behavior.
The warning area flags TTL values above 300 seconds because DNS cache exposure can dominate the user-visible window.
- Set Target RTO so the summary can mark the modeled user-visible failover window as inside, near, or beyond the objective.
- Open Advanced to choose the failure timing assumption, successful probe response, and jitter or automation buffer.
- Review Timing Ledger, Provider Preset Review, Scenario Matrix, and Probe Sensitivity Chart before changing production thresholds.
Very short probe intervals with an unhealthy threshold of 1 may meet the RTO model while increasing false failover risk during short application pauses.
Interpreting Results:
The new-traffic window is the estimated time until new requests can be steered away from the failed backend. The user-visible failover window can be longer when DNS TTL, authoritative propagation, or another endpoint-level step is part of the path.
| Result | How to read it |
|---|---|
| Detection | Probe timeout, interval spacing, timing alignment, and any jitter buffer needed to cross the unhealthy threshold. |
| New traffic | Detection plus reroute delay, before old connections are fully drained. |
| User-visible failover | New traffic plus authoritative propagation and recursive DNS TTL when the selected path includes DNS. |
| RTO status | A comparison between the modeled user-visible window and the target objective. |
| Recovery duration | Estimated time for an unhealthy backend to become healthy again after successful probes. |
Technical Details:
Probe-based failover is governed by consecutive results. An unhealthy threshold of 3 does not mean three seconds; it means three failed probe attempts, each with timeout behavior and interval spacing. The first failed probe can occur immediately after failure, about halfway into an interval on average, or nearly one full interval later, which is why first-failed-probe, average-case, and worst-case timing produce different windows.
Connection draining and DNS exposure affect different audiences. New traffic can move to healthy backends while existing connections continue until the drain window ends. DNS TTL applies to clients and resolvers that already cached an endpoint answer before failover took effect, while authoritative propagation represents the time for the traffic-management layer to publish or converge on the new answer.
Formula Core:
The timing model adds probe detection, reroute delay, and optional DNS exposure. For backend-only paths, the DNS terms are omitted from the user-visible window.
| Symbol | Meaning |
|---|---|
| a | Alignment wait: zero, half interval, or one interval depending on timing assumption. |
| u | Unhealthy threshold count. |
| t | Probe timeout. |
| i | Probe interval. |
| j | Jitter or automation buffer. |
| c, p, ttl | Reroute delay, authoritative propagation, and recursive DNS TTL. |
With a 30 second interval, 5 second timeout, unhealthy threshold of 2, worst-case alignment of 30 seconds, and a 10 second reroute delay, the detection window is 70 seconds and new traffic moves at about 80 seconds. If the same failover also includes a 60 second DNS TTL and 15 seconds of authoritative propagation, the modeled user-visible window becomes about 155 seconds.
| Timing choice | Rule used | Planning effect |
|---|---|---|
| First failed probe | Adds no alignment wait. | Useful for lab traces that start at the first observed failed check. |
| Average case | Adds half of the probe interval. | Useful for expected-case planning when failure can occur anywhere in the interval. |
| Worst case | Adds one full probe interval. | Useful for customer promises and conservative RTO checks. |
| Recovery duration | Uses successful response time, healthy threshold, interval spacing, reroute delay, and jitter. | Useful during restarts and deployments, not just incidents. |
Accuracy Notes:
The estimate models timing from configuration values. Real incidents can be affected by application-level health semantics, regional control-plane delays, resolver behavior, client retry policy, sticky sessions, TLS reuse, standby capacity, and whether all backends fail at once. Test failover in a controlled environment before relying on a modeled RTO, and verify that the health endpoint fails when the user-facing dependency fails.
Advanced Tips:
- Use Worst case from actual failure for commitments made to users or customers; first-failed-probe timing is better for lab traces that start after the first failed check.
- Keep Probe timeout no longer than the interval unless the provider documents overlapping checks clearly.
- Compare New traffic failover and Existing connection drain separately; drain protects sessions but can keep old clients attached to a degraded backend.
- Model DNS/GSLB paths with the TTL on the failover record itself, not a static zone default or an unrelated TXT record.
- Use Jitter and automation buffer for external monitors, scheduler delays, manual approval steps, or scripted failover hooks that sit outside native target removal.
- Use the Scenario Matrix to compare fast, balanced, and conservative settings before lowering thresholds in production.
- Treat Over RTO or Well over RTO as a test-plan trigger: validate standby capacity, health endpoint semantics, DNS resolver behavior, and client retry policy before tightening production probes.
Worked Example:
An application load balancer checks a backend every 10 seconds, waits 5 seconds for a response, and marks the backend unhealthy after 2 failed checks. Under the worst-case timing assumption, detection is 5 + 5 + 10 + 10, or 30 seconds: two timeouts, one interval between the failed probes, and one full interval of alignment wait. Add a 5 second reroute delay and new traffic can move at about 35 seconds.
If the public endpoint also depends on a DNS answer cached for 120 seconds, user-visible recovery can be much longer than backend failover. With 30 seconds of authoritative propagation added to the same example, the modeled user-visible window becomes about 185 seconds. DNS paths should be tested with resolver behavior, not just target-group health.
FAQ:
Why can DNS failover be slower than backend failover?
Backend failover changes where the load balancer sends new traffic. DNS failover also waits for cached DNS answers to age out, so users may continue using an old endpoint until their resolver re-queries.
Does connection drain delay new traffic?
Usually no. Drain time protects existing connections. New traffic can be routed elsewhere while old connections finish or time out.
Why not set the unhealthy threshold to 1?
A threshold of 1 detects faster, but a single slow probe, transient packet loss, or brief application pause can remove a healthy backend from service.
What should I check if the RTO status is over target?
Compare the detection window, reroute delay, and DNS TTL separately. The largest contributor is usually the first setting to test and tune.
References:
- Health checks for Application Load Balancer target groups, Amazon Web Services.
- Optimize load balancer health check parameters, Amazon Web Services.
- Azure Load Balancer health probes, Microsoft Learn.
- Health checks overview, Google Cloud.
- How to drain an HAProxy backend server for maintenance, Simplified Guide.
- How to configure backend health checks in HAProxy, Simplified Guide.