{{ summaryHeading }}
{{ summaryPrimary }}
{{ summaryLine }}
{{ badge.label }}
Client LB Probe DNS
Load balancer failover time inputs
Start from a known platform profile, then replace values with your listener, target group, or traffic-manager settings.
Use backend target only for ALB/NLB/backend-pool routing; use DNS/GSLB when the endpoint address itself changes.
Shorter intervals reduce detection time but increase probe traffic and sensitivity to transient stalls.
sec
Keep timeout no longer than the interval for most probe designs; the calculator warns when they overlap.
sec
Lower values fail over faster but can mark healthy backends down during a short application pause.
fails
Recovery time matters during deployments and rolling restarts, even when failover is the main concern.
passes
Use zero for simple in-LB target removal; add provider processing time for GSLB, automation, or external monitors.
sec
This does not delay new-traffic failover, but it can keep old sessions tied to a failing backend.
sec
Use the actual TTL on the failover A/AAAA/CNAME record, not a static TXT or unrelated zone value.
sec
Use a normal or upper-bound propagation estimate from your traffic-management provider.
sec
{{ target_rto_seconds }} sec
Compare the modeled failover window with the recovery objective promised for this service.
Use worst case for user-facing RTO commitments; use first failed probe for lab timing from the first failing check.
TCP probes can be near zero; HTTP probes should use a realistic health endpoint response time.
sec
Keep this small for native load balancers; increase it for external health monitors or scripted failover.
sec
ComponentWindowControlled byOperational noteCopy
{{ row.component }} {{ row.window }} {{ row.control }} {{ row.note }}
SettingCurrent valuePreset basisTuning cueCopy
{{ row.setting }} {{ row.current }} {{ row.preset }} {{ row.cue }}
ScenarioHealth check profileNew traffic windowUser-visible windowTradeoffCopy
{{ row.scenario }} {{ row.profile }} {{ row.newTraffic }} {{ row.userVisible }} {{ row.tradeoff }}
Customize
Advanced
:

Introduction:

Failover time is the delay between a backend becoming unusable and users reliably reaching a healthy destination. In a load-balanced service, that delay is rarely a single timer. It is usually the sum of health-check detection, control-plane rerouting, client connection behavior, and sometimes DNS cache exposure.

Health checks are the starting signal. A probe interval controls how often a backend is tested, a timeout defines how long a probe can wait, and an unhealthy threshold says how many consecutive failures are needed before traffic should move away. Short intervals detect failures faster, but they also make false positives more likely when an application has brief pauses, network jitter, slow startup, or maintenance windows.

Diagram showing client, DNS, load balancer, backend, probe failures, reroute, and TTL exposure.

Backend failover and DNS failover should be kept separate in planning. Backend failover moves new traffic inside the load balancer or target group. DNS failover changes the answers returned to resolvers, so users may keep using an old endpoint until cached records expire. A low time to live can reduce that exposure, but it cannot force every resolver or client to re-query at the same instant.

Term Practical meaning
Detection window How long probes need to classify the backend as unhealthy.
Reroute delay Time for the service to update routing after the unhealthy decision.
TTL exposure Potential time for clients or resolvers to keep a cached DNS answer.
Connection drain Grace time for existing connections, which may outlive the new-traffic cutover.

A recovery time objective, or RTO, is a promise about how quickly service should be restored from the user's point of view. Failover math helps test whether probe settings and DNS choices are aligned with that promise. It does not prove application correctness, data consistency, capacity in the standby path, or whether the fallback can handle the full load.

How to Use This Tool:

  1. Pick the provider preset that is closest to the system being modeled, or start from custom values when your production settings are known.
  2. Select the failover path: backend-only, DNS/GSLB endpoint, or backend plus DNS exposure.
  3. Enter the health-check interval, probe timeout, unhealthy threshold, healthy threshold, reroute delay, and connection-drain time.
  4. For DNS-based paths, add the recursive DNS TTL and authoritative propagation time that users may experience after the backend decision.
  5. Set the target RTO so the result can show whether the modeled user-visible window is inside, near, or beyond the objective.
  6. Use the timing assumption, successful probe response, and jitter buffer to model worst-case, average, or first-failed-probe timing.
  7. Compare the scenario matrix and sensitivity charts before changing production thresholds, because faster probes can increase false-positive risk.

Interpreting Results:

The new-backend traffic window is the estimated time until new requests can be steered away from the failed backend. The user-visible failover window can be longer when DNS TTL, authoritative propagation, or other endpoint-level behavior is part of the path.

Result How to read it
Detection Probe timeout and interval time needed to cross the unhealthy threshold.
New traffic Detection plus reroute delay, before old connections are fully drained.
User-visible failover New traffic plus DNS exposure when the selected path includes DNS.
RTO status A comparison between the modeled user-visible window and the target objective.
Recovery duration Estimated time for an unhealthy backend to become healthy again after successful probes.

Technical Details:

Probe-based failover is governed by consecutive results. An unhealthy threshold of 3 does not mean three seconds; it means three failed probe attempts, each separated by the configured interval and each subject to timeout behavior. The first failed probe can occur immediately after failure or nearly one full interval later, which is why worst-case and average timing differ.

Connection draining and DNS exposure affect different audiences. New traffic can move to healthy backends while existing connections continue until the drain window ends. DNS TTL applies to clients and resolvers that already cached an endpoint answer before failover took effect.

Formula Core:

The simplified timing model adds probe detection, control-plane delay, and optional DNS exposure.

Td=a+[u×t+(u-1)×i]+j , Tv=Td+c+p+ttl
Symbol Meaning
a Alignment wait: zero, half interval, or one interval depending on timing assumption.
u Unhealthy threshold count.
t Probe timeout.
i Probe interval.
j Jitter or automation buffer.
c, p, ttl Reroute delay, authoritative propagation, and recursive DNS TTL.

With a 30 second interval, 5 second timeout, threshold of 2, worst-case alignment of 30 seconds, and 10 second reroute delay, detection is 45 seconds and new traffic starts around 55 seconds. If DNS TTL is 60 seconds and authoritative propagation is 15 seconds, the modeled user-visible window becomes about 130 seconds.

Accuracy Notes:

The estimate models timing from configuration values. Real incidents can be affected by application-level health semantics, regional control-plane delays, resolver behavior, client retry policy, sticky sessions, TLS reuse, standby capacity, and whether all backends fail at once. Test failover in a controlled environment before relying on a modeled RTO.

Worked Example:

An application load balancer checks a backend every 10 seconds, waits 5 seconds for a response, and marks the backend unhealthy after 2 failed checks. Under the worst-case timing assumption, detection is 5 + 10 + 10 seconds, or 25 seconds. Add a 5 second reroute delay and new traffic can move at about 30 seconds.

If the public endpoint also depends on a DNS answer cached for 120 seconds, user-visible recovery can be much longer than backend failover. That is why DNS paths should be tested with resolver behavior, not just target-group health.

FAQ:

Why can DNS failover be slower than backend failover?

Backend failover changes where the load balancer sends new traffic. DNS failover also waits for cached DNS answers to age out, so users may continue using an old endpoint until their resolver re-queries.

Does connection drain delay new traffic?

Usually no. Drain time protects existing connections. New traffic can be routed elsewhere while old connections finish or time out.

Why not set the unhealthy threshold to 1?

A threshold of 1 detects faster, but a single slow probe, transient packet loss, or brief application pause can remove a healthy backend from service.

What should I check if the RTO status is over target?

Compare the detection window, reroute delay, and DNS TTL separately. The largest contributor is usually the first setting to test and tune.