HA Failover Budget Calculator
Calculate online HA failover budgets from health-check cadence, promotion, traffic convergence, client cache delay, and RTO targets for cleaner recovery planning.{{ result.summaryTitle }}
| Phase | Duration | Elapsed | RTO share | Budget note | Copy |
|---|---|---|---|---|---|
| {{ row.phase }} | {{ row.duration }} | {{ row.elapsed }} | {{ row.rtoShare }} | {{ row.note }} |
| Checkpoint | Value | Status | Action | Copy |
|---|---|---|---|---|
| {{ row.checkpoint }} | {{ row.value }} | {{ row.status }} | {{ row.action }} |
| Scenario | Total failover | RTO margin | Savings | Change applied | Copy |
|---|---|---|---|---|---|
| {{ row.scenario }} | {{ row.total }} | {{ row.margin }} | {{ row.savings }} | {{ row.change }} |
Introduction:
A failover budget is the recovery clock for a high-availability service. It starts when the failed path is detected and ends when users can reliably reach the replacement path. The budget matters because a service can have redundant nodes, replicas, and load balancers yet still miss its recovery time objective (RTO) if detection, promotion, routing, client retries, or validation take too long.
The useful estimate is a chain of delays, not a single magic number. A two-second health check with three required failures creates a six-second detection base before any standby promotion begins. A promoted node may still need fencing, state catch-up, traffic convergence, and a client retry or DNS cache tail before the visible outage clears. For a 60 second RTO, a 20 second client tail can be the difference between a comfortable plan and a tight one.
RTO and recovery point objective (RPO) answer different failure questions. RTO is about how long the service may be unavailable. RPO is about how much data freshness gap the business can tolerate. Replication lag may matter a great deal during a database handoff, but it should not be hidden inside the RTO clock unless the service truly cannot be declared recovered until that lag is resolved.
A budget estimate does not prove that failover automation is correct. It gives operators and service owners a concrete timing model to compare with tested failovers, synthetic probes, incident runbooks, and the stated recovery target. The safest use is to model the path, find the longest delay, and then confirm the result with a controlled failover test.
Technical Details:
High-availability failover timing is governed by serial waits. Monitoring must first decide that the old path is down. Control-plane or cluster logic then has to promote a replacement, fence the old owner when needed, and make the replacement safe to receive traffic. Routing, virtual IP movement, load-balancer health state, DNS steering, service discovery, and client retry behavior can add visible time after the backend is already healthy.
Health-check cadence is often the first lever because it creates a lower bound on detection time. If probes run every 2 seconds and the system requires 3 consecutive failures, the base detection delay is 6 seconds before jitter or timeout allowance. Tightening that cadence can help, but overly aggressive probes can create false failovers when a dependency is merely slow or briefly unreachable.
The recovery clock should be compared with an RTO that was chosen for business impact, not only with the fastest lab result. A safety buffer reserves part of the RTO for real incident variance. With a 60 second RTO and a 10% buffer, the comfortable target is 54 seconds even though the formal RTO is still 60 seconds.
Formula Core:
The model adds each recovery phase in seconds, then compares the total against the RTO and the buffer-adjusted comfort target.
| Symbol | Meaning | Unit or rule |
|---|---|---|
I |
Health-check interval | Seconds, minimum 0.1. |
F |
Failed checks before failover | Whole count, minimum 1. |
J |
Detection jitter allowance | Non-negative seconds. |
P, C, G |
Promotion, state catch-up, and traffic convergence | Non-negative seconds. |
L, V, A |
Client retry/cache tail, ready validation, and manual approval | Non-negative seconds. |
S |
Safety buffer | 0% to 80%. |
Readiness Rules:
| Condition | Result meaning | Operator action |
|---|---|---|
RTO = 0 |
Phase timing is shown, but no fit check is made. | Set an RTO target before using the estimate as a commitment. |
Total <= comfort target |
The model clears the RTO and retains the requested buffer. | Attach recent failover-test evidence before relying on it. |
Comfort target < Total <= RTO |
The plan is inside RTO, but normal variance can consume the reserve. | Tune the longest phase or lower the buffer expectation. |
Total > RTO |
The modeled failover misses the target. | Use the tuning scenarios to find the first RTO-clearing change. |
Detection >= 35% of total |
Probe cadence and failed-check count dominate the start of recovery. | Review interval, threshold, timeout, and false-positive risk together. |
Convergence + client tail >= 40% of total |
User-visible recovery is dominated by routing, DNS, retry, or endpoint cache delay. | Measure from a client-side probe, not only from the promoted node. |
Replication lag is reported as data-freshness context rather than added to the recovery clock. That keeps RTO and RPO separate: a service can become reachable within its RTO while still carrying a data gap that requires a separate business decision.
Everyday Use & Decision Guide:
Start with a measured service path if you have one. Enter the service name, set the health-check interval and failed-check count from the monitor or load balancer, then add the measured seconds for promotion, state catch-up, traffic convergence, client retry or cache delay, and ready validation. Use 0 only when a phase genuinely does not apply, such as a stateless service with no catch-up work.
The presets are useful for a first pass when the exact numbers are still rough. Load balancer pool failover, VRRP or floating IP pair, database primary promotion, and DNS steering failover all emphasize different delays. Apply the closest preset, then replace the placeholder seconds with values from failover tests, logs, synthetic checks, or runbook timing.
- Use
RTO targetas the committed recovery target in seconds. Set it to0only when you want a phase ledger without an RTO judgment. - Use
Safety bufferwhen the production incident path is noisier than the lab path. A10%buffer means a60second RTO should fit within54seconds to be comfortable. - Add
Manual approval delaywhen a human gate is still part of the cutover. Leaving it out can make a manual failover look automated. - Enter
Replication lag at failoverfor RPO context. It will appear in readiness output, but it is not added to the RTO total. - Check
Failover Tuning Scenariosbefore changing monitors. Faster probes are not always the best fix if the longest phase is the client retry tail or a manual gate.
This estimate is a good fit for planning HA pairs, database failovers, VIP moves, service-discovery changes, DNS steering, and load-balancer health behavior. It is a poor fit for proving data correctness, quorum safety, split-brain prevention, or whether the replacement node will actually serve valid responses. Those need failover tests, logs, and application-level checks.
Treat RTO risk, buffer consumed, detection-heavy, and tail dominates as stop-and-verify cues. Before a recovery commitment goes into a runbook, compare the Failover Phase Ledger with at least one real failover or game-day run.
Step-by-Step Guide:
Build the estimate from detection through user-visible recovery, then compare it with the target and scenario output.
- Enter
Service nameso copied rows and JSON identify the workload being modeled. - Set
Health-check intervalandFailed checks before failover. The summary badge should show the resulting detection time, including any later jitter allowance. - Enter
Promotion and fencing time,State catch-up time,Traffic convergence time,Client retry or cache delay, andReady validation time. The headlineFailover budgettotal updates as each phase changes. - Set
RTO targetand, if needed, openAdvancedto addSafety buffer. Watch whether the status remainsRTO clearsor moves toRTO risk. - Use
Failover profileonly when you want to replace current values with a preset. After applying it, review every phase because the preset overwrites the timing inputs. - Add
Detection jitter allowance,Manual approval delay, andReplication lag at failoverwhen those conditions are part of the real path. Blank or negative timing values are normalized to bounded non-negative values, so re-enter the measured number if a phase appears lower than expected. - Review
Failover Phase Ledgerfor elapsed timing,RTO Readiness Ledgerfor action cues, andFailover Tuning Scenariosfor candidate improvements. - Use
Phase Budget Chart,Detection Cadence Map, andJSONwhen the estimate needs to be checked visually or carried into an incident review.
Interpreting Results:
Failover budget is the total modeled recovery time. Read it with the RTO badge, the status badge, and the longest-phase badge. A total that clears RTO can still deserve attention if the safety buffer is consumed or if one phase is much larger than the rest.
RTO clears does not prove the failover is safe. It only means the entered phase times fit the selected target. Confirm the result with failover-test evidence, client-side synthetic checks, and data-freshness review when replication lag is nonzero.
| Output cue | What it means | Useful follow-up |
|---|---|---|
RTO risk |
The total failover time is greater than the RTO target. | Open Failover Tuning Scenarios and find the first change that clears the target. |
buffer consumed |
The plan may be inside RTO but above the buffer-adjusted comfort target. | Reduce the longest phase or document why the reserve is acceptable. |
detection-heavy |
Probe interval, failed-check count, and jitter take at least 35% of the total. |
Use Detection Cadence Map and check false-positive risk before tightening probes. |
tail dominates |
Traffic convergence plus client retry or cache delay takes at least 40% of the total. |
Test from outside the load balancer, DNS resolver, or client pool that users actually hit. |
data freshness gap modeled |
Replication lag is present as RPO context but not part of RTO time. | Compare the lag with the service's RPO before declaring the plan acceptable. |
The scenario rows are comparisons, not promises. A pre-promoted standby or shorter client retry tail is only useful if the system can actually be changed that way without creating a split-brain, stale-data, or false-failover risk.
Worked Examples:
Checkout API with a 60 second RTO:
The default checkout-style path uses a 2 second health-check interval, 3 failed checks, 10 seconds for promotion and fencing, 6 seconds for traffic convergence, 20 seconds for client retry or cache delay, and 5 seconds for ready validation. The Failover budget is 47.0 sec, which clears a 60 second RTO by 13.0 sec. RTO Readiness Ledger still points to the client retry or cache tail as the longest phase, so a client-side synthetic check is the best confidence test.
Database primary promotion with a tight buffer:
The database preset models 5 second checks, 3 failed checks, 3 seconds of detection jitter, 35 seconds of promotion, 20 seconds of catch-up, 10 seconds of convergence, 15 seconds of client tail, and 20 seconds of validation. The total is 118.0 sec against a 120 second RTO. The formal target clears by 2.0 sec, but a 10% safety buffer sets a 108 second comfort target, so Safety buffer reports the reserve as consumed.
Manual gate that breaks an otherwise good plan:
A service can look healthy on technical timing and still miss RTO after a human approval delay is added. Starting from the checkout example, adding 90 seconds of Manual approval delay raises the total to 137.0 sec against the same 60 second RTO. The status moves to RTO risk, the longest phase becomes Manual approval delay, and the practical fix is to move approval earlier, automate the gate, or change the recovery commitment.
FAQ:
What is included in the failover budget?
The total includes health-check detection, optional detection jitter, promotion and fencing, state catch-up, traffic convergence, client retry or cache delay, ready validation, and optional manual approval delay.
Why is replication lag not added to the RTO total?
Replication lag is shown as RPO context because it describes data freshness, not elapsed recovery time. It still matters for the recovery decision, but mixing it into the RTO clock can hide whether the time target and data-loss target are being met separately.
Why did tightening health checks not fix the RTO miss?
Health-check tuning only reduces the detection part of the total. If Longest phase points to client retry or cache delay, manual approval, promotion, or validation, reducing probe interval will not remove the main delay.
What should I do if a blank or negative entry gives a surprising result?
Re-enter the measured timing value in seconds. The numeric fields are bounded: timing phases cannot go below zero, failed checks are rounded to a whole count of at least one, and the safety buffer is limited to 0% through 80%.
Can a result that clears RTO still be unsafe?
Yes. RTO clears only checks the entered timing model. It does not prove fencing safety, quorum behavior, data consistency, application readiness, DNS behavior at every resolver, or client retry behavior under a real incident.
Glossary:
- RTO
- Recovery time objective, the maximum acceptable delay before service is restored.
- RPO
- Recovery point objective, the maximum acceptable data freshness gap after a failure.
- Health-check interval
- The time between probes that decide whether the old service path is unhealthy.
- Failed checks
- The consecutive failed probes required before the system starts failover.
- Traffic convergence
- The delay before routing, VIP, DNS, load-balancer, or service-discovery changes send traffic to the replacement path.
- Client retry or cache tail
- The user-visible delay caused by DNS TTLs, stale endpoint caches, connection retry backoff, or pool refresh behavior.
- Safety buffer
- The part of the RTO reserved for test variance, incident noise, or operator overhead.
- Manual approval delay
- Human confirmation or escalation time that occurs before cutover can finish.
References:
- Define recovery objectives for downtime and data loss, AWS Well-Architected Framework, 2022-03-31.
- Service Level Objectives, Google SRE Book.
- Health checks for Application Load Balancer target groups, AWS Elastic Load Balancing.
- Time to Live (TTL), Cloudflare DNS Docs.