Replication Backlog Time Calculator
Calculate online replication backlog drain time from change rate, capacity, outage, compression, and headroom so recovery plans catch target misses.{{ result.summaryTitle }}
| Metric | Value | Runbook meaning | Copy |
|---|---|---|---|
| {{ row.metric }} | {{ row.value }} | {{ row.meaning }} |
| Checkpoint | Status | Value | Action | Copy |
|---|---|---|---|---|
| {{ row.checkpoint }} | {{ row.status }} | {{ row.value }} | {{ row.action }} |
| Scenario | Usable capacity | Net drain | Catch-up time | Target status | Copy |
|---|---|---|---|---|---|
| {{ row.scenario }} | {{ row.capacity }} | {{ row.netDrain }} | {{ row.catchup }} | {{ row.status }} |
Introduction:
Replication backlog is the change data that has not yet reached, flushed, or replayed on a replica. It grows during outages, link interruptions, overloaded apply workers, throttled storage, and maintenance windows where writes continue while replication cannot keep up. The planning question is not just how many bytes are waiting. The more useful question is whether the recovery path has enough spare throughput to drain those bytes while new changes keep arriving.
That spare throughput is the difference between usable replication capacity and the incoming replicated change rate. A replica with 500 Mbps of usable capacity and 140 Mbps of continuing replicated load has 360 Mbps of net drain for backlog recovery. A replica with 500 Mbps of usable capacity and 540 Mbps of continuing replicated load does not catch up at all, because every recovery second adds more work than it removes.
Backlog time is closely tied to recovery point objective, recovery time objective, and reader staleness. A large backlog can be acceptable if it drains before dependent systems need current data. A smaller backlog can be a real incident if the replica keeps falling farther behind, if stale reads are exposed to users, or if failover would lose changes that have not reached the recovery site.
The estimate remains a planning model. It does not prove that replication is healthy, that every byte has the same apply cost, or that a single average rate will hold through a noisy recovery period. It is best used beside live replication counters, database or storage apply metrics, and a short measured catch-up test when the window is tight.
Technical Details:
Asynchronous replication usually has two separate timing concerns. First, changes must be shipped from the source to the replica. Second, the replica must write, flush, or replay those changes before they become useful. A backlog can therefore come from a network lane that cannot send fast enough, a target that cannot apply fast enough, or a maintenance window where replication stopped while writes continued.
Byte lag and time lag are related but not identical. A database can report byte distance between source and replica positions, while another monitor reports the age of the transaction being replayed. Byte lag is easier to model from throughput because it behaves like a drainable queue. Time lag is better for user impact because it describes how stale the replica may be. A throughput model connects those views by estimating both the modeled backlog and its write-time equivalent.
The core condition is convergence. Replication catches up only when usable capacity is greater than the incoming replicated change rate. Compression or dedupe reduces the replicated change rate when the source rate is logical bytes, while protocol overhead and an optional apply ceiling reduce usable capacity.
Formula Core:
The model converts all rates to Mbps, converts backlog sizes to bytes, then uses net drain to calculate catch-up duration.
In the capacity formula, the apply ceiling is used only when a positive ceiling is entered. If no separate apply ceiling is entered, usable capacity is the raw replication capacity after protocol overhead. Catch-up time is finite only when Rnet is greater than zero.
Variable Guide:
| Symbol | Meaning | Why it changes the result |
|---|---|---|
Rlogical |
Incoming change rate before compression or dedupe. | Higher source write volume creates more replicated work during outage and catch-up. |
Cratio |
Compression or dedupe ratio, with 1.0 meaning no reduction. |
A larger ratio lowers replicated bytes when the source rate is logical data. |
Cusable |
Capacity left for replication after overhead and any target-side apply limit. | This must exceed continuing replicated load before backlog can fall. |
Bcurrent |
Existing bytes already waiting before the modeled outage window. | Existing backlog adds directly to the outage backlog. |
s |
Safety buffer as a decimal share. | Adds reserve for bursts, retransmits, metadata, or uncertain counters. |
Rnet |
Usable capacity minus incoming replicated change rate. | Positive values drain backlog; zero or negative values mean the replica cannot converge. |
Validation and Boundary Rules:
| Input | Accepted range or rule | Practical effect |
|---|---|---|
Incoming change rate |
Zero or greater, in Mbps, MB/s, Gbps, or GiB/hr. | Zero removes continuing write pressure and makes only existing backlog matter. |
Replication capacity |
Greater than zero, in the same supported rate units. | A nonpositive value blocks the estimate because no drain rate can be formed. |
Replication outage |
Zero or greater, in minutes, hours, or days. | Set to zero when modeling a measured backlog without adding outage-generated bytes. |
Current backlog |
Zero or greater, in GiB, TiB, GB, or TB. | Use the unit family that matches the source counter. |
Compression or dedupe ratio |
At least 1.0. |
Values below 1.0 are rejected because they would inflate data through a reduction field. |
Protocol overhead |
0% to 95%. |
Reduces raw capacity before comparing it with continuing replicated load. |
Backlog safety buffer |
0% to 300%. |
Scales total modeled backlog upward after existing and outage backlog are combined. |
Steady-state warning |
1% to 100%. |
Sets the utilization point where normal replication headroom is flagged as thin. |
The model uses deterministic arithmetic and simple conversions. It does not simulate bursty write patterns, queue scheduling, database lock conflicts, storage stalls, checksum cost, or multi-replica fan-out. If actual replication lag is high while network lag is low, target-side apply or replay is often the next place to inspect rather than raw network bandwidth alone.
Everyday Use & Decision Guide:
Begin with measured counters when you have them. Use sustained write or WAL generation for Incoming change rate, not a short spike. Use a proven send, receive, or apply rate for Replication capacity, not the label on a network interface. If the replica's storage or database apply path is slower than the link, put that lower rate in Replica apply ceiling.
The strongest first pass is conservative. Keep Compression or dedupe ratio at 1.0 unless your counters are logical bytes and you have evidence for the reduction. Add Protocol overhead only when the entered capacity is raw rather than already usable. Add a Backlog safety buffer when the outage includes bursty writes, retransmits, or uncertain measurements.
- Set
Replication outageto the time changes accumulated without normal replication. Use0when the only source of backlog is a measured byte counter. - Use
Catch-up targetwhen a runbook, maintenance window, recovery time objective, or reader freshness promise needs a fit or miss answer. - Keep
Steady-state warningnear your normal headroom policy. The default80%flags cases where ordinary write load is already consuming most of the usable lane. - Add
Catch-up start timeonly when a projected completion timestamp helps with handoff or incident notes.
A good fit is a planned outage, failback window, replica rebuild, database reader recovery, storage mirror repair, or disaster recovery drill where you can estimate change rate and capacity. A poor fit is a live incident with unknown write bursts, changing throttles, or missing apply counters. In that case, use the result as a rough bound and keep measuring the actual drain rate during recovery.
Do not treat a short Catch-up time as proof that the system is safe to fail over. Read Net backlog drain, Steady-state utilization, Primary bottleneck, and RPO pressure before making the call. A positive drain estimate still needs live evidence that source, network, and replica apply metrics are moving in the same direction.
Step-by-Step Guide:
Work from backlog creation first, then add recovery capacity and target checks.
- Enter
Incoming change rateand choose the unit that matches the source counter. If the red validation box says the rate is negative, correct that field before reading the result. - Enter
Replication capacityas the sustained usable lane. The estimate requires this value to be greater than zero. - Set
Replication outageandCurrent backlog. TheTotal modeled backlogrow will show existing backlog plus the bytes generated during the outage. - Set
Compression or dedupe ratio. Leave it at1.0when rate and backlog counters already describe transferred replication bytes. - Add
Catch-up targetif a deadline matters. A target of0turns off target checks and keeps the headline focused on raw drain time. - Open
AdvancedforWorkload name,Protocol overhead,Replica apply ceiling,Backlog safety buffer,Steady-state warning, andCatch-up start time. If any advanced value is outside its range, fix the validation message before using the estimate. - Read the summary badges first.
can catch up,target risk,thin headroom, andfalls behindquickly show whether the current inputs converge. - Open
Backlog Metrics,Catch-up Checkpoints,Capacity Scenarios,Backlog Burn-down Chart,Capacity Sensitivity Curve, andJSONwhen you need table evidence, scenario comparison, chart evidence, or structured output.
Interpreting Results:
Net backlog drain is the result to check first. Positive drain means the modeled backlog can fall. Zero or negative drain means Catch-up time becomes not reachable, and the next action is to raise usable capacity, reduce write load, improve compression, or remove an apply ceiling before expecting convergence.
| Output cue | Read it as | Check next |
|---|---|---|
Total modeled backlog |
The bytes that must be sent, flushed, or replayed after existing backlog, outage bytes, and buffer are combined. | Verify the source counter unit, especially GiB versus GB and TiB versus TB. |
Incoming replication load |
Continuing change traffic after compression or dedupe. | Make sure logical write rate and compression ratio are describing the same data family. |
Usable replication capacity |
The active send or apply ceiling after overhead and optional replica apply limit. | If Primary bottleneck says replica apply ceiling, inspect target replay and storage counters. |
Catch-up target fit |
Whether the modeled drain time meets the configured runbook target. | Use Capacity Scenarios or the sensitivity chart to see how much extra capacity would help. |
RPO pressure |
The modeled backlog expressed as write-time lag at the current incoming load. | Compare it with the workload recovery point objective and stale-read tolerance. |
The target badge is a planning cue, not a guarantee. spare 2.63 hr means the current arithmetic fits the target by that margin, while short 2.06 hr means the model misses by that amount. The wording does not prove that the actual recovery will hold the same rate for the whole window.
A low steady-state utilization is reassuring only when the entered capacity is real. If the replication lane shares traffic with backups, snapshots, user reads, or cross-region work, compare the estimated Usable replication capacity with a measured transfer or live replication test before relying on the drain time.
Worked Examples:
Default database recovery window
Use a workload named orders-db-prod, an Incoming change rate of 180 Mbps, Replication capacity of 500 Mbps, a 90 minute outage, 120 GiB current backlog, a 1.30x compression ratio, and a 4 hour target. The result shows Total modeled backlog of 207.04 GiB, Incoming replication load of 138 Mbps, Net backlog drain of +362 Mbps, and Catch-up time of 1.37 hr. The target status is inside target with about 2.63 hr spare.
Apply ceiling misses the target
Keep the same numbers, but set Replica apply ceiling to 220 Mbps. The lower target-side ceiling becomes Usable replication capacity, so Net backlog drain drops to +81.5 Mbps. Catch-up time stretches to 6.06 hr, and Catch-up target fit reports a miss by about 2.06 hr. This points to replica replay or storage apply work, not only more network bandwidth.
Write surge that cannot converge
Raise Incoming change rate to 700 Mbps while keeping 500 Mbps usable capacity and the 1.30x compression ratio. Incoming replication load becomes 538 Mbps, which is higher than usable capacity. The page shows not reachable, falls behind, Net backlog drain of -38.5 Mbps, and Steady-state utilization of 107.7%. The corrective path is capacity, throttling, or a better reduction ratio before catch-up can begin.
Measured backlog without an outage window
If a monitor already reports 80 GiB of backlog, set Replication outage to 0, set Current backlog to 80 GiB, and enter the current write and capacity rates. Total modeled backlog then stays tied to the measured counter instead of double-counting an outage estimate. If a validation message appears after changing units, correct the negative or out-of-range field before copying results into a runbook.
FAQ:
Why does the result say not reachable?
The incoming replicated change rate is equal to or higher than usable replication capacity. Raise capacity, reduce write load, improve compression or dedupe, lower protocol overhead, or remove a replica apply ceiling before using the catch-up estimate.
Should I use write rate or replication bytes?
Use the rate that matches your compression setting. If your counter is logical write data, enter that rate and the reduction ratio. If your counter already reports transferred replication bytes, keep Compression or dedupe ratio at 1.0.
What is the difference between catch-up time and backlog age equivalent?
Catch-up time estimates how long backlog takes to drain after replication resumes. Backlog age equivalent expresses the same modeled bytes as write-time lag at the current incoming replication load.
Can a replica miss the target even when it can catch up?
Yes. Positive Net backlog drain means convergence is possible, but the drain can still be too slow for the configured Catch-up target. Check Capacity Scenarios to see whether more capacity, less write load, or removing an apply ceiling changes the target result.
Why did the validation box appear?
The calculator rejects negative rates, negative backlog, negative outage time, compression ratios below 1.0, zero or negative replication capacity, protocol overhead outside 0% to 95%, and warning thresholds outside 1% to 100%.
Where do the inputs go?
The calculation runs in your browser. Treat copied tables, downloaded JSON, CSV, DOCX, chart images, and shared links carefully because they can preserve workload names, rates, capacity assumptions, and recovery targets.
Glossary:
- Replication backlog
- Change data waiting to be sent, flushed, or replayed on a replica.
- Net backlog drain
- Usable replication capacity minus continuing incoming replicated load.
- Replica apply ceiling
- A target-side replay, flush, or storage limit that is lower than the network lane.
- Backlog age equivalent
- The modeled backlog expressed as write-time lag at the current incoming replication rate.
- RPO
- Recovery point objective, the maximum acceptable data freshness gap for a recovery scenario.
- RTO
- Recovery time objective, the maximum acceptable time to restore service after an interruption.
References:
- Log-Shipping Standby Servers, PostgreSQL Global Development Group.
- Replication lag, Google Cloud, last updated 2026-05-01.
- AWS Elastic Disaster Recovery FAQs, Amazon Web Services.