{{ result.summaryTitle }}
{{ result.primaryDisplay }}
{{ result.secondaryText }}
{{ result.statusText }} {{ result.netDrainBadge }} {{ result.backlogBadge }} {{ result.targetBadge }} {{ result.utilizationBadge }}
Replication backlog inputs
Use a sustained changed-data rate from storage, WAL, binlog, or replication counters.
Enter the sustained replication lane capacity, not the interface label speed.
Set 0 when you only want to drain an already measured backlog.
Use zero if the outage backlog is the only backlog you need to model.
Use 1.0 when your rate and backlog counters already report compressed replication bytes.
Set 0 to skip target checks and keep the calculator focused on the raw drain time.
Leave as-is for an unlabeled estimate.
Default 0 keeps the entered capacity as already usable.
%
Use 0 for no separate apply ceiling.
Default 0 leaves the measured backlog unchanged.
%
Warn when ongoing change traffic consumes this share of usable replication capacity.
%
Leave blank when only duration matters.
MetricValueRunbook meaningCopy
{{ row.metric }}{{ row.value }}{{ row.meaning }}
CheckpointStatusValueActionCopy
{{ row.checkpoint }}{{ row.status }}{{ row.value }}{{ row.action }}
ScenarioUsable capacityNet drainCatch-up timeTarget statusCopy
{{ row.scenario }}{{ row.capacity }}{{ row.netDrain }}{{ row.catchup }}{{ row.status }}

          
Customize
Advanced
:

Introduction:

Replication backlog is the change data that has not yet reached, flushed, or replayed on a replica. It grows during outages, link interruptions, overloaded apply workers, throttled storage, and maintenance windows where writes continue while replication cannot keep up. The planning question is not just how many bytes are waiting. The more useful question is whether the recovery path has enough spare throughput to drain those bytes while new changes keep arriving.

That spare throughput is the difference between usable replication capacity and the incoming replicated change rate. A replica with 500 Mbps of usable capacity and 140 Mbps of continuing replicated load has 360 Mbps of net drain for backlog recovery. A replica with 500 Mbps of usable capacity and 540 Mbps of continuing replicated load does not catch up at all, because every recovery second adds more work than it removes.

Replication backlog diagram showing write rate and outage time creating backlog, then usable capacity minus continuing load creating net drain toward a target.

Backlog time is closely tied to recovery point objective, recovery time objective, and reader staleness. A large backlog can be acceptable if it drains before dependent systems need current data. A smaller backlog can be a real incident if the replica keeps falling farther behind, if stale reads are exposed to users, or if failover would lose changes that have not reached the recovery site.

The estimate remains a planning model. It does not prove that replication is healthy, that every byte has the same apply cost, or that a single average rate will hold through a noisy recovery period. It is best used beside live replication counters, database or storage apply metrics, and a short measured catch-up test when the window is tight.

Technical Details:

Asynchronous replication usually has two separate timing concerns. First, changes must be shipped from the source to the replica. Second, the replica must write, flush, or replay those changes before they become useful. A backlog can therefore come from a network lane that cannot send fast enough, a target that cannot apply fast enough, or a maintenance window where replication stopped while writes continued.

Byte lag and time lag are related but not identical. A database can report byte distance between source and replica positions, while another monitor reports the age of the transaction being replayed. Byte lag is easier to model from throughput because it behaves like a drainable queue. Time lag is better for user impact because it describes how stale the replica may be. A throughput model connects those views by estimating both the modeled backlog and its write-time equivalent.

The core condition is convergence. Replication catches up only when usable capacity is greater than the incoming replicated change rate. Compression or dedupe reduces the replicated change rate when the source rate is logical bytes, while protocol overhead and an optional apply ceiling reduce usable capacity.

Formula Core:

The model converts all rates to Mbps, converts backlog sizes to bytes, then uses net drain to calculate catch-up duration.

Rincoming = RlogicalCratio Cusable = min(Craw×(1-o),Capply) Boutage = Rincoming×1000000×t 8 Btotal = (Bcurrent+Boutage)×(1+s) Rnet = Cusable-Rincoming Tcatchup = BtotalRnet×1000000/8

In the capacity formula, the apply ceiling is used only when a positive ceiling is entered. If no separate apply ceiling is entered, usable capacity is the raw replication capacity after protocol overhead. Catch-up time is finite only when Rnet is greater than zero.

Variable Guide:

Replication backlog formula variables and meanings
Symbol Meaning Why it changes the result
Rlogical Incoming change rate before compression or dedupe. Higher source write volume creates more replicated work during outage and catch-up.
Cratio Compression or dedupe ratio, with 1.0 meaning no reduction. A larger ratio lowers replicated bytes when the source rate is logical data.
Cusable Capacity left for replication after overhead and any target-side apply limit. This must exceed continuing replicated load before backlog can fall.
Bcurrent Existing bytes already waiting before the modeled outage window. Existing backlog adds directly to the outage backlog.
s Safety buffer as a decimal share. Adds reserve for bursts, retransmits, metadata, or uncertain counters.
Rnet Usable capacity minus incoming replicated change rate. Positive values drain backlog; zero or negative values mean the replica cannot converge.

Validation and Boundary Rules:

Accepted input ranges for the replication backlog model
Input Accepted range or rule Practical effect
Incoming change rate Zero or greater, in Mbps, MB/s, Gbps, or GiB/hr. Zero removes continuing write pressure and makes only existing backlog matter.
Replication capacity Greater than zero, in the same supported rate units. A nonpositive value blocks the estimate because no drain rate can be formed.
Replication outage Zero or greater, in minutes, hours, or days. Set to zero when modeling a measured backlog without adding outage-generated bytes.
Current backlog Zero or greater, in GiB, TiB, GB, or TB. Use the unit family that matches the source counter.
Compression or dedupe ratio At least 1.0. Values below 1.0 are rejected because they would inflate data through a reduction field.
Protocol overhead 0% to 95%. Reduces raw capacity before comparing it with continuing replicated load.
Backlog safety buffer 0% to 300%. Scales total modeled backlog upward after existing and outage backlog are combined.
Steady-state warning 1% to 100%. Sets the utilization point where normal replication headroom is flagged as thin.

The model uses deterministic arithmetic and simple conversions. It does not simulate bursty write patterns, queue scheduling, database lock conflicts, storage stalls, checksum cost, or multi-replica fan-out. If actual replication lag is high while network lag is low, target-side apply or replay is often the next place to inspect rather than raw network bandwidth alone.

Everyday Use & Decision Guide:

Begin with measured counters when you have them. Use sustained write or WAL generation for Incoming change rate, not a short spike. Use a proven send, receive, or apply rate for Replication capacity, not the label on a network interface. If the replica's storage or database apply path is slower than the link, put that lower rate in Replica apply ceiling.

The strongest first pass is conservative. Keep Compression or dedupe ratio at 1.0 unless your counters are logical bytes and you have evidence for the reduction. Add Protocol overhead only when the entered capacity is raw rather than already usable. Add a Backlog safety buffer when the outage includes bursty writes, retransmits, or uncertain measurements.

  • Set Replication outage to the time changes accumulated without normal replication. Use 0 when the only source of backlog is a measured byte counter.
  • Use Catch-up target when a runbook, maintenance window, recovery time objective, or reader freshness promise needs a fit or miss answer.
  • Keep Steady-state warning near your normal headroom policy. The default 80% flags cases where ordinary write load is already consuming most of the usable lane.
  • Add Catch-up start time only when a projected completion timestamp helps with handoff or incident notes.

A good fit is a planned outage, failback window, replica rebuild, database reader recovery, storage mirror repair, or disaster recovery drill where you can estimate change rate and capacity. A poor fit is a live incident with unknown write bursts, changing throttles, or missing apply counters. In that case, use the result as a rough bound and keep measuring the actual drain rate during recovery.

Do not treat a short Catch-up time as proof that the system is safe to fail over. Read Net backlog drain, Steady-state utilization, Primary bottleneck, and RPO pressure before making the call. A positive drain estimate still needs live evidence that source, network, and replica apply metrics are moving in the same direction.

Step-by-Step Guide:

Work from backlog creation first, then add recovery capacity and target checks.

  1. Enter Incoming change rate and choose the unit that matches the source counter. If the red validation box says the rate is negative, correct that field before reading the result.
  2. Enter Replication capacity as the sustained usable lane. The estimate requires this value to be greater than zero.
  3. Set Replication outage and Current backlog. The Total modeled backlog row will show existing backlog plus the bytes generated during the outage.
  4. Set Compression or dedupe ratio. Leave it at 1.0 when rate and backlog counters already describe transferred replication bytes.
  5. Add Catch-up target if a deadline matters. A target of 0 turns off target checks and keeps the headline focused on raw drain time.
  6. Open Advanced for Workload name, Protocol overhead, Replica apply ceiling, Backlog safety buffer, Steady-state warning, and Catch-up start time. If any advanced value is outside its range, fix the validation message before using the estimate.
  7. Read the summary badges first. can catch up, target risk, thin headroom, and falls behind quickly show whether the current inputs converge.
  8. Open Backlog Metrics, Catch-up Checkpoints, Capacity Scenarios, Backlog Burn-down Chart, Capacity Sensitivity Curve, and JSON when you need table evidence, scenario comparison, chart evidence, or structured output.

Interpreting Results:

Net backlog drain is the result to check first. Positive drain means the modeled backlog can fall. Zero or negative drain means Catch-up time becomes not reachable, and the next action is to raise usable capacity, reduce write load, improve compression, or remove an apply ceiling before expecting convergence.

How to interpret replication backlog outputs
Output cue Read it as Check next
Total modeled backlog The bytes that must be sent, flushed, or replayed after existing backlog, outage bytes, and buffer are combined. Verify the source counter unit, especially GiB versus GB and TiB versus TB.
Incoming replication load Continuing change traffic after compression or dedupe. Make sure logical write rate and compression ratio are describing the same data family.
Usable replication capacity The active send or apply ceiling after overhead and optional replica apply limit. If Primary bottleneck says replica apply ceiling, inspect target replay and storage counters.
Catch-up target fit Whether the modeled drain time meets the configured runbook target. Use Capacity Scenarios or the sensitivity chart to see how much extra capacity would help.
RPO pressure The modeled backlog expressed as write-time lag at the current incoming load. Compare it with the workload recovery point objective and stale-read tolerance.

The target badge is a planning cue, not a guarantee. spare 2.63 hr means the current arithmetic fits the target by that margin, while short 2.06 hr means the model misses by that amount. The wording does not prove that the actual recovery will hold the same rate for the whole window.

A low steady-state utilization is reassuring only when the entered capacity is real. If the replication lane shares traffic with backups, snapshots, user reads, or cross-region work, compare the estimated Usable replication capacity with a measured transfer or live replication test before relying on the drain time.

Worked Examples:

Default database recovery window

Use a workload named orders-db-prod, an Incoming change rate of 180 Mbps, Replication capacity of 500 Mbps, a 90 minute outage, 120 GiB current backlog, a 1.30x compression ratio, and a 4 hour target. The result shows Total modeled backlog of 207.04 GiB, Incoming replication load of 138 Mbps, Net backlog drain of +362 Mbps, and Catch-up time of 1.37 hr. The target status is inside target with about 2.63 hr spare.

Apply ceiling misses the target

Keep the same numbers, but set Replica apply ceiling to 220 Mbps. The lower target-side ceiling becomes Usable replication capacity, so Net backlog drain drops to +81.5 Mbps. Catch-up time stretches to 6.06 hr, and Catch-up target fit reports a miss by about 2.06 hr. This points to replica replay or storage apply work, not only more network bandwidth.

Write surge that cannot converge

Raise Incoming change rate to 700 Mbps while keeping 500 Mbps usable capacity and the 1.30x compression ratio. Incoming replication load becomes 538 Mbps, which is higher than usable capacity. The page shows not reachable, falls behind, Net backlog drain of -38.5 Mbps, and Steady-state utilization of 107.7%. The corrective path is capacity, throttling, or a better reduction ratio before catch-up can begin.

Measured backlog without an outage window

If a monitor already reports 80 GiB of backlog, set Replication outage to 0, set Current backlog to 80 GiB, and enter the current write and capacity rates. Total modeled backlog then stays tied to the measured counter instead of double-counting an outage estimate. If a validation message appears after changing units, correct the negative or out-of-range field before copying results into a runbook.

FAQ:

Why does the result say not reachable?

The incoming replicated change rate is equal to or higher than usable replication capacity. Raise capacity, reduce write load, improve compression or dedupe, lower protocol overhead, or remove a replica apply ceiling before using the catch-up estimate.

Should I use write rate or replication bytes?

Use the rate that matches your compression setting. If your counter is logical write data, enter that rate and the reduction ratio. If your counter already reports transferred replication bytes, keep Compression or dedupe ratio at 1.0.

What is the difference between catch-up time and backlog age equivalent?

Catch-up time estimates how long backlog takes to drain after replication resumes. Backlog age equivalent expresses the same modeled bytes as write-time lag at the current incoming replication load.

Can a replica miss the target even when it can catch up?

Yes. Positive Net backlog drain means convergence is possible, but the drain can still be too slow for the configured Catch-up target. Check Capacity Scenarios to see whether more capacity, less write load, or removing an apply ceiling changes the target result.

Why did the validation box appear?

The calculator rejects negative rates, negative backlog, negative outage time, compression ratios below 1.0, zero or negative replication capacity, protocol overhead outside 0% to 95%, and warning thresholds outside 1% to 100%.

Where do the inputs go?

The calculation runs in your browser. Treat copied tables, downloaded JSON, CSV, DOCX, chart images, and shared links carefully because they can preserve workload names, rates, capacity assumptions, and recovery targets.

Glossary:

Replication backlog
Change data waiting to be sent, flushed, or replayed on a replica.
Net backlog drain
Usable replication capacity minus continuing incoming replicated load.
Replica apply ceiling
A target-side replay, flush, or storage limit that is lower than the network lane.
Backlog age equivalent
The modeled backlog expressed as write-time lag at the current incoming replication rate.
RPO
Recovery point objective, the maximum acceptable data freshness gap for a recovery scenario.
RTO
Recovery time objective, the maximum acceptable time to restore service after an interruption.

References: