{{ result.summaryTitle }}
{{ result.primaryDisplay }}
{{ result.secondaryText }}
{{ result.statusText }} {{ result.netDrainBadge }} {{ result.backlogBadge }} {{ result.targetBadge }} {{ result.utilizationBadge }}
Write rate Replication Backlog Catch-up
Replication backlog inputs
Use a sustained changed-data rate from storage, WAL, binlog, or replication counters.
Enter the sustained replication lane capacity, not the interface label speed.
Set 0 when you only want to drain an already measured backlog.
Use zero if the outage backlog is the only backlog you need to model.
Use 1.0 when your rate and backlog counters already report compressed replication bytes.
Set 0 to skip target checks and keep the calculator focused on the raw drain time.
Leave as-is for an unlabeled estimate.
Default 0 keeps the entered capacity as already usable.
%
Use 0 for no separate apply ceiling.
Default 0 leaves the measured backlog unchanged.
%
Warn when ongoing change traffic consumes this share of usable replication capacity.
%
Leave blank when only duration matters.
MetricValueRunbook meaningCopy
{{ row.metric }}{{ row.value }}{{ row.meaning }}
CheckpointStatusValueActionCopy
{{ row.checkpoint }}{{ row.status }}{{ row.value }}{{ row.action }}
ScenarioUsable capacityNet drainCatch-up timeTarget statusCopy
{{ row.scenario }}{{ row.capacity }}{{ row.netDrain }}{{ row.catchup }}{{ row.status }}
Customize
Advanced
:

Replication backlog is the queue of change data a secondary system still has to receive, flush, or replay before it matches the primary. The queued work may be database write-ahead log (WAL), binlog events, storage deltas, changed blocks, search-index operations, or file changes, but the planning question is the same: how much old work is waiting, and how much spare recovery rate remains while new writes keep arriving?

Catch-up planning is different from reading a lag counter at one instant. A seconds-based lag value describes how stale the replica may be, while a byte backlog describes the amount of work still queued. The two can move differently during recovery. A replica can have a small time lag but a large byte queue after a burst, or a modest byte queue that still takes a long time to replay because the target disk, database apply process, or network path is the bottleneck.

Useful backlog estimates separate three quantities instead of treating lag as one mystery number:

  • Backlog volume, which combines bytes already waiting with bytes created during an outage or maintenance window.
  • Continuing replicated load, which is the write stream that keeps arriving while catch-up is underway.
  • Usable recovery capacity, which is the practical send or apply rate after overhead, throttling, and target-side limits.
Replication backlog queue with new writes, queued backlog, usable capacity, continuing load, and net drain.

Recovery objectives give the number a practical purpose. Recovery time objective (RTO) is about how long service can remain unavailable, while recovery point objective (RPO) is about how much data freshness or possible data loss is acceptable. A backlog that drains within a maintenance window may still violate an RPO if readers or failover targets see old data during the catch-up period, so byte backlog and time lag need to be checked together.

The main mistake is using a nominal interface speed as if every bit were available for replication. Cross-region latency, packet loss, encryption, compression cost, disk throughput, lock conflicts, and replay process limits can all lower the real drain rate. A backlog estimate should be paired with live lag metrics and at least one measured recovery run before it becomes a runbook commitment.

How to Use This Tool:

Use counters from the same workload and time window, then check whether the recovery path has positive spare capacity.

  1. Enter Incoming change rate from a sustained write, WAL, binlog, or changed-block counter. Choose Mbps, MB/s, Gbps, or GiB/hr to match the source counter.
  2. Enter Replication capacity as the sustained usable send or apply rate. Avoid raw interface speed unless testing shows that replication can actually use it.
  3. Set Replication outage for the time changes accumulated without normal replication. Use 0 when only a measured backlog is being drained.
  4. Set Current backlog in GB, TB, GiB, or TiB. Keep the size unit consistent with the monitoring system that produced the backlog counter.
  5. Adjust Compression or dedupe ratio. Use 1.0 when the entered rate already measures transferred replication bytes instead of logical change volume.
  6. Set Catch-up target when a maintenance window, runbook, or recovery promise needs an inside-target or target-miss result.
  7. Open Advanced when protocol overhead, a Replica apply ceiling, a Backlog safety buffer, a Steady-state warning, or a Catch-up start time materially changes the plan.
  8. Read Backlog Metrics first. Use Catch-up Checkpoints, Capacity Scenarios, Backlog Burn-down Chart, and Capacity Sensitivity Curve to compare bottlenecks and alternatives.

If a validation message appears, fix the named field before using the summary. The checks cover negative rates, negative durations, negative backlog, a compression ratio below 1.0, nonpositive replication capacity, and invalid advanced percentages. The form also caps protocol overhead, safety buffer, and steady-state warning fields.

Interpreting Results:

Start with Net backlog drain. Positive drain means usable capacity remains after continuing replicated writes. Zero or negative drain means the replica cannot converge, so Catch-up time should be treated as not reachable rather than slow-but-acceptable.

Total modeled backlog is the amount of data being drained after existing backlog, outage-generated bytes, and the optional safety buffer are combined. Backlog age equivalent restates that byte count as write-time lag at the modeled incoming replication load, which helps compare the estimate with RPO and stale-read tolerance.

Steady-state utilization warns about normal headroom. A plan can meet the catch-up target and still be fragile if ordinary writes consume most of the usable capacity, because a routine write spike can create a new lag event.

Do not treat an inside-target result as proof that failover is ready. Check Primary bottleneck, compare the modeled Net backlog drain with live byte-lag movement, and verify that target replay, storage, or apply metrics improve during the recovery window.

Technical Details:

Replication catch-up is a queue problem with a moving input. Old backlog drains only from the spare rate left after the current write stream has been accounted for. The calculation therefore starts with normalized throughput, then converts the backlog and outage window into a common byte-per-second model.

Usable capacity can be lower than the raw replication lane. Protocol overhead removes a percentage of the entered capacity, and a replica apply ceiling caps the result when the target can receive data faster than it can replay or flush it. Compression or dedupe lowers incoming replicated load only when the entered change rate describes logical source data.

Formula Core:

The model normalizes rates to Mbps, durations to seconds, and backlog to bytes before deriving catch-up time.

Rincoming = Rlogicalq Cpath = Craw×(1-o) Cusable = min(Cpath,Capply) if apply ceiling > 0 Cpath otherwise Boutage = Rincoming×1000000×toutage8 Btotal = (Bcurrent+Boutage)×(1+s) Rnet = Cusable-Rincoming Tcatchup = BtotalRnet×1000000/8, when Rnet>0
Replication backlog variable guide
Symbol Meaning Unit handling
RincomingContinuing replicated load after the compression or dedupe ratio.Mbps after rate normalization.
CusableCapacity left after protocol overhead and any replica apply ceiling.Mbps, with the apply ceiling used only when it is greater than zero.
BoutageBacklog generated during the outage window.Bytes, using 1,000,000 bits per Mbps and 8 bits per byte.
BtotalExisting backlog plus outage-generated backlog after the safety buffer.Bytes, displayed with binary byte units.
RnetUsable capacity minus continuing replicated load.Positive Mbps values drain backlog; zero or negative values cannot catch up.
TcatchupModeled duration from restart until the backlog reaches zero.Seconds internally, displayed as seconds, minutes, hours, or days.

For the default values, 180 Mbps of logical writes with a 1.3 compression ratio becomes 138.46 Mbps of incoming replicated load. A 90-minute outage creates about 87.04 GiB of new backlog, which combines with 120 GiB of existing backlog for 207.04 GiB total. With 500 Mbps of usable capacity, net drain is 361.54 Mbps and catch-up time is about 1.37 hours.

Replication backlog validation and conversion rules
Area Rule Practical effect
Rate valuesIncoming change rate must be zero or greater; replication capacity must be greater than zero.A static backlog is valid, but a nonpositive recovery lane cannot produce catch-up.
Rate unitsGbps is converted to 1000 Mbps; MB/s is converted by multiplying by 8; GiB/hr is converted from binary bytes per hour.Mixed monitoring units can be compared after normalization.
Backlog unitsGB and TB use decimal bytes; GiB and TiB use binary bytes.Choose the unit family that matches the source counter to avoid a silent size shift.
Compression ratioThe ratio must be at least 1.0.Values above 1.0 reduce logical write volume into transferred replication bytes.
Protocol overheadAccepted from 0% to 95%.The value reduces raw capacity before any apply ceiling is considered.
Backlog safety bufferAccepted from 0% to 300%.The value scales the combined backlog after current and outage backlog are added.
Steady-state warningAccepted above 0% and up to 100%.This threshold decides when normal write load is labeled thin headroom.

Scenario comparisons use the same queue model while changing one pressure point at a time: 75%, 125%, and 200% usable capacity, a 150% write surge, no compression benefit, and an uncapped apply path when an apply ceiling is present. Sensitivity rows also recompute catch-up from 0.5x to 3x usable capacity, which exposes the break-even point where continuing writes consume the entire recovery lane.

Limitations:

The calculation uses average rates, so it cannot prove that a live replication path is healthy. Real recovery can be shaped by bursty writes, archive fetches, long-running read queries, DDL locks, packet loss, target I/O, replay process limits, and storage throttling.

  • Use byte-lag counters for backlog size and replay-lag or commit-age counters for user-visible staleness.
  • Check both network and apply metrics; a healthy link can still leave the replica behind when replay or storage is slow.
  • Treat the optional completion time as a schedule estimate, not an availability guarantee.

Worked Examples:

Maintenance restart with enough headroom

A workload writes 180 Mbps of logical change data, has 120 GiB already waiting, and spends 90 minutes without normal replication. With a 1.3 compression ratio and 500 Mbps usable capacity, Incoming replication load is 138.46 Mbps, Total modeled backlog is 207.04 GiB, and Net backlog drain is +361.54 Mbps. The resulting Catch-up time is 1.37 hr, so a 4 hr Catch-up target has 2.63 hr of spare time.

Target replay becomes the bottleneck

The same workload changes sharply when Replica apply ceiling is set to 160 Mbps. Usable replication capacity becomes 160 Mbps, Net backlog drain drops to +21.54 Mbps, and Catch-up time stretches to 22.94 hr. Primary bottleneck points to the replica apply ceiling, so adding network bandwidth alone would not meet the target.

The replica cannot converge

If usable capacity is only 120 Mbps while the modeled incoming load stays 138.46 Mbps, Net backlog drain becomes -18.46 Mbps and the summary reports falls behind. Catch-up time is not reachable until the plan reduces incoming load, increases usable capacity, improves compression, or removes the apply bottleneck.

FAQ:

Why can a replica fail to catch up when bandwidth looks available?

The usable capacity must first cover continuing replicated writes. If Incoming replication load is equal to or higher than Usable replication capacity, Net backlog drain is zero or negative and the backlog keeps growing.

Should I enter link speed or observed replication throughput?

Enter the sustained rate replication can actually use. Raw link speed can overstate catch-up when protocol overhead, throttling, encryption, storage, or target replay lowers Usable replication capacity.

What if my monitoring gives lag in seconds?

Use seconds-based lag as an interpretation check, not as Current backlog. The backlog field expects a byte size, and Backlog age equivalent then restates the modeled bytes as write-time lag.

Why does compression change the answer?

Compression or dedupe ratio divides the logical write rate before backlog bytes are calculated. Leave it at 1.0 when the incoming rate already measures transferred replication traffic.

Why do I get a validation warning?

A warning appears when a required number is outside the accepted range, such as negative backlog, nonpositive replication capacity, a compression ratio below 1.0, or an invalid advanced percentage. Fix the named field before reading Backlog Metrics.

Glossary:

Replication backlog
Change data waiting to be transferred, flushed, or applied by a replica.
Byte lag
A backlog measurement expressed as bytes instead of elapsed time.
WAL
Write-ahead log, a database change stream commonly used by PostgreSQL replication.
RTO
Recovery time objective, the maximum acceptable time to restore service.
RPO
Recovery point objective, the maximum acceptable data freshness gap or possible data loss window.
Replica apply ceiling
The target-side replay, flush, or storage limit when it is lower than the network lane.
Net backlog drain
Usable replication capacity minus continuing replicated load.
Backlog age equivalent
The modeled backlog expressed as write-time lag at the incoming replication load.