{{ result.summaryTitle }}
{{ result.primary }}
{{ result.summaryLine }}
{{ badge.label }}
Database replica lag catch-up inputs
Label the stream being modeled so exported rows stay traceable.
Use the byte backlog metric when available; it gives a more stable catch-up estimate than timestamp lag alone.
Use a sustained recent rate, not a short spike, unless the spike is expected to continue.
Use the observed replay/apply rate after decompression and disk limits, before the reserve below.
Set the maintenance, failover, freshness, or promotion deadline to compare against the modeled ETA.
{{ params.apply_efficiency_pct }}%
Lower this when replay stalls, checkpoints, or long transactions make the observed peak rate optimistic.
{{ params.apply_reserve_pct }}%
Use 0 for emergency catch-up, or 5-25% when keeping replica service steady matters.
Set the log retention or slot backlog budget; use a high value when retention pressure is not the current concern.
{{ params.throttle_pct }}%
Models a temporary write throttle, queue drain, or maintenance pause without changing the baseline inputs.
{{ params.apply_boost_pct }}%
Models faster storage, more workers, bigger instance class, or replay tuning.
Shorten for incident calls; lengthen when modeling a long replica rebuild.
SignalStatusEvidenceOperator moveCopy
{{ row.signal }} {{ row.status }} {{ row.evidence }} {{ row.move }}
MetricValueDetailsCopy
{{ row.metric }} {{ row.value }} {{ row.detail }}
ScenarioWrite rateApply rateNet drainCatch-up ETATargetCopy
{{ row.scenario }} {{ row.writeRate }} {{ row.applyRate }} {{ row.netDrain }} {{ row.catchupEta }} {{ row.target }}
Customize
Advanced
:

Replica lag is the distance between changes committed on a primary database and changes that a standby, read replica, logical subscriber, or change-stream consumer has applied. Small lag is normal in asynchronous replication, but the same delay becomes a production issue during failovers, migrations, reporting cutovers, maintenance windows, and incidents where stale reads can send people to the wrong decision.

Operational teams often mix two different lag questions. Timestamp lag asks how old the replica's visible data might be. Byte or log-position backlog asks how much change data still has to be received, flushed, replayed, or applied. A quiet database can show old timestamps with little work left, while a busy system can show only seconds of lag but still hold a large write-ahead log, binary log, relay log, or replication-slot backlog.

Common replica lag measurements and what they answer
Measurement Operational Question Common trap
Timestamp lag How old is the latest applied transaction? A quiet system can show a large time gap even when little work remains.
Byte backlog How much log or change data still needs to be applied? A small byte gap can still take time if apply is blocked by locks or I/O.
Apply throughput How fast can the replica turn backlog into visible database changes? Short bursts can overstate the sustained rate available during recovery.
Retention headroom How much retained log space remains before required changes may be unavailable? Enough time to catch up does not always mean enough retained log space.

Catch-up is a stock-and-flow problem. The backlog is the stock. Incoming primary writes add to it, and replica apply throughput drains it. If a primary produces 450 MiB of new log data per minute while the replica effectively applies 600 MiB per minute, only 150 MiB per minute is available to reduce old lag. When write generation rises above effective apply throughput, the replica can be busy every second and still fall farther behind.

Replica backlog drains only when apply throughput exceeds write generation A backlog curve slopes down when effective apply throughput is higher than incoming writes, with a retained log limit shown above the draining line. Current backlog Net drain Retained log limit Backlog reaches zero only while effective apply rate stays above incoming writes.

Retention is the hard boundary behind the timing question. PostgreSQL write-ahead log, MySQL binary logs, relay logs, replication slots, and managed-service transaction logs are kept so a lagging replica can resume from the point it last consumed. If the required records disappear, or retained logs fill the available storage budget, the problem may stop being a wait-and-monitor exercise and become a rebuild, slot cleanup, or storage incident.

Any catch-up estimate depends on rate assumptions holding long enough to matter. Sustained recent rates are more useful than peak replay bursts, and comparable workload matters more than a tidy formula. Long transactions, blocked apply workers, slow disks, network stalls, checkpoint pressure, replica-side reads, schema changes, and single-threaded apply behavior can all make a constant-rate forecast too optimistic.

How to Use This Tool:

Model one replica stream at a time. Use values from the same recent interval whenever possible so the backlog, write generation, and apply throughput describe the same workload.

  1. Enter a Replica name that will make exported rows traceable to the standby, read replica, shard copy, or replication stream you are checking.
  2. Set Current lag backlog from an unapplied WAL, binlog, relay log, slot, or change-stream byte metric. Choose MiB, GiB, or TiB to match the source value.
  3. Enter Primary write generation as the sustained rate at which the primary is adding new log data while catch-up is happening.
  4. Enter Replica apply throughput as the observed replay or apply rate. Use the Advanced controls to reduce it with Apply efficiency and Apply reserve when peaks are not sustainable or some capacity must remain available.
  5. Set the Catch-up target window to the promotion, maintenance, migration, freshness, or incident deadline you need to compare against.
  6. Add a Retained log limit when log storage, slot retention, or relay-log availability could become the blocking risk.
  7. Use Scenario write throttle and Scenario apply boost to compare mitigations without changing the baseline numbers.
  8. Adjust Chart horizon when the lag curve needs to cover a longer rebuild or a shorter incident-review window.

If an input warning appears, fix that before relying on the result. The most important warning is a retained log limit below the current backlog because it means the modeled retention budget is already exhausted or the units need review.

Interpreting Results:

The summary focuses on Catch-up ETA and Net drain rate. Positive net drain means the backlog is shrinking. Zero or negative net drain means ongoing writes are consuming all available apply capacity, so the ETA is not finite while the workload continues.

  • Inside target means the modeled catch-up ETA is less than or equal to the selected target window.
  • After target means the replica can catch up eventually, but not by the chosen deadline.
  • Will not catch up means effective apply throughput is less than or equal to primary write generation.
  • Retention risk means retained-log headroom may run out within the target window while backlog is growing.
  • Already over limit means the current backlog is greater than the retained log budget entered for the model.

The Replica Metrics tab breaks out current backlog, raw apply throughput, effective apply throughput, required apply throughput, and retained-log headroom. The required apply value is the rate needed to absorb ongoing writes and drain the existing backlog before the target closes.

The Apply Scenario Ladder compares baseline, write throttle, apply boost, throttle plus boost, paused writes, and no apply reserve. These rows are useful during incident calls because they show whether a smaller write throttle is enough or whether storage, instance, parallel apply, or replay tuning must add apply capacity.

The Throughput Margin Stack chart compares write rate, modeled apply rate, required apply rate, and net drain. The Lag Drain Curve chart shows how the backlog changes over time and where the retained log limit sits. Treat both charts as planning views; confirm live replication status before promotion, failover, or rebuild decisions.

Advanced Tips:

  • Prefer a byte, WAL, binary-log, relay-log, slot, or change-stream backlog when one is available. Timestamp lag is useful for freshness alerts, but catch-up time depends on how much work remains.
  • Lower Apply efficiency when replay is slowed by checkpoints, lock waits, long transactions, single-threaded apply, replica reads, or bursty storage. A peak replay rate can make the ETA look safer than the next hour will be.
  • Keep Apply reserve above zero when the replica still has to serve reads or maintenance work. Use zero only for an emergency estimate where all observed apply capacity can be spent on catch-up.
  • Test Scenario write throttle and Scenario apply boost separately before combining them. Separate rows make it easier to see whether the cheaper mitigation is enough before changing instance size, replay workers, or write traffic.
  • Set Retained log limit from the real WAL, binlog, relay-log, slot, or managed-service storage budget. Leaving it unmodeled turns the output into a timing estimate without a retention-pressure check.
  • Match Chart horizon to the decision window. Short horizons work for incident calls, while long horizons make rebuild and migration plans easier to review after the immediate pressure passes.

Technical Details:

Replica catch-up is a stock-and-flow problem. The stock is the current backlog of unapplied change data. The flows are primary write generation and replica apply throughput. The stock falls only when adjusted apply throughput is higher than the incoming write flow.

Raw apply throughput usually needs adjustment before it can be used for an operational estimate. Apply efficiency reduces the observed rate for replay stalls, lock waits, checkpoint interference, transaction shape, and bursty storage behavior. Apply reserve subtracts capacity that should remain available for read traffic, background maintenance, or safety margin.

Formula Core

All backlog values are normalized to MiB, and all rates are normalized to MiB per minute before the model compares them.

Aeffective = Araw * E * ( 1 - R ) D = Aeffective - W Catch-up ETA = L D when D > 0 Arequired = W + L T
Replica catch-up formula symbols
Symbol Meaning Unit after conversion
L Current lag backlog MiB
W Primary write generation MiB/min
Araw Replica apply throughput before adjustments MiB/min
E Apply efficiency as a decimal 0.10 to 1.00
R Apply reserve as a decimal 0.00 to 0.50
D Net drain rate MiB/min
T Catch-up target window minutes

For example, 18 GiB of backlog is 18,432 MiB. A raw apply rate of 720 MiB/min with 92% efficiency and 5% reserve becomes 629.28 MiB/min. With primary writes at 450 MiB/min, net drain is 179.28 MiB/min, so the ETA is about 102.8 minutes. A two-hour target requires 603.6 MiB/min of effective apply, so that example has some margin.

Verdict Rules

Replica lag catch-up verdict rules
Signal Condition Operational meaning
Inside target Net drain > 0 and ETA <= target The modeled backlog reaches zero before the selected deadline.
After target Net drain > 0 and ETA > target The replica catches up, but the deadline is missed.
Will not catch up Net drain <= 0 The backlog is flat or growing while writes continue.
Retention exhausted Retained log limit <= current backlog The modeled retained-log budget is already consumed.
Retention risk Backlog is growing and headroom fills within the target window Required logs or storage headroom may run out before the deadline.

Retention time is calculated only when backlog is growing and the retained-log limit is above the current backlog. In that case, remaining headroom is divided by the absolute value of the negative net drain rate. When backlog is shrinking, retained-log pressure can still matter, but the model treats the headroom as not filling under the current rate assumptions.

Scenario rows reuse the same equations with changed rates. A write throttle lowers W, an apply boost raises adjusted apply throughput, paused writes sets W to zero, and no apply reserve removes the reserve reduction while keeping efficiency in place. That makes the scenarios comparable because every row starts from the same backlog and target window.

Displayed durations and rates are rounded for readability, so manual arithmetic may differ slightly in the last decimal. The verdict uses the unrounded normalized values.

Limitations:

The output is an arithmetic estimate, not a live replication monitor or a promotion-safety check.

  • Use sustained rates from a comparable load period. Peak replay numbers can make catch-up look faster than it will be during an incident.
  • Timestamp lag and byte backlog can disagree. A replica may show low timestamp lag while still retaining a large amount of log data, or high timestamp lag after a quiet period with little backlog.
  • Long transactions, schema changes, blocked apply threads, vacuum or checkpoint pressure, slow disks, network interruptions, and replica-side reads can break the constant-rate assumption.
  • Managed-service lag metrics may use engine-specific definitions. Check the source metric before mixing PostgreSQL, MySQL, SQL Server, Oracle, or cloud-provider values.
  • Promotion safety also depends on consistency, replication errors, failover design, read routing, and data-loss tolerance.

Worked Examples:

Replica catches up before the deadline

A replica has 18 GiB of backlog, primary writes of 450 MiB/min, raw apply of 720 MiB/min, 92% efficiency, and 5% reserve. Effective apply is about 629 MiB/min, net drain is about 179 MiB/min, and the catch-up ETA is about 1.7 hours. Against a two-hour target, the verdict is inside target.

Writes keep outrunning replay

If the same replica faces 900 MiB/min of primary write generation, net drain becomes negative. The ETA changes to will not catch up because every minute adds more backlog than the replica can apply. The scenario rows can show whether throttling writes, boosting apply capacity, or pausing writes changes the result.

Retention turns delay into recovery risk

A replica with 60 GiB of lag and a 64 GiB retained-log limit has only about 4 GiB of headroom. If backlog is growing at roughly 271 MiB/min, that headroom lasts about 15 minutes. The immediate concern is not just the missed target; it is the chance that required log records or storage space run out.

A warning blocks the estimate

If Current lag backlog is zero, Replica apply throughput is zero, or the retained-log limit is lower than the backlog, the warning list appears. Correct the missing value, unit mismatch, or retention budget before sharing the result.

FAQ:

Why does write rate get subtracted from apply rate?

The replica is applying old backlog while the primary keeps creating new changes. Only the apply capacity left after absorbing new writes can reduce existing lag.

Should I use seconds behind or byte backlog?

Use byte backlog for catch-up time when it is available because the estimate depends on how much change data remains. Seconds behind is still useful for user-facing freshness and alerting.

What should I enter for apply efficiency?

Use 100% only when the observed apply rate is stable and repeatable. Lower it when checkpoints, lock waits, long transactions, replica reads, or bursty storage make the raw rate optimistic.

Why is retained log pressure separate from catch-up ETA?

Catch-up ETA asks when lag reaches zero. Retained log pressure asks whether the log budget survives long enough for that to happen. A replica can have a finite ETA and still be close to a retention problem.

Why do I see a retention input warning?

The warning appears when the retained log limit is below the current backlog. Increase the limit value, correct the backlog units, or treat the replica as already outside the modeled retention budget.

Glossary:

Backlog
Unapplied change data waiting for a replica, standby, subscriber, or consumer to receive, replay, or apply.
Apply throughput
The rate at which the replica turns backlog into applied database changes.
Net drain rate
Effective apply throughput minus ongoing primary write generation.
Replication slot
A PostgreSQL mechanism that can retain required WAL for a standby or subscriber until it has consumed it.
Relay log
A MySQL replica-side log that stores events received from the source before the applier processes them.
Retention limit
The log, slot, or storage budget available before required change data may be removed or storage pressure forces intervention.
Target window
The deadline used to judge whether the modeled catch-up time is acceptable.

References: