DB Replica Lag Catch-up Calculator
Calculate online database replica catch-up time from backlog, write rate, apply throughput, and retention headroom to plan failover, rebuild, or incident recovery.{{ result.summaryTitle }}
- {{ error }}
| Signal | Status | Evidence | Operator move | Copy |
|---|---|---|---|---|
| {{ row.signal }} | {{ row.status }} | {{ row.evidence }} | {{ row.move }} |
| Metric | Value | Details | Copy |
|---|---|---|---|
| {{ row.metric }} | {{ row.value }} | {{ row.detail }} |
| Scenario | Write rate | Apply rate | Net drain | Catch-up ETA | Target | Copy |
|---|---|---|---|---|---|---|
| {{ row.scenario }} | {{ row.writeRate }} | {{ row.applyRate }} | {{ row.netDrain }} | {{ row.catchupEta }} | {{ row.target }} |
{{ formattedJson }}
Introduction:
Database replica lag is the backlog of changes that exist on a primary or source database but have not yet been applied on a replica. The backlog may be reported as time, bytes of write-ahead log, binary log, relay log, or another change stream measure. Catch-up planning asks a simpler operational question: is the replica applying old work faster than the source is creating new work?
That question matters during failover preparation, read-replica incidents, delayed analytics refreshes, maintenance windows, cross-region recovery, and replica rebuilds. A replica can look usable for reads while still being too far behind for promotion, and a fast-looking apply rate can still miss a deadline if incoming writes consume most of the capacity.
Time-based lag is useful for freshness, but it is not always a catch-up estimate. PostgreSQL, MySQL, managed database services, and change-data-capture systems all expose lag differently, and some metrics can read as unknown, stale, or zero under conditions that still need investigation. For catch-up planning, a byte backlog plus sustained write and apply rates usually gives a clearer answer than a single seconds-behind number.
Replica catch-up is still a model, not a guarantee. Long transactions, checkpoints, storage stalls, network pauses, locks, replay conflicts, table rebuilds, and cloud instance limits can change the real apply rate. A good estimate gives operators a defensible next move: wait, throttle writes, add apply capacity, protect retained logs, or rebuild before the missing log range disappears.
Technical Details:
Replication catch-up depends on the difference between incoming change generation and effective apply throughput. Incoming writes add to the backlog while the replica is trying to drain it. Apply throughput removes backlog only after the replica has received, written, flushed, and replayed or applied the change records that matter for the database engine in use.
The useful catch-up rate is therefore a net rate. If the source produces 450 MiB of new log data per minute and the replica can apply 630 MiB per minute after efficiency and reserve are considered, the backlog shrinks at about 180 MiB per minute. If the source produces more than the replica can apply, the backlog grows and no finite catch-up time exists while writes continue at that rate.
Retention is a separate limit. A replica may be mathematically able to catch up but still fail if the source removes required WAL, binary log, relay log, slot data, or archived change records before the replica reaches them. This is why backlog size and retained-log headroom should be read together, especially during cross-region lag, stopped replication, or a long maintenance pause.
Formula Core:
The model normalizes backlog and rates to MiB and MiB per minute, then compares the remaining drain rate with the selected target window.
| Quantity | Meaning | Practical reading |
|---|---|---|
| Current lag backlog | Unapplied change data waiting for the replica. | Use byte backlog when available, such as WAL distance, relay log bytes, or change-stream backlog. |
| Source write generation | New log volume produced while the replica is catching up. | Use a sustained recent rate unless an incident spike is expected to continue. |
| Raw apply throughput | The observed rate at which the replica applies or replays change records. | Measure after storage, decompression, replay, and transaction costs are visible. |
| Apply efficiency | A reduction for stalls, checkpoints, locks, query conflicts, and burst variability. | Lower it when a short peak rate would make the estimate too optimistic. |
| Apply reserve | Capacity held back so the replica is not modeled at its absolute limit. | Keep a reserve for read traffic or storage recovery; use zero only for emergency drain planning. |
| Target window | The deadline used to judge whether catch-up is early, late, or impossible. | Match the maintenance window, promotion deadline, data freshness objective, or incident checkpoint. |
| Retained log limit | Optional storage or slot budget for the changes the replica still needs. | A low or exhausted limit can force rebuild even when the throughput math looks recoverable. |
PostgreSQL exposes write, flush, and replay positions and also reports lag intervals for recent WAL, but its documentation cautions that those lag intervals are not predictions of catch-up time. MySQL's seconds-behind value can be unknown or misleading when receiver and applier behavior diverge. Managed database metrics such as read-replica lag are still valuable, but they should be paired with byte backlog and throughput when the decision is about how long recovery will take.
A finite ETA means the current modeled net drain is positive. It does not prove the replica is safe to promote, consistent for a particular read, or protected from log loss. Promotion and failover decisions still need engine-specific health checks, replication status, application consistency rules, and confirmation that every required log segment remains available.
Everyday Use & Decision Guide:
Start with a measured backlog and two sustained rates. Enter the current lag backlog, the source write generation rate, and the replica apply throughput using the units that match your monitoring output. The best first pass usually comes from a recent 10 to 30 minute window, not from a single spike or a quiet minute.
The advanced controls are there to keep the estimate conservative. Lower Apply efficiency when replay is uneven, checkpoints are heavy, or long transactions keep blocking apply work. Add Apply reserve when the replica must keep serving reads while it catches up. Add Retained log limit when WAL, binlog, relay log, or slot storage pressure is part of the incident.
- Use
Catch-up Verdictfirst. It names whether the replica catches up inside the target, after the target, or not at all under the current rates. - Open
Replica Metricswhen you need the normalized backlog, effective apply throughput, net drain rate, required apply rate, and retention headroom in one place. - Use
Apply Scenario Ladderto compare write throttling, apply boost, both together, paused writes, and no-reserve emergency drain. - Check
Throughput Margin Stackwhen the question is how far current apply capacity sits from the target rate. - Use
Lag Drain Curveto show whether backlog reaches zero before the target or crosses a retained-log line before recovery.
The calculation runs in the browser from the values you enter. It does not connect to your database, query a replica, inspect logs, or test live replication. Shared URLs, CSV files, DOCX exports, chart images, and JSON exports can still reveal operational details such as replica names, write rates, backlog sizes, and recovery targets, so handle them like incident notes.
Treat a healthy result as permission to keep monitoring, not as permission to ignore the replica. If Retained log pressure shows risk, if the required apply rate is above modeled capacity, or if the net drain is negative, move from waiting to mitigation before the target window closes.
Step-by-Step Guide:
- Enter a clear
Replica nameso copied rows and exports identify the stream being modeled. - Enter
Current lag backlog. Prefer a byte-based backlog such as WAL distance, retained slot bytes, relay log backlog, or change-stream bytes over a seconds-only freshness metric. - Enter
Primary write generationfrom the source workload rate expected during catch-up. - Enter
Replica apply throughputfrom observed replay or apply throughput on the lagging replica. - Set
Catch-up target windowto the deadline that matters for the operation, such as a failover checkpoint, maintenance window, or freshness objective. - Open Advanced and adjust
Apply efficiency,Apply reserve, andRetained log limitwhen the default assumptions are too optimistic for the incident. - Use
Scenario write throttle,Scenario apply boost, andChart horizonto compare mitigation options without changing the baseline inputs.
Interpreting Results:
The summary figure is the modeled catch-up time when net drain is positive. If the figure reads Lag grows, the replica is falling farther behind while the current source write rate continues. In that state, a target window cannot be met without lowering writes, increasing apply throughput, removing reserve, or pausing writes long enough for the replica to drain.
| If you see | Read it as | Check next |
|---|---|---|
Inside target |
The current modeled net drain reaches zero before the selected deadline. | Keep monitoring actual backlog and verify the apply rate stays close to the modeled rate. |
After target |
The replica drains backlog, but not fast enough for the current window. | Compare Required apply throughput with Effective apply throughput. |
Will not catch up |
Source writes are equal to or greater than effective apply throughput. | Use the scenario ladder to estimate throttle, boost, or write-pause options. |
Retention risk or Already over limit |
The retained-log budget may run out before the replica reaches needed records. | Increase retention, protect the slot or logs, reduce writes, or plan a rebuild. |
Needs margin |
Current effective apply capacity is below the rate needed to hit the target window. | Look for storage, instance, worker, lock, checkpoint, or query-conflict limits before relying on waiting. |
Negative net drain is the most important stop sign. It means the backlog is growing, not merely draining slowly. A retention limit below the current backlog is another stop sign because the missing log range may already be unavailable or too close to loss for a wait-only plan.
Read scenario results as planning comparisons. A modeled apply boost may mean faster storage, a larger instance class, more apply workers, replay tuning, or removing read load from the replica. A modeled write throttle may mean queueing writes, pausing batch jobs, shifting traffic, or temporarily lowering ingest until the backlog reaches a safe size.
Worked Examples:
Replica catches up before the deadline
The default-style case has an 18 GiB backlog, 450 MiB per minute of source writes, and 720 MiB per minute of raw apply throughput. With 92% efficiency and a 5% reserve, effective apply is about 629 MiB per minute. Net drain is about 179 MiB per minute, so the backlog drains in roughly 1 hour 43 minutes. Against a 2 hour target, that is inside the window, but only by about 17 minutes.
Replica falls behind during a write spike
A replica has 250 GiB waiting, source writes are running at 1.2 GiB per minute, and raw apply is 900 MiB per minute. After 85% efficiency and a 10% reserve, effective apply is about 689 MiB per minute. The backlog grows by more than 500 MiB per minute, so the correct operational reading is not "slow catch-up." It is "no catch-up while this write rate continues."
Retention turns a late plan into a rebuild risk
A cross-region replica is 40 GiB behind with a retained-log budget of 50 GiB. If writes keep exceeding apply by 200 MiB per minute, the remaining 10 GiB of headroom lasts about 51 minutes. A 2 hour target is irrelevant unless retention is increased or writes are reduced, because the replica may lose required records long before the recovery target.
Scenario ladder narrows the mitigation
If the baseline misses by 30 minutes, compare the write-throttle row with the apply-boost row before changing production. A small throttle may be enough when writes are the dominant pressure. A boost is more credible when storage, apply workers, or instance class clearly cap replay. If neither row reaches the target, the combined row and pause-writes row show how aggressive the incident response must become.
FAQ:
Does this measure my live replica?
No. It models catch-up from the values you enter. Use database monitoring, engine views, cloud metrics, or log data to collect backlog, write generation, and apply throughput before trusting the estimate.
Should I enter seconds of lag or bytes of backlog?
Use bytes of backlog when you have them. Seconds of lag are useful for freshness, but catch-up time depends on how much unapplied data remains and how quickly new data is being produced and applied.
Why does the result say the replica will not catch up?
The modeled effective apply throughput is not greater than the source write generation rate. The backlog cannot shrink until writes fall, apply capacity rises, or writes pause.
Why does a retained-log limit trigger an input warning?
A retained-log limit below the current backlog means the model starts past the stated storage or slot budget. That can mean the limit is wrong, the backlog metric is wrong, or replication already needs urgent attention.
What does apply reserve change?
Apply reserve subtracts capacity from the raw apply rate before net drain is calculated. It is useful when the replica still needs room for reads, checkpoints, recovery I/O, or normal system overhead during catch-up.
Can I use the chart as an incident artifact?
Yes, if the inputs are clearly labeled and measured. Exported charts and tables are planning evidence, not proof that the database engine will keep the same rate under live load.
Glossary:
- Replica lag
- The gap between the source database state and the replica state, reported as time, bytes, log position, or another engine-specific measure.
- Backlog
- The unapplied change data the replica still needs to receive, write, flush, replay, or apply.
- Source write generation
- The rate at which the source database creates new log records while catch-up is underway.
- Apply throughput
- The rate at which the replica can process backlog into database state visible to the relevant workload.
- Net drain
- Effective apply throughput minus source write generation. Positive net drain shrinks backlog; negative net drain grows it.
- Retained log limit
- The available log, slot, archive, or relay storage budget that must hold records until the replica no longer needs them.
- Target window
- The deadline used to judge whether the modeled catch-up time is acceptable.