DR Runbook Report
Build a disaster recovery runbook from RTO, RPO, owners, dependencies, recovery steps, validation checks, and readiness gates for review.{{ result.runbookText }}
| {{ header }} | Copy |
|---|---|
| {{ cell }} | |
|
No DR table rows available
Complete the runbook inputs before exporting this table.
|
Introduction:
Disaster recovery work becomes difficult when the plan exists but the recovery path is not testable. A service owner may know the backup location, an engineer may know the failover commands, and an incident commander may know who can declare a disaster, but those facts do not help enough during an outage unless they are written in one usable runbook. A recovery runbook turns scattered continuity knowledge into a sequence that responders can follow, challenge, and rehearse.
A disaster recovery runbook is narrower than a full business continuity plan. It focuses on a specific system or service, the scenario that triggers recovery, the environments involved, the accountable people, the dependencies that must recover first, and the proof required before users can trust the restored service. A strong runbook does not stop at failover commands. It records activation authority, stop conditions, validation checks, communications, recovery-mode limits, failback steps, and the evidence that should be captured along the way.
Two time objectives shape almost every recovery decision. The recovery time objective, or RTO, is the target for how long the service can be unavailable. The recovery point objective, or RPO, is the acceptable data-loss window measured back from the disruption. A daily backup may satisfy a low-priority internal app with a long RPO, but it will not satisfy a payment, identity, routing, or customer-facing system that needs recent data and short downtime. The objectives have to match the architecture that actually exists, not the strategy a team hopes to build later.
Recovery strategy names are easy to misuse. Backup and restore means rebuilding or restoring the service when needed. Pilot light keeps the critical core ready but still needs startup and scale-out work. Warm standby keeps a reduced service running. Hot standby active/passive keeps a passive environment close to full service. Active/active runs multiple serving locations at once, but it still needs data consistency rules, clean rollback paths, and corruption recovery. Faster strategies usually cost more and require more testing, monitoring, and operational discipline.
| Runbook Area | What It Settles | Common Gap |
|---|---|---|
| Objectives | How much downtime and data loss the service can tolerate. | Targets copied from policy but never tested against the real recovery design. |
| Dependencies | Which identity, DNS, data, network, vendor, and monitoring paths must recover first. | A single application component is listed while upstream access and routing paths are missing. |
| Validation | How responders prove the recovered service is usable and data is trustworthy. | Infrastructure is checked, but user paths and data-integrity proof are absent. |
| Exercise history | Whether the written plan still matches service reality. | Old tabletop notes survive after roles, access, dependencies, or topology changed. |
A readiness score is not a recovery guarantee. Real confidence comes from running restore drills, tabletop exercises, failover tests, and post-incident updates that keep the runbook aligned with the service people actually operate.
How to Use This Tool:
Use the form to turn recovery facts into a reviewable runbook draft, readiness gates, a dependency ledger, a timeline, and a JSON report. Start with the service identity and authority fields, then add the operational evidence that would matter during an exercise or outage.
- Enter the Service or system, Recovery tier, Disaster scenario, Recovery strategy, RTO, RPO, primary environment, recovery environment, runbook owner, and activation authority.
- Add Dependencies one per line. The strongest row format is dependency name, criticality, owner, recovery note, and verification focus separated by pipes, tabs, semicolons, or commas.
- Write the Failover steps, Validation checks, Restore or failback steps, and Communications and escalation as concrete actions, acceptance checks, or notification routes.
- Open Advanced and record backup evidence, replication health, privileged access and secrets, monitoring coverage, last exercise date, exercise cadence, data-integrity proof, and recovery-mode operating notes.
- Fix any Fix before using the runbook messages. Missing required fields, empty dependency rows, absent validation checks, or incomplete communications should be closed before the draft is used in review.
- Review DR Draft, DR Gates, Dependency Log, DR Timeline, DR Readiness, and JSON. The result is ready for owner review when blockers are gone and remaining review gates have clear actions.
Interpreting Results:
The summary shows the readiness percentage, residual risk, service tier, recovery strategy, RTO, RPO, and count of ready or blocked gates. A high percentage means the entered evidence covers more of the model. It does not prove that the service can recover inside the target until the runbook is exercised and evidence is recorded.
Ready gates have enough information for practical review or exercise use. Review gates contain usable draft material but need owner confirmation, stronger proof, fresher exercise evidence, or more detailed steps. Blocked gates identify missing or conflicting inputs that could stop activation, failover, validation, failback, or acceptance.
Residual risk combines gate status with service tier. Tier 0 and Tier 1 services are more sensitive to review items because small ambiguities can become incident blockers. Any blocked gate raises risk sharply because one missing authority, access path, data proof, dependency owner, or validation check can prevent responders from declaring the service recovered.
The DR Readiness chart groups gates into objectives, dependencies, data, execution, validation, failback, and coordination. Weak segments are good agenda items for the next tabletop, restore drill, failover test, or business-owner review.
Technical Details:
Disaster recovery readiness depends on targets, architecture, evidence, and rehearsal. RTO and RPO set the business tolerance, but the recovery strategy determines whether those targets are plausible. A backup-and-restore plan needs enough time to rebuild infrastructure, restore data, and validate the service. A low RPO needs current backup or replication evidence. A short RTO needs pre-provisioned capacity, tested traffic steering, access that survives the primary outage, and validation checks that can be run quickly.
The readiness model converts time objectives to minutes, compares the selected recovery strategy with reference windows, scores each gate, and averages related gates into a seven-part readiness profile. The model is intentionally conservative: missing required sections block the runbook, stale exercises reduce confidence, and dependency maps need enough rows to show the surrounding service chain rather than only the application name.
Strategy Reference Windows
| Recovery Strategy | Reference RTO | Reference RPO | Readiness Meaning |
|---|---|---|---|
| Backup and restore | 240 minutes | 1440 minutes | Infrastructure, configuration, and data are restored when needed. Lower operating cost usually means slower recovery. |
| Pilot light | 60 minutes | 60 minutes | Core data and minimal infrastructure exist in the recovery location, but more capacity must be started. |
| Warm standby | 30 minutes | 30 minutes | A reduced but functional recovery environment is already running and can scale up. |
| Hot standby active/passive | 15 minutes | 15 minutes | A near-full passive environment is ready to take traffic after routing and decision steps. |
| Multi-site active/active | 5 minutes | 5 minutes | Multiple active sites serve traffic, but data-corruption events still need backup, isolation, and integrity checks. |
Gate Scoring
| Gate Status | Score | Typical Cause |
|---|---|---|
| Ready | 100 | The required evidence is complete enough for owner review or exercise use. |
| Review | 62 | The section exists, but proof, freshness, owner confirmation, or detail is not strong enough yet. |
| Blocked | 20 | A missing or invalid element could prevent activation, failover, validation, failback, or acceptance. |
Readiness Rules
| Area | Ready Signal | Review or Block Signal |
|---|---|---|
| Objectives and authority | Service name, activation authority, positive RTO, and non-negative RPO are present. | Any required objective or authority field is missing. |
| Recovery strategy fit | The RTO is at least the strategy's reference RTO, and the RPO fits the strategy's reference data-loss window. | A target is more aggressive than the selected strategy normally supports, or a stronger strategy has a loose RPO that needs business confirmation. |
| Dependency map | At least three dependency rows are entered, with criticality and ownership inferred where provided. | One or two dependencies need review; no dependency rows block the runbook. |
| Data protection | Backup evidence and replication health are both complete. | Missing backup evidence blocks the gate; RPO targets of 60 minutes or less also block when replication evidence is missing. |
| Execution and validation | At least five failover steps, three validation checks, and four restore or failback steps are present. | Short lists remain in review, empty lists block, and required data-integrity proof expects a matching validation clue such as integrity, checksum, read, write, restore, data, or audit. |
| Exercise currency | The last exercise date is inside the selected quarterly, semiannual, or annual cadence. | No cadence, no date, invalid date, or a stale exercise blocks or downgrades the gate. |
| Access and monitoring | Privileged access and monitoring coverage are both complete. | Scheduled items need review; missing access or monitoring blocks operational confidence. |
Formula Core
Durations are normalized to minutes before strategy and score checks run. Gate scores are averaged into profile groups, and the readiness percentage is the rounded average of those group scores.
For example, an RTO of 2 hours becomes 120 minutes, while an RPO of 15 minutes stays 15 minutes. A Ready gate contributes 100, Review contributes 62, and Blocked contributes 20. The profile groups then average related gates for objectives, dependencies, data, execution, validation, failback, and coordination.
Security and Privacy Notes:
Runbook entries can reveal sensitive service names, topology, recovery locations, dependency relationships, vendor contacts, escalation paths, break-glass assumptions, monitoring gaps, and privileged-access readiness. Treat the draft, tables, chart, and JSON as operational continuity material.
The calculation and report generation run in the browser. The main privacy risk is what you type, copy, download, paste into tickets, or share with reviewers. Do not include passwords, private keys, recovery-console credentials, customer data, live exploit details, or vault secrets. Reference approved vaults, access procedures, incident channels, and credential owners instead of placing secret values in the runbook text.
Before using an export in a real incident, have the accountable service owner and activation authority review it, then store it where incident responders can reach it during a primary-environment outage.
Worked Examples:
- Identity platform failover: Set a Tier 1 service, primary site outage scenario, warm standby strategy, 2-hour RTO, and 15-minute RPO. Add identity database replica, DNS, firewall policy, monitoring, login validation, and failback steps. Review gates will show whether monitoring, privileged access, replication evidence, and exercise currency are strong enough for the next tabletop.
- Backup-only internal app: Choose backup and restore, enter a longer RTO, list the database restore, configuration redeploy, owner validation, and communications steps, then use blocked gates to show where restore evidence or validation checks are still missing.
- Ransomware recovery: Choose the data corruption or ransomware scenario, require data integrity proof, and include clean restore point checks, checksum or audit validation, malware isolation, business acceptance, and a cautious failback plan. The validation gate stays in review until the checks include a clear data-integrity clue.
- Stale exercise record: Keep the runbook details complete but leave the last exercise date older than the selected cadence. The exercise gate will downgrade or block the report, making the next drill a visible readiness action instead of an informal reminder.
FAQ:
Can the readiness score prove the service will recover?
No. The score reflects the entered evidence and gate rules. Real proof comes from restore drills, failover tests, monitoring evidence, data checks, and owner review.
Why does a stronger recovery strategy still need review?
Hot standby or active/active architecture can reduce downtime, but it still needs validated access, traffic steering, monitoring, data consistency, corruption recovery, and failback decisions.
Why does the dependency map expect three rows?
A real service usually depends on more than itself. Identity, DNS, databases, storage, networking, monitoring, vendor support, and communication paths often determine whether recovery succeeds.
What should go in validation checks?
Include user-path checks, data read/write or integrity proof, monitoring checks, dependency reachability, business acceptance, and evidence that responders can timestamp during recovery.
Should secrets be placed in the generated runbook?
No. Name the vault, access procedure, credential owner, or break-glass process. Do not paste passwords, keys, tokens, customer records, or private exploit details into the runbook text.
Glossary:
- Disaster recovery runbook
- A service-specific document that records activation, recovery steps, validation evidence, communications, and failback actions.
- RTO
- Recovery time objective, the target time allowed to restore service after disruption.
- RPO
- Recovery point objective, the acceptable data-loss window measured back from the disruption.
- Failover
- The process of moving service from the failed primary environment to the recovery environment.
- Failback
- The process of returning service from recovery mode to the repaired or rebuilt primary environment.
- Dependency
- An upstream or downstream service, vendor, data path, access path, or operational system that must work for recovery to succeed.
- Exercise cadence
- The recurring tabletop, restore drill, or failover test schedule used to keep the runbook current.
- Residual risk
- The remaining recovery concern after gate status and service tier are considered.
References:
- SP 800-34 Rev. 1: Contingency Planning Guide for Federal Information Systems, NIST CSRC.
- Disaster recovery options in the cloud, AWS documentation.
- How to set up Elasticsearch cross-cluster replication, Simplified Guide.
- How to check GlusterFS geo-replication status, Simplified Guide.
- How to set a GlusterFS geo-replication checkpoint, Simplified Guide.