DR Runbook Report

Service or system:

Use the name responders and business owners will recognize during an incident.

Recovery tier:

Choose the priority tier agreed through business impact analysis or service ownership.

Disaster scenario:

Pick the failure mode this runbook is meant to handle first.

Recovery strategy:

Use the strategy the service has actually built, not the one planned for a future architecture.

Recovery time objective:

Enter the RTO agreed by the business owner or continuity plan.

Recovery point objective:

Use the data-loss target the backup or replication design can prove.

Primary environment:

Be specific enough that responders know the failed source environment.

Recovery environment:

Name the destination that will carry service while the primary environment is unavailable.

Runbook owner:

Use a role when the assigned person changes by rota.

Activation authority:

This appears in the declaration and rollback decision sections.

Dependencies:

Include identity, DNS, databases, queues, object storage, network paths, vendors, and monitoring sources.

Failover steps:

Keep every step observable and reversible where possible.

Validation checks:

These become acceptance criteria before declaring the service recovered.

Restore or failback steps:

Include data resync, traffic return, monitoring, and closeout evidence.

Communications and escalation:

Add the bridge, business owner, service desk, vendor, on-call, and executive update path.

Fix before using the runbook

{{ error }}

Backup evidence:

This drives the data protection gate and post-exercise notes.

Replication health:

For backup-only recovery, Complete means restore evidence proves the data point.

Privileged access and secrets:

Mark Complete only when responders have verified access without depending on the failed environment.

Monitoring coverage:

Monitoring appears in the recovery-mode and closeout sections.

Last DR exercise:

Use the last tabletop, restore drill, game day, or full failover test date.

Exercise cadence:

Choose the cadence expected by the service owner or compliance program.

Data integrity proof:

When enabled, the validation gate expects explicit data-integrity evidence.

Recovery-mode operating notes:

One note per line; these become recovery-mode operating instructions.

{{ result.runbookText }}

{{ header }}	Copy
{{ cell }}
No DR table rows available Complete the runbook inputs before exporting this table.

Embed:

Customize

Include current inputs

Size

Advanced

Width

Height

Aspect ratio

Max height

Collapsible embed

Allow fullscreen

Referrer policy

Sandbox tokens

Introduction:

Disaster recovery work becomes difficult when the plan exists but the recovery path is not testable. A service owner may know the backup location, an engineer may know the failover commands, and an incident commander may know who can declare a disaster, but those facts do not help enough during an outage unless they are written in one usable runbook. A recovery runbook turns scattered continuity knowledge into a sequence that responders can follow, challenge, and rehearse.

A disaster recovery runbook is narrower than a full business continuity plan. It focuses on a specific system or service, the scenario that triggers recovery, the environments involved, the accountable people, the dependencies that must recover first, and the proof required before users can trust the restored service. A strong runbook does not stop at failover commands. It records activation authority, stop conditions, validation checks, communications, recovery-mode limits, failback steps, and the evidence that should be captured along the way.

Two time objectives shape almost every recovery decision. The recovery time objective, or RTO, is the target for how long the service can be unavailable. The recovery point objective, or RPO, is the acceptable data-loss window measured back from the disruption. A daily backup may satisfy a low-priority internal app with a long RPO, but it will not satisfy a payment, identity, routing, or customer-facing system that needs recent data and short downtime. The objectives have to match the architecture that actually exists, not the strategy a team hopes to build later.

Recovery strategy names are easy to misuse. Backup and restore means rebuilding or restoring the service when needed. Pilot light keeps the critical core ready but still needs startup and scale-out work. Warm standby keeps a reduced service running. Hot standby active/passive keeps a passive environment close to full service. Active/active runs multiple serving locations at once, but it still needs data consistency rules, clean rollback paths, and corruption recovery. Faster strategies usually cost more and require more testing, monitoring, and operational discipline.

Runbook Area	What It Settles	Common Gap
Objectives	How much downtime and data loss the service can tolerate.	Targets copied from policy but never tested against the real recovery design.
Dependencies	Which identity, DNS, data, network, vendor, and monitoring paths must recover first.	A single application component is listed while upstream access and routing paths are missing.
Validation	How responders prove the recovered service is usable and data is trustworthy.	Infrastructure is checked, but user paths and data-integrity proof are absent.
Exercise history	Whether the written plan still matches service reality.	Old tabletop notes survive after roles, access, dependencies, or topology changed.

A readiness score is not a recovery guarantee. Real confidence comes from running restore drills, tabletop exercises, failover tests, and post-incident updates that keep the runbook aligned with the service people actually operate.

How to Use This Tool:

Use the form to turn recovery facts into a reviewable runbook draft, readiness gates, a dependency ledger, a timeline, and a JSON report. Start with the service identity and authority fields, then add the operational evidence that would matter during an exercise or outage.

Enter the Service or system, Recovery tier, Disaster scenario, Recovery strategy, RTO, RPO, primary environment, recovery environment, runbook owner, and activation authority.
Add Dependencies one per line. The strongest row format is dependency name, criticality, owner, recovery note, and verification focus separated by pipes, tabs, semicolons, or commas.
Write the Failover steps, Validation checks, Restore or failback steps, and Communications and escalation as concrete actions, acceptance checks, or notification routes.
Open Advanced and record backup evidence, replication health, privileged access and secrets, monitoring coverage, last exercise date, exercise cadence, data-integrity proof, and recovery-mode operating notes.
Fix any Fix before using the runbook messages. Missing required fields, empty dependency rows, absent validation checks, or incomplete communications should be closed before the draft is used in review.
Review DR Draft, DR Gates, Dependency Log, DR Timeline, DR Readiness, and JSON. The result is ready for owner review when blockers are gone and remaining review gates have clear actions.

Interpreting Results:

The summary shows the readiness percentage, residual risk, service tier, recovery strategy, RTO, RPO, and count of ready or blocked gates. A high percentage means the entered evidence covers more of the model. It does not prove that the service can recover inside the target until the runbook is exercised and evidence is recorded.

Ready gates have enough information for practical review or exercise use. Review gates contain usable draft material but need owner confirmation, stronger proof, fresher exercise evidence, or more detailed steps. Blocked gates identify missing or conflicting inputs that could stop activation, failover, validation, failback, or acceptance.

Residual risk combines gate status with service tier. Tier 0 and Tier 1 services are more sensitive to review items because small ambiguities can become incident blockers. Any blocked gate raises risk sharply because one missing authority, access path, data proof, dependency owner, or validation check can prevent responders from declaring the service recovered.

The DR Readiness chart groups gates into objectives, dependencies, data, execution, validation, failback, and coordination. Weak segments are good agenda items for the next tabletop, restore drill, failover test, or business-owner review.

Technical Details:

Disaster recovery readiness depends on targets, architecture, evidence, and rehearsal. RTO and RPO set the business tolerance, but the recovery strategy determines whether those targets are plausible. A backup-and-restore plan needs enough time to rebuild infrastructure, restore data, and validate the service. A low RPO needs current backup or replication evidence. A short RTO needs pre-provisioned capacity, tested traffic steering, access that survives the primary outage, and validation checks that can be run quickly.

The readiness model converts time objectives to minutes, compares the selected recovery strategy with reference windows, scores each gate, and averages related gates into a seven-part readiness profile. The model is intentionally conservative: missing required sections block the runbook, stale exercises reduce confidence, and dependency maps need enough rows to show the surrounding service chain rather than only the application name.

Strategy Reference Windows

Recovery Strategy	Reference RTO	Reference RPO	Readiness Meaning
Backup and restore	240 minutes	1440 minutes	Infrastructure, configuration, and data are restored when needed. Lower operating cost usually means slower recovery.
Pilot light	60 minutes	60 minutes	Core data and minimal infrastructure exist in the recovery location, but more capacity must be started.
Warm standby	30 minutes	30 minutes	A reduced but functional recovery environment is already running and can scale up.
Hot standby active/passive	15 minutes	15 minutes	A near-full passive environment is ready to take traffic after routing and decision steps.
Multi-site active/active	5 minutes	5 minutes	Multiple active sites serve traffic, but data-corruption events still need backup, isolation, and integrity checks.

Gate Scoring

Gate Status	Score	Typical Cause
Ready	100	The required evidence is complete enough for owner review or exercise use.
Review	62	The section exists, but proof, freshness, owner confirmation, or detail is not strong enough yet.
Blocked	20	A missing or invalid element could prevent activation, failover, validation, failback, or acceptance.

Readiness Rules

Area	Ready Signal	Review or Block Signal
Objectives and authority	Service name, activation authority, positive RTO, and non-negative RPO are present.	Any required objective or authority field is missing.
Recovery strategy fit	The RTO is at least the strategy's reference RTO, and the RPO fits the strategy's reference data-loss window.	A target is more aggressive than the selected strategy normally supports, or a stronger strategy has a loose RPO that needs business confirmation.
Dependency map	At least three dependency rows are entered, with criticality and ownership inferred where provided.	One or two dependencies need review; no dependency rows block the runbook.
Data protection	Backup evidence and replication health are both complete.	Missing backup evidence blocks the gate; RPO targets of 60 minutes or less also block when replication evidence is missing.
Execution and validation	At least five failover steps, three validation checks, and four restore or failback steps are present.	Short lists remain in review, empty lists block, and required data-integrity proof expects a matching validation clue such as integrity, checksum, read, write, restore, data, or audit.
Exercise currency	The last exercise date is inside the selected quarterly, semiannual, or annual cadence.	No cadence, no date, invalid date, or a stale exercise blocks or downgrades the gate.
Access and monitoring	Privileged access and monitoring coverage are both complete.	Scheduled items need review; missing access or monitoring blocks operational confidence.

Formula Core

Durations are normalized to minutes before strategy and score checks run. Gate scores are averaged into profile groups, and the readiness percentage is the rounded average of those group scores.

durationMinutes = value \times unitMultiplier

readinessPercent = \frac{sum of profile group scores}{number of profile groups}

For example, an RTO of 2 hours becomes 120 minutes, while an RPO of 15 minutes stays 15 minutes. A Ready gate contributes 100, Review contributes 62, and Blocked contributes 20. The profile groups then average related gates for objectives, dependencies, data, execution, validation, failback, and coordination.

Security and Privacy Notes:

Runbook entries can reveal sensitive service names, topology, recovery locations, dependency relationships, vendor contacts, escalation paths, break-glass assumptions, monitoring gaps, and privileged-access readiness. Treat the draft, tables, chart, and JSON as operational continuity material.

The calculation and report generation run in the browser. The main privacy risk is what you type, copy, download, paste into tickets, or share with reviewers. Do not include passwords, private keys, recovery-console credentials, customer data, live exploit details, or vault secrets. Reference approved vaults, access procedures, incident channels, and credential owners instead of placing secret values in the runbook text.

Before using an export in a real incident, have the accountable service owner and activation authority review it, then store it where incident responders can reach it during a primary-environment outage.

Worked Examples:

Identity platform failover: Set a Tier 1 service, primary site outage scenario, warm standby strategy, 2-hour RTO, and 15-minute RPO. Add identity database replica, DNS, firewall policy, monitoring, login validation, and failback steps. Review gates will show whether monitoring, privileged access, replication evidence, and exercise currency are strong enough for the next tabletop.
Backup-only internal app: Choose backup and restore, enter a longer RTO, list the database restore, configuration redeploy, owner validation, and communications steps, then use blocked gates to show where restore evidence or validation checks are still missing.
Ransomware recovery: Choose the data corruption or ransomware scenario, require data integrity proof, and include clean restore point checks, checksum or audit validation, malware isolation, business acceptance, and a cautious failback plan. The validation gate stays in review until the checks include a clear data-integrity clue.
Stale exercise record: Keep the runbook details complete but leave the last exercise date older than the selected cadence. The exercise gate will downgrade or block the report, making the next drill a visible readiness action instead of an informal reminder.

FAQ:

Can the readiness score prove the service will recover?

No. The score reflects the entered evidence and gate rules. Real proof comes from restore drills, failover tests, monitoring evidence, data checks, and owner review.

Why does a stronger recovery strategy still need review?

Hot standby or active/active architecture can reduce downtime, but it still needs validated access, traffic steering, monitoring, data consistency, corruption recovery, and failback decisions.

Why does the dependency map expect three rows?

A real service usually depends on more than itself. Identity, DNS, databases, storage, networking, monitoring, vendor support, and communication paths often determine whether recovery succeeds.

What should go in validation checks?

Include user-path checks, data read/write or integrity proof, monitoring checks, dependency reachability, business acceptance, and evidence that responders can timestamp during recovery.

Should secrets be placed in the generated runbook?

No. Name the vault, access procedure, credential owner, or break-glass process. Do not paste passwords, keys, tokens, customer records, or private exploit details into the runbook text.

Glossary:

Disaster recovery runbook: A service-specific document that records activation, recovery steps, validation evidence, communications, and failback actions.
RTO: Recovery time objective, the target time allowed to restore service after disruption.
RPO: Recovery point objective, the acceptable data-loss window measured back from the disruption.
Failover: The process of moving service from the failed primary environment to the recovery environment.
Failback: The process of returning service from recovery mode to the repaired or rebuilt primary environment.
Dependency: An upstream or downstream service, vendor, data path, access path, or operational system that must work for recovery to succeed.
Exercise cadence: The recurring tabletop, restore drill, or failover test schedule used to keep the runbook current.
Residual risk: The remaining recovery concern after gate status and service tier are considered.

References:

SP 800-34 Rev. 1: Contingency Planning Guide for Federal Information Systems, NIST CSRC.
Disaster recovery options in the cloud, AWS documentation.
How to set up Elasticsearch cross-cluster replication, Simplified Guide.
How to check GlusterFS geo-replication status, Simplified Guide.
How to set a GlusterFS geo-replication checkpoint, Simplified Guide.