HA Pair Role Consistency Checker
Check HA pair role rows for active/standby drift, stale sync, heartbeat failures, VIP mismatches, and priority conflicts before failover.| Group | Service / VIP | Nodes | Active-up | Standby-up | Sync | Verdict | Copy |
|---|---|---|---|---|---|---|---|
| {{ row.group }} | {{ row.service }} | {{ row.nodes }} | {{ row.activeUp }} | {{ row.standbyUp }} | {{ row.sync }} | {{ row.verdict }} |
| Node | Group | Role | State | Priority | Sync | Heartbeat | Flags | Copy |
|---|---|---|---|---|---|---|---|---|
| {{ row.node }} | {{ row.group }} | {{ row.roleLabel }} | {{ row.stateLabel }} | {{ row.priorityLabel }} | {{ row.syncLabel }} | {{ row.heartbeatLabel }} | {{ row.flagsText }} |
| Severity | Group | Node | Check | Evidence | Next step | Copy |
|---|---|---|---|---|---|---|
| {{ row.severityLabel }} | {{ row.group }} | {{ row.node }} | {{ row.check }} | {{ row.evidence }} | {{ row.nextStep }} |
Introduction
High availability pairs are meant to make one service survive a node failure without forcing users to choose a new address or restart their work. The usual active/passive design has one peer serving traffic and another peer waiting with enough configuration, state, and control-plane contact to take over. That simple story becomes harder to prove during maintenance windows, firewall upgrades, routing changes, database patching, or incident review because the evidence rarely arrives as one clean yes-or-no value.
A useful HA review starts by separating role from readiness. A row that says active is making an ownership claim. A row that says up is making a health claim. A row that says synced, heartbeat up, or preempt on is describing supporting evidence around that claim. These signals can disagree. The active peer can be marked down, the standby peer can be healthy but stale, two peers can both claim the active role, or the virtual IP address can differ between rows that are supposed to describe the same protected service.
| Evidence | Question it answers | Common mistake |
|---|---|---|
| Role | Which peer claims the protected service right now? | Treating primary, master, owner, active, backup, secondary, and standby wording as interchangeable without checking health. |
| State | Is the peer actually healthy enough for its role? | Calling a pair ready because the standby exists, even though it is down, degraded, or in maintenance. |
| VIP or service | Are the rows protecting the same shared address or workload? | Mixing rows from different services and then trusting the active and standby counts. |
| Sync and heartbeat | Can state and control messages still move between peers? | Assuming an active peer is safe because traffic is still flowing, while replication or the HA link has already failed. |
| Priority and preempt | Should the current owner match the election or failback policy? | Flagging every lower-priority active peer as wrong when the platform intentionally disables automatic failback. |
Split brain is the most urgent role-consistency failure. It happens when more than one peer acts as if it owns the same protected service. Even when no split-brain symptom appears, a false sense of redundancy can still be costly. A down standby, stale session sync, or broken heartbeat path may leave the active peer serving traffic with no trustworthy failover target. Planned work that would normally be routine can become risky because the pair is no longer in the state the runbook assumes.
Priority rules deserve extra care because they are not universal. VRRP-style systems normally prefer a higher-priority peer, and preemption can let a recovered preferred peer reclaim ownership. Other platforms pin roles, invert scores, suppress failback, or use priority-like values only as hints. The safe question is not only which number is larger, but whether the current owner agrees with the platform convention that applies to that service.
A row-based consistency check can catch drift in the evidence before a failover test or change window. It cannot prove that every dependency will move correctly. Quorum, fencing, route advertisements, storage locks, session replication, application warm-up, and vendor-specific failover rules still need direct product checks.
How to Use This Tool:
Paste the HA evidence you already have, then set the expected ownership and readiness rules for the service group you are reviewing.
- Paste CSV or tab-delimited rows into HA role rows. A header row is optional. Without a header, the expected order is node, role, priority, VIP, state, sync, heartbeat, and preempt. Lines that start with
#are ignored. - Include a group, service, or VIP value when one paste contains more than one protected service. Rows with the same group are checked together, and rows without a group fall back to the VIP or a default pair name.
- Set Expected active-up to the number of peers that should be both active-style and up-style. Ordinary active/passive pairs usually stay at
1. - Set Minimum standby-up to the number of standby-style peers that must be up before the group counts as ready for failover. A two-node pair usually needs
1. - Choose Priority rule according to the platform convention. Use higher priority should be active for VRRP-like election behavior, lower priority should be active for inverted conventions, or ignore priority when ownership is intentionally pinned outside numeric priority.
- Set Sync stale after for numeric sync lag. Lag above that value is treated as stale. Text values such as
synced,stale,lagging, andout-of-syncare read by meaning. - Start with Pair Health Table for the group verdict. Use Node Role Ledger to inspect each peer, Failover Findings for failed checks and next steps, and Priority Role Ladder when numeric priorities need visual comparison.
If Check HA input rows appears, fix missing node names, missing roles, or too few node rows before trusting the results. Parsed rows may still appear, but the redundancy judgment is incomplete until the validation warning is gone.
Interpreting Results:
The most important fields are Active-up, Standby-up, Sync, and Verdict in the Pair Health Table. A clean verdict means the supplied rows match the selected rules that could be checked. A critical verdict means at least one core failover condition failed, such as the active owner count, standby readiness, active state, or heartbeat health.
| Output cue | What it means | Verification to perform |
|---|---|---|
| Critical | A core rule failed. Typical causes are no active-up peer, too many active-up peers, too few standby-up peers, an active peer that is not up, or a failed heartbeat. | Confirm live ownership and peer state from the HA platform before planned failover, restart, routing work, or device maintenance. |
| Warning | The service may still be running, but the rows show drift such as stale sync, standby-down evidence, VIP disagreement, duplicate node rows, unknown role/state wording, or priority conflict. | Open Failover Findings and decide whether the warning is real drift, stale inventory, or an intentional vendor-specific exception. |
| Review | The available rows do not prove a failure, but they leave a confidence gap such as missing active priority or mixed preempt values. | Add the missing fields or compare the preempt setting with the runbook before using the result as a change gate. |
| Clean | The supplied active, standby, VIP, sync, heartbeat, priority, and duplicate-row checks did not find drift under the selected rules. | Still verify quorum, fencing, route advertisement, replication, storage locks, and application readiness for the actual failover path. |
Avoid false confidence from a clean group count alone. The Pair Health Table summarizes the group, while Node Role Ledger and Failover Findings show which row and which rule created the verdict. Use the row evidence when deciding whether to proceed, pause, or collect a fresher export.
Technical Details:
The checker treats each row as one HA peer observation. The evaluation boundary is the group name when present, then the VIP or service value when the group is blank, and finally a default pair label when neither is supplied. Inside each group, the rules compare normalized role, normalized state, numeric priority when available, sync freshness, heartbeat status, preempt values, duplicate node names, and service identity.
Role normalization keeps common HA vocabulary comparable while preserving the original row text in the ledger. Active-style roles include active, primary, master, and owner. Standby-style roles include standby, passive, secondary, backup, and spare. State is evaluated separately because role claims do not prove readiness. A peer can be active and down, standby and up, standby and stale, or unknown enough to require manual review.
Rule Core:
| Check | Clean condition | Finding condition |
|---|---|---|
| Active ownership | active-up count = expected active-up |
A lower or higher count is critical. A higher count can indicate split ownership or duplicate active evidence. |
| Standby readiness | standby-up count >= minimum standby-up |
A lower count is critical because the group lacks the required ready standby count. |
| Pair member count | Rows in a group are at least expected active-up + minimum standby-up. |
A smaller row set is a warning because a peer may be missing from the pasted evidence. |
| Active state | Every active-style peer is also up-style. | An active peer marked down, degraded, maintenance, missing, or unknown is critical. |
| Standby state | Enough standby-style peers are up-style. | A down standby creates a warning and may also create a critical standby-readiness failure. |
| Sync freshness | Healthy sync text is accepted. Numeric lag is clean when it is less than or equal to Sync stale after. | Out-of-sync wording or numeric lag greater than the selected threshold is a warning. |
| Heartbeat health | Heartbeat values such as up, on, enabled, healthy, true, or yes are clean. | Down, off, disabled, failed, false, or no creates a critical finding. Missing heartbeat is informational rather than failed by itself. |
| Shared service identity | Rows in one group name the same service or VIP value, or leave it consistently unlisted. | Multiple service or VIP values in one group are warnings because the rows may describe different workloads. |
| Priority ownership | When priority checking is enabled, exactly one active-up peer should match the selected higher-value or lower-value convention. | An up peer that outranks the active owner creates a warning. A missing active priority creates an informational finding. |
| Duplicate node evidence | Each node name appears once inside its group. | Repeated node names are warnings because active-up and standby-up counts may be inflated. |
| Preempt consistency | Preempt values are absent or consistent across the group. | Mixed preempt values are informational because they may explain expected or suppressed failback. |
Sync and heartbeat are intentionally judged differently. Sync freshness warns about stale configuration or session state. It affects confidence in a clean handoff, but it does not always mean the current active peer stopped serving traffic. Heartbeat failure is more severe because the peers may not be exchanging the control messages needed to detect failure and coordinate ownership.
| Field | Clean or recognized values | Review or failure values |
|---|---|---|
| Role | active, primary, master, owner, standby, passive, secondary, backup, spare |
Blank role, unknown wording, maintenance, suspended, or disabled when active/standby evidence is expected. |
| State | up, healthy, ready, ok, running, normal |
down, offline, failed, degraded, maintenance, missing, or unknown. |
| Sync | synced, in sync, current, ok, healthy, or numeric lag within the selected threshold. |
stale, out-of-sync, unsynced, lagging, failed, error, or numeric lag above the threshold. |
| Heartbeat | up, on, enabled, healthy, true, yes |
down, off, disabled, failed, false, no, or unrecognized wording. |
| Priority | Numeric values can be compared when the priority rule is not ignored. | Blank, non-numeric, or vendor-specific priority text is not used for ranking. |
Priority ownership is checked only when it can support a fair conclusion. The group must have exactly one active-up peer, at least one other up peer with numeric priority, and a selected priority convention other than ignore. If the active peer has no numeric priority, the result stays informational. If another up peer outranks the active peer under the selected rule, the result becomes a warning rather than a critical failure because no-preempt designs and manual role pinning can be valid.
The Priority Role Ladder is a chart of numeric priority rows. It is useful for spotting a high-priority standby, a down preferred peer, or an active owner below another up peer. The formal pass/fail decision still comes from the rule checks and the Failover Findings table.
Limitations and Privacy Notes:
The checker reviews consistency in the rows you provide. It does not contact HA devices, query controllers, confirm traffic ownership, or replace vendor failover procedures.
- Pasted HA rows are evaluated in the browser. There is no server lookup or upload step for the inventory data.
- Review the address bar before sharing a filled page because repeatable checks can carry values in the current URL.
- A clean result does not prove quorum, fencing, route convergence, session replication, shared-storage locking, or application readiness.
- Normalize uncommon vendor wording when the ledger marks role, state, sync, or heartbeat values as unknown.
Worked Examples:
Clean firewall pair
A corp-fw export lists fw-a as active, priority 150, up, synced, heartbeat up, and fw-b as standby, priority 100, up, synced, heartbeat up. Both rows share 10.44.20.1. With Expected active-up set to 1 and Minimum standby-up set to 1, Pair Health Table shows Active-up 1/1, Standby-up 1/1, and a clean verdict. Treat that as a row-consistency pass, then confirm the live device state before the maintenance window.
Split ownership and stale sync
An edge-fw snapshot has two rows marked active and up for the same VIP, with one peer reporting out-of-sync. Pair Health Table reports Active-up 2/1 and a critical verdict. Failover Findings names Active ownership and State synchronization. Resolve the duplicate active evidence first because stale sync is dangerous, but two active owners can affect traffic and shared state immediately.
Standby down with numeric lag
A branch-fw pair has one active-up peer, one standby peer marked down, and sync lag of 95 seconds while Sync stale after is 30. The group shows Standby-up 0/1, Sync 1 stale of 2 reported, and a critical standby-readiness result. The active peer may still be serving traffic, but the pasted evidence says the normal failover path is not ready.
Validation warning from a short export
A one-line paste with only fw-a,active,150,10.44.20.1,up triggers Check HA input rows because at least two node rows are needed for a useful pair check. Add the standby row, or include a header when your export places role, state, sync, and heartbeat columns in a different order. After the validation warning clears, compare the group verdict with the Failover Findings list.
FAQ:
Does a clean result mean failover will work?
No. Clean means the supplied rows agree with the selected active, standby, VIP, sync, heartbeat, priority, and duplicate-row rules. Real failover still depends on the HA control plane, quorum, fencing, routing, replication, storage, and application behavior.
Can I paste rows without a header?
Yes. Without a header, rows are read as node, role, priority, VIP, state, sync, heartbeat, and preempt. Add a header when the export uses another order or includes group, cluster, pair, service, site, or similar grouping columns.
Why do I see Check HA input rows?
That alert appears when the paste is missing required row evidence, such as a node name or role, or when fewer than two node rows are available. Fix the row text first, then review the Pair Health Table and Failover Findings.
Why is priority checking optional?
Priority does not mean the same thing on every HA platform. Use the priority rule when numeric priority should select the owner, and choose ignore priority when roles are pinned manually, preemption is disabled, or the platform uses another ownership rule.
What does the stale-sync threshold do?
Numeric sync lag greater than Sync stale after becomes a warning. Numeric lag equal to or below the threshold is treated as fresh enough. Text values such as synced, stale, lagging, and out-of-sync are interpreted directly.
Why do duplicate node rows matter?
Duplicate rows can inflate active-up or standby-up counts. Check whether they came from repeated exports, interface-specific rows, stale inventory, or truly separate service instances before trusting the group verdict.
Glossary:
- Active
- The peer currently claiming ownership of the protected service or virtual address.
- Standby
- The peer expected to be ready to take over when the active peer fails, is maintained, or is moved.
- Active-up
- A row that is both active-style and up-style, counted against the expected owner count.
- Standby-up
- A row that is both standby-style and up-style, counted against the required ready peer count.
- VIP
- A virtual IP address or shared service address that should follow the active role.
- Heartbeat
- The HA control-path signal used to monitor whether a peer is reachable and participating.
- Preempt
- A behavior where a preferred peer may reclaim ownership after it becomes eligible again.
- Split brain
- A dangerous condition where more than one peer acts as if it owns the same protected service.
References:
- RFC 9568: Virtual Router Redundancy Protocol (VRRP) Version 3 for IPv4 and IPv6, RFC Editor, April 2024.
- Introduction to High Availability, Ubuntu Server documentation.
- High Availability Add-On overview, Red Hat Documentation.
- Keepalived for Linux, Keepalived project.