{{ summaryTitle }}
{{ formatCap(usableTB) }} {{ unit }}
{{ summaryMessage }}
{{ summaryDetail }}
{{ badge.text }}
Filled {{ manualValidCount }} of {{ manualNodeIndices.length }} nodes; raw total {{ formatCap(rawTB) }} {{ unit }} before overhead.
size min_size
k m min_size
{{ protectionLabel }}; healthy fault tolerance {{ healthyFaultTolerance }} {{ failureDomainWord(healthyFaultTolerance) }}.
{{ unit }}
OSD reserve size is derived from the advanced OSDs-per-node setting.
{{ formatPercentValue(nearfullConfigured) }}
{{ formatPercentValue(backfillfullConfigured) }}
{{ formatPercentValue(fullConfigured) }}
{{ formatPercentValue(osdOverhead, 1) }}
{{ formatPercentValue(skew, 1) }}
Recovery check: {{ advancedAlertText }}
Metric Value Copy
{{ row.label }} {{ row.value }}
Breakdown uses raw post-overhead capacity at {{ formatPercentValue(effectiveNearfull) }} nearfull. Reserved raw remains the recovery cushion and any free operating room above the current fill target.
{{ card.label }}
{{ card.value }}
{{ card.note }}
{{ item.badge }}
{{ item.title }}
{{ item.text }}
Guardrail Value Copy
{{ row.label }} {{ row.value }}
Profile Usable Efficiency Fault tolerance Min spread Delta Verdict Copy
{{ row.profile }} {{ row.usable }} {{ row.efficiency }} {{ row.faultTolerance }} {{ row.minimumDomains }} {{ row.delta }} {{ row.verdict }}

                
:

Introduction

Ceph capacity planning is about safe occupancy, not just raw disk totals. The number that matters is the protected data space you can keep at a chosen operating threshold while still leaving room for rebalance and recovery when a host or OSD disappears.

This calculator turns that planning problem into explicit arithmetic. It starts with either one shared node size or a per-node list, subtracts optional overhead and skew, applies a recovery reserve tied to the selected failure domain, and then converts the remaining raw space into usable data through replication or erasure coding.

The result is broader than a single capacity number. You get a summary card, a Capacity Ledger, a Protection Breakdown chart, a Cluster Fit view for headroom and spread checks, a Scheme Matrix that compares common protection profiles on the same raw estate, and JSON, CSV, image, or DOCX exports for review notes.

The calculator is strongest when you are comparing planning choices. It helps answer questions such as whether a six-node replicated pool still has enough reserve at an 85 percent nearfull target, how much usable space a 4+2 erasure-coded layout would return on the same cluster, or how much a few larger nodes distort the recovery margin in a mixed fleet.

It does not inspect a live cluster, a real CRUSH map, or active pool telemetry. Compression, device classes, balancer state, and workload suitability still need separate operational review. All calculations and exports stay in the browser, so the node sizes and thresholds you enter are not sent to a server.

Technical Details

Ceph documents nearfull, backfillfull, and full as ascending thresholds. The calculator keeps that order and starts with the same default ratios many operators recognize: 85 percent, 90 percent, and 95 percent. That matters because rebalancing can stall once OSDs reach backfillfull, and writes can stop when an OSD crosses full.

The calculation itself is deliberate. First it sums the node capacities in the selected unit. Next it removes the share reserved for OSD or metadata overhead and any optional skew penalty that represents uneven fill or mismatched weights. After that it estimates how much raw space must stay free so the chosen failure domain can be absorbed. Only then does it apply the protection efficiency for the selected profile.

Capacity flow used by the calculator
Raw node total shared size or manual node list Post-overhead raw subtract overhead and skew first Recovery reserve host, OSD, or custom domain loss target clamps nearfull if needed Usable data after efficiency Redundancy Reserved raw free room left
The chart tab visualizes the last three boxes. The headroom clamp happens before protection efficiency is applied, which is why a strict recovery target can lower the usable result even when raw storage looks large.
Raw post-overhead = nodesCi×(1-o)×(1-s) Recommended nearfull = 1-HRaw post-overhead Nearfull used = min(requested nearfull,recommended nearfull) Usable capacity = Raw post-overhead×nearfull used×η

The reserve term H changes with the failure domain. In host mode, the calculator sorts node capacities from largest to smallest and reserves enough raw space to absorb the number of host losses you set. In OSD mode, it estimates average OSD size from total raw capacity and OSDs per node. In custom mode, it multiplies your custom capacity block by the failure count and leaves the spread check for manual review.

Protection efficiency is simple in healthy state: replication uses 1 / size, and erasure coding uses k / (k + m). Ceph also uses min_size to decide how far a pool can continue taking I/O in degraded conditions. The calculator exposes that as Degraded write tolerance and Efficiency @ min_size, but it does not treat degraded behavior as the normal usable-capacity answer.

Protection rules used by the calculator
Mode Healthy efficiency Healthy fault tolerance Degraded-write cue Minimum spread
Replication 1 / size size - 1 domains size - min_size domains can be missing while writes still meet the minimum size distinct domains
Erasure coding k / (k + m) m domains (k + m) - min_size missing shards before writes fall below the minimum k + m distinct domains
Failure reserve methods in the calculator
Failure domain How reserve is estimated What to verify outside the calculator
Host (node) Reserves the largest node capacities first That the pool really spreads replicas or shards across hosts the way you expect
OSD Uses average OSD size from total raw capacity and OSD count Real OSD size mix, device-class layout, and uneven fill on the fullest devices
Custom capacity block Multiplies the entered block size by the loss target That the CRUSH rule actually gives you enough distinct domains for the selected profile

Everyday Use & Decision Guide

Start by matching the node model to the hardware you really have. Uniform nodes are fine for first-pass sizing, but mixed clusters deserve manual per-node entry because one or two larger hosts can dominate the reserve calculation. The calculator uses the largest node capacities first when the failure domain is set to host, so heterogeneity is not averaged away.

The preset selector is useful when you want to compare common Ceph protection profiles quickly. Replication 2x, 3x, and 4x make the cost of full copies obvious. The EC presets let you compare space efficiency directly, and the Custom option lets you set your own size, min_size, k, and m values when the built-in rows are close but not exact.

Failure domain choice changes the story more than many raw-capacity spreadsheets admit. A host-level reserve is a sensible default for many pool designs, but an OSD-level model can be useful for tight experiments on dense nodes, and the custom block is there when your real blast radius is better expressed as a capacity chunk than as a whole host. If you use custom mode, treat the spread warning seriously because the calculator cannot prove your CRUSH rule for you.

The Accept degraded PGs switch is an emergency planning knob, not a normal sizing mode. When it is off, the calculator clamps the requested nearfull target to the recovery-safe recommendation. When it is on, the requested target stays in place even if that leaves less room for backfill or recovery than the reserve model calls for. That is useful for short-term what-if work, but it is a weak baseline for procurement or steady-state planning.

Use the tabs in sequence. The ledger tells you where the capacity went. The chart makes redundancy versus reserved raw easy to see. Cluster Fit explains whether headroom, protection, spread, and PG posture line up with the fault target. Scheme Matrix keeps the raw cluster fixed and shows which built-in protection profiles look stronger or weaker under the same guardrails. The exports are there when the conversation needs to move into a design review, change ticket, or capacity note.

Step-by-Step Guide

  1. Enter the node count and choose either one shared node capacity or manual per-node capacities. Manual mode is the honest choice for mixed clusters and staged expansions.
  2. Pick a protection preset or switch to custom details. Replication uses size and min_size. Erasure coding uses k, m, and min_size.
  3. Set the failure domain and the number of domain losses you want the cluster to absorb. That choice drives the reserve model and the minimum spread check.
  4. Adjust the nearfull target, then open Advanced for OSDs per node, backfillfull, full, target PG per OSD, pool count, overhead, skew, and the degraded-PG override.
  5. Read the summary and Cluster Fit cards together. If Nearfull used falls below the target you entered, the safety clamp is active. If spread or healthy protection fails, raw capacity alone will not fix the whole problem.
  6. Use Capacity Ledger for exact figures, Protection Breakdown for the raw-space split, Scheme Matrix for profile comparison, and JSON or document exports when you need to pass the scenario to someone else.

Interpreting Results

The first thing to read is the summary title. Recovery-Safe Usable Capacity means the current profile, spread, and reserve target line up cleanly. Projected Usable Capacity usually means the calculator can still produce a planning number, but one or more guardrails need attention. Usable Capacity With Topology Risk means the selected protection profile does not fit the available spread or the requested fault target while healthy.

The next trust check is the trio of Requested nearfull, Nearfull used, and Recommended nearfull. If the first number is higher than the second, the reserve model has stepped in and lowered the operating threshold. That is not lost capacity caused by a bug. It is the cost of leaving room for recovery.

Ceph's own troubleshooting guidance warns that pool headroom is constrained by the most full OSD, not by cluster average use. That is why the skew control matters and why mixed-node plans often look tighter than a quick raw-total spreadsheet suggests. A large cluster can still be brittle if one part of the placement map runs hotter than the rest.

How to read the main outputs
Output What it means How to use it
Usable capacity Protected data space at the effective nearfull threshold Use it for sizing only after the spread and headroom checks look healthy
Headroom to backfillfull Raw space left before rebalance space becomes tight If this is thin, the cluster may look large on paper but still struggle during recovery
Healthy fault tolerance How many domains the selected profile can lose while staying fully protected Compare it directly with the failure target before trusting any usable number
Degraded write tolerance How many domains can be missing before writes fall below min_size Read it as a degraded-state cue, not as a healthy-state promise
Minimum spread and Available spread The distinct domains the profile needs versus the ones the selected model can offer If spread fails, change topology or protection profile before arguing about raw capacity
Suggested PG per pool and PG shards per OSD A planning estimate based on OSD count, target PGs, pool count, and shard fan-out Use it to compare scenarios, then confirm the final answer with Ceph autoscaler and pool design

The Protection Breakdown chart separates usable data, redundancy, and reserved raw space. That makes it easier to see whether the headline loss is caused by protection overhead or by space deliberately left free for safe operation. Scheme Matrix answers a different question: if the raw estate and nearfull policy stay fixed, which standard protection profile produces a better fit or more usable capacity?

One more boundary matters. Higher usable capacity is not always the better operational choice. Ceph documentation notes that erasure-coded pools often need more failure domains and can bring workload limits that pure capacity math does not cover. The calculator helps you narrow the options. It does not certify that every workload or service should use the most space-efficient row in the matrix.

Worked Examples

Default six-node replication plan

Start with six 10 TB nodes, replication size 3, one host loss to tolerate, and the default 85 percent nearfull target. Raw storage is 60 TB. Because the reserve model keeps one 10 TB host in hand, the recommended nearfull falls to about 83.3 percent. The calculator therefore uses 50 TB of raw space, reports roughly 16.67 TB of usable protected data, 33.33 TB of redundancy, and 10 TB of reserved raw headroom.

Same cluster, different protection profile

Keep the same six nodes and the same one-host reserve target, then compare the current replicated profile with EC 4+2 in Scheme Matrix. The raw post-overhead estate and nearfull policy do not change. What changes is efficiency. At the same 83.3 percent effective nearfull, EC 4+2 yields about 33.33 TB of usable data instead of 16.67 TB. The tradeoff is not hidden: the profile needs six distinct placement domains and should still be judged against workload fit outside the calculator.

Mixed-node cluster with overhead and skew

Now enter four nodes at 12, 12, 8, and 8 TB, keep replication size 3, set one host loss to tolerate, and add 5 percent overhead plus 5 percent skew. Raw post-overhead space drops to about 36.10 TB. The reserve still has to cover the largest 12 TB host, so recommended nearfull falls to roughly 66.8 percent. Usable protected data lands near 8.03 TB. In this case the small answer is the warning: the cluster shape, not the replication formula, is the real limit.

FAQ

Why does Nearfull used differ from the target slider?

The slider sets the requested operating point. If degraded override is off, the calculator compares that request with the recovery-safe recommendation derived from the reserve model and uses the lower number.

What does min_size change in this calculator?

min_size does not change healthy-state efficiency. It changes the degraded-write boundary. That is why the calculator shows Degraded write tolerance and Efficiency @ min_size as separate cues instead of folding them into the main usable result.

Why can spread fail even when there is plenty of raw storage?

Because protection profiles need distinct placement domains, not just bytes. A profile can be spacious on paper and still be invalid if the selected host, OSD, or custom-domain model cannot provide enough distinct targets.

Does Scheme Matrix change my node inventory or thresholds?

No. It reuses the same raw post-overhead estate and current nearfull policy, then compares standard protection profiles against the same fault target and spread rules.

Is any cluster information uploaded?

No. The calculations, chart rendering, and exports stay in the browser.

Glossary

Nearfull
The operating fill threshold used for the main usable-capacity estimate.
Backfillfull
A higher threshold where rebalance and recovery room becomes tight.
Full
The threshold where writes can stop until raw capacity is freed.
Failure domain
The unit the reserve model is protecting against, such as a host, an OSD, or a custom capacity block.
min_size
The minimum replica or shard count required for I/O in degraded conditions.
Placement group (PG)
Ceph's data-placement grouping unit, used here for rough per-pool and per-OSD planning.
Skew
An optional penalty that reduces effective raw capacity to account for uneven fill or mismatched weights.