| Metric | Value | Copy |
|---|---|---|
| {{ row.label }} | {{ row.value }} |
| Guardrail | Value | Copy |
|---|---|---|
| {{ row.label }} | {{ row.value }} |
| Profile | Usable | Efficiency | Fault tolerance | Min spread | Delta | Verdict | Copy |
|---|---|---|---|---|---|---|---|
| {{ row.profile }} | {{ row.usable }} | {{ row.efficiency }} | {{ row.faultTolerance }} | {{ row.minimumDomains }} | {{ row.delta }} | {{ row.verdict }} |
Ceph capacity planning is mostly a question of how much raw disk you can safely consume before recovery work becomes difficult. A cluster can look large on paper and still run short on practical headroom once replication, erasure coding, and failure recovery are accounted for.
This calculator turns those tradeoffs into a planning estimate. It starts with the raw capacity of your nodes, applies optional overhead and skew penalties, then shows how much protected data space remains at the nearfull level you can actually run. The result is paired with redundancy cost, reserved raw headroom, and a small set of guardrail metrics that help you judge whether the cluster still has room to rebalance.
A common sizing question is whether six 10 TB nodes with replication size 3 are enough, or whether a 4+2 erasure-coded layout would carry the same workload more efficiently. Another is whether a cluster that must survive the loss of one host can still tolerate an 85 percent nearfull target without crowding recovery. Those are exactly the comparisons this tool is built to make.
The model also helps when node sizes are uneven. You can enter each node manually, choose the failure domain you care about, and see how much raw reserve must stay unused if you want room to absorb one or more host or OSD losses. That matters because the safe fill point is not just an aesthetic threshold. It determines whether the cluster still has space to reshuffle data when something fails.
What the estimate does not mean is that the cluster will behave exactly this way under live load. The package does not inspect a real CRUSH map, pool mix, device class layout, compression ratio, or current placement imbalance. It is a planning calculator, not a replacement for live Ceph telemetry. As shipped, it runs entirely in the browser, so the capacities and assumptions you enter stay on the local device.
For a first pass, enter the total node count, keep the shared per-node capacity if the hardware is uniform, and choose the protection method that matches the pool you are planning. Replication is the faster sanity check because the efficiency is easy to read. Erasure coding becomes more interesting when you want to compare how much additional usable data you gain once the same raw cluster is protected with k+m instead of full copies.
If the hardware is mixed, switch on Define each node manually (heterogeneous) before trusting the result. That moves the calculation away from a single average node size and lets the headroom model treat the largest host as the most expensive failure when the domain is Host (node). In uneven clusters that change can move the recommended nearfull point much lower than the slider value you first chose.
Failure domain for headroom to the way the pool is actually placed. A host-level rule and an OSD-level rule do not imply the same reserve requirement.Accept degraded PGs as an emergency what-if. It deliberately lets Nearfull used exceed the recovery-safe recommendation, which can be useful for short-term ingest planning but weak for baseline sizing.Headroom to backfillfull and Headroom to full before focusing on the large usable-capacity number. A strong protected-capacity figure is not enough if the recovery window is tiny.Suggested PG per pool as a planning cue, then compare it with your real pool layout and Ceph autoscaler choices instead of treating it as a mandatory final setting.The most common misread is to assume that the slider value for OSD nearfull ratio is always the value used in the estimate. It is not. When degraded writes are not allowed, the tool can clamp the effective nearfull value down to the safer recommendation implied by your failure-domain reserve. The next step is simple: compare Recommended nearfull with Nearfull used, then decide whether to add raw capacity, lower the target, or relax the failure assumption.
The calculator works from four layers of arithmetic. First it sums the node capacities. Next it reduces that raw total by any OSD or metadata overhead and by the optional CRUSH imbalance skew penalty. Then it multiplies the remaining raw capacity by the effective nearfull fraction, which is either the operator's requested nearfull value or a lower recovery-safe value if the safety clamp is active. Only after that does it apply the protection efficiency for replication or erasure coding.
That ordering matters. A replication size of 3 turns one unit of protected data into three units of consumed raw space, so the healthy-state efficiency is 1/3. A 4+2 erasure-coded profile keeps four data chunks and two coding chunks, so the healthy-state efficiency is 4/(4+2), or about 66.7 percent. The package also shows Efficiency @ min_size because degraded write thresholds can temporarily change how much of the written raw space still represents durable data. That field is a guardrail for degraded behavior, not the normal usable-capacity result.
The failure-domain reserve is intentionally simple and conservative. If you choose Host (node), the tool sorts node capacities from largest to smallest and reserves enough raw space to cover the number of host failures you entered. If you choose OSD, it estimates average OSD size from the node count and OSDs-per-node figure. If you choose Custom capacity, it multiplies the custom domain size by the number of domain failures to tolerate. That reserve is converted into a recommended nearfull value, and the package enforces the Ceph ordering rule that nearfull < backfillfull < full.
| Term | Meaning in this tool | Source |
|---|---|---|
C_i |
Per-node capacity, either repeated from the shared node figure or read from the manual node list | Input |
o |
OSD or metadata overhead fraction | Input |
s |
Skew penalty for uneven CRUSH weights or heterogeneous fill behavior | Input |
f_eff |
Effective nearfull fraction after optional safety clamping | Derived |
η |
Protection efficiency: 1/r for replication, k/(k+m) for erasure coding |
Derived |
Suggested PG per pool |
Power-of-two estimate from OSD count, target PG/OSD, pool count, and the replication or erasure-coding shard fan-out | Derived |
The PG estimate is intentionally approximate. The package calculates a target based on Estimated OSDs, PG/OSD, Data pools, and the data-shard factor for the chosen protection mode, then rounds to the nearest power of two. The companion field Estimated PG shards per OSD shows what that rounded value turns into after replication or erasure-coding fan-out. This is useful for comparing scenarios, but it still needs to be checked against the real autoscaler, pool weights, and hardware limits in a live cluster.
All calculations are deterministic and local to the page. There is no lambda.mjs helper in this package, so the browser computes the table, donut chart, and JSON payload without sending cluster assumptions to a server.
Use the calculator as a sizing worksheet: enter the cluster shape first, then add protection and headroom assumptions until the guardrail outputs match the failure scenario you need to survive.
Total nodes and either Capacity per node or Define each node manually (heterogeneous). The summary becomes meaningful only after the node list produces a positive raw total.Data protection. Use Rep when you want copy-based protection with size and min, or switch to EC when you want k, m, and min for erasure coding.OSD nearfull ratio slider to the operating target you want to test, then open Advanced to set Backfillfull ratio, Full ratio, Failure domain for headroom, Domains to tolerate (fail), OSD/metadata overhead, and CRUSH imbalance skew.Capacity Overview table. If Nearfull used comes out lower than the target you entered, the safety clamp is active. Lower the plan target, add more raw capacity, or revise the failure assumption before moving on.Protection Breakdown for the reserved-versus-redundancy-versus-data split, then check Ceph Guardrails for Headroom to backfillfull, Headroom to full, Suggested PG per pool, and Estimated PG shards per OSD.JSON if you need a structured copy of the inputs and outputs for another worksheet or review note. The final outcome to carry forward is not just Projected Usable Capacity, but the combination of usable capacity, effective nearfull, and recovery headroom.Projected Usable Capacity is the amount of protected data the model says you can hold at the effective nearfull level after overhead, skew, and protection costs have all been applied. It is not the same as total raw disk, and it is not a promise that every pool in a real cluster can consume that amount evenly.
The most important trust check is the relationship between the fill thresholds and the reserve outputs. A large usable figure can still be weak if Headroom to backfillfull is narrow or if Nearfull used has already been pushed above the recommendation through the degraded override. That combination means the plan leaves little room for recovery traffic before another threshold is reached.
Protected efficiency describes healthy-state raw-to-data conversion for the selected protection mode.Efficiency @ min_size is a degraded-write reference, not a second usable-capacity answer.Recommended nearfull is driven by the failure-domain reserve model. If it is much lower than your target, the cluster shape is telling you the recovery cushion is thin.Suggested PG per pool helps compare scenarios, but a final Ceph design still needs pool-specific review and autoscaler context.With the default six nodes at 10 TB each, replication size 3, zero overhead, zero skew, and an 85 percent nearfull target, the math is straightforward. Raw storage is 60 TB. The model uses 51 TB of that raw space at nearfull, converts it through replication efficiency of one third, and lands on 17.00 TB of usable protected data. The remaining 34 TB is redundancy cost and 9 TB stays reserved raw headroom.
Consider four heterogeneous nodes at 12, 12, 8, and 8 TB with replication size 3, 5 percent overhead, 5 percent skew, and a requirement to survive one host failure. Raw post-overhead capacity falls to about 36.10 TB. Because the largest host contributes 12 TB, the recommended nearfull level drops to roughly 66.8 percent. If you leave OSD nearfull ratio at 85 percent and keep degraded override off, the package will use the lower 66.8 percent value and usable protected data falls to about 8.03 TB.
That second case is the troubleshooting pattern worth remembering. When the result looks smaller than expected, first check whether the cluster shape or failure-domain rule has forced the effective nearfull point down. If it has, the smaller answer is the warning, not the bug.
Nearfull used not match the nearfull slider?The slider sets your requested operating target. If Accept degraded PGs is off, the package compares that target with the recovery-safe recommendation implied by the chosen failure domain and tolerated failures, then uses the lower of the two values.
Use replication when you want a simple copy-based pool model or when rebuild simplicity matters more than raw efficiency. Use erasure coding when you want to compare how much additional usable data you gain from k+m protection, then validate the operational tradeoff in the real Ceph design.
Suggested PG per pool replace Ceph autoscaling decisions?No. It is a planning estimate derived from OSD count, target PG/OSD, pool count, and protection fan-out. It is useful for sizing conversations, but Ceph autoscaler behavior and pool-specific weight distribution still need separate review.
No. This package has no server-side helper. The browser computes the estimates, chart, and JSON payload locally from the numbers you enter.