Kubernetes Cluster Capacity Calculator
Plan a Kubernetes replica ceiling from node allocatable resources, pod requests, surge, PDB, failure drills, zone loss, and pod IP limits.{{ result.summaryTitle }}
Current result
| Area | Badge | Headline | Detail | Copy |
|---|---|---|---|---|
| {{ card.label }} | {{ card.badge }} | {{ card.headline }} | {{ card.detail }} |
| Metric | Value | Copy |
|---|---|---|
| {{ row.label }} | {{ row.value }} |
| Constraint | Planner Budget | Pod Ceiling | Scheduling Note | Copy |
|---|---|---|---|---|
| {{ row.label }} | {{ row.budget }} | {{ row.capacity }} | {{ row.note }} |
| Scenario | Active Nodes | Capacity | Safe Target | Peak Slack | Status | Outcome | Copy |
|---|---|---|---|---|---|---|---|
| {{ row.scenario }} | {{ row.nodesText }} | {{ row.capacityText }} | {{ row.targetText }} | {{ row.peakSlackText }} | {{ row.statusText }} | {{ row.outcome }} |
| Total Nodes | Add | Safe Target | Steady Capacity | Gap to Goal | Limiter | Status | Outcome | Copy |
|---|---|---|---|---|---|---|---|---|
| {{ row.totalNodesText }} | {{ row.additionalNodesText }} | {{ row.safeTargetText }} | {{ row.steadyCapacityText }} | {{ row.gapText }} | {{ row.limiter }} | {{ row.statusText }} | {{ row.outcome }} |
| Lever | Type | New Safe Target | Delta | Goal Gap | Why It Helps | Trade-off | Copy |
|---|---|---|---|---|---|---|---|
|
{{ row.label }}
{{ row.summary }}
|
{{ row.category }} | {{ row.safeTargetText }} | {{ row.deltaText }} | {{ row.goalGapText }} | {{ row.why }} | {{ row.tradeoff }} | |
| No positive single-step lever was found inside the built-in comparison set for this workload shape. | |||||||
| Field | Value | Copy |
|---|---|---|
| {{ row.label }} | {{ row.value }} |
Introduction:
Kubernetes capacity planning starts with what the scheduler can reserve, not with the largest number on a cloud invoice or a dashboard average. Pods ask for CPU, memory, and sometimes local ephemeral storage. Nodes expose allocatable resources after system and kubelet reservations. A workload stops fitting as soon as one required dimension runs out, even when the other dimensions still look generous.
Replica capacity also changes under rollout and failure pressure. A Deployment may create surge pods while new replicas come up. A PodDisruptionBudget (PDB) can require a minimum number of pods to remain available during drains. A multi-zone pool can pass a healthy-cluster check but fail the more useful question: can the same target survive the loss of the largest zone or a selected number of worker nodes?
Capacity work usually starts with four terms. Allocatable is what a node exposes to pods after system and kubelet reservations. Requests are the CPU, memory, and storage amounts pods ask the scheduler to reserve, not the usage graphs seen after the pod is running. Overhead includes DaemonSet pods and repeated runtime, sidecar, or pod overhead that lands with every application pod. Policy envelope means the reduced capacity after surge, disruption, and zone-loss assumptions are applied.
| Planning Question | What Usually Decides It | Common Misread |
|---|---|---|
| Will the healthy pool fit steady replicas? | The smallest ceiling across CPU requests, memory requests, pod slots, explicit storage requests, and optional pod IP budget. | Looking only at total CPU or observed usage. |
| Will a rolling update fit? | Desired replicas plus temporary surge pods. | Approving a steady replica count that has no room for maxSurge. |
| Will drains or node loss fit? | The same ceilings recalculated with fewer active nodes. | Testing only the all-nodes-healthy cluster. |
| Will availability policy fit? | The PDB floor and the smallest modeled failure envelope. | Assuming maxUnavailable alone controls disruption behavior. |
The common mistake is to divide total cluster CPU by average CPU usage and call the answer a replica limit. That misses memory, pod slots, container network interface (CNI) address limits, platform pods, update peaks, and disruption rules. Another mistake is to plan from healthy-cluster capacity only. The number that is safe to publish is usually lower than the number that fits before a rollout or failure.
No planning number can replace a real scheduler event, a staging rollout, or provider-specific CNI documentation. It can still narrow the review: if the first limiter is memory, changing maxPods will not help; if the first limiter is pod IP budget, adding CPU does not create addresses; if the largest-zone outage binds first, spreading nodes more evenly may matter as much as buying more total capacity.
How to Use This Tool:
Describe one schedulable worker pool and one average workload shape, then let the calculator reduce that pool through resource ceilings and availability policy.
- Start with
Presetif one matches the workload class, such asStateless Web Service,Service Mesh / API Platform,Batch / Worker Queue, orStateful Edge / Cache Pods. Switch toCustom (keep manual values)before changing the seeded pod shape or overhead values. - Enter
Worker nodes,Allocatable CPU per node,Allocatable memory per node, andMax pods per nodefrom Ready nodes that can actually run the workload. Use node allocatable values, not instance vCPU or raw machine memory. - Set
Average pod CPU requestandAverage pod memory requestfrom the workload manifests or admission-time defaults. Observed usage belongs in a right-sizing exercise, not in this request-based capacity calculation. - Adjust the visible planning controls:
Reserve buffer,Target utilization,Rolling update maxSurge,PDB minAvailable target,Node failures to tolerate, andDesired safe replicas. Watch the summary forSafe replica ceiling, the status badge, and the named limiter. - Open
Advancedfor storage, zone, DaemonSet, pod overhead, topology spread, maxUnavailable, subnet/IP, failure-envelope, fragmentation, and precision settings. UseUsable pod-subnet IPsonly when pod addresses are a known capacity constraint. - Review
Capacity Metrics,Capacity Constraints,Scheduler Ceiling Stack,Failure Envelope, andFailure Capacity Curvebefore changing a rollout target. IfDesired safe replicasis set, useScale PathandCapacity Leversto compare adding nodes against request, surge, PDB, and packing changes. - Fix input warnings before treating the number as review-ready. The calculator may clamp values such as zones above node count, failed nodes that leave no node alive, unsupported presets, or DaemonSet pods above
Max pods per node.
Interpreting Results:
Recommended replicas (safe target) is the planning ceiling to compare with a proposed Deployment replica count. It is the smallest value left after the healthy baseline, selected node-loss drill, largest-zone outage when applicable, rollout surge, and PDB floor are all checked. Binding policy check names the availability rule that capped the final target.
Limiting dimension explains which healthy-cluster ceiling is tightest before policy is applied. A CPU limiter points toward request tuning or more CPU runway. A memory limiter points toward memory requests or node memory. A pod-slot limiter points toward Max pods per node, DaemonSet pod count, or CNI density. A pod IP limiter means the optional subnet guardrail is tighter than scheduler resources.
- Trust
Failure Envelopewhen approving a target for drains or node-loss drills, because it compares rollout peak and PDB floor against smaller active-node counts. - Use
Scheduler Ceiling Stackto avoid tuning the wrong quantity. Raising CPU does not improve a memory, slot, storage, or IP-bound plan. - Do not treat a green badge as proof that Kubernetes will schedule every future pod. Affinity, taints, actual topology spread, autoscaler timing, image pull pressure, and provider CNI rules can still reduce usable capacity.
- Use
Input SnapshotorCopy linkfor review handoff, but avoid sharing sensitive cluster names or private notes in URL-carried context.
Technical Details:
Kubernetes scheduling capacity begins with Node allocatable resources. Allocatable CPU and memory already account for the node's system and kube reservations. Application capacity then loses more room to DaemonSet pods, DaemonSet requests, extra per-pod overhead, reserve buffers, target utilization, topology spread allowance, and fragmentation allowance. Pod slots follow a count-based path because maxPods is not divided by CPU or memory.
Every active dimension produces a pod ceiling. CPU, memory, and pod slots are always active. Ephemeral storage becomes a hard ceiling only when the effective per-pod ephemeral-storage request is above zero. The pod IP budget is optional and conservative: when enabled, usable pod-subnet IPs are reduced by worker nodes, upgrade-surge nodes, and system pod IP reserve before workload pods are counted.
Formula Core:
The healthy-cluster ceiling is the minimum of the active resource ceilings after overhead and planning factors are applied.
Rollout and failure checks convert steady capacity into a safe target. A higher maxSurge increases peak pods, so the safe target is lower than steady capacity. Node-loss and zone-loss checks repeat the same resource calculation with fewer active nodes.
| Ceiling | Quantity used | When it binds |
|---|---|---|
| CPU requests | Allocatable CPU minus DaemonSet CPU, multiplied by planning factors and divided by effective pod CPU request. | CPU is the smallest pod ceiling after flooring. |
| Memory requests | Allocatable memory minus DaemonSet memory, multiplied by planning factors and divided by effective pod memory request. | Memory is the smallest pod ceiling after flooring. |
| Pod slots | Max pods per node minus DaemonSet pods, multiplied by active nodes and count-based planning factors. | Small pods hit pod density before CPU or memory. |
| Ephemeral storage | Allocatable ephemeral storage minus DaemonSet storage, divided by effective per-pod ephemeral request. | Only active when the effective per-pod ephemeral request is greater than 0 GiB. |
| Pod IP budget | Usable pod-subnet IPs minus node, upgrade-surge-node, and system pod IP reservations. | Only active when a positive usable IP count is entered. |
| Policy check | Boundary rule | Result effect |
|---|---|---|
| Healthy baseline | rollout peak <= healthy steady capacity | Caps the target before failure assumptions. |
| Selected node-loss drill | rollout peak <= post-loss steady capacity | Reduces the target when the chosen failed-node count removes too much capacity. |
| Largest-zone outage | rollout peak <= capacity after losing ceil(nodes / zones) | Applies only when more than one availability zone is selected. |
| PDB floor | ceil(target x PDB%) <= smallest failure capacity | Can cap the target even when surge peak fits. |
| Goal search | recommended replicas >= Desired safe replicas | Scans larger node counts up to the built-in node ceiling for Scale Path. |
With the service-mesh defaults, 12 nodes with 14 vCPU allocatable and 0.55 vCPU of DaemonSet CPU leave 13.45 vCPU per node. An 18% reserve, 76% target utilization, 8% fragmentation allowance, and 12% topology reserve produce a planning factor of about 0.5045. CPU budget is therefore about 12 x 13.45 x 0.5045 = 81.4 vCPU, and a 0.35 vCPU effective pod request yields 232 pods by CPU before memory, slots, storage, network, rollout, and failure checks are applied.
In that same default run, CPU is the healthy limiter at 232 steady pods, but the final Recommended replicas (safe target) is 124. The binding policy is the largest-zone outage: with 3 zones, the model removes 4 nodes, leaving 8 active nodes and a zone steady capacity of 155 pods. A 25% surge turns 124 replicas into a Rollout peak at target of 155 pods, leaving no zone-outage slack.
Accuracy Notes:
The result is a deterministic planning estimate, not a scheduler replay. It uses average pod shape and explicit assumptions, so the answer can be too optimistic or too conservative when real placement rules differ from those assumptions.
- Use current node allocatable values and actual request totals from the workload that will be deployed.
- Keep reserve, target utilization, topology, fragmentation, and failure assumptions unchanged when comparing two runs.
- Check scheduler events, provider CNI limits, affinity and anti-affinity rules, taints, autoscaler settings, and real zone distribution before approving a high-risk rollout.
- Treat ephemeral storage as a hard modeled ceiling only when pods declare nonzero local ephemeral-storage requests.
Worked Examples:
A service-mesh API pool starts with 12 workers, 14 allocatable vCPU per node, 52 GiB memory per node, 110 Max pods per node, 0.35 vCPU pods, 0.7 GiB pods, a 25% Rolling update maxSurge, 3 zones, and 1 Node failures to tolerate. The output shows Recommended replicas (safe target) as 124, Limiting dimension as CPU requests, Rollout peak at target as 155 pods, and Binding policy check as largest-zone outage.
A dense small-pod pool changes the same workload to 28 Max pods per node, 0.05 vCPU pods, 0.1 GiB pods, and 8 DaemonSet pods per node. CPU and memory now have plenty of room, but Limiting dimension becomes pod slots. The expected Recommended replicas (safe target) drops to 78 because the largest-zone outage leaves 98 steady pod slots and a 25% surge turns 78 replicas into 98 peak pods.
A troubleshooting review sets Desired safe replicas to 180 with the default service-mesh inputs. The summary still shows 124 safe replicas, so the target is short by 56. Scale Path reaches the goal at 18 total nodes, and Capacity Levers helps compare that node increase against lowering requests, reducing surge, relaxing the PDB floor, or improving packing. If the limiter were Pod IP budget, the corrective path would need more usable pod IPs, not only more CPU.
FAQ:
Should I enter observed CPU usage or CPU requests?
Enter requests. Kubernetes scheduling reserves capacity from declared requests, while observed usage is useful for later right-sizing and alert review.
Why is the safe target lower than steady planner capacity?
The safe target accounts for rollout surge, selected node loss, largest-zone loss when enabled, and the PDB floor. Steady planner capacity is the healthy resource ceiling before those availability checks reduce the publishable replica count.
Why does storage say advisory only?
Storage stays advisory when the effective per-pod ephemeral request is 0 GiB. Enter a nonzero Average pod ephemeral request or extra pod ephemeral overhead when local ephemeral-storage requests should constrain scheduling.
What should I do with an adjusted input warning?
Review the changed field before using the result. Warnings commonly mean zones exceeded worker nodes, failed-node count left no active node, DaemonSet pods exceeded Max pods per node, or a number was outside the accepted range.
Does the calculator send my cluster values away?
The capacity calculation runs in the browser. Shared links can carry selected values in the URL, so treat copied links as review artifacts and avoid sharing sensitive context in them.
Glossary:
- Allocatable
- Node CPU, memory, and other resources available to pods after Kubernetes and system reservations.
- Request
- The amount of CPU, memory, or ephemeral storage a pod asks Kubernetes to reserve for scheduling.
- DaemonSet tax
- Per-node platform pod count and resource requests removed before application pod capacity is calculated.
- Effective pod request
- The application pod request plus any extra per-pod overhead included in the planning model.
- Rollout peak
- The safe replica target plus temporary surge pods created during a rolling update.
- PodDisruptionBudget
- A Kubernetes policy that limits voluntary disruption by requiring enough pods to remain available.
- Pod IP budget
- An optional subnet-based ceiling for workload pod addresses after node, surge-node, and system reservations.
References:
- Resource Management for Pods and Containers, Kubernetes Documentation.
- Reserve Compute Resources for System Daemons, Kubernetes Documentation.
- Pod Overhead, Kubernetes Documentation.
- Update a Deployment Without Downtime, Kubernetes Documentation.
- Specifying a Disruption Budget for your Application, Kubernetes Documentation.
- Pod Topology Spread Constraints, Kubernetes Documentation.
- Local ephemeral storage, Kubernetes Documentation.