{{ summaryHeading }}
{{ summaryPrimary }}
{{ summaryLine }}
{{ badge.label }}
Canary deployment inputs
Target service for the canary rollout.
Version or artifact identifier that will receive canary traffic.
Pick the traffic-shift mechanism closest to the production rollout.
Choose the blast-radius profile for this service path.
Comma or line separated percentages, for example 1, 5, 10, 25, 50, 100.
Minutes to observe each traffic step before the next promotion.
min
Minutes of telemetry each gate should evaluate before promotion.
min
Minutes to keep the release under elevated monitoring after full promotion.
min
Approximate production requests per minute for this service path.
req/min
Approximate total serving replicas or tasks during the rollout.
replicas
Maximum acceptable 5xx or failed-request rate for the canary slice.
%
Maximum accepted canary p95 latency during each analysis window.
ms
Maximum acceptable saturation signal for the canary slice.
%
Choose how much stable capacity remains available while canary traffic increases.
How the rollout should react when a gate fails.
The exact action operators should take when a canary gate fails.
Person, team, or role accountable for the rollout gate decisions.
Optional release governance reference.
Optional pre-traffic checks added to the schedule.
min
Optional custom metric and threshold for the gate checklist.
Optional newline-separated checks, such as dashboards, feature flag state, or alert routing.
Aspect Value Planning note Copy
{{ row.aspect }} {{ row.value }} {{ row.note }}
Phase Canary Stable Bake Elapsed Gate action Copy
{{ row.phase }} {{ row.canary }} {{ row.stable }} {{ row.bake }} {{ row.elapsed }} {{ row.action }}
Signal Threshold Window Cadence Abort action Copy
{{ row.signal }} {{ row.threshold }} {{ row.window }} {{ row.cadence }} {{ row.action }}
Trigger Detection Rollback action Recovery validation Copy
{{ row.trigger }} {{ row.detection }} {{ row.action }} {{ row.validation }}
Customize
Advanced
:

A canary deployment releases a new version to a small slice of production traffic before the full audience sees it. The point is to learn from real requests while the blast radius is still small enough to stop quickly.

The release decision depends on three things at once: how much traffic is exposed, how long each step is observed, and what signals are trusted enough to halt the rollout. A 1% step for a busy checkout service may produce thousands of requests in a few minutes, while a quiet internal service may need a longer bake window before the same percentage says much.

Canary rollout path from stable traffic to a small canary, guardrail gate, promotion, or rollback
Traffic increases only after each guardrail gate has enough evidence; a failed gate sends traffic back to the stable release.

Canary planning can give false comfort when the percentages look careful but the checks are weak. A rollout with tiny steps still needs useful telemetry, clear abort authority, and enough stable capacity to take traffic back. The schedule is a release plan, not proof that the new version is safe.

A good canary plan names the service, the release version, the traffic mechanism, the observation windows, and the exact rollback action. Those details make the plan usable in a release review and help operators avoid debating basics while customer traffic is already moving.

Technical Details:

Canary rollout mechanics start with a traffic split between the stable version and the release candidate. Each step assigns a canary percentage, holds that weight for a bake interval, then evaluates guardrail telemetry before the next promotion. A final stable promotion closes the rollout only after the earlier gates pass.

The same percentage can mean very different evidence depending on request volume and duration. Modeled canary exposure is the request rate multiplied by the canary percentage and the step duration. Higher exposure gives the gate more samples, but it also increases the impact of a bad release.

canary requests = request rate × canary percent 100 × bake minutes

Traffic steps are normalized into ascending promotion order. Invalid percentages outside 1 to 100 are ignored, duplicate percentages are collapsed, and a missing 100% promotion is added so the schedule has a closure point. If no usable traffic step remains, the planner reports an input error.

Rule Core

Canary planner scoring and warning rules
Planning factor Rule used Result effect
Release risk Routine starts at 8, customer-facing starts at 20, critical starts at 34. Sets the base risk score before schedule and rollback choices are added.
Stable capacity Full stable capacity adds 0, shared capacity adds 6, reduced stable capacity adds 14. Reduced stable capacity also creates a rollback warning.
Rollback mode Automated rollback adds 0, manual halt adds 8, manual promote only adds 18. Manual-only rollback creates a warning because recovery depends on operator action.
Traffic jumps First canary step above 10% adds 12. Largest jump above 25 points adds 7; above 50 points adds 12. Large exposure changes raise the score and add review notes.
Telemetry window Bake interval shorter than the analysis window adds 10. The gate may not see a complete telemetry window before promotion.
Request rate At least 3,000 requests per minute adds 4; at least 10,000 adds 8. Busy services increase potential exposure during each bad step.
Kubernetes replica granularity Without a traffic router, fewer than 10 replicas and a first step below 10% add 10. Small replica counts cannot represent very small percentages accurately.
Rollback action A blank rollback action adds 12. The playbook falls back to platform guidance instead of a concrete operator command.

Recommendations are score bands with one extra warning rule. A score of 58 or higher returns Tighten before launch. A score from 34 to 57, or at least three warnings, returns Needs guarded review. Lower scores return Ready for gated canary when required fields are present.

Canary planner guardrail signals
Gate signal Threshold source Abort meaning
5xx or failed-request rate Error-rate abort threshold and analysis window. Canary errors breached the allowed rate for the gate window.
User-facing p95 latency p95 latency abort threshold in milliseconds. The canary is slower than the release limit or materially worse than stable.
Saturation CPU, memory, queue, connection, or similar saturation threshold. Resource pressure makes the next promotion unsafe until capacity recovers.
Custom business guardrail Optional custom metric text supplied by the release team. A domain signal such as authorization success or order completion is below the chosen limit.
Operator or stakeholder stop Any credible regression report during the canary. Human stop authority remains available throughout the rollout.

The platform choice changes the wording of promotion, verification, and rollback actions. Argo Rollouts, CodeDeploy for ECS, CodeDeploy for Lambda, Google Cloud Deploy, and generic traffic routers are treated as traffic-routing platforms, while a basic Kubernetes workload rollout warns when replica math cannot represent small canary weights well.

Everyday Use & Decision Guide:

Start with the production Service name and Release version that will appear in dashboards, events, and rollback notes. Pick the deployment platform closest to the real traffic-shift mechanism. Choose Kubernetes workload rollout only when pod or task counts are the main approximation; choose a traffic-router option when weights can be controlled independently of replicas.

The default traffic steps of 1, 5, 10, 25, 50, 100 are a cautious first pass for a customer-facing service. Increase bake time when traffic is low or when the analysis window needs more telemetry. For critical paths, smaller first steps and longer holds usually matter more than adding a long final watch after the release is already at 100%.

  • Use Estimated request rate to make canary exposure visible instead of judging percentages alone.
  • Keep Bake interval at least as long as Analysis window so each gate can evaluate a complete window.
  • Set error, p95 latency, and saturation thresholds to values that should stop promotion, not merely values that create alerts.
  • Use Stable stays fully warm until final promotion when fast rollback matters more than temporary capacity cost.
  • Write Rollback action as an operator-ready sentence, such as setting canary traffic to 0%, restoring stable traffic, disabling a risky flag, and verifying stable metrics.

The result tabs cover different release-review jobs. Rollout Brief summarizes service, platform, schedule, score, capacity posture, rollback authority, and primary gates. Traffic Schedule shows each phase with canary and stable percentages, bake time, elapsed time, and gate action. Guardrail Gates and Rollback Playbook turn thresholds and owner information into reviewable runbook rows.

A common misread is treating Ready for gated canary as approval to release. It means the entered plan is internally consistent under the planner rules. It does not mean the dashboards exist, the alarms are wired, the rollout controller is healthy, or the service owner has accepted the risk.

Before using the plan, read any Review before promotion warnings and compare the Canary Gate Ladder with the rollback posture. A chart that reaches 100% quickly while warnings remain visible should slow the release review down.

Step-by-Step Guide:

Use this flow to turn a release idea into a canary rollout brief and rollback playbook.

  1. Enter Service name and Release version. If either is blank, Canary inputs need attention reports the missing field.
  2. Choose Deployment platform and Release risk. The summary badges update with the platform short name, risk tier, rollback mode, gate duration, and warning count.
  3. Enter Canary traffic steps. Commas, spaces, line breaks, and percent signs are accepted; duplicates are collapsed and a missing 100% step is added.
  4. Set Bake interval, Analysis window, and Final watch. Check Traffic Schedule to confirm elapsed time and gate actions match the release window.
  5. Fill in Estimated request rate, Target replicas, and the three abort thresholds. The brief uses these values to model request exposure and build guardrail rows.
  6. Choose Stable capacity posture and Rollback mode, then write the exact Rollback action. A blank action creates a warning and makes the rollback playbook fall back to platform guidance.
  7. Open Advanced when the plan needs a release owner, change ticket, preflight duration, custom guardrail, or predeploy checks. These values appear in the brief, guardrail gates, schedule, or rollback rows.
  8. Review Rollout Brief, Guardrail Gates, Rollback Playbook, and Canary Gate Ladder. Clear input errors before copying JSON or exporting tables.

Interpreting Results:

The most important result is the combination of recommendation, warnings, and rollback posture. A low score with several warnings still deserves review because warnings call out specific release hazards, such as an incomplete analysis window or a first canary step above 10%.

Canary rollout result cues and follow-up actions
Result cue Meaning Follow-up
Ready for gated canary The schedule, rollback posture, and guardrails fit the planner's lower-risk rules. Confirm dashboards, alerts, deployment permissions, and owner availability before release.
Needs guarded review The score or warning count is high enough that accountable approval should be explicit. Reduce jumps, lengthen bake windows, strengthen rollback automation, or document acceptance.
Tighten before launch The canary exposes too much risk for the chosen path, capacity, traffic, or rollback mode. Use smaller steps, keep stable capacity warm, and move rollback closer to automation.
Review before promotion Inputs are valid, but one or more planning warnings need a decision. Read every warning before using the schedule or playbook in a change review.
Canary inputs need attention A required field or traffic-step rule failed. Fix the listed error before relying on the brief, schedule, exports, or JSON.

Do not overread modeled request exposure as statistical proof. The planner multiplies request rate, percentage, and duration; it does not inspect real telemetry quality, sampling bias, retry storms, regional skew, alert routing, or business approval.

The safest follow-up is to compare Guardrail Gates with live dashboards before the first traffic shift. Every threshold should map to a chart, alert, or query that the release owner can read during the bake interval.

Worked Examples:

Customer checkout release. With checkout-api, version 2.8.0, Argo Rollouts, customer-facing risk, traffic steps 1, 5, 10, 25, 50, 100, a 10 minute bake, a 5 minute analysis window, 1,800 requests per minute, full stable capacity, and automated rollback, Rollout Brief shows 6 traffic gate(s), 1.1 hr elapsed. Risk posture is 27/100 - Ready for gated canary, and the modeled canary exposure is 43,380 requests across planned bake windows.

Critical path with aggressive settings. A billing release using steps 25, 50, 100, critical risk, a 5 minute bake, a 10 minute analysis window, 12,000 requests per minute, reduced stable capacity, and manual promote only reaches Tighten before launch. The warnings identify the large first exposure, incomplete telemetry window, reduced rollback capacity, manual-only rollback, and a promotion jump above 25 percentage points.

Small Kubernetes replica count. A routine service with Kubernetes workload rollout, 4 replicas, and steps 1, 5, 10, 100 may still produce a valid schedule, but Review before promotion warns that tiny percentages cannot be represented accurately without a traffic router. The corrective path is to use a real router for percentage weights, raise serving capacity, or choose steps that match whole replica counts.

Missing rollback action. Leaving Rollback action blank does not stop the planner from building tables, but it adds risk and warning text. Rollback Playbook then uses platform guidance instead of the operator-specific action, so the release owner should add a concrete traffic, flag, deployment, and verification sentence before the plan is used.

Security and Privacy Notes:

Planning calculations run in the browser after the page loads. The entered service name, release version, rollback text, guardrails, generated tables, chart data, and JSON output are handled client-side by this tool, with no server-side planning endpoint in the tool behavior.

Treat the output as operationally sensitive. It may include service names, release identifiers, change tickets, rollback commands, feature flags, capacity assumptions, and threshold values that belong in normal release-management controls.

FAQ:

Does the planner deploy the canary?

No. It prepares a rollout brief, schedule, guardrail table, rollback playbook, chart data, and JSON. The actual deployment still happens in Argo Rollouts, Kubernetes, CodeDeploy, Cloud Deploy, or another release system.

Why was 100% added to my traffic steps?

The schedule needs a final promotion point. If the entered percentages stop below 100, the planner adds a final stable-promotion step and reports that change as a warning.

Why does a small Kubernetes rollout warn about low percentages?

A basic workload rollout approximates percentages with whole replicas. With fewer than 10 replicas, a 1% or 5% canary cannot be represented accurately unless a traffic router controls weights separately.

Which guardrail should stop promotion first?

Any sustained breach in the entered error-rate, p95 latency, saturation, or custom business guardrail should stop the next promotion. The rollback playbook also includes an operator or stakeholder stop trigger.

Can I compare two plans by score alone?

Use the score as a triage cue, not the whole decision. Compare warnings, first exposure, largest promotion jump, rollback capacity, rollback mode, and whether the guardrail thresholds match real dashboards.

Glossary:

Canary deployment
A release pattern that sends a small share of production traffic to a new version before full promotion.
Stable version
The current production version that should remain able to serve traffic during rollback.
Bake interval
The hold time at one canary percentage before the next promotion is considered.
Analysis window
The telemetry period used to evaluate guardrail thresholds at a gate.
Guardrail
A metric or business signal that can halt promotion when it breaches the chosen threshold.
p95 latency
The response-time value below which 95% of observed requests fall.
Saturation
Resource pressure such as CPU, memory, queue depth, connection use, or another capacity signal.
Rollback posture
The combination of stable capacity, automation, owner authority, and concrete action that determines recovery speed.