Blue-Green Deployment Planner
Plan blue-green traffic shifts with routing surface, bake intervals, health gates, risk warnings, rollback commands, and exposure estimates before release review.{{ summaryHeading }}
- {{ error }}
- {{ warning }}
| Wave | Elapsed | Green traffic | Blue traffic | Requests at wave | Copy |
|---|---|---|---|---|---|
| {{ row.wave }} | {{ row.elapsedLabel }} | {{ row.greenLabel }} | {{ row.blueLabel }} | {{ row.exposureLabel }} |
| Gate | Signal | Threshold | Action | Copy |
|---|---|---|---|---|
| {{ row.gate }} | {{ row.signal }} | {{ row.threshold }} | {{ row.action }} |
{{ analysis.commandText }}
Introduction:
Blue-green deployment keeps two production-capable environments available during a release. The current environment continues serving users while the target environment is deployed, warmed, checked, and then placed behind the production route. The plan succeeds only when traffic can move forward with clear health gates and move back quickly if the target shows a customer-impacting problem.
Traffic movement can be gradual or all at once, depending on the routing surface. Weighted load balancers, weighted DNS records, service mesh routes, and Gateway-style routing can split requests by proportion. A native Kubernetes Service selector change normally points the service at one matching set of Pods, so it behaves like a full switch unless another routing layer provides weights.
A useful blue-green plan names the exact route that will change, the first production percentage, the wait time between waves, and the rollback trigger that stops the rollout. It also keeps capacity honest. A target environment that cannot carry the expected peak is not ready for a full cutover, even when early low-percent traffic looks healthy.
The schedule is a planning aid, not a deployment guarantee. Request weights do not always map exactly to user impact because clients, caches, connection reuse, sticky sessions, and DNS resolvers can keep some traffic on an older path after a change. Treat every wave as a checkpoint that needs telemetry, an operator decision, and a tested route back to the current environment.
Technical Details:
Blue-green traffic shifting is governed by route weights, observation time, target capacity, and rollback reachability. Weighted surfaces use a series of target percentages that begin with the first shift and increase by the entered increment until the target reaches 100%. The final wave is capped at 100%, so a 10% first shift with 20% increments becomes 10%, 30%, 50%, 70%, 90%, and 100%.
The elapsed model adds pre-warm time before production exposure, then adds one bake-plus-lag interval for every traffic wave. Bake time is the planned observation period after a route change. Metric lag is extra wait time for alarms, logs, queues, and delayed dependencies to reflect the new request path. A blue hold is added after promotion so the old environment remains available while logs, drains, and rollback confidence settle.
Kubernetes Service selector mode is modeled differently because a selector patch points the production Service at the target label set. The schedule therefore uses one 100% target wave, and the planner warns when partial weights are entered for that mode. Use a service mesh, ingress, Gateway route, or load balancer if the rollout needs weighted waves inside Kubernetes.
Formula Core:
The timing and exposure model uses minutes, request rate, and target percentage. Exposure is an estimate of requests sent to the target during each wave, not a count of distinct users.
When bake time and metric lag are both zero, the request exposure formula still uses a one-minute minimum so a nonzero traffic percentage does not produce a misleading zero request estimate.
Risk Rules:
| Condition | Planner response | Why it matters |
|---|---|---|
| Fewer than three health checks | thin guardrails |
Readiness, error rate, latency, saturation, and business signals should not be collapsed into one vague check. |
| No rollback trigger | missing rollback trigger |
The on-call operator needs a concrete stop condition before production traffic moves. |
| Target capacity below 100% of expected peak | target below peak capacity |
The target cannot safely become the only production environment if it is intentionally undersized. |
| Bake interval below 10 minutes | short bake |
Delayed errors, slow dependencies, and alert windows may not settle before the next wave. |
| First shift above 25% or increment above 50% | large first shift or large increment |
Large jumps increase blast radius and can skip useful checkpoints. |
| DNS TTL longer than the modeled bake comparison | ttl exceeds bake |
Recursive resolvers may keep older answers after the route is changed back. |
| Sticky sessions above 20% or blue hold below 30 minutes | sticky sessions or short blue hold |
Observed weights can blur, and teardown can remove the fast rollback path too soon. |
Risk points roll up into four labels: ready path, watch closely, cautious rollout, and hold for review. The label is a planning cue. It does not approve the release, execute commands, or prove that the target environment is healthy.
Result Surfaces:
| Result | What it shows | Best use |
|---|---|---|
Shift Schedule |
Preflight, optional pre-warm, each wave, elapsed time, blue share, target share, and modeled target requests. | Check the actual route sequence before the deployment window. |
Gate Checklist |
Readiness, capacity, wave, routing, approval, rollback, and blue-hold checks. | Turn the plan into operator review points. |
Traffic Commands |
Route-change and rollback command templates for the selected routing surface. | Prepare commands for manual review before adapting them to real infrastructure identifiers. |
Traffic Shift Curve |
Blue and target traffic share over elapsed gate time. | Spot abrupt jumps, one-wave selector switches, and long hold periods. |
JSON |
Inputs, summary, warnings, schedule rows, gate rows, chart data, and command runbook. | Carry the same modeled plan into change records or review notes. |
Everyday Use & Decision Guide:
Start with the route mechanism. Choose weighted load balancer target groups when one listener can split traffic between blue and target groups. Use weighted DNS only when propagation behavior and TTL are already understood. Use service mesh or ingress weights when the application route is controlled inside the cluster. Choose Kubernetes Service selector switch when the real plan is an all-at-once selector move.
Enter the production service name, the current environment, and the target environment as operators will recognize them during a release. The labels appear in schedules, gate rows, command text, and the JSON payload, so vague names create vague handoff material. Keep Current environment and Target environment different; the page reports an input error if they match.
- Keep
First shiftsmall when the target has not seen production traffic for this release. A 5% or 10% first wave usually makes guardrail failures easier to contain than a 50% jump. - Use
Bake intervalfor the minimum observation time after each route change, then addMetric lagwhen dashboards, queues, or delayed jobs trail the live request path. - Set
Baseline trafficto the expected request rate for the deployment window, not the daily average if traffic has strong peaks. - Keep
Target capacityat or above 100% unless the release intentionally has an external cap and will not promote to full traffic. - Write health checks as action-ready guardrails such as readiness, 5xx rate, p95 latency, saturation, queue depth, and a business signal.
- Make the
Rollback triggerexplicit enough that the on-call person can act without a debate during the bake interval.
The summary is the fastest sanity check. ready path means the configured plan avoided the built-in caution rules, not that the deployment is riskless. hold for review means the entered plan has enough severe cautions that route movement should pause until the missing checks, capacity, timing, or rollback language is fixed.
Entered values are processed in the browser, and generated commands are text templates. Review identifiers, namespaces, listener ARNs, hosted zones, route objects, and rollback steps against the real deployment system before anyone runs a command.
Step-by-Step Guide:
Build the plan from routing reality first, then verify schedule, gates, and rollback text before sharing the result.
- Choose
Routing surface. Confirm it matches the system that will actually shift production traffic. - Enter
Service name,Current environment, andTarget environment. Fix any input errors before reading the schedule. - Set
First shift,Shift increment,Bake interval, andBaseline traffic. The summary updates with wave count, total shift time, and modeled exposure. - Set
Target capacity. Treat any capacity warning as a release-readiness issue, not a formatting issue. - List concrete
Health checksand write theRollback trigger. The gate checklist turns these into review rows. - Open
Advancedfor route target, DNS TTL, metric lag, pre-warm time, sticky sessions, blue hold, and promotion approval. - Review
Shift ScheduleandTraffic Shift Curve. Check that the first wave, final promotion, and elapsed gate time match the intended release window. - Review
Gate ChecklistandTraffic Commands. Replace placeholders and infrastructure-specific identifiers outside the planner before using command text in a real runbook. - Use
JSONonly after warnings, route commands, gate language, and rollback wording match the deployment plan you intend to review.
Interpreting Results:
Shift Schedule is the main planning output. Read the elapsed time and target percentage together. A six-wave plan with 35 minutes between waves takes much longer than the percentages alone suggest, and it exposes the target to more requests as the target share rises.
The risk label should be read with the warning list. A single danger condition, such as target capacity below peak, can matter more than several mild timing cautions. The score is useful for sorting plans that need review, but the release decision still belongs to the team that owns health checks, rollback, and customer impact.
| Result cue | What it means | What to verify |
|---|---|---|
ready path |
No built-in risk rule contributed points. | Health checks, command identifiers, and rollback action still need human review. |
watch closely |
The plan has moderate cautions, usually around guardrails or timing. | Check the warning list before increasing the first wave or shortening bake time. |
cautious rollout |
Risk points are high enough that the plan needs release-owner review. | Look for missing rollback language, capacity limits, large shifts, or weak checks. |
hold for review |
The entered plan crosses the highest risk threshold. | Pause route movement until the blocking conditions are corrected or explicitly accepted. |
Requests at wave |
Estimated target requests for that wave based on request rate, interval, and target percent. | Compare with real monitoring after the shift because stickiness and DNS caching can change observed distribution. |
Use command output as a review draft. The planner cannot know the final hosted zone, listener ARN, namespace, route object, target group ARN, or organizational approval path. Those details must be supplied and checked in the real deployment process.
Worked Examples:
Default Weighted Load Balancer Plan:
With the default values, the plan starts at 10% target traffic and adds 20% each wave, producing six waves: 10%, 30%, 50%, 70%, 90%, and 100%. The 30 min bake and 5 min metric lag create a 35 min interval. With 15 min pre-warm time, total shift time is 225 min, or about 3.8 hr. At 4,200 req/min, modeled cumulative target exposure is about 514,500 req.
DNS Plan With Long TTL:
A weighted DNS plan with 10 min bake time and a 1,800 sec TTL triggers a TTL warning because cached DNS answers can last longer than the bake comparison. The schedule can still show a clean sequence, but rollback may not reach every resolver as quickly as the route update suggests. Lower or confirm TTL before using short bake intervals for DNS-controlled production traffic.
Kubernetes Selector Switch:
If Kubernetes Service selector switch is selected while First shift is 10%, the schedule still uses one 100% target wave. The warning explains that a native selector switch is atomic. For a gradual Kubernetes rollout, the route needs a mesh, Gateway, ingress, or load balancer layer that can represent weighted traffic.
Capacity Below Peak:
A target capacity of 80% peak creates a danger warning even if the first wave is small. The early traffic may pass, but the promotion wave asks the target to carry all production traffic. The practical fix is to raise capacity, pre-warm autoscaling, lower expected exposure with a separate control, or stop before full promotion.
FAQ:
Does the planner execute a deployment?
No. It builds schedule rows, gate rows, command text, a traffic curve, and JSON from the entered values. Operators still review, adapt, and run commands in their own deployment system.
Why does selector mode ignore partial percentages?
A native Kubernetes Service selector points the Service at Pods matching the selected labels. Without another routing layer, that change behaves like a full switch to the target environment.
Why can the request exposure estimate differ from monitoring?
The estimate uses baseline requests per minute, interval length, and target percentage. Real traffic can differ because of client behavior, connection reuse, sticky sessions, caches, and DNS resolver timing.
How many health checks should a plan include?
At least three concrete checks avoid the thin-guardrails warning. A practical set usually includes readiness, error rate, latency, saturation, and one business or user-path signal.
What should happen after 100% target traffic?
Keep the old environment available for the blue hold period. Use that time to verify drains, background jobs, logs, and rollback confidence before teardown.
Are entered values sent to a deployment service?
The planning calculation runs in the browser from the values entered on the page. Treat copied commands, CSV, DOCX, chart images, and JSON as deployment records when sharing them.
Glossary:
- Blue environment
- The current production environment that starts with all production traffic.
- Green or target environment
- The candidate environment that receives traffic during the release and becomes production after promotion.
- Routing surface
- The mechanism that moves production traffic, such as a load balancer, DNS record, service mesh route, or Kubernetes Service selector.
- Bake interval
- The planned observation time after a traffic shift before the next wave can continue.
- Metric lag
- Extra wait time for telemetry, alarms, queues, and delayed dependencies to reflect a route change.
- Target exposure
- The estimated request count sent to the target environment during one or more waves.
- Blue hold
- The period after promotion when the old environment remains available for rollback confidence.