Blue-Green Deployment Planner
Plan a blue-green release schedule with traffic waves, bake windows, capacity warnings, rollback triggers, command drafts, and exposure estimates.| Wave | Elapsed | Green traffic | Blue traffic | Requests at wave | Copy |
|---|---|---|---|---|---|
| {{ row.wave }} | {{ row.elapsedLabel }} | {{ row.greenLabel }} | {{ row.blueLabel }} | {{ row.exposureLabel }} |
| Gate | Signal | Threshold | Action | Copy |
|---|---|---|---|---|
| {{ row.gate }} | {{ row.signal }} | {{ row.threshold }} | {{ row.action }} |
{{ analysis.commandText }}
Introduction:
A blue-green release keeps two production-capable environments available during the same deployment window. One environment continues to serve users while the replacement is deployed, warmed, checked, and then promoted by moving the traffic route. The pattern reduces downtime, but it also makes routing, observation, and rollback rules central to the release plan.
The names are conventional rather than magical. Blue usually means the current production environment, and green usually means the candidate environment. Some teams use labels such as stable and candidate, old and new, or active and standby. The important part is that both sides can be named unambiguously during an incident, and that operators know which side is currently receiving production traffic.
Blue-green planning works best when the new version can be validated with production-like traffic before the previous version is removed. Web services, APIs, worker pools, and cluster workloads often fit that pattern. The release still needs matching data compatibility, warmed caches, ready dependencies, observability that separates old-version and new-version signals, and enough capacity for the target to carry the final load.
| Term | Planning meaning |
|---|---|
| Bake interval | Observation time after a traffic change, long enough for alarms, logs, queues, and user-path signals to settle. |
| Traffic wave | A planned target percentage, such as 10%, 30%, or 100%, applied through a route that supports weighting. |
| Rollback trigger | A concrete condition that sends traffic back to the current environment instead of continuing the promotion. |
| Blue hold | Time after promotion when the old environment stays healthy so rollback remains fast. |
Route choice changes the risk picture. A load balancer, weighted DNS record, service mesh, ingress rule, or Gateway route may express a gradual shift. A plain Kubernetes Service selector change points the Service at a matching set of Pods, so it behaves more like an atomic switch unless another routing layer controls weights. Sticky sessions, long-lived connections, resolver caches, and client-side retries can also make observed traffic differ from the nominal percentage.
A release schedule cannot prove that the new version is correct. It turns release intent into traffic steps, observation gates, capacity checks, rollback language, and old-environment hold time that a team can review before production traffic moves. The strongest plans name the route, the health signals, the approval owner, and the path back before users are exposed.
How to Use This Tool:
Start with Routing surface because that choice decides whether the plan can move gradually or must model a full switch. Weighted load balancer, weighted DNS, and service mesh or ingress modes use traffic percentages. Kubernetes Service switch models one 100% target wave.
- Enter Service name, Current environment, and Target environment. The current and target names must be different, and the labels should be recognizable during a release incident.
- Set First shift and Shift increment. The final modeled wave is capped at 100% target traffic, so a 10% first shift and 20% increment becomes 10%, 30%, 50%, 70%, 90%, and 100%.
- Set Bake interval, Metric lag, and Pre-warm time. These values shape how long the plan waits before each next gate and how much time is reserved before first production exposure.
- Enter Baseline traffic as expected requests per minute during the release window, then set Target capacity against expected peak rather than a quiet daily average.
- List concrete Health checks, one per line when possible. Useful checks include readiness, error rate, latency, saturation, queue depth, and a business or user-path signal.
- Write a measurable Rollback trigger, then add Route target when command drafts should name a listener, hosted zone, virtual service, Kubernetes Service, or other routing object.
- Use the advanced fields for DNS TTL, Sticky sessions, Blue hold time, and Promotion gate. If the summary reports hold for review, fix or explicitly accept the warning before using the plan in a change record.
The summary gives a fast sanity check: wave count, modeled shift time, target capacity, bake length, and caution state. A ready path label means the built-in cautions did not find a problem in the entered values. It is not deployment approval.
The planning calculation runs in the browser. Generated commands are review drafts, not executable proof. Replace placeholders and verify real infrastructure identifiers before copying any command into a deployment runbook.
Interpreting Results:
Shift Schedule is the main timing output. Read each wave as the planned traffic state after preflight, optional pre-warm, and a bake-plus-lag checkpoint. Requests at wave is estimated from the entered request rate and target percentage; it is not a count of distinct users or sessions.
The warning label is a shortcut, but the warning list matters more than the label alone. One severe capacity warning can matter more than several small timing cautions. Treat hold for review as a stop sign until the missing rollback text, low capacity, short bake, route mismatch, or weak guardrail is fixed or accepted by the release owner.
| Output | What it shows | How to review it |
|---|---|---|
| Shift Schedule | Elapsed time, target share, blue share, and modeled target requests for each wave. | Check whether the first wave, final promotion, and total shift time fit the release window. |
| Gate Checklist | Validation issues, cautions, readiness, capacity, routing, approval, rollback, and old-environment hold rows. | Use it as a release-review checklist before anyone changes production traffic. |
| Traffic Commands | Route-change and rollback command drafts for the selected routing surface. | Replace placeholders and confirm namespaces, listeners, target groups, hosted zones, routes, and service names elsewhere. |
| Traffic Shift Curve | Blue and target traffic share over elapsed gate time. | Look for jumps that are too large, one-wave selector switches, or a plan that waits longer than expected. |
| JSON | Entered values, summary, warnings, schedule rows, gate rows, chart data, and command text. | Use it for change records only after warnings and placeholders have been reviewed. |
Exports preserve the modeled plan, not the live state of production. If monitoring later shows different traffic distribution, trust monitoring first and update the plan assumptions before continuing.
Technical Details:
The schedule is driven by target percentage, observation interval, baseline request rate, pre-warm time, and old-environment hold time. Weighted routing surfaces advance through target percentages that start at the first shift and increase by the shift increment until the target reaches 100%. The last step is capped at 100%, and the schedule is limited to a finite set of waves so a bad input cannot create an endless plan.
Selector-switch mode uses one 100% target wave because a normal Kubernetes Service selector does not express request weight. Gradual traffic within Kubernetes needs another traffic-routing mechanism, such as a mesh, ingress, Gateway route, or load balancer rule that can send defined shares to separate destinations.
Formula Core:
The model adds pre-warm time before production exposure, then repeats one bake-plus-lag interval for every traffic wave. Exposure is estimated as requests during the interval multiplied by the target share. A one-minute minimum prevents a nonzero traffic percentage from showing zero exposure when the interval is set to zero.
For example, 4,200 requests per minute, a 30 minute bake, a 5 minute lag, and a 10% first wave produce 4,200 * 35 * 0.10 = 14,700 modeled target requests for that first wave. The same interval at 100% target traffic models 147,000 target requests.
Risk Rules:
| Condition | Planner cue | Why it matters |
|---|---|---|
| Fewer than three health checks | thin guardrails | Readiness, errors, latency, saturation, and user-path behavior can fail separately. |
| Missing rollback trigger | missing rollback trigger | Operators need a pre-agreed stop condition before production traffic moves. |
| Target capacity below 100% of expected peak | target below peak capacity | The target may pass early exposure and still fail when it becomes the only production environment. |
| Bake interval under 10 minutes | short bake | Slow failures, delayed jobs, alert windows, and cache effects may not appear before the next wave. |
| First shift above 25% or increment above 50% | large first shift or large increment | Large jumps raise blast radius and can skip useful observation points. |
| DNS TTL longer than the bake comparison | ttl exceeds bake | Resolvers may keep older answers after the team expects a route change or rollback to be visible. |
| Sticky sessions above 20% or blue hold below 30 minutes | sticky sessions or short blue hold | Observed traffic can blur across old connections, and early teardown removes the fastest rollback path. |
| Kubernetes selector mode with partial-wave inputs | atomic selector switch | A selector patch models one 100% switch unless another routing mechanism controls weights. |
The caution score is capped at 100. A score below 18 displays ready path, 18 to 39 displays watch closely, 40 to 69 displays cautious rollout, and 70 or higher displays hold for review. These labels are planning cues, not approval outcomes.
Routing Surface Differences:
| Surface | Modeled behavior | Main caveat |
|---|---|---|
| Weighted load balancer | Traffic waves are represented as blue and target weights. | Health, draining, stickiness, and target-group readiness still decide whether observed traffic matches the plan. |
| Weighted DNS | Weights estimate the share of DNS answers sent toward each environment. | Resolver caching and TTL can delay both promotion and rollback visibility. |
| Service mesh or ingress | Route rules can express weighted destinations inside the application traffic path. | Destination subsets, route precedence, outlier detection, retries, and gateway behavior need separate validation. |
| Kubernetes Service switch | The schedule uses one 100% target wave. | Native selector changes do not provide a percentage ramp by themselves. |
Limitations and Privacy Notes:
The planner estimates request exposure from a steady request rate. Real production traffic can spike, drain slowly, retry, reconnect, or stay pinned to an older destination. Use the schedule and curve to prepare the release, then compare each wave against live monitoring before continuing.
Command drafts intentionally use placeholders for infrastructure identifiers. A correct plan still needs human review of accounts, regions, hosted zones, listener names, target groups, namespaces, route objects, service labels, and rollback authority. Do not run generated text until it has been adapted to the real environment.
Entered values are processed locally in the browser. Treat copied CSV, DOCX, chart images, command text, and JSON as release records when they include service names, route names, rollback language, or operational details.
Worked Examples:
Default Weighted Load Balancer Plan:
With a 10% first shift and 20% increments, the schedule produces six target waves: 10%, 30%, 50%, 70%, 90%, and 100%. A 30 minute bake plus a 5 minute metric lag creates a 35 minute interval. With 15 minutes of pre-warm time, the target reaches full traffic after 225 minutes, before the old-environment hold is added.
DNS Plan With Long TTL:
A weighted DNS plan with a 10 minute bake and a 1,800 second TTL triggers a TTL caution because recursive resolvers can keep older answers longer than the modeled observation window. The plan may still show clean percentages, but rollback visibility depends on cached DNS answers outside the route change itself.
Kubernetes Selector Switch:
Selecting Kubernetes Service switch with a 10% first shift still creates one 100% target wave. That is the expected model for a native Service selector change. To plan gradual Kubernetes traffic, choose a routing surface that can represent weights.
Capacity Below Peak:
A target capacity of 80% peak creates a severe capacity caution even if the first wave is small. The green environment may survive early exposure and still fail at promotion. Raise capacity, warm autoscaling, lower the release scope, or stop before full traffic until the capacity gap is solved.
FAQ:
Does this tool deploy anything?
No. It builds a schedule, gates, command drafts, a traffic curve, and exportable records from the entered values. Operators still review and run deployment actions in their own systems.
Why can a small first wave still miss defects?
Low percentages reduce exposure, but rare paths, background jobs, regional traffic, cache misses, and high-load behavior may not appear until later waves. That is why bake time and diverse health checks matter.
Why does a Kubernetes Service switch ignore partial percentages?
A normal selector picks the Pods behind the Service. Without another weighted routing mechanism, changing the selector behaves like a full switch to the target labels.
What should a rollback trigger include?
Use a measurable condition and an action. Examples include repeated failed health checks, error rate above an agreed threshold, p95 latency above the release limit, queue growth, or a customer-impacting business signal.
Why keep the old environment after promotion?
A hold period preserves the fastest rollback option while drains, logs, delayed jobs, and customer signals settle. Tearing it down immediately can turn a reversible release into a restore or redeploy incident.
Glossary:
- Blue environment
- The current production environment at the start of the release.
- Green environment
- The replacement environment that receives traffic during the release and may become production after promotion.
- Routing surface
- The mechanism that moves traffic between environments, such as a load balancer, DNS record, mesh route, ingress rule, or Service selector.
- Bake interval
- The planned observation time after each traffic change before the next wave continues.
- Metric lag
- Additional wait time for alerts, logs, queues, delayed jobs, and dashboards to reflect the new traffic path.
- Target exposure
- The estimated number of requests sent to the target environment during a traffic wave.
- Blue hold
- The time after promotion when the old environment remains available for rollback confidence.
References:
- CodeDeploy blue/green deployments for Amazon ECS, AWS documentation.
- Values specific for weighted records, Amazon Route 53 documentation.
- Service, Kubernetes documentation.
- Traffic Shifting, Istio documentation.