{{ summaryTitle }}
{{ summaryPrimary }}
{{ summaryLine }}
{{ badge.label }}
Objective Window Burn tier Route
SLO burn rate alert inputs
Use the production service or user journey protected by this SLO.
Distinguish this SLO from other objectives on the same service.
Enter the success objective used for the compliance window.
%
Use the same period your SLO report and error budget policy use.
days
Pick the YAML wrapper that matches your deployment target.
Use one PromQL expression template that produces a bad-events / eligible-events ratio for each lookback window.
Keep each tier as key,severity,long_window,short_window,burn_multiple,for_duration.
Defaults emit two page tiers and two ticket tiers for a 30-day SLO.
Override only when your rule file naming policy requires it.
Leave as 30s unless your Prometheus server evaluates SLO recording rules at a different cadence.
Use the label key your notification policy already routes on.
Use comma or newline separated labels such as team=payments,env=prod.
Link the first responder to the recovery or triage runbook.
Link the alert to a dashboard scoped to this SLO and service.
Leave off when your Prometheus version or rule policy does not use keep_firing_for.
{{ include_keep_firing_for ? keep_firing_for : 'Omitted' }}
Use a Prometheus duration such as 5m, 10m, or 30m.
{{ analysis.yaml }}
Alert nameSeverityWindowsBurn multipleError thresholdBudget shareExhaustion timeCopy
{{ row.alertName }} {{ row.severity }} {{ row.windows }} {{ row.burnLabel }} {{ row.thresholdLabel }} {{ row.budgetShareLabel }} {{ row.exhaustionLabel }}
SeverityLabel setAnnotation focusNotification pathReview actionCopy
{{ row.severity }} {{ row.labels }} {{ row.annotations }} {{ row.notificationPath }} {{ row.reviewAction }}
Customize
Advanced
:

Introduction:

Reliability alerting works best when a page represents real user risk instead of every noisy metric spike. Service level objectives give teams a budget-based way to make that judgment. The service level indicator measures the behavior users care about, the objective sets the success target, and the remaining error budget shows how much failure is still allowed before the target is missed.

Burn rate turns that budget into a pace. A service burning at 1x is spending its error budget at exactly the planned speed for the compliance period. A service burning at 14.4x is spending the same budget 14.4 times faster, so a one-hour spike can matter even if the monthly SLO still looks healthy.

SLI
The service level indicator, usually a success ratio, latency target, or bad-events ratio measured from real traffic or synthetic checks.
SLO
The service level objective, such as 99.9 percent successful requests over a 30-day compliance period.
SLA
The service level agreement, usually a user-facing or contractual promise that may depend on one or more SLOs.
Error budget
The portion of requests, minutes, or events that may fail before the objective is missed.
Burn rate
The speed at which current failures spend that budget compared with the planned pace.
Lookback window
The recent time range used to judge whether the current burn rate should alert.

Multi-window burn-rate alerting handles two common failures in SLO paging. A long lookback window proves that enough budget is at stake, but it can stay above threshold long after the incident is over. A short lookback window proves that the burn is current, but it can react to a burst that is too small to threaten the objective. Requiring both windows to cross the tier threshold gives the alert a budget test and a recency test.

SLO target converted to error budget, multiplied by a burn tier, and checked across long and short windows

Tier selection is partly arithmetic and partly incident policy. Fast page tiers should catch budget loss that needs immediate response. Slower ticket tiers should catch sustained drift before the reporting period is lost. A low-traffic service needs separate review because one failed request can create a dramatic ratio without enough samples to diagnose a real outage.

The error-ratio expression is the foundation. It should divide bad events by eligible events for the same service, user journey, and compliance period used by the SLO. A clean rule built on the wrong denominator can page the wrong team or miss the reliability signal the objective was meant to protect.

How to Use This Tool:

Start with the service objective and a PromQL expression that already measures bad events divided by eligible events. Then review the generated YAML and the budget ledger before using the rules in monitoring.

  1. Enter Service name and SLO label with stable values such as orders-api and availability. Those values shape the generated alert names, labels, annotations, and downloaded filenames.
  2. Set SLO target and Compliance period. The summary should show the remaining error budget and the highest configured burn tier.
  3. Choose Rule format. Pick Prometheus rule file for plain rule YAML, PrometheusRule CRD for Prometheus Operator, or Kubernetes ConfigMap rule file when a ConfigMap mounts the rules.
  4. Paste an Error ratio query template that returns a decimal ratio, not a percent. Keep {window} in the template so every tier can render both lookback windows; {service} and {slo} are optional replacements.
  5. Edit Burn-rate tiers as CSV rows with key, severity, long window, short window, burn multiple, and for duration. Load SRE defaults restores the starter page and ticket rows.
  6. Use Advanced for a custom rule-group name, evaluation interval, severity label key, extra routing labels, runbook link, dashboard link, or optional keep_firing_for. Extra labels should match the team and environment keys your alert routing already understands.
  7. Resolve every item in Review SLO alert inputs. If Generation notes remain, read them before copying Alert YAML or exporting the ledger, routing guidance, chart data, or JSON report.

Interpreting Results:

Alert YAML is ready to inspect only after validation errors are gone. Passing validation means the inputs have usable shape, not that the monitoring expression measures the right traffic. Test the long-window and short-window expressions in Prometheus before promoting any generated rule.

  • Threshold Ledger shows the budget math behind each rule. Error threshold is the bad-event ratio both windows must exceed, Budget share estimates how much budget the long window represents, and Exhaustion time shows how long the full budget lasts at that burn multiple.
  • Routing Guidance maps severity labels to notification intent, annotation focus, and review action. Treat it as a checklist for your policy review, not as proof that Alertmanager routes are configured correctly.
  • Burn Threshold Ladder compares the tier thresholds against the 1x budget pace. Page tiers should sit clearly above the baseline, while ticket tiers should still create work early enough to protect the compliance period.
  • JSON gathers the configuration, rendered YAML, rule math, table rows, chart rows, errors, and warnings into one structured export for review automation.

Technical Details:

Burn-rate alerting starts with a normalized error ratio: bad events divided by eligible events over a lookback window. The ratio is decimal, so one percent errors is 0.01. The SLO target leaves an error budget ratio, and each burn multiple scales that budget into the threshold used by an alert tier.

The long lookback estimates budget impact. The short lookback confirms that the budget burn is still happening. Each generated rule uses a strict greater-than comparison on both windows, so the tier becomes active only when the long-window ratio and the short-window ratio are both above the same threshold.

Formula Core:

The main calculations depend on the SLO target, compliance duration, burn multiple, and long-window duration.

B = 100-S100 T = B×R P = R×W×100C E = CR
SLO burn-rate formula symbols
Symbol Meaning Unit
S SLO target entered as a percent. percent
B Error budget ratio. ratio
R Burn multiple for one tier. x
T Error-ratio threshold used in both alert windows. ratio
W Long-window duration converted to hours. hours
C Compliance period converted to hours. hours
E Full-budget exhaustion time at the same burn multiple. hours

With S = 99.9, C = 720 hours for 30 days, R = 14.4, and W = 1 hour, the budget ratio is 0.001, the threshold is 0.0144, the long-window budget share is 2.00%, and the full-budget exhaustion time is 50.0 hr.

Rule Core:

Each burn tier produces one alert rule. The rendered expression replaces the window placeholder twice, compares both ratios to the same threshold, and joins those comparisons with and.

SLO burn-rate alert rule conditions
Condition Boundary Meaning
Long-window error ratio > T The budget loss is large enough over the main lookback window.
Short-window error ratio > T The burn is still active in the recent lookback window.
For duration (for) Positive Prometheus duration The expression must remain active for this hold time before the alert fires.
keep_firing_for Optional positive duration When enabled, Prometheus can keep the alert firing briefly after the expression clears.

Default Tier Pattern:

Default SLO burn-rate tiers
Tier Severity Windows Burn Hold
fast-page page 1h and 5m 14.4x 2m
medium-page page 6h and 30m 6x 15m
slow-ticket ticket 24h and 2h 3x 30m
budget-ticket ticket 3d and 6h 1x 1h

The default rows follow a common multi-window pattern: high burn rates route to paging severities, while slower burn rates route to ticket-style follow-up. The 24h/2h and 3d/6h ticket tiers are useful for drift that does not justify an immediate incident but can still miss the SLO if it remains untreated.

Validation Boundaries:

SLO burn-rate alert input validation boundaries
Input Accepted rule Why it matters
SLO target Greater than 0 and less than 100 percent. Targets outside this range do not leave a meaningful error budget.
Compliance period Greater than zero days. The budget-share and exhaustion calculations need a positive period.
Error ratio query template Must include {window}. Both long-window and short-window expressions are rendered from the same template.
Durations Positive compact units such as 30s, 5m, 1h, 3d, or 1w. Rule intervals, windows, hold times, and optional keep-firing times must parse as Prometheus durations.
Tier windows Short window must be shorter than long window. The short lookback is the recency check; it cannot be the wider budget-impact check.
Label keys Prometheus-style label names. Routing labels must be valid before Alertmanager or a PrometheusRule controller can use them reliably.

Two non-blocking warnings deserve attention. A threshold at or above a 100 percent error ratio usually means the SLO target or burn multiple is unrealistic. A long-to-short window ratio below 6:1 or above 24:1 can behave differently from the common multi-window pattern, where roughly 12:1 balances recency, reset time, and noise control.

Accuracy Notes:

The generated YAML depends on the correctness of the error-ratio query and the monitoring system that evaluates it. The generator does not query live Prometheus data, inspect time series cardinality, validate Kubernetes admission rules, or apply anything to a cluster.

  • Run each rendered long-window and short-window expression in Prometheus before promoting the rule.
  • Validate the final rule file or PrometheusRule object with the checks your deployment path normally uses.
  • Confirm that severity labels, service labels, runbook links, and dashboard links match existing notification routes.
  • Review low-traffic services separately because a small number of failed requests can create a high burn rate with weak diagnostic signal.

Worked Examples:

The default orders-api availability SLO at 99.9% over 30 days produces four alert rules. Threshold Ledger shows fast-page at 14.4x, an Error threshold of 1.440%, a Budget share of 2.00%, and an Exhaustion time near 50.0 hr. That tier is intended for paging because the burn is fast enough to spend meaningful budget quickly.

A slower row such as slow-ticket,ticket,24h,2h,3,30m produces a 0.300% threshold for the same 99.9 percent SLO. The Budget share is 10.00%, and the Exhaustion time is about 10.0 days. That output is usually better suited to scheduled remediation than immediate incident response.

A 90 percent SLO paired with a 14.4x burn tier creates a threshold above 100 percent errors. The YAML can still render, but Generation notes warns that the tier is probably not useful. Lower the burn multiple, raise the SLO target, or remove that row before treating it as production monitoring code.

A row such as bad,page,5m,1h,6,15m blocks generation because the short window is longer than the long window. Swap the windows or choose a larger long-window value, then confirm that Alert YAML appears again.

FAQ:

Should the query return a percent or a ratio?

It should return a ratio. For one percent errors, the expression should evaluate near 0.01, not 1. The generated thresholds are also ratios.

Why do both windows use the same threshold?

The burn multiple defines one threshold for the tier. The long and short windows both need to exceed that threshold so the alert reflects budget impact and recent activity.

Why did I get a threshold over 100 percent?

That warning means the SLO target leaves such a large error budget, or the burn multiple is so high, that the tier threshold is at least 1.0. Review the target and tier before using the rule.

What does keep_firing_for change?

When enabled, keep_firing_for keeps an alert firing for the selected duration after the expression clears. Use it only if your Prometheus version and alert policy expect that behavior.

Does the generator check my live Prometheus data?

No. It generates rule text and review outputs from your inputs. Test the rendered expressions in Prometheus and validate the YAML before applying it.

Glossary:

SLI
Service level indicator, the measured ratio or quantity used to judge reliability.
SLO
Service level objective, the target reliability level for a service or user journey.
SLA
Service level agreement, a user-facing or contractual commitment that may use one or more SLOs as evidence.
Error budget
The allowed miss ratio left by the SLO target.
Burn rate
The speed at which errors spend the error budget compared with the planned pace.
Lookback window
The recent time range used to calculate an alerting error ratio.
Compliance period
The time window over which the SLO is judged, such as 30 or 90 days.
For duration
The time an alert condition must remain active before it becomes firing.

References: