SLO Burn Rate Alerts Generator
Generate Prometheus SLO burn-rate alert rules with multi-window tiers, budget math, routing labels, YAML formats, and review tables.{{ analysis.yaml }}
| Alert name | Severity | Windows | Burn multiple | Error threshold | Budget share | Exhaustion time | Copy |
|---|---|---|---|---|---|---|---|
| {{ row.alertName }} | {{ row.severity }} | {{ row.windows }} | {{ row.burnLabel }} | {{ row.thresholdLabel }} | {{ row.budgetShareLabel }} | {{ row.exhaustionLabel }} |
| Severity | Label set | Annotation focus | Notification path | Review action | Copy |
|---|---|---|---|---|---|
| {{ row.severity }} | {{ row.labels }} | {{ row.annotations }} | {{ row.notificationPath }} | {{ row.reviewAction }} |
Introduction:
Reliability alerting works best when a page represents real user risk instead of every noisy metric spike. Service level objectives give teams a budget-based way to make that judgment. The service level indicator measures the behavior users care about, the objective sets the success target, and the remaining error budget shows how much failure is still allowed before the target is missed.
Burn rate turns that budget into a pace. A service burning at 1x is spending its error budget at exactly the planned speed for the compliance period. A service burning at 14.4x is spending the same budget 14.4 times faster, so a one-hour spike can matter even if the monthly SLO still looks healthy.
- SLI
- The service level indicator, usually a success ratio, latency target, or bad-events ratio measured from real traffic or synthetic checks.
- SLO
- The service level objective, such as 99.9 percent successful requests over a 30-day compliance period.
- SLA
- The service level agreement, usually a user-facing or contractual promise that may depend on one or more SLOs.
- Error budget
- The portion of requests, minutes, or events that may fail before the objective is missed.
- Burn rate
- The speed at which current failures spend that budget compared with the planned pace.
- Lookback window
- The recent time range used to judge whether the current burn rate should alert.
Multi-window burn-rate alerting handles two common failures in SLO paging. A long lookback window proves that enough budget is at stake, but it can stay above threshold long after the incident is over. A short lookback window proves that the burn is current, but it can react to a burst that is too small to threaten the objective. Requiring both windows to cross the tier threshold gives the alert a budget test and a recency test.
Tier selection is partly arithmetic and partly incident policy. Fast page tiers should catch budget loss that needs immediate response. Slower ticket tiers should catch sustained drift before the reporting period is lost. A low-traffic service needs separate review because one failed request can create a dramatic ratio without enough samples to diagnose a real outage.
The error-ratio expression is the foundation. It should divide bad events by eligible events for the same service, user journey, and compliance period used by the SLO. A clean rule built on the wrong denominator can page the wrong team or miss the reliability signal the objective was meant to protect.
How to Use This Tool:
Start with the service objective and a PromQL expression that already measures bad events divided by eligible events. Then review the generated YAML and the budget ledger before using the rules in monitoring.
- Enter
Service nameandSLO labelwith stable values such asorders-apiandavailability. Those values shape the generated alert names, labels, annotations, and downloaded filenames. - Set
SLO targetandCompliance period. The summary should show the remaining error budget and the highest configured burn tier. - Choose
Rule format. PickPrometheus rule filefor plain rule YAML,PrometheusRule CRDfor Prometheus Operator, orKubernetes ConfigMap rule filewhen a ConfigMap mounts the rules. - Paste an
Error ratio query templatethat returns a decimal ratio, not a percent. Keep{window}in the template so every tier can render both lookback windows;{service}and{slo}are optional replacements. - Edit
Burn-rate tiersas CSV rows with key, severity, long window, short window, burn multiple, andforduration.Load SRE defaultsrestores the starter page and ticket rows. - Use
Advancedfor a custom rule-group name, evaluation interval, severity label key, extra routing labels, runbook link, dashboard link, or optionalkeep_firing_for. Extra labels should match the team and environment keys your alert routing already understands. - Resolve every item in
Review SLO alert inputs. IfGeneration notesremain, read them before copyingAlert YAMLor exporting the ledger, routing guidance, chart data, or JSON report.
Interpreting Results:
Alert YAML is ready to inspect only after validation errors are gone. Passing validation means the inputs have usable shape, not that the monitoring expression measures the right traffic. Test the long-window and short-window expressions in Prometheus before promoting any generated rule.
Threshold Ledgershows the budget math behind each rule.Error thresholdis the bad-event ratio both windows must exceed,Budget shareestimates how much budget the long window represents, andExhaustion timeshows how long the full budget lasts at that burn multiple.Routing Guidancemaps severity labels to notification intent, annotation focus, and review action. Treat it as a checklist for your policy review, not as proof that Alertmanager routes are configured correctly.Burn Threshold Laddercompares the tier thresholds against the 1x budget pace. Page tiers should sit clearly above the baseline, while ticket tiers should still create work early enough to protect the compliance period.JSONgathers the configuration, rendered YAML, rule math, table rows, chart rows, errors, and warnings into one structured export for review automation.
Technical Details:
Burn-rate alerting starts with a normalized error ratio: bad events divided by eligible events over a lookback window. The ratio is decimal, so one percent errors is 0.01. The SLO target leaves an error budget ratio, and each burn multiple scales that budget into the threshold used by an alert tier.
The long lookback estimates budget impact. The short lookback confirms that the budget burn is still happening. Each generated rule uses a strict greater-than comparison on both windows, so the tier becomes active only when the long-window ratio and the short-window ratio are both above the same threshold.
Formula Core:
The main calculations depend on the SLO target, compliance duration, burn multiple, and long-window duration.
| Symbol | Meaning | Unit |
|---|---|---|
S |
SLO target entered as a percent. | percent |
B |
Error budget ratio. | ratio |
R |
Burn multiple for one tier. | x |
T |
Error-ratio threshold used in both alert windows. | ratio |
W |
Long-window duration converted to hours. | hours |
C |
Compliance period converted to hours. | hours |
E |
Full-budget exhaustion time at the same burn multiple. | hours |
With S = 99.9, C = 720 hours for 30 days, R = 14.4, and W = 1 hour, the budget ratio is 0.001, the threshold is 0.0144, the long-window budget share is 2.00%, and the full-budget exhaustion time is 50.0 hr.
Rule Core:
Each burn tier produces one alert rule. The rendered expression replaces the window placeholder twice, compares both ratios to the same threshold, and joins those comparisons with and.
| Condition | Boundary | Meaning |
|---|---|---|
| Long-window error ratio | > T |
The budget loss is large enough over the main lookback window. |
| Short-window error ratio | > T |
The burn is still active in the recent lookback window. |
For duration (for) |
Positive Prometheus duration | The expression must remain active for this hold time before the alert fires. |
keep_firing_for |
Optional positive duration | When enabled, Prometheus can keep the alert firing briefly after the expression clears. |
Default Tier Pattern:
| Tier | Severity | Windows | Burn | Hold |
|---|---|---|---|---|
fast-page |
page | 1h and 5m |
14.4x |
2m |
medium-page |
page | 6h and 30m |
6x |
15m |
slow-ticket |
ticket | 24h and 2h |
3x |
30m |
budget-ticket |
ticket | 3d and 6h |
1x |
1h |
The default rows follow a common multi-window pattern: high burn rates route to paging severities, while slower burn rates route to ticket-style follow-up. The 24h/2h and 3d/6h ticket tiers are useful for drift that does not justify an immediate incident but can still miss the SLO if it remains untreated.
Validation Boundaries:
| Input | Accepted rule | Why it matters |
|---|---|---|
SLO target |
Greater than 0 and less than 100 percent. | Targets outside this range do not leave a meaningful error budget. |
Compliance period |
Greater than zero days. | The budget-share and exhaustion calculations need a positive period. |
Error ratio query template |
Must include {window}. |
Both long-window and short-window expressions are rendered from the same template. |
| Durations | Positive compact units such as 30s, 5m, 1h, 3d, or 1w. |
Rule intervals, windows, hold times, and optional keep-firing times must parse as Prometheus durations. |
| Tier windows | Short window must be shorter than long window. | The short lookback is the recency check; it cannot be the wider budget-impact check. |
| Label keys | Prometheus-style label names. | Routing labels must be valid before Alertmanager or a PrometheusRule controller can use them reliably. |
Two non-blocking warnings deserve attention. A threshold at or above a 100 percent error ratio usually means the SLO target or burn multiple is unrealistic. A long-to-short window ratio below 6:1 or above 24:1 can behave differently from the common multi-window pattern, where roughly 12:1 balances recency, reset time, and noise control.
Accuracy Notes:
The generated YAML depends on the correctness of the error-ratio query and the monitoring system that evaluates it. The generator does not query live Prometheus data, inspect time series cardinality, validate Kubernetes admission rules, or apply anything to a cluster.
- Run each rendered long-window and short-window expression in Prometheus before promoting the rule.
- Validate the final rule file or PrometheusRule object with the checks your deployment path normally uses.
- Confirm that severity labels, service labels, runbook links, and dashboard links match existing notification routes.
- Review low-traffic services separately because a small number of failed requests can create a high burn rate with weak diagnostic signal.
Worked Examples:
The default orders-api availability SLO at 99.9% over 30 days produces four alert rules. Threshold Ledger shows fast-page at 14.4x, an Error threshold of 1.440%, a Budget share of 2.00%, and an Exhaustion time near 50.0 hr. That tier is intended for paging because the burn is fast enough to spend meaningful budget quickly.
A slower row such as slow-ticket,ticket,24h,2h,3,30m produces a 0.300% threshold for the same 99.9 percent SLO. The Budget share is 10.00%, and the Exhaustion time is about 10.0 days. That output is usually better suited to scheduled remediation than immediate incident response.
A 90 percent SLO paired with a 14.4x burn tier creates a threshold above 100 percent errors. The YAML can still render, but Generation notes warns that the tier is probably not useful. Lower the burn multiple, raise the SLO target, or remove that row before treating it as production monitoring code.
A row such as bad,page,5m,1h,6,15m blocks generation because the short window is longer than the long window. Swap the windows or choose a larger long-window value, then confirm that Alert YAML appears again.
FAQ:
Should the query return a percent or a ratio?
It should return a ratio. For one percent errors, the expression should evaluate near 0.01, not 1. The generated thresholds are also ratios.
Why do both windows use the same threshold?
The burn multiple defines one threshold for the tier. The long and short windows both need to exceed that threshold so the alert reflects budget impact and recent activity.
Why did I get a threshold over 100 percent?
That warning means the SLO target leaves such a large error budget, or the burn multiple is so high, that the tier threshold is at least 1.0. Review the target and tier before using the rule.
What does keep_firing_for change?
When enabled, keep_firing_for keeps an alert firing for the selected duration after the expression clears. Use it only if your Prometheus version and alert policy expect that behavior.
Does the generator check my live Prometheus data?
No. It generates rule text and review outputs from your inputs. Test the rendered expressions in Prometheus and validate the YAML before applying it.
Glossary:
- SLI
- Service level indicator, the measured ratio or quantity used to judge reliability.
- SLO
- Service level objective, the target reliability level for a service or user journey.
- SLA
- Service level agreement, a user-facing or contractual commitment that may use one or more SLOs as evidence.
- Error budget
- The allowed miss ratio left by the SLO target.
- Burn rate
- The speed at which errors spend the error budget compared with the planned pace.
- Lookback window
- The recent time range used to calculate an alerting error ratio.
- Compliance period
- The time window over which the SLO is judged, such as 30 or 90 days.
- For duration
- The time an alert condition must remain active before it becomes firing.
References:
- Alerting on SLOs, Google Site Reliability Engineering Workbook.
- Alerting on your burn rate, Google Cloud Observability, last updated 2026-06-02 UTC.
- Alerting rules, Prometheus documentation.
- API reference, Prometheus Operator documentation.