SLO Burn Rate Alerts Generator
Generate Prometheus SLO burn-rate alert YAML from service, objective, query, and tier inputs with thresholds, routing labels, and review outputs.{{ summaryTitle }}
- {{ error }}
- {{ warning }}
{{ analysis.yaml }}
| Alert name | Severity | Windows | Burn multiple | Error threshold | Budget share | Exhaustion time | Copy |
|---|---|---|---|---|---|---|---|
| {{ row.alertName }} | {{ row.severity }} | {{ row.windows }} | {{ row.burnLabel }} | {{ row.thresholdLabel }} | {{ row.budgetShareLabel }} | {{ row.exhaustionLabel }} |
| Severity | Label set | Annotation focus | Notification path | Review action | Copy |
|---|---|---|---|---|---|
| {{ row.severity }} | {{ row.labels }} | {{ row.annotations }} | {{ row.notificationPath }} | {{ row.reviewAction }} |
Introduction:
Service level objective burn-rate alerts warn when a service is spending its error budget faster than the reliability target allows. For a 99.9% availability SLO, the monthly error budget is 0.1% of eligible requests or events. A burn rate of 1x spends that budget evenly across the full compliance period, while 14.4x spends it much faster and usually deserves immediate attention.
Burn-rate alerting matters because ordinary error-rate alarms can be noisy during small blips and too slow during sustained damage. A short outage that hurts many users should page quickly. A quieter degradation that drains budget over a day or three should still create a ticket before the service owner discovers the miss during an SLO review.
Multi-window burn-rate alerts reduce false confidence by checking a long window and a short window at the same time. The long window confirms the issue is large enough to matter to the budget. The short window confirms the issue is still happening recently enough to act on. That pairing helps separate an old incident tail from an active service problem.
An SLO alert rule is still only as good as its service-level indicator. The error-ratio query should count bad events divided by eligible events for the same user journey, service, and compliance period used in the SLO. A generated rule cannot prove that the metric is correct, that the routing policy is safe, or that the alert volume is acceptable for the on-call team.
Technical Details:
An SLO target defines the allowed miss ratio. Subtracting the target from 100% gives the error budget ratio. Burn rate multiplies that budget ratio to produce the error-ratio threshold used by the alert expression. For a 99.9% SLO, the error budget ratio is 0.001. A 14.4x burn tier therefore fires above an error ratio of 0.0144, displayed as 1.440%.
The alert expression uses the same threshold for both windows in a tier. The long-window query and short-window query are joined with and, so both windows must be above threshold before the alert can become active. The for duration then controls how long the condition must remain active before the alert is treated as firing.
Formula Core
The calculation starts with the SLO target and compliance period, then applies each tier's burn multiple and long window.
| Symbol | Meaning | Unit |
|---|---|---|
S |
SLO target entered as a percent, such as 99.9. | percent |
B |
Error budget ratio left by the target. | ratio |
R |
Burn multiple from the tier row. | multiple |
T |
Error-ratio threshold used in both PromQL window checks. | ratio |
W |
Long-window length converted to hours. | hours |
C |
Compliance period converted to hours. | hours |
P |
Share of the full error budget represented by the long window at that burn rate. | percent |
E |
Time to drain the full error budget at the tier's burn multiple. | hours |
The default 30-day setup uses four tier rows. The first two are page-like severities, and the last two are ticket-like severities. The 24-hour, 3x tier and the 3-day, 1x tier both represent 10% of a 30-day budget, but they describe different operational cadences.
| Tier key | Severity | Windows | Burn | Threshold for 99.9% SLO | Budget share | Budget exhaustion |
|---|---|---|---|---|---|---|
fast-page |
page |
1h and 5m |
14.4x | 1.440% | 2.00% | 2.1 days |
medium-page |
page |
6h and 30m |
6x | 0.600% | 5.00% | 5.0 days |
slow-ticket |
ticket |
24h and 2h |
3x | 0.300% | 10.00% | 10.0 days |
budget-ticket |
ticket |
3d and 6h |
1x | 0.100% | 10.00% | 30.0 days |
Validation protects the generated rule shape. The SLO target must be greater than 0 and less than 100. The compliance period must be positive. The error-ratio query template must include {window}. Duration fields accept Prometheus-style units such as 5m, 1h, 3d, and 1w. Each tier needs a short window shorter than its long window and a burn multiple greater than zero.
| Generated field | How it is formed | Review meaning |
|---|---|---|
alert |
Pascal-cased service, SLO label, tier key, and BurnRate. |
Stable alert names help routing, dashboards, and silence matching. |
expr |
Long-window query above threshold and short-window query above threshold. |
Both windows must agree before the rule can fire. |
for |
Comes from the tier row's final duration column. | Controls how long the condition must stay active before firing. |
labels |
Combines extra labels with severity, service, SLO, tier, windows, and burn multiple. | Routing and grouping should match the notification policy. |
annotations |
Adds summary and description, plus runbook and dashboard links when entered. | Responders get context without editing each generated rule by hand. |
keep_firing_for |
Added only when the switch is enabled and the duration is valid. | Can reduce clear and re-fire churn after the expression drops below threshold. |
The output can use a plain Prometheus rule group or a PrometheusRule custom resource. Both forms contain the same alert rules. The custom resource adds Kubernetes metadata and a spec.groups wrapper for operator-managed Prometheus deployments.
Everyday Use & Decision Guide:
Start with the service and SLO label that responders already know, such as orders-api and availability. Use the same SLO target and compliance period as the SLO report. If the report is a 30-day rolling availability SLO, leave the period at 30 days; if the review policy uses 28 or 90 days, change it before trusting the threshold ledger.
The error-ratio query template is the most important input. It should return a bad-events divided by eligible-events ratio for the requested lookback window. Do not enter a percentage expression. A query that returns 0.0144 for 1.44% matches the generated threshold for the default fast page tier on a 99.9% SLO.
- Use
Prometheus rule filewhen the result will be placed under the server's rule files. - Use
PrometheusRule CRDwhen the deployment is managed by the Prometheus Operator or a compatible platform. - Keep the default tier rows for a first pass, then adjust severities and windows only to match local paging policy.
- Use
Extra labelsfor stable routing values such as team and environment. - Add
Runbook URLandDashboard URLwhen responders should land on a specific triage path. - Enable
Keep firing foronly when your Prometheus version and alert policy support that field.
Stop when Review SLO alert inputs appears. The result should not be copied until the missing placeholder, bad duration, invalid label key, or tier-row problem is fixed. Generation notes are different: they warn about unusual thresholds or long-to-short window ratios while still allowing the YAML to be generated.
A green result does not mean the alert should page immediately in production. Compare Threshold Ledger against historical error ratios, read Routing Guidance for page versus ticket behavior, and test the generated PromQL against real time series before handing it to Alertmanager.
Step-by-Step Guide:
Build the alert set from the SLO definition first, then review the generated rule fields and routing impact.
- Enter
Service nameandSLO label. The summary and generated alert names should reflect the same service and objective used in dashboards and runbooks. - Set
SLO targetandCompliance period. The summary line should show the derived error budget, such as a 99.9% SLO leaving 0.100% budget. - Choose
Rule format. TheAlert YAMLtab should show either agroupsdocument or a PrometheusRule resource withapiVersion: monitoring.coreos.com/v1. - Enter the
Error ratio query template. Keep{window}in the expression so each tier can build its long-window and short-window PromQL checks. - Review
Burn-rate tiers. Each row needskey,severity,long_window,short_window,burn_multiple,for_duration; useLoad SRE defaultsto restore the default four-tier set. - Open
Advancedfor group name, evaluation interval, severity label key, extra labels, runbook URL, dashboard URL, and optional keep-firing duration. - If
Review SLO alert inputsappears, fix the listed issue before using the result. Common fixes include adding{window}, changing1hourto1h, or making the short window shorter than the long window. - Read
Threshold Ledgerfor exact alert names, thresholds, budget share, and exhaustion time. Then readRouting Guidanceto confirm each severity maps to the intended response path. - Use
Burn Threshold Ladderto compare burn multiples visually andJSONwhen another review or deployment workflow needs the same generated values in structured form.
Interpreting Results:
The first value to check is the error threshold in Threshold Ledger. It is an error ratio, displayed as a percent for readability. For a 99.9% SLO, fast-page at 14.4x shows 1.440%, while budget-ticket at 1x shows 0.100%.
Alert YAMLis the deployable rule text, but it still needs Prometheus syntax validation and a real-data query check.Budget shareestimates how much of the compliance-period error budget the long window represents at that burn multiple.Exhaustion timeshows how long the full error budget would last if the burn rate continued.Routing Guidancetreatspage,critical, andsev1as immediate-response severities. Other severities are routed as ticket or team-channel review.Generation notesdeserve review when a threshold exceeds a 100% error ratio or the long-to-short window ratio falls outside the expected range.
The false-confidence risk is treating a generated rule as proof of SLO coverage. Confirm that the query measures the right SLI, that both windows return data, that labels match Alertmanager routing, and that the default page tiers do not create alert fatigue for known background errors.
Worked Examples:
Default 99.9% availability SLO
An orders-api availability SLO uses a 30-day compliance period and the default query template. The fast-page tier has 1h and 5m windows at 14.4x. Threshold Ledger should show OrdersApiAvailabilityFastPageBurnRate, an error threshold of 1.440%, budget share of 2.00%, and exhaustion time near 2.1 days. That is a page-level signal because active burn at that pace can damage the objective quickly.
Ticket path for slower budget drain
The same service keeps budget-ticket,ticket,3d,6h,1,1h. On a 99.9% SLO, Threshold Ledger shows a 0.100% threshold, 10.00% budget share, and 30.0 days of exhaustion time. Routing Guidance maps the ticket severity to next working review rather than immediate paging, which fits a sustained burn that needs ownership without waking the on-call for a fast incident.
PrometheusRule for operator-managed monitoring
A Kubernetes platform team changes Rule format to PrometheusRule CRD, leaves Evaluation interval at 30s, and adds team=payments plus env=prod as extra labels. Alert YAML should start with apiVersion: monitoring.coreos.com/v1 and include those labels on every generated rule. Routing Guidance should show the same team and environment values in the label set used for notification review.
Troubleshooting a tier-row mistake
If a row is entered as fast-page,page,5m,1h,14.4,2m, the short window is longer than the long window. Review SLO alert inputs reports that the short window must be shorter than the long window. Changing the row back to fast-page,page,1h,5m,14.4,2m clears the issue and restores the 1-hour plus 5-minute page tier.
FAQ:
Should the query return a ratio or a percent?
Use a ratio. A value of 0.0144 means 1.44%. Entering a query that already multiplies by 100 will make the generated thresholds too low for the displayed percent labels.
Why does each tier need two windows?
The generated expression checks the long and short windows with and. The long window catches meaningful budget spend, and the short window confirms the burn is recent enough for the selected response.
What does the 12:1 window note mean?
The generator warns when a tier's long window divided by short window is below 6 or above 24. The default tiers use a 12:1 ratio, such as 1h with 5m.
Can I add my own labels?
Yes. Use Extra labels with comma or newline separated key=value entries. Label keys must follow Prometheus label naming rules, and generated rule labels take precedence over duplicate keys.
What should I do when input review blocks generation?
Read the Review SLO alert inputs list and fix the named field. Common causes are a missing {window} placeholder, invalid duration text, an invalid severity label key, or a tier row with fewer than six columns.
Does the page validate the PromQL against my metrics?
No. It builds the rule text and derived review tables from the entered values. Run the PromQL in Prometheus, confirm the series labels, and use your normal rule validation before deploying.
Glossary:
- SLI
- Service level indicator, the measured ratio or metric used to judge a user-visible reliability promise.
- SLO
- Service level objective, the target value the service aims to meet over a compliance period.
- Error budget
- The allowed miss ratio left by the SLO target, such as 0.1% for a 99.9% objective.
- Burn rate
- The speed of error-budget consumption relative to spending the budget evenly across the full period.
- PromQL
- Prometheus Query Language, used here to express the long-window and short-window error-ratio checks.
- PrometheusRule
- A Kubernetes custom resource shape used by operator-managed Prometheus deployments to load alerting and recording rules.
References:
- Alerting on SLOs, Google SRE Workbook.
- Service Level Objectives, Google SRE Book.
- Alerting rules, Prometheus Authors.
- PrometheusRule [monitoring.coreos.com/v1], Red Hat Documentation.