{{ summaryTitle }}
{{ summaryPrimary }}
{{ summaryLine }}
{{ badge.label }}
SLO burn rate alert inputs
Use the production service or user journey protected by this SLO.
Distinguish this SLO from other objectives on the same service.
Enter the success objective used for the compliance window.
%
Use the same period your SLO report and error budget policy use.
days
Pick the YAML wrapper that matches your deployment target.
Use one PromQL expression template that produces a bad-events / eligible-events ratio for each lookback window.
Keep each tier as key,severity,long_window,short_window,burn_multiple,for_duration.
Defaults emit two page tiers and two ticket tiers for a 30-day SLO.
Override only when your rule file naming policy requires it.
Leave as 30s unless your Prometheus server evaluates SLO recording rules at a different cadence.
Use the label key your notification policy already routes on.
Use comma or newline separated labels such as team=payments,env=prod.
Link the first responder to the recovery or triage runbook.
Link the alert to a dashboard scoped to this SLO and service.
Leave off when your Prometheus version or rule policy does not use keep_firing_for.
{{ include_keep_firing_for ? keep_firing_for : 'Omitted' }}
Use a Prometheus duration such as 5m, 10m, or 30m.
{{ analysis.yaml }}
Alert nameSeverityWindowsBurn multipleError thresholdBudget shareExhaustion timeCopy
{{ row.alertName }} {{ row.severity }} {{ row.windows }} {{ row.burnLabel }} {{ row.thresholdLabel }} {{ row.budgetShareLabel }} {{ row.exhaustionLabel }}
SeverityLabel setAnnotation focusNotification pathReview actionCopy
{{ row.severity }} {{ row.labels }} {{ row.annotations }} {{ row.notificationPath }} {{ row.reviewAction }}
Customize
Advanced
:

Introduction:

Service level objective burn-rate alerts warn when a service is spending its error budget faster than the reliability target allows. For a 99.9% availability SLO, the monthly error budget is 0.1% of eligible requests or events. A burn rate of 1x spends that budget evenly across the full compliance period, while 14.4x spends it much faster and usually deserves immediate attention.

Burn-rate alerting matters because ordinary error-rate alarms can be noisy during small blips and too slow during sustained damage. A short outage that hurts many users should page quickly. A quieter degradation that drains budget over a day or three should still create a ticket before the service owner discovers the miss during an SLO review.

Diagram showing an SLO target converted to error budget, multiplied by a burn tier, then checked across long and short windows.

Multi-window burn-rate alerts reduce false confidence by checking a long window and a short window at the same time. The long window confirms the issue is large enough to matter to the budget. The short window confirms the issue is still happening recently enough to act on. That pairing helps separate an old incident tail from an active service problem.

An SLO alert rule is still only as good as its service-level indicator. The error-ratio query should count bad events divided by eligible events for the same user journey, service, and compliance period used in the SLO. A generated rule cannot prove that the metric is correct, that the routing policy is safe, or that the alert volume is acceptable for the on-call team.

Technical Details:

An SLO target defines the allowed miss ratio. Subtracting the target from 100% gives the error budget ratio. Burn rate multiplies that budget ratio to produce the error-ratio threshold used by the alert expression. For a 99.9% SLO, the error budget ratio is 0.001. A 14.4x burn tier therefore fires above an error ratio of 0.0144, displayed as 1.440%.

The alert expression uses the same threshold for both windows in a tier. The long-window query and short-window query are joined with and, so both windows must be above threshold before the alert can become active. The for duration then controls how long the condition must remain active before the alert is treated as firing.

Formula Core

The calculation starts with the SLO target and compliance period, then applies each tier's burn multiple and long window.

B = (100-S)/100 T = B×R P = R×W×100/C E = C/R
SLO burn-rate formula symbols
Symbol Meaning Unit
S SLO target entered as a percent, such as 99.9. percent
B Error budget ratio left by the target. ratio
R Burn multiple from the tier row. multiple
T Error-ratio threshold used in both PromQL window checks. ratio
W Long-window length converted to hours. hours
C Compliance period converted to hours. hours
P Share of the full error budget represented by the long window at that burn rate. percent
E Time to drain the full error budget at the tier's burn multiple. hours

The default 30-day setup uses four tier rows. The first two are page-like severities, and the last two are ticket-like severities. The 24-hour, 3x tier and the 3-day, 1x tier both represent 10% of a 30-day budget, but they describe different operational cadences.

Default burn-rate tiers for a 99.9 percent 30-day SLO
Tier key Severity Windows Burn Threshold for 99.9% SLO Budget share Budget exhaustion
fast-page page 1h and 5m 14.4x 1.440% 2.00% 2.1 days
medium-page page 6h and 30m 6x 0.600% 5.00% 5.0 days
slow-ticket ticket 24h and 2h 3x 0.300% 10.00% 10.0 days
budget-ticket ticket 3d and 6h 1x 0.100% 10.00% 30.0 days

Validation protects the generated rule shape. The SLO target must be greater than 0 and less than 100. The compliance period must be positive. The error-ratio query template must include {window}. Duration fields accept Prometheus-style units such as 5m, 1h, 3d, and 1w. Each tier needs a short window shorter than its long window and a burn multiple greater than zero.

Generated alert rule fields and their meaning
Generated field How it is formed Review meaning
alert Pascal-cased service, SLO label, tier key, and BurnRate. Stable alert names help routing, dashboards, and silence matching.
expr Long-window query above threshold and short-window query above threshold. Both windows must agree before the rule can fire.
for Comes from the tier row's final duration column. Controls how long the condition must stay active before firing.
labels Combines extra labels with severity, service, SLO, tier, windows, and burn multiple. Routing and grouping should match the notification policy.
annotations Adds summary and description, plus runbook and dashboard links when entered. Responders get context without editing each generated rule by hand.
keep_firing_for Added only when the switch is enabled and the duration is valid. Can reduce clear and re-fire churn after the expression drops below threshold.

The output can use a plain Prometheus rule group or a PrometheusRule custom resource. Both forms contain the same alert rules. The custom resource adds Kubernetes metadata and a spec.groups wrapper for operator-managed Prometheus deployments.

Everyday Use & Decision Guide:

Start with the service and SLO label that responders already know, such as orders-api and availability. Use the same SLO target and compliance period as the SLO report. If the report is a 30-day rolling availability SLO, leave the period at 30 days; if the review policy uses 28 or 90 days, change it before trusting the threshold ledger.

The error-ratio query template is the most important input. It should return a bad-events divided by eligible-events ratio for the requested lookback window. Do not enter a percentage expression. A query that returns 0.0144 for 1.44% matches the generated threshold for the default fast page tier on a 99.9% SLO.

  • Use Prometheus rule file when the result will be placed under the server's rule files.
  • Use PrometheusRule CRD when the deployment is managed by the Prometheus Operator or a compatible platform.
  • Keep the default tier rows for a first pass, then adjust severities and windows only to match local paging policy.
  • Use Extra labels for stable routing values such as team and environment.
  • Add Runbook URL and Dashboard URL when responders should land on a specific triage path.
  • Enable Keep firing for only when your Prometheus version and alert policy support that field.

Stop when Review SLO alert inputs appears. The result should not be copied until the missing placeholder, bad duration, invalid label key, or tier-row problem is fixed. Generation notes are different: they warn about unusual thresholds or long-to-short window ratios while still allowing the YAML to be generated.

A green result does not mean the alert should page immediately in production. Compare Threshold Ledger against historical error ratios, read Routing Guidance for page versus ticket behavior, and test the generated PromQL against real time series before handing it to Alertmanager.

Step-by-Step Guide:

Build the alert set from the SLO definition first, then review the generated rule fields and routing impact.

  1. Enter Service name and SLO label. The summary and generated alert names should reflect the same service and objective used in dashboards and runbooks.
  2. Set SLO target and Compliance period. The summary line should show the derived error budget, such as a 99.9% SLO leaving 0.100% budget.
  3. Choose Rule format. The Alert YAML tab should show either a groups document or a PrometheusRule resource with apiVersion: monitoring.coreos.com/v1.
  4. Enter the Error ratio query template. Keep {window} in the expression so each tier can build its long-window and short-window PromQL checks.
  5. Review Burn-rate tiers. Each row needs key,severity,long_window,short_window,burn_multiple,for_duration; use Load SRE defaults to restore the default four-tier set.
  6. Open Advanced for group name, evaluation interval, severity label key, extra labels, runbook URL, dashboard URL, and optional keep-firing duration.
  7. If Review SLO alert inputs appears, fix the listed issue before using the result. Common fixes include adding {window}, changing 1hour to 1h, or making the short window shorter than the long window.
  8. Read Threshold Ledger for exact alert names, thresholds, budget share, and exhaustion time. Then read Routing Guidance to confirm each severity maps to the intended response path.
  9. Use Burn Threshold Ladder to compare burn multiples visually and JSON when another review or deployment workflow needs the same generated values in structured form.

Interpreting Results:

The first value to check is the error threshold in Threshold Ledger. It is an error ratio, displayed as a percent for readability. For a 99.9% SLO, fast-page at 14.4x shows 1.440%, while budget-ticket at 1x shows 0.100%.

  • Alert YAML is the deployable rule text, but it still needs Prometheus syntax validation and a real-data query check.
  • Budget share estimates how much of the compliance-period error budget the long window represents at that burn multiple.
  • Exhaustion time shows how long the full error budget would last if the burn rate continued.
  • Routing Guidance treats page, critical, and sev1 as immediate-response severities. Other severities are routed as ticket or team-channel review.
  • Generation notes deserve review when a threshold exceeds a 100% error ratio or the long-to-short window ratio falls outside the expected range.

The false-confidence risk is treating a generated rule as proof of SLO coverage. Confirm that the query measures the right SLI, that both windows return data, that labels match Alertmanager routing, and that the default page tiers do not create alert fatigue for known background errors.

Worked Examples:

Default 99.9% availability SLO

An orders-api availability SLO uses a 30-day compliance period and the default query template. The fast-page tier has 1h and 5m windows at 14.4x. Threshold Ledger should show OrdersApiAvailabilityFastPageBurnRate, an error threshold of 1.440%, budget share of 2.00%, and exhaustion time near 2.1 days. That is a page-level signal because active burn at that pace can damage the objective quickly.

Ticket path for slower budget drain

The same service keeps budget-ticket,ticket,3d,6h,1,1h. On a 99.9% SLO, Threshold Ledger shows a 0.100% threshold, 10.00% budget share, and 30.0 days of exhaustion time. Routing Guidance maps the ticket severity to next working review rather than immediate paging, which fits a sustained burn that needs ownership without waking the on-call for a fast incident.

PrometheusRule for operator-managed monitoring

A Kubernetes platform team changes Rule format to PrometheusRule CRD, leaves Evaluation interval at 30s, and adds team=payments plus env=prod as extra labels. Alert YAML should start with apiVersion: monitoring.coreos.com/v1 and include those labels on every generated rule. Routing Guidance should show the same team and environment values in the label set used for notification review.

Troubleshooting a tier-row mistake

If a row is entered as fast-page,page,5m,1h,14.4,2m, the short window is longer than the long window. Review SLO alert inputs reports that the short window must be shorter than the long window. Changing the row back to fast-page,page,1h,5m,14.4,2m clears the issue and restores the 1-hour plus 5-minute page tier.

FAQ:

Should the query return a ratio or a percent?

Use a ratio. A value of 0.0144 means 1.44%. Entering a query that already multiplies by 100 will make the generated thresholds too low for the displayed percent labels.

Why does each tier need two windows?

The generated expression checks the long and short windows with and. The long window catches meaningful budget spend, and the short window confirms the burn is recent enough for the selected response.

What does the 12:1 window note mean?

The generator warns when a tier's long window divided by short window is below 6 or above 24. The default tiers use a 12:1 ratio, such as 1h with 5m.

Can I add my own labels?

Yes. Use Extra labels with comma or newline separated key=value entries. Label keys must follow Prometheus label naming rules, and generated rule labels take precedence over duplicate keys.

What should I do when input review blocks generation?

Read the Review SLO alert inputs list and fix the named field. Common causes are a missing {window} placeholder, invalid duration text, an invalid severity label key, or a tier row with fewer than six columns.

Does the page validate the PromQL against my metrics?

No. It builds the rule text and derived review tables from the entered values. Run the PromQL in Prometheus, confirm the series labels, and use your normal rule validation before deploying.

Glossary:

SLI
Service level indicator, the measured ratio or metric used to judge a user-visible reliability promise.
SLO
Service level objective, the target value the service aims to meet over a compliance period.
Error budget
The allowed miss ratio left by the SLO target, such as 0.1% for a 99.9% objective.
Burn rate
The speed of error-budget consumption relative to spending the budget evenly across the full period.
PromQL
Prometheus Query Language, used here to express the long-window and short-window error-ratio checks.
PrometheusRule
A Kubernetes custom resource shape used by operator-managed Prometheus deployments to load alerting and recording rules.

References: