Error Budget Calculator
Calculate online request error budget use, burn rate, projected exhaustion, policy gates, and runway points for SLO release or incident review.{{ summaryTitle }}
| Metric | Value | Operational note | Copy |
|---|---|---|---|
| {{ row.metric }} | {{ row.value }} | {{ row.note }} |
| Gate | Threshold | Current | Action | Copy |
|---|---|---|---|---|
| {{ row.gate }} | {{ row.threshold }} | {{ row.current }} | {{ row.action }} |
| Day | Projected used | Sustainable line | Projected remaining events | Copy |
|---|---|---|---|---|
| {{ row.day }} | {{ row.projectedUsed }} | {{ row.sustainableUsed }} | {{ row.remainingEvents }} |
Introduction
Request error budgets turn a service-level objective into a count of eligible events that may fail during a reporting window. A 99.9% SLO leaves 0.1% of requests or events for budget-consuming outcomes. If a service handles 12,000,000 eligible events in a 30-day period, that target allows about 12,000 bad events before the period misses the objective.
The count matters because a small failure percentage can still be urgent when traffic is high or when the reporting window is only partly elapsed. A service can look acceptable in the current sample while the same pace would spend most of the period allowance before the next review. Error budget math gives SRE, operations, and product teams a shared way to discuss release risk, incident follow-up, and whether a service is burning reliability faster than planned.
Burn rate is the pace signal. A burn rate of 1.00x means the current bad-event ratio would use the budget exactly by the end of a matching period. Higher values spend budget faster. Lower values leave room if traffic and error mix stay similar. The same number still needs context because low-traffic services can swing sharply after a handful of failures, and request volume can change after a launch, migration, or traffic event.
A useful readout separates the observed slice from the full reporting window. The current sample shows what has happened so far, while the projection asks what happens if that bad-event pace continues.
An error budget calculation is not root-cause analysis. It does not prove which deployment, dependency, route, customer segment, or retry pattern caused the failures. It tells you how much of the reliability allowance has been spent and how quickly the current pace would exhaust the rest.
Technical Details:
A request-based SLO starts with an eligibility rule. Eligible events are the requests or events included in the SLI denominator. Budget-consuming events are the eligible events that the SLO definition treats as failures, such as errors, slow responses, dropped jobs, or another condition chosen by the service owner. The allowed failure ratio is the gap between 100% and the SLO target.
Traffic projection affects the size of a request-based budget. When the full-window event count is unknown, the observed eligible-event pace can be stretched across the reporting window. When a known period forecast is available, that forecast can be used instead. The forecast should never be lower than the eligible events already observed because the period cannot end with fewer events than the sample has already counted.
Formula Core
The calculation keeps ratios and counts separate. Counts show how many events are available or spent. Ratios show whether the current event mix is sustainable for the selected SLO target.
For a 99.9% SLO, the budget ratio is 0.001. With 12,500,000 eligible events and 8,200 bad events across 10 elapsed days, the observed error ratio is 0.000656 and burn rate is 0.656x. If the reporting window is 30 days and no manual forecast is provided, projected period events are 37,500,000 and the full-window budget is 37,500 bad events. The same bad-event pace projects 24,600 bad events by period end, or 65.6% of the period budget.
| Input | Calculation role | Boundary to check |
|---|---|---|
| SLO target | Sets the allowed failure ratio as (100 - SLO) / 100. |
Must be greater than 0% and less than 100%. |
| Reporting window | Defines the full period used for compliance and projection. | Must be greater than 0 days. |
| Elapsed window | Defines the observed slice behind the event counts. | Must be greater than 0 days and cannot exceed the reporting window. |
| Eligible events observed | Supplies the SLI denominator for the observed slice. | Must be greater than 0. |
| Budget-consuming events | Supplies the bad-event numerator for burn, budget use, and projection. | Cannot be negative or greater than eligible events observed. |
| Expected period events | Overrides pace-based traffic projection when a full-window forecast is known. | Use 0 for observed-pace projection, or a value at least as large as observed eligible events. |
Policy gates translate the same budget math into review cues. They do not change the core budget calculation. They decide whether the result reads as sustainable, needs a release-risk review, resembles a fast-burn incident candidate, or is too small to trust without a wider sample.
| Gate | Default boundary | Meaning |
|---|---|---|
| Sustainable burn | burn rate <= 1.00x |
The current bad-event ratio is at or below the long-period allowance. |
| Slow-burn watch | burn rate < 2.0x |
Above this boundary, the result calls for a longer-window reliability check. |
| Fast-burn incident | burn rate < 10.0x |
Above this boundary, active traffic should be checked for a possible short-window incident. |
| Release-risk watch | projected use < 70% |
At or above this boundary, risky launches and mitigations deserve review before the budget is spent. |
| Incident share | current bad events < 20% of full-window budget |
At or above this boundary, one incident or failure class consumed enough budget to document against policy. |
| Sample confidence | observed allowed failures >= 10 |
Smaller samples can be mathematically valid but too noisy for strong operational decisions. |
The output is deterministic from the entered values. It does not query a monitoring system, inspect live traffic, or verify that the service name maps to a real SLO. That makes the calculation reproducible for review notes, but the SLO owner still has to confirm that the event counts, reporting window, and bad-event definition match the policy being reviewed.
Everyday Use & Decision Guide:
Start with the exact reliability slice you need to defend. Use Service or SLO name for the route group, job, user journey, or SLO label that will appear in review notes. Set SLO target and Reporting window from the policy first, then enter the observed event counts from the same SLI definition.
The most common setup mistake is mixing windows. Elapsed window should describe the sample behind the counts, not the full reporting period. Ten days of observed data inside a 30-day period answers a different question from a completed 30-day report. The projection assumes the current bad-event pace continues, so daily and weekly traffic cycles can make one short slice look calmer or louder than the full period.
Leave Expected period events at 0 when the observed traffic pace is a reasonable forecast. Enter a forecast only when the period volume is known or when a launch, migration, release pause, or seasonal event means the observed pace is not representative. If the forecast is lower than the observed eligible count, the result should be fixed before the budget numbers are used.
- Budget Ledger is the main accounting view for SLO target, elapsed share, eligible events, bad events, observed success rate, burn rate, projected use, remaining budget, and exhaustion timing.
- Burn Policy Gates explains sustainable burn, slow-burn watch, fast-burn review, release-risk watch, incident-share impact, and sample confidence.
- Budget Runway Curve compares projected budget use with the sustainable line, the configured watch gate, and the 100% budget limit.
- Runway Points gives representative day-by-day checkpoints for the same curve when a table is easier to quote.
- JSON gives a structured handoff of the current inputs, summary values, rows, and curve points.
A calm headline does not prove the service is safe for every release. Check Current burn rate, Projected period use, and Sample confidence together. A low burn rate with a tiny sample may need more data. A 70% projected-use watch may still pass the SLO today, but it leaves less room for a future incident or rollout error.
Use the output for release review prep, post-incident budget accounting, weekly SLO status, or comparing a forecast with the observed traffic pace. Before sharing the table or JSON, confirm that the bad-event count really represents the same failure definition used by the SLO.
Step-by-Step Guide:
Work from policy to counts, then read the ledger before using the chart or handoff outputs.
- Enter Service or SLO name using the label you would use in a dashboard, incident review, or release note.
- Set SLO target as the success objective, such as 99.9%, and set Reporting window to the period the service is judged against.
- Set Elapsed window to the duration covered by the observed counts. Fix it if the validation message says it is zero, negative, or longer than the reporting window.
- Enter Eligible events observed and Budget-consuming events from the same SLI slice. Budget-consuming events must not exceed eligible events.
- Open Advanced only when policy gates need adjustment or a full-window traffic forecast should replace observed-pace projection.
- Read the summary and Budget Ledger. Confirm Observed success rate, Current burn rate, Projected period use, and Projected exhaustion.
- Open Burn Policy Gates when the summary moves into watch or miss risk. The gate rows explain whether the trigger came from burn rate, projected use, incident share, or sample size.
- Use Budget Runway Curve, Runway Points, and JSON after the ledger matches the SLO definition you intend to review.
Interpreting Results:
The headline percentage is Projected period use. It estimates how much of the full-window budget would be consumed by period end if the observed bad-event pace continued. Values below the policy watch gate leave room, values at or above the watch gate call for review, and values at or above 100% indicate a projected SLO miss.
Current burn rate is the quickest pace signal. At 1.00x, the current error ratio is exactly sustainable for the selected SLO. At 2.00x, the service is spending twice as fast as the long-period allowance. At 10.00x, the same failure mix can eat through budget fast enough that recent traffic and alert windows should be checked before treating the result as routine release data.
| Output | Use it for | Do not overread |
|---|---|---|
| Observed success rate | Checking whether the sample itself is above or below the target. | It does not prove the full reporting window will pass. |
| Allowed events at elapsed pace | Comparing the observed bad-event count with the local allowance for the sample size. | It changes with eligible traffic volume and is not a fixed incident quota. |
| Budget used now | Seeing how much of the projected full-window budget has already been spent by observed bad events. | It is not the same as projected period use when only part of the window has elapsed. |
| Projected exhaustion | Estimating the reporting day when the budget crosses 100% at the current bad-event pace. | It is not a promise that traffic and failures will stay unchanged. |
| Incident share | Judging whether the entered bad-event count consumed a policy-significant budget share. | It does not identify whether the count came from one incident or several unrelated causes. |
| Sample confidence | Deciding whether the sample has enough expected allowed failures for a practical readout. | A low value does not make the math invalid, but it should slow high-risk decisions. |
Negative remaining budget or a projected miss should trigger a policy conversation, not a blind conclusion. Check whether the bad-event definition, traffic forecast, elapsed window, and SLO target are all correct before changing release plans or declaring the period out of compliance.
Worked Examples:
Default checkout service sample
With a 99.9% SLO, a 30-day reporting window, 10 elapsed days, 12,500,000 eligible events, and 8,200 budget-consuming events, the observed success rate is 99.934%. The burn rate is about 0.66x, projected period use is about 65.6%, and the service remains below the default 70% policy watch gate. The same run still deserves a check against launch plans because it has already used 21.9% of the full-window budget.
Manual forecast after a traffic increase
A service has 6,000,000 eligible events and 9,000 bad events after 5 days of a 30-day period. At 99.9%, observed-pace projection would expect 36,000,000 period events and 36,000 budget events. If the team knows a launch will raise the full-window volume to 60,000,000 events, entering that forecast raises the full-window budget to 60,000 events. The burn rate stays 1.50x because the observed error ratio has not changed, but current budget share and projected period use are read against the larger forecast.
Fast-burn incident candidate
For a 99.95% SLO, 400,000 eligible events and 3,000 bad events in a half-day slice produce a 0.75% observed error ratio. The allowed failure ratio is 0.05%, so burn rate is 15.00x. Even if the full-window projection changes with traffic forecast, the burn policy gate treats this as a fast-burn incident candidate that should be checked against recent monitors and active user impact.
Low-traffic sample
A small internal job with 2,000 eligible events, 3 bad events, a 99.9% SLO, and one elapsed day has only 2 allowed bad events in the observed slice. The burn rate is 1.50x, but the sample-confidence row warns that the allowance is below 10. The result is useful as a prompt to investigate, but a longer window or a broader SLO slice is safer before changing release policy.
FAQ:
What counts as a budget-consuming event?
The SLO definition decides that. It may be failed requests, slow responses, unsuccessful jobs, dropped messages, or another event class. The calculator expects you to enter the count after that policy decision has already been made.
Why can the observed sample pass while the projection is risky?
The sample can be under its local allowance while still burning too quickly for the full reporting window. Projection stretches the current bad-event pace across the period, so the elapsed window and the full window can tell different stories.
What does burn rate above 1.00x mean?
It means the observed bad-event ratio is higher than the SLO allows over a sustained full period. If the same mix continued, the budget would be exhausted before the reporting window ended.
When should I enter expected period events?
Enter a forecast when the full-window traffic volume is known or when observed traffic is not representative. Leave it at 0 when the current eligible-event pace is the best available projection.
Does this verify my monitoring system's SLO?
No. It calculates from the values you enter. It does not query SLO objects, dashboards, alert policies, or live service traffic.
Why does low traffic make burn rate noisy?
A strict SLO can allow only a few bad events in a small sample. When the observed allowance is below 10, one or two failures can move burn rate sharply, so the result should be checked against a longer window before a high-impact decision.
Glossary:
- Service-level objective (SLO)
- The target success level for a service or SLI slice over a reporting window.
- Eligible events
- The request or event count included in the SLI denominator for the observed window.
- Budget-consuming events
- The eligible events that count as failures for the SLO.
- Error budget
- The allowed number of budget-consuming events before the SLO target is missed.
- Burn rate
- The observed error ratio divided by the allowed failure ratio.
- Projected period use
- The share of the full-window budget expected to be spent by period end if the current bad-event pace continues.
- Exhaustion day
- The reporting day when the full-window budget would reach 100% at the current bad-event pace.
References:
- Service Level Objectives, Site Reliability Engineering, Google and O'Reilly Media.
- Alerting on SLOs, Site Reliability Engineering Workbook, Google and O'Reilly Media.
- Concepts in service monitoring, Google Cloud Observability, updated 2026-04-29.
- Alerting on your burn rate, Google Cloud Observability.