API Error Budget Analyzer
Measure API error-budget burn from status counts or access-log snippets, with SLO projections, status-code policy checks, and burn guardrails.| Metric | Value | Operational note | Copy |
|---|---|---|---|
| {{ row.metric }} | {{ row.value }} | {{ row.note }} |
| Status | Class | Requests | Share | Budget role | Copy |
|---|---|---|---|---|---|
|
No parsed statuses
Paste status counts or load the sample source to populate the status mix ledger.
|
|||||
| {{ row.status }} | {{ row.classLabel }} | {{ row.countLabel }} | {{ row.shareLabel }} | {{ row.role }} | |
| Guardrail | Condition | Current | Outcome | Copy |
|---|---|---|---|---|
| {{ row.guardrail }} | {{ row.condition }} | {{ row.current }} | {{ row.outcome }} |
Introduction:
API reliability work often starts after a release, a traffic spike, or an incident export raises the same practical concern: did the service spend too much of its reliability budget? A raw count of 500 responses is useful, but it becomes more meaningful when it is compared with the service level objective, the amount of traffic, and the time window behind the sample.
An availability SLO describes the success rate a service aims to meet over a stated period. The gap between that target and perfect availability is the error budget. A 99.9% availability target leaves 0.1% of eligible requests for failures during the compliance period. For high-volume APIs, that allowance can be thousands of requests. For a low-volume endpoint, one failed request can be enough to make a short sample look alarming.
- SLI
- The measured reliability signal, such as the fraction of eligible API requests that succeed.
- SLO
- The target value for that signal, such as 99.9% successful requests over 30 days.
- Error budget
- The allowed miss rate for the SLO, converted into requests or events for the period.
- Burn rate
- How fast the sample is spending that allowance compared with the sustainable pace.
HTTP status codes need a policy decision before they become SLO failures. Many teams count 5xx responses because those usually point to a server-side problem. Some count 429 when rate limiting causes a user-visible failure. Others include client disconnects or selected 4xx codes when those responses reflect service behavior rather than user mistakes. The important part is consistency: changing the budget-consuming status rule can change the apparent reliability of the same traffic sample.
| Status area | Typical meaning | SLO caution |
|---|---|---|
2xx and selected 3xx |
The API completed the requested work or redirected as intended. | Usually counted as successful, subject to the service's own SLI definition. |
4xx |
The request could not be completed as sent, or the client was rate-limited. | Some codes may be user error, while 429 or selected edge failures may still harm users. |
5xx |
The server failed or could not complete the method. | Often treated as budget-consuming for availability SLOs. |
An error-budget calculation is a triage aid, not a full observability system. It can explain a pasted status sample, compare policies, and show whether the current failure rate is sustainable. It cannot know which requests your SLI excludes, whether retries duplicated failures, or whether the sample captures normal traffic for the whole compliance period.
How to Use This Tool:
Use the same SLO wording, period, and status-code policy that the service owner would use in an incident note or SLO dashboard.
- Enter API name with the service, route group, or SLO slice that should appear in the Budget Ledger.
- Set Availability SLO and Compliance period to the target and reporting window you actually use. A 99.9% 30-day SLO leaves a 0.1% request budget over 30 days.
- Set Observed window to the hours covered by the pasted counts or log excerpt, not the full SLO period.
-
Fill Budget-consuming statuses with classes such as
5xx, exact codes such as429, or inclusive ranges such as500-599.Changing this policy can move the same status sample from budget available to budget overrun, so keep the pattern fixed when comparing runs. -
Paste Status counts, drop one TXT/CSV/LOG file onto the textarea, choose Browse TXT/CSV, or choose Load sample. Accepted rows include
500=12,500,12, named status and count fields, and access-log lines that contain HTTP status codes.If an input issue says no rows were parsed, simplify the source to onestatus=countpair per line before using the burn-rate result. - Review Budget Ledger for current burn rate and projected period usage, then check Status Mix, Burn Guardrails, and Budget Burn Curve when you need to explain which codes spent the budget and whether the projection crosses 70% or 100%.
Interpreting Results:
Current burn rate is the fastest sustainability readout. A value of 1.00x means the observed error ratio is exactly equal to the SLO's allowed error ratio. Values above 1.00x mean the same mix would run out of budget before the compliance period ends.
Projected period usage and Projected remaining budget show the consequence in period terms. A projection near 100% deserves review even if the short sample has not yet caused a visible outage. A negative remaining budget means the observed pace points to an SLO miss unless traffic, failures, or the policy changes.
| Result cue | Boundary | How to respond |
|---|---|---|
| Sustainable burn | Current burn rate <= 1.00x |
The sample stays within the long-window budget if it represents normal traffic. |
| Budget policy watch | Projected period use >= 70% |
Review release risk, ownership, and mitigation while the budget is still recoverable. |
| Budget spent | Projected period use >= 100% or the sample already exceeds its allowed bad-request count |
Confirm with production SLI telemetry and treat the sample as an SLO threat if active traffic agrees. |
| Fast-burn watch | Current burn rate >= 6.00x |
Check a short-window metric before paging from a small or partial sample. |
| Critical burn gate | Current burn rate >= 14.40x |
Escalate when the same rate is still present in live traffic. |
Avoid false confidence from a clean-looking status mix. If the Sample confidence guardrail says the sample has fewer than 10 allowed bad events, one failure can swing the burn rate sharply. Aggregate a longer window or a related API slice before treating the result as a paging signal.
Technical Details:
Error-budget math converts an availability target into an allowed failure ratio, then compares that allowance with the observed ratio of budget-consuming responses. Request-count SLOs are naturally proportional: twice the traffic creates twice the allowed number of bad responses, but the allowed percentage stays the same.
Burn rate is a pace measurement. A 30-day SLO does not require waiting 30 days to detect risk, because a short sample can be scaled by its observed request rate. That projection is useful for triage, but it inherits every weakness of the sample: missing routes, retry amplification, planned maintenance, bot traffic, and one-off deploy spikes all change the meaning of the count.
Formula Core:
The core calculation uses request counts rather than downtime minutes. The same equations apply whether the input came from aggregated status counts or parsed log lines.
| Term | Meaning | Unit |
|---|---|---|
SLO |
Availability target entered as a percent, greater than 0 and less than 100. | percent |
Bratio |
Allowed fraction of eligible requests that may consume budget. | ratio |
Eobs |
Observed bad-response ratio from the selected status policy. | ratio |
| Projected budget events | Observed request pace scaled to the compliance period, multiplied by the allowed error ratio. | requests |
With a 99.9% SLO, the allowed error ratio is 0.001. In the sample counts, 5xx plus 429 produces 493 budget-consuming responses out of 253,093 eligible requests. The observed error ratio is about 0.1948%, so the burn rate is about 1.95x. At a 24-hour observed window and a 30-day compliance period, that pace projects to about 194.8% of the period budget.
Parsing and Rule Core:
| Mechanism | Rule | Limit or boundary |
|---|---|---|
| Status source parsing | Rows can be parsed from status=count, CSV-like pairs, named status/count fields, access-log status positions, or the first HTTP status code in a line. |
Only three-digit HTTP status codes from 100 to 599 are recognized. |
| Budget-consuming pattern | Tokens can be classes such as 5xx, exact codes such as 429, or inclusive ranges such as 500-599. |
Unsupported tokens or reversed ranges stop the analysis until corrected. |
| Daily pace | Observed requests and bad responses are scaled by 24 / observed window hours. |
A short incident window is valid only if that pace is the scenario being tested. |
| Summary status | Budget overrun appears when projected usage reaches 100% or the observed sample already exceeds its allowed bad-request count. | Burn watch appears at 70% projected use or when burn rate is above 1.00x. |
| Sample confidence | The guardrail expects at least 10 allowed bad events in the sample. | Below that point, a small number of failures can create a noisy burn-rate estimate. |
Percentages are displayed with rounded decimal places, while request counts are rounded to whole events for readability. Negative remaining budget is not clipped to zero because the size of the overrun is useful when comparing policies or deciding how much recovery time is needed.
Accuracy and Privacy Notes:
Pasted text and local TXT, CSV, or LOG files are read in the browser for parsing. It is still wise to sanitize request paths, identifiers, tokens, and user data before pasting logs, because status-code analysis usually needs counts and codes rather than full request details.
- The result does not replace production SLI telemetry, alert windows, or incident policy.
- Partial logs, retry storms, canary traffic, maintenance windows, and low request volume can skew burn-rate projections.
- Use the same status policy and compliance period when comparing two runs. A policy change can make the same traffic look better or worse.
Advanced Tips:
- Use Normalize after pasting a mixed log excerpt when you want to audit the parsed status counts before sharing the result.
- Compare
5xxalone with5xx,429when rate limiting is user-visible and the SLO owner has not settled the policy yet. - Treat Fast-burn watch and Critical burn gate as investigation cues unless the same burn rate is present in the live short-window SLI.
- Use Budget Burn Curve to explain timing to release managers, then keep Compliance period unchanged so the curve matches the reported SLO window.
- Use CSV, DOCX, or JSON exports only after removing request paths, tokens, and user identifiers from pasted log lines.
Worked Examples:
These cases use the shipped sample and common incident-review situations so the burn-rate numbers can be checked against visible result fields.
Release watch from the sample source
The sample has 253,093 eligible requests over 24 hours. Counting 5xx and 429 creates 493 budget-consuming responses, so Budget Ledger reports about 1.95x current burn rate and 194.8% projected period usage for a 99.9% 30-day SLO. That is a burn watch and a projected budget overrun, so confirm the same pace in production SLI telemetry before changing release status.
Rate-limit policy comparison
The same sample counted with only 5xx has 73 budget-consuming responses. Adding 429 changes the Status Mix role for rate-limited requests and moves Projected period usage from a comfortable level to an over-budget projection. That comparison is useful when a team is deciding whether user-visible rate limits should spend the availability budget.
Low-traffic boundary
A small endpoint with 1,000 requests and one 500 response under a 99.9% SLO can show a 1.00x burn rate, but Sample confidence remains weak because the sample allows only one bad event. A longer window or an aggregated route group gives a more stable read before treating the single failure as a paging signal.
Parsing cleanup
A pasted excerpt that contains text but no parseable three-digit statuses returns an input issue asking for rows such as 200=10000 and 500=12. Reformatting the excerpt into one status/count pair per line and choosing Normalize should populate Status Mix with request counts and budget roles.
FAQ:
What should count as a bad API response?
Use the same rule your SLO uses. A common starting point is 5xx, and the status pattern also accepts exact codes such as 429 or inclusive ranges when your policy counts them.
Why does a short sample project across the whole period?
The observed request and bad-response counts are converted to a daily pace, then scaled to the compliance period. That helps with triage, but the projection is only as representative as the observed window.
Does projected exhaustion mean the SLO has already failed?
No. Projected exhaustion assumes the observed pace continues. Compare it with the live SLI source and the actual remaining budget before declaring an SLO miss.
Why can one failure create a large burn rate?
Strict SLOs leave small budgets, and low-traffic samples may have fewer than 10 allowed bad events. Check Sample confidence and aggregate a longer window when the request count is small.
What should I fix if the input is rejected?
Use an SLO greater than 0 and less than 100, enter positive period and window values, add at least one valid budget-consuming status pattern, and provide parseable status counts or log lines.
Glossary:
- Service level indicator (SLI)
- The measured signal used to judge reliability, such as the fraction of eligible API requests that succeed.
- Service level objective (SLO)
- The target value for the SLI over a stated period.
- Error budget
- The portion of eligible requests that may fail while the service still meets its SLO.
- Burn rate
- The observed bad-response ratio divided by the allowed error ratio.
- Compliance period
- The full reporting window used for the SLO projection.
- Budget-consuming status
- An HTTP status code or status class that the selected policy treats as spending error budget.
References:
- Service Level Objectives, Google SRE.
- Alerting on SLOs, Google SRE Workbook.
- RFC 9110: HTTP Semantics, RFC Editor, June 2022.
- RFC 6585: Additional HTTP Status Codes, RFC Editor, April 2012.
- How to audit Nginx access logs for security threats, Simplified Guide.
- How to audit Apache access logs for security threats, Simplified Guide.