Large Language Model (LLM) Usage Cost Calculator
Estimate LLM API spend from token mix, request volume, caching, retries, fixed fees, and budget caps with model price comparisons.LLM Spend Snapshot
Current result
| Metric | Tokens / Volume | Cost (USD) | Copy |
|---|---|---|---|
| {{ row.label }} | {{ row.tokensDisplay }} | {{ row.costDisplay }} |
| Component | Per Request | Per Day | Per Month | Copy |
|---|---|---|---|---|
| {{ row.label }} | {{ row.perRequest }} | {{ row.perDay }} | {{ row.perMonth }} |
| Component | Per Request | Per Day | Per Month | Copy |
|---|---|---|---|---|
| {{ row.label }} | {{ row.perRequest }} | {{ row.perDay }} | {{ row.perMonth }} |
| Metric | Value | Copy |
|---|---|---|
| {{ row.label }} | {{ row.value }} |
| Priority | Move | Monthly Total | Savings | Budget Effect | Next Step | Copy |
|---|---|---|---|---|---|---|
| {{ row.priority }} |
{{ row.title }}
{{ row.context }}
|
{{ row.monthlyDisplay }} | {{ row.savingsDisplay }} | {{ row.budgetDisplay }} | {{ row.nextStep }} | |
| No savings moves are available yet. Add traffic, retries, or a budget cap to unlock pressure guidance. | ||||||
| Rank | Model | Provider | Per Request | Monthly | Vs Current | Copy |
|---|---|---|---|---|---|---|
| {{ row.rank }} |
{{ row.label }}
{{ row.budgetDisplay }}
|
{{ row.provider }} | {{ row.requestDisplay }} | {{ row.monthlyDisplay }} | {{ row.deltaDisplay }} |
Introduction
LLM spending is a volume problem disguised as a tiny unit price. One request may cost less than a cent, but a support assistant, search workflow, coding agent, report generator, or document summarizer repeats that request pattern all month. The bill grows from the token mix, the number of attempts, the model rate card, and the fixed services wrapped around the model.
Tokens are the working unit. A token is a chunk of text or other model-readable content, and providers usually bill input and output tokens at different rates. Input covers the prompt, system instructions, retrieved context, tool results, and any repeated conversation history. Output covers the model's answer and, depending on the provider and model family, may include billed generated or reasoning categories that are not obvious from the visible reply alone.
- Input tokens
- Prompt, system text, retrieved context, tool results, and other content sent into the model.
- Output tokens
- Visible answer text and, for some models, billed reasoning or generated content categories.
- Cached reads
- Repeated input tokens billed at a reduced read price when the provider recognizes reusable context.
- Billed attempts
- The successful request volume plus expected retry, fallback, reconnect, or duplicate-call overhead.
Rate cards are usually quoted per million tokens because the per-token price is too small to read comfortably. Budget planning often works better at the per-request level: convert the rate into the unit used by the estimate, multiply by the average prompt and answer size, then multiply again by daily traffic. A small unit mistake can change the estimate by a factor of 1,000, especially when a published per-million price is entered into a per-thousand field.
| Workload pattern | Cost driver to watch | Common mistake |
|---|---|---|
| Customer support chat | Repeated system prompt and answer length | Counting only the user's short message. |
| Retrieval-augmented answers | Retrieved context and cache hit rate | Ignoring documents injected before the model call. |
| Agent or tool workflow | Retries, fallback calls, tool schemas, and tool results | Pricing one visible answer as if it came from one request. |
| Batch report writing | Long output tokens and peak-month volume | Assuming input price matters more than generated text. |
Good estimates start from representative traffic, not from a clean demo prompt. A production trace may include hidden instructions, policy text, retrieved documents, function results, tool schemas, reconnects, and automatic retries. Measured token counts from real requests give a stronger baseline, then the same workload can be tested under changes such as shorter answers, higher cache reuse, higher request volume, or a different model price card.
A cost estimate cannot decide model quality. A cheaper row in a price comparison can identify pressure, but production selection still depends on latency, accuracy, context-window fit, safety behavior, regional terms, data-handling requirements, and the provider's current billing rules.
How to Use This Tool:
Start with one representative request, then widen the assumptions after the base request cost looks believable.
- Choose
Presetfor a built-in planning rate card, or chooseCustomwhen a dashboard, invoice, enterprise quote, region, or batch plan gives different prices. - Enter
Prompt tokensandCompletion tokensfor an average successful request. UsePrompt draftfor a rough input estimate when no tokenizer count is available. - Set
Requests per dayandBilling days per month. The request count should represent successful user-facing work before retry overhead is added. - Open
Advancedto set prompt, cached-read, and completion rates, then addCache hit rate,Retry multiplier,Margin uplift,Fixed monthly fees,Growth scenario, andMonthly budget cap. - Read
Usage Cost Brieffor the monthly total and cap status. UseCost ComponentsandToken Usage Ledgerto find the part of the bill that moved. - Use
Unit Economics,Spend Pressure Moves,Model Price Ladder, andScenario Burn Curvefor pricing comparisons, savings moves, and growth checks without losing the original workload shape.
If the total is off by hundreds or thousands of times, check the rate units first. Public pages often quote dollars per 1M tokens, while the editable rate fields use dollars per 1K tokens.
Interpreting Results:
Monthly total (tokens + fees) is the operating estimate. Per request is better for comparing prompt designs, but the budget decision depends on repeated volume, retry overhead, margin, fixed fees, and active billing days.
Cost Components is the best place to diagnose pressure. Cached-read relief lowers only the prompt side of the bill. Retry overhead raises billed attempts without adding user-facing volume. Fixed fees create a monthly floor that remains even when token counts shrink.
| Output field | Boundary or cue | How to read it |
|---|---|---|
Budget headroom |
>= 0 |
The modeled month fits inside the configured cap. |
Budget overage |
< 0 headroom |
The workload misses the cap; reduce token size, attempts, rates, margin, or fixed fees. |
Billed attempts |
Higher than successful requests when Retry multiplier is above 1.00 |
Retries, reconnects, or fallback calls are adding cost. |
Effective prompt card |
Falls as Cache hit rate rises |
Prompt caching is improving input cost, but completion tokens are unchanged. |
Model Price Ladder |
Same workload under different built-in rates | Use it for cost pressure, not as proof that a cheaper model is suitable. |
Do not treat the prompt draft estimate as invoice-grade tokenization. Verify high-stakes budgets against provider usage data, including hidden system text, retrieved context, tool calls, retries, cache-write charges, and generated output categories that your provider bills separately.
Technical Details:
LLM usage cost is a variable-cost model with fixed add-ons. Variable cost comes from token counts and token prices. Fixed cost comes from platform, logging, support, observability, marketplace, or internal chargeback amounts that do not shrink when an individual request becomes shorter.
Prompt and completion tokens are priced separately, then multiplied by billed attempts. Cached prompt reads are represented as a blended prompt rate. The cache hit ratio shifts a share of prompt tokens from the normal prompt rate to the cached-read rate, while completion tokens continue to use the completion rate. Cache writes, storage duration, batch discounts, regional multipliers, or tool-specific charges may exist in a provider contract, so the custom rate fields should reflect the actual billing terms being modeled.
Formula Core:
The main calculation blends the prompt rate, converts successful traffic into billed attempts, applies margin to variable token spend, and adds fixed fees at the end.
| Symbol | Meaning | Unit or handling |
|---|---|---|
P | Prompt tokens per successful request | Tokens |
O | Completion tokens per successful request | Tokens |
H | Cache hit rate | Percent converted to a ratio from 0 to 1 |
Q | Successful requests per day | Requests/day |
S | Retry multiplier | Clamped to at least 1.00 |
M | Margin uplift | Percent added to variable token spend |
D | Billing days per month | Days/month |
F | Fixed monthly fees | Added after variable token spend |
B | Monthly budget cap | Used for headroom and cap-envelope outputs |
With 1,400 prompt tokens, 600 completion tokens, 60% cached prompt reads, prompt rate $0.0025 per 1K, cached prompt rate $0.00125 per 1K, and output rate $0.0100 per 1K, the effective prompt rate is $0.00175 per 1K. The base token cost is $0.00845 before margin. Adding 15% margin gives about $0.0097 per request; at 240 requests/day, 1.15x retry multiplier, and 30 days, monthly variable spend is about $80.46 before fixed fees.
| Derived result | Rule | Interpretation limit |
|---|---|---|
Growth month |
Uses the configured growth percentage on billed attempts. | Token size and rates stay constant. |
Peak month |
Uses the larger of growth plus 25% or a 1.5x attempt multiplier. |
It is a stress test, not a long-range forecast. |
Max tokens per request at cap |
Divides the variable monthly budget by current request volume and effective per-token cost. | Fixed fees reduce the variable budget first. |
Max prompt tokens at cap |
Holds completion cost fixed and solves the remaining per-request budget against effective prompt rate. | Only valid while request volume, completion tokens, rates, and margin remain unchanged. |
Max completion tokens at cap |
Holds prompt cost fixed and solves the remaining per-request budget against completion rate. | Only valid while request volume, prompt tokens, rates, and margin remain unchanged. |
The prompt draft estimator is deliberately rough. It normalizes whitespace, estimates from character count and word count, averages those two estimates, and rounds to a whole-token value. Provider tokenizers can differ by language, punctuation, code, hidden messages, tool schemas, images, and model family.
Displayed currency is rounded for readability, but the underlying arithmetic uses the numeric field values. Small rounding differences are normal when comparing the page to a provider invoice that reports more decimal places or separates extra billing categories.
Pricing Accuracy:
Built-in presets are a planning snapshot, not a live provider contract. The public-card refresh date is March 12, 2026, and provider pages can change model names, tokenizers, batch discounts, cache-write charges, cache-read discounts, data-residency multipliers, tool-call pricing, and deprecation status after that date.
Use Custom when a provider dashboard, invoice, procurement quote, cloud marketplace, region, batch mode, or enterprise agreement gives a different rate. For production budgets, reconcile the estimate against real usage exports before setting alerts, customer-facing prices, or internal chargebacks.
Worked Examples:
A report generator with no cache discount. A workload with 1,400 prompt tokens, 600 completion tokens, 240 requests/day, 30 billing days, prompt rate $0.0025 per 1K, and completion rate $0.0100 per 1K produces about $0.0095 in Per request cost. Monthly total (tokens + fees) is about $68.40 before fixed fees or margin.
A cached workload with retry overhead. Keeping the same token counts and rates, then adding 60% cache hits, 15% margin, $49 fixed monthly fees, and a 1.15x retry multiplier produces a monthly total near $129.46. With a $500 budget cap, Budget headroom remains positive at roughly $370.54, but Billed attempts is higher than successful request volume.
A rate-unit mistake during budget review. If a provider rate of $2.50 per 1M input tokens is pasted as 2.50 in the per-1K prompt-rate field, the prompt cost is overstated by a factor of 1,000. A surprising Budget overage should be checked against Effective prompt card and Completion card before changing the model choice.
FAQ:
Should I use preset pricing or custom rates?
Use presets for early comparison. Use Custom for current quotes, invoices, regional pricing, cloud marketplace pricing, batch pricing, or enterprise terms.
Does cache hit rate discount completion tokens?
No. The cache hit rate blends the prompt-token rate only. Completion tokens still use the completion rate.
Why are billed attempts higher than requests per day?
Retry multiplier turns successful requests into billed attempts. A value above 1.00 means repeated calls, fallback calls, reconnects, or retries are expected to be billed.
Why does the prompt draft estimate differ from my provider bill?
The draft estimator uses text length and word count. Provider tokenizers, system messages, retrieved context, tool schemas, images, reasoning tokens, and hidden billing categories can change the final count.
What should I check when the budget cap fails?
Check Completion tokens, Retry multiplier, Fixed monthly fees, and the per-1K rate fields first. Then use Spend Pressure Moves to see which single adjustment saves the most.
Glossary:
- Prompt tokens
- Input tokens sent to the model, including user text and any added context.
- Completion tokens
- Output tokens generated by the model for a request.
- Cached reads
- Repeated input tokens billed at a reduced read rate when provider caching applies.
- Billed attempts
- Successful request volume after retry and fallback overhead is included.
- Retry multiplier
- The factor that increases successful requests into expected billed attempts.
- Margin uplift
- A percentage added to variable token spend for chargeback, markup, or risk buffer.
- Fixed monthly fees
- Monthly costs added after token spend, such as platform or observability fees.
- Budget headroom
- The amount left between the modeled monthly total and the budget cap.
References:
- OpenAI API Pricing, OpenAI.
- Prompt Caching in the API, OpenAI, 2024.
- Pricing, Anthropic Claude API Docs.
- Gemini Developer API Pricing, Google AI for Developers.
- Context Caching, Google AI for Developers.
- Pricing, Mistral AI.
- How Does Cohere's Pricing Work?, Cohere.