| Metric | Tokens / Volume | Cost (USD) | Copy |
|---|---|---|---|
| {{ row.label }} | {{ row.tokensDisplay }} | {{ row.costDisplay }} |
| Component | Per Request | Per Day | Per Month | Copy |
|---|---|---|---|---|
| {{ row.label }} | {{ row.perRequest }} | {{ row.perDay }} | {{ row.perMonth }} |
| Component | Per Request | Per Day | Per Month | Copy |
|---|---|---|---|---|
| {{ row.label }} | {{ row.perRequest }} | {{ row.perDay }} | {{ row.perMonth }} |
| Metric | Value | Copy |
|---|---|---|
| {{ row.label }} | {{ row.value }} |
| Priority | Move | Monthly Total | Savings | Budget Effect | Next Step | Copy |
|---|---|---|---|---|---|---|
| {{ row.priority }} |
{{ row.title }}
{{ row.context }}
|
{{ row.monthlyDisplay }} | {{ row.savingsDisplay }} | {{ row.budgetDisplay }} | {{ row.nextStep }} | |
| No savings moves are available yet. Add traffic, retries, or a budget cap to unlock pressure guidance. | ||||||
| Rank | Model | Provider | Per Request | Monthly | Vs Current | Copy |
|---|---|---|---|---|---|---|
| {{ row.rank }} |
{{ row.label }}
{{ row.budgetDisplay }}
|
{{ row.provider }} | {{ row.requestDisplay }} | {{ row.monthlyDisplay }} | {{ row.deltaDisplay }} |
Large language model billing usually starts with a simple rule and ends with a messy planning question. Providers commonly charge one rate for prompt tokens and another for completion tokens, but the real month total also depends on how many requests you serve, how often retries happen, whether repeated prompt text is cached, and which fixed platform costs sit on top of token spend. This calculator turns those moving parts into one workload model so you can estimate per-request cost, daily run-rate, and monthly spend from the same assumptions.
That makes it useful well before an invoice exists. You can size a prototype, compare one model card against another, check whether longer answers are still affordable, or test whether a monthly cap survives growth. The tool accepts either a public pricing preset or custom rates, then layers in traffic, retry overhead, cached prompt reads, margin uplift, fixed monthly fees, a growth scenario, and an optional budget cap.
The result is not just a headline number. You get a summary table, a component breakdown, a usage table that separates successful requests from billed attempts, a unit-economics view, ranked savings suggestions in Spend Pressure, a cross-provider Model Ladder, a Scenario Burn chart, and a structured JSON export. Table exports support copy, CSV download, and DOCX export, while the chart tabs add image and CSV downloads for the visual views.
All calculations stay in the browser. The tool has no server-side pricing path, so pasted prompt drafts and manual assumptions are processed locally on the page.
The calculator builds monthly spend from four layers. First it prices prompt tokens and completion tokens separately. Prompt tokens can be blended between the normal prompt rate and the cached-read rate according to the cache hit percentage, while completion tokens always use the completion rate. Second it converts successful daily traffic into billed attempts by multiplying requests per day by the retry multiplier. Third it applies any margin uplift to the variable token cost. Fourth it adds fixed monthly fees after the usage-driven cost has already been calculated.
The main summary rows use one baseline formula set. Growth and peak views do not introduce a new pricing method. They rerun the same per-request economics at higher billed-attempt counts. The growth month uses your configured growth percentage, while the peak month uses whichever is larger: growth plus another 25%, or 1.50 times the base billed-attempt rate. That makes the chart a workload stress test rather than a forecasting engine.
In those expressions, P is prompt tokens per request, O is completion tokens per request, H is the cache-hit ratio from 0 to 1, Q is successful requests per day, S is the retry multiplier, M is the margin percentage, D is billing days per month, and F is fixed monthly fees.
The prompt draft box is only an estimator. The page normalizes the pasted text, counts both characters and words, estimates tokens from roughly four characters per token and 1.32 tokens per word, averages the two estimates, then rounds to the nearest whole token. That is a practical starting point for English-like text, but real tokenizer counts still vary by model, language, punctuation, hidden system text, and retrieved context.
| Lever | How the calculator uses it | What to keep in mind |
|---|---|---|
| Prompt tokens | Priced at the effective prompt card after any cached-read blend | Long system prompts and retrieval context often hide here |
| Completion tokens | Priced only at the completion card | Shorter answers often cut cost faster than model swaps |
| Cache hit rate | Moves a share of prompt tokens onto the cached-read price | This tool does not add cache-write or cache-storage charges |
| Retry multiplier | Turns successful requests into billed attempts | Retry cost can rise even when the user-facing traffic stays flat |
| Fixed monthly fees | Added after the variable token total is calculated | Useful for platform, observability, or support costs |
| Budget cap | Compares the monthly total against a ceiling and derives safe envelopes | It does not change the cost math; it only evaluates the result |
The preset ladder covers public rate cards from OpenAI, Anthropic, Google, Mistral, and Cohere. If you switch to Custom, the live rate fields stay editable and the ladder still compares your current rates against the verified public presets. Some rows may also carry provider status notes, which is useful when a public card is already deprecated but still helpful as a comparison point.
Start with the parts of the workload you can measure directly. Typical prompt size, typical answer size, successful requests per day, and active billing days usually give a better first estimate than trying to guess a perfect model choice up front. Once those numbers look realistic, move to the pricing preset or custom card that best matches the workload you are planning.
If you are trying to reduce spend, it helps to separate three different questions. Token-size questions live in Prompt tokens and Completion tokens. Reliability questions live in Retry multiplier. Budget-floor questions live in Fixed monthly fees. Mixing those together can make a cost problem feel harder than it is, so the tool keeps them in separate tables and recommendations.
Prompt draft when you need a fast estimate, but replace it with observed token counts when you move from rough planning to launch budgeting.Retry multiplier early if your flow can fan out, reconnect, or fall back. A small retry increase can matter more than a minor price-card change.Cache hit rate as prompt-side relief only. In this model, completion tokens never get a cache discount.Fixed monthly fees if you want the result to reflect the real operating floor rather than token spend alone.Monthly budget cap when you need to answer capacity questions such as how many requests per day or how many tokens per request fit inside a monthly target.When a preset shows a status note, treat that row as a planning reference, not an automatic recommendation. A deprecated public card may still be worth comparing for cost history, but it is rarely the right answer for fresh production work.
Custom if you already know the exact prompt, cached-read, and completion rates.Prompt tokens and Completion tokens. If prompt size is still rough, paste representative text into Prompt draft and let the estimator seed the prompt field.Requests per day and Billing days per month. These two fields turn a unit-cost idea into an operating-cost estimate.Advanced and add the assumptions that change planning decisions most: cached-read pricing, cache hit rate, retry multiplier, margin uplift, fixed monthly fees, growth scenario, and budget cap.Summary first to confirm the per-request, daily, and monthly totals move the way you expect.Components, Usage, Unit Economics, Spend Pressure, Model Ladder, and Scenario Burn to figure out whether the next move should be shorter prompts, shorter answers, fewer retries, a different model card, or a tighter budget target.Per request is the unit price of the current mix. It matters, but it is usually not the decision number. The operational number is Monthly total (tokens + fees), because that is where token size, billed attempts, margin, and fixed fees finally meet. A cheap-looking request can still miss a cap once it is repeated enough times.
Components needs a careful read. Those rows are diagnostic layers, not a strict additive ledger. Cached-read relief and retry overhead describe pressure already folded into the billed total, so you should use them to explain the current month rather than sum them again by hand.
| Result area | Best question it answers | Common misread |
|---|---|---|
| Summary | What will this workload cost per request, per day, and per month? | Treating per-request cost as the whole budget story |
| Components | Which billing layer is pushing the total upward? | Adding diagnostic rows together as if they were separate charges |
| Usage | How much traffic is successful volume versus retry overhead? | Ignoring the difference between requests and billed attempts |
| Unit Economics | How expensive is the current mix per 1K or 1M tokens, and what fits under the cap? | Assuming a lower effective token cost automatically means a lower invoice under every workload |
| Spend Pressure | Which single change saves the most money from the current baseline? | Taking a model switch as a quality recommendation instead of a cost-only comparison |
| Model Ladder | How does the same workload price out across verified preset cards? | Forgetting that only the rate card changes while all workload assumptions stay fixed |
| Scenario Burn | What happens if prompt size, answer size, cache share, growth, or peak demand changes? | Reading it as a long-range forecast instead of a nearby stress test |
Budget results are straightforward. Positive headroom means the configured month still fits under the cap. Overage means it does not. The derived request and token ceilings are useful planning aids, but they are only as good as the workload assumptions already entered above them.
With the default setup, the workload uses 1,400 prompt tokens, 600 completion tokens, 240 requests per day, 30 billing days, and the GPT-4o preset card. The calculator returns Per request = $0.0095, Daily = $2.28, and Monthly total (tokens + fees) = $68.40 across 14.4 million monthly tokens. The growth scenario rises to $82.08 and the peak month rises to $102.60. That is a clean baseline because the rate card, token mix, and traffic are all visible in one place.
Keep the same baseline workload, set Cache hit rate to 60%, and add a Monthly budget cap of $60. The blended prompt card drops enough to move Per request to $0.00845, and the summary shows about $7.56 in monthly cached-read relief. Even so, the monthly total lands at $60.84, so the budget result still shows overage. This is a good example of why efficiency wins and budget fit are related but not identical questions.
Suppose a custom workflow averages 2,200 prompt tokens, 900 completion tokens, 1,200 successful requests per day, 22 billing days, 40% cached prompt reads, a 1.12 retry multiplier, 15% margin uplift, and $250 in fixed monthly fees. The calculator treats that as 1,344 billed attempts per day and produces a monthly total of $705.64, of which $250 is the fixed-fee floor. With a 35% growth scenario the month rises to $865.12, and the peak month reaches $979.03. This is the kind of setup where Usage, Components, and Spend Pressure usually matter more than the headline rate card.
Because the page uses a blended character-and-word heuristic, not a provider tokenizer. It is meant to get you close enough for planning, not to reproduce a billing ledger exactly.
No. In this model, cached pricing only affects prompt tokens. Completion tokens always use the completion rate.
No. The calculator models discounted cached reads only. If your provider charges separate cache writes, storage, search grounding, tool use, taxes, or regional surcharges, you need to account for those outside this page.
No. The cap is a comparison layer. It reports headroom or overage and derives request or token envelopes, but it does not alter the core cost formula.
It keeps prompt size, completion size, traffic, retries, margin, cache hit rate, and fixed fees constant. Only the preset rate card changes, so the ladder is a price comparison under the same workload.
No. This tool has no server-side processing path, so the calculations and prompt estimate stay in the browser.
No. Public pricing pages can change. Use the presets as a checked starting point, then refresh the rates or switch to custom values when accuracy matters for procurement or launch approval.