How to Monitor LLM Costs in Production (Without Going Broke)
How to Monitor LLM Costs in Production Without Going Broke
The first month of running an LLM feature in production is usually cheap. The second month is often a surprise. The third month is when someone on the team opens the billing dashboard, stares at the number, and asks how this happened. This post walks through how to instrument LLM cost from day one, catch cost regressions before they blow up the bill, and systematically cut token waste across your prompt portfolio.
The techniques here are generic, but we will call out where EmberLM ships the piece out of the box.
Why LLM cost is harder to track than it looks
Traditional software cost is easy. You pay for compute, storage, and bandwidth. They grow roughly linearly with usage, and you can cap them at the infrastructure layer.
LLM cost has three properties that make it harder.
Per-call variance. The same prompt can cost wildly different amounts depending on input length, output length, and model. A simple summarization might cost a tenth of a cent on one user input and three cents on another.
Silent regressions. A small prompt edit that adds fifty tokens to the system prompt multiplies across every call. A week later you discover your cost per call is up 30 percent.
Cascading calls. Agent frameworks often make multiple model calls per user turn. The user sees one action. The bill sees five calls, each billed independently.
Without instrumentation, none of this is visible until the invoice arrives.
Step 1: log every call
The baseline. Every single model call in your production system emits a log line with:
- Timestamp
- Prompt name or ID
- Model
- Input token count
- Output token count
- Cost in USD
- Latency in milliseconds
- User ID or workspace ID
- Optional metadata (source, experiment flag, etc.)
If your provider returns token counts on each call, log them. If it does not, compute them with a tokenizer library matched to the model. Both OpenAI and Anthropic return usage objects in their responses; most calls to their APIs today have token counts baked into the response.
Never log the full prompt or output in the cost-tracking log. That is a separate system with a separate retention policy. The cost log needs counts and costs only.
EmberLM's SDK automatically emits this log for every call. You do not need to wire it yourself.
Step 2: compute cost per call consistently
Different models cost different amounts. Within a single provider, input and output tokens cost different amounts. Keep a single source of truth for pricing, keyed by model name, for both input and output.
PRICING = {
"claude-haiku-4-5": { "input": 0.0008, "output": 0.004 }, # per 1k tokens
"claude-sonnet-4-6": { "input": 0.003, "output": 0.015 },
"claude-opus-4-7": { "input": 0.015, "output": 0.075 },
"gpt-5-2": { "input": 0.005, "output": 0.02 },
}
def cost_usd(model, in_tokens, out_tokens):
p = PRICING[model]
return (in_tokens * p["input"] + out_tokens * p["output"]) / 1000
Compute cost at log time. Do not compute it at report time from raw tokens, because your pricing table might change while old log entries still need to reflect the price at the time.
Step 3: build the dashboard
The minimum useful dashboard has four charts.
Daily spend. Total cost per day, over the last 30 days. Spikes are visible at a glance.
Spend by prompt. Cost attributed to each prompt name. Shows which prompts are the budget leaders.
Cost per call, per prompt. Average cost of a single run of each prompt. Shows which prompts have become more expensive over time.
Token distribution. Histogram of output tokens per call. Outliers at the right tail are usually bugs.
Any BI tool works. Metabase, Grafana, Superset, Looker. EmberLM ships these four charts in the Cost Analytics page out of the box, no dashboard construction required.
Step 4: set a budget and alert on overages
Every production workload needs a monthly budget. The budget should be realistic but not padded. If your team thinks the feature should cost $200 per month, set the budget at $250 and alert when you hit $200.
Alerts should fire on three conditions.
Daily spend anomaly. Today's spend is more than three standard deviations above the 30-day average. Fire immediately.
Budget pace. If today is the 15th of the month and you have already spent 75 percent of the budget, fire.
Per-prompt cost jump. Cost per call on a specific prompt is up more than 20 percent week-over-week. Fire.
Set these to Slack or email. Treat the third one especially seriously; it often means someone changed a prompt and did not realize they made it longer.
Step 5: attribute cost to features and users
At any real scale you want to know who is driving the cost. Tag every call with enough metadata that you can slice later.
At minimum: user ID, workspace ID, feature name. Ideally also: experiment variant, client version, entry point.
With tagging, questions like "which workspace is generating 40 percent of our API calls" become one query instead of a week of detective work. If your pricing plan caps calls per workspace, you can also bill accurately and cut off heavy users when they hit the cap.
EmberLM's Cost Analytics page segments spend by workspace automatically. For per-feature attribution, pass a feature metadata field with each SDK call.
Step 6: run cost regressions alongside quality regressions
Here is the practice most teams miss. When you edit a prompt, you run a regression to make sure quality does not drop. You also need to run a regression to make sure cost does not spike.
The check is simple. Before the edit, measure average cost per call on your golden dataset. After the edit, measure it again. If the new cost is more than 15 percent higher, flag it for review. Either the change should be smaller, or the quality win should justify the cost.
EmberLM's regression runs compute cost per run automatically. Cost delta between runs is visible next to pass rate delta.
Step 7: cut token waste systematically
Once you have visibility, you will find waste. Three patterns keep appearing.
Bloated system prompts. Teams accrete instructions over time. "Also, never say X." "Also, always include Y." By month six, the system prompt is three thousand tokens, and every single call is paying for all of them.
Fix. Audit your top five prompts. Read the system prompt. Strike anything that is redundant, never-happened, or already obvious to the model. Target a 30 percent reduction on any prompt over two thousand tokens. Run the regression afterward to make sure quality held.
Over-long outputs. The model returns five paragraphs when one would do. Users do not read five paragraphs. You are paying for tokens nobody consumes.
Fix. Set explicit length caps in the system prompt. "Respond in under 150 words." Followed by "Do not include preamble." Most models respect these limits, and output token counts drop dramatically.
Wrong model tier. Teams use Opus for tasks Haiku handles just fine. The quality is indistinguishable on the dataset but the cost is 10x.
Fix. For every prompt, run the golden dataset with three model tiers and compare pass rate and cost. If a cheaper model has pass rate within three points, use it. Only upgrade to a more expensive model when the quality delta is real.
EmberLM's Token Optimizer takes a prompt, analyzes it for redundancy, and returns a shorter version with an estimated token savings and suggested rewrites. Use it on any prompt that seems bloated.
Step 8: quarterly cost review
Once a quarter, review:
- Total spend vs budget
- Top 10 most expensive prompts
- Prompts whose cost per call grew more than 20 percent over the quarter
- Prompts that could move to a cheaper model tier based on regression evidence
This is a 60-minute meeting. Outputs are either changes to prompts or changes to the budget. Do it on the calendar. Do not skip it.
Common mistakes
Ignoring the SDK layer. Teams build their own HTTP calls to the provider, roll their own cost tracking, and end up with inconsistent logs. Use the provider SDK or a higher-level wrapper. Both OpenAI and Anthropic SDKs return usage. EmberLM's SDK unifies this across providers.
Not logging until after the first surprise bill. Log from day one. The retroactive cost question is much harder to answer without data.
Treating cost as engineering-only. Cost is a product concern. Product managers should see the dashboard. Product decisions should include cost. A feature that costs $20 per active user per month is a different business from a feature that costs $2.
Optimizing without measuring. Do not guess which prompt is expensive. Measure. The expensive one is often not the one you think.
What the stack looks like end-to-end
From bottom to top:
- SDK wrapper emits a structured cost log per call
- Logs pipe to a durable store (Postgres, BigQuery, Clickhouse)
- A dashboard slices the data by prompt, feature, workspace
- Alerts fire on anomalies, budget pace, per-prompt jumps
- Cost regressions run on every prompt change
- Quarterly reviews roll up everything
EmberLM bundles layers 1 through 5 into one product. You wire the SDK once, and the Cost Analytics page delivers the rest.
Getting started
If you have not started logging, start today. You will thank yourself next month.
If you are already logging but have no dashboard, build one in Metabase in an afternoon. Or sign up for EmberLM and skip the setup.
If you have a dashboard but no regression gate, add one. The first time it catches a 30 percent cost regression before ship, it pays for itself.
Start at emberlm.dev/signup.
Summary
LLM cost is not controllable without instrumentation. With instrumentation, it is boring. Log every call. Budget and alert. Regression-test cost alongside quality. Review quarterly. Cut waste. Done.
Teams that do this spend less, ship more, and never have the "what happened to our bill" meeting. Teams that do not will have that meeting, and it is a bad meeting.
Start logging at emberlm.dev/signup.