Every cost audit we run starts the same way: we ask to see the LLM bill, and then we ask what’s actually in it. The second question almost never has an answer. Teams know the number is going up. They rarely know which feature, which model, or which careless prompt is driving it.
When we do the line-by-line, the same three patterns show up nearly every time.
1. One model for everything
The first AI feature ships on the best frontier model available, because that’s what was open in the playground when the prototype worked. Then the second feature reuses the same client. And the third. Six months later, a request that classifies a support ticket into one of five buckets is being answered by a model priced for graduate-level reasoning.
The fix isn’t exotic: route each task to the cheapest model that still nails it. Bulk classification, extraction, and routine summarisation go to small or open models. Frontier capacity is reserved for the work that genuinely needs it. Done well, this alone usually takes a third off the bill — and often makes the product faster, because the small models respond quicker.
2. Paying for the same answer twice
A surprising share of production traffic is near-duplicate. The same FAQ, the same product question, the same document summarised again because nobody cached the first result. You are paying full price, every time, for an answer you already generated.
Semantic caching — matching on meaning, not exact string — collapses a lot of that traffic to zero tokens. For support-style workloads we routinely see cache hit rates that would make a CFO emotional.
3. Prompts that quietly tax every call
Long system prompts, entire documents stuffed into context “just in case,” few-shot examples that stopped earning their place months ago — each one is a tax applied to every single request. At scale, trimming the prompt is one of the highest-leverage things you can do, and it costs nothing but attention.
The point
None of this requires a model change you’ll have to defend to your users. It’s routing, caching, and hygiene. We guarantee a 40% reduction because, across the stacks we’ve seen, clearing that bar has been the easy part — the hard part is just looking.
If your AI bill is growing faster than your usage makes sense of, that gap is the opportunity. Let’s find it.