When a finance team first looks at the GenAI line item, the instinct is to treat it like cloud compute and optimise per call. Reduce token counts. Pick a cheaper model. Negotiate volume pricing. Those moves help, but they are not the levers that change a bill from "alarming" to "predictable." The structural levers sit one layer deeper, in how the application is architected and how usage scales as the product matures.
1. Model Cascade vs Single-Model Design
Most production GenAI systems route every request through a single capable model. That is the most expensive way to ship AI. A cascade routes simple requests to a small, cheap model first, escalating to a more capable model only when the small one cannot handle it. Done well, the cascade can cut cost by 60–80% with no measurable quality drop, because most requests in real systems are mundane.
2. Caching at the Right Layer
Prompt caching reuses the cost of computing the cached portion of a prompt across many requests. For systems with long, stable system prompts and short, varying user inputs, the savings are dramatic — 50% or more on the cached portion. The catch is that caching only helps if the cacheable portion is genuinely stable. Systems that rebuild the prompt context every request (tool definitions, conversation history, retrieved documents in random order) get little benefit.
3. Output Length Discipline
Output tokens are typically 4–5x more expensive than input tokens. A model that generates a 2,000-token response when you needed 200 is not just wasteful in volume — it is wasteful at the most expensive part of the pricing curve. Explicit output length constraints in the system prompt, plus structured output formats that the model cannot pad, are some of the highest-ROI changes you can make.
4. Batch vs Real-Time
Asynchronous batch APIs cost 50% less than the equivalent real-time calls at most major providers. Anything that can wait — overnight summaries, scheduled enrichment, bulk classification — should run through batch. The mental shift teams need to make is treating real-time as a budget item rather than a default.
A surprisingly common pattern: a feature ships with a real-time API call to generate a piece of content the user could have read 10 seconds later just as happily. Latency requirements should be explicit per feature, not implicit. Treat real-time API calls the way you would treat a premium service — used only where it actually matters.
5. Self-Hosted vs API for Steady-State Volume
API pricing is excellent at low volume and unsustainable at high steady-state volume. The breakeven point depends on the model and the use case, but somewhere around several billion tokens per month, self-hosting an open-weight model on cloud GPUs becomes cheaper than the equivalent API spend. The decision is not just cost — operational complexity, latency, model upgrade cadence all matter — but if you are scaling rapidly, run the calculation.
6. Embedding Strategy in RAG Systems
Retrieval-augmented generation systems can spend more on embeddings than on completions if the architecture is wrong. Re-embedding the same documents on every change, embedding before chunking, or using oversized embedding models for content that does not benefit are all common patterns. The right strategy: chunk first, embed once, version and reuse, and right-size the embedding model for the retrieval task.
7. Per-Customer Cost Visibility
If you cannot tell which customers, products, or features are driving your AI bill, you cannot manage it. Tag every request with at least a customer ID, feature ID, and environment. Aggregate the usage data into your existing cost reporting. Most organisations are surprised by the distribution — a small minority of users typically drive a large majority of cost. That is fine, sometimes desirable, but only knowable if you measure.
The pattern across all seven levers is the same: GenAI costs are an architectural problem more than a procurement problem. The team that wins on cost is not the team with the cheapest contract — it is the team that designed for cost from the start.