LLM Cost Optimization Guide
LLM API costs can scale quickly. This guide covers the most effective levers: model selection, prompt caching, batching, and token compression — with real pricing examples from current provider rates.
1. Model selection is your biggest lever
The most impactful cost decision is which model you use. Flagship models cost 10x–100x more than mini-tier models for the same token count. The key question is whether your task actually requires that capability gap.
Consider a workload processing 1 million requests per month, each with 800 input tokens and 200 output tokens:
| Model | Input $/M | Output $/M | Monthly cost |
|---|---|---|---|
| GPT-4.1 (flagship) | $2.00 | $8.00 | $3,200 |
| GPT-4.1 Mini (standard) | $0.40 | $1.60 | $640 |
| Claude Sonnet 4 | $3.00 | $15.00 | $5,400 |
| Claude 3.5 Haiku (fast) | $0.80 | $4.00 | $1,440 |
| Gemini 2.5 Flash | $0.15 | $0.60 | $240 |
| Nova Micro (mini) | $0.04 | $0.14 | $56 |
At 1M requests/month, switching from GPT-4.1 to GPT-4.1 Mini saves approximately $2,560 per month. For simple tasks (classification, extraction, summarization), mini-tier models match flagship quality at a fraction of the cost.
2. Prompt caching: pay once, reuse many times
If your requests share a large static prefix — a system prompt, retrieval context, few-shot examples, or a policy document — you can use provider-native caching to avoid paying for those tokens on every request.
Caching is most effective when the cached prefix is large relative to the variable part. A 4,000-token system prompt amortized over 100 requests at a 90% discount is a 90% savings on those tokens for 99 of those requests.
When not to rely on caching: Caches are typically evicted after 5–60 minutes of inactivity. Bursty workloads with long idle periods will miss cache frequently. Design your architecture to batch requests together when possible.
3. Batch processing for non-real-time workloads
Most providers offer a batch API that processes requests asynchronously at 50% of the standard price (OpenAI, Anthropic). The tradeoff is completion time: batches typically complete within 24 hours rather than milliseconds.
Workloads well-suited for batch processing:
- Document classification at scale
- Offline content generation (product descriptions, SEO copy)
- Evaluation pipelines and benchmark runs
- Data enrichment and entity extraction
- Training data generation and annotation
Combining batch mode with caching and model right-sizing can reduce costs by 80–95% for the right workloads. A job that costs $500 at standard rates could cost $25–$100 with all three optimizations applied.
4. Token compression: trim every prompt
Every token you send is a token you pay for. Common sources of unnecessary tokens:
5. Output token discipline
Output tokens are consistently 3–5x more expensive than input tokens. Reducing output verbosity is high-leverage:
- Request structured output (JSON, XML) instead of prose when you only need data fields
- Set
max_tokensexplicitly to cap runaway outputs - Instruct the model to be concise: "Respond in 2 sentences or fewer"
- For classification tasks, constrain to a single token response where possible
6. Intelligent request routing
Not all requests require the same capability. A routing layer that classifies incoming requests and dispatches them to the appropriate model tier can reduce average cost significantly without degrading quality for users.
Common routing patterns:
Quick reference
Use the cost calculator to model your specific workload across multiple providers, or read the model selection guide if you're still deciding which model tier fits your use case.