LLM Cost Optimization Guide

LLM API costs can scale quickly. This guide covers the most effective levers: model selection, prompt caching, batching, and token compression — with real pricing examples from current provider rates.

1. Model selection is your biggest lever

The most impactful cost decision is which model you use. Flagship models cost 10x–100x more than mini-tier models for the same token count. The key question is whether your task actually requires that capability gap.

Consider a workload processing 1 million requests per month, each with 800 input tokens and 200 output tokens:

ModelInput $/MOutput $/MMonthly cost
GPT-4.1 (flagship)$2.00$8.00$3,200
GPT-4.1 Mini (standard)$0.40$1.60$640
Claude Sonnet 4$3.00$15.00$5,400
Claude 3.5 Haiku (fast)$0.80$4.00$1,440
Gemini 2.5 Flash$0.15$0.60$240
Nova Micro (mini)$0.04$0.14$56

At 1M requests/month, switching from GPT-4.1 to GPT-4.1 Mini saves approximately $2,560 per month. For simple tasks (classification, extraction, summarization), mini-tier models match flagship quality at a fraction of the cost.

2. Prompt caching: pay once, reuse many times

If your requests share a large static prefix — a system prompt, retrieval context, few-shot examples, or a policy document — you can use provider-native caching to avoid paying for those tokens on every request.

Caching is most effective when the cached prefix is large relative to the variable part. A 4,000-token system prompt amortized over 100 requests at a 90% discount is a 90% savings on those tokens for 99 of those requests.

When not to rely on caching: Caches are typically evicted after 5–60 minutes of inactivity. Bursty workloads with long idle periods will miss cache frequently. Design your architecture to batch requests together when possible.

3. Batch processing for non-real-time workloads

Most providers offer a batch API that processes requests asynchronously at 50% of the standard price (OpenAI, Anthropic). The tradeoff is completion time: batches typically complete within 24 hours rather than milliseconds.

Workloads well-suited for batch processing:

  • Document classification at scale
  • Offline content generation (product descriptions, SEO copy)
  • Evaluation pipelines and benchmark runs
  • Data enrichment and entity extraction
  • Training data generation and annotation

Combining batch mode with caching and model right-sizing can reduce costs by 80–95% for the right workloads. A job that costs $500 at standard rates could cost $25–$100 with all three optimizations applied.

4. Token compression: trim every prompt

Every token you send is a token you pay for. Common sources of unnecessary tokens:

Verbose system prompts
Replace prose instructions with structured lists. "Always respond in JSON format containing keys: name, type, confidence" is shorter than a paragraph explaining the same thing.
Redundant few-shot examples
Start with 2–3 examples. Evaluate whether adding more actually improves output quality before paying for them.
Full document retrieval
For RAG, retrieve and send only the most relevant chunks rather than entire documents. Semantic chunking and reranking reduce context size significantly.
Conversation history bloat
Truncate old turns, summarize history, or use a sliding window rather than appending the full transcript indefinitely.

5. Output token discipline

Output tokens are consistently 3–5x more expensive than input tokens. Reducing output verbosity is high-leverage:

  • Request structured output (JSON, XML) instead of prose when you only need data fields
  • Set max_tokens explicitly to cap runaway outputs
  • Instruct the model to be concise: "Respond in 2 sentences or fewer"
  • For classification tasks, constrain to a single token response where possible

6. Intelligent request routing

Not all requests require the same capability. A routing layer that classifies incoming requests and dispatches them to the appropriate model tier can reduce average cost significantly without degrading quality for users.

Common routing patterns:

Simple Q&A, short lookups
GPT-4.1 Nano, Nova Micro
Mini-tier model
Classification, entity extraction
Claude Haiku, Gemini Flash
Fast-tier model
Code generation, analysis
GPT-4.1 Mini, Claude Sonnet
Standard-tier model
Complex reasoning, long-form writing
GPT-4.1, Claude Opus
Flagship model

Quick reference

Right-size model
Effort: Medium
60–90%
Batch API
Effort: Low
50%
Prompt caching
Effort: Low–Medium
50–90% on cached tokens
Token compression
Effort: Medium
10–40%
Output limits
Effort: Low
10–30%
Request routing
Effort: High
40–80%

Use the cost calculator to model your specific workload across multiple providers, or read the model selection guide if you're still deciding which model tier fits your use case.