LLM Cost Optimization Guide

LLM API costs can scale quickly. This guide covers the most effective levers: model selection, prompt caching, batching, and token compression — with real pricing examples from current provider rates.

1. Model selection is your biggest lever

The most impactful cost decision is which model you use. Flagship models cost 10x–100x more than mini-tier models for the same token count. The key question is whether your task actually requires that capability gap.

Consider a workload processing 1 million requests per month, each with 800 input tokens and 200 output tokens:

Model	Input $/M	Output $/M	Monthly cost
GPT-4.1 (flagship)	$2.00	$8.00	$3,200
GPT-4.1 Mini (standard)	$0.40	$1.60	$640
Claude Sonnet 4	$3.00	$15.00	$5,400
Claude 3.5 Haiku (fast)	$0.80	$4.00	$1,440
Gemini 2.5 Flash	$0.15	$0.60	$240
Nova Micro (mini)	$0.04	$0.14	$56

At 1M requests/month, switching from GPT-4.1 to GPT-4.1 Mini saves approximately $2,560 per month. For simple tasks (classification, extraction, summarization), mini-tier models match flagship quality at a fraction of the cost.

2. Prompt caching: pay once, reuse many times

If your requests share a large static prefix — a system prompt, retrieval context, few-shot examples, or a policy document — you can use provider-native caching to avoid paying for those tokens on every request.

Anthropic

90%

Cache write: 25% premium on first write; cache read: 10% of base input price

OpenAI

50%

Automatic prompt caching on context prefixes over 1,024 tokens

Google

75%

Context caching for Gemini models; minimum 32K tokens

Caching is most effective when the cached prefix is large relative to the variable part. A 4,000-token system prompt amortized over 100 requests at a 90% discount is a 90% savings on those tokens for 99 of those requests.

When not to rely on caching: Caches are typically evicted after 5–60 minutes of inactivity. Bursty workloads with long idle periods will miss cache frequently. Design your architecture to batch requests together when possible.

3. Batch processing for non-real-time workloads

Most providers offer a batch API that processes requests asynchronously at 50% of the standard price (OpenAI, Anthropic). The tradeoff is completion time: batches typically complete within 24 hours rather than milliseconds.

Workloads well-suited for batch processing:

Document classification at scale
Offline content generation (product descriptions, SEO copy)
Evaluation pipelines and benchmark runs
Data enrichment and entity extraction
Training data generation and annotation

Combining batch mode with caching and model right-sizing can reduce costs by 80–95% for the right workloads. A job that costs $500 at standard rates could cost $25–$100 with all three optimizations applied.

4. Token compression: trim every prompt

Every token you send is a token you pay for. Common sources of unnecessary tokens:

Verbose system prompts

Replace prose instructions with structured lists. "Always respond in JSON format containing keys: name, type, confidence" is shorter than a paragraph explaining the same thing.

Redundant few-shot examples

Start with 2–3 examples. Evaluate whether adding more actually improves output quality before paying for them.

Full document retrieval

For RAG, retrieve and send only the most relevant chunks rather than entire documents. Semantic chunking and reranking reduce context size significantly.

Conversation history bloat

Truncate old turns, summarize history, or use a sliding window rather than appending the full transcript indefinitely.

5. Output token discipline

Output tokens are consistently 3–5x more expensive than input tokens. Reducing output verbosity is high-leverage:

Request structured output (JSON, XML) instead of prose when you only need data fields
Set max_tokens explicitly to cap runaway outputs
Instruct the model to be concise: "Respond in 2 sentences or fewer"
For classification tasks, constrain to a single token response where possible

6. Intelligent request routing

Not all requests require the same capability. A routing layer that classifies incoming requests and dispatches them to the appropriate model tier can reduce average cost significantly without degrading quality for users.

Common routing patterns:

Simple Q&A, short lookups

GPT-4.1 Nano, Nova Micro

→ Mini-tier model

Classification, entity extraction

Claude Haiku, Gemini Flash

→ Fast-tier model

Code generation, analysis

GPT-4.1 Mini, Claude Sonnet

→ Standard-tier model

Complex reasoning, long-form writing

GPT-4.1, Claude Opus

→ Flagship model

Quick reference

Right-size model

Effort: Medium

60–90%

Batch API

Effort: Low

50%

Prompt caching

Effort: Low–Medium

50–90% on cached tokens

Token compression

Effort: Medium

10–40%

Output limits

Effort: Low

10–30%

Request routing

Effort: High

40–80%

Use the cost calculator to model your specific workload across multiple providers, or read the model selection guide if you're still deciding which model tier fits your use case.