How to Choose the Right LLM

The right model depends on what you're building, not which model scores highest on benchmarks. This guide maps use cases to model tiers and highlights the tradeoffs.

The three-axis framework

Every model selection involves three competing pressures. Knowing which axis matters most for your use case leads directly to the right tier.

Speed

How fast does the user need a response? Real-time chat and autocomplete have <500ms budgets. Batch analysis can wait minutes or hours.

If speed is critical → mini or fast tier

Quality

How much does output quality affect your product? Classification labels can tolerate 2–5% error. Medical or legal output cannot.

If quality is critical → standard or flagship tier

Cost

What is your cost per 1,000 requests? High-volume background jobs need cheap tokens. Low-volume human-in-the-loop tasks can afford more.

If cost is critical → mini or fast tier + batching

Model tiers explained

Mini / Nano

Under $0.50/M input

Use for: Classification, routing, simple extraction, high-volume pipelines

Skip for: Complex reasoning, nuanced writing, multi-step agents

Examples: GPT-4.1 Nano, Nova Micro, Gemini 2.0 Flash

Fast

$0.10–$1.00/M input

Use for: Summarization, moderate extraction, customer-facing chatbots, structured output

Skip for: Complex code generation, deep analysis

Examples: Claude 3.5 Haiku, Gemini 2.5 Flash, Mistral Small

Standard

$0.40–$3.00/M input

Use for: Code generation, RAG, tool use, general assistant, most production workloads

Skip for: Very simple tasks (overpaying) or frontier reasoning tasks

Examples: GPT-4.1 Mini, Claude Sonnet 4, Mistral Medium

Flagship

$2.00–$15.00/M input

Use for: Complex code, long-form writing, nuanced analysis, vision tasks that need highest accuracy

Skip for: Bulk processing — costs add up fast

Examples: GPT-4.1, Claude Opus 4, Gemini 2.5 Pro

Reasoning

$0.55–$2.00/M input (chain-of-thought token overhead)

Use for: Math, science, multi-step logic, verification tasks

Skip for: Latency-sensitive apps — reasoning steps add seconds

Examples: o3, o4-mini, DeepSeek R1

Capability requirements by feature

Some features require specific model support regardless of tier. Check before committing to a model:

Feature	Minimum tier	Notes
Vision / image input	Fast+	Not all fast-tier models support it — check per-model
Function calling / tools	Mini+	Widely supported, but reliability improves in standard tier
JSON mode / structured output	Mini+	Universally supported across current generation models
Long context (>128K tokens)	Standard+	Gemini and Claude offer up to 1M–2M tokens; most others 128K
Streaming	Mini+	Supported by all models on this site
Multi-turn agents	Standard+	Mini-tier models can fail on complex tool orchestration

Recommendation matrix by use case

Text classification

Speed: HighBudget: Low

Mini-tier models handle classification reliably at minimal cost. Fine-tuning can further reduce cost.

GPT-4.1 Nano$0.10/M Nova Micro$0.04/M Gemini 2.0 Flash$0.10/M

Entity extraction / NER

Speed: HighBudget: Low

JSON mode support is required. Fast-tier models extract structured data accurately.

GPT-4.1 Mini$0.40/M Claude 3.5 Haiku$0.80/M Gemini 2.5 Flash$0.15/M

Summarization

Speed: MediumBudget: Low–Medium

Short-to-medium summaries: fast-tier. Long documents or abstractive summarization: standard-tier.

GPT-4.1 Mini$0.40/M Claude 3.5 Haiku$0.80/M Mistral Small$0.10/M

RAG / question answering

Speed: MediumBudget: Medium

Retrieval accuracy depends heavily on chunking and reranking, not just model quality. Cohere Command R+ is purpose-built for RAG.

GPT-4.1 Mini$0.40/M Claude Sonnet 4$3.00/M Command R+$2.50/M

Code generation

Speed: MediumBudget: Medium–High

Claude Sonnet and GPT-4.1 lead on multi-file code editing. Codestral specializes in code completion at lower cost.

Claude Sonnet 4$3.00/M GPT-4.1$2.00/M Codestral$0.30/M

Complex reasoning / math

Speed: LowBudget: High

Reasoning-tier models use chain-of-thought internally. Expect slower responses and higher cost per request.

o3$2.00/M DeepSeek R1$0.55/M Gemini 2.5 Pro$1.25/M

Long-form writing

Speed: LowBudget: Medium–High

Flagship models produce more coherent long documents. Anthropic models tend to be preferred for creative and nuanced writing.

Claude Opus 4$15.00/M GPT-4.1$2.00/M Grok 3$3.00/M

Vision / image understanding

Speed: MediumBudget: Medium

All three support vision natively. Check that your chosen model tier supports image inputs.

GPT-4.1$2.00/M Claude Sonnet 4$3.00/M Gemini 2.5 Pro$1.25/M

Tool use / agents

Speed: MediumBudget: Medium

Reliable function calling requires a model that understands tool schemas. Avoid mini-tier for complex multi-step agents.

Claude Sonnet 4$3.00/M GPT-4.1$2.00/M Gemini 2.5 Pro$1.25/M

High-volume production API

Speed: Very HighBudget: Very Low

Prioritize tokens/second and cost per token. Combine with caching and batching for maximum efficiency.

GPT-4.1 Nano$0.10/M Gemini 2.0 Flash$0.10/M Nova Micro$0.04/M

Provider-level considerations

OpenAI

Broadest ecosystem, most third-party integrations, best function calling reliability. GPT-4.1 family offers excellent cost-performance.

Anthropic

Preferred for long-context work, nuanced writing, and safety-critical applications. Claude models have the largest max output token limits.

Google

Best-in-class context windows (1M+ tokens). Gemini 2.5 Flash is one of the most cost-efficient fast-tier models available.

Meta (open weights)

Llama 4 models are fully open weight — deployable on your own infrastructure for zero marginal API cost at scale.

DeepSeek

Exceptional price-performance. DeepSeek V3 at $0.27/M input competes with frontier models at 10% of the cost.

Ready to compare costs? Use the cost calculator to model your specific workload, or read the cost optimization guide for strategies to reduce spend once you've chosen a model.