How to Choose the Right LLM

The right model depends on what you're building, not which model scores highest on benchmarks. This guide maps use cases to model tiers and highlights the tradeoffs.

The three-axis framework

Every model selection involves three competing pressures. Knowing which axis matters most for your use case leads directly to the right tier.

Speed

How fast does the user need a response? Real-time chat and autocomplete have <500ms budgets. Batch analysis can wait minutes or hours.

If speed is critical → mini or fast tier
Quality

How much does output quality affect your product? Classification labels can tolerate 2–5% error. Medical or legal output cannot.

If quality is critical → standard or flagship tier
Cost

What is your cost per 1,000 requests? High-volume background jobs need cheap tokens. Low-volume human-in-the-loop tasks can afford more.

If cost is critical → mini or fast tier + batching

Model tiers explained

Mini / Nano
Under $0.50/M input
Use for: Classification, routing, simple extraction, high-volume pipelines
Skip for: Complex reasoning, nuanced writing, multi-step agents
Examples: GPT-4.1 Nano, Nova Micro, Gemini 2.0 Flash
Fast
$0.10–$1.00/M input
Use for: Summarization, moderate extraction, customer-facing chatbots, structured output
Skip for: Complex code generation, deep analysis
Examples: Claude 3.5 Haiku, Gemini 2.5 Flash, Mistral Small
Standard
$0.40–$3.00/M input
Use for: Code generation, RAG, tool use, general assistant, most production workloads
Skip for: Very simple tasks (overpaying) or frontier reasoning tasks
Examples: GPT-4.1 Mini, Claude Sonnet 4, Mistral Medium
Flagship
$2.00–$15.00/M input
Use for: Complex code, long-form writing, nuanced analysis, vision tasks that need highest accuracy
Skip for: Bulk processing — costs add up fast
Examples: GPT-4.1, Claude Opus 4, Gemini 2.5 Pro
Reasoning
$0.55–$2.00/M input (chain-of-thought token overhead)
Use for: Math, science, multi-step logic, verification tasks
Skip for: Latency-sensitive apps — reasoning steps add seconds
Examples: o3, o4-mini, DeepSeek R1

Capability requirements by feature

Some features require specific model support regardless of tier. Check before committing to a model:

FeatureMinimum tierNotes
Vision / image inputFast+Not all fast-tier models support it — check per-model
Function calling / toolsMini+Widely supported, but reliability improves in standard tier
JSON mode / structured outputMini+Universally supported across current generation models
Long context (>128K tokens)Standard+Gemini and Claude offer up to 1M–2M tokens; most others 128K
StreamingMini+Supported by all models on this site
Multi-turn agentsStandard+Mini-tier models can fail on complex tool orchestration

Recommendation matrix by use case

Text classification
Speed: HighBudget: Low

Mini-tier models handle classification reliably at minimal cost. Fine-tuning can further reduce cost.

Entity extraction / NER
Speed: HighBudget: Low

JSON mode support is required. Fast-tier models extract structured data accurately.

Summarization
Speed: MediumBudget: Low–Medium

Short-to-medium summaries: fast-tier. Long documents or abstractive summarization: standard-tier.

RAG / question answering
Speed: MediumBudget: Medium

Retrieval accuracy depends heavily on chunking and reranking, not just model quality. Cohere Command R+ is purpose-built for RAG.

Code generation
Speed: MediumBudget: Medium–High

Claude Sonnet and GPT-4.1 lead on multi-file code editing. Codestral specializes in code completion at lower cost.

Complex reasoning / math
Speed: LowBudget: High

Reasoning-tier models use chain-of-thought internally. Expect slower responses and higher cost per request.

Long-form writing
Speed: LowBudget: Medium–High

Flagship models produce more coherent long documents. Anthropic models tend to be preferred for creative and nuanced writing.

Vision / image understanding
Speed: MediumBudget: Medium

All three support vision natively. Check that your chosen model tier supports image inputs.

Tool use / agents
Speed: MediumBudget: Medium

Reliable function calling requires a model that understands tool schemas. Avoid mini-tier for complex multi-step agents.

High-volume production API
Speed: Very HighBudget: Very Low

Prioritize tokens/second and cost per token. Combine with caching and batching for maximum efficiency.

Provider-level considerations

Ready to compare costs? Use the cost calculator to model your specific workload, or read the cost optimization guide for strategies to reduce spend once you've chosen a model.