LLM API Quick Reference
Authentication patterns, example requests, and JSON response shapes for the major LLM provider APIs. All examples use curl and Python.
Common parameters
Most providers follow the OpenAI chat completions schema. Parameters below apply broadly; check provider docs for exact names and defaults.
| Parameter | Type | Required | Description |
|---|---|---|---|
model | string | Yes | The model identifier. Provider-specific format (e.g., gpt-4.1-mini, claude-sonnet-4-20250514). |
messages | array | Yes | Array of message objects with "role" (system/user/assistant) and "content" fields. |
max_tokens | integer | No | Maximum tokens in the response. Set explicitly to avoid unexpectedly long outputs. Anthropic calls this max_tokens (required). |
temperature | float | No | Sampling temperature. 0 = deterministic, 1 = maximum randomness. Default varies by provider (usually 0.7–1.0). |
stream | boolean | No | Stream partial response tokens as server-sent events. Reduces time-to-first-token for real-time UIs. |
top_p | float | No | Nucleus sampling probability mass. Use temperature OR top_p, not both. |
stop | string | string[] | No | One or more sequences where the model will stop generating. Useful for structured output parsing. |
tools | array | No | Function definitions the model can call. Format differs slightly between providers. |
OpenAI
Official docs →Base URL
https://api.openai.com/v1Auth header
Authorization: Bearer $OPENAI_API_KEYCapabilities
JSONStreamingTools
curl
curl https://api.openai.com/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $OPENAI_API_KEY" \
-d '{
"model": "gpt-4.1-mini",
"messages": [
{ "role": "system", "content": "You are a helpful assistant." },
{ "role": "user", "content": "Explain prompt caching in one sentence." }
],
"max_tokens": 256,
"temperature": 0.7
}'Example response
{
"id": "chatcmpl-abc123",
"object": "chat.completion",
"model": "gpt-4.1-mini",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "Prompt caching stores the KV cache of a repeated token prefix so subsequent requests pay a fraction of the normal input token cost."
},
"finish_reason": "stop"
}
],
"usage": {
"prompt_tokens": 32,
"completion_tokens": 38,
"total_tokens": 70
}
}Python
import openai
client = openai.OpenAI() # reads OPENAI_API_KEY from env
response = client.chat.completions.create(
model="gpt-4.1-mini",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Explain prompt caching in one sentence."},
],
max_tokens=256,
temperature=0.7,
)
print(response.choices[0].message.content)Anthropic
Official docs →Base URL
https://api.anthropic.com/v1Auth header
x-api-key: $ANTHROPIC_API_KEYCapabilities
JSONStreamingTools
curl
curl https://api.anthropic.com/v1/messages \
-H "Content-Type: application/json" \
-H "x-api-key: $ANTHROPIC_API_KEY" \
-H "anthropic-version: 2023-06-01" \
-d '{
"model": "claude-sonnet-4-20250514",
"max_tokens": 256,
"system": "You are a helpful assistant.",
"messages": [
{ "role": "user", "content": "Explain prompt caching in one sentence." }
]
}'Example response
{
"id": "msg_01XFDUDYJgAACzvnptvVoYEL",
"type": "message",
"role": "assistant",
"model": "claude-sonnet-4-20250514",
"content": [
{
"type": "text",
"text": "Prompt caching stores the key-value cache of a static prompt prefix so that repeated requests pay only 10% of the normal input token cost for cached tokens."
}
],
"stop_reason": "end_turn",
"usage": {
"input_tokens": 32,
"output_tokens": 38
}
}Python
import anthropic
client = anthropic.Anthropic() # reads ANTHROPIC_API_KEY from env
message = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=256,
system="You are a helpful assistant.",
messages=[
{"role": "user", "content": "Explain prompt caching in one sentence."},
],
)
print(message.content[0].text)Google Gemini
Official docs →Base URL
https://generativelanguage.googleapis.com/v1betaAuth header
?key=$GOOGLE_API_KEY (query param)Capabilities
JSONStreamingTools
curl
curl "https://generativelanguage.googleapis.com/v1beta/models/gemini-2.5-flash:generateContent?key=$GOOGLE_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"contents": [
{
"role": "user",
"parts": [{ "text": "Explain prompt caching in one sentence." }]
}
],
"generationConfig": {
"maxOutputTokens": 256,
"temperature": 0.7
},
"systemInstruction": {
"parts": [{ "text": "You are a helpful assistant." }]
}
}'Example response
{
"candidates": [
{
"content": {
"parts": [
{
"text": "Prompt caching stores the computation for a repeated prefix so subsequent API calls skip re-processing those tokens, reducing cost and latency."
}
],
"role": "model"
},
"finishReason": "STOP"
}
],
"usageMetadata": {
"promptTokenCount": 29,
"candidatesTokenCount": 35,
"totalTokenCount": 64
}
}Python
import google.generativeai as genai
import os
genai.configure(api_key=os.environ["GOOGLE_API_KEY"])
model = genai.GenerativeModel(
model_name="gemini-2.5-flash",
system_instruction="You are a helpful assistant.",
)
response = model.generate_content(
"Explain prompt caching in one sentence.",
generation_config=genai.types.GenerationConfig(
max_output_tokens=256,
temperature=0.7,
),
)
print(response.text)Mistral
Official docs →Base URL
https://api.mistral.ai/v1Auth header
Authorization: Bearer $MISTRAL_API_KEYCapabilities
JSONStreamingTools
curl
curl https://api.mistral.ai/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $MISTRAL_API_KEY" \
-d '{
"model": "mistral-small-latest",
"messages": [
{ "role": "system", "content": "You are a helpful assistant." },
{ "role": "user", "content": "Explain prompt caching in one sentence." }
],
"max_tokens": 256,
"temperature": 0.7
}'Example response
{
"id": "cmpl-e5cc70bb28c444948073e77776eb30ef",
"object": "chat.completion",
"model": "mistral-small-latest",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "Prompt caching stores the key-value attention states for a repeated prompt prefix, allowing subsequent requests to skip recomputing those tokens and reducing both cost and latency."
},
"finish_reason": "stop"
}
],
"usage": {
"prompt_tokens": 29,
"completion_tokens": 42,
"total_tokens": 71
}
}Python
from mistralai import Mistral
import os
client = Mistral(api_key=os.environ["MISTRAL_API_KEY"])
response = client.chat.complete(
model="mistral-small-latest",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Explain prompt caching in one sentence."},
],
max_tokens=256,
temperature=0.7,
)
print(response.choices[0].message.content)DeepSeek
Official docs →Base URL
https://api.deepseek.comAuth header
Authorization: Bearer $DEEPSEEK_API_KEYCapabilities
JSONStreamingTools
curl
curl https://api.deepseek.com/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $DEEPSEEK_API_KEY" \
-d '{
"model": "deepseek-chat",
"messages": [
{ "role": "system", "content": "You are a helpful assistant." },
{ "role": "user", "content": "Explain prompt caching in one sentence." }
],
"max_tokens": 256,
"temperature": 0.7
}'Example response
{
"id": "930c37e5-a7ab-4e42-8da3-4e7d4e2f7a22",
"object": "chat.completion",
"model": "deepseek-chat",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "Prompt caching reuses the computed key-value states for a static prompt prefix across requests, reducing token computation cost for repeated context."
},
"finish_reason": "stop"
}
],
"usage": {
"prompt_tokens": 29,
"prompt_cache_hit_tokens": 0,
"prompt_cache_miss_tokens": 29,
"completion_tokens": 34,
"total_tokens": 63
}
}Python
from openai import OpenAI # DeepSeek uses OpenAI-compatible API
import os
client = OpenAI(
api_key=os.environ["DEEPSEEK_API_KEY"],
base_url="https://api.deepseek.com",
)
response = client.chat.completions.create(
model="deepseek-chat",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Explain prompt caching in one sentence."},
],
max_tokens=256,
temperature=0.7,
)
print(response.choices[0].message.content)xAI (Grok)
Official docs →Base URL
https://api.x.ai/v1Auth header
Authorization: Bearer $XAI_API_KEYCapabilities
JSONStreamingTools
curl
curl https://api.x.ai/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $XAI_API_KEY" \
-d '{
"model": "grok-3-mini",
"messages": [
{ "role": "system", "content": "You are a helpful assistant." },
{ "role": "user", "content": "Explain prompt caching in one sentence." }
],
"max_tokens": 256,
"temperature": 0.7
}'Example response
{
"id": "dc5fec56-6db8-40f2-8e27-f5faf4a8a2e6",
"object": "chat.completion",
"model": "grok-3-mini",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "Prompt caching stores the transformer key-value states for a fixed prompt prefix so that repeated API calls with the same prefix skip redundant computation."
},
"finish_reason": "stop"
}
],
"usage": {
"prompt_tokens": 29,
"completion_tokens": 39,
"total_tokens": 68
}
}Python
from openai import OpenAI # xAI uses OpenAI-compatible API
import os
client = OpenAI(
api_key=os.environ["XAI_API_KEY"],
base_url="https://api.x.ai/v1",
)
response = client.chat.completions.create(
model="grok-3-mini",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Explain prompt caching in one sentence."},
],
max_tokens=256,
temperature=0.7,
)
print(response.choices[0].message.content)OpenAI-compatible providers
Several providers implement the OpenAI chat completions schema. If you're using the OpenAI Python SDK, you can point it at these providers by setting base_url and api_key:
| Provider | base_url | API key env var |
|---|---|---|
| xAI (Grok) | https://api.x.ai/v1 | XAI_API_KEY |
| DeepSeek | https://api.deepseek.com | DEEPSEEK_API_KEY |
| Together (Llama, etc.) | https://api.together.xyz/v1 | TOGETHER_API_KEY |
| Mistral | https://api.mistral.ai/v1 | MISTRAL_API_KEY |
See the cost optimization guide for strategies to reduce spend, or open the calculator to compare costs before choosing a provider.