Prompt caching saves 90% on repeated context. See exact pricing across all Claude 4 models and calculate your real-world savings.
Cache writes occur once per 5-minute TTL window. All subsequent requests within that window read from cache at 10% of standard input cost.
| Model | Standard Input | Cache Write (1.25×) | Cache Read (0.10×) | Cache Savings | Min Cache Size |
|---|---|---|---|---|---|
| Claude Haiku 4.5 | $0.80/MTok | $1.00/MTok | $0.08/MTok | 90% | 2,048 tokens |
| Claude Sonnet 4.6 | $3.00/MTok | $3.75/MTok | $0.30/MTok | 90% | 1,024 tokens |
| Claude Opus 4.7 | $15.00/MTok | $18.75/MTok | $1.50/MTok | 90% | 1,024 tokens |
A chatbot with a 2,000-token system prompt, 300-token average user message, 200-token average response, 10,000 conversations/day, 3 turns per conversation:
A document Q&A app that loads a 20,000-token document context per session, 500 questions/day, 10 turns per session:
| Model | Without Caching/day | With Caching/day | Savings/day | Savings/year |
|---|---|---|---|---|
| Haiku 4.5 | $81.60 | $12.08 | $69.52 (85%) | $25,375 |
| Sonnet 4.6 | $306.00 | $45.30 | $260.70 (85%) | $95,156 |
Add cache_control to the last content block you want cached. You can have up to 4 cache breakpoints per request.
import anthropic
client = anthropic.Anthropic()
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=1024,
system=[{
"type": "text",
"text": "You are a helpful assistant...", # your system prompt
"cache_control": {"type": "ephemeral"} # ← add this
}],
messages=[{
"role": "user",
"content": user_message
}]
)
# Check cache hit in response
print(response.usage)
# Usage(cache_creation_input_tokens=1247, cache_read_input_tokens=0, ...)
# Second call: cache_read_input_tokens=1247 (90% cheaper)
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=1024,
system=[{
"type": "text",
"text": "Answer questions based on the document below.",
"cache_control": {"type": "ephemeral"}
}],
messages=[
{
"role": "user",
"content": [
{
"type": "text",
"text": document_content, # large document
"cache_control": {"type": "ephemeral"} # ← cache the doc
},
{"type": "text", "text": user_question}
]
}
]
)
Caching is always beneficial when you send the same context more than once within 5 minutes. The write cost is recovered on the second request:
| Scenario | Cached Tokens | Haiku Breakeven | Sonnet Breakeven |
|---|---|---|---|
| Small system prompt (500 tokens) | 500 tokens | 2nd request | 2nd request |
| Large system prompt (2K tokens) | 2,000 tokens | 2nd request | 2nd request |
| Document context (20K tokens) | 20,000 tokens | 2nd request | 2nd request |
| Full book (100K tokens) | 100,000 tokens | 2nd request | 2nd request |
Cache writes cost 125% (1.25×) of standard input. Cache reads cost 10% (0.10×). You break even on the 2nd read — the write overhead of 0.25× is recouped immediately since reads save 90%. Starting from the 2nd request, every cache hit gives you 90% off that token segment.
Paste your system prompt into the calculator to see token counts, standard vs cached costs, and monthly savings estimates.
Cache writes: 125% of standard input rate (e.g., $1.00/MTok for Haiku, $3.75/MTok for Sonnet). Cache reads: 10% of standard input rate (e.g., $0.08/MTok for Haiku, $0.30/MTok for Sonnet). Output tokens are unchanged. From the 2nd request within the TTL window, every cached token costs 90% less.
Minimum cacheable block: 2,048 tokens for Haiku 4.5, 1,024 tokens for Sonnet 4.6 and Opus 4.7. Content shorter than the minimum is not cached even with cache_control headers present. Most system prompts and document contexts exceed this threshold.
Add "cache_control": {"type": "ephemeral"} to the content block you want to cache. Typically: the last block of your system prompt, or a large document in the messages array. You can have up to 4 cache breakpoints. Cache TTL is 5 minutes.
Yes — all Claude 4 models (Haiku 4.5, Sonnet 4.6, Opus 4.7) support prompt caching. Rates and minimum sizes differ slightly. The 90% cache read discount applies to all models. Cache is per-API-key and per-model.
Claude uses explicit cache_control headers (you choose what to cache); OpenAI's caching is automatic. Claude cache reads cost 10% of standard (90% discount); OpenAI cached inputs cost 50%. Claude charges for cache writes (1.25×); OpenAI does not. For large fixed contexts, Claude's 90% discount typically outperforms OpenAI's 50%.