Prompt Caching——Claude Code's "Money-Saving Secret"

Have you noticed that asking Claude Code the same question about a codebase gives you a faster, cheaper response the second time?
This isn’t an illusion—it’s Prompt Caching at work. It’s like restaurant prep—prepare common broth in advance, when a customer orders, just heat it up. Fast and cost-effective.
Today we’re揭开ing this “money-saving secret” mask.
The diagram: Prompt caching is like restaurant prep, preparing common ingredients in advance
Why Caching Is Needed
Let’s do the math first.
Claude API charges by token: input tokens (sent to the model) + output tokens (model responses). System prompts and tool definitions are sent every time—this is the “fixed overhead.”
Assume:
- System prompt: 5K tokens
- Tool definitions: 15K tokens
- Historical conversation: 10K tokens
- Single request total: 30K tokens
Without caching, every request costs 30K. But if you ask 10 consecutive questions in one codebase, the system prompt and tool definitions are actually the same—paying repeatedly is wasteful.
Prompt caching solves this: cache the unchanged “prefix,” pay once, and subsequent requests reuse it.
Cache Breakpoints: Where to Cut
Caching doesn’t cache the entire prompt—there are “breakpoints.”
Imagine shipping packages: you don’t label the entire box, but put labels at specific positions for sorting.
Claude Code’s cache breakpoint design:
System Prompts: From the beginning to before the tool list. This is the least-variable part.
Tool List: Tool definitions after stable sorting. This is relatively stable.
Dynamic Content: User input, historical messages, tool results—this part isn’t cached.
[System Prompt] ← Cache Breakpoint 1
[Tool List] ← Cache Breakpoint 2
[Historical Messages]
[User Input]
The diagram: Cache breakpoint design divides prompts into cacheable and non-cacheable parts
When the model API receives a request, it checks if the part before cache breakpoints is already cached. If so, it reuses directly and only bills for the part after breakpoints.
Tool Ordering: Why Order Matters
Tool list ordering directly affects cache hit rate.
Assume you have these tools: FileRead, FileEdit, Bash, Grep, Glob. If the order keeps changing:
Request 1: [Bash, FileRead, FileEdit, Glob, Grep]
Request 2: [Bash, FileEdit, FileRead, Grep, Glob] ← Order changed! Cache miss
As long as one tool’s position changes, the entire cache key changes, and all previous cache becomes invalid.
Claude Code’s solution: Stable Sorting.
- Built-in tools in fixed order (like alphabetical)
- MCP tools also alphabetically sorted
- Built-in tools first, MCP tools after
This way, as long as the tool set doesn’t change, order doesn’t change, and cache hits.
Request 1: [Bash, FileEdit, FileRead, Glob, Grep] + [MCP-A, MCP-B]
Request 2: [Bash, FileEdit, FileRead, Glob, Grep] + [MCP-A, MCP-B] ← Cache hit!
Cache Interruption: When It Becomes Invalid
Cache isn’t permanent—certain situations cause “interruption”:
Tool Changes: If you add a new MCP tool or disable a tool, the tool list changes and cache becomes invalid.
System Prompt Updates: If Claude Code version updates, system prompts change and cache becomes invalid.
Model Switching: Different models have different system prompts—switching models invalidates cache.
Session Timeout: Cache has an expiration time—prolonged disuse invalidates it.
Explicit Refresh: Certain operations trigger cache refresh.
Understanding these interruption scenarios helps you optimize usage:
- Don’t frequently add/remove MCP tools
- Ask consecutive questions in the same session for higher hit rates
- Avoid frequently switching models
The diagram: Common causes of cache interruption
Practical: Maximizing Cache Benefits
If you want to maximize prompt caching benefits:
Maintain Session Continuity: Ask consecutive questions in the same codebase, don’t frequently start new sessions.
Stabilize Tool Configuration: Finalize which MCP tools you want to use, don’t frequently add/remove.
Use CLAUDE.md: Project-level configuration goes in CLAUDE.md—it gets included in the cached prefix, maintaining stability.
Monitor Cache Hit Rate: Claude Code’s status output shows cache hit information.
Understand Cost Structure: When cache hits, you only pay for user input + output tokens—the system part is free.
Cost Comparison: How Much Can Be Saved
How much can actually be saved? Example:
Scenario: 10 consecutive questions in a large codebase
Without Cache:
- Each request: 30K tokens × 10 requests = 300K tokens
With Cache (assuming 80% hit rate):
- First request: 30K tokens (establish cache)
- Subsequent 9 requests: 5K tokens (user input) × 9 = 45K tokens
- Total: 75K tokens
Savings: (300K - 75K) / 300K = 75%
Actual savings depend on usage patterns, but typically reach 50-80%.
Implications for Building AI Agents
If you want to implement prompt caching in your own AI applications:
Identify Stable Prefixes: System prompts and tool definitions are relatively stable—suitable for caching.
Design Stable Ordering: Ensure tool list order remains consistent between requests.
Set Reasonable Breakpoints: Set breakpoints between “stable parts” and “dynamic parts.”
Handle Cache Invalidation: Gracefully fall back to non-cached mode when cache becomes invalid.
Monitor and Optimize: Provide visibility into cache hit rates to help users optimize.
Summary
Prompt caching is Claude Code’s “money-saving secret”—by caching unchanged system prompts and tool definitions, it dramatically reduces API call costs.
Key designs:
- Cache breakpoints: Divide prompts into cacheable and non-cacheable parts
- Stable sorting: Tool list maintains fixed order to ensure cache hits
- Interruption handling: Identify cache invalidation scenarios and handle gracefully
Understanding this helps you:
- Use Claude Code more effectively (maintain session continuity)
- Understand cost structure (know where money goes)
- Implement similar optimizations in your own AI applications
