Prompt Caching——Claude Code's "Money-Saving Secret"

Have you noticed that asking Claude Code the same question about a codebase gives you a faster, cheaper response the second time?

This isn’t an illusion—it’s Prompt Caching at work. It’s like restaurant prep—prepare common broth in advance, when a customer orders, just heat it up. Fast and cost-effective.

Today we’re揭开ing this “money-saving secret” mask.

Prompt Caching Like Prep The diagram: Prompt caching is like restaurant prep, preparing common ingredients in advance

Why Caching Is Needed

Let’s do the math first.

Claude API charges by token: input tokens (sent to the model) + output tokens (model responses). System prompts and tool definitions are sent every time—this is the “fixed overhead.”

Assume:

System prompt: 5K tokens
Tool definitions: 15K tokens
Historical conversation: 10K tokens
Single request total: 30K tokens

Without caching, every request costs 30K. But if you ask 10 consecutive questions in one codebase, the system prompt and tool definitions are actually the same—paying repeatedly is wasteful.

Prompt caching solves this: cache the unchanged “prefix,” pay once, and subsequent requests reuse it.

Cache Breakpoints: Where to Cut

Caching doesn’t cache the entire prompt—there are “breakpoints.”

Imagine shipping packages: you don’t label the entire box, but put labels at specific positions for sorting.

Claude Code’s cache breakpoint design:

System Prompts: From the beginning to before the tool list. This is the least-variable part.

Tool List: Tool definitions after stable sorting. This is relatively stable.

Dynamic Content: User input, historical messages, tool results—this part isn’t cached.

[System Prompt] ← Cache Breakpoint 1
[Tool List]    ← Cache Breakpoint 2
[Historical Messages]
[User Input]

Cache Breakpoints The diagram: Cache breakpoint design divides prompts into cacheable and non-cacheable parts

When the model API receives a request, it checks if the part before cache breakpoints is already cached. If so, it reuses directly and only bills for the part after breakpoints.

Tool Ordering: Why Order Matters

Tool list ordering directly affects cache hit rate.

Assume you have these tools: FileRead, FileEdit, Bash, Grep, Glob. If the order keeps changing:

Request 1: [Bash, FileRead, FileEdit, Glob, Grep]
Request 2: [Bash, FileEdit, FileRead, Grep, Glob]  ← Order changed! Cache miss

As long as one tool’s position changes, the entire cache key changes, and all previous cache becomes invalid.

Claude Code’s solution: Stable Sorting.

Built-in tools in fixed order (like alphabetical)
MCP tools also alphabetically sorted
Built-in tools first, MCP tools after

This way, as long as the tool set doesn’t change, order doesn’t change, and cache hits.

Request 1: [Bash, FileEdit, FileRead, Glob, Grep] + [MCP-A, MCP-B]
Request 2: [Bash, FileEdit, FileRead, Glob, Grep] + [MCP-A, MCP-B]  ← Cache hit!

Cache Interruption: When It Becomes Invalid

Cache isn’t permanent—certain situations cause “interruption”:

Tool Changes: If you add a new MCP tool or disable a tool, the tool list changes and cache becomes invalid.

System Prompt Updates: If Claude Code version updates, system prompts change and cache becomes invalid.

Model Switching: Different models have different system prompts—switching models invalidates cache.

Session Timeout: Cache has an expiration time—prolonged disuse invalidates it.

Explicit Refresh: Certain operations trigger cache refresh.

Understanding these interruption scenarios helps you optimize usage:

Don’t frequently add/remove MCP tools
Ask consecutive questions in the same session for higher hit rates
Avoid frequently switching models

The diagram: Common causes of cache interruption

Practical: Maximizing Cache Benefits

If you want to maximize prompt caching benefits:

Maintain Session Continuity: Ask consecutive questions in the same codebase, don’t frequently start new sessions.

Stabilize Tool Configuration: Finalize which MCP tools you want to use, don’t frequently add/remove.

Use CLAUDE.md: Project-level configuration goes in CLAUDE.md—it gets included in the cached prefix, maintaining stability.

Monitor Cache Hit Rate: Claude Code’s status output shows cache hit information.

Understand Cost Structure: When cache hits, you only pay for user input + output tokens—the system part is free.

Cost Comparison: How Much Can Be Saved

How much can actually be saved? Example:

Scenario: 10 consecutive questions in a large codebase

Without Cache:

Each request: 30K tokens × 10 requests = 300K tokens

With Cache (assuming 80% hit rate):

First request: 30K tokens (establish cache)
Subsequent 9 requests: 5K tokens (user input) × 9 = 45K tokens
Total: 75K tokens

Savings: (300K - 75K) / 300K = 75%

Actual savings depend on usage patterns, but typically reach 50-80%.

Implications for Building AI Agents

If you want to implement prompt caching in your own AI applications:

Identify Stable Prefixes: System prompts and tool definitions are relatively stable—suitable for caching.

Design Stable Ordering: Ensure tool list order remains consistent between requests.

Set Reasonable Breakpoints: Set breakpoints between “stable parts” and “dynamic parts.”

Handle Cache Invalidation: Gracefully fall back to non-cached mode when cache becomes invalid.

Monitor and Optimize: Provide visibility into cache hit rates to help users optimize.

Summary

Prompt caching is Claude Code’s “money-saving secret”—by caching unchanged system prompts and tool definitions, it dramatically reduces API call costs.

Key designs:

Cache breakpoints: Divide prompts into cacheable and non-cacheable parts
Stable sorting: Tool list maintains fixed order to ensure cache hits
Interruption handling: Identify cache invalidation scenarios and handle gracefully

Understanding this helps you:

Use Claude Code more effectively (maintain session continuity)
Understand cost structure (know where money goes)
Implement similar optimizations in your own AI applications

Why Caching Is Needed#

Cache Breakpoints: Where to Cut#

Tool Ordering: Why Order Matters#

Cache Interruption: When It Becomes Invalid#

Practical: Maximizing Cache Benefits#

Cost Comparison: How Much Can Be Saved#

Implications for Building AI Agents#

Summary#