Token Budget - How to Spend 200K Wisely?

Imagine you have a $200 budget for grocery shopping, but a huge list of things to buy: rice, flour, oil, salt, vegetables, fruits, snacks, drinks. How do you allocate to eat well without overspending? And if you’re running low on money, what should you cut?

Claude Code faces the same challenge with its 200K token budget.

Budget Allocation: Spend Where It Counts

200K tokens is a “fortune,” but there are also many places to spend it. Here’s roughly how Claude Code allocates its budget:

Expense Item	Percentage	Purpose
System Prompt	10-20%	AI’s “onboarding manual”
Current Conversation	30-40%	Recent Q&A exchanges
Historical Context	20-30%	Previous conversation (may be compressed)
Reserved Buffer	10-20%	Space for new messages

It’s like your monthly salary allocation:

System Prompt: Rent and utilities (fixed cost, can’t cut)
Current Conversation: Meals (necessary spending, but can control)
Historical Context: Savings (save what you can, withdraw if needed)
Reserved Buffer: Emergency fund (just in case)

Three Gates: Insurance Against Overspending

Claude Code has three “gates” to prevent budget overruns, like a dam’s staged flood release:

Gate 1: Single Tool 50K Limit

Imagine asking AI to read a 10MB log file. Without control, this would devour the entire budget.

So Claude Code rules: If a single tool result exceeds 50K characters, store it to disk and only show AI the first 2KB preview.

It’s like when a package is too large for the building concierge to let into the residential complex - they only show you the tracking number, you pick it up from the warehouse yourself if needed.

Special rules:

Read tool: Never persisted to disk (it has its own size control)
Images: Pass through directly (counted as fixed 2000 tokens)
Empty results: Filled with placeholders (preventing certain models from misjudging)

Gate 2: Single Message 200K Limit

If AI simultaneously makes 5 searches, each returning 40K characters, that’s 200K total. Gate 1 can’t help here (none exceeded 50K individually), but total load超标.

So Gate 2 rules: All tool results in a single message cannot exceed 200K characters combined.

It’s like ordering food delivery: individual meals aren’t capped, but a single order can’t exceed $200 total.

How is over-budget handled? Sort by size, persist the largest ones first, until total drops below 200K.

There’s an elegant design here: State freezing. Once AI “sees” a tool result, that decision is forever set - this ensures prompt cache stability. If this time AI sees full content but next time suddenly gets a preview instead, the cache fails.

It’s like telling the waiter “I’ll have this dish,” you can’t change your mind after the chef finishes cooking.

Gate 3: Context Window Tracking

The first two gates are “character” limits, the third is a “token” limit. Claude Code estimates token count two ways:

Precise counting: The API returns a usage field with:

input_tokens (this input)
cache_creation_input_tokens (newly written to cache)
cache_read_input_tokens (read from cache)
output_tokens (AI output)

Rough estimation: Character count ÷ 4. Different file types use different coefficients:

Plain text: ÷ 4
JSON: ÷ 2 (JSON has many symbols, higher token density)
Images/PDFs: Fixed 2000 tokens

Why estimation? Because between API calls, there’s no way to know exact token count precisely. Estimation error could be ±50%, but that’s fine - compaction is the safety net anyway.

Parallel Tool Counting Pitfalls

Here’s a trap: token counting is easy to get wrong with parallel tool calls.

When AI simultaneously makes two tool calls, the internal message array looks like:

[..., AI message, tool result 1, AI message, tool result 2]

Notice the two AI messages share the same ID (from the same API response). If you simply search backward for “the last AI message with usage,” you might miss tool result 1.

Claude Code’s solution: Trace backward to the first AI message with the same ID, then count all subsequent messages from that position.

It’s like counting a line of people - you can’t just count those near you, you have to count the whole line.

Token Budget Practical Strategies

Understanding the budget mechanism, here’s how to optimize your usage:

1. Control Search Scope

Instead of letting AI search the entire codebase, narrow the scope:

# Bad approach
Search "TODO"

# Good approach
Search "TODO", but only in the src/ directory

It’s like shopping with a list - more effective than aimless wandering.

2. Batch Processing

If you need to analyze many files, process in batches:

First analyze core modules, then handle edge functionality

Don’t have AI read 10 files at once - might trigger persistence, leaving AI with only previews.

3. Use the Read Tool

Read tool gets special treatment: never persisted to disk. So if a file is important, Read is more reliable than cat commands.

It’s like the VIP lane - no waiting.

4. Watch Out for JSON Files

JSON files have 2x the token density of regular files. A 100KB JSON actually consumes about 50K tokens, not 25K.

Be extra careful when reading large JSON config files.

5. Anticipate Compaction Timing

Claude Code triggers auto-compaction when context approaches the limit. But estimation can be off:

If content is primarily English, estimation is conservative (actual tokens fewer)
If content has lots of code/JSON, estimation might be low (actual tokens more)

So compaction might come earlier or later than expected - be prepared.

What to Do When Budget Runs Out?

When tokens are truly running low, Claude Code will:

Trigger auto-compaction: Pack historical conversation into a summary
Clear old tool results: Microcompaction kicks in
Prompt user: “Context is filling up, consider starting a new session”

It’s like when your credit card maxes out:

First try installment payments (compaction)
If that doesn’t work, freeze some credit (clear tool results)
Finally remind you: “Time to pay up” (start new session)

When Should You Start a New Session?

Conversation has gone on a long time (dozens of rounds)
Topic has switched (from fixing bugs to discussing architecture)
AI starts “forgetting” early content (compressed away)

Starting a new session isn’t surrender - it’s smart budget management.

Dynamic Adjustment: Budget Isn’t Fixed

Claude Code’s budget allocation is dynamic:

Large tool results → Compress more history
Short conversation → Keep more history
Near limit → Aggressive compression

It’s like spending strategies while traveling:

Just arrived, money plentiful, eat well and stay comfortably
Halfway through, start budgeting carefully
Running low on funds, hostel and instant noodles

Implications for Building AI Agents

If you want to design your own AI agent, token budget management is worth learning from:

1. Layered Defense

Don’t rely on a single gate - layer them:

First layer: Single result limit (prevent individual bombs)
Second layer: Single message limit (prevent accumulation effects)
Third layer: Global tracking (final safety net)

2. Prefer Overestimation to Underestimation

When estimating tokens, be conservative:

Overestimation cost: Early compaction (performance loss)
Underestimation cost: API call failure (user experience breaks)

3. State Management Matters

Once a model “sees” content, that decision must remain consistent. This requires state management, not stateless pure functions.

4. Consider Cache Impact

The existence of prompt cache changes many designs:

Can’t arbitrarily change sent content
State must persist to session files
State must be restored when resuming sessions

Summary

Token budget is the “goalkeeper” of Claude Code’s context management:

Three gates: Single tool 50K, single message 200K, global 200K token
Two estimations: Precise count (API return) + rough estimate (characters ÷ 4)
Dynamic adjustment: Flexible allocation based on usage
State freezing: Seen content never changes, ensuring cache stability

It’s like financial management:

Budget enables planning
Gates prevent overspending
Contingency plans enable calm responses

Understanding token budget lets you:

Anticipate when compaction might trigger
Wisely arrange tool call strategies
Implement similar mechanisms in your own AI agents

Next up: Cache Interruptions - What Causes Them to Fail?

Budget Allocation: Spend Where It Counts#

Three Gates: Insurance Against Overspending#

Gate 1: Single Tool 50K Limit#

Gate 2: Single Message 200K Limit#

Gate 3: Context Window Tracking#

Parallel Tool Counting Pitfalls#

Token Budget Practical Strategies#

1. Control Search Scope#

2. Batch Processing#

3. Use the Read Tool#

4. Watch Out for JSON Files#

5. Anticipate Compaction Timing#

What to Do When Budget Runs Out?#

When Should You Start a New Session?#

Dynamic Adjustment: Budget Isn’t Fixed#

Implications for Building AI Agents#

1. Layered Defense#

2. Prefer Overestimation to Underestimation#

3. State Management Matters#

4. Consider Cache Impact#

Summary#