Context Management - The Core Capability of AI Coding

Table of Contents
- Why Context Management is a Core Capability
- Principle 1: Set a Budget for Everything
- Principle 2: Preserve Important Content
- Principle 3: Inform, Don’t Hide
- Principle 4: Circuit Breaker for Runaway Loops
- Principle 5: Conservative Estimation
- How the Five Principles Relate
- Implications for Building AI Agents
- Practical: Auditing Your Agent
- Summary
Have you ever wondered why Claude Code can remember what you said dozens of turns into a conversation, while regular AI chat seems to “forget” as you go along?
The secret lies in context management - it’s not just about “how many characters can be remembered,” but rather like a professional organizer maximizing useful information density in limited closet space. Today, let’s talk about how Claude Code turns context management into a core capability.
Why Context Management is a Core Capability
Imagine your desk is 200 centimeters wide. Seems like enough space, right? But then you casually place: coffee cup, notebook, reference materials, scratch paper, snack packaging… and soon there’s no room to work.
An AI’s context window (200K tokens) is like this desk. System prompts take 15-20K, each tool call returns 5-50K, read a few files and search some code, and you’ve already used up half the space. What’s worse is that expired tool results, redundant file contents, and resolved discussions pile up on the “desktop,” diluting the AI’s attention and causing response quality to decline.
The core proposition of context management: Maximize information density in limited space while ensuring critical content isn’t lost.
Claude Code solves this problem with five principles.
Principle 1: Set a Budget for Everything
Have you ever done a household budget? Rent can’t exceed 30% of income, food控制在20%, entertainment at most 10%… if one category overspends, others must cut back.
Claude Code’s context management is also strict “budgeting”:
Budget System Overview
| Content Source | Budget Limit | Engineering Rationale |
|---|---|---|
| Single tool result | 50K characters | Prevents a single oversized result from filling context |
| All tool results in one message | 200K characters | Prevents 10 parallel tools at 50K each from causing 500K flood |
| Skill list description | 1% of context window | 1000 skills can’t dominate the conversation |
| Post-compact restored files | Max 5 files, 5K tokens per file, 50K total | Restoring too much means no compression |
| Post-compact restored skills | 5K per skill, 25K total | Keep skill system usable without bloat |
These numbers weren’t pulled out of thin air. Take the 1% budget for skill lists: as users install more and more skills, without limits, skill descriptions could fill the entire context. Claude Code’s solution is a three-level truncation cascade - truncate descriptions to 250 characters first, then truncate low-priority skills, finally keeping only built-in skill names. Even if you install 1000 skills, the skill list will never exceed 1% of the context window.
The Wisdom of Double Protection
Notice the design behind the 50K and 200K numbers:
// Single tool result cannot exceed 50K
export const DEFAULT_MAX_RESULT_SIZE_CHARS = 50_000
// But all tool results in one message cannot exceed 200K
export const MAX_TOOL_RESULTS_PER_MESSAGE_CHARS = 200_000
This is like: each drawer can hold a maximum of 50 items (reasonable), but if you open 10 drawers at the same time, the total cannot exceed 200 items (preventing “legal but dangerous” combinations).
Anti-pattern: Unbounded Content Injection
Injecting tool results, file contents into the context without limits, is like dumping everything from a supermarket shopping cart onto your desk - looks abundant but there’s no room to work.
Principle 2: Preserve Important Content
When moving houses, what do you do? Pack everything and unpack it all at the new place? That’s foolish. The smart approach is: pack most things in boxes, but keep the most frequently used items (toiletries, change of clothes) in a carry-on bag so you can use them on the first day at the new place.
Claude Code’s compression-restoration mechanism follows this exact thinking.
Compression Loses Details
Auto-compact summarizes an entire conversation into a few sentences, freeing up context space. But compression has a side effect: specific code content, file paths, and exact line number references are all lost. If the model completely forgets what files it read after compression, it has to read them again - wasting tool calls and waiting time.
Selective Restoration Strategy
// Restore the 5 most recent files after compression
export const POST_COMPACT_MAX_FILES_TO_RESTORE = 5
The elegance of the restoration strategy:
- Snapshot before compression: Use
cacheToObject()to record current state - Execute compression: Conversation becomes summary
- Selective restoration:
- 5 most recent files (not all)
- Max 5K tokens per file (not full content)
- Total does not exceed 50K tokens
- Already-sent skills are not re-injected (saves 4K tokens)
This is like the “carry-on bag” when moving - only bring the most urgently needed items, not carrying the entire house on your back.
Key insight: Restoring too much means no compression happened; restoring too little means over-compression. Five files, 5K tokens, 50K total - these are the “sweet spots” validated through engineering.
Anti-pattern: Full Compression or Full Preservation
Either restore nothing (forcing the model to start from scratch) or try to preserve everything (compression effect is zero). Like when moving: either pack everything (can’t live the first day) or pack nothing (the moving truck can’t hold it all).
Principle 3: Inform, Don’t Hide
Have you ever experienced this: someone gives you a “condensed version” of materials but doesn’t tell you what was deleted. You make decisions based on this version, and it turns out the critical information was already removed, but you’re unaware.
AI encounters the same problem. If context is truncated or compressed without the model knowing, it may make wrong decisions based on incomplete information, or even “fabricate” content it can’t remember - because it doesn’t know its information is incomplete.
Four Implementations of Transparent Truncation
1. Tool Result Truncation Notification
When a tool result exceeds 50K characters, Claude Code doesn’t silently truncate. Instead:
- Full result written to disk
- Model receives preview message with “this is truncated, full version at xxx path”
- Model knows: (1) what it sees isn’t complete, (2) how to get the full version
2. Cache Micro-compact Notification
When old tool results are cleaned, notifyCacheDeletion() explicitly tells the model “some old tool results have been deleted.” Prevents the model from referencing content that no longer exists.
3. File Read Pagination
FileReadTool reads 2000 lines by default, with pagination support via offset/limit parameters. The tool description explicitly states this behavior - the model knows it only sees the first 2000 lines by default and can specify offset for later content.
4. Explicit Declarations in Compression Summary
The compression prompt requires the summary to include “where we are in the task” and “what still needs to be done” - ensuring the post-compression model knows exactly what stage of the task it’s at.
This is like labeling storage boxes when organizing: “this has reference materials, available to retrieve if needed” - instead of just throwing everything in the basement pretending it doesn’t exist.
Anti-pattern: Silent Truncation
Truncating tool results or deleting context content without the model’s knowledge. The model might reference content that no longer exists and make wrong decisions. Like when you ask a friend “where’s that report,” they say “I threw it away,” but you don’t know, and keep asking “what about the data on page three of the report” - your friend is confused.
Principle 4: Circuit Breaker for Runaway Loops
Have you ever seen a vending machine swallow your money without dispensing, and you frantically press the refund button, but it just doesn’t respond? If this were an AI Agent, it might infinitely retry the “refund operation,” failing each time but trying again until the end of time.
In AI systems, infinite retries are especially dangerous - each retry costs money (API calls), and the failure reason is often systemic (like context being too large to compress), so retries won’t change the outcome.
Claude Code’s Circuit Breaker Design
// Stop trying after 3 consecutive failures
// BQ data: 1,279 sessions had 50+ consecutive failures, up to 3,272
// Wasting ~250K API calls/day
const MAX_CONSECUTIVE_AUTOCOMPACT_FAILURES = 3
The number 3 comes from real data: BigQuery shows 1,279 sessions experienced 50+ consecutive compact failures, reaching as high as 3,272. Wasting approximately 250,000 API calls per day. The circuit breaker turns infinite loops into “stop after 3 failures.”
Circuit Breaker Family
| Subsystem | Circuit Condition | Circuit Behavior |
|---|---|---|
| Auto-compact | 3 consecutive failures | Stop compacting until session ends |
| YOLO Classifier | 3 consecutive / 20 total denials | Fall back to manual user confirmation |
| max_output_tokens recovery | Max 3 retries | Stop retrying, accept truncated output |
| Prompt-too-long handling | Still exceeds after dropping oldest turns | Drop 20%, don’t drop infinitely |
Each circuit breaker follows the same pattern: set a reasonable retry limit, and when exceeded, degrade to a safe but limited-functionality state, rather than crashing or infinite looping.
Anti-pattern: Infinite Retry
“Compression failed? Try again. Still failed? Change parameters and try again.” In AI Agents, this is like a vending machine frantically pressing the refund button after swallowing money - the machine is broken, pressing more won’t help, and it’s wasting your time.
Principle 5: Conservative Estimation
Have you ever planned a budget and found: planned to spend 1000, actually spent 1200? Token estimation for AI is similar - underestimating leads to overflow, overestimating only slightly wastes space.
Claude Code chooses conservative estimation in every scenario:
Estimation Strategy Comparison
| Content Type | Estimation Strategy | Conservative Level | Reason |
|---|---|---|---|
| Plain text | 4 bytes/token | Medium | English actually ~3.5-4.5 |
| JSON content | 2 bytes/token | Highly conservative | Structural characters have low tokenization efficiency |
| Images/Documents | Fixed 2000 tokens | Highly conservative | Use fixed value when metadata unavailable |
| Cached tokens | From API usage | Exact | Only API-returned counts are authoritative |
The choice to estimate JSON at 2 bytes/token is particularly interesting. JSON structural characters ({}, [], "", :, ,) have far lower tokenization efficiency than natural language - 100 bytes of JSON might consume 40-50 tokens, while 100 bytes of English only needs 25-30. If using the universal 4 bytes/token estimate, JSON-dense tool results would be severely underestimated, potentially causing context overflow.
Conservative Estimation’s Benefits Outweigh Costs
The worst case of overestimating token consumption: triggers compression slightly early, user waits a few more seconds.
The worst case of underestimating token consumption: prompt_too_long error, API call fails, emergency context dropping required, possibly losing critical information.
This is like buying insurance - better to pay slightly more premiums than to find coverage insufficient when disaster strikes.
Anti-pattern: The Illusion of Precise Counting
Trying to precisely calculate token counts on the client side. Only the API server’s tokenizer can give precise values - any client-side count is an estimate. Since it’s an estimate, it should be biased toward safety, not pursuing false “precision.”
How the Five Principles Relate
Set Budget for Everything
↓
Preserve Important Content ←┐
↓ │
Inform, Don't Hide │
↓ │
Circuit Breaker Loops ──────┤
↓ │
Conservative Estimation ────┘
Budget is the foundation: Define token limits for each content source.
Preservation and disclosure are execution: Decide what to restore after compression and ensure the model knows what was truncated.
Circuit breakers and estimation are safeguards: Prevent automated processes from exceeding budget, ensure budget isn’t bypassed through underestimation.
Implications for Building AI Agents
Pattern 1: Layered Token Budget
- Problem solved: Multiple content sources competing for limited context space
- Core approach: Set independent budget + total budget for each source, truncation cascade handles overflow
- Code template: Per-item limit (50K) → Aggregate limit (200K/message) → Global limit (context window - output reserve - buffer)
Pattern 2: Compression-Restoration Loop
- Problem solved: Compression loses critical context
- Core approach: Snapshot before compression → Compress → Selectively restore most recent/important content
- Prerequisites: Ability to track which content is “most recently used”
Pattern 3: Circuit Breaker
- Problem solved: Automated processes infinite loop under abnormal conditions
- Core approach: Stop after N consecutive failures, degrade to safe state
- Prerequisites: Defined failure criteria and post-degradation behavior
Practical: Auditing Your Agent
1. Measure Context Consumption
Measure how many tokens each content source consumes in real scenarios, find the biggest consumers. Is it tool results? File reads? Skill lists?
2. Set Size Limits for Tool Results
Ensure file reads, database queries, API call results have character/line limits. Don’t wait until prompt_too_long before remembering to limit.
3. Implement Post-Compression Restoration
If your Agent uses context compression, design a restoration strategy - let the post-compression model not need to start from scratch. Recently used files? Recently edited locations? Key configuration information?
4. Inform the Model When Truncating
Tell the model “this is truncated, full version is here” - far better than silent truncation where the model discovers missing information on its own.
5. Add Circuit Breakers
Set retry limits for any automated process that might loop. Better to degrade than to infinite loop.
Summary
Five principles of context management:
| Principle | Core Approach | Anti-pattern |
|---|---|---|
| Set Budget for Everything | Each source has token limits, multi-level protection | Unbounded content injection |
| Preserve Important Content | Selectively restore critical info after compression | Full compression or full preservation |
| Inform, Don’t Hide | Explicitly tell model when truncating/compressing | Silent truncation |
| Circuit Breaker Loops | Stop retrying after N consecutive failures | Infinite retry |
| Conservative Estimation | Better overestimate than underestimate tokens | Pursuing false precision |
This is like a professional organizer’s work:
- Budget: Every item has a designated storage spot, can’t be placed randomly
- Preserve: Pack seasonal clothes, but keep daily items in the handiest drawer
- Inform: Label storage boxes, know where things are
- Circuit breaker: Stop buying when storage is full, don’t stack to the ceiling
- Conservative estimation: When buying storage boxes, better slightly larger than too small to fit
Understanding context management enables you to:
- Maximize information density within limited context windows
- Avoid the awkwardness of “forgetting as the conversation goes on”
- Implement professional-grade context management in your own AI Agents
Next: Production-Grade AI Coding Patterns - Engineering Practices Learned from Claude Code.
