Have you ever wondered why Claude Code can remember what you said dozens of turns into a conversation, while regular AI chat seems to “forget” as you go along?

The secret lies in context management - it’s not just about “how many characters can be remembered,” but rather like a professional organizer maximizing useful information density in limited closet space. Today, let’s talk about how Claude Code turns context management into a core capability.

Why Context Management is a Core Capability

Imagine your desk is 200 centimeters wide. Seems like enough space, right? But then you casually place: coffee cup, notebook, reference materials, scratch paper, snack packaging… and soon there’s no room to work.

An AI’s context window (200K tokens) is like this desk. System prompts take 15-20K, each tool call returns 5-50K, read a few files and search some code, and you’ve already used up half the space. What’s worse is that expired tool results, redundant file contents, and resolved discussions pile up on the “desktop,” diluting the AI’s attention and causing response quality to decline.

The core proposition of context management: Maximize information density in limited space while ensuring critical content isn’t lost.

Claude Code solves this problem with five principles.

Principle 1: Set a Budget for Everything

Have you ever done a household budget? Rent can’t exceed 30% of income, food控制在20%, entertainment at most 10%… if one category overspends, others must cut back.

Claude Code’s context management is also strict “budgeting”:

Budget System Overview

Content SourceBudget LimitEngineering Rationale
Single tool result50K charactersPrevents a single oversized result from filling context
All tool results in one message200K charactersPrevents 10 parallel tools at 50K each from causing 500K flood
Skill list description1% of context window1000 skills can’t dominate the conversation
Post-compact restored filesMax 5 files, 5K tokens per file, 50K totalRestoring too much means no compression
Post-compact restored skills5K per skill, 25K totalKeep skill system usable without bloat

These numbers weren’t pulled out of thin air. Take the 1% budget for skill lists: as users install more and more skills, without limits, skill descriptions could fill the entire context. Claude Code’s solution is a three-level truncation cascade - truncate descriptions to 250 characters first, then truncate low-priority skills, finally keeping only built-in skill names. Even if you install 1000 skills, the skill list will never exceed 1% of the context window.

The Wisdom of Double Protection

Notice the design behind the 50K and 200K numbers:

// Single tool result cannot exceed 50K
export const DEFAULT_MAX_RESULT_SIZE_CHARS = 50_000

// But all tool results in one message cannot exceed 200K
export const MAX_TOOL_RESULTS_PER_MESSAGE_CHARS = 200_000

This is like: each drawer can hold a maximum of 50 items (reasonable), but if you open 10 drawers at the same time, the total cannot exceed 200 items (preventing “legal but dangerous” combinations).

Anti-pattern: Unbounded Content Injection

Injecting tool results, file contents into the context without limits, is like dumping everything from a supermarket shopping cart onto your desk - looks abundant but there’s no room to work.

Principle 2: Preserve Important Content

When moving houses, what do you do? Pack everything and unpack it all at the new place? That’s foolish. The smart approach is: pack most things in boxes, but keep the most frequently used items (toiletries, change of clothes) in a carry-on bag so you can use them on the first day at the new place.

Claude Code’s compression-restoration mechanism follows this exact thinking.

Compression Loses Details

Auto-compact summarizes an entire conversation into a few sentences, freeing up context space. But compression has a side effect: specific code content, file paths, and exact line number references are all lost. If the model completely forgets what files it read after compression, it has to read them again - wasting tool calls and waiting time.

Selective Restoration Strategy

// Restore the 5 most recent files after compression
export const POST_COMPACT_MAX_FILES_TO_RESTORE = 5

The elegance of the restoration strategy:

  1. Snapshot before compression: Use cacheToObject() to record current state
  2. Execute compression: Conversation becomes summary
  3. Selective restoration:
    • 5 most recent files (not all)
    • Max 5K tokens per file (not full content)
    • Total does not exceed 50K tokens
    • Already-sent skills are not re-injected (saves 4K tokens)

This is like the “carry-on bag” when moving - only bring the most urgently needed items, not carrying the entire house on your back.

Key insight: Restoring too much means no compression happened; restoring too little means over-compression. Five files, 5K tokens, 50K total - these are the “sweet spots” validated through engineering.

Anti-pattern: Full Compression or Full Preservation

Either restore nothing (forcing the model to start from scratch) or try to preserve everything (compression effect is zero). Like when moving: either pack everything (can’t live the first day) or pack nothing (the moving truck can’t hold it all).

Principle 3: Inform, Don’t Hide

Have you ever experienced this: someone gives you a “condensed version” of materials but doesn’t tell you what was deleted. You make decisions based on this version, and it turns out the critical information was already removed, but you’re unaware.

AI encounters the same problem. If context is truncated or compressed without the model knowing, it may make wrong decisions based on incomplete information, or even “fabricate” content it can’t remember - because it doesn’t know its information is incomplete.

Four Implementations of Transparent Truncation

1. Tool Result Truncation Notification

When a tool result exceeds 50K characters, Claude Code doesn’t silently truncate. Instead:

  • Full result written to disk
  • Model receives preview message with “this is truncated, full version at xxx path”
  • Model knows: (1) what it sees isn’t complete, (2) how to get the full version

2. Cache Micro-compact Notification

When old tool results are cleaned, notifyCacheDeletion() explicitly tells the model “some old tool results have been deleted.” Prevents the model from referencing content that no longer exists.

3. File Read Pagination

FileReadTool reads 2000 lines by default, with pagination support via offset/limit parameters. The tool description explicitly states this behavior - the model knows it only sees the first 2000 lines by default and can specify offset for later content.

4. Explicit Declarations in Compression Summary

The compression prompt requires the summary to include “where we are in the task” and “what still needs to be done” - ensuring the post-compression model knows exactly what stage of the task it’s at.

This is like labeling storage boxes when organizing: “this has reference materials, available to retrieve if needed” - instead of just throwing everything in the basement pretending it doesn’t exist.

Anti-pattern: Silent Truncation

Truncating tool results or deleting context content without the model’s knowledge. The model might reference content that no longer exists and make wrong decisions. Like when you ask a friend “where’s that report,” they say “I threw it away,” but you don’t know, and keep asking “what about the data on page three of the report” - your friend is confused.

Principle 4: Circuit Breaker for Runaway Loops

Have you ever seen a vending machine swallow your money without dispensing, and you frantically press the refund button, but it just doesn’t respond? If this were an AI Agent, it might infinitely retry the “refund operation,” failing each time but trying again until the end of time.

In AI systems, infinite retries are especially dangerous - each retry costs money (API calls), and the failure reason is often systemic (like context being too large to compress), so retries won’t change the outcome.

Claude Code’s Circuit Breaker Design

// Stop trying after 3 consecutive failures
// BQ data: 1,279 sessions had 50+ consecutive failures, up to 3,272
// Wasting ~250K API calls/day
const MAX_CONSECUTIVE_AUTOCOMPACT_FAILURES = 3

The number 3 comes from real data: BigQuery shows 1,279 sessions experienced 50+ consecutive compact failures, reaching as high as 3,272. Wasting approximately 250,000 API calls per day. The circuit breaker turns infinite loops into “stop after 3 failures.”

Circuit Breaker Family

SubsystemCircuit ConditionCircuit Behavior
Auto-compact3 consecutive failuresStop compacting until session ends
YOLO Classifier3 consecutive / 20 total denialsFall back to manual user confirmation
max_output_tokens recoveryMax 3 retriesStop retrying, accept truncated output
Prompt-too-long handlingStill exceeds after dropping oldest turnsDrop 20%, don’t drop infinitely

Each circuit breaker follows the same pattern: set a reasonable retry limit, and when exceeded, degrade to a safe but limited-functionality state, rather than crashing or infinite looping.

Anti-pattern: Infinite Retry

“Compression failed? Try again. Still failed? Change parameters and try again.” In AI Agents, this is like a vending machine frantically pressing the refund button after swallowing money - the machine is broken, pressing more won’t help, and it’s wasting your time.

Principle 5: Conservative Estimation

Have you ever planned a budget and found: planned to spend 1000, actually spent 1200? Token estimation for AI is similar - underestimating leads to overflow, overestimating only slightly wastes space.

Claude Code chooses conservative estimation in every scenario:

Estimation Strategy Comparison

Content TypeEstimation StrategyConservative LevelReason
Plain text4 bytes/tokenMediumEnglish actually ~3.5-4.5
JSON content2 bytes/tokenHighly conservativeStructural characters have low tokenization efficiency
Images/DocumentsFixed 2000 tokensHighly conservativeUse fixed value when metadata unavailable
Cached tokensFrom API usageExactOnly API-returned counts are authoritative

The choice to estimate JSON at 2 bytes/token is particularly interesting. JSON structural characters ({}, [], "", :, ,) have far lower tokenization efficiency than natural language - 100 bytes of JSON might consume 40-50 tokens, while 100 bytes of English only needs 25-30. If using the universal 4 bytes/token estimate, JSON-dense tool results would be severely underestimated, potentially causing context overflow.

Conservative Estimation’s Benefits Outweigh Costs

The worst case of overestimating token consumption: triggers compression slightly early, user waits a few more seconds.

The worst case of underestimating token consumption: prompt_too_long error, API call fails, emergency context dropping required, possibly losing critical information.

This is like buying insurance - better to pay slightly more premiums than to find coverage insufficient when disaster strikes.

Anti-pattern: The Illusion of Precise Counting

Trying to precisely calculate token counts on the client side. Only the API server’s tokenizer can give precise values - any client-side count is an estimate. Since it’s an estimate, it should be biased toward safety, not pursuing false “precision.”

How the Five Principles Relate

Set Budget for Everything
    Preserve Important Content ←┐
           ↓                    │
    Inform, Don't Hide          │
           ↓                    │
    Circuit Breaker Loops ──────┤
           ↓                    │
    Conservative Estimation ────┘

Budget is the foundation: Define token limits for each content source.

Preservation and disclosure are execution: Decide what to restore after compression and ensure the model knows what was truncated.

Circuit breakers and estimation are safeguards: Prevent automated processes from exceeding budget, ensure budget isn’t bypassed through underestimation.

Implications for Building AI Agents

Pattern 1: Layered Token Budget

  • Problem solved: Multiple content sources competing for limited context space
  • Core approach: Set independent budget + total budget for each source, truncation cascade handles overflow
  • Code template: Per-item limit (50K) → Aggregate limit (200K/message) → Global limit (context window - output reserve - buffer)

Pattern 2: Compression-Restoration Loop

  • Problem solved: Compression loses critical context
  • Core approach: Snapshot before compression → Compress → Selectively restore most recent/important content
  • Prerequisites: Ability to track which content is “most recently used”

Pattern 3: Circuit Breaker

  • Problem solved: Automated processes infinite loop under abnormal conditions
  • Core approach: Stop after N consecutive failures, degrade to safe state
  • Prerequisites: Defined failure criteria and post-degradation behavior

Practical: Auditing Your Agent

1. Measure Context Consumption

Measure how many tokens each content source consumes in real scenarios, find the biggest consumers. Is it tool results? File reads? Skill lists?

2. Set Size Limits for Tool Results

Ensure file reads, database queries, API call results have character/line limits. Don’t wait until prompt_too_long before remembering to limit.

3. Implement Post-Compression Restoration

If your Agent uses context compression, design a restoration strategy - let the post-compression model not need to start from scratch. Recently used files? Recently edited locations? Key configuration information?

4. Inform the Model When Truncating

Tell the model “this is truncated, full version is here” - far better than silent truncation where the model discovers missing information on its own.

5. Add Circuit Breakers

Set retry limits for any automated process that might loop. Better to degrade than to infinite loop.

Summary

Five principles of context management:

PrincipleCore ApproachAnti-pattern
Set Budget for EverythingEach source has token limits, multi-level protectionUnbounded content injection
Preserve Important ContentSelectively restore critical info after compressionFull compression or full preservation
Inform, Don’t HideExplicitly tell model when truncating/compressingSilent truncation
Circuit Breaker LoopsStop retrying after N consecutive failuresInfinite retry
Conservative EstimationBetter overestimate than underestimate tokensPursuing false precision

This is like a professional organizer’s work:

  • Budget: Every item has a designated storage spot, can’t be placed randomly
  • Preserve: Pack seasonal clothes, but keep daily items in the handiest drawer
  • Inform: Label storage boxes, know where things are
  • Circuit breaker: Stop buying when storage is full, don’t stack to the ceiling
  • Conservative estimation: When buying storage boxes, better slightly larger than too small to fit

Understanding context management enables you to:

  • Maximize information density within limited context windows
  • Avoid the awkwardness of “forgetting as the conversation goes on”
  • Implement professional-grade context management in your own AI Agents

Next: Production-Grade AI Coding Patterns - Engineering Practices Learned from Claude Code.