Context compression is a set of techniques that reduce the token count of prompts while preserving semantic content, enabling more information to fit within a model's fixed context window.
Context compression is a set of techniques that reduce the token count of prompts while preserving semantic content, enabling more information to fit within a model's fixed context window. It addresses the fundamental constraint that context windows are finite while information needs grow unbounded.
Compression approaches span a spectrum from lossless to lossy. Lossless techniques include removing whitespace, abbreviating repetitive structures, and using concise formatting. Lossy approaches include LLM-powered summarization, extractive compression (keeping only relevant sentences), and learned soft-prompt compression where a smaller model distills documents into compact token sequences that a larger model can decode.
For agent systems, context compression is particularly critical. Multi-turn conversations accumulate thousands of tokens of history. Tool outputs (API responses, file contents, search results) often contain 90% irrelevant information. Without compression, agents hit context limits within 5-10 turns. Production systems apply tiered compression: keep recent turns verbatim, summarize older turns, and extract only relevant fragments from tool outputs.
Context compression directly extends the effective capability of any fixed-context model. An agent with 128K context and 4x compression operates as if it had 512K context for information density purposes. This means longer conversations, more tool outputs, and larger codebases can be processed without upgrading to more expensive long-context models.
A code review agent compresses file contents before adding them to context: instead of including entire 2,000-line files, it extracts only the changed functions plus their immediate dependencies (typically 200-400 lines). This allows reviewing a 50-file pull request within a 128K context window that would otherwise overflow after 6-7 full files.
Aaron is an engineering leader, software architect, and founder with 18 years building distributed systems and cloud infrastructure. Now focused on LLM-powered platforms, agent orchestration, and production AI. He shares hands-on technical guides and framework comparisons at fp8.co.