Master the art of context engineering for AI agents. Learn 6 battle-tested techniques from production systems: KV cache optimization, tool masking, filesystem-as-context, attention manipulation, error preservation, and few-shot pitfalls.
TL;DR: Context engineering is the discipline of dynamically managing what information an AI agent sees at each step of execution. Unlike prompt engineering, which focuses on crafting a single instruction, context engineering orchestrates the entire information environment across multi-step agent workflows. This article distills six production-tested principles -- KV cache-aware design, tool masking, filesystem-as-context, attention manipulation through recitation, error preservation, and few-shot discipline -- into a practical framework for building reliable, cost-effective AI agents.
The term "prompt engineering" dominated AI discourse through 2024 and into 2025. But as AI systems evolved from single-turn chatbots into autonomous agents executing dozens of sequential tool calls, the limitations of prompt-centric thinking became apparent. A well-crafted prompt matters little if the agent's 35th step drowns in irrelevant context, if cache misses inflate costs by an order of magnitude, or if the model forgets its objective halfway through a complex task.
Context engineering addresses these challenges directly. It is the practice of designing, managing, and optimizing the entire information environment that surrounds an LLM at each point in an agentic workflow. Where prompt engineering asks "what should I say to the model?", context engineering asks "what should the model see, and when should it see it?"
The Manus team -- builders of one of the most capable general-purpose AI agents in production -- recently shared six hard-won lessons from their engineering experience. These lessons, independently validated by production systems at other organizations, form a practical framework that any team building AI agents should internalize. This article examines each principle in depth, adds original analysis and code examples, and connects the lessons to broader developments in agent architecture.
The KV (key-value) cache is a fundamental optimization in transformer-based inference. During autoregressive generation, the model computes attention over all preceding tokens. Without caching, each new token would require recomputing attention keys and values for every prior token -- an O(n^2) operation per step. The KV cache stores these intermediate computations, allowing the model to process only the new token against cached representations.
For AI agents, the KV cache has profound economic and performance implications. Consider a typical agent session: the system prompt, tool definitions, and conversation history form a growing prefix that is largely identical between steps. If the cache hits on this prefix, only the new observation needs fresh computation. If it misses, the entire context must be reprocessed.
The cost difference is substantial. With Claude Sonnet, cached input tokens cost $0.30 per million tokens, while uncached tokens cost $3.00 per million -- a 10x difference. For an agent that averages a 100:1 input-to-output token ratio (as Manus reports), cache optimization is not a nice-to-have; it is the single largest lever for cost reduction.
The KV cache is prefix-sensitive. Because transformers process tokens autoregressively, a single token change at position N invalidates the cache for every token from position N onward. Common cache-breaking patterns include:
The following pseudocode illustrates a cache-aware context assembly pattern:
The key insight is structural: arrange context in layers ordered from most stable to most volatile. The system prompt and tool schema, which rarely change, form the prefix. Conversation history is append-only, never rewritten. Only the newest observation -- the most volatile element -- sits at the end where cache invalidation has minimal impact.
For teams running self-hosted inference with frameworks like vLLM, session affinity adds another dimension. Routing requests from the same agent session to the same worker ensures the GPU's KV cache remains warm. Without session affinity, a load balancer might route consecutive steps to different GPUs, each computing the full context from scratch.
As agent capabilities grow, so does the temptation to dynamically manage the tool set. A coding agent might have access to 40 tools spanning file operations, web browsing, terminal commands, and database queries. When the agent enters a phase focused purely on code editing, it seems logical to remove irrelevant tools like browser_navigate or database_query to reduce confusion and context length.
This intuition is wrong for two reasons.
First, removal breaks the KV cache. Tool definitions typically appear near the beginning of the context (in the system prompt or as structured tool schemas). Removing a tool changes this prefix, invalidating the cache for everything that follows -- including the entire conversation history. The cost of cache invalidation almost always exceeds the savings from shorter context.
Second, removal creates dangling references. If the conversation history contains a previous step where the agent called browser_navigate, and that tool no longer exists in the schema, the model encounters a contradiction. It sees evidence of a tool call that the current schema says is impossible. This can cause hallucination, schema violations, or confused reasoning.
The solution is to keep all tool definitions in the schema but constrain which tools the agent can actually invoke at each step. This can be implemented at the API level or through logit manipulation:
Many inference APIs now support this natively. Anthropic's API, for example, offers tool_choice modes: auto (model decides freely), required (must call a tool), and specified (must call a specific tool). OpenAI provides function_call with similar semantics. Combined with consistent naming conventions -- prefixing related tools like browser_, shell_, db_ -- masking becomes straightforward to implement without custom logit processors.
Modern LLMs support context windows of 128K tokens or more. This capacity is generous but not unlimited, and three practical constraints make it insufficient for agentic workloads:
The filesystem offers a solution: treat it as an external memory that the agent can read from and write to, keeping only the most relevant information in the active context window.
The critical principle is that compression must be reversible. When summarizing a web page to fit in context, preserve the URL. When condensing a log file, preserve the file path. When extracting key findings from a document, note the page numbers. This ensures the agent can always retrieve the full information if needed:
A more powerful application of filesystem-as-context is the agent workspace -- a dedicated directory where the agent maintains structured notes, intermediate results, and task state:
This pattern mirrors how human knowledge workers operate. A researcher does not keep every paper they have ever read in working memory. They take notes, organize findings, and refer back to source material as needed. The filesystem gives AI agents the same capability.
This approach also connects to a broader trend in agent architecture. DeepMind's "Era of Experience" paper envisions AI systems that learn continuously from streams of experience rather than static datasets. Externalizing state to the filesystem is a step in this direction -- it allows agents to persist knowledge across context windows and even across sessions. The paper's emphasis on agents that generate, collect, and learn from their own experience parallels the filesystem-as-memory pattern: both treat the agent's accumulated experience as a first-class resource rather than ephemeral computation.
Some researchers have speculated that State Space Models (SSMs) like Mamba, with their linear-time sequence processing and natural affinity for streaming data, could further enhance this pattern. An SSM-based agent that externalizes memory to files might combine the efficiency of linear attention with the persistence of filesystem storage -- an architectural direction reminiscent of Neural Turing Machines, updated for the modern agent paradigm.
Transformer attention is not uniform. Research on the positional properties of attention has consistently shown that tokens at the beginning and end of the context receive disproportionate attention weight, while tokens in the middle -- the so-called "lost in the middle" zone -- receive less. For a chatbot answering a single question, this is a minor concern. For an agent executing 50 sequential tool calls, it is a fundamental challenge.
Consider an agent tasked with refactoring a large codebase to replace one logging library with another. By step 30, the agent's context contains hundreds of tool calls, file reads, and edit operations. The original task description -- "replace all uses of log4j with the structured logging library" -- sits at the very beginning of the context, buried under thousands of tokens of operational history. The agent begins to drift, making edits that are locally sensible but globally inconsistent with the objective.
The solution is deceptively simple: periodically instruct the agent to restate its current objectives. This pushes the task plan into the most recent portion of the context, where attention is strongest:
The Manus team implements this by maintaining a todo.md file that the agent rewrites at each step. This combines the filesystem-as-context pattern (Section 3) with attention manipulation: the act of writing and reading the todo file forces key objectives into the most recent context, ensuring they receive strong attention weight during the next planning step.
This technique works because it exploits a fundamental property of the transformer architecture rather than fighting against it. Instead of attempting to modify how the model attends to different positions (which would require architectural changes), recitation moves important information to positions where it naturally receives strong attention.
When a naive agent encounters an error -- a failed API call, a syntax error in generated code, a permission denied exception -- the simplest response is to clear the failed step from history and retry. This approach seems clean: why pollute the context with failed attempts?
The answer is that errors are information, and removing them removes the model's ability to learn within the session.
When a model sees that a particular approach failed, it implicitly updates its internal representations to assign lower probability to similar approaches. A stack trace from a failed database query tells the model about schema constraints. A permission denied error reveals access control boundaries. A timeout error suggests the need for a different strategy entirely.
Consider an agent attempting to install a software dependency:
If steps 12 and 13 were erased, the agent (or a future agent in the same session) might repeat those exact failed attempts. With the errors preserved, the model has evidence that the first two approaches do not work and can reason about why the third succeeded.
The principle extends beyond simple retries. Error traces form a kind of in-context learning signal. Academic benchmarks tend to measure agent performance under ideal conditions -- clean APIs, correct documentation, stable environments. Production systems face flaky networks, outdated documentation, ambiguous error messages, and race conditions. An agent that preserves and learns from its own failures handles these realities far better than one that starts fresh after each mistake.
For maximum utility, structure error information to help the model extract patterns:
Few-shot prompting -- providing examples of desired input-output behavior within the context -- has been a cornerstone of prompt engineering since GPT-3. For single-turn tasks, it remains highly effective. For agentic workflows, it introduces a subtle but serious risk: behavioral rigidity.
Language models are exceptional pattern matchers. When the context contains three examples of how to handle a task, the model will closely mimic those examples -- even when the current situation demands a different approach. The Manus team observed this in resume review tasks: after seeing a few examples of structured analysis, the agent fell into a rhythm, producing nearly identical analyses regardless of significant differences between resumes.
The problem compounds over time. In a multi-step workflow, the agent's own previous actions become implicit few-shot examples for future steps. If the first three files are edited with a particular pattern, the model treats that pattern as a strong prior for editing the fourth file -- even if the fourth file requires fundamentally different handling.
This creates a brittleness that is difficult to diagnose. The agent appears to work well on cases similar to the examples and fails silently on edge cases, producing outputs that look structurally correct but are substantively wrong.
Modern LLMs, particularly those trained with RLHF and instruction tuning, are remarkably capable of following clear natural language instructions without examples. Consider the contrast:
Few-shot approach (rigid, cache-expensive):
Instruction-based approach (flexible, cache-friendly):
The instruction-based approach consumes fewer tokens, handles edge cases the few-shot examples never anticipated, and does not anchor the model to a rigid output format.
When examples are genuinely necessary -- for teaching a domain-specific format or demonstrating a complex multi-step procedure -- introduce controlled variation to prevent pattern lock-in:
The goal is to teach the model the principle behind the examples, not the surface pattern.
The distinction between context engineering and prompt engineering is not merely semantic. It reflects a fundamental shift in how we think about building AI systems.
Prompt engineering operates at the level of a single interaction. It asks: "How do I phrase this instruction to get the best response?" The unit of work is the prompt -- a static text artifact that is crafted, tested, and refined.
Context engineering operates at the system level across many interactions. It asks: "How do I design the information architecture so that the model has the right information at the right time across a multi-step workflow?" The unit of work is the context management system -- a dynamic process that assembles, compresses, prioritizes, and structures information for each step.
As AI agents become the primary interface for complex tasks, context engineering is becoming the more critical discipline. A perfectly engineered prompt is necessary but not sufficient; without effective context management, the agent will still drift, waste compute, and produce inconsistent results over long execution sequences.
Context engineering represents the maturation of AI agent development from an art of clever prompting into a rigorous engineering discipline. The six principles examined here -- KV cache-aware design, tool masking, filesystem-as-context, attention manipulation through recitation, error preservation, and few-shot discipline -- are not isolated tricks. They form a coherent framework grounded in the fundamental properties of transformer architectures and the practical realities of production deployment.
These principles share a common thread: they treat the context window not as a passive container for text, but as a carefully managed computational resource. Every token in the context influences every generated token. Every cache miss costs real money. Every piece of information competes for the model's finite attention. Engineering this resource effectively is what separates agents that work in demos from agents that work in production.
As the field moves toward the vision described in DeepMind's "Era of Experience" -- agents that continuously learn from their own interactions with the world -- context engineering will only grow in importance. The agents of tomorrow will not just consume static prompts; they will manage rich, evolving information environments across long-running workflows. The teams that master context engineering today are building the foundation for that future.
Context engineering is the practice of dynamically managing the information environment that an AI agent operates within across multi-step workflows. It encompasses decisions about what information to include in the model's context window, how to structure that information for optimal cache utilization and attention patterns, when to compress or externalize information to the filesystem, and how to maintain coherence across dozens or hundreds of sequential inference calls.
Prompt engineering focuses on crafting the text of a single instruction to maximize response quality for a specific task. Context engineering operates at a higher level, managing the entire information lifecycle across a multi-step agent workflow. While prompt engineering asks "what should I say?", context engineering asks "what should the model see at each step, and how should that information be structured for cost efficiency, latency optimization, and sustained coherence?"
The KV (key-value) cache stores intermediate attention computations from transformer inference, allowing the model to reuse calculations from previous tokens rather than recomputing them. For AI agents, which process long, growing contexts across many sequential steps, KV cache hits can reduce inference costs by up to 10x and significantly decrease latency. Designing agent architectures that maximize cache hit rates -- through stable prefixes, append-only history, and deterministic serialization -- is one of the highest-impact optimizations available.
Removing tools from the schema during an agent session breaks the KV cache (since tool definitions typically appear early in the context) and creates dangling references when the conversation history contains calls to tools that no longer exist. Tool masking -- keeping all tools in the schema but constraining which ones the agent can actually invoke -- preserves cache stability, avoids hallucination from schema inconsistencies, and achieves the same goal of directing the agent toward appropriate tools.
The filesystem-as-context pattern treats the local filesystem as an extension of the agent's memory. Instead of keeping all intermediate results, research notes, and observations in the context window, the agent writes them to files and reads them back when needed. This keeps the active context lean and focused while preserving full information fidelity. The key principle is reversible compression: always preserve file paths, URLs, and references so the agent can retrieve full content on demand.
Aaron is a senior software engineer and AI researcher specializing in generative AI, multimodal systems, and cloud-native AI infrastructure. He writes about cutting-edge AI developments, practical tutorials, and deep technical analysis at fp8.co.
Learn to build Claude Code Skills step by step. Create reusable AI instructions, templates, and automation to boost your development workflow and productivity.
AI Development Tools, Developer Productivity, Claude CodeCompare LangChain MCP Adapters, Bedrock Inline Agent SDK, and Multi-Agent Orchestrator. Detailed architecture analysis with code examples for MCP integration, tool handling, and multi-agent collaboration.
AgentBuild AI agents with Amazon Bedrock AgentCore. Step-by-step Python examples for memory, code execution, browser automation, and tool integration.
AI Agents, Amazon Bedrock, Conversational AI