Outcome focus: Turned long-context failure modes into an engineering playbook for selecting, isolating, pruning, summarizing, offloading, and evaluating context in agent systems.
agentscontext engineeringllm evaluationtool usemulti-agent systems
Long context is not the same thing as good context.
That is the main lesson I take from Drew Breunig's two posts on context failure and context repair, Simon Willison's follow-up, and Anthropic's write-up on its multi-agent research system. The through-line is not subtle: bigger windows give agent builders more room, but they do not remove the need for information management.
In some cases, bigger windows make the problem easier to hide.
A model can accept a million tokens, but that does not mean every token helps. Old mistakes remain in the transcript. Irrelevant tools compete for attention. Contradictory snippets sit next to each other. Long histories can cause the model to repeat stale patterns instead of forming a better plan. The context window becomes a workbench, a junk drawer, and a memory leak at the same time.
This is why "context engineering" is a useful phrase.
It shifts attention away from prompt writing as a single text artifact and toward the whole system that decides what the model sees. Context is assembled. It is selected, ordered, compressed, isolated, retrieved, offloaded, and evaluated. Every part of that assembly process changes model behavior.
Breunig gives the failure taxonomy. Willison sharpens the developer framing. Anthropic shows what the pattern looks like in a production multi-agent research system.
Taken together, they point to a practical rule:
The context window should be treated like a scarce, behavioral resource, not like free storage.
Long context changes the failure mode#
The early promise of long context was simple. If the model can read more, maybe we can stop worrying so much about retrieval, summarization, and tool selection. Put all the documents in. Put every tool description in. Put the whole conversation in. Let the model sort it out.
Breunig's "How Long Contexts Fail" is a direct argument against that instinct. He names four failure modes that show up when context becomes unmanaged: context poisoning, context distraction, context confusion, and context clash.
The names are useful because they make the failures easier to diagnose.
Context poisoning happens when a wrong intermediate result, hallucination, or bad assumption gets written into the context and then keeps steering later outputs. The agent is no longer making one mistake. It is carrying the mistake forward as state.
Context distraction happens when the model over-focuses on accumulated context and under-uses its broader learned capability. Breunig points to examples where very long agent histories encouraged repeated old actions instead of new plans. That is a different issue from merely filling the window. The context may still fit, but it is shaping the model badly.
Context confusion happens when irrelevant context influences the response. This is common with tools. If every tool definition is available, the model has to decide which ones matter. Some models call irrelevant tools. Smaller models can fail when too many tool descriptions overlap, even when the total token count is within the formal limit.
Context clash happens when the assembled context contains conflicting information. Multi-turn conversations make this easy. Early attempts, partial answers, corrected instructions, retrieved documents, and tool outputs all stay in the transcript. The model sees the whole mess and may rely on the wrong piece.
These are not theoretical edge cases.
They are exactly the problems agent systems create for themselves. Agents gather data, call tools, summarize, revise plans, ask submodels for help, and append everything into a working history. Long context gives them more room to accumulate useful evidence. It also gives them more room to accumulate bad state.
The token budget is behavioral, not just numerical#
A token budget is usually discussed as a limit.
How much can the model fit? How much does it cost? How much latency does it add?
Those questions matter, but they are incomplete. The more important question is behavioral: what does this token do to the model's next decision?
A tool description is not inert. A stale plan is not inert. A previous wrong answer is not inert. A long document section that is only loosely relevant is not inert. Every added token changes the distribution of attention and possible next actions.
This is the part that makes context engineering different from storage design.
A database can hold irrelevant rows without changing the meaning of a query if the query is written correctly. A model context does not behave that way. Irrelevant text can still influence generation. Contradictory text can still fight with the right answer. Tool descriptions can still invite unnecessary calls.
The context window is not a filing cabinet.
It is part of the computation.
That is why the right response to long context is not "use less context" in a crude way. The right response is to make context intentional. Use the context that changes the answer in the right direction. Keep out the context that creates noise, contradiction, stale plans, or accidental authority.
RAG is not dead#
Breunig's follow-up, "How to Fix Your Context," starts with a useful correction: RAG is still alive.
Long windows revive the same debate every time. If we can fit more in the prompt, why retrieve? Why not provide everything and avoid the risk of missing a document?
Because retrieval is not only about fitting.
It is about selection.
RAG, when done well, is a context selection mechanism. It says: for this question, these are the few pieces of outside information that should influence the answer. It is not perfect. Retrieval quality can fail. Chunking can be bad. Embeddings can miss the point. But dumping the entire corpus into context is not the cure. It creates a different failure mode.
The better posture is to treat RAG as one tool in a context pipeline.
Retrieve candidates. Rerank when the task requires it. De-duplicate overlapping chunks. Preserve source metadata. Keep enough text for the model to reason, but not so much that the relevant evidence is buried under adjacent noise.
The same principle applies beyond documents.
Context engineering is not the death of RAG. It is the generalization of RAG's best idea: the model should see what it needs, not everything we can technically supply.
Tool loadout is product design#
The term "tool loadout" comes from Breunig's second post, and Willison highlights it in his link post because it gives developers a concrete handle.
The idea is simple: do not give the model every tool by default. Select the tools likely to matter for the task.
That sounds obvious until a system integrates MCP servers, internal APIs, cloud services, databases, file tools, browser tools, project tools, and workflow tools. The easiest implementation path is to expose everything and let the model choose.
That is where context confusion starts.
Tool descriptions overlap. Some tools are generic. Some tools are dangerous. Some tools are stale. Some tools sound relevant but are not. The model may call a tool because it is visible, not because it is needed.
Anthropic's multi-agent write-up makes the same point from a production angle: tool design and tool selection are critical. A model searching the web for information that only exists in Slack is already on the wrong path. Bad tool descriptions send agents into wasted work. Distinct purpose and clear descriptions matter.
Tool loadout is not just an optimization.
It is product design for agent action.
For each request, the system should ask which tools belong in the room. A simple fact lookup may need one search tool. A code task may need file read, shell, and test tools. A research task may need web search, source fetch, citation extraction, and synthesis. A high-risk action may need only read tools until a human approves the write path.
I would rather expose five sharp tools than fifty vague ones.
Context quarantine is why subagents work#
Context quarantine is Breunig's phrase for isolating work into separate threads with separate contexts.
Willison connects this to subagents and to the pattern used by Claude Code and Anthropic's research system. Anthropic gives the most concrete production example: an orchestrator-worker architecture where a lead agent plans and delegates specialized tasks to subagents, each with its own context window, prompt, tools, and exploration path.
That is not only a parallelism trick.
It is context hygiene.
If a single agent handles a broad research problem, every search result, failed path, intermediate thought, tool result, and partial conclusion can accumulate in one window. By splitting work, each subagent sees only the task slice it needs. It can explore independently. Then it returns a condensed result to the lead agent, which synthesizes across reports instead of dragging every raw trace into one giant prompt.
Anthropic describes this as compression. Search is the work of distilling insight from a huge corpus. Subagents help by exploring different parts of the search space in parallel and returning the important tokens.
That design also reduces path dependency.
One subagent's mistake does not necessarily poison another subagent's context. A bad source in one branch can be corrected during synthesis. The lead agent receives outputs that can be compared, reconciled, and cited.
But quarantine is not free.
Anthropic reports that multi-agent systems can be much more expensive in token usage. They also point out that not every task fits the pattern. Work that requires tight shared context or real-time coordination may not benefit. Many coding tasks are less parallelizable than broad research tasks.
So the decision is not "single-agent bad, multi-agent good."
The decision is whether the task has separable subproblems whose outputs can be synthesized cleanly.
Summarization is a component, not a convenience#
Summarization used to be a workaround for small context windows.
When the chat got too long, summarize it and start over.
Now summarization has a different role. It reduces context distraction, preserves intent across long-running workflows, and gives the system a way to carry forward only the state that should matter.
The hard part is deciding what the summary must preserve.
A generic summary may be fluent and useless. It may omit constraints, uncertainty, open questions, source provenance, decisions, failed approaches, or user preferences. It may flatten conflict into false agreement. It may turn a temporary hypothesis into a settled fact.
That is why Breunig argues that summarization deserves to be its own component that can be evaluated. I agree.
For an agent workflow, I would separate summaries by purpose:
- A plan summary that preserves goals, constraints, and next steps.
- A research summary that preserves claims, sources, and confidence.
- A tool summary that preserves outputs, errors, and state changes.
- A user preference summary that preserves only durable preferences.
- A decision summary that preserves tradeoffs and chosen direction.
Different summaries need different schemas.
The model should not have to infer which details matter from a vague instruction to "summarize this conversation."
Pruning is deletion with judgment#
Context pruning is the removal of irrelevant or unneeded material.
This is harder than it sounds because relevance changes over time. A failed tool call may be crucial for debugging now and irrelevant later. A retrieved document may be useful for one subquestion and distracting for the final synthesis. A previous wrong answer may need to be removed or marked as wrong, not allowed to linger as competing context.
Pruning should protect the instructions and goals while cutting accumulated cruft.
One practical pattern is to maintain structured context internally and compile the final prompt at the boundary. Instead of treating the prompt as one growing string, keep sections: system instruction, task goal, constraints, active plan, selected evidence, tool results, prior decisions, and open questions.
Then pruning can operate by section.
Old tool results can be dropped. Superseded plans can be archived. Irrelevant documents can be removed. Main instructions can remain untouched. A contradiction can be resolved by marking one source as stale instead of letting both fight for attention.
This is ordinary engineering discipline applied to prompts.
The prompt is not just text. It is a rendered view of state.
Offloading keeps state out of the window#
Context offloading means storing information outside the model context and retrieving it when needed.
Willison notes examples such as agents writing and updating plan.md files. Anthropic's research system also describes saving plans to memory so important context survives truncation. The point is simple: not all state belongs in the live prompt.
Some state belongs in files. Some belongs in a database. Some belongs in a vector store. Some belongs in a task graph. Some belongs in an audit log. Some belongs in a durable memory service. The context window should pull in the slices needed for the next decision.
This makes agent systems more inspectable.
If a plan exists only as chat history, it is hard to review and easy to lose. If it lives in a file or structured store, the agent can update it, the user can inspect it, and another process can validate it.
Offloading also helps with restart and recovery. Long-running agents fail. Tools time out. Containers restart. Context windows truncate. If the only copy of the plan is the current prompt, the system is fragile.
External state turns memory into architecture.
Multi-agent systems need effort budgets#
One of the most useful parts of Anthropic's write-up is its honesty about coordination.
Early multi-agent systems can overdo everything. They spawn too many subagents. They keep searching after they have enough evidence. They duplicate work. They choose the wrong tool. They chase nonexistent sources. They distract each other.
Anthropic's mitigation was not magic. It was orchestration discipline.
They taught the lead agent how to delegate. Each subagent needs an objective, an output format, tool guidance, source guidance, and task boundaries. They also embedded effort rules: simple fact-finding might require one agent and a small number of tool calls; direct comparisons might need a few subagents; complex research might justify more than ten subagents with divided responsibilities.
That is context engineering at the orchestration level.
Effort is part of the prompt contract.
Without an effort budget, an agent may spend too little and answer shallowly, or spend too much and burn tokens on marginal improvement. A good orchestrator should scale effort to task value and complexity.
This is where I think agent design starts to look like operations research. The system is allocating compute, tools, context windows, and attention under uncertainty.
That deserves explicit rules.
Evaluation has to watch outcomes and process#
Anthropic also makes a point that should be standard in agent work: evaluate early, and evaluate the outcome rather than only the exact path.
For research agents, there may not be one prescribed sequence of steps. One agent might search three sources, another might search ten. The question is whether the final answer is factual, complete, well-cited, based on good sources, and produced with reasonable tool use.
Their rubric categories are practical: factual accuracy, citation accuracy, completeness, source quality, and tool efficiency. They used LLM-as-judge evaluation and human evaluation because each catches different failures.
That pattern maps well beyond research.
For context engineering, I would evaluate:
- Did retrieval include the necessary evidence?
- Did tool loadout include the right tools and exclude distracting ones?
- Did subagents avoid duplicate work?
- Did summaries preserve constraints and uncertainty?
- Did pruning remove stale or irrelevant material without dropping critical facts?
- Did offloaded state survive restart or truncation?
- Did the final answer cite or explain the evidence it used?
- Did the system stay within cost and latency budgets?
If those questions are not measured, context quality becomes vibes.
Failure modes I would watch#
The first failure mode is long-context dumping. The system sends everything because it can.
The second is tool sprawl. The model sees too many tool definitions and starts choosing poorly.
The third is poisoned state. A hallucination, bad extraction, or incorrect subresult becomes part of the ongoing context.
The fourth is stale plan persistence. The agent keeps following an old plan because the plan stayed in context after the facts changed.
The fifth is false summarization. A summary reads well but drops the one constraint that mattered.
The sixth is over-parallelization. The system spawns subagents for work that needs shared context or a single coherent thread.
The seventh is missing synthesis. Subagents return outputs, but the lead agent does not reconcile contradictions or source quality.
The eighth is hidden cost. Multi-agent context quarantine improves quality but burns tokens far beyond the value of the task.
The ninth is no context observability. Nobody can tell which documents, tools, summaries, or prior messages shaped the answer.
The tenth is no recovery path. The agent runs long, fails midway, and cannot reconstruct its plan because the state lived only in the prompt.
These are design failures, not model personality quirks.
A working playbook#
If I were building an agent today, I would start with a context budget for each stage.
Planning gets the user request, constraints, available tool categories, and a small amount of relevant history. It does not get every document.
Tool selection gets the task and tool descriptions indexed for retrieval. It returns a small loadout.
Research or execution gets only the selected tools, the task slice, and the evidence needed for that slice.
Subagents run in isolated contexts when the work is separable.
Summaries are generated into purpose-built schemas. They are not free-form recaps.
Durable state lives outside the prompt. Plans, findings, decisions, and long-lived memories should be inspectable.
The final synthesis gets selected evidence, subagent findings, source metadata, known contradictions, and the output requirements.
Evaluation checks both the answer and the context pipeline that produced it.
That is the shape I trust:
request
-> classify task and risk
-> select tools and documents
-> split separable work
-> run isolated contexts
-> summarize and prune
-> offload durable state
-> synthesize with citations
-> evaluate answer and processThe details change by product. The discipline does not.
The real lesson#
Long context is powerful.
It lets models read more, remember more, inspect more, and coordinate larger jobs. That matters. But a larger window is not a substitute for system design.
Breunig's taxonomy helps name the ways context fails. His follow-up gives the repair tools. Willison makes the developer implication plain: tool loadout, quarantine, pruning, summarization, and offloading are practical patterns, not academic labels. Anthropic shows the production version: orchestrated subagents, effort budgets, tool heuristics, memory, citation synthesis, evals, retries, checkpoints, and human testing.
This is the lesson I want to carry forward:
Context engineering is not about stuffing the right magic prompt into a model.
It is about controlling the information environment in which the model acts.
That environment decides what the model notices, what it ignores, what it repeats, what it trusts, and what it can safely do next.
For agent systems, that is architecture.