Research Notes - Multi-Pass Processing and Context Engineering for AI Research Agent Reliability

Bram Donkers; Claude Sonquatre-six

Research Notes - Multi-Pass Processing and Context Engineering for AI Research Agent Reliability

Bram Donkers · Claude Sonquatre-six

AI Generated by claude-sonnet-4-6 · human-supervised · Created: 2026-03-10 · History

Research: Multi-Pass Processing and Context Engineering for AI Research Agent Reliability

Date: 2026-03-10 Search queries used:

“multi-pass processing AI research agents reliability”
“context engineering LLM agents reliability 2025”
“context engineering definition AI agents structured prompting”
“multi-pass LLM reasoning iterative refinement research agent accuracy”
“context rot LLM long context degradation attention mechanism 2024 2025”
“Anthropic multi-agent research system how we built it 2025”

Executive Summary

Context engineering has emerged as the successor to prompt engineering for agentic AI systems: rather than crafting individual instructions, it involves curating what information enters a model’s bounded attention budget at each inference step. Multi-pass processing — iterative loops where agents revise, compress, or hand off context between inference calls — is the primary architectural mechanism that makes context engineering tractable at scale. Empirical evidence from both Anthropic’s production research system and a 2026 Nature study on clinical agents shows that multi-agent, multi-pass architectures sustain high accuracy under load where single-pass agents collapse. The core problem these techniques solve is “context rot”: as token count grows, LLM attention degrades in a nonlinear way, and single-agent single-pass designs structurally fail at complex research tasks. Tenet alignment is strong for Context as Infrastructure and Symbiotic Intelligence; mild tension with Human Intent First arises when agent autonomy in context management obscures the provenance of decisions.

Key Sources

Effective Context Engineering for AI Agents

URL: https://www.anthropic.com/engineering/effective-context-engineering-for-ai-agents
Type: Engineering blog post (Anthropic)
Date: September 29, 2025
Key points:
- Context engineering = “strategies for curating and maintaining the optimal set of tokens during LLM inference”
- Natural progression from prompt engineering; the engineering problem shifts from words to information configuration
- “Context rot” — performance degrades as tokens accumulate due to the n² attention complexity of transformers
- Guiding principle: find the smallest set of high-signal tokens that maximises the likelihood of desired outcome
- Three techniques for long-horizon tasks: compaction, structured note-taking, multi-agent sub-architectures
- Sub-agent architecture: each sub-agent explores with a clean context window and returns 1,000–2,000 token summaries to a lead agent
- Compaction = summarising an overloaded context and restarting with the condensed version
- Just-in-time retrieval: agents hold lightweight identifiers (file paths, URLs) and load data only when needed, mirroring human cognition
Tenet alignment: Strong alignment with Context as Infrastructure; compaction and note-taking are literally treating context as persistent, reusable infrastructure
Quote: “Context, therefore, must be treated as a finite resource with diminishing marginal returns.”

How We Built Our Multi-Agent Research System

URL: https://www.anthropic.com/engineering/multi-agent-research-system
Type: Engineering case study (Anthropic)
Date: June 13, 2025
Key points:
- Production research system uses orchestrator-worker pattern: lead agent (Claude Opus 4) spawns parallel sub-agents (Claude Sonnet 4)
- Multi-agent system outperformed single-agent Claude Opus 4 by 90.2% on internal research evals
- Token usage alone explains 80% of performance variance on BrowseComp; more tokens (via multi-agent parallelism) = better results
- Multi-agent systems use ~15× more tokens than chat interactions; justified only when task value is high
- Multi-pass search: start broad (short queries), progressively narrow — mirrors expert human research behaviour
- Interleaved thinking: sub-agents use visible reasoning scratchpad after each tool call to evaluate quality and refine next query
- Prompt engineering encodes heuristics (not rigid rules): scale effort to query complexity; give orchestrators delegation templates
- Sub-agents write to external filesystem; coordinator gets lightweight reference, not full output — prevents “game of telephone” degradation
- Failure mode documented: sub-agents duplicating work without division of labour; solved by explicit task boundaries
- Extended thinking improved instruction-following, reasoning efficiency across the board
Tenet alignment: Aligns with Symbiotic Intelligence — the system is designed to expand research reach, not just automate; human intent is still the entry point. Also Context as Infrastructure: persistent memory, checkpoints, plan storage.
Quote: “The essence of search is compression: distilling insights from a vast corpus. Subagents facilitate compression by operating in parallel with their own context windows.”

Orchestrated Multi-Agents Sustain Accuracy Under Clinical-Scale Workloads

URL: https://www.nature.com/articles/s44401-026-00077-0
Type: Peer-reviewed journal article (npj Health Systems, Nature Publishing Group)
Date: March 9, 2026
Key points:
- Experiment: single agent vs. orchestrated multi-agent across retrieval, extraction, dosing tasks at batch sizes 5–80
- Single-agent accuracy: 73.1% (batch=5) → 16.6% (batch=80); multi-agent: 90.6% → 65.3%
- GPT-4.1-mini: single agent dropped to 33.9% at batch=80; multi-agent stayed at 91.4%
- Multi-agent used up to 65-fold fewer tokens than single agent at maximum load
- Mechanism: context insulation — each worker sees only the tokens relevant to its single decision; attention is not diluted by irrelevant material
- Orchestrator re-assembles answers without expanding any single model call’s context
- Model scale matters: larger models hold accuracy longer under multi-agent load
Tenet alignment: Strongly supports Symbiotic Intelligence — task partitioning preserves accuracy and enables scale without replacing human judgment. Transparent audit trail addresses regulatory/accountability concern.
Quote: “Delegating each task to its own worker appears to insulate the LLM from context interference, so accuracy remains high even when many unrelated prompts arrive at once.”

Context Rot: How Increasing Input Tokens Impacts LLM Performance

URL: https://research.trychroma.com/context-rot
Type: Technical research report (Chroma)
Date: July 14, 2025
Key points:
- Evaluated 18 LLMs including GPT-4.1, Claude 4, Gemini 2.5, Qwen3
- Models do not use context uniformly — performance grows unreliable as input length grows
- The “lost in the middle” effect: information in the middle of long contexts has significantly lower retrieval accuracy than information at start or end
- Transformer architecture creates n² pairwise token relationships — stretched thin at long contexts
- Near-perfect Needle-in-a-Haystack scores do not generalise; context rot appears on more realistic tasks
Tenet alignment: Foundational empirical grounding for why context engineering is necessary
Quote (from Kudra secondary source, citing this research): “Information in the middle gets effectively ignored despite being ‘in context’.”

8 AI Agent Concepts Every AI Developer Needs in 2026

URL: https://kudra.ai/8-ai-agent-concepts-every-ai-developer-needs-in-2026-visually-explained/
Type: Technical explainer (Kudra, February 2026)
Key points:
- Memory hierarchy: short-term (context window) → session memory → long-term (vector DB). Skipping session memory causes quality problems.
- Context window management: retrieval accuracy by position — first 10%: 87%, middle 50%: 52%, last 10%: 81%
- Multi-pass RAG strategies table: iterative retrieval achieves “very high” precision but “very high” latency; appropriate for multi-hop reasoning
- Recommendation: “reserve multi-pass for complex analytical tasks”
- Agent-directed RAG (agent decides when and what to retrieve) beats auto-retrieve-always for complex research
Tenet alignment: Neutral/supportive; practical framing

Major Positions

Position 1: Context Engineering as Information Architecture

Proponents: Anthropic Applied AI team (Prithvi Rajasekaran, Ethan Dixon, Carly Ryan, Jeremy Hadfield)
Core claim: Agent reliability is primarily an information management problem, not a model capability problem. The key variable is what goes into the context window at each inference step.
Key arguments:
- LLMs have a finite “attention budget” that context rot depletes
- System prompts, tools, examples, and message history must each be minimised for high signal density
- Just-in-time retrieval (agents load data on demand) outperforms preloaded RAG for dynamic tasks
- Compaction + note-taking + multi-agent architectures are the three levers for long-horizon work
Relation to site tenets: Strong alignment with Context as Infrastructure — context is treated as infrastructure that requires active curation and maintenance. Challenges the naive “just load everything” pattern.

Position 2: Multi-Agent Parallelism as the Reliability Mechanism

Proponents: Anthropic Research engineering team; Klang, Omar et al. (Mount Sinai/Nature 2026)
Core claim: Context isolation — giving each agent only the tokens it needs for its specific sub-task — is the primary driver of reliability under load. Multi-pass architectures (lead agent ↔ sub-agents) implement this isolation.
Key arguments:
- Token count explains 80% of performance variance (Anthropic BrowseComp analysis)
- Multi-agent outperforms single-agent by 90.2% on research evals
- Clinical study: single-agent accuracy collapses from 73% to 17% as task batch grows; multi-agent holds at 65%+
- The orchestrator-worker pattern scales to any task volume without degrading individual decision quality
Relation to site tenets: Aligns with Symbiotic Intelligence — the system expands human research capability without reducing human interpretability. The audit trail (each agent call logged) supports accountable collaboration.

Proponents: Various — Amazon (multi-pass code refinement), general ReAct/chain-of-thought literature
Core claim: Even without multiple agents, running multiple inference passes over a task (draft → critique → revise) improves output quality. This is a simpler form of multi-pass processing.
Key arguments:
- Multi-pass iterative refinement achieves higher accuracy across performance metrics (Amazon code suggestion research)
- Interleaved thinking — agent reasons after each tool result — is a lightweight form of multi-pass processing
- Extended thinking as a controllable scratchpad: agents plan, act, evaluate, adapt in one context window
Relation to site tenets: Aligns with Symbiotic Intelligence — iterative refinement encodes epistemic humility; outputs are treated as provisional, not final.

Position 4: Failure Mode Realism — Multi-Agent Has Costs

Proponents: Anthropic (same source); Reddit/practitioner community
Core claim: Multi-agent and multi-pass architectures introduce coordination complexity, token cost, and emergent failure modes that make them brittle in practice if not carefully engineered.
Key arguments:
- Multi-agent systems use ~15× more tokens than chat; economic viability requires high-value tasks
- Sub-agents can duplicate work without explicit division of labour
- Small prompt changes cascade unpredictably across the full system
- Synchronous lead-agent execution creates bottlenecks; async coordination is complex
- Failure rates in peer-reviewed multi-agent studies reported at 60–80% without careful engineering
Relation to site tenets: Tension with Always Scalable — multi-pass/multi-agent is expensive and may not be proportionate for simpler tasks. Aligns with the tenet’s call to balance effort with results.

Key Debates

Debate 1: Context Window Size vs. Context Quality

Sides: One camp advocates for larger context windows (more capacity = more flexibility); the other argues context rot makes window size secondary to curation quality
Core disagreement: Will ever-larger context windows eventually make context engineering obsolete?
Current state: Ongoing. Chroma’s context rot research and Anthropic’s engineering guidance both argue quality will remain the binding constraint regardless of window size, because attention degrades nonlinearly. Counter-position: models with specialised long-context training may break the pattern.

Debate 2: Pre-Retrieved RAG vs. Agent-Directed Just-in-Time Retrieval

Sides: RAG (retrieve at query time, preloaded) vs. agentic search (agent decides what to retrieve and when)
Core disagreement: Speed and simplicity (RAG) vs. adaptability and relevance (agentic)
Current state: Field is converging on hybrid approaches; RAG for stable, structured knowledge; agentic retrieval for open-ended research tasks.

Debate 3: How Much Should Agents Self-Manage Their Context?

Sides: Human-curated context (engineer controls what enters) vs. fully autonomous context curation by the agent itself
Core disagreement: Autonomy may improve relevance but reduces interpretability and controllability
Current state: Trend is toward progressive autonomy as models improve; “do the simplest thing that works” remains Anthropic’s operational guidance.

Historical Timeline

Year	Event/Publication	Significance
2017	“Attention is All You Need” (Vaswani et al.)	Transformer architecture established the n² attention mechanism that creates context rot
2023	“Lost in the Middle” (Liu et al., arXiv)	Documented that LLMs systematically fail to attend to information in the middle of long contexts
2023	ReAct and chain-of-thought prompting popularised	Established iterative reasoning-action loops as a reliability technique
2024	Context windows scaled to 100k+ tokens across major models	Created assumption that “more is always better” — later contested by context rot research
Jun 2025	Anthropic publishes multi-agent research system post	First detailed engineering account of multi-agent + multi-pass for production research tasks
Jul 2025	Chroma publishes “Context Rot” report (18 LLMs)	Empirical proof that performance degrades nonlinearly with context length across all major models
Sep 2025	Anthropic publishes “Effective Context Engineering for AI Agents”	Established “context engineering” as the canonical successor term to prompt engineering
Mar 2026	Nature (npj Health Systems) publishes orchestrated multi-agent clinical study	First peer-reviewed, scaled trial showing multi-agent preserves accuracy where single-agent collapses

Potential Article Angles

Based on this research, an article could:

“The Attention Budget” — Context as Finite Infrastructure for AI Research Agents — Argue that treating AI agent context as a finite, curated resource (not a dump of everything relevant) is the defining engineering discipline of the agentic era. Aligns strongly with Context as Infrastructure tenet. Could introduce “context rot” as the core failure mode and multi-pass architecture as the solution.
“Multi-Pass Reasoning as Epistemic Humility” — Frame multi-pass processing not just as a performance technique but as a structural embodiment of epistemic humility: agents treat initial outputs as provisional drafts, not conclusions. Connects to Symbiotic Intelligence and Human Intent First tenets. The human remains the final integrator; agents compress and surface, not conclude.
“The Parallelism Paradox: Why More Agents Are Sometimes Safer Than One” — Counter-intuitive insight: running the same task through isolated sub-agents with separate context windows produces more reliable outputs than a single highly capable agent with a full view. This mirrors how diverse expert panels outperform lone experts. Strong connection to Pluralism of Perspectives tenet.

When writing, follow obsidian/project/writing-style.md for:

Named-anchor summary technique for forward references
Tenet alignment requirements
LLM optimization (front-load important information)

Gaps in Research

Philosophical grounding: No philosophical literature directly addresses multi-pass processing as an epistemic pattern. Potential connections to pragmatist epistemology (Dewey’s inquiry cycle), Popperian falsificationism (iterative conjecture/refutation), or hermeneutic spirals — not yet found in existing sources.
Human-centred reliability: Most research measures accuracy, token count, latency. Little evidence on how these architectures affect the human researcher’s understanding, judgment calibration, or trust calibration.
Domain specificity: Clinical and coding domains are well-studied; design and strategic research tasks (qualitative, ambiguous, creative) are understudied.
Long-term reliability: Studies focus on single-session performance. Context management across multi-session research projects (days/weeks) is an open empirical question.
Automation bias interaction: Multi-pass architectures may increase automation bias if humans trust “refined” outputs more than single-pass outputs regardless of actual quality improvement.

Citations

Rajasekaran, P., Dixon, E., Ryan, C., & Hadfield, J. (2025, September 29). Effective context engineering for AI agents. Anthropic Engineering. https://www.anthropic.com/engineering/effective-context-engineering-for-ai-agents
Hadfield, J., Zhang, B., Lien, K., Scholz, F., Fox, J., & Ford, D. (2025, June 13). How we built our multi-agent research system. Anthropic Engineering. https://www.anthropic.com/engineering/multi-agent-research-system
Klang, E., Omar, M., Raut, G., Agbareia, R., Timsina, P., Freeman, R., Gavin, N., Stump, L., Charney, A. W., Glicksberg, B. S., & Nadkarni, G. N. (2026). Orchestrated multi agents sustain accuracy under clinical-scale workloads compared to a single agent. npj Health Systems, 3, 23. https://doi.org/10.1038/s44401-026-00077-0
Hong, K., Troynikov, A., & Huber, J. (2025, July 14). Context rot: How increasing input tokens impacts LLM performance. Chroma Research. https://research.trychroma.com/context-rot
Liu, N. F., et al. (2023). Lost in the middle: How language models use long contexts. arXiv. https://doi.org/10.48550/arXiv.2307.03172
Kudra. (2026, February 27). 8 AI agent concepts every AI developer needs in 2026. https://kudra.ai/8-ai-agent-concepts-every-ai-developer-needs-in-2026-visually-explained/

Tags: context-engineering, ai-agents, knowledge-work, research

Research Notes - Multi-Pass Processing and Context Engineering for AI Research Agent Reliability

Research: Multi-Pass Processing and Context Engineering for AI Research Agent Reliability

Executive Summary

Key Sources

Effective Context Engineering for AI Agents

How We Built Our Multi-Agent Research System

Orchestrated Multi-Agents Sustain Accuracy Under Clinical-Scale Workloads

Context Rot: How Increasing Input Tokens Impacts LLM Performance

8 AI Agent Concepts Every AI Developer Needs in 2026

Major Positions

Position 1: Context Engineering as Information Architecture

Position 2: Multi-Agent Parallelism as the Reliability Mechanism

Position 3: Iterative Refinement (Multi-Pass Within a Single Agent)

Position 4: Failure Mode Realism — Multi-Agent Has Costs

Key Debates

Debate 1: Context Window Size vs. Context Quality

Debate 2: Pre-Retrieved RAG vs. Agent-Directed Just-in-Time Retrieval

Debate 3: How Much Should Agents Self-Manage Their Context?

Historical Timeline

Potential Article Angles

Gaps in Research

Citations