The context window as working memory — The Agent Loop — what's actually running when an agent "works for hours"

Stop treating the context window as "the prompt."

The prompt is a single input to a single completion. The context window is a living, turn-by-turn accumulation that the agent reads fresh on every loop iteration. What lands in that window, in what order, and what gets evicted when it fills up — those decisions determine whether the agent at turn thirty still knows what the agent at turn one was told to do.

Most agent drift failures trace back to this gap. The instruction was given. The model received it. Somewhere between turn three and turn twenty-five, it fell off the desk.

Think of the context window as a desk with a fixed surface area. At the start of a run you arrange it: system instruction in the top left, injected memory summaries just below, any retrieved evidence beside that, tool-call history accumulating in the middle, the current user message at the front. The arrangement matters because models do not attend uniformly across a long context. They attend more reliably to what is at the beginning and what is at the end. What sits in the middle of a very long context can effectively disappear.

The desk fills up as the run progresses. When it overflows, something gets evicted — and unless you have decided what that something is, the framework has decided for you. Frameworks default to evicting the oldest tool-call results. That is often the wrong call. The oldest tool-call result might be the scaffolding that the rest of the run depends on.

The arrangement matters because models do not attend uniformly across a long context. Naming the five bands on the desk is the first step toward deliberate management.

System instruction is the standing charter. It tells the model what it is, what it is trying to do, and what constraints bind it. It is typically the most stable band — it changes rarely or never across a run. Keep it concise. Every byte here is purchased from the budget that remaining bands share.

Injected memory is what the memory layer pumped in at the start of this turn. Summaries of prior sessions, persisted preferences, standing facts about the task. This band should be curated — a distillation, not a transcript dump.

Retrieved evidence is what a retrieval step fetched for this specific turn. In RAG-style pipelines (retrieval-augmented generation: the model fetches external context at runtime rather than relying solely on its training), this can be the largest band by byte count. It is also the most volatile — it changes every turn.

Tool-call history is the accumulating record of what the agent has done. It grows with every turn. In long runs it becomes the biggest consumer on the desk. Most frameworks default to keeping it all; many production systems should be summarizing or compressing it.

Current message is what arrived this turn. It is the smallest band and should stay that way.

Understanding the five bands is useful even if you never look at the raw token counts. It gives you a vocabulary for debugging. "The agent stopped applying the safety constraint after turn fifteen" stops being mysterious when you know that the constraint lived in the injected-memory band, the injected-memory band was oversized, and the eviction policy silently dropped it to make room for tool-call history. The fix is not a model upgrade. The fix is pinning the constraint and compressing the history.

Why it matters now

By mid-2026, 200K-token windows are in standard use and one-million-token windows are shipping. The naïve takeaway is "just put everything in the context." That is wrong in three ways.

First, cost scales linearly with window size across every invocation. A 200K context costs roughly 10× a 20K context per call. Across a long run with many turns, the arithmetic compounds quickly.

Second, attention degrades non-linearly in very long contexts. The middle of the window is the worst-attended region. Critical instructions buried at turn 40 of a hundred-turn run may as well be absent.

Third, larger windows let bad context-hygiene hide longer before it bites. An agent running on a 4K context shows drift at turn five. The same agent on a 200K context shows drift at turn fifty. You will not catch it in staging. You will catch it in production, at the wrong moment.

The window got bigger. The discipline got more important, not less.

A source you should trust

"Lost in the Middle" (Liu et al., 2023). The empirical paper that made "just stuff everything in" measurably wrong. The finding: performance on retrieval tasks degrades when the relevant document is in the middle of the context, even when the model technically "sees" it. Required reading before any decision to expand context size.
Anthropic's long-context cookbook. Operator-grade examples of structuring multi-turn agent context for retention: how to layer bands, when to summarize tool-call history, how to pin high-priority instructions.
Anthropic's prompt caching documentation. Caching is a context-hygiene tool as well as a cost tool — understanding what is cacheable tells you which bands to stabilize and which to keep volatile.

A recipe

A context-budget checklist for any agent loop before it ships:

Name the five bands. Write them on a whiteboard or in a document; giving them names forces you to think about them explicitly.
Assign a byte budget to each band based on the maximum context window and expected run length. Add them up. If they exceed the window at turn twenty, you have a problem you can solve now rather than at 3am.
Write down the eviction policy explicitly: when the desk overflows, what goes first? If your answer is "whatever the framework does," read the framework docs and write down what it does. Inherit deliberately, not by accident.
Log actual band sizes per turn in development. Most teams have never seen this graph. The tool-call history band in particular often looks fine for fifteen turns and then doubles in four.
Set a per-band alarm: when a single band exceeds its budget by more than 20%, that is where drift starts. An alert in development is free insurance against a production incident.

The smell of it going wrong

The agent's behavior at turn twenty is meaningfully different from turn two in ways that are not explained by the task progressing. That is context drift until proven otherwise.
The same instruction appears in both the system prompt and every user message turn-over-turn, because earlier turns reliably evict it. This is the symptom of an instruction that belongs in pinned memory, not in the volatile window.
Cost-per-turn is growing as the run progresses even though the user's messages are short. Tool-call history is accumulating uncompressed.
A retrieved document is demonstrably in the context but the agent acts as though it is not. The document landed in the middle of a long window. It lost the attention lottery.
The team knows the context window size but cannot name the eviction policy. They are flying blind on the most consequential per-turn decision the framework makes.

A judgment call from real work

The Ostronaut named-vector retrieval harness — the system underneath PL's AI-powered lesson and case recommendations — hit this failure mode during load testing of longer learning sessions.

The symptom was odd: lesson recommendations that were coherent at the start of a session degraded in quality around the 25-minute mark, reliably, regardless of which user triggered the session. The retrieval results were objectively relevant. The model was capable of using them. But quality dropped.

The root cause was the orphan-gap content. Markdown headings and introductory paragraphs from longer lessons were being chunked into their own vector records without the body that gave them meaning. In a short context, the orphan chunk would sit close to the substantive chunk that followed it — the model could infer the relationship. In a long context, where tool-call history had accumulated, the orphan chunk and its parent chunk were separated by enough tokens that the attention connection broke.

The fix was not "make the context bigger." The fix was deciding what was load-bearing. The orphan chunks needed to be fused with their parent paragraphs at index time, not patched at retrieval time. And the tool-call history band needed a summarization step at turn fifteen to keep the window from becoming a transcript.

Context hygiene saved the session quality. Bigger context would have hidden the problem for another twenty turns and then caused the same failure with higher per-turn cost.

The generalizable principle: when session quality degrades at a predictable turn count, the diagnosis is almost always in the context bands, not the model weights. Check the bands first. They are inspectable. The model weights are not.

Rules from this lesson

The context window is a desk with bands, not a bag with a size limit.
Every band has a budget and an eviction policy; if you have not decided them, the framework has decided them for you.
Drift at long turn counts is a context-hygiene bug 80% of the time before it is a model bug.
Bigger context windows make discipline more important, not less; they just let bad hygiene hide longer.