Whenever you hear "we just put it in the context," ask: what is the eviction policy?
The context window is the most legible part of the agent's information environment. You can see what is in it. You can add things to it. The model reads it reliably. It is tempting to use it for everything, and from 2020 through 2024, when context windows were 4,000 to 32,000 tokens, the discipline of not using it for everything was enforced by scarcity. You could not stuff everything in even if you wanted to.
Then context windows reached one million tokens. The scarcity pressure lifted. And teams that had built reasonably disciplined memory architectures under constraint began to backslide. Why maintain a semantic store when you can just put everything in the context? The answer is four-fold: eviction, cost, relevance degradation, and session boundaries.
The picture
Two systems side by side. Left system: a million-token context window, filled with chat history, document fragments, session notes, user facts, and team conventions. Grows every turn. At turn 1, the agent has everything. At turn 400, the window is full and the platform's eviction policy — usually "drop the oldest" — starts silently removing information. The agent continues answering. Some answers start to be wrong. The user does not know why.
Right system: a bounded working memory (current turn plus relevant retrieved context), backed by a four-regime memory layer. The context stays small and fresh. The memory layer persists. When the agent needs something from last week, it retrieves it explicitly. The agent knows what it knows and when it retrieved it.
Both are useful. They are not interchangeable. The left system is appropriate for a task that begins and ends in one session and does not require the agent to know anything about the user's history. The right system is appropriate for a product people use repeatedly over time and where the value proposition is partly that the system gets better at helping you.
Why it matters now
The arrival of one-million-token context windows in 2024 and 2025 produced a visible pattern in how teams built agents: they deferred memory architecture, reasoning that the context window was large enough to handle whatever came up. This worked on demos and early prototypes. It failed at scale for predictable reasons.
Cost is the first signal. Context is not free. A system that fills the context window with accumulated history on every call burns tokens at a rate that is proportional to session length. A user who has been with the product for six months generates six months of history. The cost-per-call for that user is multiples of the cost for a new user.
The "Lost in the Middle" finding (Liu et al., cited in Course 1 Lesson 2) is the second signal. Attention on long contexts is not uniform. Facts in the middle of a large context window receive less reliable attention than facts near the edges. A system that stuffs everything into the context does not guarantee the agent will attend to all of it equally. Relevant facts can be in the context and still be effectively ignored.
Session restart is the third signal. If the value proposition includes "the system remembers you," then the agent's memory must survive a session restart. Context windows do not. An agent that "remembers" only within a session is a context feature wearing a memory costume.
A source you should trust
The Anthropic long-context cookbook is explicit on what context windows can and cannot substitute for. The relevant section covers the difference between "information is in the context" and "the model will reliably attend to the information in the context." These are not the same guarantee.
"Lost in the Middle" (Liu et al., 2023) provides the empirical grounding. When context windows are long, model attention degrades on information positioned away from the beginning and end. The practical implication: do not assume that placing something in the context is equivalent to making it reliably available to the model.
A recipe
Three diagnostic questions — close your eyes to the code and answer these about the system you are designing:
- Close and reopen the session. Does the agent still know the thing the user told it in session one? If the answer is "no" and the product's value proposition suggests "yes," you have context masquerading as memory.
- Switch to a new conversation thread. Does knowledge carry over? If session-isolated information loss would surprise the user, the design needs a memory layer.
- Run for one thousand turns. Is early-session knowledge still load-bearing at turn 1,000? If yes, and if that knowledge is only in the context, your system will silently degrade for long-term users.
The smell of it going wrong
- The system prompt has grown to 30,000 tokens of "everything we ever told the agent." This is context substituting for memory, and it means every single call pays for knowledge the current turn may not need.
- Switching sessions loses information the user expected to persist. This is the trust failure: the user believes the system knows them, the system does not.
- Cost-per-turn scales linearly with how long the user has been with the product. This is the business problem: the users who trust you most are the most expensive to serve.
- The agent answers correctly about a fact at turn 5 and incorrectly about the same fact at turn 50. The information was evicted — or was present but received insufficient attention — and the team blames the model.
- Memory design was deferred because "the context window is big enough for now." The deferred work becomes a rewrite after the product ships.
A judgment call from real work
PL's auto-memory was deliberately designed to separate "what fits in this turn's context" from "what persists across sessions." The design had two specific constraints.
First, MEMORY.md (the always-loaded index) was capped at around 200 lines. Not because 200 is a magic number, but because every byte in MEMORY.md is paid for on every call, by every session, forever. An uncapped index becomes a context tax. The rule is: MEMORY.md contains only what is relevant to most future sessions. Topic-narrow references stay in their own files and are loaded only when retrieved.
Second, individual memory files are not auto-loaded. They are retrieved by the system when the session topic suggests they are relevant. This means a memory about a specific project deployment does not tax a session about content authoring. The eviction policy question is answered by design: memory files persist indefinitely, but the context only pays for what it needs.
The early version of this system did not have these constraints. The MEMORY.md file grew without a discipline, and by session 50 it was costing meaningful context on every call for memories that had not been relevant in weeks. The discipline emerged from watching the cost and applying the eviction-policy question to the design retrospectively.
Rules from this lesson
- Context is a desk; memory is a notebook — designate which one each fact lives in before you build the system.
- If knowledge has to survive a session restart, it is memory; if not, context is fine and cheaper.
- Large context windows do not make memory design optional; they make bad memory design less visible until it becomes a cost, reliability, or trust problem.