All notes

Memory tiers, not bigger models, cut your agent token bill

Memory tiers, not bigger models, cut your agent token bill

An AI agent has no memory of its own. Every time your chatbot replies, the model behind it has forgotten the entire conversation and is reading it again from scratch. Everything an agent "remembers" is something an engineer decided to stuff back into the prompt, and that decision is what you pay for at the end of the month.

I watch most teams reach for a bigger or cheaper model the moment the bill hurts. In my read, that is the wrong lever. The dominant cost driver is which tier of memory you wrap around the model: in-prompt working memory, recent-turn episodic memory, vector-retrieved semantic memory, or cold archival memory. Pick the wrong tier and you can pay roughly twenty times more in tokens for the same answer quality, per the Azure Cosmos DB Conf 2026 session recap. The model is a rounding error next to that.

Key takeaways

  • Token cost is a memory-architecture problem in a model-selection costume; the tier choice dominates, the model choice trims.
  • Four tiers have stabilised in 2026: working, episodic, semantic, archival; map your use case to one before benchmarking models.
  • Sliding-window memory wins below thirty turns; entity-graph wins when every fact must survive; hierarchical bridges the middle band.
  • Long context windows did not abolish tiering; measured context rot keeps semantic retrieval cheaper than stuffing one million tokens.
  • For a fifty-seat Mittelstand firm the cheapest tier upgrade is a cache header plus a session TTL, not a vector database migration.

In this article

Why memory, not model, sets your token bill

The model is stateless. You are paying for memory. Every API call sends a fresh blob of text to the model and gets a fresh blob back. There is no hidden cache of "what we said yesterday" inside the model weights. If your agent appears to remember a customer's name, an engineer decided that name belongs in the prompt for this turn, and your invoice line for that turn reflects that decision.

That detail is invisible while you are prototyping. You feed the agent ten or fifteen test turns, everything answers correctly, and the dashboard says you are spending pennies per session. The trap is that the cost curve and the recall curve both behave nicely below thirty turns and then diverge sharply once real users start dropping facts on Monday and asking about them on Friday. The Azure Cosmos DB Conf demo measured exactly that gap, and the team that ran it noticed only because they pushed past the usual proof-of-concept turn count.

The other half of the trap, and the one I see repeated most often in client kickoffs, is the assumption that a frontier model with a one-million-token window solves the problem by absorbing the conversation. It does not. The Chroma Context Rot study showed monotonic recall degradation across all eighteen frontier models tested as input length grew. Long context buys you headroom, not memory. Headroom you pay per token to refill on every call.

Here is my thought: token cost is an architecture decision masquerading as a procurement one. You can keep procurement, you can keep your favourite model, and you can still cut the bill by an order of magnitude if you put the right tier in front of the right use case.

The four tiers and what each one actually costs

Four tiers have stabilised across the agent-memory market in 2026, and naming them the same way internally is the first move that pays off. The vocabulary is not mine. It is the same shape Letta exposes as core, recall, and archival, the same shape LangGraph calls short-term plus long-term store, and the same shape Mem0 and Zep ship as managed services.

  • Working memory. The current prompt. Tokens you pay for on every call, no exceptions. Bounded by the model's context window.
  • Episodic memory. Recent turns held verbatim or lightly compressed. Cheap to read, lossy to compress, sufficient for short chats.
  • Semantic memory. Consolidated facts stored as entities with embeddings and key-value pairs, retrieved by vector or hybrid search. Higher fixed cost, near-perfect recall on the facts you stored.
  • Archival memory. Cold storage with on-demand pulls. Almost free at rest. Slow to recall, but the right home for compliance logs and last quarter's tickets.

Prompt caching is the fifth tier and the cheapest one to add this week. Per the Anthropic prompt caching documentation, a cache write costs 1.25 times base input for a five-minute window and a cache read costs 0.1 times. One cache hit on a five-minute cache pays for the write. You almost never have to argue for that ROI.

The signal is not which tier sounds best. It is which tier matches your turn count, your fact density, and your tolerance for the wrong answer. Monday move: write a one-line tier label next to every agent use case in your roadmap. If you cannot label it, you cannot price it.

💡 Treat memory tier like database normalisation. Put each fact in exactly one tier that matches its access pattern, then stop arguing about which model is "smartest" for the bot.

Replaying the Azure Cosmos DB Conf demo

The most concrete numbers I have seen this quarter come from replaying the Microsoft Developer channel walkthrough of the Azure Cosmos DB Conf 2026 session on memory patterns. I verified publisher attribution via YouTube oEmbed for the underlying recording before quoting any of its numbers here. The session benchmarked three memory strategies on the same sixty-seed-message dataset with ten recall questions, five easy and five nuanced.

The reported numbers were 92 tokens per call at 0 percent recall with no memory at all, roughly 1,100 tokens per call at 60 percent overall recall with a sliding window, and roughly 1,660 tokens per call at 100 percent recall with an entity graph, with hierarchical memory landing between them at 80 percent overall. The relevant gap is not in the headline. The interesting collapse is in the nuanced subset: sliding window dropped to 20 percent on five questions whose answers lived more than thirty turns back. Hierarchical recovered to 60 percent. Entity graph held at 100 percent.

A pseudocode sketch of the three patterns in one process, with session id as partition key and embeddings co-located with the operational record:

def respond(turn_text: str, session_id: str, mode: str) -> str:
    base_prompt = system_prompt()

    if mode == "sliding_window":
        recent = store.last_n_turns(session_id, n=30)
        summary = store.summary_before(session_id, cutoff=30)
        context = summary + recent

    elif mode == "hierarchical":
        hot = store.last_n_turns(session_id, n=10)
        warm = store.weekly_digest(session_id)
        cold = store.archived_facts(session_id, k=5)
        context = cold + warm + hot

    elif mode == "entity_graph":
        entities = store.vector_search(
            query=turn_text,
            partition_key=session_id,
            k=8,
        )
        facts = store.facts_for(entities)
        edges = store.relationships(entities)
        context = render_graph(facts, edges)

    return llm.complete(base_prompt + context + turn_text)

The hard cost is fifty percent more tokens for entity graph than for sliding window, in exchange for the forty-point recall jump on the nuanced subset. The cost worth comparing is not "1,660 versus 1,100". It is "1,660 with the right answer versus 1,100 with the wrong one and a human cleaning up afterwards".

Monday move: pick the agent use case where a wrong answer costs the most, instrument exactly which turns its current memory strategy drops, and price the gap. The number you produce is the only one that lets you defend a tier upgrade to a CFO.

Where the thesis loses

The thesis loses when the model is genuinely the bottleneck. If the agent is answering legal questions and you are running a 2024-class model with a known reasoning ceiling on contracts, no memory tier rescues you; the failures are at the inference step, not the retrieval step. Swap the model first, then come back to tiering.

From the small builds I have priced, it also loses when the use case is so small that the infrastructure cost of a tiered memory exceeds the token savings. A twelve-seat firm running a single internal FAQ bot with fifty queries a day will not break even on a vector database and a separate operational store. For that footprint, prompt caching plus a flat episodic buffer is the correct answer, and any tiering beyond that is over-engineering.

A third failure mode is the long-context maximalist position, summarised in the widely circulated "RAG is dead, long context won" essay (linked in Resources). The honest version of that argument is that for tasks with a single long document and one question over it, modern long-context models often beat naive RAG. The dishonest version is that this generalises to multi-turn agents with thousands of conversations. The Chroma context-rot evidence is what I keep returning to here, and it is also the argument my sibling post on structured retrieval for enterprise agents makes from the retrieval-quality angle. Treat that post as the cousin to this one: same enemy, different cost lens.

Finally, single-vendor recall numbers should be treated as marketing until reproduced. The LOCOMO benchmark dispute is the canonical case, where Zep originally claimed 84 percent, Mem0 corrected to 58.44 percent, and Zep counter-claimed 75.14 percent on the same benchmark, per the Mem0 State of AI Agent Memory 2026 report (linked in Resources). My thesis still holds. The tier still dominates the cost. But whose tier scores best on which benchmark is a question you should answer with your own evals, not someone else's slide deck.

A Mittelstand scenario, Monday morning

Picture Sabine, the operations lead at a sixty-person logistics firm in Westfalen. The company runs a customer-service chatbot that answers shipment status questions, reads from an order database, and handles around four thousand sessions a month. The current stack is a Claude Haiku call wired straight to the chat UI, with full conversation history shoved into the prompt every turn. The monthly token spend just crossed three thousand euros, and her CFO is asking why.

Sabine does not need MemGPT. She needs three tier changes she can ship inside two weeks.

  • Step one, prompt caching on the system prompt and product catalogue. A five-minute cache write at 1.25 times input, cache reads at 0.1 times, and the catalogue stops being repriced on every turn. Monday: add the cache control header to the first message in the conversation, set the TTL to five minutes, and confirm via the response usage block that cache reads are firing.
  • Step two, an episodic buffer with a session TTL. Hold the last thirty turns verbatim per session in Postgres, summarise everything older, and drop sessions older than thirty days. Stack: Postgres plus pgvector, both already in the existing application database. Monday: write the migration that adds a chat_sessions table partitioned by tenant_id and a chat_turns child table with a TTL trigger.
  • Step three, a semantic tier only for the long-tail FAQ. Embed the FAQ documents once into the same Postgres database via pgvector, retrieve top three matches per turn, and inject them as additional context only when the episodic buffer misses. Monday: a hundred-line ingestion script and a feature flag that lets you A/B the tier against the baseline.

My working target for Sabine is a fifty percent reduction in monthly token spend on the same recall floor, measured weekly on a fixed eval set of thirty real customer questions. That is an authorial projection sized off the twenty-times spread cited above, not a vendor benchmark; I expect the real number to land between thirty and seventy depending on FAQ overlap. No model swap. No new vendor. The bill comes down because each fact is finally living in the tier whose price matches its access pattern.

The reason this works at sixty seats and would not work at six is that Sabine already has Postgres, already has a developer on staff who knows how to write a migration, and already pays for the tokens the cache header will eliminate. The tier upgrade is cheaper than the line items it removes.

How to pick the tier without a six-month spike

Tier selection is a one-page decision, not a research project. I ship this matrix to clients and the conversation takes thirty minutes:

| Use case shape | Default tier | Upgrade trigger | |---|---|---| | FAQ bot, under thirty turns per session | Sliding window + prompt caching | Recall under 90 percent on nuanced eval | | Planning assistant, thirty to one hundred turns | Hierarchical | Compliance need for verbatim recall | | CRM or financial agent, every fact matters | Entity graph or semantic store | Hybrid only when sliding window covers chitchat | | Compliance log read-back | Archival with on-demand pull | Read frequency exceeds once per session per record |

Three rules go with the matrix. First, instrument the recall gap before you propose the upgrade; pick five nuanced questions that live past turn thirty in your real data and report the miss rate. Second, never benchmark fewer than sixty seed messages, because every tier looks perfect under thirty. Third, keep sliding window as the default for ninety percent of routine traffic even after you adopt entity graph, and route only premium queries to the expensive tier; the hybrid pattern is what makes the cost math work.

Monday move: take your top three production agents, drop them into the matrix, and write the tier-mismatch line item next to each. The agent currently using working-only memory for a fact that lives past turn thirty is the one quietly costing you the most, and it is the one you will fix this week.

Open questions

I am still watching three threads on this. First, the toolkits Microsoft previewed in June 2026 around agent memory and agentic retrieval reduce the cost of running an entity graph at small scale; whether they push the break-even down to twenty seats or stay at the F500 floor is something I expect to update on after a few real builds. Second, the prompt-caching tier in particular is moving fast on workspace isolation and TTL semantics, so the cache-as-tier math could shift twice this year. Third, the contested LOCOMO numbers across vendors still bother me; I want to see independent reproduction with shared seeds before I trust any vendor's recall figure for an architectural decision.

If the next twelve months prove anything stable about agent memory, it is that the tiering vocabulary is here to stay. The cost numbers will move; the principle that tier dominates model will not.

Compare notes. If you are sitting on an agent whose token bill keeps drifting and your reflex is to ask which model to swap, I would push back on that and ask which tier you are actually paying for. I run this exercise with Mittelstand teams every few weeks, usually inside a one-hour working session that ends with a single page of tier labels. If you want to compare notes on what your current spend curve looks like or argue the framing back at me, the contact page is the right door. I am especially curious about cases where the tier matrix above broke down on you, because those are the cases that move the doctrine forward.

Resources