AI agent memory is the system an agent uses to retain, retrieve, and reason over information across its operation. It maps to three tiers borrowed from cognitive science: working memory (the context window), short-term memory (session history), and long-term memory (persistent cross-session storage). Each tier has different capacity, speed, and failure modes. For coding agents, working memory is the bottleneck.
Three Types of AI Agent Memory
The field draws from Endel Tulving's 1972 taxonomy of human memory. A 2025 survey of agent memory systems identified three dominant forms: token-level (context window), parametric (model weights), and latent memory. For practical purposes, agent builders work with three tiers.
Working Memory
The context window. Everything the model can reason over at inference time: the system prompt, conversation history, retrieved documents, tool outputs. Capacity is finite and every token costs attention.
Short-Term Memory
The current session. Conversation turns, tool call results, intermediate reasoning, and accumulated state. Persists only until the session ends or context is compacted.
Long-Term Memory
Persistent storage across sessions. User preferences, project knowledge, learned patterns, past decisions. Requires external systems: databases, vector stores, or files like CLAUDE.md.
Recent research further subdivides long-term memory into episodic (specific past events and their outcomes), semantic (accumulated facts and user preferences), and procedural (learned workflows and decision logic). These map to different storage backends and retrieval strategies.
| Dimension | Working Memory | Short-Term Memory | Long-Term Memory |
|---|---|---|---|
| Analogy | CPU registers / RAM | Desktop workspace | Hard drive / database |
| Capacity | 128K-1M tokens | Unbounded (within session) | Unlimited |
| Speed | Instant (in-context) | Instant (in-context) | Retrieval required |
| Persistence | Single inference call | Single session | Across sessions |
| Failure mode | Context rot, attention dilution | Noise accumulation, compaction loss | Low retention (~37%), retrieval errors |
| Key challenge | Signal-to-noise ratio | Compression timing | What to store, how to retrieve |
Working Memory: The Context Window Is All You Get
LLMs are stateless functions. Their weights are frozen by the time they reach inference. The model does not learn from your conversation, does not remember your last session, and does not update its parameters based on your feedback. The only information it can reason over is what you put into the context window.
This makes the context window the agent's working memory: the bottleneck through which all reasoning must pass. Context engineering is the discipline of managing this scarce resource. Every token matters. Every irrelevant token degrades performance.
Attention budget is finite
Like humans who can hold roughly 7 items in working memory, LLMs have a finite "attention budget." Every new token depletes this budget. Chroma's research tested 18 frontier models and found that all of them degrade as input length increases, even on simple retrieval tasks. The degradation follows three mechanisms: the lost-in-the-middle effect, attention dilution at scale, and distractor interference.
The lost-in-the-middle effect (Liu et al., Stanford/TACL 2024) showed that LLM performance drops over 30% when relevant information sits in the middle of the context. Transformer attention follows a U-shaped curve: strong at the start and end, weak in the middle. For an agent that reads 8 files and finds relevant code in file #4, that code sits in the model's blind spot.
Attention scales quadratically. At 100K tokens, the model tracks 10 billion pairwise relationships. Adding more context does not just dilute relevance. It makes the model physically worse at attending to what matters.
Short-Term Memory: Session State and Its Decay
Short-term memory is everything the agent accumulates during a single session: conversation turns, tool outputs, file contents, error traces, and its own reasoning. Unlike working memory (which is the window for a single inference call), short-term memory spans the full session and feeds into working memory at each step.
The problem is accumulation. A typical coding task generates thousands of tokens per step:
Token accumulation in a multi-step coding session
Step 1: Read issue description 500 tokens
Step 2: Search codebase, read 5 candidate files 8,000 tokens
Step 3: Read related tests, config files 6,000 tokens
Step 4: Backtrack, explore alternative approach 5,000 tokens
Step 5: Found correct file, ready to edit ----------
Total context: ~20,000 tokens (60%+ is noise)
The agent now has the right information buried in 20K tokens
of search traces, dead ends, and irrelevant file contents.
Most of this hurts performance. It does not help.Every production agent handles this differently. Claude Code triggers auto-compaction when context approaches the window limit, summarizing history into a structured format. OpenAI recommends compaction as a "default long-run primitive," not an emergency fallback. The question is not whether to compress short-term memory, but when and how.
Compaction is lossy
When Claude Code compacts, it produces documentation-style summaries that capture the gist of what happened but lose specific events, decisions, and exact code references. Users report losing early conversation detail after compaction. This is why compaction vs. summarization matters: the compression method determines what survives.
Long-Term Memory: The Cross-Session Problem
When a session ends, the context window clears. The agent starts fresh. Unless you have built an external memory system, every decision, every discovery, and every learned preference is gone.
Cross-session memory retention is the hardest unsolved problem in agent memory. The best approaches achieve only around 37% retention across sessions according to compression benchmarks. Mem0's research on the LOCOMO benchmark showed that retrieval-augmented memory achieved 26% higher accuracy than OpenAI's native memory (66.9% vs. 52.9%). A Letta agent reached 74% on the same benchmark with GPT-4o mini. These numbers are improving, but they are far from solved.
| Approach | How It Works | Tradeoffs |
|---|---|---|
| Config files (CLAUDE.md) | Always-loaded text files with project instructions | Manual maintenance; limited to what you write down |
| Vector stores / RAG | Embed past interactions, retrieve by similarity | Math ceiling at ~500K docs; code structure is hard to embed |
| Structured databases | Store facts, preferences, decisions in relational/KV stores | Requires schema design; retrieval queries add latency |
| Auto-memory (Claude) | Agent writes notes to MEMORY.md during sessions | First 200 lines loaded per session; can drift or bloat |
| MCP memory servers | SQLite-backed tools the agent reads/writes at runtime | Flexible but requires integration; no standard protocol yet |
Coding agents have converged on a practical pattern: files as memory. Claude Code uses CLAUDE.md for project instructions and MEMORY.md for auto-discovered patterns. This is simple, inspectable, and version-controlled. The first 200 lines of MEMORY.md load into every session automatically.
The deeper research direction is memory management, not just memory storage. A-Mem (Agentic Memory) treats memory as a living system that merges related memories, marks outdated ones as invalid, and resolves contradictions. This mirrors how human memory consolidates and forgets. Agents need to forget strategically, not just accumulate.
How Coding Agents Manage Memory in Practice
Each major coding agent handles the memory problem differently. The differences are not theoretical. They directly determine session length, accuracy over time, and token cost.
| Feature | Claude Code | OpenAI Codex | Cursor | Devin |
|---|---|---|---|---|
| Working memory | 200K context window | 200K context window | 120K context window | 200K context window |
| Persistent memory | CLAUDE.md + MEMORY.md | None (stateless) | Cursor rules files | Task lists, to-do files |
| Auto-compaction | Yes (at capacity) | Yes (/compact endpoint) | No | Partial (premature) |
| Context isolation | Subagent Task tool | Sandboxed execution | Background indexing | Parallel sandboxes |
| Degradation onset | Gradual (compaction helps) | Gradual | 20-30 exchanges | ~2.5 hours |
| Token efficiency | 5.5x fewer than Cursor | Baseline | High token usage | Variable |
Claude Code uses 5.5x fewer tokens than Cursor for equivalent coding tasks. That gap comes from better context management, not a better base model. Structured memory files, selective loading via .claudeignore, auto-compaction, and subagent isolation each contribute.
Devin takes a different approach for long-running tasks. It maintains a persistent to-do list and iterates over hours or days, using parallel sandboxes for isolation. But it exhibits "context anxiety" where the model prematurely summarizes to avoid hitting limits, losing detail before it needs to.
The memory hierarchy is real
Production coding agents now implement a recognizable memory hierarchy: always-loaded config files (L1 cache), on-demand file loading (L2), session history with compaction (L3), and external retrieval for rare queries (L4). The pattern mirrors CPU memory hierarchies from computer architecture: fast/small at the top, slow/large at the bottom.
Memory Architectures: MemGPT, Letta, and Virtual Context
MemGPT (Packer et al., 2023) introduced the most influential agent memory architecture. It treats the LLM as an operating system where the context window is "main memory" (RAM) and external storage is "disk." The agent uses function calls to page information in and out of context, just like an OS manages virtual memory.
MemGPT-style virtual context management
# The agent's context window (main memory) is limited
# External storage (disk) holds everything else
# Agent decides what to keep in "RAM" (context window):
core_memory = {
"persona": "I am a coding assistant working on project X",
"user": "Prefers TypeScript, uses Next.js, strict on types",
"current_task": "Fix the webhook retry logic in stripe.ts"
}
# When the agent needs old information, it "pages in" from disk:
agent.call_function("archival_search", query="previous webhook fixes")
# Returns relevant memories from vector store into context
# When context gets full, agent "pages out" to disk:
agent.call_function("archival_insert",
content="Discovered retryCount can be null for new customers"
)
# Saves to long-term storage, frees context window space
# The result: effectively unlimited memory through intelligent pagingMemGPT now lives as part of the Letta framework, which extends the pattern with memory blocks: dedicated modules for core memory, episodic memory, semantic memory, and procedural memory. Each module uses data structures suited to its content type.
Virtual Context Management
Inspired by OS virtual memory. The agent pages information between the context window (fast, limited) and external storage (slow, unlimited) using function calls. Enables working with information that far exceeds the context window.
Memory Blocks
Letta's extension of MemGPT. Dedicated modules for different memory types: core (persona + user facts), episodic (time-series events), semantic (abstract knowledge), and procedural (step-by-step workflows). Each block has its own update and retrieval logic.
The key insight from MemGPT: the agent itself should manage its memory. Rather than relying on fixed rules (compact at 80% capacity, retrieve top-5 documents), the agent decides what to remember, what to forget, and when to retrieve. This agentic approach to memory is now a core research direction, with papers like Agentic Memory (2026) proposing unified frameworks for short-term and long-term memory learning.
Optimizing Working Memory with Compression
Long-term memory is a hard research problem. Working memory is an engineering problem you can solve today. The approach: remove noise tokens from the context window so the model's attention budget goes to high-signal information.
Three compression approaches have emerged, each with different tradeoffs:
| Method | Mechanism | Hallucination Risk | Best For |
|---|---|---|---|
| Structured summarization | LLM rewrites into organized sections | Medium (paraphrasing can alter details) | High-level progress tracking |
| Opaque compression | Model-internal compression (black box) | Medium (unverifiable) | API-level simplicity |
| Verbatim compaction | Deletes noise, keeps text word-for-word | Zero (no rewriting) | Code, errors, file paths |
For coding agents where exact file paths, error messages, and code snippets must survive compression, the distinction between summarization and compaction is critical. Summarization might compress src/api/webhooks/stripe.ts:98 into "the Stripe webhook handler," losing the exact reference the agent needs for its next edit.
Morph Compact takes the deletion approach. The model identifies which tokens carry signal and which are noise, then removes the noise. Every sentence that survives is verbatim from the original. No paraphrasing. No summarization. This means the agent's working memory after compression contains a strict subset of the original content with zero risk of the compression step introducing errors.
Working memory optimization with Morph Compact
from openai import OpenAI
client = OpenAI(
api_key="your-morph-api-key",
base_url="https://api.morphllm.com/v1"
)
# Agent's working memory is getting noisy after many tool calls.
# Compact it before the next reasoning step:
response = client.chat.completions.create(
model="morph-compact",
messages=[{
"role": "user",
"content": accumulated_context # 20K tokens of session history
}]
)
compacted = response.choices[0].message.content
# Result: 6-10K tokens, every surviving sentence verbatim
# The agent's next reasoning step sees only high-signal tokensThe ACON framework from academic research validated this direction: adaptive compression of agent observations achieved 26-54% peak token reduction while preserving 95%+ task accuracy. JetBrains found that simple observation masking matched full LLM summarization quality at a fraction of the cost. The evidence is consistent: most of what fills an agent's working memory is noise, and removing it improves performance.
Compaction is momentum
Jason Liu framed the value precisely: "If in-context learning is gradient descent, then compaction is momentum." It preserves the trajectory of the conversation while shedding the weight of irrelevant history. The agent keeps its direction without dragging dead tokens forward. This is the practical path to better working memory: not bigger windows, but cleaner ones.
Frequently Asked Questions
What is AI agent memory?
AI agent memory is the system an agent uses to retain, retrieve, and reason over information across its operation. It includes three tiers: working memory (the context window), short-term memory (session history), and long-term memory (persistent cross-session storage). Each tier has different capacity, speed, and failure modes.
What is the difference between working memory and long-term memory in AI agents?
Working memory is the context window, the only information the model can reason over during inference. It is fast but capacity-limited (128K-1M tokens). Long-term memory persists across sessions using external storage like vector databases or files. It has unlimited capacity but requires retrieval mechanisms to load relevant information back into working memory.
Why do AI agents forget between sessions?
LLMs are stateless. Their weights are frozen and do not update during use. The only information the model knows about your task is what's in the context window. When a session ends, the context clears. Cross-session retention requires external memory systems, and the best current approaches achieve only around 37% retention accuracy.
How does MemGPT manage agent memory?
MemGPT treats the LLM like an operating system with main memory (context window) and disk storage (external databases). The agent uses function calls to page information in and out of its context, similar to how an OS manages virtual memory. This now forms the basis of the Letta agent framework.
How do coding agents like Claude Code handle memory?
Claude Code uses CLAUDE.md files as always-loaded project memory, auto-compaction to summarize history at context limits, subagent isolation through its Task tool, and .claudeignore to exclude irrelevant files. These strategies map to Anthropic's four pillars of context engineering: Write, Select, Compress, and Isolate.
How does context compression improve agent working memory?
Context compression removes noise tokens from the context window, freeing attention budget for high-signal information. Morph Compact achieves 50-70% token reduction with 98% verbatim accuracy by deleting low-signal content rather than rewriting it. Every surviving sentence is identical to the original, eliminating hallucination risk from the compression step.
Optimize Your Agent's Working Memory
Morph Compact removes noise tokens from the context window so your agent's attention budget goes to what matters. 50-70% reduction, 3,300+ tok/s, and zero hallucination risk. Every surviving sentence is verbatim from the original.