AI Agent Memory: How Agents Retain and Reason Across Sessions

AI agent memory is the system an agent uses to retain, retrieve, and reason over information across its operation. It maps to three tiers borrowed from cognitive science: working memory (the context window), short-term memory (session history), and long-term memory (persistent cross-session storage). Each tier has different capacity, speed, and failure modes. For coding agents, working memory is the bottleneck.

~37%

Cross-session memory retention (best case)

30%+

Performance drop from context rot

50-70%

Working memory freed by compression

3,300+

Tokens/sec (Morph Compact)

Three Types of AI Agent Memory

The field draws from Endel Tulving's 1972 taxonomy of human memory. A 2025 survey of agent memory systems identified three dominant forms: token-level (context window), parametric (model weights), and latent memory. For practical purposes, agent builders work with three tiers.

Working Memory

The context window. Everything the model can reason over at inference time: the system prompt, conversation history, retrieved documents, tool outputs. Capacity is finite and every token costs attention.

Short-Term Memory

The current session. Conversation turns, tool call results, intermediate reasoning, and accumulated state. Persists only until the session ends or context is compacted.

Long-Term Memory

Persistent storage across sessions. User preferences, project knowledge, learned patterns, past decisions. Requires external systems: databases, vector stores, or files like CLAUDE.md.

Recent research further subdivides long-term memory into episodic (specific past events and their outcomes), semantic (accumulated facts and user preferences), and procedural (learned workflows and decision logic). These map to different storage backends and retrieval strategies.

Dimension	Working Memory	Short-Term Memory	Long-Term Memory
Analogy	CPU registers / RAM	Desktop workspace	Hard drive / database
Capacity	128K-1M tokens	Unbounded (within session)	Unlimited
Speed	Instant (in-context)	Instant (in-context)	Retrieval required
Persistence	Single inference call	Single session	Across sessions
Failure mode	Context rot, attention dilution	Noise accumulation, compaction loss	Low retention (~37%), retrieval errors
Key challenge	Signal-to-noise ratio	Compression timing	What to store, how to retrieve

Working Memory: The Context Window Is All You Get

LLMs are stateless functions. Their weights are frozen by the time they reach inference. The model does not learn from your conversation, does not remember your last session, and does not update its parameters based on your feedback. The only information it can reason over is what you put into the context window.

This makes the context window the agent's working memory: the bottleneck through which all reasoning must pass. Context engineering is the discipline of managing this scarce resource. Every token matters. Every irrelevant token degrades performance.

Attention budget is finite

Like humans who can hold roughly 7 items in working memory, LLMs have a finite "attention budget." Every new token depletes this budget. Chroma's research tested 18 frontier models and found that all of them degrade as input length increases, even on simple retrieval tasks. The degradation follows three mechanisms: the lost-in-the-middle effect, attention dilution at scale, and distractor interference.

The lost-in-the-middle effect (Liu et al., Stanford/TACL 2024) showed that LLM performance drops over 30% when relevant information sits in the middle of the context. Transformer attention follows a U-shaped curve: strong at the start and end, weak in the middle. For an agent that reads 8 files and finds relevant code in file #4, that code sits in the model's blind spot.

100M

Pairwise attention relationships at 10K tokens

10B

Pairwise attention relationships at 100K tokens

Pairwise attention relationships at 1M tokens

Attention scales quadratically. At 100K tokens, the model tracks 10 billion pairwise relationships. Adding more context does not just dilute relevance. It makes the model physically worse at attending to what matters.

Short-Term Memory: Session State and Its Decay

Short-term memory is everything the agent accumulates during a single session: conversation turns, tool outputs, file contents, error traces, and its own reasoning. Unlike working memory (which is the window for a single inference call), short-term memory spans the full session and feeds into working memory at each step.

The problem is accumulation. A typical coding task generates thousands of tokens per step:

Token accumulation in a multi-step coding session

Step 1: Read issue description                          500 tokens
Step 2: Search codebase, read 5 candidate files       8,000 tokens
Step 3: Read related tests, config files               6,000 tokens
Step 4: Backtrack, explore alternative approach         5,000 tokens
Step 5: Found correct file, ready to edit              ----------
Total context: ~20,000 tokens (60%+ is noise)

The agent now has the right information buried in 20K tokens
of search traces, dead ends, and irrelevant file contents.
Most of this hurts performance. It does not help.

Every production agent handles this differently. Claude Code triggers auto-compaction when context approaches the window limit, summarizing history into a structured format. OpenAI recommends compaction as a "default long-run primitive," not an emergency fallback. The question is not whether to compress short-term memory, but when and how.

Compaction is lossy

When Claude Code compacts, it produces documentation-style summaries that capture the gist of what happened but lose specific events, decisions, and exact code references. Users report losing early conversation detail after compaction. This is why compaction vs. summarization matters: the compression method determines what survives.

Long-Term Memory: The Cross-Session Problem

When a session ends, the context window clears. The agent starts fresh. Unless you have built an external memory system, every decision, every discovery, and every learned preference is gone.

Cross-session memory retention is the hardest unsolved problem in agent memory. The best approaches achieve only around 37% retention across sessions according to compression benchmarks. Mem0's research on the LOCOMO benchmark showed that retrieval-augmented memory achieved 26% higher accuracy than OpenAI's native memory (66.9% vs. 52.9%). A Letta agent reached 74% on the same benchmark with GPT-4o mini. These numbers are improving, but they are far from solved.

Approach	How It Works	Tradeoffs
Config files (CLAUDE.md)	Always-loaded text files with project instructions	Manual maintenance; limited to what you write down
Vector stores / RAG	Embed past interactions, retrieve by similarity	Math ceiling at ~500K docs; code structure is hard to embed
Structured databases	Store facts, preferences, decisions in relational/KV stores	Requires schema design; retrieval queries add latency
Auto-memory (Claude)	Agent writes notes to MEMORY.md during sessions	First 200 lines loaded per session; can drift or bloat
MCP memory servers	SQLite-backed tools the agent reads/writes at runtime	Flexible but requires integration; no standard protocol yet

Coding agents have converged on a practical pattern: files as memory. Claude Code uses CLAUDE.md for project instructions and MEMORY.md for auto-discovered patterns. This is simple, inspectable, and version-controlled. The first 200 lines of MEMORY.md load into every session automatically.

The deeper research direction is memory management, not just memory storage. A-Mem (Agentic Memory) treats memory as a living system that merges related memories, marks outdated ones as invalid, and resolves contradictions. This mirrors how human memory consolidates and forgets. Agents need to forget strategically, not just accumulate.

How Coding Agents Manage Memory in Practice

Each major coding agent handles the memory problem differently. The differences are not theoretical. They directly determine session length, accuracy over time, and token cost.

Feature	Claude Code	OpenAI Codex	Cursor	Devin
Working memory	200K context window	200K context window	120K context window	200K context window
Persistent memory	CLAUDE.md + MEMORY.md	None (stateless)	Cursor rules files	Task lists, to-do files
Auto-compaction	Yes (at capacity)	Yes (/compact endpoint)	No	Partial (premature)
Context isolation	Subagent Task tool	Sandboxed execution	Background indexing	Parallel sandboxes
Degradation onset	Gradual (compaction helps)	Gradual	20-30 exchanges	~2.5 hours
Token efficiency	5.5x fewer than Cursor	Baseline	High token usage	Variable

Claude Code uses 5.5x fewer tokens than Cursor for equivalent coding tasks. That gap comes from better context management, not a better base model. Structured memory files, selective loading via .claudeignore, auto-compaction, and subagent isolation each contribute.

Devin takes a different approach for long-running tasks. It maintains a persistent to-do list and iterates over hours or days, using parallel sandboxes for isolation. But it exhibits "context anxiety" where the model prematurely summarizes to avoid hitting limits, losing detail before it needs to.

The memory hierarchy is real

Production coding agents now implement a recognizable memory hierarchy: always-loaded config files (L1 cache), on-demand file loading (L2), session history with compaction (L3), and external retrieval for rare queries (L4). The pattern mirrors CPU memory hierarchies from computer architecture: fast/small at the top, slow/large at the bottom.

Memory Architectures: MemGPT, Letta, and Virtual Context

MemGPT (Packer et al., 2023) introduced the most influential agent memory architecture. It treats the LLM as an operating system where the context window is "main memory" (RAM) and external storage is "disk." The agent uses function calls to page information in and out of context, just like an OS manages virtual memory.

MemGPT-style virtual context management

# The agent's context window (main memory) is limited
# External storage (disk) holds everything else

# Agent decides what to keep in "RAM" (context window):
core_memory = {
    "persona": "I am a coding assistant working on project X",
    "user": "Prefers TypeScript, uses Next.js, strict on types",
    "current_task": "Fix the webhook retry logic in stripe.ts"
}

# When the agent needs old information, it "pages in" from disk:
agent.call_function("archival_search", query="previous webhook fixes")
# Returns relevant memories from vector store into context

# When context gets full, agent "pages out" to disk:
agent.call_function("archival_insert",
    content="Discovered retryCount can be null for new customers"
)
# Saves to long-term storage, frees context window space

# The result: effectively unlimited memory through intelligent paging

MemGPT now lives as part of the Letta framework, which extends the pattern with memory blocks: dedicated modules for core memory, episodic memory, semantic memory, and procedural memory. Each module uses data structures suited to its content type.

Virtual Context Management

Inspired by OS virtual memory. The agent pages information between the context window (fast, limited) and external storage (slow, unlimited) using function calls. Enables working with information that far exceeds the context window.

Memory Blocks

Letta's extension of MemGPT. Dedicated modules for different memory types: core (persona + user facts), episodic (time-series events), semantic (abstract knowledge), and procedural (step-by-step workflows). Each block has its own update and retrieval logic.

The key insight from MemGPT: the agent itself should manage its memory. Rather than relying on fixed rules (compact at 80% capacity, retrieve top-5 documents), the agent decides what to remember, what to forget, and when to retrieve. This agentic approach to memory is now a core research direction, with papers like Agentic Memory (2026) proposing unified frameworks for short-term and long-term memory learning.

Optimizing Working Memory with Compression

Long-term memory is a hard research problem. Working memory is an engineering problem you can solve today. The approach: remove noise tokens from the context window so the model's attention budget goes to high-signal information.

Three compression approaches have emerged, each with different tradeoffs:

Method	Mechanism	Hallucination Risk	Best For
Structured summarization	LLM rewrites into organized sections	Medium (paraphrasing can alter details)	High-level progress tracking
Opaque compression	Model-internal compression (black box)	Medium (unverifiable)	API-level simplicity
Verbatim compaction	Deletes noise, keeps text word-for-word	Zero (no rewriting)	Code, errors, file paths

For coding agents where exact file paths, error messages, and code snippets must survive compression, the distinction between summarization and compaction is critical. Summarization might compress src/api/webhooks/stripe.ts:98 into "the Stripe webhook handler," losing the exact reference the agent needs for its next edit.

50-70%

Token reduction (Morph Compact)

98%

Verbatim accuracy

3,300+

Tokens per second

Hallucination risk

Morph Compact takes the deletion approach. The model identifies which tokens carry signal and which are noise, then removes the noise. Every sentence that survives is verbatim from the original. No paraphrasing. No summarization. This means the agent's working memory after compression contains a strict subset of the original content with zero risk of the compression step introducing errors.

Working memory optimization with Morph Compact

from openai import OpenAI

client = OpenAI(
    api_key="your-morph-api-key",
    base_url="https://api.morphllm.com/v1"
)

# Agent's working memory is getting noisy after many tool calls.
# Compact it before the next reasoning step:
response = client.chat.completions.create(
    model="morph-compact",
    messages=[{
        "role": "user",
        "content": accumulated_context  # 20K tokens of session history
    }]
)

compacted = response.choices[0].message.content
# Result: 6-10K tokens, every surviving sentence verbatim
# The agent's next reasoning step sees only high-signal tokens

The ACON framework from academic research validated this direction: adaptive compression of agent observations achieved 26-54% peak token reduction while preserving 95%+ task accuracy. JetBrains found that simple observation masking matched full LLM summarization quality at a fraction of the cost. The evidence is consistent: most of what fills an agent's working memory is noise, and removing it improves performance.

Compaction is momentum

Jason Liu framed the value precisely: "If in-context learning is gradient descent, then compaction is momentum." It preserves the trajectory of the conversation while shedding the weight of irrelevant history. The agent keeps its direction without dragging dead tokens forward. This is the practical path to better working memory: not bigger windows, but cleaner ones.

Frequently Asked Questions

What is AI agent memory?

AI agent memory is the system an agent uses to retain, retrieve, and reason over information across its operation. It includes three tiers: working memory (the context window), short-term memory (session history), and long-term memory (persistent cross-session storage). Each tier has different capacity, speed, and failure modes.

What is the difference between working memory and long-term memory in AI agents?

Working memory is the context window, the only information the model can reason over during inference. It is fast but capacity-limited (128K-1M tokens). Long-term memory persists across sessions using external storage like vector databases or files. It has unlimited capacity but requires retrieval mechanisms to load relevant information back into working memory.

Why do AI agents forget between sessions?

LLMs are stateless. Their weights are frozen and do not update during use. The only information the model knows about your task is what's in the context window. When a session ends, the context clears. Cross-session retention requires external memory systems, and the best current approaches achieve only around 37% retention accuracy.

How does MemGPT manage agent memory?

MemGPT treats the LLM like an operating system with main memory (context window) and disk storage (external databases). The agent uses function calls to page information in and out of its context, similar to how an OS manages virtual memory. This now forms the basis of the Letta agent framework.

How do coding agents like Claude Code handle memory?

Claude Code uses CLAUDE.md files as always-loaded project memory, auto-compaction to summarize history at context limits, subagent isolation through its Task tool, and .claudeignore to exclude irrelevant files. These strategies map to Anthropic's four pillars of context engineering: Write, Select, Compress, and Isolate.

How does context compression improve agent working memory?

Context compression removes noise tokens from the context window, freeing attention budget for high-signal information. Morph Compact achieves 50-70% token reduction with 98% verbatim accuracy by deleting low-signal content rather than rewriting it. Every surviving sentence is identical to the original, eliminating hallucination risk from the compression step.

Optimize Your Agent's Working Memory

Morph Compact removes noise tokens from the context window so your agent's attention budget goes to what matters. 50-70% reduction, 3,300+ tok/s, and zero hallucination risk. Every surviving sentence is verbatim from the original.

Try Compact

Read Context Compression Guide

Morph Fast Apply

Morph WarpGrep

Morph Glance

Morph MCP

Morph Monitor

AI Agent Memory: How Agents Retain, Retrieve, and Reason Across Sessions