Context Distillation: How LLMs Internalize and Compress Context

Context distillation extracts and preserves only the essential information from a large context into a compact form. In training, it teaches models to internalize instructions so they don't need them at inference time. In production, it compresses runtime context so coding agents stay fast and accurate as sessions grow longer.

50-70%

Token reduction with verbatim compaction

80-95%

Compression via full distillation

3,300+

Tokens/sec (Morph Compact)

98%

Verbatim accuracy preserved

What Is Context Distillation

Context distillation is the process of extracting essential information from a large context and encoding it into a more compact form. The term spans two domains: model training and runtime inference. Both share the same goal. Reduce the tokens required while preserving the information that matters.

In training, context distillation means fine-tuning a model to produce outputs as if instructions, chain-of-thought reasoning, or few-shot examples were present, without actually including them. The Learning by Distilling Context paper formalized this: condition the model on full context to generate output, then fine-tune the model to predict that same output from the bare input alone.

In production systems, context distillation means compressing the accumulated context of a running agent, conversation history, tool outputs, file contents, error traces, into the smallest token set that preserves task-relevant signal. This is the practical problem every coding agent faces.

Two meanings, one principle

Training-time context distillation bakes context into model weights. Runtime context distillation compresses context into fewer tokens. Both reduce what the model needs to process while preserving what it needs to know. The first is a training technique. The second is an inference optimization.

Context Distillation in Model Training

The original context distillation technique works in two steps. First, the model generates output conditioned on full context: task instructions, scratchpad reasoning, and few-shot examples. Second, the model is fine-tuned to produce that same output conditioned only on the task input, without the instructions or scratchpad.

This incentivizes the model to internalize the context into its parameters. After distillation, the model behaves as if the instructions were present, because the information has been absorbed into the weights themselves.

Instruction Internalization

Models learn to follow task instructions without needing them in the prompt. The instructions become part of the model's default behavior after fine-tuning on context-conditioned outputs.

Reasoning Absorption

Step-by-step scratchpad reasoning gets absorbed into model weights. The model learns to produce correct answers directly, without generating intermediate reasoning steps at inference time.

Safety Alignment

Anthropic's Constitutional AI used context distillation to internalize safety principles. The model was trained to follow principles without needing them in the prompt, though RLHF later proved more effective.

Anthropic's Constitutional AI

Anthropic's Constitutional AI paper applied context distillation to safety training. The model generated outputs conditioned on a set of constitutional principles, then was fine-tuned to produce those same safe outputs without the principles in the prompt. The technique worked, but Anthropic found that RLHF provided a larger improvement. Their later models were fine-tuned directly from pretrained checkpoints, skipping the context distillation step.

DeepSeek R1's Reasoning Traces

DeepSeek demonstrated a related pattern with R1. They generated 800,000 reasoning traces from the full R1 model, then fine-tuned smaller models (Qwen 2.5, Llama 3) on those traces. The distilled 14B model outperformed larger open-source models on reasoning benchmarks. The reasoning context that the large model needed to solve problems was distilled into the smaller model's weights. This is closer to knowledge distillation, but the training data itself is distilled context: the chain-of-thought that R1 used to reach correct answers.

On-Policy Context Distillation (OPCD)

Recent research on on-policy context distillation addresses two practical applications: experiential knowledge distillation, where models extract transferable knowledge from their own historical solution traces, and system prompt distillation, where models internalize behaviors encoded in optimized prompts. Instead of relying on teacher-generated data, the model learns from its own on-policy outputs.

Context Distillation vs. Knowledge Distillation

These two techniques share a name but solve different problems. Understanding the distinction matters for choosing the right approach.

Dimension	Knowledge Distillation	Context Distillation
Goal	Transfer capabilities from large to small model	Remove dependency on context tokens at inference
Teacher/Student	Different models (large teacher, small student)	Same model (with context vs. without context)
What changes	Model size decreases	Context requirement decreases
Training data	Teacher's output distributions	Model's own context-conditioned outputs
Result	Smaller model with similar capabilities	Same model that internalizes instructions/reasoning
Example	DeepSeek R1 → R1-Distill-14B	Anthropic Constitutional AI safety principles

Knowledge distillation compresses models. Context distillation compresses context. DeepSeek's R1 distillation pipeline blurs the line: the reasoning traces are distilled context, but the student is a smaller model. In practice, the two techniques often work together. A distilled model that has internalized instructions through context distillation can then be knowledge-distilled into an even smaller model.

The overlap with DeepSeek R1

DeepSeek R1's distillation is technically knowledge distillation (large model to small model), but the training data is distilled context (chain-of-thought reasoning traces). The distilled 14B model doesn't just mimic R1's outputs. It internalizes R1's reasoning patterns. This shows how the two techniques reinforce each other.

Runtime Context Distillation for Coding Agents

For coding agents, the practical problem is not training-time distillation. It is runtime context distillation: compressing the accumulated context of a live session into fewer tokens without losing the information the agent needs.

A typical coding agent session accumulates context fast. File reads add 200-2,000 tokens each. Grep results add 500-5,000 tokens. Error traces add 100-1,000 tokens. Conversation turns add 50-500 tokens each. After 20-30 tool calls, the agent can be carrying 30,000+ tokens of context, much of it irrelevant to the current task.

As context rot research shows, LLM performance degrades as input length increases. The degradation starts well before the context window fills up. Every irrelevant token makes the model worse at attending to the tokens that matter. Runtime context distillation is the fix: reduce tokens to keep the model focused.

60-80%

Context is tool observations

30K+

Tokens after 20-30 tool calls

37%

Cross-session memory retention

1-2K

Tokens returned per subagent

Subagent Architecture: Natural Distillation

Anthropic's context engineering research demonstrates that subagent architectures achieve natural context distillation. Each subagent explores extensively in its own context window, using tens of thousands of tokens. But it returns only a condensed summary of 1,000-2,000 tokens to the lead agent. The lead agent never sees the noise.

This is context distillation in the runtime sense: a large body of information (the subagent's exploration) gets distilled into a compact representation (the summary) that preserves the essential findings. The context engineering discipline is, at its core, the practice of distilling context to maintain signal density.

The Token Budget Problem

Multi-session memory retention sits at roughly 37%. If an agent cannot carry context forward reliably between sessions, it needs to maximize what fits in the current window. This makes within-session context distillation critical. Every token of noise that survives in the context is a token of signal that could have been there instead.

Three Methods of Runtime Context Distillation

Three distinct approaches have emerged for distilling context at runtime. Each trades off differently between compression ratio, fidelity, and hallucination risk. For a deeper comparison, see compaction vs. summarization.

Summarization

Rewrites context into condensed natural language. 70-90% compression. High risk of altering code, file paths, and error messages during rewriting. Used by Claude Code's auto-compact.

Opaque Compression

Model-internal compressed representation. 99%+ compression. Not inspectable, not portable, locked to provider infrastructure. Used by OpenAI Codex.

Verbatim Compaction

Deletes noise tokens, keeps every surviving sentence word-for-word from the original. 50-70% compression. Zero hallucination risk. Used by Morph Compact.

Dimension	Summarization	Opaque Compression	Verbatim Compaction
Compression ratio	70-90%	99%+	50-70%
Information retention	50-80% (rewrites)	Unknown (opaque)	High (exact original)
Hallucination risk	Medium (paraphrasing)	Medium (black box)	Zero (no rewriting)
Code fidelity	Low (code altered)	Unknown	High (verbatim text)
Inspectability	High (readable)	None (opaque)	High (subset of original)
Speed	Slow (full LLM call)	Variable	3,300+ tok/s
Provider lock-in	None	High (OpenAI only)	None

The choice depends on what information you need to preserve. Summarization works when the agent needs high-level progress context. Verbatim compaction works when the agent needs exact code, file paths, and error messages. For context compression in coding agents, where precision matters more than maximum compression ratio, verbatim compaction eliminates the risk of the distillation step itself introducing errors.

Tiered Context Distillation Strategy

The optimal production approach combines multiple distillation methods at different time horizons. Recent context stays at full fidelity. Older context gets progressively more distilled.

Tier	Method	Information Retained	Compression
Immediate (last 2-3 turns)	Consolidation (full detail)	80-95%	20-50%
Recent (last 10-15 turns)	Verbatim compaction	High (exact text)	50-70%
Session history	Summarization	50-80% (rewritten)	70-90%
Cross-session memory	Full distillation (key patterns)	30-60% (conceptual)	80-95%

This layered approach matches compression depth to relevance. The agent sees exact details for what it is working on now, compacted-but-verbatim details for recent work, summaries of older session history, and distilled patterns from past sessions. Each tier uses the distillation method that best fits its role.

Sourcegraph's radical approach

Sourcegraph retired compaction entirely in their Amp agent. When context fills up, Amp spawns a new agent with a task summary rather than compressing the existing conversation. This treats context exhaustion as a coordination problem, not a compression problem. It is the most extreme form of context distillation: compress an entire session into a single task handoff.

Tiered distillation in an agent loop

// Tiered context distillation strategy
interface ContextTier {
  turns: Message[];
  method: "full" | "compact" | "summarize" | "distill";
}

function distillContext(history: Message[]): ContextTier[] {
  const total = history.length;
  return [
    // Immediate: last 3 turns at full fidelity
    { turns: history.slice(-3), method: "full" },
    // Recent: turns 4-15 get verbatim compaction
    { turns: history.slice(-15, -3), method: "compact" },
    // Older: everything else gets summarized
    { turns: history.slice(0, -15), method: "summarize" },
  ];
}

async function buildDistilledContext(
  history: Message[],
  morph: OpenAI
): Promise<string> {
  const tiers = distillContext(history);
  const parts: string[] = [];

  for (const tier of tiers) {
    if (tier.method === "full") {
      parts.push(formatMessages(tier.turns));
    } else if (tier.method === "compact") {
      // Morph Compact: verbatim deletion, zero hallucination
      const response = await morph.chat.completions.create({
        model: "morph-compact",
        messages: [{ role: "user", content: formatMessages(tier.turns) }],
      });
      parts.push(response.choices[0].message.content ?? "");
    } else {
      // Summarize older context with a general-purpose model
      const response = await morph.chat.completions.create({
        model: "gpt-4o-mini",
        messages: [{
          role: "user",
          content: `Summarize the key decisions and state:\n${formatMessages(tier.turns)}`
        }],
      });
      parts.push(response.choices[0].message.content ?? "");
    }
  }

  return parts.join("\n\n");
}

Code Example: Morph Compact SDK

Morph Compact provides verbatim compaction through the standard OpenAI SDK. Point the base URL at Morph's API and use the morph-compact model. For full product details, see the Compact product page.

Basic context distillation with Morph Compact (Python)

from openai import OpenAI

client = OpenAI(
    api_key="your-morph-api-key",
    base_url="https://api.morphllm.com/v1"
)

# Distill a long context into its essential tokens
response = client.chat.completions.create(
    model="morph-compact",
    messages=[
        {
            "role": "user",
            "content": long_context_string  # Agent's accumulated context
        }
    ]
)

distilled = response.choices[0].message.content
# Every surviving sentence is verbatim from the original
# 50-70% smaller, zero hallucination risk

Inline context distillation for tool outputs (TypeScript)

import OpenAI from "openai";

const morph = new OpenAI({
  apiKey: process.env.MORPH_API_KEY,
  baseURL: "https://api.morphllm.com/v1",
});

// Distill tool outputs before they enter the agent's context
async function distillToolOutput(output: string): Promise<string> {
  const tokens = estimateTokens(output);
  if (tokens < 500) return output; // Short outputs pass through

  const response = await morph.chat.completions.create({
    model: "morph-compact",
    messages: [{ role: "user", content: output }],
  });

  return response.choices[0].message.content ?? output;
}

// Agent loop with inline distillation
for (const toolCall of pendingToolCalls) {
  const result = await executeTool(toolCall);
  const distilled = await distillToolOutput(result.output);
  conversation.addToolResult(toolCall.id, distilled);
  // Context stays clean — only high-signal tokens survive
}

Frequently Asked Questions

What is context distillation in LLMs?

Context distillation extracts essential information from a large context into a compact form. In training, it means fine-tuning a model to produce outputs as if instructions or reasoning steps were present, without actually including them. In production, it means compressing runtime context (conversation history, tool outputs, file contents) into the smallest token set that preserves task-relevant signal.

How is context distillation different from knowledge distillation?

Knowledge distillation transfers capabilities from a large teacher model to a smaller student model. Context distillation transfers the effect of context tokens into model parameters. The teacher and student can be the same model. Knowledge distillation compresses models. Context distillation compresses context.

How did Anthropic use context distillation?

Anthropic applied context distillation in their Constitutional AI research. The model generated outputs conditioned on safety principles, then was fine-tuned to produce those outputs without the principles present. The model internalized the safety behavior. Later work found RLHF provided larger improvements, so subsequent models skipped the context distillation step.

What are the methods of runtime context distillation?

Three methods exist. Summarization rewrites context into condensed natural language (70-90% compression, hallucination risk). Opaque compression uses model-internal representations (99%+ compression, not inspectable). Verbatim compaction deletes low-signal tokens while keeping every surviving sentence identical to the original (50-70% compression, zero hallucination risk). Morph Compact uses verbatim compaction.

How does context distillation apply to coding agents?

Coding agents accumulate tokens across sessions: file reads, grep results, error traces, conversation history. Context distillation reduces this growing context to the smallest token set that preserves task-relevant information. Subagent architectures achieve natural distillation by exploring in isolated contexts and returning condensed summaries of 1,000-2,000 tokens. Context engineering is the discipline of applying distillation principles to keep agent context clean.

What is the difference between consolidation, summarization, and distillation?

Consolidation retains 80-95% of information with 20-50% compression, preserving breadth. Summarization retains 50-80% with 50-80% compression, balancing detail and brevity. Distillation retains 30-60% of raw information but captures the conceptual essence, achieving 80-95% compression. A tiered approach uses all three: consolidation for immediate context, summarization for recent history, distillation for older sessions.

Distill Your Agent's Context

Morph Compact is verbatim compaction for coding agents. 50-70% token reduction, 3,300+ tok/s, and zero hallucination risk. Every surviving sentence is word-for-word identical to the original.

Try Compact

Morph Fast Apply

Morph WarpGrep

Morph Glance

Morph MCP

Morph Monitor