Prompt Compression: Techniques to Reduce LLM Costs

Prompt compression reduces the number of tokens in an LLM prompt while preserving the information the model needs. LLM costs scale linearly with tokens, so a 50% reduction in input tokens is a 50% reduction in input cost. For agents running hundreds of turns, this is the difference between viable infrastructure and a cost center that scales out of control.

50-70%

Compression ratio with verbatim compaction

3,300+

Tokens per second (Morph Compact)

98%

Verbatim accuracy on real sessions

$11K/mo

Savings at 100 agent sessions/day

What Is Prompt Compression

Prompt compression encompasses any technique that reduces the token count of an LLM input while retaining the semantic content the model needs to produce a correct response. The term covers two distinct scenarios:

Input compression: reducing a prompt before it's sent to the model. This includes removing boilerplate, compressing retrieved documents, and pruning low-information tokens.
Context compression: reducing accumulated context during long-running agent sessions. As agents read files, search codebases, and explore solution paths, their context windows fill with content that was useful at one point but is no longer relevant.

Both forms solve the same underlying problem: LLMs charge per token, and their performance degrades as input length increases. Sending fewer, higher-signal tokens costs less and produces better results.

Compression vs. prompt engineering

Prompt engineering optimizes how you phrase a request. Prompt compression optimizes how much context accompanies that request. They're complementary. A well-engineered prompt with 50K tokens of noisy context will still underperform a mediocre prompt with 10K tokens of relevant context.

Why It Matters: The Token-Cost Equation

LLM pricing is straightforward: you pay per token, with input tokens typically cheaper than output tokens. But input tokens dominate total cost in agent workflows because agents consume far more context than they generate.

Scenario	Tokens/Session	Cost/Session (Opus)	Monthly (100/day)
No compression	500K input	$7.50	$22,500
30% compression	350K input	$5.25	$15,750
50% compression	250K input	$3.75	$11,250
70% compression	150K input	$2.25	$6,750

The savings compound. An agent team running 100 sessions per day at 500K tokens each consumes 50 million input tokens daily. At Claude Opus pricing (~$15/M input tokens), that's $750/day or $22,500/month. A 50% compression ratio cuts that to $11,250/month, saving $135,000 annually.

Cost is only half the story. Context rot research shows that LLM performance degrades as input length increases, even when the context window isn't full. Compression doesn't just save money. It produces better outputs by reducing the noise the model has to reason through.

Prompt Compression Techniques

Six approaches dominate the prompt compression landscape today. Each makes different tradeoffs between compression ratio, speed, accuracy, and hallucination risk.

1. LLMLingua / LongLLMLingua (Microsoft Research)

LLMLingua uses a small language model (typically GPT-2 or LLaMA-7B) to score each token by its information content, then removes tokens with the lowest perplexity scores. The intuition: tokens that are highly predictable from surrounding context carry little information and can be dropped without meaningful loss.

2-10x

Compression ratio

~1.4B

Parameters for scoring model

Extra

Inference pass required

LongLLMLingua extends this to long-context scenarios, adding document reordering to place high-relevance content at the start and end of the prompt (mitigating the lost-in-the-middle effect).

Limitations

Token-level pruning can break structured content. Removing tokens from a JSON object, a code block, or a file path produces malformed output. LLMLingua works best on natural language paragraphs, not structured data that agents commonly process.

2. Selective Context

Selective Context operates at the sentence level rather than the token level. It computes self-information scores for each sentence and removes those below a threshold. Simple and effective for filtering boilerplate, but coarser than token-level methods.

The advantage is that it preserves sentence boundaries, so it won't break structured content the way token-level pruning can. The disadvantage is lower compression ratios since you can only remove or keep entire sentences.

3. RAG as Compression

Retrieval-augmented generation is prompt compression by another name: instead of sending full documents, you retrieve and send only the relevant chunks. This is the most widely deployed form of prompt compression in production systems.

The limitation is that retrieval quality caps output quality. If the retriever misses a relevant chunk, the model can't compensate. For code, this is particularly acute because relevance often depends on multi-hop relationships (imports, call graphs, type definitions) that embedding-based retrieval struggles with.

Factor	Strength	Weakness
Compression ratio	Very high (only send relevant chunks)	Retrieval misses = information loss
Latency	Indexing is one-time cost	Retrieval adds per-query latency
Code applicability	Works for documentation, specs	Struggles with cross-file dependencies
Maintenance	Standard infrastructure	Index staleness on active codebases

4. Context Caching (Anthropic, Google)

Context caching isn't compression per se, but it reduces the cost of repeated prefixes. Anthropic's prompt caching charges 90% less for cached input tokens and 0% for cache hits on subsequent requests. Google's Gemini offers similar caching.

This works well when your prompts share a long, stable prefix (system prompts, reference documentation, few-shot examples). It doesn't help with dynamic context that changes every request, which is most of what agents deal with.

90%

Cost reduction on cached tokens

5 min

Default cache TTL (Anthropic)

Static

Only helps repeated prefixes

5. Summarization-Based Compression

Use a cheaper, faster model to summarize content before sending it to an expensive model. This is the standard approach in Claude Code (context compaction), Factory, and other agent frameworks.

The tradeoff is accuracy. Summarization rewrites content, which means the model can introduce errors, drop specific details, or hallucinate. Factory's benchmarks on 36K real coding messages found that the biggest performance gap between compression methods was accuracy: preserving file paths, error codes, stack traces, and other specific details that agents need to function correctly.

The accuracy problem

A summary that says "there was an error in the auth module" is less useful than the original "TypeError: Cannot read property 'jwt' of undefined at src/middleware/auth.ts:47". For coding agents, the specific details are the information. Lossy compression of specifics is information destruction.

6. Verbatim Compaction (Morph Compact)

Morph Compact takes a different approach: delete low-signal content while keeping every surviving token identical to the input. Nothing is rewritten. Nothing is paraphrased. The output is a strict subset of the input tokens.

This eliminates the hallucination risk inherent in summarization. When an agent needs a file path, error code, or code snippet, verbatim compaction guarantees it's either present exactly as it appeared in the original or absent entirely. No corrupted paths. No approximate error messages.

50-70%

Compression ratio

3,300+

Tokens/second throughput

98%

Verbatim accuracy

Hallucination risk

Technique	Compression	Speed	Accuracy Risk	Best For
LLMLingua	2-10x	Slow (extra inference)	Breaks structured content	Natural language docs
Selective Context	2-5x	Fast	Coarse (sentence-level)	Boilerplate removal
RAG	Very high	Index + retrieve	Retrieval misses	Document Q&A
Context caching	N/A (cost only)	Instant	None	Repeated prefixes
Summarization	High (98%+)	Model-dependent	Hallucination, detail loss	Conversation history
Morph Compact	50-70%	3,300+ tok/s	Zero (verbatim)	Agent context, code

Benchmarks: How Methods Compare

Factory ran the most comprehensive public evaluation of prompt compression for coding agents. They tested three compression approaches on 36,000 messages from real Claude Code coding sessions, scoring each on overall quality (1-5 scale) and compression ratio.

Method	Overall Score	Compression	Key Weakness
Factory structured summaries	3.70/5	98.6%	Custom, not generally available
Anthropic summaries	3.44/5	98.7%	Detail loss in file paths/errors
OpenAI opaque	3.35/5	99.3%	Lowest accuracy on specifics

The critical finding: the biggest differentiator wasn't compression ratio (all methods achieved 98%+). It was accuracy on specific details. File paths, line numbers, error messages, and debugging context are exactly the information coding agents need, and summarization-based approaches systematically degraded them.

This is the core argument for verbatim compaction. When the output must contain exact tokens from the input (paths, errors, code), a method that guarantees zero rewriting outperforms methods that achieve higher compression ratios but corrupt the details that matter most.

Compression ratio vs. accuracy

A 99% compression ratio that loses a critical file path is worse than a 60% compression ratio that preserves it exactly. For coding agents, accuracy-per-token matters more than raw compression ratio.

Cost Savings at Scale

The math is straightforward. Input tokens dominate agent costs because agents consume far more context than they produce. Compressing input tokens creates a direct, linear cost reduction.

$7.50

Saved per 1M tokens at 50% compression

$3.75

Saved per 500K-token session

$375

Daily savings (100 sessions)

$135K

Annual savings at scale

These numbers assume Claude Opus at ~$15/M input tokens. The savings scale proportionally with any model:

Model	Input Price/1M	No Compression	50% Compressed	Monthly Savings
Claude Opus	$15.00	$22,500	$11,250	$11,250
GPT-4.1	$2.00	$3,000	$1,500	$1,500
Claude Sonnet	$3.00	$4,500	$2,250	$2,250
Gemini 2.5 Pro	$1.25	$1,875	$937	$937

The cost savings are amplified by the performance improvement. Compressed context means fewer wasted turns, fewer hallucinations, and faster task completion. Factory measured that better compression approaches reduced the number of retries and corrections agents needed, compounding the per-token savings.

Implementation: Morph Compact SDK

Morph Compact exposes a simple API that takes a block of text and returns a verbatim-compacted version. The API is compatible with the OpenAI SDK, so integration requires minimal code changes.

Basic prompt compression with Morph Compact

from openai import OpenAI

client = OpenAI(
    base_url="https://api.morphllm.com/v1",
    api_key="your-morph-api-key"
)

# Compress a long context before sending to your main model
long_context = open("conversation_history.txt").read()

response = client.chat.completions.create(
    model="morph-compact",
    messages=[
        {"role": "user", "content": long_context}
    ]
)

compressed = response.choices[0].message.content
# compressed is a strict subset of the original tokens
# no rewriting, no hallucination — just the high-signal content

Agent context compression pipeline

# Compress accumulated agent context before each reasoning step
def compress_context(messages: list[dict]) -> list[dict]:
    """Compress old messages, keep recent ones intact."""
    if len(messages) <= 5:
        return messages  # nothing to compress yet

    # Compress older messages, keep last 5 untouched
    old_messages = messages[:-5]
    recent_messages = messages[-5:]

    old_text = "\n".join(m["content"] for m in old_messages)

    response = client.chat.completions.create(
        model="morph-compact",
        messages=[{"role": "user", "content": old_text}]
    )

    compressed_msg = {
        "role": "user",
        "content": f"[Compressed context]\n{response.choices[0].message.content}"
    }

    return [compressed_msg] + recent_messages

Using with LangChain

from langchain.retrievers import ContextualCompressionRetriever
from langchain.schema import Document
from openai import OpenAI

morph = OpenAI(
    base_url="https://api.morphllm.com/v1",
    api_key="your-morph-api-key"
)

def compact_documents(docs: list[Document]) -> list[Document]:
    """Compress retrieved documents with verbatim compaction."""
    compressed = []
    for doc in docs:
        response = morph.chat.completions.create(
            model="morph-compact",
            messages=[{"role": "user", "content": doc.page_content}]
        )
        compressed.append(Document(
            page_content=response.choices[0].message.content,
            metadata=doc.metadata
        ))
    return compressed

# Use in your RAG pipeline:
# 1. Retrieve documents normally
# 2. Compact them before sending to the reasoning model
# 3. Every token in the output existed in the original — zero hallucination

The key integration pattern: compress context before it enters your main model's context window. This applies whether you're building a RAG pipeline, an agent framework, or a simple chat application with long conversation history.

Frequently Asked Questions

What is prompt compression?

Prompt compression is the practice of reducing the number of tokens in an LLM prompt while preserving the meaning and information needed for accurate output. Techniques include token-level pruning (LLMLingua), sentence-level filtering (Selective Context), retrieval-based approaches (RAG), prefix caching, summarization, and verbatim compaction (Morph Compact).

How do you compress prompts and reduce LLM costs?

The most effective approach depends on your use case. For repeated prefixes, use context caching (Anthropic charges 90% less for cached tokens). For long agent sessions, use verbatim compaction to remove low-signal tokens while preserving exact content. For document-heavy prompts, use RAG to retrieve only relevant chunks. A 50% compression ratio on Claude Opus saves $7.50 per million tokens.

What is the difference between prompt compression and summarization?

Summarization rewrites content in fewer words, which risks introducing hallucinated details or losing specific information like file paths and error codes. Verbatim compaction removes low-signal tokens without rewriting anything. Every surviving token is identical to the original input, so there is zero hallucination risk. Factory's benchmarks found that accuracy on specific details was the biggest differentiator between methods.

Does prompt compression affect output quality?

It depends on the technique. Token-level pruning can break structured content like code blocks. Summarization can lose critical details. Verbatim compaction preserves 98% accuracy on real coding sessions because it keeps every surviving token identical to the input. Beyond accuracy preservation, compression often improves output quality by reducing context rot, the performance degradation LLMs experience as input length increases.

What is LLMLingua?

LLMLingua is a prompt compression method from Microsoft Research that uses a small language model to score each token by information content and removes low-signal tokens. It achieves 2-10x compression ratios. LongLLMLingua extends the approach to long-context scenarios with document reordering to mitigate the lost-in-the-middle effect.

Can I use prompt compression with LangChain?

Yes. LangChain provides a ContextualCompressionRetriever that wraps any retriever with a compression layer. You can call the Morph Compact API as a pre-processing step to verbatim-compact retrieved documents before passing them to your chain. This gives you RAG's chunk selection plus verbatim compaction's accuracy guarantees.

How does Morph Compact compare to other compression methods?

Morph Compact achieves 50-70% compression at 3,300+ tokens per second with 98% verbatim accuracy. Unlike summarization, it introduces zero hallucination risk because nothing is rewritten. Unlike token-level pruning, it doesn't break structured content because it operates on semantic units rather than individual tokens.

Compress Prompts Without Losing Accuracy

Morph Compact removes low-signal tokens while keeping every surviving token identical to the input. 50-70% compression, 3,300+ tok/s, zero hallucination risk. Cut your LLM costs in half without sacrificing the details your agents need.

Try Compact

View Pricing

Morph Fast Apply

Morph WarpGrep

Morph Glance

Morph MCP

Morph Monitor

Prompt Compression: Techniques to Reduce LLM Costs and Improve Agent Performance