LLM Cost Optimization: A Practical Guide to Cutting Your API Spend by 70-90%

Six proven techniques to reduce LLM API costs: prompt caching (90% savings), batch processing (50% discount), model routing, distillation, and context compression. Real pricing data for GPT-5, Claude Opus 4.5, Gemini 2.5 Pro, and DeepSeek V3.2.

February 27, 2026 ยท 2 min read

LLM API spending hit $8.4 billion by mid-2025 and doubled year over year. 37% of enterprises spend over $250K/year on LLM APIs. The models that work best cost the most, output tokens cost 3-10x more than input tokens, and usage grows faster than prices fall. Optimization is not optional.

70-90%
Total cost reduction possible with combined techniques
10x/yr
Price decrease for equivalent LLM performance
$8.4B
API spending mid-2025, doubling YoY
3-10x
Output tokens cost vs. input tokens

The LLM Cost Equation: Why Output Tokens Break the Budget

LLM API pricing follows a simple formula: (input tokens x input price) + (output tokens x output price). But the ratio between input and output pricing is where costs get unintuitive.

Input tokens are processed in parallel through a single forward pass. The model reads your entire prompt at once. Output tokens are generated sequentially, one at a time, with each token depending on every previous token. This autoregressive decoding requires 3-10x more compute per token than input processing.

The pricing reflects this asymmetry. Claude Opus 4.5 charges $5.00/M for input but $25.00/M for output, a 5x ratio. GPT-5 charges $1.25/M input and $10.00/M output, an 8x ratio. Gemini 2.5 Pro mirrors GPT-5 at $1.25/$10.00.

Output tokens dominate your bill

If your application generates long responses (code generation, document drafting, data analysis), output tokens likely account for 60-80% of your total spend. Reducing output length or switching to a model with lower output pricing often has more impact than compressing input.

Two forces shape the cost landscape. LLMflation drives prices down roughly 10x per year for equivalent performance. GPT-4-level quality cost $30/M output in early 2024 and costs $2-3/M through GPT-4.1 or Claude Sonnet 4.5 in 2026. But frontier reasoning models (GPT-5, Claude Opus 4.5) remain expensive because they push the performance boundary outward. The cheap tier gets cheaper. The top tier stays pricey.

Current LLM Pricing Landscape (February 2026)

Every cost optimization decision starts with knowing what each model actually charges. These are the current API prices across the major providers.

ProviderModelInput/M tokensOutput/M tokensOutput:Input Ratio
AnthropicClaude Opus 4.5$5.00$25.005x
AnthropicClaude Sonnet 4.5$3.00$15.005x
AnthropicClaude Haiku 4.5$1.00$5.005x
OpenAIGPT-5$1.25$10.008x
OpenAIGPT-4.1$2.00$8.004x
OpenAIGPT-4o-mini$0.15$0.604x
GoogleGemini 2.5 Pro$1.25$10.008x
GoogleGemini 2.0 Flash$0.10$0.404x
DeepSeekV3.2$0.28$0.421.5x

A few patterns stand out. Anthropic charges the most across all tiers, with Claude Opus 4.5 at $25/M output. OpenAI and Google have converged at identical frontier pricing ($1.25/$10.00 for GPT-5 and Gemini 2.5 Pro). DeepSeek is the outlier with near-parity between input and output pricing at $0.28/$0.42.

The cheapest option for high-volume, quality-tolerant workloads is Gemini 2.0 Flash at $0.10/$0.40. That is 62.5x cheaper on output than Claude Opus 4.5. If your task does not require frontier reasoning, you are likely overspending by 10-60x.

Price is not the only cost

Latency, rate limits, and quality all factor into total cost of ownership. A model that costs 5x less but produces outputs requiring 2x human review may cost more in total. Measure cost per successful completion, not just cost per token.

Six Techniques That Actually Work

Each technique works independently. Several stack together for compound savings. The right combination depends on your workload.

Prompt Caching

Cache repeated prompt prefixes. Anthropic gives 90% savings on cache reads. One dev: $720/mo to $72/mo.

Batch Processing

50% flat discount for non-urgent work. Stacks with caching for 95%+ total savings on eligible requests.

Model Routing

Send easy tasks to cheap models, hard tasks to frontier. 40-85% savings. Most production requests are routine.

Distillation

Train smaller models to mimic frontier ones. 5-30x cost reduction retaining 97% of performance on your specific tasks.

Context Compression

Reduce input tokens by 50-80%. LLMLingua: 20x compression, 1.5% perf loss. Morph Compact: 50-70% with zero hallucination.

Combined Approach

Stack caching + routing + compression for 70-90% total cost reduction across your full production workload.

1. Prompt Caching: 90% Savings on Repeated Prefixes

Most LLM applications send the same system prompt, few-shot examples, or context documents with every request. Prompt caching stores the processed representation of these repeated prefixes so the model does not recompute them.

Anthropic offers 90% savings on cached input tokens. That means cached tokens on Claude Sonnet 4.5 cost $0.30/M instead of $3.00/M. OpenAI offers a 50% discount on cached tokens. One developer reported their monthly spend dropped from $720 to $72 after enabling prompt caching for system prompts that repeated across every API call.

Prompt caching with Anthropic (Python)

from anthropic import Anthropic

client = Anthropic()

# The system prompt and context are cached after the first call.
# Subsequent calls with the same prefix get 90% input discount.
response = client.messages.create(
    model="claude-sonnet-4-5-20250514",
    max_tokens=1024,
    system=[
        {
            "type": "text",
            "text": LARGE_SYSTEM_PROMPT,  # e.g. 4000 tokens
            "cache_control": {"type": "ephemeral"}
        }
    ],
    messages=[{"role": "user", "content": user_query}]
)

# First call: full price on input ($3.00/M)
# All subsequent calls: cached prefix at $0.30/M (90% off)
# With a 4K-token system prompt and 100 calls/day:
#   Before: 4000 * 100 * $3.00/M = $1.20/day
#   After:  4000 * 100 * $0.30/M = $0.12/day

Prompt caching is the highest-ROI optimization for most applications because it requires zero changes to your prompt quality and works immediately. If your system prompt or context prefix exceeds 1,024 tokens and you make repeated calls, enable caching first.

2. Batch Processing: 50% Flat Discount

Both OpenAI and Anthropic offer batch APIs that process requests asynchronously at a 50% discount. The tradeoff: results are delivered within 24 hours instead of real-time.

Batch processing stacks with prompt caching. If you batch requests that also use cached prefixes, you get 50% off the already-90%-discounted cached tokens. Combined savings exceed 95% on input tokens for high-volume batch workloads.

When batching makes sense

Evaluation pipelines, data labeling, content generation for CMS, nightly report generation, and any workflow where the user is not waiting for a response. If latency tolerance is over 1 hour, batching should be your default.

3. Model Routing: 40-85% Savings

Model routing classifies incoming requests by difficulty and sends simple tasks to cheap models while routing complex tasks to frontier models. The economics work because 60-80% of production requests are routine: simple classification, formatting, extraction, or short-answer queries that GPT-4o-mini handles as well as GPT-5.

A concrete example: a team processing 100,000 daily requests on GPT-4-class models paid $4,500/month. After implementing routing that sent 70% of requests to GPT-4o-mini and 30% to GPT-5, the bill dropped to $1,500/month, a 67% reduction with no measurable quality loss on the routed tasks.

Simple model routing (TypeScript)

type Difficulty = "simple" | "complex";

function classifyRequest(prompt: string): Difficulty {
  // Heuristic routing: length, keyword complexity, task type
  if (prompt.length < 500 && !requiresReasoning(prompt)) {
    return "simple";
  }
  return "complex";
}

const MODEL_MAP = {
  simple: "gpt-4o-mini",    // $0.15 / $0.60 per M tokens
  complex: "gpt-5",          // $1.25 / $10.00 per M tokens
} as const;

async function route(prompt: string) {
  const difficulty = classifyRequest(prompt);
  return openai.chat.completions.create({
    model: MODEL_MAP[difficulty],
    messages: [{ role: "user", content: prompt }],
  });
}

// 70% simple (GPT-4o-mini) + 30% complex (GPT-5)
// vs. 100% GPT-5: ~67% cost reduction

4. Distillation: 5-30x Cost Reduction

Distillation trains a smaller, cheaper model to mimic the outputs of a frontier model on your specific task. The distilled model retains ~97% of the teacher model's performance on the target domain while running 5-30x cheaper.

OpenAI provides built-in distillation tools that let you fine-tune GPT-4o-mini on outputs from GPT-5. The workflow: run your production queries through the frontier model, collect input-output pairs, and fine-tune the smaller model on those pairs.

Distillation requires upfront investment (collecting data, training, evaluation) but delivers the largest per-token savings of any technique. It works best for well-defined, high-volume tasks where you can measure quality objectively: classification, extraction, code generation for specific patterns, or structured output formatting.

5. Context Compression: 50-80% Token Reduction

Context compression reduces the number of input tokens sent to the model while preserving the information it needs. This directly reduces the input side of your cost equation.

LLMLingua achieves up to 20x compression with only 1.5% performance loss through prompt-aware token pruning. For a contract analysis pipeline processing 15,000-token documents, compression to 4,500 tokens cuts input costs by 70% per call.

Morph Compact takes a different approach: deletion, not rewriting. Every sentence that survives compression is verbatim from the original. It achieves 50-70% reduction with 98% verbatim accuracy at 3,300+ tokens per second. Because it deletes noise rather than paraphrasing, there is zero hallucination risk in the compressed output.

6. Combined Approach: 70-90% Total Reduction

These techniques are not mutually exclusive. The highest savings come from stacking them.

CombinationMechanismTypical Savings
Caching aloneCache repeated prefixesUp to 90% on input
Caching + BatchingBatch non-urgent + cache95%+ on input
Routing + CachingCheap models for simple tasks + cache70-85%
Compression + RoutingShrink context + route by difficulty75-90%
All combinedCache + batch + route + compress70-90% total

A realistic production pipeline: cache your system prompts (90% off repeated prefix), compress long context inputs with Morph Compact (50-70% fewer input tokens), route simple requests to GPT-4o-mini ($0.15/M instead of $1.25/M), and batch non-urgent work for an additional 50% off. Each layer compounds. A workload costing $10,000/month can realistically drop to $1,000-3,000/month.

Context Compression: The Biggest Lever for Long-Context Workloads

For applications that process long documents, maintain agent conversation history, or work with large codebases, context compression delivers the largest single cost reduction. The reason: these workloads are input-token dominated, and compression directly attacks input volume.

50-70%
Token reduction (Morph Compact)
20x
Max compression ratio (LLMLingua)
1.5%
Performance loss at 20x compression
$0.007/M
Morph Compact pricing

Consider a contract analysis pipeline. Each contract averages 15,000 input tokens. At Claude Sonnet 4.5 pricing ($3.00/M input), that is $0.045 per document just for input. Processing 1,000 contracts per day costs $45/day or $1,350/month in input tokens alone.

With 70% compression (15,000 to 4,500 tokens), input cost drops to $0.0135 per document. At 1,000 documents per day, that is $13.50/day or $405/month. Savings: $945/month from a single optimization. The compression step itself costs fractions of a cent per document.

Cost savings with Morph Compact (Python)

from openai import OpenAI

morph = OpenAI(
    api_key="your-morph-api-key",
    base_url="https://api.morphllm.com/v1"
)

# Compress before sending to the expensive model
def analyze_contract(contract_text: str) -> str:
    # Step 1: Compress with Morph Compact ($0.007/M tokens)
    compressed = morph.chat.completions.create(
        model="morph-compact",
        messages=[{"role": "user", "content": contract_text}]
    ).choices[0].message.content

    # Step 2: Analyze with Claude Sonnet ($3.00/M input)
    result = anthropic.messages.create(
        model="claude-sonnet-4-5-20250514",
        messages=[{"role": "user", "content": f"Analyze: {compressed}"}]
    )
    return result.content[0].text

# Before: 15K tokens x $3.00/M = $0.045/doc
# After:  4.5K tokens x $3.00/M = $0.0135/doc + ~$0.0001 compression
# Savings: 70% on input costs per document

Morph Compact works through deletion, not rewriting. Every sentence that survives is verbatim from the original. For cost optimization, this means the model receives exact quotes, exact numbers, and exact terms from the source document. No risk of compression introducing errors that the downstream model acts on.

Compression compounds with other techniques

Compress first, then cache the compressed version. You reduce the tokens entering the cache, which means the cache stores less and serves faster. For agent workloads, compress tool outputs inline so the conversation history stays lean throughout the session. Each turn saved is tokens not re-sent on the next turn.

Agent Cost Profiles: What AI Coding Sessions Actually Cost

Agentic workloads are the fastest-growing category of LLM spend and the hardest to optimize. Each agent session involves dozens of API calls with growing context windows. Real cost data from production coding agents:

MetricTypical SessionHeavy User (Monthly)
Claude Code session~$0.34 (45K input + 13K output + 38K cache)
Heavy Claude Code user50+ sessions/day$5,623/month API equivalent
Optimized CLAUDE.md$0.024/sessionvs. $0.063 bloated
20-dev team (50 sessions/day each)$340/day unoptimized$67.50/day saved with compression

The average Claude Code session costs ~$0.34, broken down as 45K input tokens, 13K output tokens, and 38K cache read tokens. That sounds cheap until you scale it. A heavy user running 50 sessions per day pays the API equivalent of $5,623/month. A 20-person engineering team at that usage rate faces $112,460/month in agent API costs.

The biggest cost driver in agent sessions is context re-sending. Every turn of an agent conversation re-sends the full conversation history plus tool outputs. A 20-turn session with growing context means the early messages are paid for 20 times. Compression reduces this compound cost by shrinking what gets re-sent each turn.

Optimizing Agent Configuration

Agent system prompts (like CLAUDE.md files) are sent with every turn. A bloated 3,000-token system prompt in a 20-turn session costs 60,000 input tokens just for the repeated prompt. Trimming it to 1,000 tokens saves 40,000 tokens per session. At Claude Sonnet pricing, that is $0.12 per session or $180/month for a heavy user.

An optimized CLAUDE.md configuration produces sessions costing $0.024 versus $0.063 for the bloated version, a 62% reduction purely from prompt engineering. Pair that with context compression on tool outputs and prompt caching on the system prompt, and agent costs drop dramatically.

The compound cost of agent context

Every unnecessary token in an agent conversation is paid for on every subsequent turn. 100 wasted tokens in turn 1 of a 30-turn session costs 3,000 tokens total. At $3/M, that is nearly a cent of pure waste per session. Across thousands of sessions, it adds up fast. Keep agent context lean from the start.

Cost Savings Calculator

Concrete numbers for common scenarios. These assume current February 2026 pricing.

$15K/day
Saved: 1M calls/day at 10K avg input with 50% compression on Claude Sonnet
$2,025/mo
Saved: 20 devs x 50 Claude Code sessions/day with context compression
$945/mo
Saved: 1K contracts/day at 70% compression on Claude Sonnet
$3,000/mo
Saved: routing 100K daily requests (70% to GPT-4o-mini, 30% GPT-5)
$648/mo
Saved: prompt caching on 4K system prompt across 100K daily calls
$7K-9K/mo
Combined: caching + compression + routing on a $10K/mo workload

Scenario: High-Volume API (1M calls/day)

1 million calls per day with an average 10,000 input tokens per call on Claude Sonnet 4.5 ($3.00/M input): $30,000/day in input costs. With 50% context compression, input drops to 5,000 tokens per call: $15,000/day. Annual savings: $5.4 million.

Scenario: Engineering Team (20 devs)

20 developers each running 50 Claude Code sessions per day. Each session averages $0.34. Unoptimized daily cost: $340/day or $10,200/month. With context compression reducing per-session cost by ~20%: $67.50/day saved, or $2,025/month.

Scenario: Document Processing Pipeline

1,000 documents per day, 15K tokens each, on Claude Sonnet 4.5. Monthly input cost: $1,350/month. After 70% compression: $405/month. Add prompt caching for the analysis instructions (90% off the repeated prefix): under $400/month total. An 85% reduction from two optimizations.

Frequently Asked Questions

How much can you reduce LLM API costs?

By combining prompt caching (90% savings on cache hits), batch processing (50% discount), model routing (40-85% savings), and context compression (50-80% token reduction), teams routinely achieve 70-90% total cost reduction. One developer cut their monthly bill from $720 to $72 using prompt caching alone.

What is the cheapest LLM API in 2026?

Google's Gemini 2.0 Flash is the cheapest mainstream option at $0.10 per million input tokens and $0.40 per million output. DeepSeek V3.2 costs $0.28/$0.42. GPT-4o-mini is $0.15/$0.60. For frontier-quality output, GPT-5 and Gemini 2.5 Pro both cost $1.25/$10.00 per million tokens.

Why do LLM output tokens cost more than input tokens?

Output generation requires sequential autoregressive decoding where each token depends on all previous tokens. Input tokens are processed in parallel through a single forward pass. This means output requires 3-10x more compute per token. Claude Opus 4.5 charges $5/M input but $25/M output (5x). GPT-5 charges $1.25/$10.00 (8x).

What is prompt caching and how much does it save?

Prompt caching stores the processed representation of repeated prompt prefixes so the model does not recompute them. Anthropic offers 90% savings on cache reads, meaning cached input tokens cost $0.30/M instead of $3.00/M on Claude Sonnet 4.5. One developer reported their monthly spend dropped from $720 to $72 after enabling caching for system prompts.

How does model routing reduce LLM costs?

Model routing classifies incoming requests by difficulty and sends simple tasks to cheap models (GPT-4o-mini at $0.15/M) while routing complex tasks to frontier models (GPT-5 at $1.25/M). Since 60-80% of production requests are routine, this saves 40-85%. A team processing 100K daily requests cut their bill from $4,500/month to $1,500/month with routing.

What is context compression and how does it reduce costs?

Context compression reduces the number of input tokens sent to the LLM while preserving the information the model needs. LLMLingua achieves 20x compression with 1.5% performance loss. Morph Compact achieves 50-70% reduction with 98% verbatim accuracy. For a contract analysis pipeline processing 15K-token documents, compression to 4,500 tokens cuts input costs by 70% per call.

Cut Your LLM Costs with Context Compression

Morph Compact reduces input tokens by 50-70% with zero hallucination risk. Every surviving sentence is verbatim from the original. Works with any LLM provider. Drop-in compatible with the OpenAI SDK.