Context windows have grown from 4K tokens to 10 million. But the number on a model card does not tell you what you actually get. Effective context degrades long before you fill the window. Pricing surcharges kick in at hidden thresholds. And more context often produces worse output, not better. This is every model, compared honestly.
LLM Context Window Comparison Table (February 2026)
Every major model, sorted by context window size. Pricing is per million tokens. Where models have tiered pricing (e.g., different rates above 128K or 200K tokens), the base rate is listed first with the surcharge rate after the slash.
| Model | Provider | Context | Max Output | Input $/M | Output $/M |
|---|---|---|---|---|---|
| Llama 4 Scout | Meta | 10M | - | Free | Free |
| Grok 4 | xAI | 2M | - | $3.00 | $15.00 |
| Grok 4.1 Fast | xAI | 2M | - | $0.20 | $0.50 |
| Gemini 2.5 Pro | 1M (2M beta) | 64K | $1.25 / $2.50 | $10 / $15 | |
| Gemini 2.5 Flash | 1M | 8K | $0.15 | $0.60 | |
| GPT-4.1 | OpenAI | 1M | 32K | $2.00 | $8.00 |
| GPT-4.1 Mini | OpenAI | 1M | 32K | $0.40 | $1.60 |
| Llama 4 Maverick | Meta | 1M | - | Free | Free |
| GPT-5.2 | OpenAI | 400K | 128K | $1.75 | $14.00 |
| GPT-5 | OpenAI | 400K | 128K | $1.25 | $10.00 |
| GPT-5 Nano | OpenAI | 400K | 128K | $0.05 | $0.40 |
| o3 | OpenAI | 200K | 100K | $2.00 | $8.00 |
| Claude Opus 4.6 | Anthropic | 200K (1M beta) | 64K | $5.00 | $25.00 |
| Claude Sonnet 4.6 | Anthropic | 200K (1M beta) | 64K | $3.00 | $15.00 |
| Claude Haiku 4.5 | Anthropic | 200K | - | $1.00 | $5.00 |
| DeepSeek R1 | DeepSeek | 128K | 64K | $0.55 | $2.19 |
| DeepSeek V3 | DeepSeek | 128K | 8K | $0.14 | $0.28 |
| Mistral Large 3 | Mistral | 128K | - | $2.00 | $6.00 |
| Qwen3-235B | Alibaba | 128K | - | ~$0.30-0.70 | ~$3-8 |
| GPT-4o | OpenAI | 128K | 16K | $2.50 | $10.00 |
About this table
Pricing reflects API rates from each provider as of February 2026. Free models (Llama 4 Scout, Maverick) require self-hosting or third-party inference. "Max Output" marked as "-" means the provider does not publish a separate output cap. Gemini and Claude tiered pricing: the rate after the slash applies when the request exceeds 200K tokens.
Understanding Context Windows
A context window is the total number of tokens an LLM can process in a single request. This includes both your input (the prompt, system instructions, conversation history, files) and the model's output (its response). The two share the same budget.
Input vs. Output Tokens
Most models set separate limits. GPT-5.2 has a 400K context window with a 128K max output, meaning your input can be up to 272K tokens. Claude Opus 4.6 offers 200K context with 64K output. The gap between total context and max output is your input budget.
This distinction matters for coding agents. A model with 1M context but 8K max output (like Gemini 2.5 Flash) can ingest a massive codebase but can only generate a short response per turn. A model with 400K context but 128K max output (like GPT-5) can generate much longer responses, which matters for multi-file edits.
How Tokenization Works
Tokens are not words. They are sub-word units determined by each model's tokenizer. A rough rule of thumb: 1 token is about 0.75 English words, or roughly 4 characters. Code typically tokenizes less efficiently than prose because of special characters, indentation, and variable names.
Practical equivalents for a 128K context window:
- ~96,000 English words (~300 pages of text)
- ~50,000-70,000 lines of code (depending on language)
- A medium-sized codebase, or a large codebase with selective file inclusion
Context window != memory
A context window is not persistent memory. Every API call starts fresh. If you send 100K tokens in one request, the model does not "remember" them on the next request. Conversation history must be re-sent each time, which is why long sessions accumulate tokens fast and why context compression becomes essential.
Long-Context Pricing: The Hidden Surcharges
Most comparison pages list per-token pricing without mentioning the surcharges that activate at longer context lengths. These surcharges can double your actual cost.
| Provider | Surcharge Threshold | Input Multiplier | Output Multiplier | Scope |
|---|---|---|---|---|
| Anthropic | 200K tokens | 2x | 1.5x | ALL tokens (not just overage) |
| 200K tokens | 2x | 2x | ALL tokens (not just overage) | |
| OpenAI | None | 1x | 1x | No surcharge at any length |
| xAI | None published | 1x | 1x | No surcharge documented |
| DeepSeek | None published | 1x | 1x | No surcharge documented |
The surcharge applies to ALL tokens
Anthropic and Google do not charge extra only on the tokens above the threshold. When a request crosses 200K, the higher rate applies to every token in the request. A 201K-token request to Claude Sonnet 4.6 costs $3.00/M on input and $15.00/M on output. A 199K-token request costs $3.00/M on input and $15.00/M on output. But at 200K+, Anthropic switches to $6.00/M input and $22.50/M output for the entire request. This cliff-edge pricing means crossing 200K by even one token doubles your input cost.
What This Means in Practice
Consider a coding agent that routinely sends 250K-token requests to Claude Sonnet 4.6. Without the surcharge, input would cost $0.75 per request. With the surcharge (2x on all tokens), it costs $1.50 per request. Over 1,000 requests, that is an extra $750. The surcharge is not a rounding error. It is a line item in your infrastructure budget.
OpenAI's decision not to surcharge at any length is a meaningful competitive advantage for workloads that consistently exceed 200K tokens. GPT-4.1 at $2.00/M stays at $2.00/M whether you send 10K or 900K tokens.
Effective Context vs. Advertised Context
A model's advertised context window is its theoretical capacity. Effective context is how well the model actually performs at that length. The gap between the two is often enormous.
RULER Benchmark Results
The RULER benchmark tests model performance at increasing context lengths with tasks like needle-in-a-haystack retrieval, multi-key-value lookup, and pattern matching. Performance at 4K tokens serves as the baseline.
| Model | Claimed Context | Score @ 4K | Score @ 128K | Performance Drop |
|---|---|---|---|---|
| Gemini 1.5 Pro | 1M | 96.7 | 94.4 | -2.3 pts |
| GPT-4-1106 | 128K | 96.6 | 81.2 | -15.4 pts |
| Llama 3.1-70B | 128K | 96.5 | 66.6 | -29.9 pts |
| Mixtral-8x22B | 64K | 95.6 | 31.7 | -63.9 pts |
Gemini 1.5 Pro is the clear outlier. It loses only 2.3 points going from 4K to 128K, meaning it uses nearly its full context window effectively. Every other model tested shows significant degradation. Mixtral-8x22B, despite advertising a 64K window, produces near-random results at 128K with a 63.9-point drop.
The practical implication: a model advertising 128K context might give you GPT-4-level quality at 4K tokens but mid-tier quality at 128K. You are not buying the same model at every context length. Performance at your typical input size matters more than the maximum advertised number.
Context Rot: Why More Context Means Worse Output
Context rot is the degradation in LLM output quality as input length grows. It is not a theoretical concern. Chroma tested 18 models and measured it directly.
Chroma's Findings Across 18 Models
Every model tested showed degradation as context grew. No exceptions. But the most counterintuitive finding was about text ordering:
| Condition | Performance | Why |
|---|---|---|
| Shuffled text | Higher accuracy | No narrative structure to create positional bias |
| Coherent text | Lower accuracy | Recency bias causes models to over-weight later passages |
Models performed better on shuffled text than on coherent text. This is not a bug. Coherent text creates stronger positional patterns. The model develops recency bias, attending disproportionately to passages near the end of the input and neglecting earlier content. Shuffled text disrupts this bias, forcing more uniform attention across the input.
The implication for real-world use: ordering matters. If you put critical information at the start of a long prompt and less important content at the end, the model may still prioritize the later material. This makes naive "stuff everything into the context" approaches unreliable. It is not just a matter of fitting within the window. It is a matter of what the model actually attends to within that window.
Context rot is not about context limits
Context rot happens well before you hit the model's context window limit. A model with 200K tokens of capacity can start degrading at 50K. The window tells you what fits. It does not tell you what the model will actually use effectively. This is why context compression improves output quality, not just cost.
Cost at Scale: What Context Actually Costs
Per-token rates look similar in isolation. At scale, the gaps are massive. Here is what 1 billion input tokens per month costs for each tier of model.
The budget tier (DeepSeek V3, Gemini Flash-Lite, GPT-4.1 Nano) clusters around $100-140 per billion input tokens. The premium tier (Claude Sonnet, Opus) runs $3,000-5,000. That is a 35x spread. And this is before accounting for Anthropic's 2x surcharge above 200K tokens, which would push Opus to $10,000 per billion tokens for long-context workloads.
Where Context Costs Compound
Coding agents are the worst case for context costs. A single agentic session might make 50-100 API calls, each carrying an expanding conversation history. If the agent reads files, runs commands, and accumulates tool outputs, the 50th call might include 150K tokens of conversation history. Multiply by thousands of users and the costs scale nonlinearly.
This is where context compression pays for itself. Reducing token count by 50% does not save 50% on a single request. It saves 50% on every subsequent request in the session, because the compressed history is carried forward. For a 100-call agent session, compressing context early can cut total session cost by 60-70%.
The Alternative: Less Context, Better Results
The race to bigger context windows assumes more context is always better. The research says otherwise.
Compression Outperforms Raw Context
CompLLM research demonstrated that 2x compressed context actually surpasses uncompressed performance on very long sequences. The mechanism is straightforward: removing noise improves the signal-to-noise ratio. The model has less to process and more of what remains is relevant.
The retrieve-then-solve approach produced an even starker result. By selecting only relevant context instead of feeding the full input, Mistral improved from 35.5% to 66.7% accuracy. Nearly double. Not by giving the model more context, but by giving it better context.
| Approach | Strategy | Result |
|---|---|---|
| Full context (Mistral) | Send everything | 35.5% accuracy |
| Retrieve-then-solve (Mistral) | Select relevant context | 66.7% accuracy |
| 2x compressed (CompLLM) | Compress before sending | Surpasses uncompressed |
Morph Compact: Verbatim Context Reduction
Morph Compact reduces context by 50-70% while keeping every surviving sentence word-for-word identical to the original. No paraphrasing. No summarization. No hallucination risk in the compressed output. It runs at 3,300+ tokens per second with 98% verbatim accuracy.
For teams paying $3,000-5,000 per billion tokens on premium models, compacting context before sending it can cut that to $1,000-2,500 while also improving output quality by reducing noise. The cost savings and quality improvements work in the same direction.
Frequently Asked Questions
Which LLM has the largest context window in 2026?
Llama 4 Scout from Meta holds the largest context window at 10 million tokens. Grok 4 from xAI follows at 2 million. Gemini 2.5 Pro, GPT-4.1, GPT-4.1 Mini, Gemini 2.5 Flash, and Llama 4 Maverick all support 1 million tokens. But the largest window does not mean the best performance. RULER benchmark tests show most models degrade significantly before reaching their advertised limits.
How much does it cost to use a 1 million token context window?
Costs vary dramatically. Filling a 1M context with GPT-4.1 costs $2.00 per request. Gemini 2.5 Flash costs $0.15 for the same input. Llama 4 Scout and Maverick are free to use (but require self-hosting or third-party inference). Anthropic and Google charge 2x surcharges above 200K tokens, which can double effective cost for long-context requests.
What is the difference between context window and max output tokens?
The context window is the total token budget for input plus output. Max output tokens is the ceiling on how much the model can generate. GPT-5.2 has a 400K context window with 128K max output, so your input can be up to 272K tokens. A model with large context but small max output (like Gemini 2.5 Flash at 1M/8K) can read a lot but writes short responses per turn.
Do LLMs actually use their full context window effectively?
No. RULER benchmark testing shows significant degradation at longer contexts. GPT-4-1106 drops from 96.6 at 4K to 81.2 at 128K. Llama 3.1-70B drops from 96.5 to 66.6. Gemini 1.5 Pro is the exception, holding at 94.4 at 128K with only a 2.3-point drop. Performance at your typical input size matters more than the maximum advertised number.
What is context rot and why does it matter?
Context rot is the degradation in LLM output quality as input length grows. Chroma tested 18 models and found every one degrades. Models performed better on shuffled text than coherent text because coherent text creates stronger recency bias. Filling a large context window with ordered information can produce worse results than a shorter, more focused input.
Is it better to use a larger context window or compress the context?
Research increasingly favors compression. CompLLM showed 2x compressed context surpasses uncompressed on long sequences. Morph Compact achieves 50-70% reduction with 98% verbatim accuracy at 3,300+ tok/s. The retrieve-then-solve approach improved Mistral from 35.5% to 66.7% by selecting relevant context. For most applications, focused input outperforms stuffed input.
Stop Paying for Wasted Context
Morph Compact reduces context by 50-70% while keeping every surviving sentence verbatim. Cut your token costs and improve output quality at the same time. 3,300+ tok/s, 98% verbatim accuracy, zero hallucination risk.